Introduction

Geographical Analysis¶

Often when evaluating a smaller company, it becomes crucial to understand the market they operate within. Let’s take a hypothetical start up that wants to buy apartments, starting in Boston, and provide much more flexible leases to their tenants among other things. Their idea is that the apartment market is in need of innovation. Before they begin buying up properties, they need to know exactly what the landscape looks like. Thankfully, the city of Boston has a free dataset which gives us the tax assessments for land value that we will use here.

You can find the data here for yourself or download the data from github: Boston Data

As always begin with reading in the data.

In [1]:

import pandas as pd

#Read in the dataset
df = pd.read_csv('2015 Real Estate.csv')
print(df)

                PID        CM_ID ST_NUM     ST_NAME ST_NAME_SUF UNIT_NUM  \
0       1302299000_          NaN    199  SAVIN HILL          AV      NaN
1       1302298000_          NaN    195  SAVIN HILL          AV      NaN
2       1302297000_          NaN    193  SAVIN HILL          AV      NaN
3       1302296000_          NaN    6 8   SOUTHVIEW          ST      NaN
4       1302295000_          NaN  10 12   SOUTHVIEW          ST      NaN
...             ...          ...    ...         ...         ...      ...
168110  1302300006_  1302300000_    201  SAVIN HILL          AV        3
168111  1302300004_  1302300000_    201  SAVIN HILL          AV        2
168112  1302300002_  1302300000_    201  SAVIN HILL          AV        1
168113  1302300000_  1302300000_    201  SAVIN HILL          AV      NaN
168114  1302300000_  1302300000_    201  SAVIN HILL          AV      NaN

       ZIPCODE          full_address  PTYPE  LU  ... U_ORIENT U_TOT_RMS  \
0       02125_  199   SAVIN HILL AV     105  R3  ...      NaN       NaN
1       02125_  195   SAVIN HILL AV     104  R2  ...      NaN       NaN
2       02125_  193   SAVIN HILL AV     104  R2  ...      NaN       NaN
3       02125_     6 8 SOUTHVIEW ST     104  R2  ...      NaN       NaN
4       02125_   10 12 SOUTHVIEW ST     104  R2  ...      NaN       NaN
...        ...                   ...    ...  ..  ...      ...       ...
168110  02125_  201  SAVIN HILL AV 3    102  CD  ...        T       4.0
168111  02125_  201  SAVIN HILL AV 2    102  CD  ...        T       2.0
168112  02125_  201  SAVIN HILL AV 1    102  CD  ...        T       2.0
168113  02125_  201   SAVIN HILL AV     995  CM  ...      NaN       NaN
168114  02125_  201   SAVIN HILL AV     995  CM  ...      NaN       NaN

       U_BDRMS U_FULL_BTH U_HALF_BTH  U_KIT_TYPE  U_HEAT_TYP  U_AC  U_FPLACE  \
0          NaN        NaN        NaN         NaN         NaN   NaN       NaN
1          NaN        NaN        NaN         NaN         NaN   NaN       NaN
2          NaN        NaN        NaN         NaN         NaN   NaN       NaN
3          NaN        NaN        NaN         NaN         NaN   NaN       NaN
4          NaN        NaN        NaN         NaN         NaN   NaN       NaN
...        ...        ...        ...         ...         ...   ...       ...
168110     1.0        1.0        0.0           F           W     C       1.0
168111     2.0        2.0        0.0           O           W     C       0.0
168112     2.0        2.0        0.0           O           W     C       0.0
168113     NaN        NaN        NaN         NaN         NaN   NaN       NaN
168114     NaN        NaN        NaN         NaN         NaN   NaN       NaN

                             Location
0       (42.308250000| -71.049200000)
1       (42.308490000| -71.049460000)
2       (42.308470000| -71.049700000)
3       (42.308250000| -71.049880000)
4       (42.308240000| -71.049710000)
...                               ...
168110     (0.000000000| 0.000000000)
168111     (0.000000000| 0.000000000)
168112     (0.000000000| 0.000000000)
168113  (42.308430000| -71.049000000)
168114  (42.308430000| -71.049000000)

[168115 rows x 56 columns]

From the data key, we see that the column LU is going to denote land use with different codes. With the function value_counts, we get a clearer picture of what the makeup of the data is in terms of the land use.

In [2]:

#Find value counts
print(df['LU'].value_counts())

CD    59909
R1    30726
R2    17464
R3    14176
CM     8970
E      8591
RL     7015
C      4565
CP     3554
A      2937
RC     2575
R4     2567
CL     2186
CC     1765
EA      614
I       497
AH        3
XX        1
Name: LU, dtype: int64

Residential condos are the most common, followed by 1 family homes. Let's begin our analysis by looking at the 1-family homes (code R1). We are going to make a copy of the dataframe (so that we can make changes to it and not worry about messing with the original dataframe) and slice it to be only properties with that R1 code.

In [3]:

r1 = df[df['LU'] == 'R1'].copy()
print(r1)

                PID CM_ID ST_NUM     ST_NAME ST_NAME_SUF UNIT_NUM ZIPCODE  \
41      1302268030_   NaN     12       DENNY          ST      NaN  02125_
42      1302268020_   NaN     18       DENNY          ST      NaN  02125_
60      1302260000_   NaN     44  SPRINGDALE          ST      NaN  02125_
77      1302251000_   NaN    161  SAVIN HILL          AV      NaN  02125_
85      1302236000_   NaN     23   PLAYSTEAD          RD      NaN  02125_
...             ...   ...    ...         ...         ...      ...     ...
168101  1302306000_   NaN    215  SAVIN HILL          AV      NaN  02125_
168102  1302305000_   NaN    213  SAVIN HILL          AV      NaN  02125_
168104  1302303000_   NaN    207  SAVIN HILL          AV      NaN  02125_
168105  1302302000_   NaN    205  SAVIN HILL          AV      NaN  02125_
168106  1302301000_   NaN    203  SAVIN HILL          AV      NaN  02125_

                full_address  PTYPE  LU  ... U_ORIENT U_TOT_RMS U_BDRMS  \
41            12   DENNY ST     101  R1  ...      NaN       NaN     NaN
42            18   DENNY ST     101  R1  ...      NaN       NaN     NaN
60       44   SPRINGDALE ST     101  R1  ...      NaN       NaN     NaN
77      161   SAVIN HILL AV     101  R1  ...      NaN       NaN     NaN
85        23   PLAYSTEAD RD     101  R1  ...      NaN       NaN     NaN
...                      ...    ...  ..  ...      ...       ...     ...
168101  215   SAVIN HILL AV     101  R1  ...      NaN       NaN     NaN
168102  213   SAVIN HILL AV     101  R1  ...      NaN       NaN     NaN
168104  207   SAVIN HILL AV     101  R1  ...      NaN       NaN     NaN
168105  205   SAVIN HILL AV     101  R1  ...      NaN       NaN     NaN
168106  203   SAVIN HILL AV     101  R1  ...      NaN       NaN     NaN

       U_FULL_BTH U_HALF_BTH  U_KIT_TYPE  U_HEAT_TYP  U_AC  U_FPLACE  \
41            NaN        NaN         NaN         NaN   NaN       NaN
42            NaN        NaN         NaN         NaN   NaN       NaN
60            NaN        NaN         NaN         NaN   NaN       NaN
77            NaN        NaN         NaN         NaN   NaN       NaN
85            NaN        NaN         NaN         NaN   NaN       NaN
...           ...        ...         ...         ...   ...       ...
168101        NaN        NaN         NaN         NaN   NaN       NaN
168102        NaN        NaN         NaN         NaN   NaN       NaN
168104        NaN        NaN         NaN         NaN   NaN       NaN
168105        NaN        NaN         NaN         NaN   NaN       NaN
168106        NaN        NaN         NaN         NaN   NaN       NaN

                             Location
41      (42.309030000| -71.050840000)
42      (42.308900000| -71.050960000)
60      (42.308880000| -71.051450000)
77      (42.309730000| -71.051240000)
85      (42.309320000| -71.052080000)
...                               ...
168101  (42.308860000| -71.047910000)
168102  (42.308790000| -71.048130000)
168104  (42.308620000| -71.048680000)
168105  (42.308470000| -71.048620000)
168106  (42.308560000| -71.048870000)

[30726 rows x 56 columns]

Now, there are a lot of columns that seem to have null values. How do we know which are not going to be of any use? One solution is to combine pd.isnull() which returns True/False based on null values with the function to find the mean value of a column. Since True/False is the same as 1/0, calling for the mean of the columns based on this will return the percent of records which are null!

In [4]:

#Find percent of the time values are null
print(pd.isnull(r1).mean())

PID                   0.000000
CM_ID                 0.999967
ST_NUM                0.000000
ST_NAME               0.000000
ST_NAME_SUF           0.000879
UNIT_NUM              1.000000
ZIPCODE               0.000000
full_address          0.000000
PTYPE                 0.000000
LU                    0.000000
OWN_OCC               0.000000
OWNER                 0.000000
OWNER_MAIL_ADDRESS    0.000000
OWNER_MAIL_CS         0.000000
OWNER_MAIL_ZIPCODE    0.000000
AV_LAND               0.000000
AV_BLDG               0.000000
AV_TOTAL              0.000000
GROSS_TAX             0.000000
LAND_SF               0.000325
YR_BUILT              0.000000
YR_REMOD              0.050413
GROSS_AREA            0.000000
LIVING_AREA           0.000000
NUM_FLOORS            0.000033
STRUCTURE_CLASS       0.006379
R_BLDG_STYL           0.000033
R_ROOF_TYP            0.000033
R_EXT_FIN             0.000033
R_TOTAL_RMS           0.000033
R_BDRMS               0.000033
R_FULL_BTH            0.000033
R_HALF_BTH            0.000033
R_KITCH               0.000033
R_HEAT_TYP            0.000033
R_AC                  0.000033
R_FPLACE              0.000033
S_NUM_BLDG            1.000000
S_BLDG_STYL           1.000000
S_UNIT_RES            1.000000
S_UNIT_COM            1.000000
S_UNIT_RC             1.000000
S_EXT_FIN             1.000000
U_BASE_FLOOR          1.000000
U_NUM_PARK            1.000000
U_CORNER              1.000000
U_ORIENT              1.000000
U_TOT_RMS             1.000000
U_BDRMS               1.000000
U_FULL_BTH            1.000000
U_HALF_BTH            1.000000
U_KIT_TYPE            1.000000
U_HEAT_TYP            1.000000
U_AC                  1.000000
U_FPLACE              1.000000
Location              0.000000
dtype: float64

Let's round up any columns that have a null percent greater than 20% and drop them.

In [5]:

#Drop the columns

#Find the columns to drop
columns_to_drop = r1.columns[pd.isnull(r1).mean() > .2]
print("Columns to drop:")
print(columns_to_drop)
print()
#Drop the columns
r1 = r1.drop(columns=columns_to_drop)
print(r1)

Columns to drop:
Index(['CM_ID', 'UNIT_NUM', 'S_NUM_BLDG', 'S_BLDG_STYL', 'S_UNIT_RES',
       'S_UNIT_COM', 'S_UNIT_RC', 'S_EXT_FIN', 'U_BASE_FLOOR', 'U_NUM_PARK',
       'U_CORNER', 'U_ORIENT', 'U_TOT_RMS', 'U_BDRMS', 'U_FULL_BTH',
       'U_HALF_BTH', 'U_KIT_TYPE', 'U_HEAT_TYP', 'U_AC', 'U_FPLACE'],
      dtype='object')

                PID ST_NUM     ST_NAME ST_NAME_SUF ZIPCODE  \
41      1302268030_     12       DENNY          ST  02125_
42      1302268020_     18       DENNY          ST  02125_
60      1302260000_     44  SPRINGDALE          ST  02125_
77      1302251000_    161  SAVIN HILL          AV  02125_
85      1302236000_     23   PLAYSTEAD          RD  02125_
...             ...    ...         ...         ...     ...
168101  1302306000_    215  SAVIN HILL          AV  02125_
168102  1302305000_    213  SAVIN HILL          AV  02125_
168104  1302303000_    207  SAVIN HILL          AV  02125_
168105  1302302000_    205  SAVIN HILL          AV  02125_
168106  1302301000_    203  SAVIN HILL          AV  02125_

                full_address  PTYPE  LU OWN_OCC                OWNER  ...  \
41            12   DENNY ST     101  R1       Y    FINNIGAN JANICE M  ...
42            18   DENNY ST     101  R1       Y      PORTER ARTHUR J  ...
60       44   SPRINGDALE ST     101  R1       Y    SMEGLIN GENNARO A  ...
77      161   SAVIN HILL AV     101  R1       Y       LYDON MARTIN L  ...
85        23   PLAYSTEAD RD     101  R1       Y     PAVIDIS MARTHA J  ...
...                      ...    ...  ..     ...                  ...  ...
168101  215   SAVIN HILL AV     101  R1       Y  KASHMANIAN MARIE TS  ...
168102  213   SAVIN HILL AV     101  R1       Y     SCOTT  LESLIE  E  ...
168104  207   SAVIN HILL AV     101  R1       N    BANE MARY JO ETAL  ...
168105  205   SAVIN HILL AV     101  R1       Y     FERRANDI ALICE L  ...
168106  203   SAVIN HILL AV     101  R1       Y      MCCOLGAN JOHN J  ...

       R_EXT_FIN R_TOTAL_RMS R_BDRMS  R_FULL_BTH  R_HALF_BTH  R_KITCH  \
41             C         5.0     3.0         1.0         0.0      1.0
42             M         4.0     2.0         1.0         0.0      1.0
60             M         7.0     3.0         1.0         0.0      1.0
77             W         7.0     4.0         1.0         0.0      1.0
85             M         7.0     2.0         1.0         1.0      1.0
...          ...         ...     ...         ...         ...      ...
168101         W         6.0     2.0         1.0         1.0      1.0
168102         F        11.0     4.0         2.0         1.0      1.0
168104         F         9.0     2.0         1.0         1.0      1.0
168105         W         8.0     3.0         1.0         1.0      1.0
168106         W         8.0     4.0         1.0         1.0      1.0

        R_HEAT_TYP  R_AC  R_FPLACE                       Location
41               F     N       0.0  (42.309030000| -71.050840000)
42               W     N       0.0  (42.308900000| -71.050960000)
60               W     N       1.0  (42.308880000| -71.051450000)
77               W     N       0.0  (42.309730000| -71.051240000)
85               F     N       0.0  (42.309320000| -71.052080000)
...            ...   ...       ...                            ...
168101           W     C       0.0  (42.308860000| -71.047910000)
168102           F     N       0.0  (42.308790000| -71.048130000)
168104           F     N       1.0  (42.308620000| -71.048680000)
168105           F     N       1.0  (42.308470000| -71.048620000)
168106           W     N       2.0  (42.308560000| -71.048870000)

[30726 rows x 36 columns]

Data Science

Data Science for Venture Capital

Introduction

Geographical Analysis¶

Leave A Reply Cancel reply

Modal title