-
Geographical Analysis 6
-
Lecture1.1
-
Lecture1.2
-
Lecture1.3
-
Lecture1.4
-
Lecture1.5
-
Lecture1.6
-
-
Cap Table 3
-
Lecture2.1
-
Lecture2.2
-
Lecture2.3
-
-
Simulation 6
-
Lecture3.1
-
Lecture3.2
-
Lecture3.3
-
Lecture3.4
-
Lecture3.5
-
Lecture3.6
-
-
Search Index 8
-
Lecture4.1
-
Lecture4.2
-
Lecture4.3
-
Lecture4.4
-
Lecture4.5
-
Lecture4.6
-
Lecture4.7
-
Lecture4.8
-
-
Fund Distributions 5
-
Lecture5.1
-
Lecture5.2
-
Lecture5.3
-
Lecture5.4
-
Lecture5.5
-
Introduction
Geographical Analysis¶
Often when evaluating a smaller company, it becomes crucial to understand the market they operate within. Let’s take a hypothetical start up that wants to buy apartments, starting in Boston, and provide much more flexible leases to their tenants among other things. Their idea is that the apartment market is in need of innovation. Before they begin buying up properties, they need to know exactly what the landscape looks like. Thankfully, the city of Boston has a free dataset which gives us the tax assessments for land value that we will use here.
You can find the data here for yourself or download the data from github: Boston Data
As always begin with reading in the data.
import pandas as pd
#Read in the dataset
df = pd.read_csv('2015 Real Estate.csv')
print(df)
PID CM_ID ST_NUM ST_NAME ST_NAME_SUF UNIT_NUM \
0 1302299000_ NaN 199 SAVIN HILL AV NaN
1 1302298000_ NaN 195 SAVIN HILL AV NaN
2 1302297000_ NaN 193 SAVIN HILL AV NaN
3 1302296000_ NaN 6 8 SOUTHVIEW ST NaN
4 1302295000_ NaN 10 12 SOUTHVIEW ST NaN
... ... ... ... ... ... ...
168110 1302300006_ 1302300000_ 201 SAVIN HILL AV 3
168111 1302300004_ 1302300000_ 201 SAVIN HILL AV 2
168112 1302300002_ 1302300000_ 201 SAVIN HILL AV 1
168113 1302300000_ 1302300000_ 201 SAVIN HILL AV NaN
168114 1302300000_ 1302300000_ 201 SAVIN HILL AV NaN
ZIPCODE full_address PTYPE LU ... U_ORIENT U_TOT_RMS \
0 02125_ 199 SAVIN HILL AV 105 R3 ... NaN NaN
1 02125_ 195 SAVIN HILL AV 104 R2 ... NaN NaN
2 02125_ 193 SAVIN HILL AV 104 R2 ... NaN NaN
3 02125_ 6 8 SOUTHVIEW ST 104 R2 ... NaN NaN
4 02125_ 10 12 SOUTHVIEW ST 104 R2 ... NaN NaN
... ... ... ... .. ... ... ...
168110 02125_ 201 SAVIN HILL AV 3 102 CD ... T 4.0
168111 02125_ 201 SAVIN HILL AV 2 102 CD ... T 2.0
168112 02125_ 201 SAVIN HILL AV 1 102 CD ... T 2.0
168113 02125_ 201 SAVIN HILL AV 995 CM ... NaN NaN
168114 02125_ 201 SAVIN HILL AV 995 CM ... NaN NaN
U_BDRMS U_FULL_BTH U_HALF_BTH U_KIT_TYPE U_HEAT_TYP U_AC U_FPLACE \
0 NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ...
168110 1.0 1.0 0.0 F W C 1.0
168111 2.0 2.0 0.0 O W C 0.0
168112 2.0 2.0 0.0 O W C 0.0
168113 NaN NaN NaN NaN NaN NaN NaN
168114 NaN NaN NaN NaN NaN NaN NaN
Location
0 (42.308250000| -71.049200000)
1 (42.308490000| -71.049460000)
2 (42.308470000| -71.049700000)
3 (42.308250000| -71.049880000)
4 (42.308240000| -71.049710000)
... ...
168110 (0.000000000| 0.000000000)
168111 (0.000000000| 0.000000000)
168112 (0.000000000| 0.000000000)
168113 (42.308430000| -71.049000000)
168114 (42.308430000| -71.049000000)
[168115 rows x 56 columns]
From the data key, we see that the column LU is going to denote land use with different codes. With the function value_counts, we get a clearer picture of what the makeup of the data is in terms of the land use.
#Find value counts
print(df['LU'].value_counts())
CD 59909
R1 30726
R2 17464
R3 14176
CM 8970
E 8591
RL 7015
C 4565
CP 3554
A 2937
RC 2575
R4 2567
CL 2186
CC 1765
EA 614
I 497
AH 3
XX 1
Name: LU, dtype: int64
Residential condos are the most common, followed by 1 family homes. Let's begin our analysis by looking at the 1-family homes (code R1). We are going to make a copy of the dataframe (so that we can make changes to it and not worry about messing with the original dataframe) and slice it to be only properties with that R1 code.
r1 = df[df['LU'] == 'R1'].copy()
print(r1)
PID CM_ID ST_NUM ST_NAME ST_NAME_SUF UNIT_NUM ZIPCODE \
41 1302268030_ NaN 12 DENNY ST NaN 02125_
42 1302268020_ NaN 18 DENNY ST NaN 02125_
60 1302260000_ NaN 44 SPRINGDALE ST NaN 02125_
77 1302251000_ NaN 161 SAVIN HILL AV NaN 02125_
85 1302236000_ NaN 23 PLAYSTEAD RD NaN 02125_
... ... ... ... ... ... ... ...
168101 1302306000_ NaN 215 SAVIN HILL AV NaN 02125_
168102 1302305000_ NaN 213 SAVIN HILL AV NaN 02125_
168104 1302303000_ NaN 207 SAVIN HILL AV NaN 02125_
168105 1302302000_ NaN 205 SAVIN HILL AV NaN 02125_
168106 1302301000_ NaN 203 SAVIN HILL AV NaN 02125_
full_address PTYPE LU ... U_ORIENT U_TOT_RMS U_BDRMS \
41 12 DENNY ST 101 R1 ... NaN NaN NaN
42 18 DENNY ST 101 R1 ... NaN NaN NaN
60 44 SPRINGDALE ST 101 R1 ... NaN NaN NaN
77 161 SAVIN HILL AV 101 R1 ... NaN NaN NaN
85 23 PLAYSTEAD RD 101 R1 ... NaN NaN NaN
... ... ... .. ... ... ... ...
168101 215 SAVIN HILL AV 101 R1 ... NaN NaN NaN
168102 213 SAVIN HILL AV 101 R1 ... NaN NaN NaN
168104 207 SAVIN HILL AV 101 R1 ... NaN NaN NaN
168105 205 SAVIN HILL AV 101 R1 ... NaN NaN NaN
168106 203 SAVIN HILL AV 101 R1 ... NaN NaN NaN
U_FULL_BTH U_HALF_BTH U_KIT_TYPE U_HEAT_TYP U_AC U_FPLACE \
41 NaN NaN NaN NaN NaN NaN
42 NaN NaN NaN NaN NaN NaN
60 NaN NaN NaN NaN NaN NaN
77 NaN NaN NaN NaN NaN NaN
85 NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ...
168101 NaN NaN NaN NaN NaN NaN
168102 NaN NaN NaN NaN NaN NaN
168104 NaN NaN NaN NaN NaN NaN
168105 NaN NaN NaN NaN NaN NaN
168106 NaN NaN NaN NaN NaN NaN
Location
41 (42.309030000| -71.050840000)
42 (42.308900000| -71.050960000)
60 (42.308880000| -71.051450000)
77 (42.309730000| -71.051240000)
85 (42.309320000| -71.052080000)
... ...
168101 (42.308860000| -71.047910000)
168102 (42.308790000| -71.048130000)
168104 (42.308620000| -71.048680000)
168105 (42.308470000| -71.048620000)
168106 (42.308560000| -71.048870000)
[30726 rows x 56 columns]
Now, there are a lot of columns that seem to have null values. How do we know which are not going to be of any use? One solution is to combine pd.isnull() which returns True/False based on null values with the function to find the mean value of a column. Since True/False is the same as 1/0, calling for the mean of the columns based on this will return the percent of records which are null!
#Find percent of the time values are null
print(pd.isnull(r1).mean())
PID 0.000000
CM_ID 0.999967
ST_NUM 0.000000
ST_NAME 0.000000
ST_NAME_SUF 0.000879
UNIT_NUM 1.000000
ZIPCODE 0.000000
full_address 0.000000
PTYPE 0.000000
LU 0.000000
OWN_OCC 0.000000
OWNER 0.000000
OWNER_MAIL_ADDRESS 0.000000
OWNER_MAIL_CS 0.000000
OWNER_MAIL_ZIPCODE 0.000000
AV_LAND 0.000000
AV_BLDG 0.000000
AV_TOTAL 0.000000
GROSS_TAX 0.000000
LAND_SF 0.000325
YR_BUILT 0.000000
YR_REMOD 0.050413
GROSS_AREA 0.000000
LIVING_AREA 0.000000
NUM_FLOORS 0.000033
STRUCTURE_CLASS 0.006379
R_BLDG_STYL 0.000033
R_ROOF_TYP 0.000033
R_EXT_FIN 0.000033
R_TOTAL_RMS 0.000033
R_BDRMS 0.000033
R_FULL_BTH 0.000033
R_HALF_BTH 0.000033
R_KITCH 0.000033
R_HEAT_TYP 0.000033
R_AC 0.000033
R_FPLACE 0.000033
S_NUM_BLDG 1.000000
S_BLDG_STYL 1.000000
S_UNIT_RES 1.000000
S_UNIT_COM 1.000000
S_UNIT_RC 1.000000
S_EXT_FIN 1.000000
U_BASE_FLOOR 1.000000
U_NUM_PARK 1.000000
U_CORNER 1.000000
U_ORIENT 1.000000
U_TOT_RMS 1.000000
U_BDRMS 1.000000
U_FULL_BTH 1.000000
U_HALF_BTH 1.000000
U_KIT_TYPE 1.000000
U_HEAT_TYP 1.000000
U_AC 1.000000
U_FPLACE 1.000000
Location 0.000000
dtype: float64
Let's round up any columns that have a null percent greater than 20% and drop them.
#Drop the columns
#Find the columns to drop
columns_to_drop = r1.columns[pd.isnull(r1).mean() > .2]
print("Columns to drop:")
print(columns_to_drop)
print()
#Drop the columns
r1 = r1.drop(columns=columns_to_drop)
print(r1)
Columns to drop:
Index(['CM_ID', 'UNIT_NUM', 'S_NUM_BLDG', 'S_BLDG_STYL', 'S_UNIT_RES',
'S_UNIT_COM', 'S_UNIT_RC', 'S_EXT_FIN', 'U_BASE_FLOOR', 'U_NUM_PARK',
'U_CORNER', 'U_ORIENT', 'U_TOT_RMS', 'U_BDRMS', 'U_FULL_BTH',
'U_HALF_BTH', 'U_KIT_TYPE', 'U_HEAT_TYP', 'U_AC', 'U_FPLACE'],
dtype='object')
PID ST_NUM ST_NAME ST_NAME_SUF ZIPCODE \
41 1302268030_ 12 DENNY ST 02125_
42 1302268020_ 18 DENNY ST 02125_
60 1302260000_ 44 SPRINGDALE ST 02125_
77 1302251000_ 161 SAVIN HILL AV 02125_
85 1302236000_ 23 PLAYSTEAD RD 02125_
... ... ... ... ... ...
168101 1302306000_ 215 SAVIN HILL AV 02125_
168102 1302305000_ 213 SAVIN HILL AV 02125_
168104 1302303000_ 207 SAVIN HILL AV 02125_
168105 1302302000_ 205 SAVIN HILL AV 02125_
168106 1302301000_ 203 SAVIN HILL AV 02125_
full_address PTYPE LU OWN_OCC OWNER ... \
41 12 DENNY ST 101 R1 Y FINNIGAN JANICE M ...
42 18 DENNY ST 101 R1 Y PORTER ARTHUR J ...
60 44 SPRINGDALE ST 101 R1 Y SMEGLIN GENNARO A ...
77 161 SAVIN HILL AV 101 R1 Y LYDON MARTIN L ...
85 23 PLAYSTEAD RD 101 R1 Y PAVIDIS MARTHA J ...
... ... ... .. ... ... ...
168101 215 SAVIN HILL AV 101 R1 Y KASHMANIAN MARIE TS ...
168102 213 SAVIN HILL AV 101 R1 Y SCOTT LESLIE E ...
168104 207 SAVIN HILL AV 101 R1 N BANE MARY JO ETAL ...
168105 205 SAVIN HILL AV 101 R1 Y FERRANDI ALICE L ...
168106 203 SAVIN HILL AV 101 R1 Y MCCOLGAN JOHN J ...
R_EXT_FIN R_TOTAL_RMS R_BDRMS R_FULL_BTH R_HALF_BTH R_KITCH \
41 C 5.0 3.0 1.0 0.0 1.0
42 M 4.0 2.0 1.0 0.0 1.0
60 M 7.0 3.0 1.0 0.0 1.0
77 W 7.0 4.0 1.0 0.0 1.0
85 M 7.0 2.0 1.0 1.0 1.0
... ... ... ... ... ... ...
168101 W 6.0 2.0 1.0 1.0 1.0
168102 F 11.0 4.0 2.0 1.0 1.0
168104 F 9.0 2.0 1.0 1.0 1.0
168105 W 8.0 3.0 1.0 1.0 1.0
168106 W 8.0 4.0 1.0 1.0 1.0
R_HEAT_TYP R_AC R_FPLACE Location
41 F N 0.0 (42.309030000| -71.050840000)
42 W N 0.0 (42.308900000| -71.050960000)
60 W N 1.0 (42.308880000| -71.051450000)
77 W N 0.0 (42.309730000| -71.051240000)
85 F N 0.0 (42.309320000| -71.052080000)
... ... ... ... ...
168101 W C 0.0 (42.308860000| -71.047910000)
168102 F N 0.0 (42.308790000| -71.048130000)
168104 F N 1.0 (42.308620000| -71.048680000)
168105 F N 1.0 (42.308470000| -71.048620000)
168106 W N 2.0 (42.308560000| -71.048870000)
[30726 rows x 36 columns]