Introduction
Geographical Analysis¶
Often when evaluating a smaller company, it becomes crucial to understand the market they operate within. Let’s take a hypothetical start up that wants to buy apartments, starting in Boston, and provide much more flexible leases to their tenants among other things. Their idea is that the apartment market is in need of innovation. Before they begin buying up properties, they need to know exactly what the landscape looks like. Thankfully, the city of Boston has a free dataset which gives us the tax assessments for land value that we will use here.
You can find the data here for yourself or download the data from github: Boston Data
As always begin with reading in the data.
import pandas as pd
#Read in the dataset
df = pd.read_csv('2015 Real Estate.csv')
print(df)
From the data key, we see that the column LU is going to denote land use with different codes. With the function value_counts, we get a clearer picture of what the makeup of the data is in terms of the land use.
#Find value counts
print(df['LU'].value_counts())
Residential condos are the most common, followed by 1 family homes. Let's begin our analysis by looking at the 1-family homes (code R1). We are going to make a copy of the dataframe (so that we can make changes to it and not worry about messing with the original dataframe) and slice it to be only properties with that R1 code.
r1 = df[df['LU'] == 'R1'].copy()
print(r1)
Now, there are a lot of columns that seem to have null values. How do we know which are not going to be of any use? One solution is to combine pd.isnull() which returns True/False based on null values with the function to find the mean value of a column. Since True/False is the same as 1/0, calling for the mean of the columns based on this will return the percent of records which are null!
#Find percent of the time values are null
print(pd.isnull(r1).mean())
Let's round up any columns that have a null percent greater than 20% and drop them.
#Drop the columns
#Find the columns to drop
columns_to_drop = r1.columns[pd.isnull(r1).mean() > .2]
print("Columns to drop:")
print(columns_to_drop)
print()
#Drop the columns
r1 = r1.drop(columns=columns_to_drop)
print(r1)