Measuring Growth
Measuring Growth¶
Thanks to the fact that we also have real estate data from 2020, we are able to find how property values have grown over the last 5 years. Let’s read in the data first.
df2 = pd.read_csv("2020 Real Estate.txt")
Finding the Unique Identifier¶
The column PID will serve as our unique identifier, but first we need to investigate if it is indeed unique. Using value_counts, we can get the total number of instances for each unique label. If we do this with PID....
#Get PID value counts
print(df2['PID'].value_counts())
While it is not unique right now, there is one way that will both make it unique and also deal with values we don't want.... Filtering out anything that has an assessed value of $0 leads to the following value counts....
#Get rid of anything with a total value of 0
df2 = df2[df2['AV_TOTAL'] > 0]
#Check the value counts now
print(df2['PID'].value_counts())
There is one other issue we must take care of.... In the 2015 real estate data, the PID has an underscore and is marked as a string. Let's convert it.
#Show top 5 PIDs prior
print(df["PID"].head(5))
#Get rid of the underscore and turn it into an integer
df['PID'] = df['PID'].str.replace("_", "").astype(int)
#Show top 5 PIDs after
print()
print()
print(df["PID"].head(5))
Creating the Comparison¶
Let's build out comparison by first taking the different value measures within the 2015 dataset. We will also set the PID as the index.
#Build the basic set up
land_value = df.set_index("PID")[['AV_LAND', "AV_BLDG", "AV_TOTAL"]]
land_value.columns = ["2015 Land Value", "2015 Building Value", "2015 Total Value"]
We want to convert anything that has a value of 0 to be null which we can do with by using where and passing the condition it is greater than 0 (to prevent any issue of dividing by 0). Anything equal to 0 will be converted to NaN.
land_value = land_value.where(land_value > 0)
Join the 2020 values into the dataframe.
#Join 2020 values
land_value = land_value.join(df2.set_index("PID")[['AV_LAND', "AV_BLDG", "AV_TOTAL"]])
#Convert 0 values to NaN
land_value = land_value.where(land_value > 0)
#Rename columns
land_value.columns = ["2015 Land Value", "2015 Building Value", "2015 Total Value",
"2020 Land Value", "2020 Building Value", "2020 Total Value"]
print(land_value)
Find the growth rates by dividing 2020 values by 2015 values and subtracting 1.
#Compute the growth rates
land_value["Land Growth"] = land_value['2020 Land Value'] / land_value['2015 Land Value'] - 1
land_value["Building Growth"] = land_value['2020 Building Value'] / land_value['2015 Building Value'] - 1
land_value["Total Growth"] = land_value['2020 Total Value'] / land_value['2015 Total Value'] - 1
Let's see what the histograms look like.
#Set up subplots
fig, ax = plt.subplots(nrows=3, ncols=1, figsize=(6,8), sharex=True, sharey=True)
land_value["Land Growth"].plot.hist(ax=ax[0], bins=100)
land_value["Building Growth"].plot.hist(ax=ax[1], bins=100)
land_value["Total Growth"].plot.hist(ax=ax[2], bins=100)
plt.show()
Obviously there are outliers which we are not going to want to include for now. If we use the quantile function we can find the 1st percentile and the 99th percentile to see what the bounds are that hold 98% of the data excluding the 1% on each end. Let's see how it looks.
sample = land_value[['Land Growth', 'Building Growth', 'Total Growth']]
bound1 = sample.quantile(.01)
bound2 = sample.quantile(.99)
print(bound1)
print()
print()
print(bound2)
If we combine this with where we can transform our data to put any values outside the bounds as null.
sample = sample.where((sample > bound1) & (sample < bound2))
print(sample)
And now we can plot these histograms.
#Set up subplots
fig, ax = plt.subplots(nrows=3, ncols=1, figsize=(6,8), sharex=True, sharey=True)
sample["Land Growth"].plot.hist(ax=ax[0], bins=100)
ax[0].set_title("Land Growth")
sample["Building Growth"].plot.hist(ax=ax[1], bins=100)
ax[1].set_title("Building Growth")
sample["Total Growth"].plot.hist(ax=ax[2], bins=100)
ax[2].set_title("Total Growth")
plt.show()
Let's also get the medians to see what the 5 year growth is in the middle.
print(land_value[['Land Growth', 'Building Growth','Total Growth']].median())
Let's bring the growth values into the dataframe by joining on the PID.
df = df.join(land_value[['Land Growth', 'Building Growth', 'Total Growth']], on='PID')
Now we can plot the values on the map!
create_map("R1", "Land Growth")