Measuring Growth

Measuring Growth¶

Thanks to the fact that we also have real estate data from 2020, we are able to find how property values have grown over the last 5 years. Let’s read in the data first.

In [38]:

df2 = pd.read_csv("2020 Real Estate.txt")

/Users/seanmcowen/anaconda/envs/FinanceAndPython/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3062: DtypeWarning: Columns (6,15,27,28,29,34,35,36,37,38,39,40,41,42,44,45,46,47,48,58,59,64,65,66,67,68,69,70,72,73,74) have mixed types.Specify dtype option on import or set low_memory=False.
  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,

Finding the Unique Identifier¶

The column PID will serve as our unique identifier, but first we need to investigate if it is indeed unique. Using value_counts, we can get the total number of instances for each unique label. If we do this with PID....

In [39]:

#Get PID value counts
print(df2['PID'].value_counts())

203520010     6
203506010     6
1902190000    6
303038200     6
1703777000    5
             ..
2005341004    1
501120000     1
103984000     1
2003998000    1
500695040     1
Name: PID, Length: 174714, dtype: int64

While it is not unique right now, there is one way that will both make it unique and also deal with values we don't want.... Filtering out anything that has an assessed value of $0 leads to the following value counts....

In [40]:

#Get rid of anything with a total value of 0
df2 = df2[df2['AV_TOTAL'] > 0]

#Check the value counts now
print(df2['PID'].value_counts())

2003371050    1
1603872014    1
103970020     1
701388000     1
1805747000    1
             ..
601828016     1
900846014     1
1702279000    1
2005341004    1
500695040     1
Name: PID, Length: 164967, dtype: int64

There is one other issue we must take care of.... In the 2015 real estate data, the PID has an underscore and is marked as a string. Let's convert it.

In [41]:

#Show top 5 PIDs prior
print(df["PID"].head(5))

#Get rid of the underscore and turn it into an integer
df['PID'] = df['PID'].str.replace("_", "").astype(int)

#Show top 5 PIDs after
print()
print()
print(df["PID"].head(5))

0    1302299000_
1    1302298000_
2    1302297000_
3    1302296000_
4    1302295000_
Name: PID, dtype: object


0    1302299000
1    1302298000
2    1302297000
3    1302296000
4    1302295000
Name: PID, dtype: int64

Creating the Comparison¶

Let's build out comparison by first taking the different value measures within the 2015 dataset. We will also set the PID as the index.

In [42]:

#Build the basic set up
land_value = df.set_index("PID")[['AV_LAND', "AV_BLDG", "AV_TOTAL"]]
land_value.columns = ["2015 Land Value", "2015 Building Value", "2015 Total Value"]

We want to convert anything that has a value of 0 to be null which we can do with by using where and passing the condition it is greater than 0 (to prevent any issue of dividing by 0). Anything equal to 0 will be converted to NaN.

In [43]:

land_value = land_value.where(land_value > 0)

Join the 2020 values into the dataframe.

In [44]:

#Join 2020 values
land_value = land_value.join(df2.set_index("PID")[['AV_LAND', "AV_BLDG", "AV_TOTAL"]])

#Convert 0 values to NaN
land_value = land_value.where(land_value > 0)

#Rename columns
land_value.columns = ["2015 Land Value", "2015 Building Value", "2015 Total Value",
                     "2020 Land Value", "2020 Building Value", "2020 Total Value"]
print(land_value)

            2015 Land Value  2015 Building Value  2015 Total Value  \
PID
100001000           93100.0             193000.0          286100.0
100002000           83700.0             219300.0          303000.0
100003000           83800.0             190700.0          274500.0
100004000           83900.0             187300.0          271200.0
100005000           93000.0             205200.0          298200.0
...                     ...                  ...               ...
2205668000         266300.0             431800.0          698100.0
2205669000         573500.0            1348500.0         1922000.0
2205670000         970500.0             866800.0         1837300.0
2205670000         970500.0             866800.0         1837300.0
2205670000         970500.0             866800.0         1837300.0

            2020 Land Value  2020 Building Value  2020 Total Value
PID
100001000          199900.0             405600.0          605500.0
100002000          200900.0             466800.0          667700.0
100003000          201500.0             406100.0          607600.0
100004000          202400.0             378200.0          580600.0
100005000          199800.0             381700.0          581500.0
...                     ...                  ...               ...
2205668000         417800.0             676600.0         1094400.0
2205669000         752800.0            1754700.0         2507500.0
2205670000        1317900.0            1033600.0         2351500.0
2205670000        1317900.0            1033600.0         2351500.0
2205670000        1317900.0            1033600.0         2351500.0

[168115 rows x 6 columns]

Find the growth rates by dividing 2020 values by 2015 values and subtracting 1.

In [45]:

#Compute the growth rates
land_value["Land Growth"] = land_value['2020 Land Value'] / land_value['2015 Land Value'] - 1
land_value["Building Growth"] = land_value['2020 Building Value'] / land_value['2015 Building Value'] - 1
land_value["Total Growth"] = land_value['2020 Total Value'] / land_value['2015 Total Value'] - 1

Let's see what the histograms look like.

In [46]:

#Set up subplots
fig, ax = plt.subplots(nrows=3, ncols=1, figsize=(6,8), sharex=True, sharey=True)

land_value["Land Growth"].plot.hist(ax=ax[0], bins=100)
land_value["Building Growth"].plot.hist(ax=ax[1], bins=100)
land_value["Total Growth"].plot.hist(ax=ax[2], bins=100)
plt.show()

Obviously there are outliers which we are not going to want to include for now. If we use the quantile function we can find the 1st percentile and the 99th percentile to see what the bounds are that hold 98% of the data excluding the 1% on each end. Let's see how it looks.

In [47]:

sample = land_value[['Land Growth', 'Building Growth', 'Total Growth']]
bound1 = sample.quantile(.01)
bound2 = sample.quantile(.99)

print(bound1)
print()
print()
print(bound2)

Land Growth       -0.211912
Building Growth   -0.209486
Total Growth      -0.125683
Name: 0.01, dtype: float64


Land Growth        1.971206
Building Growth    2.108214
Total Growth       1.877132
Name: 0.99, dtype: float64

If we combine this with where we can transform our data to put any values outside the bounds as null.

In [48]:

sample = sample.where((sample > bound1) & (sample < bound2))
print(sample)

            Land Growth  Building Growth  Total Growth
PID
100001000      1.147154         1.101554      1.116393
100002000      1.400239         1.128591      1.203630
100003000      1.404535         1.129523      1.213479
100004000      1.412396         1.019221      1.140855
100005000      1.148387         0.860136      0.950034
...                 ...              ...           ...
2205668000     0.568907         0.566929      0.567684
2205669000     0.312642         0.301224      0.304631
2205670000     0.357960         0.192432      0.279867
2205670000     0.357960         0.192432      0.279867
2205670000     0.357960         0.192432      0.279867

[168115 rows x 3 columns]

And now we can plot these histograms.

In [49]:

#Set up subplots
fig, ax = plt.subplots(nrows=3, ncols=1, figsize=(6,8), sharex=True, sharey=True)

sample["Land Growth"].plot.hist(ax=ax[0], bins=100)
ax[0].set_title("Land Growth")
sample["Building Growth"].plot.hist(ax=ax[1], bins=100)
ax[1].set_title("Building Growth")
sample["Total Growth"].plot.hist(ax=ax[2], bins=100)
ax[2].set_title("Total Growth")
plt.show()

Let's also get the medians to see what the 5 year growth is in the middle.

In [50]:

print(land_value[['Land Growth', 'Building Growth','Total Growth']].median())

Land Growth        0.401239
Building Growth    0.446657
Total Growth       0.424826
dtype: float64

Let's bring the growth values into the dataframe by joining on the PID.

In [51]:

df = df.join(land_value[['Land Growth', 'Building Growth', 'Total Growth']], on='PID')

Now we can plot the values on the map!

In [52]:

create_map("R1", "Land Growth")

Out[52]:

Data Science

Data Science for Venture Capital

Measuring Growth