Measuring the Error
Fitting the Curve Based on Error¶
Two ways to look at it:
- Minimize the overall error for the CDF curve.
- Minimize the difference for each group’s percent of companies in either absolute or relative error terms.
Minimizing the CDF Curve¶
If we chose to minimize the CDF curve, we would have our $Y$ variable be the actual value of the CDF at each x point and our $\hat{Y}$ be the predicted variable. With that in mind, we could minimize the sum of the squared residuals (differences) with the following form:
$ \Sigma(Y_{i} – \hat{Y}_{i})^2 $
#Find the difference between the predicted and actual cdf curves
#Get the sum of squared errors
print(cdf_pred - np.array(actual_cdf))
print(((cdf_pred - np.array(actual_cdf)) ** 2).sum())
[ 0. -0.03098697 0.01533453 0.00385456 0.00023574 -0.00032158
-0.00167784]
0.001213171797882587
Converting this to a function and plotting the error will help us in seeing what the best option for lambda might be.
#Build a function to find the error
def find_error(lambda_log, actual_cdf, multiples):
lambda_log = lambda_log[0]
cdf_pred = [1-np.exp(-lambda_log * np.log(x+1)) for x in multiples]
return ((cdf_pred - np.array(actual_cdf)) ** 2).sum()
print(find_error([lambda_log], actual_cdf, multiples))
0.001213171797882587
#Plot the error function
X = np.arange(1.3,1.505, .005)
err = []
for lambda_log in X:
err.append(find_error([lambda_log], actual_cdf, multiples))
plt.plot(X, err)
plt.show()
In this case, 1.44 is the optimal lambda.
#The optimal lambda comes out to 1.44
lambda_log_opt = minimize(find_error, [1.4], args=(actual_cdf, multiples))['x'][0]
print(lambda_log_opt)
plt.plot(X, err)
plt.plot(lambda_log_opt, find_error([lambda_log_opt], actual_cdf, multiples), 'ro')
plt.show()
1.4440311069061806
Simulations given this optimal lambda give us some ideas of possible paths that the investments could take.
np.random.seed(0)
returns1 = []
returns2 = []
returns3 = []
for _ in range(10000):
cdf_vals = np.random.uniform(0,1, 10)
valuations = [inverse_transform(x, lambda_log_opt) for x in cdf_vals]
returns1.append(np.mean(valuations) ** (1/8) - 1)
cdf_vals = np.random.uniform(0,1, 25)
valuations = [inverse_transform(x, lambda_log_opt) for x in cdf_vals]
returns2.append(np.mean(valuations) ** (1/8) - 1)
cdf_vals = np.random.uniform(0,1, 50)
valuations = [inverse_transform(x, lambda_log_opt) for x in cdf_vals]
returns3.append(np.mean(valuations) ** (1/8) - 1)
#Plot the 3 together
fig, axs = plt.subplots(3, 1, sharex=True, sharey=True,figsize=(5,10))
ax = axs[0]
ax.hist(returns1, bins=30)
ax.xaxis.set_major_formatter(PercentFormatter(1))
ax.set_xlabel("Fund CAGR")
ax.set_ylabel("Frequency")
ax.set_title("Exponential Simulated Funds N=10")
ax = axs[1]
ax.hist(returns2, bins=30)
ax.xaxis.set_major_formatter(PercentFormatter(1))
ax.set_xlabel("Fund CAGR")
ax.set_ylabel("Frequency")
ax.set_title("Exponential Simulated Funds N=25")
ax = axs[2]
ax.hist(returns3, bins=30)
ax.xaxis.set_major_formatter(PercentFormatter(1))
ax.set_xlabel("Fund CAGR")
ax.set_ylabel("Frequency")
ax.set_title("Exponential Simulated Funds N=50")
plt.show()
#Get the stats for the simulations
table = pd.DataFrame([[np.mean(r), np.std(r)] for r in [returns1, returns2, returns3]],
index=['N=10', 'N=25', 'N=50'],
columns=['Mean', 'STD'])
print(table)
Mean STD
N=10 0.056886 0.100331
N=25 0.075545 0.083401
N=50 0.083574 0.066899
Minimizing the Buckets¶
Another way to minimize the difference is to be looking at the differences between "buckets" that we saw from the bar graph in the beginning of this lesson. Doing this will mean that we focus on making sure we get closer to replicating that bar graph, rather than the overall curve. First, we can see how to find the probability that each bucket holds from the actual CDF curve (this will be how we also find the probability for the predicted cdf). We just convert it to a numpy array so that we can directly subtract one from the other and then take all values except the first, subtracted from all values except the last to get the difference.
buckets_actual = np.array(actual_cdf)[1:] - np.array(actual_cdf)[:-1]
print(buckets_actual)
[0.648 0.253 0.059 0.025 0.011 0.004]
Let's try the same thing with our predicted cdf curve to find the buckets.
lambda_log = 1.4
cdf_pred = [1-np.exp(-lambda_log * np.log(x+1)) for x in multiples]
buckets_pred = np.array(cdf_pred)[1:] - np.array(cdf_pred)[:-1]
print(buckets_pred)
[0.62107086 0.29753592 0.04655546 0.02074835 0.01002122 0.00250523]
Measuring the Error¶
There are two ways in which we can think about the error. One of them is to simply do the error as the sum of squared residuals like before:
$ \Sigma (Y_{i} - \hat{Y}_{i})^2 $
One thing to consider is that when we look at the low probability values such as the probability between having a 10-50X multiple, we are not going to weight them very heavily. If we predict that it is 0% probability, the penalty in the error is going to be .004^2 which comes out to be a very small number. Because of this, we might want to weight each error in a way that we describe it as the error relative to the actual value. So if we are off by 50% in a bucket, it counts the same amount. The equation for this would be:
$ \Sigma (\frac{Y_{i} - \hat{Y}_{i}}{Y_{i}})^2 $
#The error with method 1
print(((buckets_pred - buckets_actual) ** 2).sum())
0.0028847620990833157
#The error with method 2
print((((buckets_pred - buckets_actual)/buckets_actual) ** 2).sum())
0.25368970026315674
#We can define a function to do everything as it has been in the past, but with buckets
#as well as an option to do the error in terms of relative error
def find_error_buckets(lambda_log, buckets_actual, multiples, relative=False):
lambda_log = lambda_log[0]
cdf_pred = [1-np.exp(-lambda_log * np.log(x+1)) for x in multiples]
buckets_pred = np.array(cdf_pred)[1:] - np.array(cdf_pred)[:-1]
if relative:
err = (((buckets_pred - buckets_actual)/buckets_actual) ** 2).sum()
else:
err = ((buckets_pred - buckets_actual) ** 2).sum()
return err
print(find_error_buckets([lambda_log], buckets_actual, multiples, relative=False))
print(find_error_buckets([lambda_log], buckets_actual, multiples, relative=True))
0.0028847620990833157
0.25368970026315674
#Plot the error for non-relative buckets
X = np.arange(1.1,1.705, .005)
err = []
for lambda_log in X:
err.append(find_error_buckets([lambda_log], buckets_actual, multiples, relative=False))
plt.plot(X, err)
plt.show()
#Plot the error for relative buckets
X = np.arange(1.1,1.705, .005)
err = []
for lambda_log in X:
err.append(find_error_buckets([lambda_log], buckets_actual, multiples, relative=True))
plt.plot(X, err)
plt.show()
Choosing what to focus on makes a huge difference! When you minimize the relative errors, you weight incorrectly predicting the tails much more heavily.
#Save the first value we found as lambda_log_opt1
lambda_log_opt1 = lambda_log_opt
#Find the other two values
lambda_log_opt2 = minimize(find_error_buckets, [1.4], args=( buckets_actual, multiples,False))['x'][0]
lambda_log_opt3 = minimize(find_error_buckets, [1.4], args=( buckets_actual, multiples,True))['x'][0]
print(lambda_log_opt1)
print(lambda_log_opt2)
print(lambda_log_opt3)
1.4440311069061806
1.5401965955614616
1.3114755373838654
#Finally, let's compare the impact of the three for N=25
np.random.seed(0)
returns1 = []
returns2 = []
returns3 = []
for _ in range(10000):
cdf_vals = np.random.uniform(0,1, 10)
valuations = [inverse_transform(x, lambda_log_opt1) for x in cdf_vals]
returns1.append(np.mean(valuations) ** (1/8) - 1)
valuations = [inverse_transform(x, lambda_log_opt2) for x in cdf_vals]
returns2.append(np.mean(valuations) ** (1/8) - 1)
valuations = [inverse_transform(x, lambda_log_opt3) for x in cdf_vals]
returns3.append(np.mean(valuations) ** (1/8) - 1)
#Plot the 3 together
fig, axs = plt.subplots(3, 1, sharex=True, sharey=True,figsize=(5,10))
ax = axs[0]
ax.hist(returns1, bins=30)
ax.xaxis.set_major_formatter(PercentFormatter(1))
ax.set_xlabel("Fund CAGR")
ax.set_ylabel("Frequency")
ax.set_title("Exponential Simulated Funds N=25 Case 1")
ax = axs[1]
ax.hist(returns2, bins=30)
ax.xaxis.set_major_formatter(PercentFormatter(1))
ax.set_xlabel("Fund CAGR")
ax.set_ylabel("Frequency")
ax.set_title("Exponential Simulated Funds N=25 Case 2")
ax = axs[2]
ax.hist(returns3, bins=30)
ax.xaxis.set_major_formatter(PercentFormatter(1))
ax.set_xlabel("Fund CAGR")
ax.set_ylabel("Frequency")
ax.set_title("Exponential Simulated Funds N=25 Case 3")
plt.show()
table = pd.DataFrame([[np.mean(r), np.std(r)] for r in [returns1, returns2, returns3]],
index=['Case 1', 'Case 2', 'Case 3'],
columns=['Mean', 'STD'])
print(table)
Mean STD
Case 1 0.057472 0.101631
Case 2 0.040114 0.094005
Case 3 0.085658 0.115089
Look how big of a difference the assumptions can make. This is very important to consider when building out your models. As you begin to gather more data, you will find it easier and easier to more accurately model things, but assumptions always can have a large impact!