Measuring the Error

Fitting the Curve Based on Error¶

Two ways to look at it:

Minimize the overall error for the CDF curve.
Minimize the difference for each group’s percent of companies in either absolute or relative error terms.

Minimizing the CDF Curve¶

If we chose to minimize the CDF curve, we would have our $Y$ variable be the actual value of the CDF at each x point and our $\hat{Y}$ be the predicted variable. With that in mind, we could minimize the sum of the squared residuals (differences) with the following form:

$ \Sigma(Y_{i} – \hat{Y}_{i})^2 $

In [51]:

#Find the difference between the predicted and actual cdf curves
#Get the sum of squared errors
print(cdf_pred - np.array(actual_cdf))
print(((cdf_pred - np.array(actual_cdf)) ** 2).sum())

[ 0.         -0.03098697  0.01533453  0.00385456  0.00023574 -0.00032158
 -0.00167784]
0.001213171797882587

Converting this to a function and plotting the error will help us in seeing what the best option for lambda might be.

In [52]:

#Build a function to find the error
def find_error(lambda_log, actual_cdf, multiples):
    lambda_log = lambda_log[0]
    cdf_pred = [1-np.exp(-lambda_log * np.log(x+1)) for x in  multiples]
    return ((cdf_pred - np.array(actual_cdf)) ** 2).sum()
print(find_error([lambda_log], actual_cdf, multiples))

0.001213171797882587

In [53]:

#Plot the error function
X = np.arange(1.3,1.505, .005)
err = []
for lambda_log in X:
    err.append(find_error([lambda_log], actual_cdf, multiples))
plt.plot(X, err)
plt.show()

In this case, 1.44 is the optimal lambda.

In [54]:

#The optimal lambda comes out to 1.44
lambda_log_opt = minimize(find_error, [1.4], args=(actual_cdf, multiples))['x'][0]
print(lambda_log_opt)

plt.plot(X, err)
plt.plot(lambda_log_opt, find_error([lambda_log_opt], actual_cdf, multiples), 'ro')
plt.show()

1.4440311069061806

Simulations given this optimal lambda give us some ideas of possible paths that the investments could take.

In [55]:

np.random.seed(0)
returns1 = []
returns2 = []
returns3 = []

for _ in range(10000):
    cdf_vals = np.random.uniform(0,1, 10)
    valuations = [inverse_transform(x, lambda_log_opt) for x in cdf_vals]
    returns1.append(np.mean(valuations) ** (1/8) - 1)

    cdf_vals = np.random.uniform(0,1, 25)
    valuations = [inverse_transform(x, lambda_log_opt) for x in cdf_vals]
    returns2.append(np.mean(valuations) ** (1/8) - 1)

    cdf_vals = np.random.uniform(0,1, 50)
    valuations = [inverse_transform(x, lambda_log_opt) for x in cdf_vals]
    returns3.append(np.mean(valuations) ** (1/8) - 1)

#Plot the 3 together
fig, axs = plt.subplots(3, 1, sharex=True, sharey=True,figsize=(5,10))

ax = axs[0]
ax.hist(returns1, bins=30)
ax.xaxis.set_major_formatter(PercentFormatter(1))
ax.set_xlabel("Fund CAGR")
ax.set_ylabel("Frequency")
ax.set_title("Exponential Simulated Funds N=10")

ax = axs[1]
ax.hist(returns2, bins=30)
ax.xaxis.set_major_formatter(PercentFormatter(1))
ax.set_xlabel("Fund CAGR")
ax.set_ylabel("Frequency")
ax.set_title("Exponential Simulated Funds N=25")

ax = axs[2]
ax.hist(returns3, bins=30)
ax.xaxis.set_major_formatter(PercentFormatter(1))
ax.set_xlabel("Fund CAGR")
ax.set_ylabel("Frequency")
ax.set_title("Exponential Simulated Funds N=50")
plt.show()

In [56]:

#Get the stats for the simulations
table = pd.DataFrame([[np.mean(r), np.std(r)] for r in [returns1, returns2, returns3]],
            index=['N=10', 'N=25', 'N=50'],
            columns=['Mean', 'STD'])
print(table)

          Mean       STD
N=10  0.056886  0.100331
N=25  0.075545  0.083401
N=50  0.083574  0.066899

Minimizing the Buckets¶

Another way to minimize the difference is to be looking at the differences between "buckets" that we saw from the bar graph in the beginning of this lesson. Doing this will mean that we focus on making sure we get closer to replicating that bar graph, rather than the overall curve. First, we can see how to find the probability that each bucket holds from the actual CDF curve (this will be how we also find the probability for the predicted cdf). We just convert it to a numpy array so that we can directly subtract one from the other and then take all values except the first, subtracted from all values except the last to get the difference.

In [57]:

buckets_actual = np.array(actual_cdf)[1:] - np.array(actual_cdf)[:-1]
print(buckets_actual)

[0.648 0.253 0.059 0.025 0.011 0.004]

Let's try the same thing with our predicted cdf curve to find the buckets.

In [58]:

lambda_log = 1.4
cdf_pred = [1-np.exp(-lambda_log * np.log(x+1)) for x in  multiples]
buckets_pred = np.array(cdf_pred)[1:] - np.array(cdf_pred)[:-1]
print(buckets_pred)

[0.62107086 0.29753592 0.04655546 0.02074835 0.01002122 0.00250523]

Measuring the Error¶

There are two ways in which we can think about the error. One of them is to simply do the error as the sum of squared residuals like before:

$ \Sigma (Y_{i} - \hat{Y}_{i})^2 $

One thing to consider is that when we look at the low probability values such as the probability between having a 10-50X multiple, we are not going to weight them very heavily. If we predict that it is 0% probability, the penalty in the error is going to be .004^2 which comes out to be a very small number. Because of this, we might want to weight each error in a way that we describe it as the error relative to the actual value. So if we are off by 50% in a bucket, it counts the same amount. The equation for this would be:

$ \Sigma (\frac{Y_{i} - \hat{Y}_{i}}{Y_{i}})^2 $

In [59]:

#The error with method 1
print(((buckets_pred - buckets_actual) ** 2).sum())

0.0028847620990833157

In [60]:

#The error with method 2
print((((buckets_pred - buckets_actual)/buckets_actual) ** 2).sum())

0.25368970026315674

In [61]:

#We can define a function to do everything as it has been in the past, but with buckets
#as well as an option to do the error in terms of relative error
def find_error_buckets(lambda_log, buckets_actual, multiples, relative=False):
    lambda_log = lambda_log[0]
    cdf_pred = [1-np.exp(-lambda_log * np.log(x+1)) for x in  multiples]
    buckets_pred = np.array(cdf_pred)[1:] - np.array(cdf_pred)[:-1]
    if relative:
        err = (((buckets_pred - buckets_actual)/buckets_actual) ** 2).sum()
    else:
        err = ((buckets_pred - buckets_actual) ** 2).sum()
    return err
print(find_error_buckets([lambda_log], buckets_actual, multiples, relative=False))
print(find_error_buckets([lambda_log], buckets_actual, multiples, relative=True))

0.0028847620990833157
0.25368970026315674

In [62]:

#Plot the error for non-relative buckets
X = np.arange(1.1,1.705, .005)
err = []
for lambda_log in X:
    err.append(find_error_buckets([lambda_log], buckets_actual, multiples, relative=False))
plt.plot(X, err)
plt.show()

In [63]:

#Plot the error for relative buckets
X = np.arange(1.1,1.705, .005)
err = []
for lambda_log in X:
    err.append(find_error_buckets([lambda_log], buckets_actual, multiples, relative=True))
plt.plot(X, err)
plt.show()

Choosing what to focus on makes a huge difference! When you minimize the relative errors, you weight incorrectly predicting the tails much more heavily.

In [64]:

#Save the first value we found as lambda_log_opt1
lambda_log_opt1 = lambda_log_opt

#Find the other two values
lambda_log_opt2 = minimize(find_error_buckets, [1.4], args=( buckets_actual, multiples,False))['x'][0]
lambda_log_opt3 = minimize(find_error_buckets, [1.4], args=( buckets_actual, multiples,True))['x'][0]

print(lambda_log_opt1)
print(lambda_log_opt2)
print(lambda_log_opt3)

1.4440311069061806
1.5401965955614616
1.3114755373838654

In [65]:

#Finally, let's compare the impact of the three for N=25

np.random.seed(0)
returns1 = []
returns2 = []
returns3 = []

for _ in range(10000):
    cdf_vals = np.random.uniform(0,1, 10)

    valuations = [inverse_transform(x, lambda_log_opt1) for x in cdf_vals]
    returns1.append(np.mean(valuations) ** (1/8) - 1)

    valuations = [inverse_transform(x, lambda_log_opt2) for x in cdf_vals]
    returns2.append(np.mean(valuations) ** (1/8) - 1)

    valuations = [inverse_transform(x, lambda_log_opt3) for x in cdf_vals]
    returns3.append(np.mean(valuations) ** (1/8) - 1)

#Plot the 3 together
fig, axs = plt.subplots(3, 1, sharex=True, sharey=True,figsize=(5,10))

ax = axs[0]
ax.hist(returns1, bins=30)
ax.xaxis.set_major_formatter(PercentFormatter(1))
ax.set_xlabel("Fund CAGR")
ax.set_ylabel("Frequency")
ax.set_title("Exponential Simulated Funds N=25 Case 1")

ax = axs[1]
ax.hist(returns2, bins=30)
ax.xaxis.set_major_formatter(PercentFormatter(1))
ax.set_xlabel("Fund CAGR")
ax.set_ylabel("Frequency")
ax.set_title("Exponential Simulated Funds N=25 Case 2")

ax = axs[2]
ax.hist(returns3, bins=30)
ax.xaxis.set_major_formatter(PercentFormatter(1))
ax.set_xlabel("Fund CAGR")
ax.set_ylabel("Frequency")
ax.set_title("Exponential Simulated Funds N=25 Case 3")
plt.show()

In [66]:

table = pd.DataFrame([[np.mean(r), np.std(r)] for r in [returns1, returns2, returns3]],
            index=['Case 1', 'Case 2', 'Case 3'],
            columns=['Mean', 'STD'])
print(table)

            Mean       STD
Case 1  0.057472  0.101631
Case 2  0.040114  0.094005
Case 3  0.085658  0.115089

Look how big of a difference the assumptions can make. This is very important to consider when building out your models. As you begin to gather more data, you will find it easier and easier to more accurately model things, but assumptions always can have a large impact!

Data Science

Data Science for Venture Capital

Measuring the Error

Fitting the Curve Based on Error¶

Minimizing the CDF Curve¶

Minimizing the Buckets¶

Measuring the Error¶

Leave A Reply Cancel reply

Modal title