Correlation

xVals = list(range(100))
yVals = [x+randint(-10,10) for x in xVals]
plt.scatter(xVals,yVals)
plt.show()
print(np.corrcoef(xVals,yVals))

As you can see the diagonal is 1, this is because a variable is perfectly correlated with itself. If we were to get the first row, second column then we would have the correlation between x and y.

print(np.corrcoef(xVals,yVals)[0][1])

There is a strong correlation!

What about if y = x?

xVals = list(range(100))
yVals = [x for x in xVals]
plt.scatter(xVals,yVals)
plt.show()
print(np.corrcoef(xVals,yVals)[0][1])

Perfect correlation!

What if y = 2*X?

xVals = list(range(100))
yVals = [x*2 for x in xVals]
plt.scatter(xVals,yVals)
plt.show()
print(np.corrcoef(xVals,yVals)[0][1])

Still perfect correlation, this is because even though y is twice x, it is every single time so it perfectly relates to x. We could predict y given x 100% of the time if this were a representative sample.

What about y = x
²
?

xVals = list(range(100))
yVals = [x**2 for x in xVals]
plt.scatter(xVals,yVals)
plt.show()
print(np.corrcoef(xVals,yVals)[0][1])

The correlation isn’t perfect, even though we can predict it. Why is that? It’s because correlation works off the assumption that the data is linear. We could however, if we saw the trend, do something to get perfect correlation.

xVals = list(range(100))
yVals = [x**2 for x in xVals]
yVals2 = [y**.5 for y in yVals]
plt.scatter(xVals,yVals2)
plt.show()
print(np.corrcoef(xVals,yVals2)[0][1])

If we applied a square root to our data we have perfect correlation! This is something we did a lot of times working with data, apply transformation. For example, if there is exponential growth associated with a process we might apply a log transformation to see the relationship. If we were predicting with it, then we would get the prediction for the log values, then take our prediction and reverse the log.

What if the bounds on our linear relationship increased?

xVals = list(range(100))
yVals = [x+randint(-25,25) for x in xVals]
plt.scatter(xVals,yVals)
plt.show()
print(np.corrcoef(xVals,yVals)[0][1])

As the distribution spreads more, we get a lower correlation.

What about a factor that is not related to x?

xVals = list(range(100))
yVals = [randint(-25,25) for x in xVals]
plt.scatter(xVals,yVals)
plt.show()
print(np.corrcoef(xVals,yVals)[0][1])

It is supposed to be 0, but more likely than not you got a number, for me I got .15. This is because this a random sample, so there is a chance that it seems as though their is a relationship even when there is not. This brings up the central limit theorem. What the central limit theorem says is that as the sample size goes up, the sample begins to reflect the distribution more accurately. Let’s see what happens when we loop through sample sizes of 100, to see what correlations we get.

for _ in range(20):
    xVals = list(range(100))
    yVals = [randint(-25,25) for x in xVals]
    print(np.corrcoef(xVals,yVals)[0][1])

Notice how many different correlations we get! Let’s see the difference we have a much larger sample size.

for _ in range(20):
    xVals = list(range(100000))
    yVals = [randint(-25,25) for x in xVals]
    print(np.corrcoef(xVals,yVals)[0][1])

We get correlations much closer to 0 by increasing the sample size.

Basics

Statistics

Correlation

Leave A Reply Cancel reply

Modal title