Basics
First, let’s set up two different normal distribution samples.
import numpy as np
import matplotlib.pyplot as plt
scores1 = [400+np.random.normal()*100 for _ in range(100)]
scores2 = [600+np.random.normal()*70 for _ in range(100)]
We could get a box and whiskers plot easily from this.
totalScores = [scores1,scores2]
plt.boxplot(totalScores)
plt.show()
Now, let’s work with some new data. We are going to create a line with some variation. For every x value, we will have a y value corresponding to it, but it will randomly be somewhere between 5 higher and 5 lower than the x value.
from random import randint
xVals = list(range(100))
yVals = [x+randint(-5, 5) for x in xVals]
Let’s see what our data looks like.
plt.scatter(xVals,yVals)
plt.show()
What if, instead, our data had bounds of +20 and -20?
from random import randint
xVals = list(range(100))
yVals = [x+randint(-20, 20) for x in xVals]
plt.scatter(xVals,yVals)
plt.show()
The points are much more spread, this will lead into our discussion of correlation. First, there is a relationship for both of these data points, what is it? We have a positive correlation because as X goes up so does Y, we could also have a negative correlation if as X went up Y went down.
Negative correlation would look like this….
from random import randint
xVals = list(range(100))
yVals = [100-x+randint(-20, 20) for x in xVals]
plt.scatter(xVals,yVals)
plt.show()
We might also have a relationship that is not linear, such as a polynomial relationship.
from random import randint
xVals = list(range(100))
yVals = [x**2 for x in xVals]
plt.scatter(xVals,yVals)
plt.show()
Now, what is correlation? It is a measure of how closely related two variables are, and it ranges from -1 which is perfectly opposite (although they can vary by a multiple such as one changing by 1 means the other changes by -2 every time) and 1 where the variables how the same changes every time.
If we wanted to get the correlation, we can use np.corrcoef() which returns a matrix. This matrix shows the correlation of variable i with variable j where i is the row number and j is the column number.
Challenge