-
Graphing Data 4
-
Lecture1.1
-
Lecture1.2
-
Lecture1.3
-
Lecture1.4
-
-
Mean and Standard Deviation 5
-
Lecture2.1
-
Lecture2.2
-
Lecture2.3
-
Lecture2.4
-
Lecture2.5
-
-
Distributions 6
-
Lecture3.1
-
Lecture3.2
-
Lecture3.3
-
Lecture3.4
-
Lecture3.5
-
Lecture3.6
-
-
Correlation and Linear Regression 7
-
Lecture4.1
-
Lecture4.2
-
Lecture4.3
-
Lecture4.4
-
Lecture4.5
-
Lecture4.6
-
Lecture4.7
-
-
Probability 3
-
Lecture5.1
-
Lecture5.2
-
Lecture5.3
-
-
Counting Principles 3
-
Lecture6.1
-
Lecture6.2
-
Lecture6.3
-
-
Binomial Distribution 3
-
Lecture7.1
-
Lecture7.2
-
Lecture7.3
-
-
Confidence Interval 7
-
Lecture8.1
-
Lecture8.2
-
Lecture8.3
-
Lecture8.4
-
Lecture8.5
-
Lecture8.6
-
Lecture8.7
-
-
Proportion Confidence Interval 3
-
Lecture9.1
-
Lecture9.2
-
Lecture9.3
-
-
Hypothesis Testing 5
-
Lecture10.1
-
Lecture10.2
-
Lecture10.3
-
Lecture10.4
-
Lecture10.5
-
-
Comparing Two Means 5
-
Lecture11.1
-
Lecture11.2
-
Lecture11.3
-
Lecture11.4
-
Lecture11.5
-
-
Chi-squared Test 3
-
Lecture12.1
-
Lecture12.2
-
Lecture12.3
-
Basics
First, let’s set up two different normal distribution samples.
import numpy as np
import matplotlib.pyplot as plt
scores1 = [400+np.random.normal()*100 for _ in range(100)]
scores2 = [600+np.random.normal()*70 for _ in range(100)]
We could get a box and whiskers plot easily from this.
totalScores = [scores1,scores2]
plt.boxplot(totalScores)
plt.show()
Now, let’s work with some new data. We are going to create a line with some variation. For every x value, we will have a y value corresponding to it, but it will randomly be somewhere between 5 higher and 5 lower than the x value.
from random import randint
xVals = list(range(100))
yVals = [x+randint(-5, 5) for x in xVals]
Let’s see what our data looks like.
plt.scatter(xVals,yVals)
plt.show()
What if, instead, our data had bounds of +20 and -20?
from random import randint
xVals = list(range(100))
yVals = [x+randint(-20, 20) for x in xVals]
plt.scatter(xVals,yVals)
plt.show()
The points are much more spread, this will lead into our discussion of correlation. First, there is a relationship for both of these data points, what is it? We have a positive correlation because as X goes up so does Y, we could also have a negative correlation if as X went up Y went down.
Negative correlation would look like this….
from random import randint
xVals = list(range(100))
yVals = [100-x+randint(-20, 20) for x in xVals]
plt.scatter(xVals,yVals)
plt.show()
We might also have a relationship that is not linear, such as a polynomial relationship.
from random import randint
xVals = list(range(100))
yVals = [x**2 for x in xVals]
plt.scatter(xVals,yVals)
plt.show()
Now, what is correlation? It is a measure of how closely related two variables are, and it ranges from -1 which is perfectly opposite (although they can vary by a multiple such as one changing by 1 means the other changes by -2 every time) and 1 where the variables how the same changes every time.
If we wanted to get the correlation, we can use np.corrcoef() which returns a matrix. This matrix shows the correlation of variable i with variable j where i is the row number and j is the column number.
Challenge