Linear Regression Part 2
Solution
yVals = np.array(yVals)
fittedValues = np.array(fittedValues)
residuals = yVals-fittedValues
print(residuals)
plt.scatter(xVals,residuals)
plt.show()
What you’ll notice is that these residuals are random over the range of x. One condition for OLS (Ordinary Least Squares), which is our linear regression, is that the residuals are not related to the x term.
For an example of this assumption being violated, check out the following…there is clearly a pattern based on the value of x you are at.
xVals = list(range(100))
yVals = [x**2 for x in xVals]
result = scipy.stats.linregress(xVals,yVals)
m = result.slope
b = result.intercept
fittedValues = [m*x+b for x in xVals]
plt.scatter(xVals,yVals)
plt.plot(xVals,fittedValues,"r")
plt.title("Regression")
plt.show()
yVals = np.array(yVals)
fittedValues = np.array(fittedValues)
residuals = yVals-fittedValues
plt.scatter(xVals,residuals)
plt.title("Residuals")
plt.show()
Another assumption is that we have constant variance, if we do not then we have heteroskedastic results, check out this example.
xVals = list(range(100))
yVals = [x+randint(-5, 5)*x/5for x in xVals]
result = scipy.stats.linregress(xVals,yVals)
m = result.slope
b = result.intercept
fittedValues = [m*x+b for x in xVals]
plt.scatter(xVals,yVals)
plt.plot(xVals,fittedValues,"r")
plt.title("Regression")
plt.show()
yVals = np.array(yVals)
fittedValues = np.array(fittedValues)
residuals = yVals-fittedValues
plt.scatter(xVals,residuals)
plt.title("Residuals")
plt.show()
Notice how the variance of the residuals goes up as x goes up.
There are other assumptions, including the independent variables can’t be too correlated with one and other, the errors are normally distributed (so none should be more than 3 standard deviations from the mean), and the mean of residuals should be 0 (meaning it’s p value is < 5%).
Source Code