Basics
We use a chi-squared test when we have data presented in a tabular form like below. Let’s say we have a table representing whether or not an athlete was injured in a season and what sports they play.
Football | Basketball | Soccer | |
Injured | 50 | 40 | 20 |
Not Injured | 30 | 25 | 15 |
The first step to the chi-squared test is getting the column and row totals, let’s do it first by hand.
Football | Basketball | Soccer | Total | |
Injured | 50 | 40 | 20 | 110 |
Not Injured | 30 | 25 | 15 | 70 |
Total | 80 | 65 | 35 | 180 |
Now, let’s look at some code. Let’s represent our table as a numpy array.
import numpy as np
table = np.array([[50,40,20],[30,25,15]])
table
Numpy has built in functions to get the sum, or sum across rows/columns. If we want to sum over the columns we use axis=0, if we want rows we use axis=1, and if we want just the general sum then we don’t give any arguments.
print(table.sum(axis=0))
print(table.sum(axis=1))
print(table.sum())
Now, recall that if two variables are independent, then P(A and B) = P(A)*P(B). What we are going to do now is predict each square based off this hypothesis. So, for example, when we talk about the top left square, we want to first find the probability that an athlete plays football, then the probability an athlete is injured and finally we can figure out the expected number of athletes in this category (if they are independent).
So if we have a sample size n, for each square there is a row and a column it belongs to. We want to find E=P(A)*P(B)*n which is equal to (Row Total/n)*(Column Total/n)*n.
The way we can get the expected probabilities is…
for index, x in np.ndenumerate(table):
print(index, x)
Let’s compute the expected values. The way we can get the row, column and value for each square is with np.ndenumerate().
for index, x in np.ndenumerate(table):
print(index, x)
Challenge