-
Pandas Basics 5
-
Lecture1.1
-
Lecture1.2
-
Lecture1.3
-
Lecture1.4
-
Lecture1.5
-
-
Data Transformations 6
-
Lecture2.1
-
Lecture2.2
-
Lecture2.3
-
Lecture2.4
-
Lecture2.5
-
Lecture2.6
-
-
Statistics 4
-
Lecture3.1
-
Lecture3.2
-
Lecture3.3
-
Lecture3.4
-
-
Reading and Writing Data 3
-
Lecture4.1
-
Lecture4.2
-
Lecture4.3
-
-
Joins 5
-
Lecture5.1
-
Lecture5.2
-
Lecture5.3
-
Lecture5.4
-
Lecture5.5
-
-
Grouping 4
-
Lecture6.1
-
Lecture6.2
-
Lecture6.3
-
Lecture6.4
-
-
Introduction to Numpy 4
-
Lecture7.1
-
Lecture7.2
-
Lecture7.3
-
Lecture7.4
-
-
Randomness 2
-
Lecture8.1
-
Lecture8.2
-
-
Numpy Data Functionality 1
-
Lecture9.1
-
Boolean Indexing
Boolean Indexing¶
Sometimes you want to grab only certain rows based on some sort of comparison. This is the purpose of boolean indexing. What it allows you to do is check a comparison then return only those rows that are true for the comparison. To begin with let’s define a list of boolean values.
#Define a boolean index
bool_index = [True, False, True, False]
When you pass in a boolean index to the dataframe the same way that you pass in columns it will filter to only true values.
print(df[bool_index])
Name Height Weight Type Retired
0 Ray Lewis 73 250 Defense True
2 Julio Jones 75 220 Offense False
Notice above that we only got back the first and third row! Of course you would get the same by inputting it the manual way like below:
print(df[[True,False,True,False]])
Name Height Weight Type Retired
0 Ray Lewis 73 250 Defense True
2 Julio Jones 75 220 Offense False
What if wanted to find out which players are over 220 pounds and return only those players. The first step is to get our boolean index.
#We can also check the truth of a statement such as which rows have weight values over 220
print(df["Weight"] > 220)
0 True
1 True
2 False
3 False
Name: Weight, dtype: bool
Now with that it is easy to filter to the correct rows.
#And this allows us to filter based on an argument. In this case we can print only rows with weights over 220
print(df[df["Weight"]>220])
Name Height Weight Type Retired
0 Ray Lewis 73 250 Defense True
1 Tom Brady 76 225 Offense False
Combining Dataframes¶
Often we will have data from multiple sources and will need to combine them together to create a dataframe with all data together. For an example, we can first make a second set of data. Something you will notice is that there are a few values set to none. This happens really often when working with data. Sometimes a field can't be measured 100% of the time or does not apply so it has the value of none instead.
#Let's create a second set of data
data = []
data.append(("Allen Robinson",75,250,"Offense",False))
data.append(("Alvin Kamara",None,215,"Offense",False))
data.append(("Christian McCaffrey",71,None,"Offense",False))
#And turn it into a second dataframe
#You'll notice some data is missing, this happens commonly working with real data
df2 = pd.DataFrame(data,columns=["Name","Height","Weight","Type","Retired"])
print(df2)
Name Height Weight Type Retired
0 Allen Robinson 75.0 250.0 Offense False
1 Alvin Kamara NaN 215.0 Offense False
2 Christian McCaffrey 71.0 NaN Offense False
The function pd.concat takes a list of dataframes to put together. By default they will be appended vertically, but you can change this behavior by using the axis keyword. The code below will take the two dataframes and put them together to make a new combined dataframe.
#The pd.concat() function lets us put together dataframes
df_final = pd.concat([df,df2])
print(df_final)
Name Height Weight Type Retired
0 Ray Lewis 73.0 250.0 Defense True
1 Tom Brady 76.0 225.0 Offense False
2 Julio Jones 75.0 220.0 Offense False
3 Richard Sherman 75.0 194.0 Defense False
0 Allen Robinson 75.0 250.0 Offense False
1 Alvin Kamara NaN 215.0 Offense False
2 Christian McCaffrey 71.0 NaN Offense False
Reset the Index¶
If you look at the index above you see that they actually overlap. Sometimes this is the behavior you might want, but often you want to set the index to be a new unique integer index. To do this we can call reset_index. This will return a new dataframe which has the old index kept as a column added to the dataframe as well as the new index from 0 to N-1.
#The two dataframes have indexes that overlap! We could fix this by resetting the index, as so....
print(df_final.reset_index())
index Name Height Weight Type Retired
0 0 Ray Lewis 73.0 250.0 Defense True
1 1 Tom Brady 76.0 225.0 Offense False
2 2 Julio Jones 75.0 220.0 Offense False
3 3 Richard Sherman 75.0 194.0 Defense False
4 0 Allen Robinson 75.0 250.0 Offense False
5 1 Alvin Kamara NaN 215.0 Offense False
6 2 Christian McCaffrey 71.0 NaN Offense False
If you give the argument ignore_index=True then the old index will simply be dropped from the dataframe.
#Or we can give the argument ignore_index=True to reset it during the concat function
df_final = pd.concat([df,df2],ignore_index=True)
print(df_final)
Name Height Weight Type Retired
0 Ray Lewis 73.0 250.0 Defense True
1 Tom Brady 76.0 225.0 Offense False
2 Julio Jones 75.0 220.0 Offense False
3 Richard Sherman 75.0 194.0 Defense False
4 Allen Robinson 75.0 250.0 Offense False
5 Alvin Kamara NaN 215.0 Offense False
6 Christian McCaffrey 71.0 NaN Offense False