Boolean Indexing

Boolean Indexing¶

Sometimes you want to grab only certain rows based on some sort of comparison. This is the purpose of boolean indexing. What it allows you to do is check a comparison then return only those rows that are true for the comparison. To begin with let’s define a list of boolean values.

In [29]:

#Define a boolean index
bool_index = [True, False, True, False]

When you pass in a boolean index to the dataframe the same way that you pass in columns it will filter to only true values.

In [30]:

print(df[bool_index])

          Name  Height  Weight     Type  Retired
0    Ray Lewis      73     250  Defense     True
2  Julio Jones      75     220  Offense    False

Notice above that we only got back the first and third row! Of course you would get the same by inputting it the manual way like below:

In [31]:

print(df[[True,False,True,False]])

          Name  Height  Weight     Type  Retired
0    Ray Lewis      73     250  Defense     True
2  Julio Jones      75     220  Offense    False

What if wanted to find out which players are over 220 pounds and return only those players. The first step is to get our boolean index.

In [32]:

#We can also check the truth of a statement such as which rows have weight values over 220
print(df["Weight"] > 220)

0     True
1     True
2    False
3    False
Name: Weight, dtype: bool

Now with that it is easy to filter to the correct rows.

In [33]:

#And this allows us to filter based on an argument. In this case we can print only rows with weights over 220
print(df[df["Weight"]>220])

        Name  Height  Weight     Type  Retired
0  Ray Lewis      73     250  Defense     True
1  Tom Brady      76     225  Offense    False

Combining Dataframes¶

Often we will have data from multiple sources and will need to combine them together to create a dataframe with all data together. For an example, we can first make a second set of data. Something you will notice is that there are a few values set to none. This happens really often when working with data. Sometimes a field can't be measured 100% of the time or does not apply so it has the value of none instead.

In [34]:

#Let's create a second set of data

data = []
data.append(("Allen Robinson",75,250,"Offense",False))
data.append(("Alvin Kamara",None,215,"Offense",False))
data.append(("Christian McCaffrey",71,None,"Offense",False))

#And turn it into a second dataframe
#You'll notice some data is missing, this happens commonly working with real data
df2 = pd.DataFrame(data,columns=["Name","Height","Weight","Type","Retired"])
print(df2)

                  Name  Height  Weight     Type  Retired
0       Allen Robinson    75.0   250.0  Offense    False
1         Alvin Kamara     NaN   215.0  Offense    False
2  Christian McCaffrey    71.0     NaN  Offense    False

The function pd.concat takes a list of dataframes to put together. By default they will be appended vertically, but you can change this behavior by using the axis keyword. The code below will take the two dataframes and put them together to make a new combined dataframe.

In [35]:

#The pd.concat() function lets us put together dataframes
df_final = pd.concat([df,df2])
print(df_final)

                  Name  Height  Weight     Type  Retired
0            Ray Lewis    73.0   250.0  Defense     True
1            Tom Brady    76.0   225.0  Offense    False
2          Julio Jones    75.0   220.0  Offense    False
3      Richard Sherman    75.0   194.0  Defense    False
0       Allen Robinson    75.0   250.0  Offense    False
1         Alvin Kamara     NaN   215.0  Offense    False
2  Christian McCaffrey    71.0     NaN  Offense    False

Reset the Index¶

If you look at the index above you see that they actually overlap. Sometimes this is the behavior you might want, but often you want to set the index to be a new unique integer index. To do this we can call reset_index. This will return a new dataframe which has the old index kept as a column added to the dataframe as well as the new index from 0 to N-1.

In [36]:

#The two dataframes have indexes that overlap! We could fix this by resetting the index, as so....
print(df_final.reset_index())

   index                 Name  Height  Weight     Type  Retired
0      0            Ray Lewis    73.0   250.0  Defense     True
1      1            Tom Brady    76.0   225.0  Offense    False
2      2          Julio Jones    75.0   220.0  Offense    False
3      3      Richard Sherman    75.0   194.0  Defense    False
4      0       Allen Robinson    75.0   250.0  Offense    False
5      1         Alvin Kamara     NaN   215.0  Offense    False
6      2  Christian McCaffrey    71.0     NaN  Offense    False

If you give the argument ignore_index=True then the old index will simply be dropped from the dataframe.

In [37]:

#Or we can give the argument ignore_index=True to reset it during the concat function
df_final = pd.concat([df,df2],ignore_index=True)
print(df_final)

                  Name  Height  Weight     Type  Retired
0            Ray Lewis    73.0   250.0  Defense     True
1            Tom Brady    76.0   225.0  Offense    False
2          Julio Jones    75.0   220.0  Offense    False
3      Richard Sherman    75.0   194.0  Defense    False
4       Allen Robinson    75.0   250.0  Offense    False
5         Alvin Kamara     NaN   215.0  Offense    False
6  Christian McCaffrey    71.0     NaN  Offense    False

Data Science

Data Science

Boolean Indexing