Introduction
Grouping¶
Like in many database languages, grouping can be one of the most commonly used actions. What it allows you to do is quickly apply some sort of function based on groups within data. Let’s begin with an example of a dataset of sales where we have 4 columns: the day of the sale, the store of the sale, the product number, and the amount of sales.
import pandas as pd
#Create the data
sales = pd.DataFrame([[1, 1, 1, 10],
[1, 1, 2, 74],
[1, 1, 3, 27],
[1, 2, 1, 41],
[1, 2, 2, 66],
[1, 2, 3, 95],
[2, 1, 1, 1],
[2, 1, 2, 23],
[2, 1, 3, 67],
[2, 2, 1, 86],
[2, 2, 2, 87],
[2, 2, 3, 19],
[3, 1, 1, 6],
[3, 1, 2, 30],
[3, 1, 3, 55],
[3, 2, 1, 68],
[3, 2, 2, 32],
[3, 2, 3, 50]],
columns=['Day', 'Store', 'Product', 'Sales'])
print(sales)
The way we use groupby is we call groupby() on a dataframe, and pass in the column to group with as an argument. Then we need to call some sort of function to run after this! So in this case we are going to work with sum(). The following syntax will give us the total number of sales on each given day below. The index is also going to be set to the grouping column.
#Get the sum by day
print(sales.groupby('Day').sum())
The sales data makes sense, but for the product and store, we aren't actually getting real data. Those numbers were the store identifiers. So maybe we want to just limit to our sales data. If that is the case then we can use indexing prior to sum to make it so that only sales data is shown.
#Get the sum by day only for the sales
print(sales.groupby('Day')['Sales'].sum())
Likewise, any of the other columns could be used instead!
#Get sales by product
print(sales.groupby('Product')['Sales'].sum())
#Get sales by store
print(sales.groupby('Store')['Sales'].sum())