-
Pandas Basics 5
-
Lecture1.1
-
Lecture1.2
-
Lecture1.3
-
Lecture1.4
-
Lecture1.5
-
-
Data Transformations 6
-
Lecture2.1
-
Lecture2.2
-
Lecture2.3
-
Lecture2.4
-
Lecture2.5
-
Lecture2.6
-
-
Statistics 4
-
Lecture3.1
-
Lecture3.2
-
Lecture3.3
-
Lecture3.4
-
-
Reading and Writing Data 3
-
Lecture4.1
-
Lecture4.2
-
Lecture4.3
-
-
Joins 5
-
Lecture5.1
-
Lecture5.2
-
Lecture5.3
-
Lecture5.4
-
Lecture5.5
-
-
Grouping 4
-
Lecture6.1
-
Lecture6.2
-
Lecture6.3
-
Lecture6.4
-
-
Introduction to Numpy 4
-
Lecture7.1
-
Lecture7.2
-
Lecture7.3
-
Lecture7.4
-
-
Randomness 2
-
Lecture8.1
-
Lecture8.2
-
-
Numpy Data Functionality 1
-
Lecture9.1
-
Introduction
Grouping¶
Like in many database languages, grouping can be one of the most commonly used actions. What it allows you to do is quickly apply some sort of function based on groups within data. Let’s begin with an example of a dataset of sales where we have 4 columns: the day of the sale, the store of the sale, the product number, and the amount of sales.
import pandas as pd
#Create the data
sales = pd.DataFrame([[1, 1, 1, 10],
[1, 1, 2, 74],
[1, 1, 3, 27],
[1, 2, 1, 41],
[1, 2, 2, 66],
[1, 2, 3, 95],
[2, 1, 1, 1],
[2, 1, 2, 23],
[2, 1, 3, 67],
[2, 2, 1, 86],
[2, 2, 2, 87],
[2, 2, 3, 19],
[3, 1, 1, 6],
[3, 1, 2, 30],
[3, 1, 3, 55],
[3, 2, 1, 68],
[3, 2, 2, 32],
[3, 2, 3, 50]],
columns=['Day', 'Store', 'Product', 'Sales'])
print(sales)
Day Store Product Sales
0 1 1 1 10
1 1 1 2 74
2 1 1 3 27
3 1 2 1 41
4 1 2 2 66
5 1 2 3 95
6 2 1 1 1
7 2 1 2 23
8 2 1 3 67
9 2 2 1 86
10 2 2 2 87
11 2 2 3 19
12 3 1 1 6
13 3 1 2 30
14 3 1 3 55
15 3 2 1 68
16 3 2 2 32
17 3 2 3 50
The way we use groupby is we call groupby() on a dataframe, and pass in the column to group with as an argument. Then we need to call some sort of function to run after this! So in this case we are going to work with sum(). The following syntax will give us the total number of sales on each given day below. The index is also going to be set to the grouping column.
#Get the sum by day
print(sales.groupby('Day').sum())
Store Product Sales
Day
1 9 12 313
2 9 12 283
3 9 12 241
The sales data makes sense, but for the product and store, we aren't actually getting real data. Those numbers were the store identifiers. So maybe we want to just limit to our sales data. If that is the case then we can use indexing prior to sum to make it so that only sales data is shown.
#Get the sum by day only for the sales
print(sales.groupby('Day')['Sales'].sum())
Day
1 313
2 283
3 241
Name: Sales, dtype: int64
Likewise, any of the other columns could be used instead!
#Get sales by product
print(sales.groupby('Product')['Sales'].sum())
#Get sales by store
print(sales.groupby('Store')['Sales'].sum())
Product
1 212
2 312
3 313
Name: Sales, dtype: int64
Store
1 293
2 544
Name: Sales, dtype: int64