-
Pandas Basics 5
-
Lecture1.1
-
Lecture1.2
-
Lecture1.3
-
Lecture1.4
-
Lecture1.5
-
-
Data Transformations 6
-
Lecture2.1
-
Lecture2.2
-
Lecture2.3
-
Lecture2.4
-
Lecture2.5
-
Lecture2.6
-
-
Statistics 4
-
Lecture3.1
-
Lecture3.2
-
Lecture3.3
-
Lecture3.4
-
-
Reading and Writing Data 3
-
Lecture4.1
-
Lecture4.2
-
Lecture4.3
-
-
Joins 5
-
Lecture5.1
-
Lecture5.2
-
Lecture5.3
-
Lecture5.4
-
Lecture5.5
-
-
Grouping 4
-
Lecture6.1
-
Lecture6.2
-
Lecture6.3
-
Lecture6.4
-
-
Introduction to Numpy 4
-
Lecture7.1
-
Lecture7.2
-
Lecture7.3
-
Lecture7.4
-
-
Randomness 2
-
Lecture8.1
-
Lecture8.2
-
-
Numpy Data Functionality 1
-
Lecture9.1
-
Pivot
Pivot¶
If you have used pivot tables in excel, this function should feel natural to you. The pivot function will take three columns and then pivot for those values. The first argument is which column indicates the row, the second is which indicates the columns, and the third is which indicates the values. Notice how the sales_location_data looks before and after pivoting.
#Create the data
sales_location_data = pd.DataFrame([["A", 1, 1000],
["B", 1, 5000],
["C", 1, 2000],
["A", 2, 4000],
["B", 2, 2000],
["A", 3, 8000],
["C", 3, 8000]], columns=['Product', 'Location', 'Sales'])
print("Before the pivot:")
print(sales_location_data)
print()
#Pivot
sales_location_data = sales_location_data.pivot('Product', 'Location', 'Sales')
print("After the pivot:")
print(sales_location_data)
Before the pivot:
Product Location Sales
0 A 1 1000
1 B 1 5000
2 C 1 2000
3 A 2 4000
4 B 2 2000
5 A 3 8000
6 C 3 8000
After the pivot:
Location 1 2 3
Product
A 1000.0 4000.0 8000.0
B 5000.0 2000.0 NaN
C 2000.0 NaN 8000.0
If there is not a value then we automatically get a null value. We can fill those with 0 by using fillna.
#Fill null values
sales_location_data = sales_location_data.fillna(0)
print(sales_location_data)
Location 1 2 3
Product
A 1000.0 4000.0 8000.0
B 5000.0 2000.0 0.0
C 2000.0 0.0 8000.0
Applying Functions to Different Axes¶
There are many built in functions you can use to find statistics or other measures for either the rows or the columns. For example, if you call mean() for the dataframe we just worked with, we get the average for each of the columns.
print(sales_location_data.mean())
Location
1 2666.666667
2 2000.000000
3 5333.333333
dtype: float64
If you give the argument of axis=1, we get the mean by the rows.
print(sales_location_data.mean(axis=1))
Product
A 4333.333333
B 2333.333333
C 3333.333333
dtype: float64
The default behavior is axis=0 or the average by the columns.
print(sales_location_data.mean(axis=0))
Location
1 2666.666667
2 2000.000000
3 5333.333333
dtype: float64
One thing to note is how these functions will work differently with null values. Let's first bring back the data with null values.
#Create the data
sales_location_data = pd.DataFrame([["A", 1, 1000],
["B", 1, 5000],
["C", 1, 2000],
["A", 2, 4000],
["B", 2, 2000],
["A", 3, 8000],
["C", 3, 8000]], columns=['Product', 'Location', 'Sales'])
#Pivot
sales_location_data = sales_location_data.pivot('Product', 'Location', 'Sales')
print(sales_location_data)
Location 1 2 3
Product
A 1000.0 4000.0 8000.0
B 5000.0 2000.0 NaN
C 2000.0 NaN 8000.0
If we call for the mean now, by column, it is going to ignore the null values instead of counting them as 0!
print(sales_location_data.mean())
Location
1 2666.666667
2 3000.000000
3 8000.000000
dtype: float64
It is important to decide how to handle the null values because of this! Also there are many other functions you can apply, for example:
#Get a few metrics
print("Maximum:")
print(sales_location_data.max())
print()
print()
print("Minimum:")
print(sales_location_data.min())
print()
print()
print("Sum:")
print(sales_location_data.sum())
print()
print()
Maximum:
Location
1 5000.0
2 4000.0
3 8000.0
dtype: float64
Minimum:
Location
1 1000.0
2 2000.0
3 8000.0
dtype: float64
Sum:
Location
1 8000.0
2 6000.0
3 16000.0
dtype: float64