-
Basic Functionality 6
-
Lecture1.1
-
Lecture1.2
-
Lecture1.3
-
Lecture1.4
-
Lecture1.5
-
Lecture1.6
-
-
Pandas & Indices 4
-
Lecture2.1
-
Lecture2.2
-
Lecture2.3
-
Lecture2.4
-
-
Joins, Unions & Order By 3
-
Lecture3.1
-
Lecture3.2
-
Lecture3.3
-
-
Group By & Views 3
-
Lecture4.1
-
Lecture4.2
-
Lecture4.3
-
Group By
First, set up some data. We will create a dataset which has stock returns on 10 days, for 100 stocks where each stock is part of one of five sectors. We are also setting the argument of index to False when we are using to_sql() for our pandas dataframe.
import sqlite3
import os
import pandas as pd
from datetime import datetime
import numpy as np
import os
#First make sure if the database is already there that we erase it prior to this lesson
if os.path.exists("Stock.db"):
os.remove("Stock.db")
#Create the data
data = pd.DataFrame(list(range(1,101)) * 10)
data.columns = ["ID"]
l = []
for date in pd.date_range(datetime(2019,1,1), datetime(2019,1,10)):
l.extend([date] * 100)
data['Date'] = l
data['Sector'] = ["Healthcare", "Financials", "Energy", "Consumer Staples", "Consumer Discretionary"] * 200
np.random.seed(1)
data['Return'] = np.random.normal(0,.01,1000)
print(data.head(5))
conn = sqlite3.connect('Stock.db')
#Set index to false so we do not add the index to the SQL table
data.to_sql("Returns", conn, index=False)
Now, we know already how to pull the average of a column.
#We already know how to pull an average
query = "SELECT AVG(Return) FROM Returns"
df = pd.read_sql(query, conn)
print(df)
Using GROUP BY, we can apply things like average, count, etc to groups of data. The syntax is to say GROUP BY followed by a column name that we want to group with. GROUP BY ID, for example, applies our anaylsis to each group of records that have matching IDs.
#GROUP BY allows you to run something like an average over a specific grouping
query = "SELECT AVG(Return) FROM Returns GROUP BY ID"
df = pd.read_sql(query, conn)
print(df)
One issue is that we do not know which number relates to which ID. We need to also add in pulling the column ID to the statement like so:
#But we also want to make sure that we are getting the column back that we are grouping on
query = "SELECT ID, AVG(Return) FROM Returns GROUP BY ID"
df = pd.read_sql(query, conn)
print(df)
We can repeat this same idea for observing the average return by date.
#We can also do the same with grouping with date
query = "SELECT Date, AVG(Return) FROM Returns GROUP BY Date"
df = pd.read_sql(query, conn)
print(df)
In addition to just grouping by one field, it is possible to group by multiple fields which will find the average for all combinations of the fields. For example, the following finds the average for each sector on each date.
#It is also possible to group by combinations of fields such as date and sector for example
query = "SELECT Date, Sector, AVG(Return) FROM Returns GROUP BY Date, Sector"
df = pd.read_sql(query, conn)
print(df)