-
Basic Functionality 6
-
Lecture1.1
-
Lecture1.2
-
Lecture1.3
-
Lecture1.4
-
Lecture1.5
-
Lecture1.6
-
-
Pandas & Indices 4
-
Lecture2.1
-
Lecture2.2
-
Lecture2.3
-
Lecture2.4
-
-
Joins, Unions & Order By 3
-
Lecture3.1
-
Lecture3.2
-
Lecture3.3
-
-
Group By & Views 3
-
Lecture4.1
-
Lecture4.2
-
Lecture4.3
-
Joins
First, I am going to set up all two tables, which will represent the grade and number of absences for class 1 and 2 in two different tables. One twist is that not all students are in both, some are in only one of the two classes.
import pandas as pd
import numpy as np
import sqlite3
import os
#First make sure if the database is already there that we erase it prior to this lesson
if os.path.exists("Students.db"):
os.remove("Students.db")
#Create the data
student_ids = list(range(1,101))
students = pd.DataFrame(student_ids)
students.columns = ['StudentID']
np.random.seed(0)
students['Classroom1'] = np.random.choice([True,False], 100, p=[.5,.5])
students['Classroom2'] = np.random.choice([True,False], 100, p=[.5,.5])
students['Grade1'] = (np.random.rand(100) * 50 + 50).astype(int)
students['Grade2'] = (np.random.rand(100) * 50 + 50).astype(int)
students['Absent1'] = (np.random.rand(100) * 5).astype(int)
students['Absent2'] = (np.random.rand(100) * 5).astype(int)
students = students.set_index("StudentID")
students1 = students[students['Classroom1']][['Grade1', 'Absent1']]
students2 = students[students['Classroom2']][['Grade2', 'Absent2']]
print(students1.head(5))
print()
print()
print(students2.head(5))
conn = sqlite3.connect('Students.db')
students1.to_sql("Class1", conn)
students2.to_sql("Class2", conn)
A left join takes one table, and then looks to see if a given key is present in the second table. If it is, then the columns are joined together. Entries in the second table not in the first are ignored. Take for example, this left join of the two classrooms where we join on the student ID.
#A basic left join will find the matching records from the first table and then add in the columns from the right table
#Notice we have Class1 and Class2 before fields to denote which we are talking about
query = "SELECT * FROM Class1 LEFT JOIN Class2 ON Class1.StudentID = Class2.StudentID;"
df = pd.read_sql(query, conn)
print(df)
You can pick specific columns by using TABLE.COLUMN where TABLE is replaced with the table name, and COLUMN is replaced with the column in that table. See how we can implement it to get only the grades.
#We can also pick specific columns from each table
query = """SELECT Class1.StudentID,
Class1.Grade1,
Class2.Grade2 FROM Class1
LEFT JOIN Class2 ON Class1.StudentID = Class2.StudentID;"""
df = pd.read_sql(query, conn)
print(df)
If you add AS after a column you are selecting followed by a new name you can change the name that is used for the columns.
#As well, we are able to label the columns differently with AS
query = """SELECT Class1.StudentID AS ID,
Class1.Grade1 AS G1,
Class2.Grade2 AS G2 FROM Class1
LEFT JOIN Class2 ON Class1.StudentID = Class2.StudentID;"""
df = pd.read_sql(query, conn)
print(df)
The above results in a dataframe returned with columns ID, G1, G2. An inner join does the same thing except only entries that have matching keys in both tables will be returned.
#An inner join works in a similar way except it only returns records that are in both tables
query = """SELECT Class1.StudentID AS ID,
Class1.Grade1 AS G1,
Class2.Grade2 AS G2 FROM Class1
INNER JOIN Class2 ON Class1.StudentID = Class2.StudentID;"""
df = pd.read_sql(query, conn)
print(df)