-
Basic Functionality 6
-
Lecture1.1
-
Lecture1.2
-
Lecture1.3
-
Lecture1.4
-
Lecture1.5
-
Lecture1.6
-
-
Pandas & Indices 4
-
Lecture2.1
-
Lecture2.2
-
Lecture2.3
-
Lecture2.4
-
-
Joins, Unions & Order By 3
-
Lecture3.1
-
Lecture3.2
-
Lecture3.3
-
-
Group By & Views 3
-
Lecture4.1
-
Lecture4.2
-
Lecture4.3
-
More Joins
In other versions of SQL, we will have the ability to do an outer join, but for sqlite, we are not able to execute it. The outer join is a join which combines all the keys and fills missing first table values with null and fills missing table 2 values with null as well. We can actually create our own version of this. The first step is understanding that we could left join the tables in the reverse order, which would we would have all keys in the second table present in the output. The combination of this and the left join with the other order would cover every single key. The only issue would be that we would end up with duplicates for any records that are in both tables. If we filter to only records that have a NULL value for the first table’s data, we can fix this issue. Observe below the code which will give us all IDs that are in table 2 but not in table 1.
#If we left join from the second class and then filter to only the null values, we get values that
#are only present in table 2
query = """SELECT Class2.StudentID AS ID,
Class1.Grade1 AS G1,
Class2.Grade2 AS G2 FROM Class2
LEFT JOIN Class1 ON Class1.StudentID = Class2.StudentID WHERE Class1.StudentID IS NULL;"""
df = pd.read_sql(query, conn)
print(df)
Combining this with the prior result using UNION gives us all of our data in the style of an OUTER join.
#With UNION we could add these on to replicate an outer join
query = """SELECT Class1.StudentID AS ID,
Class1.Grade1 AS G1,
Class2.Grade2 AS G2 FROM Class1
LEFT JOIN Class2 ON Class1.StudentID = Class2.StudentID
UNION
SELECT Class2.StudentID AS ID,
Class1.Grade1 AS G1,
Class2.Grade2 AS G2 FROM Class2
LEFT JOIN Class1 ON Class1.StudentID = Class2.StudentID WHERE Class1.StudentID IS NULL;"""
df = pd.read_sql(query, conn)
print(df)
Let’s move on to another important topic. Imagine a scenario where we had a student taking two classes, and this person’s grade was submitted for each class with the assignment name and the grade. There is a midterm and a final for each class, and the two tables each represent a class. If we wanted to do an left join to try and get the data to be ID, assignment name, class 1 grade, then class 2 grade we might try the following:
#Add some new data
G1 = pd.DataFrame([[1, "Midterm", 90],
[1, "Final", 85]], columns=['ID', 'Assignment', 'Grade'])
G2 = pd.DataFrame([[1, "Midterm", 87],
[1, "Final", 89]], columns=['ID', 'Assignment', 'Grade'])
G1.to_sql("Grades1", conn)
G2.to_sql("Grades2", conn)
#Try a left join the wrong way
query = """SELECT Grades1.ID,
Grades1.Assignment AS C1Assignment,
Grades2.Assignment AS C2Assignment,
Grades1.Grade AS C1Grade,
Grades2.Grade AS C2Grade
FROM Grades1
LEFT JOIN Grades2 ON Grades1.ID = Grades2.ID;"""
df = pd.read_sql(query, conn)
print(df)
You will notice an issue, we actually end up with four records returned. This is because we have IDs repeated which ends up meaning that for the left join, each of the records when left joined has 2 joined to it for 2X2 = 4 total records. This is a behavior to be aware of, it is generally not what someone wants. The way to correct for this is to join based not only on the ID but also the assignment with an AND statement.
query = """SELECT Grades1.ID,
Grades1.Assignment AS C1Assignment,
Grades2.Assignment AS C2Assignment,
Grades1.Grade AS C1Grade,
Grades2.Grade AS C2Grade
FROM Grades1
LEFT JOIN Grades2 ON Grades1.ID = Grades2.ID AND Grades1.Assignment=Grades2.Assignment;"""
df = pd.read_sql(query, conn)
print(df)
Now you see that the match is 1 to 1. Also, the assignment columns are redundant and can be replaced with just one column.
query = """SELECT Grades1.ID,
Grades1.Assignment AS Assignment,
Grades1.Grade AS C1Grade,
Grades2.Grade AS C2Grade
FROM Grades1
LEFT JOIN Grades2 ON Grades1.ID = Grades2.ID AND Grades1.Assignment=Grades2.Assignment;"""
df = pd.read_sql(query, conn)
print(df)