-
Basic Functionality 6
-
Lecture1.1
-
Lecture1.2
-
Lecture1.3
-
Lecture1.4
-
Lecture1.5
-
Lecture1.6
-
-
Pandas & Indices 4
-
Lecture2.1
-
Lecture2.2
-
Lecture2.3
-
Lecture2.4
-
-
Joins, Unions & Order By 3
-
Lecture3.1
-
Lecture3.2
-
Lecture3.3
-
-
Group By & Views 3
-
Lecture4.1
-
Lecture4.2
-
Lecture4.3
-
Subqueries
Subqueries are queries inside of another query. By utilizing this method, you can build out more complex logic and achieve more detailed analysis. The idea here is that in one query when you usually say FROM table, you instead say FROM followed by, in parantheses, another query. Below shows first a regular query, then a second query which pulls from that prior one in parantheses. The first query finds the average sector return, the second query takes the result from running the first query then selects just the sector column but only for sectors which have a positive sector return.
#Queries can be put inside each other
#Query 1: Get average sector returns
query = "SELECT Sector, AVG(Return) as Return FROM Returns GROUP BY Sector"
df = pd.read_sql(query, conn)
print(df)
print()
print()
#Query 2: Pull from the first query in parantheses and choose the sector wherever
#the average sector return is greater than 0
query = "SELECT Sector FROM (SELECT Sector, AVG(Return) as Return FROM Returns GROUP BY Sector) WHERE Return > 0"
df = pd.read_sql(query, conn)
print(df)
An easier way to create the second query is to first define the average sector return query, then substitute in this query in the second query like so:
#This is an easier way to visualize, the first query being put into the second one at the parantheses
query1 = "SELECT Sector, AVG(Return) as Return FROM Returns GROUP BY Sector"
query2 = "SELECT Sector FROM ({}) WHERE Return > 0".format(query1)
df = pd.read_sql(query2, conn)
print(df)
This example illustrates how we can use subqueries, but in this specific case we could also use HAVING. What HAVING is used for is to define logic for filtering after an aggregation is done. HAVING will not effect the individual records, just the table created from the GROUP BY. For example, the code presented below will filter out any sectors that have a negative average return.
#We are also able to replicate the same concept using HAVING
#HAVING will deal with conditions after aggregation, so the following looks for where the grouped sector return
#average is positive
query2 = "SELECT Sector FROM Returns GROUP BY Sector HAVING AVG(Return) > 0"
df = pd.read_sql(query2, conn)
print(df)
If we use a WHERE statement before we say GROUP BY, we end up filtering individual records instead. For example, if we wanted to count the number positive return records for each sector we can add WHERE Return>0 before our GROUP BY like the following query:
#Look for the number of dates which are positive, by using WHERE before the GROUP BY
#we are applying it to the individual records
query3 = "SELECT Sector, COUNT(Return) as Positive FROM Returns WHERE Return>0 GROUP BY Sector"
df = pd.read_sql(query3, conn)
print(df)
Let’s bring both query2 and query3 together in an example. Let’s pull the number of positive days of return, but then filter out any sectors which do not have a positive average return. When we use IN, we are actually allowed to use the results of a separate query instead of just constant values. So we are essentially saying IN followed by the results of query2 which returns the four sectors that on average are positive.
#Combining query2 and query3 would give us the number of positive days but filters out any sectors that
#have a negative average return
query4 = "SELECT Sector, Positive FROM ({}) WHERE Sector IN ({})".format(query3, query2)
df = pd.read_sql(query4, conn)
print(df)
If you want to see what the full subquery looks like, run the following. You’ll notice it is way easier to make queries separately then combine then with python. Writing this query all in one shot would be pretty difficult with keeping track of parantheses and where one query ends and another begins.
#You can see what the words actually look like
print(query4)
One last hepful thing is BETWEEN. Using BETWEEN, we find all values between two values by saying BETWEEN A AND B where A and B are replaced with starting and ending values. The below query gives us all the average return on each date for dates between 1/1/2019 and 1/3/2019.
#BETWEEN can be used to limit values to a specific range
query = "SELECT Date, AVG(Return) as Return FROM Returns WHERE Date BETWEEN '2019-01-01 00:00:00' AND '2019-01-03 00:00:00' GROUP BY Date"
df = pd.read_sql(query, conn)
print(df)
print()
print()