-
Basic Functionality 6
-
Lecture1.1
-
Lecture1.2
-
Lecture1.3
-
Lecture1.4
-
Lecture1.5
-
Lecture1.6
-
-
Pandas & Indices 4
-
Lecture2.1
-
Lecture2.2
-
Lecture2.3
-
Lecture2.4
-
-
Joins, Unions & Order By 3
-
Lecture3.1
-
Lecture3.2
-
Lecture3.3
-
-
Group By & Views 3
-
Lecture4.1
-
Lecture4.2
-
Lecture4.3
-
Primary Key
Using an index for databases is a very important part of making queries run faster. Essentially, when we set an index it allows to much more quickly do things like sorting or logic around the indexed field such as pulling above a given value.
The primary key is a unique index that can not be null, used for identification of specific records. With this version of SQL, we need to set the primary key when creating a table as opposed to altering the table. To re-create the table, let’s first inspect what the variable types are from the table we created with pandas. PRAGMA table_info will return the information of a table.
#We can get table details if we do the following
cur = conn.cursor()
query = 'PRAGMA table_info("Curves")'
print(cur.execute(query).fetchall())
We can make our code so as to grab just the name and type of each variable.
#Get the types and column names from this
query = 'PRAGMA table_info("Curves")'
table_info = pd.DataFrame(cur.execute(query).fetchall())[[1,2]]
table_info.columns = ['Name', 'Type']
print(table_info)
Where we are going from here is automating the creation of the table creation query. First make it so we have the pattern of variable name, followed by a space, follow by the variable type. We need to do this so that we can create the table from scratch with the primary key added on.
print(table_info['Name'] +" " +table_info['Type'])
Now combine the variables using a comma and new line as the joining character.
#Combine the variables by joining a comma and new line
variables = ",\n".join((table_info['Name'] +" " +table_info['Type']).values)
print(variables)
Now substitute in the variables for the table creation query.
#Put the table query together
table_create_query = """CREATE TABLE Curves(
{}
);""".format(variables)
print(table_create_query)
Let’s see how we can re-create the table after dropping it with this query, then push in actual data by using to_sql with append.
#Drop the table and push the data
query = "DROP TABLE Curves;"
cur.execute(query)
cur.execute(table_create_query)
data.to_sql("Curves", conn, if_exists='append')
Now, to prove our point that an index will help speed, first we want to time running a query 100 times without the primary key.
import time
#Time running a query 100 times
run_times = []
for _ in range(100):
start_time = time.time()
query = 'SELECT * FROM Curves WHERE Date > "2019-05-01";'
df = pd.read_sql(query, conn)
end_time = time.time()
run_times.append(end_time-start_time)
print(sum(run_times))
Now, let’s learn about how to actually add a primary key. The way we do this is directly after delcaring a variable name and type we add PRIMARY KEY. To make the column date the primary key, the following will work:
#Update the table creation query to add a primary key
table_create_query = table_create_query.replace("Date TIMESTAMP", "Date TIMESTAMP PRIMARY KEY")
print(table_create_query)
Now that the query is changed, re-create the table.
#Re-create the table
query = "DROP TABLE Curves;"
cur.execute(query)
cur.execute(table_create_query)
data.to_sql("Curves", conn, if_exists='append')
You should see a faster time when we run the same 100 loops with an index added on.
#Test running again
run_times = []
for _ in range(100):
start_time = time.time()
query = 'SELECT * FROM Curves WHERE Date > "2019-05-01";'
df = pd.read_sql(query, conn)
end_time = time.time()
run_times.append(end_time-start_time)
print(sum(run_times))