-
Introduction 1
-
Lecture1.1
-
-
Getting the Data 3
-
Lecture2.1
-
Lecture2.2
-
Lecture2.3
-
-
SP500 Webscrape 4
-
Lecture3.1
-
Lecture3.2
-
Lecture3.3
-
Lecture3.4
-
-
Full Dataset 2
-
Lecture4.1
-
Lecture4.2
-
-
Regressions 5
-
Lecture5.1
-
Lecture5.2
-
Lecture5.3
-
Lecture5.4
-
Lecture5.5
-
-
Machine Learning 5
-
Lecture6.1
-
Lecture6.2
-
Lecture6.3
-
Lecture6.4
-
Lecture6.5
-
-
Machine Learning Function 2
-
Lecture7.1
-
Lecture7.2
-
-
Visualize Data 2
-
Lecture8.1
-
Lecture8.2
-
The Full Dataset
Solution
import pandas as pd
variables = pd.DataFrame.from_csv("Variables.csv", encoding="UTF-8")
variables.columns = ["Index","Oil","Gold","NaturalGas"]
We are going to also bring in the array we created and saved as a csv. The difference here is we will be using “rt” or the reading mode, and also will be using the reader object instead of the writer object.
import csv
with open("stocksArray.csv", 'rt') as f:
reader = csv.reader(f)
for row in reader:
stocksArray = row
For the large amount of data we need, we can’t use yahoo finance. In the past there was an API which allowed us to do this, but Yahoo discontinued it. Instead, we are going to use Quandl. You’ll need to sign up for their website (it’s free for what we are going to do), but once you have, you’ll be able pull data on the stocks we want to analyze. First, go to
Quandl.com
. After that, activate your API key, as so below replacing XYZ with your code.
import quandl
quandl.ApiConfig.api_key = 'xyz'
Now, let’s get our data. We feed “WIKI/PRICES” as our dataset, and also feed a dictionary for the argument date with ‘gte’ representing the start date, and ‘lte’ representing the end date. Finally we need to feed our stock tickers, paginate needs to be set to true so we can get enough data, and in our options we specify what columns we want.
start = "2010-01-01"
end = "2017-01-01"
df1 = quandl.get_table('WIKI/PRICES',date = { 'gte': start, 'lte': end }, ticker=stocksArray[:200],paginate=True,qopts={"columns":["adj_close","date","ticker"]})
df2 = quandl.get_table('WIKI/PRICES',date = { 'gte': start, 'lte': end }, ticker=stocksArray[200:],paginate=True,qopts={"columns":["adj_close","date","ticker"]})
df = pd.concat([df1,df2])
df
We need two separate calls because 500 stocks is too many tickers to work with. So instead we call the function on the first 200, and then the rest of the dataset. You’ll notice that the format we get our data in does not look like our other dataset. We will use the pivot_table() function which is similar to an excel pivot table.
df = df.pivot_table(index="date",columns="ticker")
df
Doing this gets our data in a similar format. We also will want to get rid of the mutli-index going on for our columns, check it out….
df.columns
To get rid of this, we can drop a level. Also we will get the percentage change.
df.columns = df.columns.droplevel()
df = df.pct_change()
print(df)
Finally, concat the variables on and save it.
df = pd.concat([df,variables],axis=1)
df.to_csv("StockData.csv", encoding="UTF-8")
Source Code