-
Introduction 1
-
Lecture1.1
-
-
Getting the Data 3
-
Lecture2.1
-
Lecture2.2
-
Lecture2.3
-
-
SP500 Webscrape 4
-
Lecture3.1
-
Lecture3.2
-
Lecture3.3
-
Lecture3.4
-
-
Full Dataset 2
-
Lecture4.1
-
Lecture4.2
-
-
Regressions 5
-
Lecture5.1
-
Lecture5.2
-
Lecture5.3
-
Lecture5.4
-
Lecture5.5
-
-
Machine Learning 5
-
Lecture6.1
-
Lecture6.2
-
Lecture6.3
-
Lecture6.4
-
Lecture6.5
-
-
Machine Learning Function 2
-
Lecture7.1
-
Lecture7.2
-
-
Visualize Data 2
-
Lecture8.1
-
Lecture8.2
-
Web Scraping
Let’s import the two libraries we are going to use.
from lxml import html
import requests
We are going to get the stocks from a wikipedia page of all the stocks in the SP 500. The first thing we need to do is download the page.
page = requests.get('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
Requests gets us the page, but it isn’t in a format we can use. We use html from lxml to do that.
tree = html.fromstring(page.content)
We also need to get the specific table. To do this, we are going to find the table by it’s xpath. The xpath can be found by inspecting the wikipedia page, and finding the xpath from inspecting. In chrome, right click inspect and then in the pane that comes up on the right click on the table. Then right click and select copy->copy xpath.
table = tree.xpath('//*[@id="mw-content-text"]/div/table[1]')
We get an array from this method. It only has one element so we’ll set table equal to it.
print(table)
table = table[0]
print(table)
Our next moves is to get all of the rows out of this. The function findall() returns an array of every element.
rows = table.findall("tr")
print(rows)