Web Scraping

Let’s import the two libraries we are going to use.

from lxml import html
import requests

We are going to get the stocks from a wikipedia page of all the stocks in the SP 500. The first thing we need to do is download the page.

page = requests.get('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')

Requests gets us the page, but it isn’t in a format we can use. We use html from lxml to do that.

tree = html.fromstring(page.content)

We also need to get the specific table. To do this, we are going to find the table by it’s xpath. The xpath can be found by inspecting the wikipedia page, and finding the xpath from inspecting. In chrome, right click inspect and then in the pane that comes up on the right click on the table. Then right click and select copy->copy xpath.

table = tree.xpath('//*[@id="mw-content-text"]/div/table[1]')

We get an array from this method. It only has one element so we’ll set table equal to it.

print(table)
table = table[0]
print(table)

Our next moves is to get all of the rows out of this. The function findall() returns an array of every element.

rows = table.findall("tr")
print(rows)

Data Science

Clustering Stock Industries

Web Scraping

Leave A Reply Cancel reply

Modal title