The longer I work as a data analyst the more I appreciate screen scraping, especially in cases where I’ll need to pull the same data more than once.
For my annual March Madness dashboard, I like to pull ESPN’s pick frequency data, the best source we have for detecting where herd mentalities may be forming in bracket picking trends. The data is formatted in a simple HTML table.
To figure out what the page will look like to my python scraper, I use Chrome’s Inspect Element tool to analyze the structure of the table and get key information like the table’s class, which in this case is “wpw-table zebra.”
Then I build a nested pair of for loops to iterate through the rows and columns. I also included an index for the second for loop, which will count up with each iteration to indicate which round I’m scraping.
Here’s the completed script. It took minutes to write and seconds to run, and now we can rerun it anytime to get the latest data as a csv file.
from bs4 import BeautifulSoup import requests from datetime import datetime f = open('WhoPickedWhom.csv', 'w') f.write('dataset,round,team,seed,percent,datetime' + '\n') dt = str(datetime.now()) print('getting Who Picked Whom data...') url = "http://games.espn.com/tournament-challenge-bracket/2017/en/whopickedwhom" page = requests.get(url) soup = BeautifulSoup(page.text, 'html5lib') t = soup.find('table', class_="wpw-table zebra").tbody.findAll('tr') for rows in t: for idx, td in enumerate(rows): round = str(idx) seed = td.find('span', class_='seed').string team = td.find('span', class_='teamName').string pct = td.find('span', class_='percentage').string f.write('bracketchallenge,' + round + ',' + team + ',' + seed + ',' + pct + ',' + dt + '\n') f.close()
Once we have the data, we can do things like plot the pick frequency at ESPN against the win probabilities computed at FiveThirtyEight. We don’t even need a screen scraper for the FiveThirtyEight data since FiveThirtyEight provides it as a csv.
Click for the interactive version on Tableau Public.