Web Scraping the Australian Open

The 2015 Australian Open was a fantastic event and every shot, game, set and match was recorded by the IBM SlamTracker. IBM have been tracking match statistics from the major grand slams for the past 8 years, and have gathered approximately 40 million data points. Through the analysis of patterns, trends and structure IBM are attempting to recognise how certain players play, how styles change over time and what it takes to be successful on the world stage.

“identify key trends and patterns that help a tennis player be successful” (David Provan, IBM)[1]

I am doing my own mini-analysis using the data available from the 2015 Australian Open’s website. I can only dream of having the kind of database that IBM must be sitting on, but perhaps in a small way we can explore the game of tennis for ourselves. In this first part we will scrape the available data from the web.

Show me the data!

There are two main pages that we are interested in on the AO website. The first is an index of all matches played on a given day:

The second is the actual statistics for a given match:

I hacked together a small Python script to scrape this data using the requests and pattern libraries. All the relevant code can be found on GitHub in the AOStats repository. Briefly, it is a very very simple routine which runs through all the days of the tournament one-by-one. For each day, it compiles a list of matches played, reads out the statistics from the website and saves everything into a dictionary.

Life’s Little Challenges…

By far the most interesting part of this little project was figuring out how to parse the HTML and extract the values of interest. It was certainly interesting simply from the point of view that I spent a lot of time looking over Tennis Australia’s HTML and getting comfortable with how they built their site.

Then there was the challenge of pulling out the right puzzle pieces. Initially this was just such a nightmare, trawling through pages of HTML, repeated search operations, losing my place and drowning myself in coffee. Then a friend taught me a new trick… the “inspect element” menu option in Chrome:

I was able to right click the page element I wanted to grab, click on “inspect element” and bam! a sidebar opened up with the source HTML highlighted and in the correct spot. What a life saver!

In terms of actually scraping the data, I made every effort to be a responsible robot. I created a function, memoize_html(), to save the raw HTML so I could return to it without issuing another get request. As part of this function, I added a 10 second delay after each get request to ensure I wasn’t overloading Tennis Australia’s servers.

With everything in place and tested, I let the bot run. All up it took about 2 hours to scrape the data at which point I dumped it into a JSON file and shut everything down.

For my first attempt at web scraping, it wasn’t too bad and I learned some great tricks along the way. But the real work has yet to begin. Even from a quick look over the website it is clear there are going to be some real challenges with the data cleansing (parsing the text into usable statistics) and simply dealing with missing values. But they are problems for another day.

References
[1] Provan, D. (2013) SlamTracker Explained. IBM. http://wimbledoninsights.netcommunities.com/ibm-wimbledon/slamtracker-explained/.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: