The 2015 Australian Open was a fantastic event! From the Australian perspective, it was encouraging to see so many Australian players winning deep into the event. And Djokovic was a class act, he appeared to be in complete control, dug deep through the critical moments and stepped up an aggressive style to close out the tournament. But the best thing was that every shot, game, set and match was recorded by the IBM SlamTracker.
In this series of posts we extend on the basic IBM Slamtracker and explore the data with some basic data analysis and data mining using Python and AzureML. As we go, the code will be publicly available via IPython notebooks on GitHub: https://github.com/nickb-/AOStats .
PART 1: Data Scraping
The Australian Open match statistics are publicly available via the AO Website: http://www.ausopen.com/en_AU/scores/index.html
IBM’s Slamtracker displays a range of summary statistics for each match, and the keys to the match. In this first part, we scrape the match statistics from the AO website using Python (requests, pattern.web)
PART 2: Data Cleaning & Exploratory Data Analysis (EDA)
In Part 1 we scraped the match statistics the AO Website. However, we made no effort to clean the raw data. In this part we clean the raw data and populate a Pandas DataFrame for further analysis.
To do this effectively we need to normalise the data, parsing aggregate statistics (such as 1st Serves In) into useful atomic measures. For example:
|1st Serves In||“81/136 (60 %)”|
Once the data has been normalised, we will use Python’s Pandas library to store the statistics and perform an initial EDA.
PART 3: Data Mining
Hopefully, the EDA from Part 2 will have exposed some interesting trends / patterns or questions for further analysis. In Part 3 we will explore the data further using AzureML.
While we can’t really define this part until after Part 2, we will hopefully be able to identify styles of play, or typical measures that indicate success deep into the Australian Open vs. early round losses.
Future Work & General Hypothesis
Long term, it would be very interesting to collect similar statistics across a large number of professional tennis tournaments with a view to being able to predict or forecast a player’s success.
- We would expect players such as Djokovic and Nadal to express quite dominant statistical patterns.
- Players like Sam Stosur (whose success rate is highly inconsistent, and therefore statistically very interesting) should be interesting to watch.
- We imagine that young players, like Nick Kygrios and Borna Coric, would be particularly interesting to watch and probably difficult to forecast.