R & Python (with SciPy / SciKitLearn) are my goto tools for exploring data and quick mash ups. Both of these languages provide a rich set of libraries for quick and convenient data analysis. Every now and then, I stumble across a process that yields good insight into the data and I need to be able to reproduce the process. In an ideal world, I would also be able to share this process with others in my team. I have this vision that one day, we might have a data mining dashboard, for self-service data exploration that could be used by anybody, regardless of their technical expertise.
But… I work in a Microsoft shop. R and Python are not integrated into our common toolkits, which makes it difficult to share and reproduce analysis across teams. My question is: can we build a shareable, interactive toolkit within the Microsoft framework? What does Microsoft offer to help us reproduce, share and deploy products for data exploration? In this post we begin by looking at Microsoft Excel & the SQL Server Data Mining Add-in.
- Preparing the lab
- Microsoft SQL Server Data mining Add-in
- Iris data set
- Python (with Scipy, Pandas, Numpy, SciKitLearn)
- Exploration of the Iris Data Set
- Exploratory analysis with Python
- Exploratory analysis with Excel
- Clustering the Iris Data Set
- Python & KMeans
- Excel & SQL Server Data Mining Add-in
- Building a Reproducable Data Product
- Why Excel doesn’t measure up…
Preparing the Lab
For this lab, we need a few things first. Where appropriate, I downloaded and installed the following:
- Microsoft SQL Server Data Mining Add-ins
- The Iris Data set, from the UCI Machine Learning Library
- Python, Scipy, Numpy, Pandas and SciKit Learn. If you need to download all of these, you might be interested in one of the existing distributions such as Anaconda.
Exploration of the Iris Dataset
The Iris dataset is famous within the statistical and machine learning community and has been widely used since 1936 when Sir Ronald Fisher famously used this dataset to describe a variety of statistical methods. The data set is made up of 150 experimental observations of Iris flowers, with measurements of Sepal Length, Sepal Width, Petal Length and Petal Width for each flower. Data was collected from three species of iris: Iris Verginica, Iris Versicolor and Iris Setosa. It turns out the data can be modeled by species quite well, making it an excellent data set for teaching the basics of machine learning and statistical learning.
For the sake of this lab, let’s pretend that we don’t know anything about the origins or characteristics of this data. Perhaps you have scraped some data from the web, or extracted some SQL Server monitoring data from your production box and you want to explore the data, find patterns, trends and relationships and use these to come up with a sound data-based action… Let’s see if we can use Python and / or Excel to do this.
Exploratory Analysis in Python
Let’s begin with Python. The most convenient way to explore the data is to load it into a Pandas data frame. Pandas comes with a great array of built-in plotting features that we can leverage. First of all, let’s look at the data:
import numpy as np import pandas as pd from sklearn import datasets import matplotlib.pyplot as plt pd.options.display.mpl_style = 'default' # uses a GGPLOT-style of plot iris = datasets.load_iris() X, Y = iris.data, iris.target data = pd.DataFrame(X) data = Y data.columns = ['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width', 'Species'] data.head()
|Sepal Length||Sepal Width||Petal Length||Peal Width||Species|
With the data in-hand, we can begin to explore it a little, beginning with a simple line plot:
data[[0, 1, 2, 3]].plot()
Perhaps there are some interesting sub-regions in the data (e.g. there are three distinct ‘regions’ within the Petal Length values)? It looks like there may be some patterns and trends in the data, but we can’t really see much from this line plot. Let’s try something a little more interesting, like Parallel Coordinates.
from pandas.tools.plotting import parallel_coordinates parallel_coordinates(data, 'Species', colormap='winter', alpha=0.3)
Parallel Coordinates plots are a favourite of mine. Used well, they can help visualise correlations and trends in multi-variate (high dimensional) data. In the plot above we can see that Iris Setosa is clearly separable from the other two species, particularly in the Petal Length and Petal Width dimensions. Based on this plot, it looks like the species are most variable in the petal dimensions. Perhaps we should try a scatter plot of Petal Length vs. Petal Width:
data.plot(kind='scatter', x='Petal Length', y='Petal Width', c='Species', color='DarkBlue')
This simple scatterplot confirms that the species can be reasonably well described by the Petal Length and Petal Width. Certainly, Iris Setosa is separable and we could build a model to distinguish this species form the other two. Iris Verginica and Iris Versicolor are mostly separable, although there is some small overlap. It looks like it would be possible to build a model to distinguish between these two species in the majority of cases.
Exploratory Analysis in Excel
Assuming that we have the data downloaded in a convenient csv format, then we should be able to do all of our analysis in Excel. Let’s begin by simple importing it as a csv file and re-creating the line chart:
This good, it was very quick and very easy. The default colours need some tweaking perhaps, but the end result is good enough for quick EDA.
For the parallel coordinate plot, I had to resort to Google where I found Jon Peltier’s tutorial. Turns out, it is as simple as “switching the rows and columns” in the design ribbon at the top of the spreadsheet:
This plot is a hot mess! But I am sure you could improve it with some work on the fill colours and the transparency. I played around for about 20 minutes, but didn’t find a simple and easy method to colour by Species. Granted, this is a result of my lack of familiarity with Excel.
Once again, producing the scatterplot was quick and easy. Of the 3 plots in Excel, this plot is reasonably attractive and informative – I would be happy to use this as is.
Clustering the Iris Data
While exploring the data in Python, we had the advantage of being able to colour by Species. I didn’t manage to quickly figure out how to do this in Excel, but perhaps this is a good thing as it is more representative of the real world. With real-world data, we are less likely to know before hand how many clusters there naturally are in the data. Let’s try clustering the data, beginning by identifying the number of clusters that are naturally in the dataset.
Python & Kmeans
Commonly, the distance of each point to the nearest centroid is used to determine the optimal fit of a KMeans model. By minimising this distance, we can determine the natural number of clusters within a dataset. This can be visualised as a plot of “Accuracy” with increasing number of clusters (K):
from sklearn import cluster # we will plot he accuracy of the clustering # (based on the standard sum of Distance from the closest centroid) # against cluster size, K, ranging K from 1 to 15 accuracy = np.zeros(15) for K in range(15): myclusters = cluster.KMeans(n_clusters=K + 1, random_state=42) a = myclusters.fit(X) accuracy[K] = a.inertia_ # interia_ is the sum of the distances of # all points from the nearest centroid plt.plot(accuracy) plt.title(&quot;Accuracy of KMeans Clustering&quot;) plt.xlabel(&quot;Number of centroids (K)&quot;) plt.xticks(range(15), range(1, 16))
From the plot above, we can see that the accuracy stabilises around 3 clusters. Obviously, this fits nicely with what we already know about the number of Species in this data. And if we plot the clusters, we see that it agrees pretty well with our original scatterplot:
Excel & the SQL Server Data Mining Add-in
I won’t run through how to do this in detail, but the data mining add-in is pretty intuitive. If you head to the Data Mining ribbon, you should be able to follow your nose through the Clustering setup GUI. Here are the results with default settings:
Interestingly, if we leave the choice of K up to Excel, then it suggests there are as many as 11 clusters! Clearly, the default settings are overfitting the data. By default, the scalable Expectation Maximisation (EM) algorithm is used. You can read more about this algorithm on Microsoft Research.
If we change the algorithm to a scalable KMeans algorithm, then we get 6 clusters:
This is somewhat more reasonable, but again an overfitting of the data. Finally, if we use the non-scalable KMeans algorithm we get the expected 3 clusters:
Using the Data Mining Add-in, it is a reasonably intuitive process to derive the clusters. Though, some trial and error was necessary to come up with the correct number of clusters. This suggests that additional work would need to be done to assess & refine the accuracy of any model created through the Data Mining Add-in.
If you poke around the other tabs in the model output, you will find details about how the clusters are defined, the range of values which define each cluster and their correlation. During my brief tinkering, I didn’t figure out how to access the model from the spreadsheet, so I am not entirely sure that you can auto-magically use these values without manually entering them into formulas. Though I am sure there must be some programmatic interface to the model…
Building an Exploratory Data Product for the Team
The whole point of this experiment has been to take a tool that I use every day and find a way to deploy this tool for the rest of the team to use. I love the simplicity and convenience of Python, but currently Python isn’t part of the Team’s tech stack. So to make this available to everyone, we need to find a solution that is common to us all and sits inside our supported tech stack. The big question is… is Excel the answer?
Clearly Excel (with the SQL Server Data Mining add-on) works! It took me a while to reproduce even this simple analysis, and when it came to clustering a little bit of trial-and-error was necessary. With more work, I am sure the visualisations could be tidied up and experience might suggest one-or-other of the clustering algorithms as being more suitable in the majority of cases.
If it works, then why not use it? Why Excel doesn’t measure up
Before I talk about what I didn’t like, let me quickly mention what I actually liked about Excel. It was quick and easy to produce the line chart and scatterplot. The scatterplot was particularly effective and I would be happy to use it as is. There is also the ability to integrate Excel with the full Microsoft stack (Office, PowerView, PowerBI, SQL Server…) this is massively appealing. But there were a couple of things I didn’t like with Excel:
#1: Excess information / complexity. This might sound a little trivial, but the raw data alone doesn’t make a lot of sense – that’s why we use visualisations to present the data in a more human-friendly fashion. So if the raw data isn’t particularly informative, I don’t want to see it. I didn’t like that I had to interact with the raw data both manually and visually in the spreadsheet.
#2: Data Mining Add-in. The data mining add-in was not particularly intuitive. Great “click-and-go” interface, but then I wonder how easily I will be able to manipulate and use the results? For example, can I produce a new column with the predicted species? I am sure you can, and I am sure I could work it out with just a little more effort… but is it worth it?
#3: Reproducability This, for me, is the biggest goal – the ability to reproduce analysis across the whole team in a convenient and consistent manner. Ideally, anyone should be able to load their data and reproduce these results with a minimal amount of interaction and expertise. I am not convinced that Excel provides a suitable interface for automation and reproducability.
Could we do it in Excel? Absolutely. But to build an intuitive data exploration tool for non-experts we would need to customise Excel with macros and all sorts. Ultimately, I think the development and maintenance costs are too high.