Coupling R, Machine Learning & SQL Server DMVs

Machine learning is the hottest thing in analytics right now, made famous by KMart’s Pregnancy Predictor, Netflix & Amazon recommender systems, Nate Silver’s election predictions and so many other famous examples. However for many of us, Machine Learning is still a dark art and it isn’t entirely clear how we might use these techniques on own data. In this series of posts I want to get away from Twitter sentiment and natural language parsing and look at how we might use machine learning with SQL Server’s DMVs. My hope is that we might be able to use machine learning to better understand what is happening inside a busy server and present this information in useful, meaningful ways to DBAs, developers and clients.

In this first lab, we will be using a clustering algorithm to analyse index usage data from SQL Server’s DMVs. Not only was I surprised by how easy this was to do, it was also amazingly effective at grouping indexes by their usage patterns.

A Large-Scale Performance Investigation
Recently we had been contracted by a national healthcare provider to investigate performance issues within their database. Not surprisingly this is a very large, active and complex system designed to support a variety of rich reporting interfaces. Performance has been an issue across the board with no single obvious reason. Our job has been to help identify the underlying issues and explore possible solutions.

Working with the client and their developers, it quickly became clear that there were some very complex reports with execution plans that had expensive clustered index scans and joins with huge amounts of data. Subsequent discussions led us to suggest that these performance issues were symptomatic of underlying issues in the database (table structures, indexing and statistics) and not necessarily just inefficient queries. My job was then to analyse the database activity, look for hot points and usage patterns and try to find efficient ways of accessing the required data, while the developers looked at strategies to simplify the queries.

Thankfully SQL Server maintains a wealth of information within its DMVs. My go-to sources are the historical wait stats, sys.dm_os_wait_stats, and index usage history, sys.dm_db_index_operational_stats & sys.dm_db_index_usage_stats. The complexity of the system and the sheer volume of data was immediately daunting. It seemed like the perfect opportunity to blend R and machine learning and find a novel way to explore and summarise the activity data.

Simplifying the problem with Machine Learning
The first job was to try to get a feel for how the indexes were being used. We had observed clustered index scans and large joins in a few execution plans, but the performance issues were wide-ranging and not just centered on these few reports. I needed to summarise the index usage across the entire database, including hundreds of indexes. Using a KMeans clustering algorithm we were able to identify 5 categories of index usage:

The figure above shows 5 distinct profiles for index usage. At first I was surprised by how different each profile was – they are clearly distinct and describe different usage patterns:

Group 1 is characterised by row locks. Can we reduce the number of rows in queries?
Group 2 is characetrised by range scans and page locks. This might suggest table scans and appropriate indexes should be designed to prevent this.
Group 3 is characterised by a high level of Singleton Lookups. These indexes should be reviewed – should they be heap tables? Or are they lacking suitable covering indexes?
Group 4 is characterised by Page Latch waits. We need to find a quicker way to access the data in these tables, perhaps with better indexes, filtered indexes or table partitioning.
Group 5 performs better than all the others. These are non-critical for this performance investigation, but then we might question if they even need to be there?

Investigating further, we then looked at the indexes that were in each group. By looking at the index DMVs and the plan cache we were able to confirm that these groups were sensible and meaningful. Based on this we were able to design solutions to these 5 “problems” and apply the same fix to all of the indexes within these groups rather than on an individual basis.

When our client first came to us, they presented us with a monstrously complex performance investigation. Somewhat overwhelmed with the task, we were able to reduce the complexity by using machine learning and a clustering algorithm to discover typical usage patterns that we then used to design suitable database-wide solutions. It was surprisingly quick to implement, and the way in which the clustering algorithm simplified the problem was an enormous benefit. This is my first foray into practical machine learning and I am excited to see where else we can use this to good effect.



  1. Just a correction: it was Target, not KMart, that predicted pregnancy. Your link is good; the article is about Target. But the in-text link says KMart.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: