The two terms prediction and accuracy are always considered hand-in-hand. For example, if we want to predict the presence of cancer in a patient, that prediction had better be accurate! In the commercial world we might be interested in predicted changes in futures, or house prices or predicting the likelihood that a customer will buy beer given that they have bought diapers. In some of these cases we might be able to accept a moderate level of uncertainty, but the fact remains that we will build and evaluate models in an effort to maximise the accuracy on unseen data. So you see that prediction and accuracy are inseparable. So why should we be wary of predictions?
If you really want to understand the effect data is having, you need the models.
Mike Loukides (2015)
Quite often, predictions fail to capture the diversity and interesting characteristics of our population. The problem is that predictions tend to promote the majority or average cases in a population. This is something that has been playing on my mind recently that really crystalised when I came across this blog: We need open models, not just open data by Mike Loukides. Loukides presents a number of great examples of the self-fulfilling loop that prediction-based decision making perpetuates:
You’re going to be showing black people homes in predominantly black neighborhoods not because you want to keep white neighborhoods pure, but because that’s where the model says they’re most likely to buy.
You’re going to be stopping and searching more minority drivers without cause not because you’re prejudiced, but because the model says they’re more likely to be arrested for crimes.
And if you stop more minority drivers, you almost certainly will arrest more minority drivers, so the model becomes self-fulfilling.
And it isn’t just regression models at risk. The same can be said of association measures that suggest because we observed X, we are likely to also observe Y. Or classification models that predict an outcome, Y based on some combination of features X1, X2, …, Xn. Naturally, if we base our behaviours on the predictions of such models, then over time our models end up defining our population – which is critically different to building models which help describe our population. Taken to the extreme, predictive modeling will eventually dampen the effects of variation and stomp out innovation!
Of course prediction is rarely the first goal in building a model. Forming predictions is often the last thing we do after we have squeezed all of the potential information in the data. Often we are more interested in understanding, exploring relationships, interactions and relative effects of various attributes within our data set. In some sense, this is the source of my sometimes frustration with tools like SQL Server Analysis Services which make it difficult to interrogate the model. Thankfully Azure ML is far more transparent, and the promise of a fully integrated SQL Server-R combo is truly exciting. Because this is where the business value really lies – not simply in predicting a trend, but in understanding the factors that drive it.