Advanced Analytics vs. Statistics

Here’s another interesting rant from LinkedIn: The irrelevance of mathematical certainty in deep learning.

In this discussion the author quite strongly argues against the suitability of statistical significance / certainty in the face of Big Data. I don’t agree with his reasoning, in what appears to be a reasonably narrow and slanted view. But hey – it’s social media, and having a strong (and controversial) position will mean more readers right?

I do agree that the limits of certainty are changing. But this has nothing to do with the techniques or the theories that lie behind machine learning, AI, deep learning etc. and more to do with the data. Consider that classical statistics was developed and perfected in an era with relatively small datasets which were expensive to collect and often represent clinical trials, then the requirement for a strong degree of certainty is obvious. However, it is my belief that biasing against false-positives is the single greatest motivator for statistical significance, which is of obvious importance when the potential consequences are life or death. Classical statistical significance has more to do with ensuring you don’t get burned through overly optimistic decisions than it is about saying we are completely sure about this entire system.

So what has changed with big data, and here I am going to suggest even more strongly, corporate / commercial data? Well simply that we have more of it. Of course you can’t build a powerful descriptive model from a small sample dataset. Typically, small datasets do not capture enough information about the wider system, and thus you are limited to being able to describe only what you have captured. The truth is, a small dataset doesn’t contain enough “truth” in order to generalise well. This is the hard limit under which classical statistics was built. Big data totally blows this limit away… assuming your data does indeed capture and describe the system well. Nothing will help you if your data is simply rubbish.

So I suggest that the arguments around certainty / robustness / deep learning vs. statistics and so forth should significantly change. Data quality is the primary problem. Anything is possible, with high certainty, if you have good data and a relatively stable system. Statisticians have known this for a very long time and so are cautious.  In the race to deliver user-friendly analytics to the masses, let us remember that our models are only ever as good as our data.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: