I want to continue on with my discussion and exploration on this topical debate: is data science / machine learning just a fancy name for statistics?
It’s a really interesting discussion being held all around the world, and in particular is gaining momentum within a core group of statisticians who are looking ahead towards the future of their profession. In particular, the Royal Statistical Society (here and here), the American Statistical Association (here) and the Statistical Society of Australia (here) have all recently entertained various perspectives on this debate. In general, it seems that the statistical community is beginning to embrace the concept of a more integrated curriculum that includes computation and wider subject areas than simply mathematics and statistics – and I think this is a great sign for statistics and data science.
Quite naturally, this will lead to the evolution of a totally different type of statistician, one whose formative education includes computer science, information science and perhaps genetics in place of pure maths. This will eat into the number of graduates pursuing academic careers focused on the development of the theoretical foundations of statistics itself. But, it will also mean that these graduates will be well placed to enter the wider world and apply their hard-won skills on real and interesting problems.
And this, of course, splits two ways. Non-statisticians (computer scientists, information science grads, geneticists, marketing students etc.) will quickly realise that being data savvy is an essential skill for success in the digital economy. Already we see ‘data science’ programs all over the world to help introduce these non-statisticians to the core statistical (and computing) concepts that they need to be effective innovators in their field. I don’t see any reason why statistics cannot be learned by people whose paths have first led them in other directions. Just like I don’t see any reason why statisticians cannot learn how to use Apache Spark or build customer churn models.
So then, so the core question: are machine learning and statistical learning synonymous? And the short answer, to me, is no – they are not. First off, both fields share a core statistical foundation. Both use models to learn about datasets. Techniques like linear regression, maximum likelihood estimation, splines and kernels were all born within the statistical community and are widely used in machine learning. So in this sense statistical modeling provides the foundation for machine learning.
But when asked to describe these two fields, here is the way that I view them:
Statistical learning is primarily focused on the analysis of a dataset. The uncovering of patterns, trends and drawing information out of the noise. Along with this comes the issue of uncertainty and robustness and a healthy amount of professional skepticism. Most importantly, I believe that statistical analysis is a self-contained exercise – an application of statistics to understand the environment in which the data was collected. Perhaps too, it may be used to make a judgment about the future environment. But rarely is it used in real-time and integrated into operational arms of the business.
Machine Learning on the other hand, is primarily about applying models to new data. While the difference is perhaps subtle, I think that it is incredibly important. Through machine learning we have the opportunity to make better decisions now, in real-time and in ways that immediately improve our business, whether this be guiding the user experience, controlling a manufacturing line or making policy decisions around twitter sentiment.
In a nutshell, I believe that context is what separates these two very similar disciplines. Statistical learning is about understanding where we have been which can help drive strategies for the future. Machine learning helps to make decisions right now. At least that’s how I see it. Would be very interested in what you think!