No original content here, but an interesting snapshot of ideas from the team that came third place in Kaggle’s Home Depot competition. Some thoughts on this:
- The team clearly put in a lot of effort to understand their data. Perhaps a little less effort into understanding the business problem, but a huge amount of effort into understanding the data they were working with.
I separate the business problem from the data problems above, largely because the focus of Kaggle is always prediction accuracy. Kaggle comps tend to reduce down to technical challenges to advance prediction efforts.
- As always with Kaggle, feature engineering played a huge role. The team engineered more than 500 features, using a variety of different methods. I am sure many of these were redundant, and they did make mention of a simpler 10-feature model which performed well.
What I find most interesting, is their performance seems to be directly related to how creatively they thought about their data and through applying and combining multiple different views or perspectives of the data. Definitely complex, and possibly devoid of the ‘biological-relevance’ (read this as domain relevance 🙂 ) that we doggedly pursue in bioinformatics. But again really interesting.
I think what I really like is that the Home Depot data was clearly very noisy, and the actual amount of signal in it was quite low. There is a direct parallel here to genomics datasets. Reading this interview has helped me think a little more objectively about my own datasets and perhaps open up some avenues to explore less “biologically-relevant” and possibly less “traditional” methods. Whilst these will undoubtedly be more complex, if the results are sensible I think the challenge then becomes one of interpretation and trying to sift through the feature engineering to find a consistent biological story.