Category Archives: Small Steps to Big Data

SQL-like power with R’s data.table package

I had an interesting little problem today, that involved extracting data from one table based on information from another table. In SQL-speak, it was a full cross join with group by and a HAVING clause. It is a job that

SQL-like power with R’s data.table package

I had an interesting little problem today, that involved extracting data from one table based on information from another table. In SQL-speak, it was a full cross join with group by and a HAVING clause. It is a job that

Data warehousing & breaking the rules of a star schema

I would really appreciate some feedback on this. If you have a few minutes to spare and don’t mind sharing your thoughts – then I would like to hear from you. The Schema We are creating a data warehouse to store

Data warehousing & breaking the rules of a star schema

I would really appreciate some feedback on this. If you have a few minutes to spare and don’t mind sharing your thoughts – then I would like to hear from you. The Schema We are creating a data warehouse to store

SQL vs. BioMart for querying the human genome

A huge part of my job is to add context and build layers of information on top of the genetic mutation datasets that we have amongst our groups. If we want to understand the importance of genetic mutations on human

SQL vs. BioMart for querying the human genome

A huge part of my job is to add context and build layers of information on top of the genetic mutation datasets that we have amongst our groups. If we want to understand the importance of genetic mutations on human

Databases for finding human protein-coding genes

After approx. 4 months with the Merriman group, I am beginning to get a handle on how they operate and the typical questions that they are interested in. At a very high level, a typical workflow might involve finding interesting

Databases for finding human protein-coding genes

After approx. 4 months with the Merriman group, I am beginning to get a handle on how they operate and the typical questions that they are interested in. At a very high level, a typical workflow might involve finding interesting

Debugging GenomeSIMLA

Simulated datasets get a hard time in the real world, as it is difficult to build a simulation which accurately captures the range of values and “dirtiness” of real data. However, simulated sets cannot be beaten when testing out new

Debugging GenomeSIMLA

Simulated datasets get a hard time in the real world, as it is difficult to build a simulation which accurately captures the range of values and “dirtiness” of real data. However, simulated sets cannot be beaten when testing out new

Should we aspire to be statisticians?

Should we aspire to be statisticians, or is data science the evolution of statistics? I came across this amazing debate (Data Science and Statistics: different worlds?) that is just phenominal. I am somewhat in awe of statisticians. I can’t help but

Should we aspire to be statisticians?

Should we aspire to be statisticians, or is data science the evolution of statistics? I came across this amazing debate (Data Science and Statistics: different worlds?) that is just phenominal. I am somewhat in awe of statisticians. I can’t help but

Heat Stress Indicators & Climate Change: Exploring the relaxation of Wet Bulb Temperature

Ever since the early 2000s climate change has been big news. John Sutter of CNN (2015) recently quoted Gina McCarthy (Environmental Protection Agency, USA) stating that “climate change is the greatest threat of our time”. The effects of climate change

Heat Stress Indicators & Climate Change: Exploring the relaxation of Wet Bulb Temperature

Ever since the early 2000s climate change has been big news. John Sutter of CNN (2015) recently quoted Gina McCarthy (Environmental Protection Agency, USA) stating that “climate change is the greatest threat of our time”. The effects of climate change

Web Scraping the Australian Open

The 2015 Australian Open was a fantastic event and every shot, game, set and match was recorded by the IBM SlamTracker. IBM have been tracking match statistics from the major grand slams for the past 8 years, and have gathered

Web Scraping the Australian Open

The 2015 Australian Open was a fantastic event and every shot, game, set and match was recorded by the IBM SlamTracker. IBM have been tracking match statistics from the major grand slams for the past 8 years, and have gathered

Data Analytics are not the answer

Data analytics is now so ubiquitous that it is a requirement for success in a fiercely competitive global economy. Unfortunately, there is so much hype around analytics that people often expect that analytics are the answer, that it can somehow

Data Analytics are not the answer

Data analytics is now so ubiquitous that it is a requirement for success in a fiercely competitive global economy. Unfortunately, there is so much hype around analytics that people often expect that analytics are the answer, that it can somehow

Exploring & Extending IBM’s SlamTracker

The 2015 Australian Open was a fantastic event! From the Australian perspective, it was encouraging to see so many Australian players winning deep into the event. And Djokovic was a class act, he appeared to be in complete control, dug

Exploring & Extending IBM’s SlamTracker

The 2015 Australian Open was a fantastic event! From the Australian perspective, it was encouraging to see so many Australian players winning deep into the event. And Djokovic was a class act, he appeared to be in complete control, dug