SQL-like power with R’s data.table package

I had an interesting little problem today, that involved extracting data from one table based on information from another table. In SQL-speak, it was a full cross join with group by and a HAVING clause. It is a job that would have been outrageously simple in SQL, but far less elegant in base R. As it turns out, R’s data.table package is pretty good at joining tables as well.

The data

Today’s datasets contain information about single nucleotide polymorphisms (SNPs) and genes in the human genome. I have two tables:


The leffler table is a list of potentially interesting SNPs, their chromosome, position and nearest protein-coding gene.

The kottgen table is a set of summary statistics from a genome-wide association study (phenotype = urate). This table was published by Kottgen et al (2013). To describe this data in a very simple way: Kottgen was interested in understanding whether genetic variation played a role in the development of hyperuracemia (high levels of uric acid in the blood). 276,184 genetic variants were tested, and their strength of association with urate was recorded in the p_gc column. P-values (the p_gc column)  less than 5×10-08 are considered a significant effect size.

The goal

Leffler’s data was a little different, and I won’t go into detail here. Long story short, the SNPs in leffler are potentially interesting regions of the genome. We wanted to look at these regions in the kottgen data set and see if there were any interesting patterns.

The plan

  1. Identify discrete, non-overlapping regions of interest in the leffler dataset. We did this by clustering the SNPs based on their chromosome position.
  2. For each of these regions, define a boundary region as the min(POS) and max(POS).
  3. From the kottgen dataset, extract all SNPs that are within the boundary regions and plot them.

Joining kottgen to leffler

The fun part here is part 3. In particular, the join between kottgen and leffler is essentially very easy: we want all SNPs from the kottgen set which are on chromosome N, and between the relevant boundaries. There is a slight trick here though, chromosome positions are defined per chromosome. That is, chromosome 1 starts at position 1 and goes to position M.  Chromosome 2 does not then start at M + 1, but it starts again at 1 and goes to position N, and so on… This means, that we can’t simply filter the kottgen set by position, we must also take into account the chromosome number.

This would be relatively trivial in SQL. It would involve a join on chromosome, a group by on the regions and a filter on position. If we assume that I have already created a dataset called regions: {regionID, chromosome, start, end}, then some relatively simple SQL should get the job done:

SELECT k.chrom, k.SNP, k.POS, k.p_gc, r.regionID
FROM kottgen AS k, regions AS r
WHERE k.chrom = r.CHR
GROUP BY r.regionID
    k.POS >= r.Start
    k.POS <= r.End


Note that this is initially a full cross join that is filtered by the HAVING clause.

Doing this in base R is a little less elegant. You would need some sort of loop over the chromosome numbers, for example we could use sapply:

# NOTE: regions and kottgen are both data.tables
sapply(  unique(regions[, regionID]),
         function (rID) {
             lclRegion <- regions[regionID == rID]
             return (  
                 kottgen[chrom == lclRegion[, CHR] &
                         POS >= lclRegion[, start] &
                         POS <= lclRegion[, end]] 

There’s nothing wrong with the above – it just lacks elegance. It is very explicit. However, it turns out that if we chain together some simple data.table operations, we can achieve the same result:

setkey(regions, CHR)
setkey(kottgen, chrom)

# chain together 3 operations:  
#    1: full cross join between kottgen and regions on CHR
#    2: the cross join has a 1:many relationship between
#       chromosomes and regions. So the second operation
#       reconstructs the region boundaries (min(Start), max(End))
#    3. filter, the same as the HAVING condition
kottgen[regions, allow.cartesian = TRUE][,
    .(SNP, p_gc, POS, Start = min(Start), End = max(End)), by = regionID][
     POS >= Start & POS <= End, .SD, by = regionID]

It wouldn’t be right to have a post without a chart, so here is an example plot of a region:


This region is part of chromosome 6. The kottgen SNPs are plotted in grey, and the potentially interesting SNP from leffler is marked in red. All of the genes within this region are also plotted (colour coded by function).

Final thoughts

Let’s talk about what I like about the data.table solution:

  1. I like that we can perform quite a complex operation without having to leave the data.table environment. There is no need to write additional functions, or logic, or use other packages (like dplyr).
  2. When I tested this, the sapply function took ~ 10 seconds and the data.table operation took 0.1 seconds. Big win.

But here’s what I don’t like:

  1. It took me a while to figure this out. Then, it took me ages to test it and confirm that the results were correct!
  2. The syntax is very dense. I know some people like these one-liner type solutions. But I don’t. It takes quite a lot to unpack the operation and understand what it is doing. Of course, good comments help – but only so far.

Of course, these downsides don’t really matter – because my main goals at the moment are to master data.table and that means using it wherever and whenever possible – even when there might be a simpler way.



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: