SQL vs. BioMart for querying the human genome

A huge part of my job is to add context and build layers of information on top of the genetic mutation datasets that we have amongst our groups. If we want to understand the importance of genetic mutations on human health, then we first need to identify interesting mutations, map these to regions of DNA, relevant genes, the proteins these genes code for and finally, biological pathways or functions that these proteins are involved in. The figure below gives a (massively stylised) overview of this process:



In a previous post (databases-for-finding-human-protein-coding-genes), I looked at using the public MySQL interface to UCSC’s Genome Browser to extract relevant genes (based on the chromosome and chromosomal region). In this post, I will extend on this to:

  1. restrict results to protein-coding genes only using an improved filter (compared to the previous post)
  2. add the protein ID to the results
  3. compare UCSC’s public MySQL interface with the BioMart API which allows you to extract the same information from the Ensembl database using R.

Where we left off previously…

Previously, I used the R and the RMySQL package to access public MySQL connection to UCSC’s Genome Browser. Below, I have created a wrapper which will let us pass different queries to UCSC:

queryUCSC <- function (query) {
    # queries UCSC using UCSC's public MySQL interface

    # Parameters:
    # -----------
    #   query: string
    #     pre-formatted query string
    # Output:
    # -------
    #   results: data frame
    #     a data frame corresponding to the query results.

    # open connection
    drv <- dbDriver("MySQL")
    con <- dbConnect(drv, user="genome", 

    # send query and return results
    results <- dbGetQuery(con, query) 

    # close DB connection

    return (results)

I left off last time with a query which extracted relevant gene information from the wgEncodeGencodeBasicV19 view:

fromEncodeGene <- function (chr, start, end) {
    # Queries the wgEncodeGencodeBasicV19 table from the UCSC Genome Browser
    query <- paste0(" SELECT ", chr, " as CHR, ",
                               "name as GeneName,
                                txStart as GeneStart,
                                txEnd as GeneEnd,
                      FROM wgEncodeGencodeBasicV19
                      WHERE chrom = 'chr", chr, "'
                        AND ( (txStart BETWEEN ", start, " AND ", end, ") OR
                              (txEnd BETWEEN ", start, " AND ", end, ") OR
                              (txStart < ", start, " AND txEnd > ", start, "))
                      GROUP BY name2
                      ORDER BY txStart ASC;

    return (query)

Using a region of chromosome 2 as an example, the first few results were found:

chr <- 2
start <- 165400000
end <- 166000000

head(queryUCSC(fromEncodeGene(chr, start, end)))
CHR GeneName name2 cdsStartStat cdsEndStat GeneStart GeneEnd exonCount exonFrames
2 ENST00000263915.3 GRB14 cmpl cmpl 165349325 165478358 14 1,2,1,0…
2 ENST00000375458.2 COBLL1 cmpl cmpl 1655366975 1656979281 3 0,0,1,2..
2 ENST00000384142.1 SNORA70F none none 165544152 1655442871 -1 -1
2 ENST00000384142.1 SNORA70F none none 165544152 1655442871 -1 -1
2 ENST00000303735.4 SLC38A11 cmpl cmpl 165752695 165812035 10 0,0,1,1…
2 ENST00000360093.3 SCN3A cmpl cmpl 165944031 166060577 28 1,0,0,0…

I am only interested in protein-coding genes. Previously, through a bit of trial and error and a Google search, I had found that protein-coding genes could be inferred from the cdsStartStat and cdsEndStat columns, as well as the exonFrames column. However, this seems a little bit artificial. Let’s try and do a little better.

Filtering on protein_coding genes

I was somewhat surprised that the function of the genes were not recorded in the wgEncodeGencodeBasicV19 table. So I dug a little deeper into the documentation online. It turns out that wgEncodeGencodeBasicV19 is a view that sits over a number of base tables (a lot of base tables in fact). One of these underlying tables, wgEncodeGencodeAttrsV19, records the geneType:

fromUCSC("select distinct geneType from wgEncodeGencodeAttrsV19;")

This returns a list of 29 different geneTypes. We are primarily interested in “protein_coding” geneTypes.

Adding protein IDs

In my previous post, I had noted that UCSC’s knownGene table also returns proteinIDs which are associated with each gene. Overall, the knownGene table wasn’t quite as user-friendly as wgEncodeGencodeBasicV19, but it would be nice to include the proteinIDs. Again, a little reading of the docos led to the wgEncodeGencodeUniProtV19 table which contained the mapping between genes, transcripts and proteins.

Using wgEncodeGencodeBasicV19, wgEncodeGencodeAttrV19 and wgEncodeGencodeUniProtV19, we can being to layer on information between chromosome regions, genes of interest and proteins that they code for:

fromGeneProteinContext <- function (chr, start, end) {

    # Queries the UCSC's Ensmeble views and tables for protein-coding genes
    query <- paste0(" SELECT ", chr, " as CHR, ",
                               "t1.name as transcriptId,
                                txStart as GeneStart,
                                txEnd as GeneEnd,
                       FROM wgEncodeGencodeBasicV19 t1
                          JOIN wgEncodeGencodeAttrsV19 t2
                          JOIN wgEncodeGencodeUniProtV19 p1
                       WHERE t1.name = t2.transcriptId
                         AND t1.name = p1.transcriptId
                         AND t1.chrom = 'chr", chr, "'
                         AND ( (t1.txStart BETWEEN ", start, " AND ", end, ") OR
                               (t1.txEnd BETWEEN ", start, " AND ", end, ") OR
                               (t1.txStart < ", start, " AND txEnd > ", start, "))
                       GROUP BY t1.name2
                       ORDER BY t1.txStart ASC;")
    return (query)

fromUCSC(fromGeneProteinContext(chr, start, end))
CHR transcriptID geneName transcriptClass geneType GeneStart GeneEnd exonCount acc name
2 ENST00000263915.3 GRB14 coding protein_coding 165349325 165478358 14 Q14449 GRB14_HUMAN
2 ENST00000375458.2 COBLL1 coding protein_coding 165536697 165697928 13 B3KMG7 B3KMG7_HUMAN
2 ENST00000303735.4 SLC38A11 coding protein_coding 165752695 165812035 10 Q08AI6/td>

2 ENST00000360093.3 SCN3A coding protein_coding 165744031 166060577 28 C9JBM7/td>


As you can see above, the results contain the 4 protein-coding genes (GRB14, COBLL1, SLC38A11, SCN3A), as well as the UniProt proteinID (acc) and the associated UniProt name. The results are filtered to protein-coding genes as a result of the equijoin to wgEncodeGencodeUniProtV19. However, we could be more explicit and include a filter (geneType = “protein_coding”) in the WHERE clause, although this would be redundant. In the interests of clarity and working as part of a team, I am a fan of being more explicit, rather than less. An alternative is to simply document this with a clear comment.

Using BioMart to query Ensembl

I am perfectly happy with the above. It works, and the SQL is relatively straightforward. However, given that we are now dealing solely with UCSC’s versions of the Ensembl database, we might be able to use BioMart to extract the same information.

BioMart is an API which allows us to access a huge range of genomic information conveniently within R. There are two advantages to using BioMart: 1) it provides a simple-to-use interface with a small number of simple R functions or commands, and 2) by combining a whole range of different databases under the Biomart interface, we don’t have to trawl through the database documentation for all of these different sources  and it means that we don’t have to be familiar with differing schemas or figure out how to integrate data across multiple sources. Big win.

If there is a downside, it is perhaps that you have to learn Biomart. Not that this is difficult. But if you are already comfortable with SQL and happy to familiarise yourself with different databases, then SQL might be a more comfortable solution for you. However, I have to keep in mind that I won’t always be here to maintain this code and that other people will probably pick it up, use it themselves or tweak it for their own projects. In this case, I think that a simplified API, like Biomart, makes a lot of sense.

Becoming familiar with Biomart is a lot like becoming familiar with a new database schema. There are a number of functions that will probably be very useful initially. The ones that I used a lot were: listDatasets, listFilters and listAttributes. I also used grep() alongside these to filter out the bits I didn’t want, for example:


#listMarts(host="www.ensembl.org") # build 38
listMarts(host="grch37.ensembl.org") # build 37

ensembl <- useMart(biomart="ENSEMBL_MART_ENSEMBL",
                   path="/biomart/martservice" ,

idx <- grep("hsapiens", listDatasets(ensembl)[, 1], value=F)
listDatasets(ensembl)[idx, ]

Note that Biomart defaults to the latest build of the human genome project (GrCH38). We are still using build 37, which meant I had to specify this when connecting to the data mart.

Here’s another interesting tip to find attributes related to proteins:

idxProtein <- grep("Protein", listAttributes(ensembl)[, 2])


It really only took 10 or 15 minutes to wrap my head around the basics of Biomart (with lots of help from the Biomart Vignette). Finally, I managed to hobble together the basic query that I needed:

      filters = c("chromosome_name", "start", "end", "transcript_biotype"),
      values = list(chromosome=chr, start=start, end=end, biotype="protein_coding"),
      mart = ensembl)
chromosome_name start_position end_positions external_gene_name transcript_count gene_biotype
2 165349322 165478358 GRB14 7 protein_coding
2 165510134 165700189 COBLL1 25 protein_coding
2 165752696 165812035 SLC38A11 10 protein_coding
2 165944032 166060577 SCN3A 7 protein_coding

This returns the protein-coding genes that fall within our region of interest. There are some obvious discrepancies between the data above and that returned from UCSC. First, the start and end positions are slightly different. I could live with small disagreements here, as this is only exploratory at this stage. However, I would want to carefully check that the number of transcripts and exons match between these two sources.

Also, it doesn’t return the protein IDs that map to these. I experimented a little (see below), but was unable to return the same information as I got from UCSC’s tables:

      filters = c("chromosome_name", "start", "end", "transcript_biotype"),
      values = list(chromosome=chr, start=start, end=end, biotype="protein_coding"),
      mart = ensembl)

I haven’t included the results here (saving myself some time). But, this query returns 37 rows, with multiple entries for each of the genes. I don’t doubt that a single gene can code multiple different peptides / proteins and that these will probably have different identifiers. It is also interesting that the protein_id do not match those obtained from UCSC. At this stage, it isn’t clear how to make sense of these results.

Final thoughts

All-in-all, using Biomart was really easy and it looks like a good method for mapping between mutations and relevant protein-coding genes. However, to extend this further and layer on protein information, it looks like querying UCSC directly might be a more elegant solution at this stage.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: