SQL vs. BioMart for querying the human genome

By nickburns2013 / March 29, 2016 / ML in Practice, Science@Work, Small Steps to Big Data / Leave a comment

A huge part of my job is to add context and build layers of information on top of the genetic mutation datasets that we have amongst our groups. If we want to understand the importance of genetic mutations on human health, then we first need to identify interesting mutations, map these to regions of DNA, relevant genes, the proteins these genes code for and finally, biological pathways or functions that these proteins are involved in. The figure below gives a (massively stylised) overview of this process:

GeneticContext

In a previous post (databases-for-finding-human-protein-coding-genes), I looked at using the public MySQL interface to UCSC’s Genome Browser to extract relevant genes (based on the chromosome and chromosomal region). In this post, I will extend on this to:

restrict results to protein-coding genes only using an improved filter (compared to the previous post)
add the protein ID to the results
compare UCSC’s public MySQL interface with the BioMart API which allows you to extract the same information from the Ensembl database using R.

Where we left off previously…

Previously, I used the R and the RMySQL package to access public MySQL connection to UCSC’s Genome Browser. Below, I have created a wrapper which will let us pass different queries to UCSC:


queryUCSC <- function (query) {
    # queries UCSC using UCSC's public MySQL interface

    # Parameters:
    # -----------
    #   query: string
    #     pre-formatted query string
    #
    # Output:
    # -------
    #   results: data frame
    #     a data frame corresponding to the query results.

    # open connection
    options(warn=-1)
    drv <- dbDriver("MySQL")
    con <- dbConnect(drv, user="genome", 
                           host="genome-mysql.cse.ucsc.edu", 
                           dbname="hg19", 
                           password="")

    # send query and return results
    results <- dbGetQuery(con, query) 

    # close DB connection
    dbDisconnect(con)
    options(warn=0)

    return (results)
}

I left off last time with a query which extracted relevant gene information from the wgEncodeGencodeBasicV19 view:


fromEncodeGene <- function (chr, start, end) {
    # Queries the wgEncodeGencodeBasicV19 table from the UCSC Genome Browser
 
    query <- paste0(" SELECT ", chr, " as CHR, ",
                               "name as GeneName,
                                name2,
                                cdsStartStat,
                                cdsEndStat,
                                txStart as GeneStart,
                                txEnd as GeneEnd,
                                exonCount,
                                exonFrames
                      FROM wgEncodeGencodeBasicV19
                      WHERE chrom = 'chr", chr, "'
                        AND ( (txStart BETWEEN ", start, " AND ", end, ") OR
                              (txEnd BETWEEN ", start, " AND ", end, ") OR
                              (txStart < ", start, " AND txEnd > ", start, "))
                      GROUP BY name2
                      ORDER BY txStart ASC;
 ")

    return (query)
}

Using a region of chromosome 2 as an example, the first few results were found:


chr <- 2
start <- 165400000
end <- 166000000

head(queryUCSC(fromEncodeGene(chr, start, end)))

CHR	GeneName	name2	cdsStartStat	cdsEndStat	GeneStart	GeneEnd	exonCount	exonFrames
2	ENST00000263915.3	GRB14	cmpl	cmpl	165349325	165478358	14	1,2,1,0…
2	ENST00000375458.2	COBLL1	cmpl	cmpl	1655366975	1656979281	3	0,0,1,2..
2	ENST00000384142.1	SNORA70F	none	none	165544152	1655442871	-1	-1
2	ENST00000384142.1	SNORA70F	none	none	165544152	1655442871	-1	-1
2	ENST00000303735.4	SLC38A11	cmpl	cmpl	165752695	165812035	10	0,0,1,1…
2	ENST00000360093.3	SCN3A	cmpl	cmpl	165944031	166060577	28	1,0,0,0…

I am only interested in protein-coding genes. Previously, through a bit of trial and error and a Google search, I had found that protein-coding genes could be inferred from the cdsStartStat and cdsEndStat columns, as well as the exonFrames column. However, this seems a little bit artificial. Let’s try and do a little better.

Filtering on protein_coding genes

I was somewhat surprised that the function of the genes were not recorded in the wgEncodeGencodeBasicV19 table. So I dug a little deeper into the documentation online. It turns out that wgEncodeGencodeBasicV19 is a view that sits over a number of base tables (a lot of base tables in fact). One of these underlying tables, wgEncodeGencodeAttrsV19, records the geneType:

fromUCSC("select distinct geneType from wgEncodeGencodeAttrsV19;")

This returns a list of 29 different geneTypes. We are primarily interested in “protein_coding” geneTypes.

Adding protein IDs

In my previous post, I had noted that UCSC’s knownGene table also returns proteinIDs which are associated with each gene. Overall, the knownGene table wasn’t quite as user-friendly as wgEncodeGencodeBasicV19, but it would be nice to include the proteinIDs. Again, a little reading of the docos led to the wgEncodeGencodeUniProtV19 table which contained the mapping between genes, transcripts and proteins.

Using wgEncodeGencodeBasicV19, wgEncodeGencodeAttrV19 and wgEncodeGencodeUniProtV19, we can being to layer on information between chromosome regions, genes of interest and proteins that they code for:

fromGeneProteinContext <- function (chr, start, end) {

    # Queries the UCSC's Ensmeble views and tables for protein-coding genes
    query <- paste0(" SELECT ", chr, " as CHR, ",
                               "t1.name as transcriptId,
                                t2.geneName,
                                t2.transcriptClass,
                                t2.geneType,
                                txStart as GeneStart,
                                txEnd as GeneEnd,
                                exonCount,
                                p1.acc,
                                p1.name
                       FROM wgEncodeGencodeBasicV19 t1
                          JOIN wgEncodeGencodeAttrsV19 t2
                          JOIN wgEncodeGencodeUniProtV19 p1
                       WHERE t1.name = t2.transcriptId
                         AND t1.name = p1.transcriptId
                         AND t1.chrom = 'chr", chr, "'
                         AND ( (t1.txStart BETWEEN ", start, " AND ", end, ") OR
                               (t1.txEnd BETWEEN ", start, " AND ", end, ") OR
                               (t1.txStart < ", start, " AND txEnd > ", start, "))
                       GROUP BY t1.name2
                       ORDER BY t1.txStart ASC;")
    return (query)
}

fromUCSC(fromGeneProteinContext(chr, start, end))

CHR	transcriptID	geneName	transcriptClass	geneType	GeneStart	GeneEnd	exonCount	acc	name
2	ENST00000263915.3	GRB14	coding	protein_coding	165349325	165478358	14	Q14449	GRB14_HUMAN
2	ENST00000375458.2	COBLL1	coding	protein_coding	165536697	165697928	13	B3KMG7	B3KMG7_HUMAN
2	ENST00000303735.4	SLC38A11	coding	protein_coding	165752695	165812035	10	Q08AI6/td>	S38AB_HUMAN
2	ENST00000360093.3	SCN3A	coding	protein_coding	165744031	166060577	28	C9JBM7/td>	C9JBM7_HUMAN

As you can see above, the results contain the 4 protein-coding genes (GRB14, COBLL1, SLC38A11, SCN3A), as well as the UniProt proteinID (acc) and the associated UniProt name. The results are filtered to protein-coding genes as a result of the equijoin to wgEncodeGencodeUniProtV19. However, we could be more explicit and include a filter (geneType = “protein_coding”) in the WHERE clause, although this would be redundant. In the interests of clarity and working as part of a team, I am a fan of being more explicit, rather than less. An alternative is to simply document this with a clear comment.

Using BioMart to query Ensembl

I am perfectly happy with the above. It works, and the SQL is relatively straightforward. However, given that we are now dealing solely with UCSC’s versions of the Ensembl database, we might be able to use BioMart to extract the same information.

BioMart is an API which allows us to access a huge range of genomic information conveniently within R. There are two advantages to using BioMart: 1) it provides a simple-to-use interface with a small number of simple R functions or commands, and 2) by combining a whole range of different databases under the Biomart interface, we don’t have to trawl through the database documentation for all of these different sources and it means that we don’t have to be familiar with differing schemas or figure out how to integrate data across multiple sources. Big win.

If there is a downside, it is perhaps that you have to learn Biomart. Not that this is difficult. But if you are already comfortable with SQL and happy to familiarise yourself with different databases, then SQL might be a more comfortable solution for you. However, I have to keep in mind that I won’t always be here to maintain this code and that other people will probably pick it up, use it themselves or tweak it for their own projects. In this case, I think that a simplified API, like Biomart, makes a lot of sense.

Becoming familiar with Biomart is a lot like becoming familiar with a new database schema. There are a number of functions that will probably be very useful initially. The ones that I used a lot were: listDatasets, listFilters and listAttributes. I also used grep() alongside these to filter out the bits I didn’t want, for example:

library("biomaRt")

#listMarts(host="www.ensembl.org") # build 38
listMarts(host="grch37.ensembl.org") # build 37

ensembl <- useMart(biomart="ENSEMBL_MART_ENSEMBL",
                   host="grch37.ensembl.org",
                   path="/biomart/martservice" ,
                   dataset="hsapiens_gene_ensembl")

idx <- grep("hsapiens", listDatasets(ensembl)[, 1], value=F)
listDatasets(ensembl)[idx, ]

Note that Biomart defaults to the latest build of the human genome project (GrCH38). We are still using build 37, which meant I had to specify this when connecting to the data mart.

Here’s another interesting tip to find attributes related to proteins:

idxProtein <- grep("Protein", listAttributes(ensembl)[, 2])

listAttributes(ensembl)[idxProtein,]

It really only took 10 or 15 minutes to wrap my head around the basics of Biomart (with lots of help from the Biomart Vignette). Finally, I managed to hobble together the basic query that I needed:

getBM(attributes=c("chromosome_name", 
                   "start_position", 
                   "end_position", 
                   "external_gene_name",
                   "transcript_count",  
                    "gene_biotype"),
      filters = c("chromosome_name", "start", "end", "transcript_biotype"),
      values = list(chromosome=chr, start=start, end=end, biotype="protein_coding"),
      mart = ensembl)

chromosome_name	start_position	end_positions	external_gene_name	transcript_count	gene_biotype
2	165349322	165478358	GRB14	7	protein_coding
2	165510134	165700189	COBLL1	25	protein_coding
2	165752696	165812035	SLC38A11	10	protein_coding
2	165944032	166060577	SCN3A	7	protein_coding

This returns the protein-coding genes that fall within our region of interest. There are some obvious discrepancies between the data above and that returned from UCSC. First, the start and end positions are slightly different. I could live with small disagreements here, as this is only exploratory at this stage. However, I would want to carefully check that the number of transcripts and exons match between these two sources.

Also, it doesn’t return the protein IDs that map to these. I experimented a little (see below), but was unable to return the same information as I got from UCSC’s tables:

getBM(attributes=c("chromosome_name", 
                   "start_position", 
                   "end_position", 
                   "external_gene_name",
                   "transcript_count", 
                   "gene_biotype", 
                   "protein_id"),
      filters = c("chromosome_name", "start", "end", "transcript_biotype"),
      values = list(chromosome=chr, start=start, end=end, biotype="protein_coding"),
      mart = ensembl)

I haven’t included the results here (saving myself some time). But, this query returns 37 rows, with multiple entries for each of the genes. I don’t doubt that a single gene can code multiple different peptides / proteins and that these will probably have different identifiers. It is also interesting that the protein_id do not match those obtained from UCSC. At this stage, it isn’t clear how to make sense of these results.

Final thoughts

All-in-all, using Biomart was really easy and it looks like a good method for mapping between mutations and relevant protein-coding genes. However, to extend this further and layer on protein information, it looks like querying UCSC directly might be a more elegant solution at this stage.

Follow Blog via Email

Categories

SQL vs. BioMart for querying the human genome

Where we left off previously…

Filtering on protein_coding genes

Adding protein IDs

Using BioMart to query Ensembl

Final thoughts

Leave a comment Cancel reply

Follow Blog via Email

Categories

SQL vs. BioMart for querying the human genome

Where we left off previously…

Filtering on protein_coding genes

Adding protein IDs

Using BioMart to query Ensembl

Final thoughts

Share this:

Related

Leave a comment Cancel reply