Databases for finding human protein-coding genes

After approx. 4 months with the Merriman group, I am beginning to get a handle on how they operate and the typical questions that they are interested in. At a very high level, a typical workflow might involve finding interesting genetic variants from raw sequence data, mapping these to interesting areas of the genome and identifying nearby (and hopefully interesting) genes, as shown below:

GeneticVariantWorkflow

The majority of my work so far then, has been focusing on the translation of data from one phase to the next. For example, how can we use machine learning to automate the discovery of interesting variants from raw sequencing data? Can we use density-based algorithms, or distance-based methods, to group these variants into regions of interest? And given these regions, what genes are nearby on the chromosome? None of these are particularly hard questions. But as always, the devil is in the detail.

You would think that given a chromosomal region, with start and end positions, that finding the genes within this region would be relatively straight forward. And at first inspection, you’d be correct. But when you start to dig into this, a whole lot of questions emerge:

  • should you use an annotation package such as GenomicRanges or GenomicFeatures from the bioconductor suite of tools?
  • should you use the Ensembl database or the UCSC database?
  • what version of the human reference genome should you use?
  • if you choose to use hg38 (for example), are the positions recorded in your raw data accurate? Or are they out-of-date?
  • should you be trying to get exactly the same genes as a tool such as LocusZoom?
  • are you interested in only protein-coding genes? What about mRNA genes and non-coding genes?

Some of these questions are easily answered. First, the Merriman group are still working exclusively with build 37 of the human genome. Beyond that, LocusZoom is currently the tool that they are most familiar with and therefore, my work should closely match this. According to Pruim et al (2010) LocusZoom obtains “gene information from the UCSC browser” (p. 2336). Unfortunately, it is not entirely clear which database tables LocusZooms draws its information from. I experimented with the refSeq, refGene and knownGene tables and it seemed that LocusZoom probably uses some combination of these.

 

Digging deeper into the UCSC Browser

One option is perhaps to download LocusZoom and run it locally. However, I am not a big fan of this option for a number of reasons. First, this is reasonably inflexible, when we really want to extend beyond LocusZoom. Second, the local database would need to be maintained and updated on a regular basis. So instead, I think we should go straight to the source and query UCSC directly.

Our goal then is to find protein coding genes within / flanking our regions of interest. Each region is defined by a chromosome number and a start and end position, for example: 11:6500000-6600000.

[UCSC Genes][knownGenes]

Let’s begin with UCSC’s knownGenes table. According to the UCSC documentation, the knownGene table is a compilation “of gene predictions based on data from RefSeq, GenBank, CCDS, Rfam, and the tRNA Genes track”.

The knownGene table  is currently used by LocusZoom (though not exclusively from what I can tell), and is also used by the Broad Institute’s GWAS plotting tool (note that this is out-of-date and based on Build 35 positions).

An example of the data is shown below (queried via R using the UCSC MySQL connection):

library(RMySQL)

qKnownGene <- function (chr, start, end) {

    # Queries the knownGene table from the UCSC Genome Browser
    query <- paste0(" SELECT ", chr, " as CHR
                      name as GeneName,
                      cdsStart,
                      cdsEndStat,
                      txStart as GeneStart,
                      txEnd as GeneEnd,
                      strand,
                      exonCount,
                      proteinID
    FROM knownGene
    WHERE chrom = 'chr", chr, "'
      AND ( (txStart BETWEEN ", start, " AND ", end, ") OR
            (txEnd BETWEEN ", start, " AND ", end, ") OR
            (txStart < ", start, " AND txEnd > ", start, "))
    GROUP BY GeneStart, GeneEnd
    ORDER BY txStart ASC; ")

    return (query)
}

chr <- 11
start <- 64500000
end <- 64600000

# Query UCSC for genes
# Setup connection
options(warn=-1)
drv <- dbDriver("MySQL")
con <- dbConnect(drv, user="genome", host="genome-mysql.cse.ucsc.edu", dbname="hg19", password="")

results <- dbGetQuery(con, qKnownGene(chr, start, end))

# close DB connection
dbDisconnect(con)
options(warn=0)
head(results)

 

CHR GeneName cdsStart cdsEnd GeneStart GeneEnd strand exonCount proteinID
11 uc001oat.3 6449473 64507712 64494682 64508355 12 A6NDC7
11 uc001ou.3 64494773 64507571 64494382 64508973 14 A6NDC7
11 uc009ypu.3 64510338 64507571 64494382 64511630 17 Q7LDG711
11 uc009ypv.3 64494773 64510338 64494382 64511629 17 Q7LDG711
11 uc009ypu.3 64494773 64510338 64494382 64512061 17 Q7LDG711
11 uc001oaw.3 64509533 64511246 64509481 64512061 2

 

There is a lot of information here. I particularly like the mapping to a protein ID. The coding sequence start and end (cdsStart, cdsEnd) columns also give an indication as to which genes are protein encoding. Note this forum which suggests that non-protein ecoding genes have the same values for both cdsStart and cdsEnd.

Fields that I don’t like: the GeneName field which is not easily recognisable and would need to be joined to another table to get understandable gene names. There also seems to be a lot of redundacy here. Note the repeated entries for the two proteins and the overlapping start / end positions.

The knownGene table is a good start, but would require a fair bit more work to join supporting tables and filter the results to a unique set of interest.

[RefSeq Genes][refGene]

According to UCSC’s documentation: “The RefSeq Genes track shows known human protein-coding and non-protein-coding genes taken from the NCBI RNA reference sequences collection (RefSeq). The data underlying this track are updated weekly.”

An example of the data is shown below (queried via R using the UCSC MySQL connection):

library(RMySQL)

qRefGene <- function(chr, start, end) {

    query <- paste0("  SELECT ", chr, " as CHR, ", "
                              name,
                              name2 as GeneName,
                              txStart as GeneStart,
                              txEnd as GeneEnd,
                              strand,
                              exonCount,
                              cdsStart,
                              cdsEnd,
                              cdsStartStat
                      FROM refGene
                      WHERE chrom = 'chr", chr, "'
                        AND ( (txStart BETWEEN ", start, " AND ", end, ") OR
                              (txEnd BETWEEN ", start, " AND ", end, ") OR
                              (txStart < ", start, " AND txEnd > ", start, "))
                      GROUP BY name2
                      ORDER BY txStart ASC;
                   ")
    return (query)
}

chr <- 11
start <- 64500000
end <- 64600000

# Query UCSC for genes
# Setup connection
options(warn=-1)
drv <- dbDriver("MySQL")
con <- dbConnect(drv, user="genome", host="genome-mysql.cse.ucsc.edu", dbname="hg19", password="")

results <- dbGetQuery(con, qRefGene(chr, start, end))

# close DB connection
dbDisconnect(con)
options(warn=0)
head(results)

 

CHR Name GeneName cdsStart cdsEnd strand exonCount GeneStart GeneEnd cdStartStat
11 NM_001098670 RASGRP2 64494382 64512329 17 64494773 64510338 cmpl
11 NM_001164716 PYGM 64513860 64528187 18 64514130 64527370 cmpl
11 NM_004630 SF1 64532075 64546316 13 64533289 64545864 cmpl

 

In terms of relevant information, I like that the refGene table returns the common gene names. Once again, protein-coding genes might be inferred from the cdsStart and cdsEnd columns. In addition, the cdsStartStat and cdsEndStat (not shown) columns alos tell us whether this is a protein encoding gene or not. This forum descirbes the possible values for these two columns, with ‘none’ or ‘unk’ being indicative of non-protein encoding genes.

It would be nice to include the protein names that these genes code for. However, that is a ‘nice-to-have’ that isn’t necessarily needed.

[GENCODE Genes v19][wgEncodeGencodeBasicV19]

According to UCSC’s documentation: “The GENCODE Genes track (version 19, December 2013) shows high-quality manual annotations merged with evidence-based automated annotations across the entire human genome generated by the GENCODE project.”

An example of the data is shown below (queried via R using the UCSC MySQL connection):

chr <- 2
start <- 165400000
end <- 166000000

qEncodeGene <- function (chr, start, end) {
    # Queries the wgEncodeGencodeBasicV19 table from the UCSC Genome Browser
    query <- paste0("  SELECT ", chr, " as CHR, ",
                              "name as GeneName,
                              name2,
                              cdsStartStat,
                              cdsEndStat,
                              txStart as GeneStart,
                              txEnd as GeneEnd,
                              exonCount,
                              exonFrames
                      FROM wgEncodeGencodeBasicV19
                      WHERE chrom = 'chr", chr, "'
                        AND ( (txStart BETWEEN ", start, " AND ", end, ") OR
                              (txEnd BETWEEN ", start, " AND ", end, ") OR
                              (txStart < ", start, " AND txEnd > ", start, "))
                      GROUP BY name2
                      ORDER BY txStart ASC;
                   ")
    return (query)
}

# Query UCSC for genes
# Setup connection
options(warn=-1)
drv <- dbDriver("MySQL")
con <- dbConnect(drv, user="genome", host="genome-mysql.cse.ucsc.edu", dbname="hg19", password="")

results <- dbGetQuery(con, qEncodeGene(chr, start, end))

# close DB connection
dbDisconnect(con)
options(warn=0)
head(results)

 

CHR GeneName name2 cdsStartStat cdsEndStat GeneStart GeneEnd exonCount exonFrames
2 ENST00000263915.3 GRB14 cmpl cmpl 165349325 165478358 14 1,2,1,0…
2 ENST00000375458.2 COBLL1 cmpl cmpl 1655366975 1656979281 3 0,0,1,2..
2 ENST00000384142.1 SNORA70F none none 165544152 1655442871 -1 -1
2 ENST00000384142.1 SNORA70F none none 165544152 1655442871 -1 -1
2 ENST00000303735.4 SLC38A11 cmpl cmpl 165752695 165812035 10 0,0,1,1…
2 ENST00000360093.3 SCN3A cmpl cmpl 165944031 166060577 28 1,0,0,0…

 

This seems to be a really good table. We get the same protein-coding genes as we do in refGene or knownGene. But more importantly, the non-protein coding genes (for example the small RNA gene, SNORA70F) are clearly identified by the cdsStartStat, cdsEndStat and exonFrames columns. Again, it would be nice to have the protein names alongside their genes – but this table is my favourite so far.

An example

Up until now, I have actually been working with the knownGene table. For each region of interest, I have been plotting the LD pattern and annotaing this with the genes from the knownGene table:

Obesity_Obesity_Chromosome2_Loci27

However, you can see here that I have a mixture of protein coding and non-protein coding genes. Here is the revised plot, using the wgEncodeGencodeBasicV19 table and restricting to protein-coding genes:

ExampleZoom

 

Overall – I think this is both a good and logical improvement over my previous versions.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: