Association studies are a favourite tool for geneticists to understand the genetics that determine our health. It is simply routine now to test every mutation amongst your patients for association with a given trait (for example BMI or breast cancer). The results of these studies are usually reported using manhattan plots:
Or a similar circos plot:
Areas of high association are indicated by the peaks (for example, those labelled in the above heart-rate manhattan plot). Typically, a geneticist will zoom in on these peaks using a ‘locus zoom’-style of plot:
These plots are certainly very interesting. They tell us where there are significant associations, i.e. the regions which are most likely related to the trait / disease being studied. They also give us a feel for how far the signal extends, whether it covers multiple genes or falls between genes.
However, whilst the manhattan / locus zoom plots are great for identifying areas / genes of potential interest, there is a lot more to the story that needs to be revealed. The first, and widely published criticism, is that the majority of known associations do not map (or cover) an obvious candidate gene. People argue then, that most of the known associations are involved in regulating gene expression, and not the more obviously hoped for functional mutation (i.e. a mutation that leads to non-functional proteins). Tackling these issues is an ongoing area of research, and is really interesting. What I want to focus on instead, for the rest of this post, is the extraction of meaningful information / insight from a standard association study.
It’s more than significance
Finding a significant association is easy (huge datasets with millions of mutations…). And the manhattan plot is a key visualisation to help easily identify regions. But beware of over-interpreting a manhattan plot. It is easy to start to assign ‘importance’ to the different regions and rank them based on the pvalues. Statistically, this is dubious. Far more interesting than the pvalues are the estimated effect sizes – that is, what are the relative odds of having a disease given a certain mutation?
There are 3 scenarios which I think are really interesting:
Scenario A: A disease, D, is prevalent in approximately 12% of men and 8% of women. There is a clear genetic association in both men and women with a given SNP, S.
Scenario B: A disease, D, is prevalent in approximately 12% of men and 8% of women. There is a clear genetic association in men, such that approximately 70% of afflicted men carry the BB mutation. However, there is no genetic association in women.
Scenario C: A disease, D, is prevalent in approximately 12% of men and 8% of women. There is a clear genetic association in men, such that approximately 65% of afflicted men carry the BB mutation. However, in women, this is reversed, with approximately 75% of afflicted women carrying the AA or AB mutation.
These scenarios are a good example of “Simpson’s paradox“, where your conclusions can differ if you don’t take account of natural groupings or structure in your data. If we ignore the effects of gender above, then we will draw the wrong conclusions in Scenarios B and C. In each of these scenarios, we need to capture both the genetic effect and the differences between the genders. An effective visualisation, or interactive tool, needs to be able to clearly communicate these differences.
Accurately Capturing Gender Effects
This is easily done, but basic 101 statistical methods are sometimes an afterthought in a biology lab. Hopefully, we will stop for a second and think about the kinds of patterns we hope to find and then ensure that our models are capable of giving accurate results. To be really concrete about this, let’s examine the different results that arise using our 3 scenarios:
Simulate some data
For each scenario above, I have simulated a very basic dataset summarised below:
Each dataset contains just 3 columns: the disease state (D: [0,1])
- the disease state (D: [0,1]). Where D=1 represents afflicated patients
- the genotype (SNP: [0,1]). Where SNP=1 represents the BB allele, and SNP=0 represents the AA/AB alleles (a simplification, but assume a homozygous recessive trait)
- the gender (G: [0,1]). Where G=0 is male and G=1 is female.
All of the datasets share the same schema. We will model each feature as discrete variables (factors in R), and the response (D) as a binomial response.
Let’s first consider Scenario A, where there is a consistent genetic effect across both genders. We would expect a significant SNP estimate and a non-significant gender estimate:
The results here are sensible. Patients with the BB genotype have an odds ratio of ~ 2.27, meaning they are more than 2x as likely to have the disease. Females are no more likely to have the disease than males.
In this scenario, men have a genetic risk of the disease, but women do not. Here, we would expect a significant SNP estimate and a significant gender estimate.
As we hoped, there is a significant SNP (gender) effect. But, these results suggest that females are just as likely as males to get the disease. We know this isn’t the case. If we add an interaction term, SNP:gender, this paradox is resolved:
The interaction term gives and odds ratio of ~0.36, which suggests that women are far less likely to have the disease than than men are. Success.
This is possibly the most interesting case, where the genetic effect is reversed for men compared to women. So we would expect a significant SNP effect and a significant gender effect:
These results are really interesting. We get a significant gender effect as expected. But for the first time, the SNP effect is non-significant. This happens because the trend in women is cancelling out the trend in men – hence, a straight additive model cannot detect the within-gender trends. Again however, this is resolved by adding an interaction term:
Now, as expected, both the SNP and gender effects are captured. More specifically, we can see that females with the BB genotype (SNP = 1) are far less likely (odds ratio ~ 0.1) to have the disease compared to males with the BB genotype. In terms of the biology, this is a really interesting result.
There are a few key points here that I want to very quickly summarise.
- pvalues alone, are not very informative. They don’t tell us anything about the strength of association or the direction of association.
- Currently, the visualisation of pvalues is the standard op in genetics. At best, this is a limited representation of the trends in the data. At worst (as in scenarios B and C), this would completely misrepresent the female population.
- Realistically, we need to extend our visualisation methods to better communicate what is really going on. This means, that where there are significant signals, we also need to be communicating the direction of association by including odds ratios for meaningful covariates (e.g. gender). This should be quite easy to do – we could use colour, shape, we could flip the points to below the x-axis… There are so many options.
Finally, watch this space. Locus Zoom have recently added a great interactive version of their tool. I hope to be able to push our own visualisation tools even further by adopting some of the ideas in this post.