Clustering techniques for variant calling

The verification of variant calling is both subjective and time-consuming. Whilst Genome Studio is able to automatically determine genotype from Illumina sequencing chips, the quality of these clustering predictions can be highly variable, which necessitates visual confirmation and adjustment. When considering millions of variants, the quality control procedures can be a massive drain on our resources in terms of time and energy. So finally, I have stumbled across a suitable problem for my masters thesis: to understand the inherent problems that underpin variant clustering and why this is such a hard problem.

Strip away the complications (science in general, and in particular genomics, is full of “complications” and uncertainties) and variant calling is nothing more than a clustering problem. I could (and probably will) write a whole post on the origin of the data, but for now suffice to say that our input data is a set of intensities for every sample at every location as shown below:



Above, we show the variant profiles of ~ 500 samples and 4 SNPs . The two columns represent the same data but with different coordinates systems. The left hand column uses cartesian coordinates, while the right hand column uses polar coordinates (the angle of deviation from a pure AA signal). In each of these four profiles, we can clearly observe defined clusters. If we consider the polar coordinates, the left-hand, central and right-hand clusters correspond to genotypes AA, AB and BB respectively. Similarly for the cartesian coordinates, clusters centered about (0, 1), (0.5, 0.5) and (1, 0) correspond to AA, AB and BB respectively.

If we were to visually label each sample in each of these plots as either AA, AB or BB, then intuitively I think we would be pretty comfortable with our decisions. Clearly, the top variant (exm1562540) exhibits a pure BB signal. rs2829735 and rs737407 show three strong and well-defined signals for each of AA, AB and BB. While the third variant, rs2834343 appears to only have the AB and BB signals. But while we might be quite happy with these calls, at the same time we can see some areas of concern:

  • exm1562540 has 3 signals around where we would expect an AB signal and there is one outlier to the bottom of the plot window.
  • rs2829735 shows strong clustering about AA, AB and BB, but there is also some scatter particularly around the AB cluster.
  • rs2834343 shows strong AB and BB signals, but there are also 6 samples around where we would expect an AA signal.
  • rs737407 is pretty clean – happy with this.

The challenge we have then, is what to do with these slightly odd points? Are they legitimate signals, or errors of some kind? If they are legitimate, perhaps we have to decide if a small handful of loosely clustered points are really important? How many samples do we need in a cluster before we can confidently call it a cluster (i.e. at what allele frequency does a variant become meaningful to study as a potential risk factor for disease)?

These are all obvious questions to ask – but they are also the core challenges in any clustering problem. Clustering is all about finding distinct groups within your data. Where groups are obviously different from one another, then clustering should be a reasonably straight forward process (for example variant rs737407). However, clustering becomes problematic when there are overlapping groups, or the boundaries between groups are not well defined, or when there are only a handful of observations. In these more difficult cases, it may be difficult to make a confident call as to which group an observation belongs, or it may even be impossible to distinguish two groups from one another.

Currently, Genome Studio is the industry standard for clustering of variants from Illumina chips. Overall, Genome Studio’s clustering methods are extremely successful at making variant calls. However, there are occasions where it gets the groupings slightly wrong and they need adjustment. There are also times when it is unable to confidently determine a cluster, despite there being a visually-obvious cluster. And there are times where it picks up outliers that may or may not be of interest. Because of these infrequent mistakes, the industry standard is to manually verify the calls made by Genome Studio. Obviously, you don’t need to check every single variant, only those that Genome Studio reports as being suspect. But this can still mean checking many hundreds or even thousands of plots.

Having gone through this verification process, there are a number of things that I noticed:

  • the first, is that Genome Studio is very very good. It gets the vast majority correct.
  • second, many of the low-confidence calls (e.g. with a clustering score (GenTrain score) < 0.7) are in fact perfectly fine from a practical stand point. For those that are fine, there is nothing to be gained by adjusting cluster positions / boundaries in an effort to maximise the GenTrain score.
  • third, any manual adjustment made must be based on the data and not on some imagined ideal or “intuitive” feel. And I would go further, and say that manual adjustments should be based on obvious patterns in the data, not on edge cases. This means that manual clustering is effectively based on qualities of the data. Why can’t we model these qualities? How is it that the clustering algorithms are missing these trends?
  • finally, the majority of manual adjustments are an attempt to maximise the call rate on the majority of data. For example, where there is overlap at the edges of two clusters, we are more likely to restrict the size of the clusters thus, sacrificing the observations in the overlapping regions in order to confidently define two distinct (and tightly clustered) clusters.


With these observations in mind, it seems to me that the process of manual verification / adjustment is mostly about sharpening the boundaries and excluding the edge cases. It is about making sure that when we make a call, we can do so with high confidence. The flip side is obviously, that the manual adjustment process will also reject calls on outliers and edge cases. In essence, we are adopting a more conservative call set at the end of the quality control process.

The four example variants shown above are pretty clear cut in my opinion, but they do begin to show some of the challenges inherent in clustering SNPs. Overlapping regions, widely dispersed observations and small clusters are the primary problems. It is these regions that Genome Studio struggles with. But if we assume a more conservative approach to clustering, I wonder if we can’t improve on the quality of the final clusters and reduce the amount of manual labour required to verify the calls? And just how conservative can we be before we begin to lose valuable information? These are the questions I aim to tackle in my thesis.





Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: