PLINK Cheat Sheet

PLINK is an amazing tool to help manipulate and analyse genomic datasets. I am blown away with the vast functionality of PLINK and the extent of their documentation. The various options and some of the typical workflows are well documented in PLINK, but even so, navigating their encyclopedic documentation is somewhat daunting. This blog then will be my cheat sheet, a place to note down common methods / workflows and why I did them. Today, the blog will be short, but this is one that will likely grow over time.

1. The PLINK data format

Basic data formats: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml

Note that the row / column ORDER of .ped and .fam files is essential!

2. Converting PLINK (.ped, .fam, .bed, .bam) files to analysis-ready VCF files

PLINK is ridiculously good for almost all of your everyday genomics data manipulation (merge, filter, compare, analyse…). But sometimes it is useful to be able to analyse the data outside of PLINK. The following PLINK operation will convert the PLINK data formats into VCF format:


plink --bfile inputGenotypeData --recode vcf-iid --out outputEncodedGenotypes

The command above takes inputGenotypeData (.bed, .bam PLINK files), recodes the data to VCF format and strips out the family ids. The transformed data is written to outputEncodedGenotypes.vcf . An example of the output is shown below:

 

#CHROM POS ID Sample_1 Sample_2 Sample_3
1 564766 exm2216283 0/0 0/1 0/0
1 564862 exm2216284 0/0 0/0 1/1
22 51169684 exm1624265 0/0 0/0 0/1

The first 9 columns include relevant information such as the chromosome, position, SNP ID, reference allele… The remaining columns include the genotype for each sample, stored by column. The genotypes are encoded with either a 0 (major allele) or 1 (minor allele), resulting in the following combinations: (0/0): homozygous for the major allele; (0/1): heterozygous; (1/1): homozygous for the minor allele; (./.): missing genotype. These encodings can then be transformed to {0, 1, 2, 3}, for example in R:

library(data.table)

genotypes <- fread("outputEncodedGenotypes.vcf",
                    header=TRUE,
                    data.table=FALSE)

recodeGenotype <- function (column) {
    factor(genotypes[, column], labels=0:3)
}

for (column in 10:ncol(genotypes)) {
    genotypes[, column] <- recodeGenotype(column)
}

3. Calculating LD for 1000 Genomes data using PLINK2

The goal here is to calculate the pairwise LD between SNPs within the 1000 Genomes project. As an example, we will explore the British LD patterns in chromosome 12 between base positions: 150000 – 175000. Note that the steps below are all written for the bash shell.

Step 1: download the relevant data from 1000 Genomes:

# variables
CHR=12
START=150000
END=175000

# popfile is the 1000 Genomes reference panel which maps individuals with
# their relevant population
POPFILE=/mnt/DataDrive/ReferenceData/1000Genomes_PopSample_Information.20130502.ALL.panel
VCFFILE=genotype_data_${CHR}_${START}-${END}.vcf
SAMPLEFILE=sample_list.txt

# extract British sample ids from POPFILE
grep GBR ${POPFILE} | cut -f1 > ${SAMPLEFILE}

# PLINK2:
#    --keep-fam: filter to keep British samples
#    --snps-only no-DI: removes indels
#    --r2: calculates LD
#    --out: output filename
plink --vcf ${VCFFILE} \
      --keep-fam ${SAMPLEFILE} \
      --snps-only no-DI \
      --r2 \
      --out ${VCFFILE%.vcf}

The above script will create an output file: genotype_data_12_150000-175000.ld, which will contain the pairwise LD values for each SNP in this region. We can red this in and visualise it in R:

ldParse <- function (ldFile) {

    # Reads in LD data, creates LD-based dissimilarity matrix, plots and clusters.
    #
    # Parameters:
    # -----------
    #    ldFile: filename
    #        LD output from PLINK2, requires the following columns:
    #        (SNP_A, SNP_B, R2) where R2 is the LD (R^2) between SNPs A and B.
    #
    # Output:
    # -------
    #    N x N dissimilarity matrix based on LD.

    # read in data, transform into NxN matrix of SNPs
    data <- read.table(ldFile, header=TRUE)
    data <- reshape2::dcast(data[, c("SNP_A", "SNP_B", "R2")], SNP_A ~ SNP_B, value.var = "R2")

    rownames(data) <- data[, 1]     # wrangle the data into a dissimilarity matrix
    data <- as.matrix(data[, -1])   # --------------------------------------------
    data[is.na(data)] <- 0          # set missing LD values as minimum LD
    data <- 1 - data                # create dissimilarity measure

    return (data)
}
heatmap(ldParse("genotype_data_12_150000-175000.ld"))

example_heatmap

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: