Manipulating VCF Files

In this post we are going to cover some very basic scenarios for manipulating VCF files. We will begin with a (very) brief description of the VCF format, look at how to extract targeted regions using tabix. In a future post we will explore some typical workflows for exploring VCF data.

The VCF Format

Wiki has a great description of the VCF format, including a nice example file. A VCF file contains a whole bunch of metadata relating to the dataset, the experiment, the quality of genotype calls etc. etc. etc.. Here is the example file from wiki:

 

example_vcf_wiki

source: Variant Call Format (wikipedia, https://en.wikipedia.org/wiki/Variant_Call_Format)

 

I have a massively simplistic view on genomics – that everything is essentially a genomic position and a value i.e. {chromosome, position, value}. So, the VCF boils down to the columns to me:

example_vcf_wiki

Subsetting a VCF file

Extracting targeted regions from a VCF file is really easy, using tabix. Tabix is a command line tool for unix, which is part of the SAMTools collection. You can download tabix from here.

Make sure that your VCF file has been compressed using bgzip. VCF files from the 1000 Genomes project have been compressed with bgzip already, so no worries there, but if yours hasn’t run the following form your terminal:

$ bgzip <your_vcf_file.vcf>

Once you have a compressed VCF file, you need to index it using tabix:

$ tabix -p vcf <your_vcf_file.vcf>

To subset our VCF file, we are going to create a text file (using bed format i.e. {CHR, START, END}) which defines the regions of interest:

targeted_regions

Our targeted_regions.bed file contains 4 regions of interest, where the columns (from left to right) are chromosome, start position, end position. Thus, we have 4 regions on chromosomes 1, 2, 10 and 15.

Extracting variants from the VCF file, which fall within this region is then as simple as running the following tabix command and piping the output to an output file:

$ tabix -R targeted_regions.bed <your_vcf_file.vcf>

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: