- to merge genotype calls from separate VCF files (e.g. one VCF file per sample) into one master VCF file with a column for each sample.
- and filter this master VCF file and extract regions of interest
EDIT: have edited this to include workflow using conda’s BCFTools within my GWrangle Docker image (see bottom of post).
For this example, I have 7 VCF files – one for each sample. All of these files are identical, except for the genotype columns (i.e. the CHROM, POS, RSID, REF etc. columns are all the same).
Note that the VCF files have been compressed using bgzip, and indexed using tabix:
$ bgzip *.vcf $ parallel tabix -p vcf ::: *vcf.gz
Here we will create a master VCF file which contains the genotype call for each sample is separate columns. There are a number of tools for this (vcfintersect, vcftools, vcf-merge, bcftools), we are going to use bcftools as it is faster than most of the older toolsets.
First, we will create a tab-separated file with the regions that we want to extract:
We have saved this file as regions.bed.
And now, it is a simple command to a) extract our regions and b) merge all the input vcf files:
$ bcftools merge -R regions.bed *vcf.gz > combined_regions_genotypes.vcf
Docker, gwrangle, bioconda bcftools
I have created a docker image which contains a whole lot of tools for the wrangling, manipulation and analysis of genomic datasets. One of these tools is bioconda’s bcftools. To install this (from a root session of the image):
# conda install -c bioconda bcftools=1.3.1
Then, to index a VCF file:
$ bcftools index <vcf_file>
And you can then subset the file as previous:
$ bcftools view -r <chr:postion> <vcf_file>
Nice and easy.