Examples¶
This chapter contains several examples for the most important use cases of VCFPy.
Reading VCF Files¶
The following is an example for reading VCF files and writing out a TSV file with the genotype calls of all SNVs.
You can find the example Python and VCF file in the sources below the directory examples/vcf_to_tsv
.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import vcfpy
# Open file, this will read in the header
reader = vcfpy.Reader.from_path('input.vcf')
# Build and print header
header = ['#CHROM', 'POS', 'REF', 'ALT'] + reader.header.samples.names
print('\t'.join(header))
for record in reader:
if not record.is_snv():
continue
line = [record.CHROM, record.POS, record.REF]
line += [alt.value for alt in record.ALT]
line += [call.data.get('GT') or './.' for call in record.calls]
print('\t'.join(map(str, line)))
The program call looks as follows.
$ ./vcf_to_tsv.py
#CHROM POS REF ALT BLANK NA12878 NA12891 NA12892 NA19238 NA19239 NA19240
chr22 42522392 G A 0/0 0/1 0/1 0/0 0/0 0/0 0/0
chr22 42522597 C T 0/1 0/0 0/0 0/0 0/0 0/0 0/0
chr22 42522613 G C 0/1 0/1 0/0 0/1 0/1 0/1 0/1
chr22 42523003 A G 0/1 1/1 0/1 0/1 0/1 0/1 0/1
chr22 42523209 T C 0/1 1/1 0/1 0/1 0/1 0/1 0/1
chr22 42523211 T C 0/0 0/1 0/1 0/0 0/0 0/0 0/0
chr22 42523409 G T 0/1 0/1 0/0 0/1 0/1 0/1 0/1
chr22 42523491 C T 0/1 0/0 0/0 0/0 0/0 0/0 0/0
chr22 42523507 A G 0/1 0/0 0/0 0/0 0/0 0/0 0/0
chr22 42523805 C T 0/0 0/0 0/1 0/0 0/0 0/0 0/0
chr22 42523943 A G 0/1 1/1 0/1 0/1 0/1 0/1 0/1
chr22 42524435 T A 0/1 0/0 0/0 0/0 0/0 0/1 0/1
[...]
Writing VCF Files¶
The following shows how to add values to the FILTER
column to records of an existing VCF file.
Adding to existing records is simpler than constructing them from scratch, of course.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import vcfpy
# Open input, add FILTER header, and open output file
reader = vcfpy.Reader.from_path('input.vcf')
reader.header.add_filter_line(vcfpy.OrderedDict([
('ID', 'DP10'), ('Description', 'total DP < 10')]))
writer = vcfpy.Writer.from_path('/dev/stdout', reader.header)
# Add "DP10" filter to records having less than 10 reads
for record in reader:
ad = sum(c.data.get('DP', 0) for c in record.calls)
if ad < 10:
record.add_filter('DP10')
writer.write_record(record)
The program call looks as follows.
##fileformat=VCFv4.3
##contig=<ID=20,length=62435964>
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
##FILTER=<ID=DP10,Description="total DP < 10">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3:40,30
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4:40,30
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2:40,30
20 1234567 microsat1 GTC G,GTCT 50 PASS;DP10 NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
Jumping in Tabix-indexed Files¶
The following shows a small program that extracts a genomic region from the input VCF file and writes it to stdout.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import vcfpy
# Open input, add FILTER header, and open output file
reader = vcfpy.Reader.from_path('input.vcf.gz')
writer = vcfpy.Writer.from_path('/dev/stdout', reader.header)
# Fetch region 20:1,110,694-1,230,237. Note that the coordinates
# in the API call are zero-based and describe half-open intervals.
for record in reader.fetch('20', 1110695, 1230237):
writer.write_record(record)
The program call looks as follows.
##fileformat=VCFv4.3
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=20,length=62435964>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4:40,30
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2:40,30