vcfpy package

Submodules

vcfpy.exceptions module

Exceptions for the vcfpy module

exception vcfpy.exceptions.HeaderNotFound[source]

Bases: vcfpy.exceptions.VCFPyException

Raised when a VCF header could not be found

exception vcfpy.exceptions.IncorrectVCFFormat[source]

Bases: vcfpy.exceptions.VCFPyException

Raised on problems parsing VCF

exception vcfpy.exceptions.InvalidHeaderException[source]

Bases: vcfpy.exceptions.VCFPyException

Raised in the case of invalid header formatting

exception vcfpy.exceptions.InvalidRecordException[source]

Bases: vcfpy.exceptions.VCFPyException

Raised in the case of invalid record formatting

exception vcfpy.exceptions.VCFPyException[source]

Bases: RuntimeError

Base class for module’s exception

vcfpy.header module

Code for representing the VCF header part

The VCF header class structure is modeled after HTSJDK

class vcfpy.header.AltAlleleHeaderLine(key, value, mapping)[source]

Bases: vcfpy.header.SimpleHeaderLine

Alternative allele header line

Mostly used for defining symbolic alleles for structural variants and IUPAC ambiguity codes

classmethod from_mapping(klass, mapping)[source]

Construct from mapping, not requiring the string value

id = None

name of the alternative allele

class vcfpy.header.CompoundHeaderLine(key, value, mapping)[source]

Bases: vcfpy.header.HeaderLine

Base class for compound header lines, currently format and header lines

Compound header lines describe fields that can have more than one entry.

mapping = None

OrderedDict with key/value mapping

serialize()[source]
value
class vcfpy.header.ContigHeaderLine(key, value, mapping)[source]

Bases: vcfpy.header.SimpleHeaderLine

Contig header line

Most importantly, parses the 'length' key into an integer

classmethod from_mapping(klass, mapping)[source]

Construct from mapping, not requiring the string value

id = None

name of the contig

length = None

length of the contig, None if missing

vcfpy.header.FORMAT_TYPES = ('Integer', 'Float', 'Character', 'String')

valid FORMAT value types

class vcfpy.header.FieldInfo(type_, number)[source]

Bases: object

Core information for describing field type and number

number = None

Number description, either an int or constant

type = None

The type, one of INFO_TYPES or FORMAT_TYPES

class vcfpy.header.FilterHeaderLine(key, value, mapping)[source]

Bases: vcfpy.header.SimpleHeaderLine

FILTER header line

description = None

description for the filter, None if missing

classmethod from_mapping(klass, mapping)[source]

Construct from mapping, not requiring the string value

id = None

token for the filter

class vcfpy.header.FormatHeaderLine(key, value, mapping)[source]

Bases: vcfpy.header.CompoundHeaderLine

Header line for FORMAT fields

description = None

description, should be given, None if not given

classmethod from_mapping(klass, mapping)[source]

Construct from mapping, not requiring the string value

id = None

key in the INFO field

source = None

source of INFO field, None if not given

type = None

value type

version = None

version of INFO field, None if not given

vcfpy.header.HEADER_NUMBER_ALLELES = 'A'

number of alleles excluding reference

vcfpy.header.HEADER_NUMBER_GENOTYPES = 'G'

number of genotypes

vcfpy.header.HEADER_NUMBER_REF = 'R'

number of alleles including reference

vcfpy.header.HEADER_NUMBER_UNBOUNDED = '.'

unbounded number of values

class vcfpy.header.Header(lines=[], samples=None)[source]

Bases: object

Represent header of VCF file

While this class allows mutating records, it should not be changed once it has been assigned to

This class provides function for adding lines to a header and updating the supporting index data structures. There is no explicit API for removing header lines, the best way is to reconstruct a new Header instance with a filtered list of header lines.

add_contig_line(mapping)[source]

Add “contig” header line constructed from the given mapping

add_filter_line(mapping)[source]

Add FILTER header line constructed from the given mapping

add_format_line(mapping)[source]

Add FORMAT header line constructed from the given mapping

add_info_line(mapping)[source]

Add INFO header line constructed from the given mapping

add_line(header_line)[source]

Add header line, updating any necessary support indices

filter_ids()[source]

Return list of all filter IDs

format_ids()[source]

Return list of all format IDs

get_format_field_info(key)[source]

Return FieldInfo for the given INFO field

get_info_field_info(key)[source]

Return FieldInfo for the given INFO field

get_lines(key)[source]

Return header lines having the given key as their type

info_ids()[source]

Return list of all info IDs

lines = None

list of :py:HeaderLine objects

samples = None

SamplesInfo object

class vcfpy.header.HeaderLine(key, value)[source]

Bases: object

Base class for VCF header lines

key = None

str with key of header line

serialize()[source]

Return VCF-serialized version of this header line

value
vcfpy.header.INFO_TYPES = ('Integer', 'Float', 'Flag', 'Character', 'String')

valid INFO value types

class vcfpy.header.InfoHeaderLine(key, value, mapping)[source]

Bases: vcfpy.header.CompoundHeaderLine

Header line for INFO fields

Note that the Number field will be parsed into an int if possible. Otherwise, the constants HEADER_NUMBER_* will be used.

description = None

description, should be given, None if not given

classmethod from_mapping(klass, mapping)[source]

Construct from mapping, not requiring the string value

id = None

key in the INFO field

source = None

source of INFO field, None if not given

type = None

value type

version = None

version of INFO field, None if not given

vcfpy.header.LINES_WITH_ID = ('ALT', 'contig', 'FILTER', 'FORMAT', 'INFO', 'META', 'PEDIGREE', 'SAMPLE')

header lines that contain an “ID” entry

class vcfpy.header.MetaHeaderLine(key, value, mapping)[source]

Bases: vcfpy.header.SimpleHeaderLine

Alternative allele header line

Used for defining set of valid values for samples keys

classmethod from_mapping(klass, mapping)[source]

Construct from mapping, not requiring the string value

id = None

name of the alternative allele

class vcfpy.header.PedigreeHeaderLine(key, value, mapping)[source]

Bases: vcfpy.header.SimpleHeaderLine

Header line for defining a pedigree entry

classmethod from_mapping(klass, mapping)[source]

Construct from mapping, not requiring the string value

id = None

name of the alternative allele

class vcfpy.header.SampleHeaderLine(key, value, mapping)[source]

Bases: vcfpy.header.SimpleHeaderLine

Header line for defining a SAMPLE entry

classmethod from_mapping(klass, mapping)[source]

Construct from mapping, not requiring the string value

id = None

name of the alternative allele

class vcfpy.header.SamplesInfos(sample_names)[source]

Bases: object

Helper class for handling and mapping of sample names to numeric indices

name_to_idx = None

mapping from sample name to index

names = None

list of sample names

class vcfpy.header.SimpleHeaderLine(key, value, mapping)[source]

Bases: vcfpy.header.HeaderLine

Base class for simple header lines, currently contig and filter header lines

Raises:vcfpy.exceptions.InvalidHeaderException in the case of missing key "ID"
mapping = None

collections.OrderedDict with key/value mapping of the attributes

serialize()[source]
value
vcfpy.header.VALID_NUMBERS = ('A', 'R', 'G', '.')

valid values for “Number” entries, except for integers

vcfpy.header.header_without_lines(header, remove)[source]

Return Header without lines given in remove

remove is an iterable of pairs key/ID with the VCF header key and ID of entry to remove. In the case that a line does not have a mapping entry, you can give the full value to remove.

vcfpy.header.mapping_to_str(mapping)[source]

Convert mapping to string

vcfpy.header.serialize_for_header(key, value)[source]

Serialize value for the given mapping key for a VCF header line

vcfpy.parser module

Parsing of VCF files from str

class vcfpy.parser.HeaderLineParserBase[source]

Bases: object

Parse into appropriate HeaderLine

parse_key_value(key, value)[source]

Parse the key/value pair

Parameters:
  • key (str) – the key to use in parsing
  • value (str) – the value to parse
Returns:

vcfpy.header.HeaderLine object

class vcfpy.parser.HeaderParser(sub_parsers)[source]

Bases: object

Helper class for parsing a VCF header

parse_line(line)[source]

Parse VCF header line (trailing ‘ ‘ or ‘ ‘ is ignored)

param str line:str with line to parse
param dict sub_parsers:
 dict mapping header line types to appropriate parser objects
returns:appropriate HeaderLine parsed from line
raises:vcfpy.exceptions.InvalidHeaderException if there was a problem parsing the file
class vcfpy.parser.MappingHeaderLineParser(line_class)[source]

Bases: vcfpy.parser.HeaderLineParserBase

Parse into HeaderLine (no particular structure)

line_class = None

the class to use for the VCF header line

parse_key_value(key, value)[source]
class vcfpy.parser.Parser(stream, path=None)[source]

Bases: object

Class for line-wise parsing of VCF files

In most cases, you want to use vcfpy.reader.Reader instead.

Parameters:
  • streamfile-like object to read from
  • path (str) – path the VCF is parsed from, for display purposes only, optional
header = None

header, once it has been read

parse_header()[source]

Read and parse vcfpy.header.Header from file, set into self.header and return it

Returns:vcfpy.header.Header
Raises:vcfpy.exceptions.InvalidHeaderException in the case of problems reading the header
parse_line(line)[source]

Pare the given line without reading another one from the stream

parse_next_record()[source]

Read, parse and return next vcfpy.record.Record

Returns:next VCF record or None if at end
Raises:vcfpy.exceptions.InvalidRecordException in the case of problems reading the record
samples = None

vcfpy.header.SamplesInfos with sample information; set on parsing the header

class vcfpy.parser.RecordParser(header, samples)[source]

Bases: object

Helper class for parsing VCF records

header = None

Header with the meta information

parse_line(line_str)[source]

Parse line from file (including trailing line break) and return resulting Record

samples = None

SamplesInfos with sample information

vcfpy.parser.SUPPORTED_VCF_VERSIONS = ('VCFv4.0', 'VCFv4.1', 'VCFv4.2', 'VCFv4.3')

Supported VCF versions, a warning will be issued otherwise

class vcfpy.parser.StupidHeaderLineParser[source]

Bases: vcfpy.parser.HeaderLineParserBase

Parse into HeaderLine (no particular structure)

parse_key_value(key, value)[source]
vcfpy.parser.convert_field_value(key, type_, value)[source]

Convert atomic field value according to the type

vcfpy.parser.parse_field_value(key, field_info, value)[source]

Parse value according to field_info

vcfpy.parser.parse_mapping(value)[source]

Parse the given VCF header line mapping

Such a mapping consists of “key=value” pairs, separated by commas and wrapped into angular brackets (“<...>”). Strings are usually quoted, for certain known keys, exceptions are made, depending on the tag key. this, however, only gets important when serializing.

Raises:vcfpy.exceptions.InvalidHeaderException if there was a problem parsing the file
vcfpy.parser.process_alt(header, ref, alt_str)[source]

Process alternative value using Header in header

vcfpy.parser.split_quoted_string(s, delim=', ', quote='"', brackets='[]')[source]

Split string s at delimiter, correctly interpreting quotes

Further, interprets arrays wrapped in one level of []. No recursive brackets are interpreted (as this would make the grammar non-regular and currently this complexity is not needed). Currently, quoting inside of braces is not supported either. This is just to support the example from VCF v4.3.

vcfpy.reader module

Parsing of VCF files from file-like objects

class vcfpy.reader.Reader(stream, path=None, tabix_path=None)[source]

Bases: object

Class for parsing of files from file-like objects

Instead of using the constructor, use the class methods from_stream() and from_path().

On construction, the header will be read from the file which can cause problems. After construction, Reader can be used as an iterable of Record.

Raises:InvalidHeaderException in the case of problems reading the header
close()[source]

Close underlying stream

fetch(chrom, begin, end)[source]

Jump to the start position of the given chromosomal position and limit iteration to the end position

Parameters:
  • chrom (str) – name of the chromosome to jump to
  • begin (int) – 0-based begin position (inclusive)
  • end (int) – 0-based end position (exclusive)
classmethod from_path(klass, path, tabix_path=None)[source]

Create new Reader from path

Parameters:
  • path – the path to load from (converted to str for compatibility with path.py)
  • tabix_path – optional string with path to TBI index, automatic inferral from path will be tried on the fly if not given
classmethod from_stream(klass, stream, path=None, tabix_path=None)[source]

Create new Reader from file

Parameters:
  • streamfile-like object to read from
  • path – optional string with path to store (for display only)
header = None

the Header

parser = None

the parser to use

path = None

optional str with the path to the stream

samples = None

the vcfpy.header.SamplesInfos object with the sample name information

stream = None

stream (file-like object) to read from

tabix_file = None

the pysam.TabixFile used for reading from index bgzip-ed VCF; constructed on the fly

tabix_path = None

optional str with path to tabix file

vcfpy.record module

Code for representing a VCF record

The VCF record structure is modeled after the one of PyVCF

class vcfpy.record.AltRecord(type_=None)[source]

Bases: object

An alternative allele Record

Currently, can be a substitution, an SV placeholder, or breakend

type = None

String describing the type of the variant, could be one of SNV, MNV, could be any of teh types described in the ALT header lines, such as DUP, DEL, INS, ...

vcfpy.record.BND = 'BND'

Code for break-end allele

class vcfpy.record.BreakEnd(type_, value)[source]

Bases: vcfpy.record.AltRecord

A placeholder for a breakend

value = None

The alternative base sequence to use in the substitution

class vcfpy.record.Call(sample, data, site=None)[source]

Bases: object

The information for a genotype callable

By VCF, this should always include the genotype information and can contain an arbitrary number of further annotation, e.g., the coverage at the variant position.

called = None

whether or not the variant is fully called

data = None

an OrderedDict with the key/value pair information from the call’s data

gt_alleles = None

the allele numbers (0, 1, ...) in this calls or None for no-call

gt_bases

Return the actual genotype alleles, e.g. if VCF genotype is 0/1, could return A/T

gt_phase_char()[source]

Return character to use for phasing

gt_type

The type of genotype, mapping is

  • hom_ref = 0
  • het = 1
  • hom_alt = 2 (which alt is untracked)
  • uncalled = None
is_filtered

Return True for filtered calls

is_het

Return True for filtered calls

is_phased

Return True for phased calls

is_variant

Return True for filtered calls

phased

Return boolean indicating whether this call is phased

plodity = None

the number of alleles in this sample’s call

sample = None

the name of the sample for which the call was made

site = None

the Record of this Call

vcfpy.record.DEL = 'DEL'

Code for “clean” deletion allele

vcfpy.record.ESCAPE_MAPPING = [('%', '%25'), (':', '%3A'), (';', '%3B'), ('=', '%3D'), (',', '%2C'), ('\r', '%0D'), ('\n', '%0A'), ('\t', '%09')]

Mapping for escaping reserved characters

vcfpy.record.INDEL = 'INDEL'

Code for indel allele, includes substitutions of unequal length

vcfpy.record.INS = 'INS'

Code for “clean” insertion allele

vcfpy.record.MIXED = 'MIXED'

Code for mixed variant type

vcfpy.record.MNV = 'MNV'

Code for a multi nucleotide variant allele

vcfpy.record.RESERVED_CHARS = ':;=%,\r\n\t'

Characters reserved in VCF, have to be escaped

class vcfpy.record.Record(CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, calls)[source]

Bases: object

Represent one record from the VCF file

Record objects are iterators of their calls

ALT = None

A list of alternative allele records of type AltRecord

CHROM = None

A str with the chromosome name

FILTER = None

A list of strings for the FILTER column

FORMAT = None

A list of strings for the FORMAT column

ID = None

A list of the semicolon-separated values of the ID column

INFO = None

An OrderedDict giving the values of the INFO column, flags are mapped to True

POS = None

An int with a 1-based begin position

QUAL = None

The quality value, can be None

REF = None

A str with the REF value

add_filter(label)[source]

Add label to FILTER if not set yet

add_format(key, value=None)[source]

Add an entry to format

The record’s calls data[key] will be set to value if not yet set and value is not None. If key is already in FORMAT then nothing is done.

affected_end

Return affected start position in 0-based coordinates

For SNVs, MNVs, and deletions, the behaviour is based on the start position and the length of the REF. In the case of insertions, the position behind the insert position is returned, yielding a 0-length interval together with :py:method:`affected_start`

affected_start

Return affected start position in 0-based coordinates

For SNVs, MNVs, and deletions, the behaviour is the start position. In the case of insertions, the position behind the insert position is returned, yielding a 0-length interval together with :py:method:`affected_end`

begin = None

An int with a 0-based begin position

call_for_sample = None

A mapping from sample name to entry in self.calls

calls = None

A list of genotype Call objects

end = None

An int with a 0-based end position

is_snv()[source]

Return True if it is a SNV

vcfpy.record.SNV = 'SNV'

Code for single nucleotide variant allele

class vcfpy.record.SV(type_, value)[source]

Bases: vcfpy.record.AltRecord

Code for structural variant allele

value = None

The alternative base sequence to use in the substitution

vcfpy.record.SV_CODES = ('DEL', 'INS', 'DUP', 'INV', 'CNV')

Codes for structural variants

vcfpy.record.SYMBOLIC = 'SYMBOLIC'

Code for symbolic allele that is neither SV nor BND

class vcfpy.record.SingleBreakEnd(type_, value)[source]

Bases: vcfpy.record.AltRecord

A placeholder for a single breakend

value = None

The alternative base sequence to use in the substitution

class vcfpy.record.Substitution(type_, value)[source]

Bases: vcfpy.record.AltRecord

A basic alternative allele record describing a REF->AltRecord substitution

Note that this subsumes MNVs, insertions, and deletions.

value = None

The alternative base sequence to use in the substitution

class vcfpy.record.SymbolicAllele(type_, value)[source]

Bases: vcfpy.record.AltRecord

A placeholder for a symbolic allele

value = None

The alternative base sequence to use in the substitution

vcfpy.record.UNESCAPE_MAPPING = [('%25', '%'), ('%3A', ':'), ('%3B', ';'), ('%3D', '='), ('%2C', ','), ('%0D', '\r'), ('%0A', '\n'), ('%09', '\t')]

Mapping from escaped characters to reserved one

vcfpy.writer module

Writing of VCF files to file-like objects

Currently, only writing to plain-text files is supported

class vcfpy.writer.Writer(stream, header, samples, path=None)[source]

Bases: object

Class for writing VCF files to file-like objects

Instead of using the constructor, use the class methods from_stream() and from_path().

The writer has to be constructed with a Header and a SamplesInfos object and the full VCF header will be written immediately on construction. This, of course, implies that modifying the header after construction is illegal.

close()[source]

Close underlying stream

classmethod from_path(klass, path, header, samples)[source]

Create new Writer from path

Parameters:
  • path – the path to load from (converted to str for compatibility with path.py)
  • header – VCF header to use
  • samples – SamplesInfos to use
classmethod from_stream(klass, stream, header, samples, path=None, use_bgzf=None)[source]

Create new Writer from file

Note that for getting bgzf support, you have to pass in a stream opened in binary mode. Further, you either have to provide a path ending in ".gz" or set use_bgzf=True. Otherwise, you will get the notorious “TypeError: ‘str’ does not support the buffer interface”.

Parameters:
  • streamfile-like object to write to
  • header – VCF header to use
  • samples – SamplesInfos to use
  • path – optional string with path to store (for display only)
  • use_bgzf – indicator whether to write bgzf to stream if True, prevent if False, interpret path if None
header = None

the :py:class:~vcfpy.header.Header` written out

path = None

optional str with the path to the stream

samples = None

the :py:class:~vcfpy.header.SamplesInfos` written out

stream = None

stream (file-like object) to read from

write_record(record)[source]

Write out the given vcfpy.record.Record to this Writer

vcfpy.writer.format_atomic(value)[source]

Format atomic value

This function also takes care of escaping the value in case one of the reserved characters occurs in the value.

vcfpy.writer.format_value(field_info, value)[source]

Format possibly compound value given the FieldInfo

Module contents