vcfpy package¶

Submodules¶

vcfpy.bgzf module¶

Support code for writing BGZF files

Shamelessly taken from Biopython

class vcfpy.bgzf.BgzfWriter(filename=None, mode='w', fileobj=None, compresslevel=6)[source]¶

Bases: object

close()[source]¶: Flush data, write 28 bytes BGZF EOF marker, and close BGZF file. samtools will look for a magic EOF marker, just a 28 byte empty BGZF block, and if it is missing warns the BAM file may be truncated. In addition to samtools writing this block, so too does bgzip - so this implementation does too.

fileno()[source]¶

flush()[source]¶

classmethod isatty()[source]¶

seekable()[source]¶

tell()[source]¶: Returns a BGZF 64-bit virtual offset.

write(data)[source]¶

vcfpy.bgzf.make_virtual_offset(block_start_offset, within_block_offset)[source]¶: Compute a BGZF virtual offset from block start and within block offsets. The BAM indexing scheme records read positions using a 64 bit ‘virtual offset’, comprising in C terms: block_start_offset << 16 | within_block_offset Here block_start_offset is the file offset of the BGZF block start (unsigned integer using up to 64-16 = 48 bits), and within_block_offset within the (decompressed) block (unsigned 16 bit integer). >>> make_virtual_offset(0, 0) 0 >>> make_virtual_offset(0, 1) 1 >>> make_virtual_offset(0, 2**16 - 1) 65535 >>> make_virtual_offset(0, 2**16) Traceback (most recent call last): … ValueError: Require 0 <= within_block_offset < 2**16, got 65536 >>> 65536 == make_virtual_offset(1, 0) True >>> 65537 == make_virtual_offset(1, 1) True >>> 131071 == make_virtual_offset(1, 2**16 - 1) True >>> 6553600000 == make_virtual_offset(100000, 0) True >>> 6553600001 == make_virtual_offset(100000, 1) True >>> 6553600010 == make_virtual_offset(100000, 10) True >>> make_virtual_offset(2**48, 0) Traceback (most recent call last): … ValueError: Require 0 <= block_start_offset < 2**48, got 281474976710656

vcfpy.exceptions module¶

Exceptions for the vcfpy module

exception vcfpy.exceptions.CannotConvertValue[source]¶

Bases: vcfpy.exceptions.VCFPyWarning

Cannot convert value.

exception vcfpy.exceptions.DuplicateHeaderLineWarning[source]¶

Bases: vcfpy.exceptions.VCFPyWarning

A header line occurs twice in a header

exception vcfpy.exceptions.FieldInfoNotFound[source]¶

Bases: vcfpy.exceptions.VCFPyWarning

A header field is not found, default is used

exception vcfpy.exceptions.FieldInvalidNumber[source]¶

Bases: vcfpy.exceptions.VCFPyWarning

Raised when compound header has invalid number

exception vcfpy.exceptions.FieldMissingNumber[source]¶

Bases: vcfpy.exceptions.VCFPyWarning

Raised when compound heade misses number

exception vcfpy.exceptions.HeaderInvalidType[source]¶

Bases: vcfpy.exceptions.VCFPyWarning

Raised when compound header has invalid type

exception vcfpy.exceptions.HeaderMissingDescription[source]¶

Bases: vcfpy.exceptions.VCFPyWarning

Raised when compound header has missing description

exception vcfpy.exceptions.HeaderNotFound[source]¶

Bases: vcfpy.exceptions.VCFPyException

Raised when a VCF header could not be found

exception vcfpy.exceptions.IncorrectListLength[source]¶

Bases: vcfpy.exceptions.VCFPyWarning

Wrong length of multi-element field

exception vcfpy.exceptions.IncorrectVCFFormat[source]¶

Bases: vcfpy.exceptions.VCFPyException

Raised on problems parsing VCF

exception vcfpy.exceptions.InvalidHeaderException[source]¶

Bases: vcfpy.exceptions.VCFPyException

Raised in the case of invalid header formatting

exception vcfpy.exceptions.InvalidRecordException[source]¶

Bases: vcfpy.exceptions.VCFPyException

Raised in the case of invalid record formatting

exception vcfpy.exceptions.LeadingTrailingSpaceInKey[source]¶

Bases: vcfpy.exceptions.VCFPyWarning

Leading or trailing space in key

exception vcfpy.exceptions.SpaceInChromLine[source]¶

Bases: vcfpy.exceptions.VCFPyWarning

Space instead of TAB in ##CHROM line

exception vcfpy.exceptions.UnknownFilter[source]¶

Bases: vcfpy.exceptions.VCFPyWarning

Missing FILTER

exception vcfpy.exceptions.UnknownVCFVersion[source]¶

Bases: vcfpy.exceptions.VCFPyWarning

Unknown VCF version

exception vcfpy.exceptions.VCFPyException[source]¶

Bases: RuntimeError

Base class for module’s exception

exception vcfpy.exceptions.VCFPyWarning[source]¶

Bases: Warning

Base class for module’s warnings

vcfpy.header module¶

Code for representing the VCF header part

The VCF header class structure is modeled after HTSJDK

class vcfpy.header.AltAlleleHeaderLine(key, value, mapping)[source]¶

Bases: vcfpy.header.SimpleHeaderLine

Alternative allele header line

Mostly used for defining symbolic alleles for structural variants and IUPAC ambiguity codes

classmethod from_mapping(mapping)[source]¶: Construct from mapping, not requiring the string value

id = None¶: name of the alternative allele

class vcfpy.header.CompoundHeaderLine(key, value, mapping)[source]¶

Bases: vcfpy.header.HeaderLine

Base class for compound header lines, currently format and header lines

Compound header lines describe fields that can have more than one entry.

Don’t use this class directly but rather the sub classes.

copy()[source]¶: Return a copy

mapping = None¶: OrderedDict with key/value mapping

serialize()[source]¶

value¶

class vcfpy.header.ContigHeaderLine(key, value, mapping)[source]¶

Bases: vcfpy.header.SimpleHeaderLine

Contig header line

Most importantly, parses the 'length' key into an integer

classmethod from_mapping(mapping)[source]¶: Construct from mapping, not requiring the string value

id = None¶: name of the contig

length = None¶: length of the contig, None if missing

vcfpy.header.FORMAT_TYPES = ('Integer', 'Float', 'Character', 'String')¶: valid FORMAT value types

class vcfpy.header.FieldInfo(type_, number, description=None, id_=None)[source]¶

Bases: object

Core information for describing field type and number

description = None¶: Description for the header field, optional

id = None¶: The id of the field, optional.

number = None¶: Number description, either an int or constant

type = None¶: The type, one of INFO_TYPES or FORMAT_TYPES

class vcfpy.header.FilterHeaderLine(key, value, mapping)[source]¶

Bases: vcfpy.header.SimpleHeaderLine

FILTER header line

description = None¶: description for the filter, None if missing

classmethod from_mapping(mapping)[source]¶: Construct from mapping, not requiring the string value

id = None¶: token for the filter

class vcfpy.header.FormatHeaderLine(key, value, mapping)[source]¶

Bases: vcfpy.header.CompoundHeaderLine

Header line for FORMAT fields

description = None¶: description, should be given, None if not given

classmethod from_mapping(mapping)[source]¶: Construct from mapping, not requiring the string value

id = None¶: key in the INFO field

source = None¶: source of INFO field, None if not given

type = None¶: value type

version = None¶: version of INFO field, None if not given

vcfpy.header.HEADER_NUMBER_ALLELES = 'A'¶: number of alleles excluding reference

vcfpy.header.HEADER_NUMBER_GENOTYPES = 'G'¶: number of genotypes

vcfpy.header.HEADER_NUMBER_REF = 'R'¶: number of alleles including reference

vcfpy.header.HEADER_NUMBER_UNBOUNDED = '.'¶: unbounded number of values

class vcfpy.header.Header(lines=None, samples=None)[source]¶

Bases: object

Represent header of VCF file

While this class allows mutating records, it should not be changed once it has been assigned to a writer. Use :py:method:`~Header.copy` to create a copy that can be modified without problems.

This class provides function for adding lines to a header and updating the supporting index data structures. There is no explicit API for removing header lines, the best way is to reconstruct a new Header instance with a filtered list of header lines.

add_contig_line(mapping)[source]¶

Add “contig” header line constructed from the given mapping

Parameters:	mapping – `OrderedDict` with mapping to add. It is recommended to use `OrderedDict` over `dict` as this makes the result reproducible
Returns:	`False` on conflicting line and `True` otherwise

add_filter_line(mapping)[source]¶

Add FILTER header line constructed from the given mapping

Parameters:	mapping – `OrderedDict` with mapping to add. It is recommended to use `OrderedDict` over `dict` as this makes the result reproducible
Returns:	`False` on conflicting line and `True` otherwise

add_format_line(mapping)[source]¶

Add FORMAT header line constructed from the given mapping

Parameters:	mapping – `OrderedDict` with mapping to add. It is recommended to use `OrderedDict` over `dict` as this makes the result reproducible
Returns:	`False` on conflicting line and `True` otherwise

add_info_line(mapping)[source]¶

Add INFO header line constructed from the given mapping

Parameters:	mapping – `OrderedDict` with mapping to add. It is recommended to use `OrderedDict` over `dict` as this makes the result reproducible
Returns:	`False` on conflicting line and `True` otherwise

add_line(header_line)[source]¶

Add header line, updating any necessary support indices

Returns:	`False` on conflicting line and `True` otherwise

copy()[source]¶: Return a copy of this header

filter_ids()[source]¶: Return list of all filter IDs

format_ids()[source]¶: Return list of all format IDs

get_format_field_info(key)[source]¶: Return FieldInfo for the given INFO field

get_info_field_info(key)[source]¶: Return FieldInfo for the given INFO field

get_lines(key)[source]¶: Return header lines having the given key as their type

has_header_line(key, id_)[source]¶

Return whether there is a header line with the given ID of the type given by key

Parameters:	key – The VCF header key/line type. id – The ID value to compare fore
Returns:	`True` if there is a header line starting with `##${key}=` in the VCF file having the mapping entry `ID` set to `id_`.

info_ids()[source]¶: Return list of all info IDs

lines = None¶: list of :py:HeaderLine objects

samples = None¶: SamplesInfo object

class vcfpy.header.HeaderLine(key, value)[source]¶

Bases: object

Base class for VCF header lines

copy()[source]¶: Return a copy

key = None¶: str with key of header line

serialize()[source]¶: Return VCF-serialized version of this header line

value¶

vcfpy.header.INFO_TYPES = ('Integer', 'Float', 'Flag', 'Character', 'String')¶: valid INFO value types

class vcfpy.header.InfoHeaderLine(key, value, mapping)[source]¶

Bases: vcfpy.header.CompoundHeaderLine

Header line for INFO fields

Note that the Number field will be parsed into an int if possible. Otherwise, the constants HEADER_NUMBER_* will be used.

description = None¶: description, should be given, None if not given

classmethod from_mapping(mapping)[source]¶: Construct from mapping, not requiring the string value

id = None¶: key in the INFO field

source = None¶: source of INFO field, None if not given

type = None¶: value type

version = None¶: version of INFO field, None if not given

vcfpy.header.LINES_WITH_ID = ('ALT', 'contig', 'FILTER', 'FORMAT', 'INFO', 'META', 'PEDIGREE', 'SAMPLE')¶: header lines that contain an “ID” entry

class vcfpy.header.MetaHeaderLine(key, value, mapping)[source]¶

Bases: vcfpy.header.SimpleHeaderLine

Alternative allele header line

Used for defining set of valid values for samples keys

classmethod from_mapping(mapping)[source]¶: Construct from mapping, not requiring the string value

id = None¶: name of the alternative allele

class vcfpy.header.PedigreeHeaderLine(key, value, mapping)[source]¶

Bases: vcfpy.header.SimpleHeaderLine

Header line for defining a pedigree entry

classmethod from_mapping(mapping)[source]¶: Construct from mapping, not requiring the string value

id = None¶: name of the alternative allele

vcfpy.header.RESERVED_INFO = {'BKPTID': FieldInfo('String', '.', 'ID of the assembled alternate allele in the assembly file', None), 'DGVID': FieldInfo('String', 1, 'ID of this element in Database of Genomic Variation', None), 'H2': FieldInfo('Flag', 0, 'Membership in HapMap 2', None), 'VALIDATED': FieldInfo('Flag', 0, 'Validated by follow-up experiment', None), 'AA': FieldInfo('String', 1, 'Ancestral Allele', None), 'BQ': FieldInfo('Float', 1, 'RMS base quality at this position', None), 'SOMATIC': FieldInfo('Flag', 0, 'Indicates that the record is a somatic mutation, for cancer genomics', None), 'CICNADJ': FieldInfo('Integer', '.', 'Confidence interval around copy number for the adjacency', None), 'HOMLEN': FieldInfo('Integer', '.', 'Length of base pair identical micro-homology at event breakpoints', None), 'SB': FieldInfo('Integer', 4, 'Strand bias at this position', None), 'CN': FieldInfo('Integer', 1, 'Copy number of segment containing breakend', None), 'CNADJ': FieldInfo('Integer', '.', 'Copy number of adjacency', None), 'AC': FieldInfo('Integer', 'A', 'Allele count in genotypes, for each ALT allele, in the same order as listed', None), 'ADR': FieldInfo('Integer', 'R', 'Reverse read depth for each allele', None), 'DBVARID': FieldInfo('String', 1, 'ID of this element in DBVAR', None), 'DBRIPID': FieldInfo('String', 1, 'ID of this element in DBRIP', None), 'DPADJ': FieldInfo('Integer', '.', 'Read Depth of adjacency', None), 'HOMSEQ': FieldInfo('String', '.', 'Sequence of base pair identical micro-homology at event breakpoints', None), 'DP': FieldInfo('Integer', 1, 'Combined depth across samples for small variants and Read Depth of segment containing breakend for SVs', None), 'SVLEN': FieldInfo('Integer', 1, 'Difference in length between REF and ALT alleles', None), 'CIGAR': FieldInfo('String', 'A', 'CIGAR string describing how to align each ALT allele to the reference allele', None), 'SVTYPE': FieldInfo('String', 1, 'Type of structural variant', None), 'AD': FieldInfo('Integer', 'R', 'Total read depth for each allele', None), 'CICN': FieldInfo('Integer', 2, 'Confidence interval around copy number for the segment', None), 'H3': FieldInfo('Flag', 0, 'Membership in HapMap 3', None), 'CIEND': FieldInfo('Integer', 2, 'Confidence interval around END for imprecise variants', None), 'AN': FieldInfo('Integer', 1, 'Total number of alleles in called genotypes', None), 'MQ': FieldInfo('Integer', 1, 'RMS mapping quality', None), 'AF': FieldInfo('Float', 'A', 'Allele frequency for each ALT allele in the same order as listed: used for estimating from primary data not called genotypes', None), 'METRANS': FieldInfo('String', 4, 'Mobile element transduction info of the form CHR,START,END,POLARITY', None), 'ADF': FieldInfo('Integer', 'R', 'Forward read depth for each allele', None), 'MATEID': FieldInfo('String', '.', 'ID of mate breakends', None), 'NOVEL': FieldInfo('Flag', 0, 'Indicates a novel structural variation', None), 'MEINFO': FieldInfo('String', 4, 'Mobile element info of the form NAME,START,END,POLARITY', None), 'PARID': FieldInfo('String', 1, 'ID of partner breakend', None), 'IMPRECISE': FieldInfo('Flag', 0, 'Imprecise structural variation', None), 'END': FieldInfo('Integer', 1, 'End position of the variant described in this record (for symbolic alleles)', None), 'NS': FieldInfo('Integer', 1, 'Number of samples with data', None), 'DB': FieldInfo('Flag', 0, 'dbSNP membership', None), 'EVENT': FieldInfo('String', 1, 'ID of event associated to breakend', None), 'CIPOS': FieldInfo('Integer', 2, 'Confidence interval around POS for imprecise variants', None), 'CILEN': FieldInfo('Integer', 2, 'Confidence interval around the inserted material between breakends', None), 'MQ0': FieldInfo('Integer', 1, 'Number of MAPQ == 0 reads covering this record', None), '1000G': FieldInfo('Flag', 0, 'Membership in 1000 Genomes', None)}¶: Reserved fields for INFO from VCF v4.3

class vcfpy.header.SampleHeaderLine(key, value, mapping)[source]¶

Bases: vcfpy.header.SimpleHeaderLine

Header line for defining a SAMPLE entry

classmethod from_mapping(mapping)[source]¶: Construct from mapping, not requiring the string value

id = None¶: name of the alternative allele

class vcfpy.header.SamplesInfos(sample_names, parsed_samples=None)[source]¶

Bases: object

Helper class for handling the samples in VCF files

The purpose of this class is to decouple the sample name list somewhat from Header. This encapsulates subsetting samples for which the genotype should be parsed and reordering samples into output files.

Note that when subsetting is used and the records are to be written out again then the FORMAT field must not be touched.

copy()[source]¶: Return a copy of the object

is_parsed(name)[source]¶: Return whether the sample name is parsed

name_to_idx = None¶: mapping from sample name to index

names = None¶: list of sample that are read from/written to the VCF file at hand in the given order

parsed_samples = None¶: set with the samples for which the genotype call fields should be read; can be used for partial parsing (speedup) and defaults to the full list of samples, None if all are parsed

class vcfpy.header.SimpleHeaderLine(key, value, mapping)[source]¶

Bases: vcfpy.header.HeaderLine

Base class for simple header lines, currently contig and filter header lines

Don’t use this class directly but rather the sub classes.

Raises:	`vcfpy.exceptions.InvalidHeaderException` in the case of missing key `"ID"`

copy()[source]¶: Return a copy

mapping = None¶: collections.OrderedDict with key/value mapping of the attributes

serialize()[source]¶

value¶

vcfpy.header.VALID_NUMBERS = ('A', 'R', 'G', '.')¶: valid values for “Number” entries, except for integers

vcfpy.header.header_without_lines(header, remove)[source]¶

Return Header without lines given in remove

remove is an iterable of pairs key/ID with the VCF header key and ID of entry to remove. In the case that a line does not have a mapping entry, you can give the full value to remove.

# header is a vcfpy.Header, e.g., as read earlier from file
new_header = vcfpy.without_header_lines(
    header, [('assembly', None), ('FILTER', 'PASS')])
# now, the header lines starting with "##assembly=" and the "PASS"
# filter line will be missing from new_header

vcfpy.header.mapping_to_str(mapping)[source]¶: Convert mapping to string

vcfpy.header.serialize_for_header(key, value)[source]¶: Serialize value for the given mapping key for a VCF header line

vcfpy.parser module¶

Parsing of VCF files from str

class vcfpy.parser.FormatChecker(header)[source]¶

Bases: object

Helper class for checking a FORMAT field

header = None¶: VCFHeader to use for checking

run(call, num_alts)[source]¶

Check FORMAT of a record.Call

Currently, only checks for consistent counts are implemented

class vcfpy.parser.HeaderChecker[source]¶

Bases: object

Helper class for checking a VCF header

run(header)[source]¶

Check the header

Warnings will be printed using warnings while errors will raise an exception.

Raises:	`vcfpy.exceptions.InvalidHeaderException` in the case of severe errors reading the header

class vcfpy.parser.HeaderLineParserBase[source]¶

Bases: object

Parse into appropriate HeaderLine

parse_key_value(key, value)[source]¶

Parse the key/value pair

Parameters:	key (str) – the key to use in parsing value (str) – the value to parse
Returns:	`vcfpy.header.HeaderLine` object

class vcfpy.parser.HeaderParser[source]¶

Bases: object

Helper class for parsing a VCF header

parse_line(line)[source]¶

Parse VCF header line (trailing ‘ ‘ or ‘ ‘ is ignored)

param str line: str with line to parse

param dict sub_parsers:

dict mapping header line types to appropriate parser objects

returns: appropriate HeaderLine parsed from line

raises: vcfpy.exceptions.InvalidHeaderException if there was a problem parsing the file

sub_parsers = None¶: Sub parsers to use for parsing the header lines

class vcfpy.parser.InfoChecker(header)[source]¶

Bases: object

Helper class for checking an INFO field

header = None¶: VCFHeader to use for checking

run(key, value, num_alts)[source]¶

Check value in INFO[key] of record

Currently, only checks for consistent counts are implemented

Parameters:	key (str) – key of INFO entry to check value – value to check alts (int) – list of alternative alleles, for length

class vcfpy.parser.MappingHeaderLineParser(line_class)[source]¶

Bases: vcfpy.parser.HeaderLineParserBase

Parse into HeaderLine (no particular structure)

line_class = None¶: the class to use for the VCF header line

parse_key_value(key, value)[source]¶

class vcfpy.parser.NoopFormatChecker[source]¶

Bases: object

Helper class that performs no checks

run(call, num_alts)[source]¶

class vcfpy.parser.NoopInfoChecker[source]¶

Bases: object

Helper class that performs no checks

run(key, value, num_alts)[source]¶

class vcfpy.parser.Parser(stream, path=None, record_checks=None)[source]¶

Bases: object

Class for line-wise parsing of VCF files

In most cases, you want to use vcfpy.reader.Reader instead.

Parameters:	stream – `file`-like object to read from path (str) – path the VCF is parsed from, for display purposes only, optional

header = None¶: header, once it has been read

parse_header(parsed_samples=None)[source]¶

Read and parse vcfpy.header.Header from file, set into self.header and return it

Parameters:	parsed_samples (list) – `list` of `str` for subsetting the samples to parse
Returns:	`vcfpy.header.Header`
Raises:	`vcfpy.exceptions.InvalidHeaderException` in the case of problems reading the header

parse_line(line)[source]¶: Pare the given line without reading another one from the stream

parse_next_record()[source]¶

Read, parse and return next vcfpy.record.Record

Returns:	next VCF record or `None` if at end
Raises:	`vcfpy.exceptions.InvalidRecordException` in the case of problems reading the record

print_warn_summary()[source]¶: If there were any warnings, print summary with warnings

record_checks = None¶: checks to perform, can contain ‘INFO’ and ‘FORMAT’

samples = None¶: vcfpy.header.SamplesInfos with sample information; set on parsing the header

class vcfpy.parser.QuotedStringSplitter(delim=', ', quote='"', brackets='[]')[source]¶

Bases: object

Helper class for splitting quoted strings

Has support for interpreting quoting strings but also brackets. Meant for splitting the VCF header line dicts

ARRAY = 3¶: state constant for array

DELIM = 4¶: state constant for delimiter

ESCAPED = 2¶: state constant for delimiter

NORMAL = 0¶: state constant for normal

QUOTED = 1¶: state constant for quoted

delim = None¶: string delimiter

quote = None¶: quote character

run(s)[source]¶

Split string s at delimiter, correctly interpreting quotes

Further, interprets arrays wrapped in one level of []. No recursive brackets are interpreted (as this would make the grammar non-regular and currently this complexity is not needed). Currently, quoting inside of braces is not supported either. This is just to support the example from VCF v4.3.

class vcfpy.parser.RecordParser(header, samples, record_checks=None)[source]¶

Bases: object

Helper class for parsing VCF records

header = None¶: Header with the meta information

parse_line(line_str)[source]¶: Parse line from file (including trailing line break) and return resulting Record

record_checks = None¶: The checks to perform, can contain ‘INFO’ and ‘FORMAT’

samples = None¶: SamplesInfos with sample information

vcfpy.parser.SUPPORTED_VCF_VERSIONS = ('VCFv4.0', 'VCFv4.1', 'VCFv4.2', 'VCFv4.3')¶: Supported VCF versions, a warning will be issued otherwise

class vcfpy.parser.StupidHeaderLineParser[source]¶

Bases: vcfpy.parser.HeaderLineParserBase

Parse into HeaderLine (no particular structure)

parse_key_value(key, value)[source]¶

vcfpy.parser.binomial[source]¶

vcfpy.parser.build_header_parsers()[source]¶

Return mapping for parsers to use for each VCF header type

Inject the WarningHelper into the parsers.

vcfpy.parser.convert_field_value(type_, value)[source]¶: Convert atomic field value according to the type

vcfpy.parser.parse_breakend(alt_str)[source]¶: Parse breakend and return tuple with results, parameters for BreakEnd constructor

vcfpy.parser.parse_field_value(field_info, value)[source]¶: Parse value according to field_info

vcfpy.parser.parse_mapping(value)[source]¶

Parse the given VCF header line mapping

Such a mapping consists of “key=value” pairs, separated by commas and wrapped into angular brackets (“<…>”). Strings are usually quoted, for certain known keys, exceptions are made, depending on the tag key. this, however, only gets important when serializing.

Raises:	`vcfpy.exceptions.InvalidHeaderException` if there was a problem parsing the file

vcfpy.parser.process_alt(header, ref, alt_str)[source]¶: Process alternative value using Header in header

vcfpy.parser.process_sub(ref, alt_str)[source]¶: Process substitution

vcfpy.parser.process_sub_grow(ref, alt_str)[source]¶: Process substution where the string grows

vcfpy.parser.process_sub_shrink(ref, alt_str)[source]¶: Process substution where the string shrink

vcfpy.parser.split_mapping(pair_str)[source]¶

Split the str in pair_str at '='

Warn if key needs to be stripped

vcfpy.parser.split_quoted_string(s, delim=', ', quote='"', brackets='[]')[source]¶

vcfpy.reader module¶

Parsing of VCF files from file-like objects

class vcfpy.reader.Reader(stream, path=None, tabix_path=None, record_checks=None, parsed_samples=None)[source]¶

Bases: object

Class for parsing of files from file-like objects

Instead of using the constructor, use the class methods from_stream() and from_path().

On construction, the header will be read from the file which can cause problems. After construction, Reader can be used as an iterable of Record.

Raises:	`InvalidHeaderException` in the case of problems reading the header

Note

It is important to note that the header member is used during the parsing of the file. If you need a modified version then create a copy, e.g., using :py:method:`~vcfpy.header.Header.copy`.

Note

If you use the parsed_samples feature and you write out records then you must not change the FORMAT of the record.

close()[source]¶: Close underlying stream

fetch(chrom_or_region, begin=None, end=None)[source]¶

Jump to the start position of the given chromosomal position and limit iteration to the end position

Parameters:	chrom_or_region (str) – name of the chromosome to jump to if begin and end are given and a samtools region string otherwise (e.g. “chr1:123,456-123,900”). begin (int) – 0-based begin position (inclusive) end (int) – 0-based end position (exclusive)

classmethod from_path(path, tabix_path=None, record_checks=None, parsed_samples=None)[source]¶

Create new Reader from path

Note

If you use the parsed_samples feature and you write out records then you must not change the FORMAT of the record.

Parameters:	path – the path to load from (converted to `str` for compatibility with `path.py`) tabix_path – optional string with path to TBI index, automatic inferral from `path` will be tried on the fly if not given record_checks (list) – record checks to perform, can contain ‘INFO’ and ‘FORMAT’

classmethod from_stream(stream, path=None, tabix_path=None, record_checks=None, parsed_samples=None)[source]¶

Create new Reader from file

Note

If you use the parsed_samples feature and you write out records then you must not change the FORMAT of the record.

Parameters:	stream – `file`-like object to read from path – optional string with path to store (for display only) record_checks (list) – record checks to perform, can contain ‘INFO’ and ‘FORMAT’ parsed_samples (list) – `list` of `str` values with names of samples to parse call information for (for speedup); leave to `None` for ignoring

header = None¶: the Header

parsed_samples = None¶: if set, list of samples to parse for

parser = None¶: the parser to use

path = None¶: optional str with the path to the stream

record_checks = None¶: checks to perform on records, can contain ‘FORMAT’ and ‘INFO’

stream = None¶: stream (file-like object) to read from

tabix_file = None¶: the pysam.TabixFile used for reading from index bgzip-ed VCF; constructed on the fly

tabix_path = None¶: optional str with path to tabix file

vcfpy.record module¶

Code for representing a VCF record

The VCF record structure is modeled after the one of PyVCF

vcfpy.record.ALLELE_DELIM = re.compile('[|/]')¶: Regular expression for splitting alleles

class vcfpy.record.AltRecord(type_=None)[source]¶

Bases: object

An alternative allele Record

Currently, can be a substitution, an SV placeholder, or breakend

serialize()[source]¶: Return str with representation for VCF file

type = None¶: String describing the type of the variant, could be one of SNV, MNV, could be any of teh types described in the ALT header lines, such as DUP, DEL, INS, …

vcfpy.record.BND = 'BND'¶: Code for break-end allele

class vcfpy.record.BreakEnd(mate_chrom, mate_pos, orientation, mate_orientation, sequence, within_main_assembly)[source]¶

Bases: vcfpy.record.AltRecord

A placeholder for a breakend

mate_chrom = None¶: chromosome of the mate breakend

mate_orientation = None¶: orientation breakend’s mate

mate_pos = None¶: position of the mate breakend

orientation = None¶: orientation of this breakend

sequence = None¶: breakpoint’s connecting sequence

serialize()[source]¶: Return string representation for VCF

within_main_assembly = None¶: bool specifying if the breakend mate is within the assembly (True) or in an ancillary assembly (False)

class vcfpy.record.Call(sample, data, site=None)[source]¶

Bases: object

The information for a genotype callable

By VCF, this should always include the genotype information and can contain an arbitrary number of further annotation, e.g., the coverage at the variant position.

called = None¶: whether or not the variant is fully called

data = None¶: an OrderedDict with the key/value pair information from the call’s data

gt_alleles = None¶: the allele numbers (0, 1, …) in this calls or None for no-call

gt_bases¶: Return the actual genotype bases, e.g. if VCF genotype is 0/1, could return (‘A’, ‘T’)

gt_phase_char¶: Return character to use for phasing

gt_type¶: The type of genotype, returns one of HOM_REF, HOM_ALT, and HET.

is_filtered(require=None, ignore=None)[source]¶

Return True for filtered calls

Parameters:	ignore (iterable) – if set, the filters to ignore, make sure to include ‘PASS’, when setting, default is `['PASS']` require (iterable) – if set, the filters to require for returning `True`

is_het¶: Return True for heterozygous calls

is_phased¶: Return boolean indicating whether this call is phased

is_variant¶: Return True for non-hom-ref calls

plodity = None¶: the number of alleles in this sample’s call

sample = None¶: the name of the sample for which the call was made

site = None¶: the Record of this Call

vcfpy.record.DEL = 'DEL'¶: Code for “clean” deletion allele

vcfpy.record.ESCAPE_MAPPING = [('%', '%25'), (':', '%3A'), (';', '%3B'), ('=', '%3D'), (',', '%2C'), ('\r', '%0D'), ('\n', '%0A'), ('\t', '%09')]¶: Mapping for escaping reserved characters

vcfpy.record.FIVE_PRIME = '5'¶: code for five prime orientation BreakEnd

vcfpy.record.FORWARD = '+'¶: code for forward orientation

vcfpy.record.HET = 1¶: Code for heterozygous

vcfpy.record.HOM_ALT = 2¶: Code for homozygous alternative

vcfpy.record.HOM_REF = 0¶: Code for homozygous reference

vcfpy.record.INDEL = 'INDEL'¶: Code for indel allele, includes substitutions of unequal length

vcfpy.record.INS = 'INS'¶: Code for “clean” insertion allele

vcfpy.record.MIXED = 'MIXED'¶: Code for mixed variant type

vcfpy.record.MNV = 'MNV'¶: Code for a multi nucleotide variant allele

vcfpy.record.RESERVED_CHARS = ':;=%,\r\n\t'¶: Characters reserved in VCF, have to be escaped

vcfpy.record.REVERSE = '-'¶: code for reverse orientation

class vcfpy.record.Record(CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, calls)[source]¶

Bases: object

Represent one record from the VCF file

Record objects are iterators of their calls

ALT = None¶: A list of alternative allele records of type AltRecord

CHROM = None¶: A str with the chromosome name

FILTER = None¶: A list of strings for the FILTER column

FORMAT = None¶: A list of strings for the FORMAT column

ID = None¶: A list of the semicolon-separated values of the ID column

INFO = None¶: An OrderedDict giving the values of the INFO column, flags are mapped to True

POS = None¶: An int with a 1-based begin position

QUAL = None¶: The quality value, can be None

REF = None¶: A str with the REF value

add_filter(label)[source]¶: Add label to FILTER if not set yet, removing PASS entry if present

add_format(key, value=None)[source]¶

Add an entry to format

The record’s calls data[key] will be set to value if not yet set and value is not None. If key is already in FORMAT then nothing is done.

affected_end¶

Return affected start position in 0-based coordinates

For SNVs, MNVs, and deletions, the behaviour is based on the start position and the length of the REF. In the case of insertions, the position behind the insert position is returned, yielding a 0-length interval together with affected_start()

affected_start¶

Return affected start position in 0-based coordinates

For SNVs, MNVs, and deletions, the behaviour is the start position. In the case of insertions, the position behind the insert position is returned, yielding a 0-length interval together with affected_end()

begin = None¶: An int with a 0-based begin position

call_for_sample = None¶: A mapping from sample name to entry in self.calls

calls = None¶: A list of genotype Call objects

end = None¶: An int with a 0-based end position

is_snv()[source]¶: Return True if it is a SNV

vcfpy.record.SNV = 'SNV'¶: Code for single nucleotide variant allele

vcfpy.record.SV = 'SV'¶: Code for structural variant allele

vcfpy.record.SYMBOLIC = 'SYMBOLIC'¶: Code for symbolic allele that is neither SV nor BND

class vcfpy.record.SingleBreakEnd(orientation, sequence)[source]¶

Bases: vcfpy.record.BreakEnd

A placeholder for a single breakend

class vcfpy.record.Substitution(type_, value)[source]¶

Bases: vcfpy.record.AltRecord

A basic alternative allele record describing a REF->AltRecord substitution

Note that this subsumes MNVs, insertions, and deletions.

serialize()[source]¶

value = None¶: The alternative base sequence to use in the substitution

class vcfpy.record.SymbolicAllele(value)[source]¶

Bases: vcfpy.record.AltRecord

A placeholder for a symbolic allele

The allele symbol must be defined in the header using an ALT header before being parsed. Usually, this is used for succinct descriptions of structural variants or IUPAC parameters.

serialize()[source]¶

value = None¶: The symbolic value, e.g. ‘DUP’

vcfpy.record.THREE_PRIME = '3'¶: code for three prime orientation BreakEnd

vcfpy.record.UNESCAPE_MAPPING = [('%25', '%'), ('%3A', ':'), ('%3B', ';'), ('%3D', '='), ('%2C', ','), ('%0D', '\r'), ('%0A', '\n'), ('%09', '\t')]¶: Mapping from escaped characters to reserved one

class vcfpy.record.UnparsedCall(sample, unparsed_data, site=None)[source]¶

Bases: object

Placeholder for Call when parsing only a subset of fields

sample = None¶: the name of the sample for which the call was made

site = None¶: the Record of this Call

unparsed_data = None¶: str with the unparsed data

vcfpy.warn_utils module¶

vcfpy.writer module¶

Writing of VCF files to file-like objects

Currently, only writing to plain-text files is supported

class vcfpy.writer.Writer(stream, header, path=None)[source]¶

Bases: object

Class for writing VCF files to file-like objects

Instead of using the constructor, use the class methods from_stream() and from_path().

The writer has to be constructed with a Header object and the full VCF header will be written immediately on construction. This, of course, implies that modifying the header after construction is illegal.

close()[source]¶: Close underlying stream

classmethod from_path(path, header)[source]¶

Create new Writer from path

Parameters:	path – the path to load from (converted to `str` for compatibility with `path.py`) header – VCF header to use, lines and samples are deep-copied

classmethod from_stream(stream, header, path=None, use_bgzf=None)[source]¶

Create new Writer from file

Note that for getting bgzf support, you have to pass in a stream opened in binary mode. Further, you either have to provide a path ending in ".gz" or set use_bgzf=True. Otherwise, you will get the notorious “TypeError: ‘str’ does not support the buffer interface”.

Parameters:	stream – `file`-like object to write to header – VCF header to use, lines and samples are deep-copied path – optional string with path to store (for display only) use_bgzf – indicator whether to write bgzf to `stream` if `True`, prevent if `False`, interpret `path` if `None`

header = None¶: the :py:class:~vcfpy.header.Header` to write out, will be deep-copied into the Writer on initialization

path = None¶: optional str with the path to the stream

stream = None¶: stream (file-like object) to read from

write_record(record)[source]¶: Write out the given vcfpy.record.Record to this Writer

vcfpy.writer.format_atomic(value)[source]¶

Format atomic value

This function also takes care of escaping the value in case one of the reserved characters occurs in the value.

vcfpy.writer.format_value(field_info, value, section)[source]¶: Format possibly compound value given the FieldInfo

vcfpy package¶

Submodules¶

vcfpy.bgzf module¶

vcfpy.exceptions module¶

vcfpy.header module¶

vcfpy.parser module¶

vcfpy.reader module¶

vcfpy.record module¶

vcfpy.warn_utils module¶

vcfpy.writer module¶

Module contents¶