vcfpy package

Submodules

vcfpy.version module

Module contents

class vcfpy.AltAlleleHeaderLine(key: str, value: str, mapping: dict[str, Any])[source]

Bases: SimpleHeaderLine

Alternative allele header line

Mostly used for defining symbolic alleles for structural variants and IUPAC ambiguity codes

classmethod from_mapping(mapping: dict[str, Any]) → AltAlleleHeaderLine[source]: Construct from mapping, not requiring the string value

id: name of the alternative allele

class vcfpy.AltRecord(type_: Literal['SNV', 'MNV', 'DEL', 'INS', 'INDEL', 'SV', 'BND', 'SYMBOLIC', 'MIXED'] | None = None)[source]

Bases: object

An alternative allele Record

Currently, can be a substitution, an SV placeholder, or breakend

serialize() → str[source]: Return str with representation for VCF file

type: String describing the type of the variant, could be one of SNV, MNV, could be any of teh types described in the ALT header lines, such as DUP, DEL, INS, …

Bases: AltRecord

A placeholder for a breakend

mate_chrom: chromosome of the mate breakend

mate_orientation: orientation breakend’s mate

mate_pos: position of the mate breakend

orientation: orientation of this breakend

sequence: breakpoint’s connecting sequence

serialize()[source]: Return string representation for VCF

within_main_assembly: bool specifying if the breakend mate is within the assembly (True) or in an ancillary assembly (False)

class vcfpy.Call(sample: str, data: dict[str, Any], site: Record | None = None)[source]

Bases: object

The information for a genotype callable

By VCF, this should always include the genotype information and can contain an arbitrary number of further annotation, e.g., the coverage at the variant position.

called: whether or not the variant is fully called

data: dict[str, Any]: an OrderedDict with the key/value pair information from the call’s data

gt_alleles: list[int | None] | None: the allele numbers (0, 1, …) in this calls or None for no-call

property gt_bases: tuple[str | None, ...]: Return the actual genotype bases, e.g. if VCF genotype is 0/1, could return (‘A’, ‘T’)

property gt_phase_char: Return character to use for phasing

property gt_type: Literal[0, 1, 2] | None: The type of genotype, returns one of HOM_REF, HOM_ALT, and HET.

is_filtered(require: list[str] | None = None, ignore: list[str] | None = None)[source]

Return True for filtered calls

Parameters:

ignore (iterable) – if set, the filters to ignore, make sure to include ‘PASS’, when setting, default is ['PASS']
require (iterable) – if set, the filters to require for returning True

property is_het: bool: Return True for heterozygous calls

property is_phased: Return boolean indicating whether this call is phased

property is_variant: bool: Return True for non-hom-ref calls

ploidy: the number of alleles in this sample’s call

sample: the name of the sample for which the call was made

set_genotype(genotype: str | None)[source]: Set self.data["GT"] to genotype and properly update related properties.

site: the Record of this Call

class vcfpy.CompoundHeaderLine(key: str, value: str, mapping: dict[str, Any])[source]

Bases: HeaderLine

Base class for compound header lines, currently format and header lines

Compound header lines describe fields that can have more than one entry.

Don’t use this class directly but rather the sub classes.

copy()[source]: Return a copy

mapping: OrderedDict with key/value mapping

serialize()[source]: Return VCF-serialized version of this header line

property value

class vcfpy.ContigHeaderLine(key: str, value: str, mapping: dict[str, Any])[source]

Bases: SimpleHeaderLine

Contig header line

Most importantly, parses the 'length' key into an integer

classmethod from_mapping(mapping: dict[str, Any]) → ContigHeaderLine[source]: Construct from mapping, not requiring the string value

id: name of the contig

length: length of the contig, None if missing

class vcfpy.FieldInfo(type_: Literal['Integer', 'Float', 'Flag', 'Character', 'String'], number: int | str, description: str | None = None, id_: str | None = None)[source]

Bases: object

Core information for describing field type and number

description: str | None: Description for the header field, optional

id: str | None: The id of the field, optional.

number: int | str: Number description, either an int or constant

type: Literal['Integer', 'Float', 'Flag', 'Character', 'String']: The type, one of INFO_TYPES or FORMAT_TYPES

class vcfpy.FilterHeaderLine(key: str, value: str, mapping: dict[str, Any])[source]

Bases: SimpleHeaderLine

FILTER header line

description: description for the filter, None if missing

classmethod from_mapping(mapping: dict[str, Any]) → FilterHeaderLine[source]: Construct from mapping, not requiring the string value

id: token for the filter

class vcfpy.FormatHeaderLine(key: str, value: str, mapping: dict[str, Any])[source]

Bases: CompoundHeaderLine

Header line for FORMAT fields

description: description, should be given, None if not given

classmethod from_mapping(mapping: dict[str, Any]) → FormatHeaderLine[source]: Construct from mapping, not requiring the string value

id: key in the INFO field

source: source of INFO field, None if not given

type: value type

version: version of INFO field, None if not given

class vcfpy.Header(lines: list[HeaderLine] | None = None, samples: SamplesInfos | None = None)[source]

Bases: object

Represent header of VCF file

While this class allows mutating records, it should not be changed once it has been assigned to a writer. Use copy() to create a copy that can be modified without problems.

This class provides function for adding lines to a header and updating the supporting index data structures. There is no explicit API for removing header lines, the best way is to reconstruct a new Header instance with a filtered list of header lines.

add_contig_line(mapping: dict[str, Any])[source]

Add “contig” header line constructed from the given mapping

Parameters:: mapping – OrderedDict with mapping to add. It is recommended to use OrderedDict over dict as this makes the result reproducible
Returns:: False on conflicting line and True otherwise

add_filter_line(mapping: dict[str, Any])[source]

Add FILTER header line constructed from the given mapping

Parameters:: mapping – OrderedDict with mapping to add. It is recommended to use OrderedDict over dict as this makes the result reproducible
Returns:: False on conflicting line and True otherwise

add_format_line(mapping: dict[str, Any])[source]

Add FORMAT header line constructed from the given mapping

Parameters:: mapping – OrderedDict with mapping to add. It is recommended to use OrderedDict over dict as this makes the result reproducible
Returns:: False on conflicting line and True otherwise

add_info_line(mapping: dict[str, Any])[source]

Add INFO header line constructed from the given mapping

Parameters:: mapping – OrderedDict with mapping to add. It is recommended to use OrderedDict over dict as this makes the result reproducible
Returns:: False on conflicting line and True otherwise

add_line(header_line: HeaderLine)[source]

Add header line, updating any necessary support indices

Returns:: False on conflicting line and True otherwise

copy()[source]: Return a copy of this header

filter_ids()[source]: Return list of all filter IDs

format_ids() → list[str][source]: Return list of all format IDs

get_format_field_info(key: str) → FieldInfo[source]: Return FieldInfo for the given FORMAT field

get_info_field_info(key: str) → FieldInfo[source]: Return FieldInfo for the given INFO field

get_lines(key: str) → Iterable[HeaderLine][source]: Return header lines having the given key as their type

has_header_line(key: str, id_: str)[source]

Return whether there is a header line with the given ID of the type given by key

Parameters:

key – The VCF header key/line type.
id – The ID value to compare fore

Returns:

True if there is a header line starting with ##${key}= in the VCF file having the mapping entry ID set to id_.

info_ids()[source]: Return list of all info IDs

lines: list of :py:HeaderLine objects

samples: SamplesInfo object

class vcfpy.HeaderLine(key: str, value: str)[source]

Bases: object

Base class for VCF header lines

copy()[source]: Return a copy

key: str with key of header line

serialize()[source]: Return VCF-serialized version of this header line

property value

exception vcfpy.HeaderNotFound[source]

Bases: VCFPyException

Raised when a VCF header could not be found

exception vcfpy.IncorrectVCFFormat[source]

Bases: VCFPyException

Raised on problems parsing VCF

class vcfpy.InfoHeaderLine(key: str, value: str, mapping: dict[str, Any])[source]

Bases: CompoundHeaderLine

Header line for INFO fields

Note that the Number field will be parsed into an int if possible. Otherwise, the constants HEADER_NUMBER_* will be used.

description: description, should be given, None if not given

classmethod from_mapping(mapping: dict[str, Any]) → InfoHeaderLine[source]: Construct from mapping, not requiring the string value

id: key in the INFO field

source: source of INFO field, None if not given

type: value type

version: version of INFO field, None if not given

exception vcfpy.InvalidHeaderException[source]

Bases: VCFPyException

Raised in the case of invalid header formatting

exception vcfpy.InvalidRecordException[source]

Bases: VCFPyException

Raised in the case of invalid record formatting

class vcfpy.MetaHeaderLine(key: str, value: str, mapping: dict[str, Any])[source]

Bases: SimpleHeaderLine

Alternative allele header line

Used for defining set of valid values for samples keys

classmethod from_mapping(mapping: dict[str, Any]) → MetaHeaderLine[source]: Construct from mapping, not requiring the string value

id: name of the alternative allele

class vcfpy.PedigreeHeaderLine(key: str, value: str, mapping: dict[str, Any])[source]

Bases: SimpleHeaderLine

Header line for defining a pedigree entry

classmethod from_mapping(mapping: dict[str, Any]) → PedigreeHeaderLine[source]: Construct from mapping, not requiring the string value

id: name of the alternative allele

Bases: object

Class for parsing of files from file-like objects

Instead of using the constructor, use the class methods from_stream() and from_path().

On construction, the header will be read from the file which can cause problems. After construction, Reader can be used as an iterable of Record.

Raises:: InvalidHeaderException in the case of problems reading the header

Note

It is important to note that the header member is used during the parsing of the file. If you need a modified version then create a copy, e.g., using :py:method:`~vcfpy.header.Header.copy`.

Note

If you use the parsed_samples feature and you write out records then you must not change the FORMAT of the record.

close()[source]: Close underlying stream

fetch(chrom_or_region: str, begin: int | None = None, end: int | None = None)[source]

Jump to the start position of the given chromosomal position and limit iteration to the end position

Parameters:

chrom_or_region (str) – name of the chromosome to jump to if begin and end are given and a samtools region string otherwise (e.g. “chr1:123,456-123,900”).
begin (int) – 0-based begin position (inclusive)
end (int) – 0-based end position (exclusive)

Create new Reader from path

Note

If you use the parsed_samples feature and you write out records then you must not change the FORMAT of the record.

Parameters:

path – the path to load from (converted to str for compatibility with path.py)
tabix_path – optional string with path to TBI index, automatic inferral from path will be tried on the fly if not given
record_checks (list) – record checks to perform, can contain ‘INFO’ and ‘FORMAT’

Create new Reader from file

Note

If you use the parsed_samples feature and you write out records then you must not change the FORMAT of the record.

Parameters:

stream – file-like object to read from
path – optional string with path to store (for display only)
record_checks (list) – record checks to perform, can contain ‘INFO’ and ‘FORMAT’
parsed_samples (list) – list of str values with names of samples to parse call information for (for speedup); leave to None for ignoring

header: the Header

parsed_samples: if set, list of samples to parse for

parser: the parser to use

path: optional str with the path to the stream

record_checks: checks to perform on records, can contain ‘FORMAT’ and ‘INFO’

stream: stream (file-like object) to read from

tabix_file: the pysam.TabixFile used for reading from index bgzip-ed VCF; constructed on the fly

tabix_path: optional str with path to tabix file

class vcfpy.Record(CHROM: str, POS: int, ID: list[str], REF: str, ALT: list[AltRecord], QUAL: float | None, FILTER: list[str], INFO: dict[str, Any], FORMAT: list[str] | None = None, calls: Sequence[Call | UnparsedCall] | None = None)[source]

Bases: object

Represent one record from the VCF file

Record objects are iterators of their calls

ALT: list[AltRecord]: A list of alternative allele records of type AltRecord

CHROM: A str with the chromosome name

FILTER: A list of strings for the FILTER column

FORMAT: A list of strings for the FORMAT column. Optional, must be given if and only if calls is also given.

ID: A list of the semicolon-separated values of the ID column

INFO: An OrderedDict giving the values of the INFO column, flags are mapped to True

POS: An int with a 1-based begin position

QUAL: The quality value, can be None

REF: A str with the REF value

add_filter(label: str)[source]: Add label to FILTER if not set yet, removing PASS entry if present

add_format(key: str, value: Any | None = None)[source]

Add an entry to format

The record’s calls data[key] will be set to value if not yet set and value is not None. If key is already in FORMAT then nothing is done.

property affected_end

Return affected start position in 0-based coordinates

For SNVs, MNVs, and deletions, the behaviour is based on the start position and the length of the REF. In the case of insertions, the position behind the insert position is returned, yielding a 0-length interval together with affected_start()

property affected_start

Return affected start position in 0-based coordinates

For SNVs, MNVs, and deletions, the behaviour is the start position. In the case of insertions, the position behind the insert position is returned, yielding a 0-length interval together with affected_end()

begin: An int with a 0-based begin position

call_for_sample: A mapping from sample name to entry in self.calls.

calls: A list of genotype Call objects. Optional, must be given if and only if FORMAT is also given.

end: An int with a 0-based end position

is_snv()[source]: Return True if it is a SNV

update_calls(calls: Sequence[Call | UnparsedCall])[source]: Update self.calls and other fields as necessary.

class vcfpy.SampleHeaderLine(key: str, value: str, mapping: dict[str, Any])[source]

Bases: SimpleHeaderLine

Header line for defining a SAMPLE entry

classmethod from_mapping(mapping: dict[str, Any]) → SampleHeaderLine[source]: Construct from mapping, not requiring the string value

id: name of the alternative allele

class vcfpy.SamplesInfos(sample_names: list[str], parsed_samples: list[str] | None = None)[source]

Bases: object

Helper class for handling the samples in VCF files

The purpose of this class is to decouple the sample name list somewhat from Header. This encapsulates subsetting samples for which the genotype should be parsed and reordering samples into output files.

Note that when subsetting is used and the records are to be written out again then the FORMAT field must not be touched.

copy()[source]: Return a copy of the object

is_parsed(name: str) → bool[source]: Return whether the sample name is parsed

name_to_idx: mapping from sample name to index

names: list of sample that are read from/written to the VCF file at hand in the given order

parsed_samples: set with the samples for which the genotype call fields should be read; can be used for partial parsing (speedup) and defaults to the full list of samples, None if all are parsed

class vcfpy.SimpleHeaderLine(key: str, value: str, mapping: dict[str, Any])[source]

Bases: HeaderLine

Base class for simple header lines, currently contig and filter header lines

Don’t use this class directly but rather the sub classes.

Raises:: vcfpy.exceptions.InvalidHeaderException in the case of missing key "ID"

copy()[source]: Return a copy

mapping: collections.OrderedDict with key/value mapping of the attributes

serialize()[source]: Return VCF-serialized version of this header line

property value

class vcfpy.SingleBreakEnd(orientation: str, sequence: str)[source]

Bases: BreakEnd

A placeholder for a single breakend

class vcfpy.Substitution(type_: Literal['SNV', 'MNV', 'DEL', 'INS', 'INDEL', 'SV', 'BND', 'SYMBOLIC', 'MIXED'], value: str)[source]

Bases: AltRecord

A basic alternative allele record describing a REF->AltRecord substitution

Note that this subsumes MNVs, insertions, and deletions.

serialize() → str[source]: Return str with representation for VCF file

value: The alternative base sequence to use in the substitution

class vcfpy.SymbolicAllele(value: str)[source]

Bases: AltRecord

A placeholder for a symbolic allele

The allele symbol must be defined in the header using an ALT header before being parsed. Usually, this is used for succinct descriptions of structural variants or IUPAC parameters.

serialize()[source]: Return str with representation for VCF file

value: The symbolic value, e.g. ‘DUP’

class vcfpy.UnparsedCall(sample: str, unparsed_data: Any, site: Record | None = None)[source]

Bases: object

Placeholder for Call when parsing only a subset of fields

sample: the name of the sample for which the call was made

site: the Record of this Call

unparsed_data: str with the unparsed data

exception vcfpy.VCFPyException[source]

Bases: RuntimeError

Base class for module’s exception

class vcfpy.Writer(stream: IO[str], header: Header, path: Path | str | None = None)[source]

Bases: object

Class for writing VCF files to file-like objects

Instead of using the constructor, use the class methods from_stream() and from_path().

The writer has to be constructed with a Header object and the full VCF header will be written immediately on construction. This, of course, implies that modifying the header after construction is illegal.

close()[source]: Close underlying stream

classmethod from_path(path: Path | str, header: Header)[source]

Create new Writer from path

Parameters:

path – the path to load from (converted to str for compatibility with path.py)
header – VCF header to use, lines and samples are deep-copied

classmethod from_stream(stream: IO[str] | IO[bytes], header: Header, path: Path | str | None = None, use_bgzf: bool | None = None)[source]

Create new Writer from file

Note that for getting bgzf support, you have to pass in a stream opened in binary mode. Further, you either have to provide a path ending in ".gz" or set use_bgzf=True. Otherwise, you will get the notorious “TypeError: ‘str’ does not support the buffer interface”.

Parameters:

stream – file-like object to write to
header – VCF header to use, lines and samples are deep-copied
path – optional string with path to store (for display only)
use_bgzf – indicator whether to write bgzf to stream if True, prevent if False, interpret path if None

header: the :py:class:~vcfpy.header.Header` to write out, will be deep-copied into the Writer on initialization

path: optional str with the path to the stream

stream: stream (file-like object) to read from

write_record(record: Record)[source]: Write out the given vcfpy.record.Record to this Writer

vcfpy.header_without_lines(header: Header, remove: Iterable[tuple[str, str]]) → Header[source]

Return Header without lines given in remove

remove is an iterable of pairs key/ID with the VCF header key and ID of entry to remove. In the case that a line does not have a mapping entry, you can give the full value to remove.

# header is a vcfpy.Header, e.g., as read earlier from file
new_header = vcfpy.without_header_lines(
    header, [('assembly', None), ('FILTER', 'PASS')])
# now, the header lines starting with "##assembly=" and the "PASS"
# filter line will be missing from new_header