vcfpy package
Submodules
vcfpy.version module
Module contents
- class vcfpy.AltAlleleHeaderLine(key: str, value: str, mapping: dict[str, Any])[source]
Bases:
SimpleHeaderLineAlternative allele header line
Mostly used for defining symbolic alleles for structural variants and IUPAC ambiguity codes
- classmethod from_mapping(mapping: dict[str, Any]) AltAlleleHeaderLine[source]
Construct from mapping, not requiring the string value
- id
name of the alternative allele
- class vcfpy.AltRecord(type_: Literal['SNV', 'MNV', 'DEL', 'INS', 'INDEL', 'SV', 'BND', 'SYMBOLIC', 'MIXED'] | None = None)[source]
Bases:
objectAn alternative allele Record
Currently, can be a substitution, an SV placeholder, or breakend
- type
String describing the type of the variant, could be one of SNV, MNV, could be any of teh types described in the ALT header lines, such as DUP, DEL, INS, …
- class vcfpy.BreakEnd(mate_chrom: str | None, mate_pos: int | None, orientation: str | None, mate_orientation: Literal['+', '-'] | None, sequence: str, within_main_assembly: bool | None)[source]
Bases:
AltRecordA placeholder for a breakend
- mate_chrom
chromosome of the mate breakend
- mate_orientation
orientation breakend’s mate
- mate_pos
position of the mate breakend
- orientation
orientation of this breakend
- sequence
breakpoint’s connecting sequence
- within_main_assembly
boolspecifying if the breakend mate is within the assembly (True) or in an ancillary assembly (False)
- class vcfpy.Call(sample: str, data: dict[str, Any], site: Record | None = None)[source]
Bases:
objectThe information for a genotype callable
By VCF, this should always include the genotype information and can contain an arbitrary number of further annotation, e.g., the coverage at the variant position.
- called
whether or not the variant is fully called
- data: dict[str, Any]
an OrderedDict with the key/value pair information from the call’s data
- gt_alleles: list[int | None] | None
the allele numbers (0, 1, …) in this calls or None for no-call
- property gt_bases: tuple[str | None, ...]
Return the actual genotype bases, e.g. if VCF genotype is 0/1, could return (‘A’, ‘T’)
- property gt_phase_char
Return character to use for phasing
- property gt_type: Literal[0, 1, 2] | None
The type of genotype, returns one of
HOM_REF,HOM_ALT, andHET.
- is_filtered(require: list[str] | None = None, ignore: list[str] | None = None)[source]
Return
Truefor filtered calls- Parameters:
ignore (iterable) – if set, the filters to ignore, make sure to include ‘PASS’, when setting, default is
['PASS']require (iterable) – if set, the filters to require for returning
True
- property is_het: bool
Return
Truefor heterozygous calls
- property is_phased
Return boolean indicating whether this call is phased
- property is_variant: bool
Return
Truefor non-hom-ref calls
- ploidy
the number of alleles in this sample’s call
- sample
the name of the sample for which the call was made
- class vcfpy.CompoundHeaderLine(key: str, value: str, mapping: dict[str, Any])[source]
Bases:
HeaderLineBase class for compound header lines, currently format and header lines
Compound header lines describe fields that can have more than one entry.
Don’t use this class directly but rather the sub classes.
- mapping
OrderedDict with key/value mapping
- property value
- class vcfpy.ContigHeaderLine(key: str, value: str, mapping: dict[str, Any])[source]
Bases:
SimpleHeaderLineContig header line
Most importantly, parses the
'length'key into an integer- classmethod from_mapping(mapping: dict[str, Any]) ContigHeaderLine[source]
Construct from mapping, not requiring the string value
- id
name of the contig
- length
length of the contig,
Noneif missing
- class vcfpy.FieldInfo(type_: Literal['Integer', 'Float', 'Flag', 'Character', 'String'], number: int | str, description: str | None = None, id_: str | None = None)[source]
Bases:
objectCore information for describing field type and number
- description: str | None
Description for the header field, optional
- id: str | None
The id of the field, optional.
- number: int | str
Number description, either an int or constant
- type: Literal['Integer', 'Float', 'Flag', 'Character', 'String']
The type, one of INFO_TYPES or FORMAT_TYPES
- class vcfpy.FilterHeaderLine(key: str, value: str, mapping: dict[str, Any])[source]
Bases:
SimpleHeaderLineFILTER header line
- description
description for the filter,
Noneif missing
- classmethod from_mapping(mapping: dict[str, Any]) FilterHeaderLine[source]
Construct from mapping, not requiring the string value
- id
token for the filter
- class vcfpy.FormatHeaderLine(key: str, value: str, mapping: dict[str, Any])[source]
Bases:
CompoundHeaderLineHeader line for FORMAT fields
- description
description, should be given,
Noneif not given
- classmethod from_mapping(mapping: dict[str, Any]) FormatHeaderLine[source]
Construct from mapping, not requiring the string value
- id
key in the INFO field
- source
source of INFO field,
Noneif not given
- type
value type
- version
version of INFO field,
Noneif not given
- class vcfpy.Header(lines: list[HeaderLine] | None = None, samples: SamplesInfos | None = None)[source]
Bases:
objectRepresent header of VCF file
While this class allows mutating records, it should not be changed once it has been assigned to a writer. Use
copy()to create a copy that can be modified without problems.This class provides function for adding lines to a header and updating the supporting index data structures. There is no explicit API for removing header lines, the best way is to reconstruct a new
Headerinstance with a filtered list of header lines.- add_contig_line(mapping: dict[str, Any])[source]
Add “contig” header line constructed from the given mapping
- Parameters:
mapping –
OrderedDictwith mapping to add. It is recommended to useOrderedDictoverdictas this makes the result reproducible- Returns:
Falseon conflicting line andTrueotherwise
- add_filter_line(mapping: dict[str, Any])[source]
Add FILTER header line constructed from the given mapping
- Parameters:
mapping –
OrderedDictwith mapping to add. It is recommended to useOrderedDictoverdictas this makes the result reproducible- Returns:
Falseon conflicting line andTrueotherwise
- add_format_line(mapping: dict[str, Any])[source]
Add FORMAT header line constructed from the given mapping
- Parameters:
mapping –
OrderedDictwith mapping to add. It is recommended to useOrderedDictoverdictas this makes the result reproducible- Returns:
Falseon conflicting line andTrueotherwise
- add_info_line(mapping: dict[str, Any])[source]
Add INFO header line constructed from the given mapping
- Parameters:
mapping –
OrderedDictwith mapping to add. It is recommended to useOrderedDictoverdictas this makes the result reproducible- Returns:
Falseon conflicting line andTrueotherwise
- add_line(header_line: HeaderLine)[source]
Add header line, updating any necessary support indices
- Returns:
Falseon conflicting line andTrueotherwise
- get_lines(key: str) Iterable[HeaderLine][source]
Return header lines having the given
keyas their type
- has_header_line(key: str, id_: str)[source]
Return whether there is a header line with the given ID of the type given by
key- Parameters:
key – The VCF header key/line type.
id – The ID value to compare fore
- Returns:
Trueif there is a header line starting with##${key}=in the VCF file having the mapping entryIDset toid_.
- lines
listof :py:HeaderLine objects
- samples
SamplesInfoobject
- class vcfpy.HeaderLine(key: str, value: str)[source]
Bases:
objectBase class for VCF header lines
- key
strwith key of header line
- property value
- exception vcfpy.HeaderNotFound[source]
Bases:
VCFPyExceptionRaised when a VCF header could not be found
- exception vcfpy.IncorrectVCFFormat[source]
Bases:
VCFPyExceptionRaised on problems parsing VCF
- class vcfpy.InfoHeaderLine(key: str, value: str, mapping: dict[str, Any])[source]
Bases:
CompoundHeaderLineHeader line for INFO fields
Note that the
Numberfield will be parsed into anintif possible. Otherwise, the constantsHEADER_NUMBER_*will be used.- description
description, should be given,
Noneif not given
- classmethod from_mapping(mapping: dict[str, Any]) InfoHeaderLine[source]
Construct from mapping, not requiring the string value
- id
key in the INFO field
- source
source of INFO field,
Noneif not given
- type
value type
- version
version of INFO field,
Noneif not given
- exception vcfpy.InvalidHeaderException[source]
Bases:
VCFPyExceptionRaised in the case of invalid header formatting
- exception vcfpy.InvalidRecordException[source]
Bases:
VCFPyExceptionRaised in the case of invalid record formatting
- class vcfpy.MetaHeaderLine(key: str, value: str, mapping: dict[str, Any])[source]
Bases:
SimpleHeaderLineAlternative allele header line
Used for defining set of valid values for samples keys
- classmethod from_mapping(mapping: dict[str, Any]) MetaHeaderLine[source]
Construct from mapping, not requiring the string value
- id
name of the alternative allele
- class vcfpy.PedigreeHeaderLine(key: str, value: str, mapping: dict[str, Any])[source]
Bases:
SimpleHeaderLineHeader line for defining a pedigree entry
- classmethod from_mapping(mapping: dict[str, Any]) PedigreeHeaderLine[source]
Construct from mapping, not requiring the string value
- id
name of the alternative allele
- class vcfpy.Reader(stream: TextIOWrapper, path: Path | str | None = None, tabix_path: Path | str | None = None, record_checks: Iterable[Literal['FORMAT', 'INFO']] | None = None, parsed_samples: list[str] | None = None)[source]
Bases:
objectClass for parsing of files from
file-like objectsInstead of using the constructor, use the class methods
from_stream()andfrom_path().On construction, the header will be read from the file which can cause problems. After construction,
Readercan be used as an iterable ofRecord.- Raises:
InvalidHeaderExceptionin the case of problems reading the header
Note
It is important to note that the
headermember is used during the parsing of the file. If you need a modified version then create a copy, e.g., using :py:method:`~vcfpy.header.Header.copy`.Note
If you use the
parsed_samplesfeature and you write out records then you must not change theFORMATof the record.- fetch(chrom_or_region: str, begin: int | None = None, end: int | None = None)[source]
Jump to the start position of the given chromosomal position and limit iteration to the end position
- Parameters:
chrom_or_region (str) – name of the chromosome to jump to if begin and end are given and a samtools region string otherwise (e.g. “chr1:123,456-123,900”).
begin (int) – 0-based begin position (inclusive)
end (int) – 0-based end position (exclusive)
- classmethod from_path(path: Path | str, tabix_path: Path | str | None = None, record_checks: list[Literal['INFO', 'FORMAT']] | None = None, parsed_samples: list[str] | None = None)[source]
Create new
Readerfrom pathNote
If you use the
parsed_samplesfeature and you write out records then you must not change theFORMATof the record.- Parameters:
path – the path to load from (converted to
strfor compatibility withpath.py)tabix_path – optional string with path to TBI index, automatic inferral from
pathwill be tried on the fly if not givenrecord_checks (list) – record checks to perform, can contain ‘INFO’ and ‘FORMAT’
- classmethod from_stream(stream: TextIOWrapper, path: Path | str | None = None, tabix_path: Path | str | None = None, record_checks: list[Literal['INFO', 'FORMAT']] | None = None, parsed_samples: list[str] | None = None)[source]
Create new
Readerfrom fileNote
If you use the
parsed_samplesfeature and you write out records then you must not change theFORMATof the record.- Parameters:
stream –
file-like object to read frompath – optional string with path to store (for display only)
record_checks (list) – record checks to perform, can contain ‘INFO’ and ‘FORMAT’
parsed_samples (list) –
listofstrvalues with names of samples to parse call information for (for speedup); leave toNonefor ignoring
- header
the Header
- parsed_samples
if set, list of samples to parse for
- parser
the parser to use
- path
optional
strwith the path to the stream
- record_checks
checks to perform on records, can contain ‘FORMAT’ and ‘INFO’
- stream
stream (
file-like object) to read from
- tabix_file
the
pysam.TabixFileused for reading from index bgzip-ed VCF; constructed on the fly
- tabix_path
optional
strwith path to tabix file
- class vcfpy.Record(CHROM: str, POS: int, ID: list[str], REF: str, ALT: list[AltRecord], QUAL: float | None, FILTER: list[str], INFO: dict[str, Any], FORMAT: list[str] | None = None, calls: Sequence[Call | UnparsedCall] | None = None)[source]
Bases:
objectRepresent one record from the VCF file
Record objects are iterators of their calls
- CHROM
A
strwith the chromosome name
- FILTER
A list of strings for the FILTER column
- FORMAT
A list of strings for the FORMAT column. Optional, must be given if and only if
callsis also given.
- ID
A list of the semicolon-separated values of the ID column
- INFO
An OrderedDict giving the values of the INFO column, flags are mapped to
True
- POS
An
intwith a 1-based begin position
- QUAL
The quality value, can be
None
- REF
A
strwith the REF value
- add_format(key: str, value: Any | None = None)[source]
Add an entry to format
The record’s calls
data[key]will be set tovalueif not yet set and value is notNone. If key is already in FORMAT then nothing is done.
- property affected_end
Return affected start position in 0-based coordinates
For SNVs, MNVs, and deletions, the behaviour is based on the start position and the length of the REF. In the case of insertions, the position behind the insert position is returned, yielding a 0-length interval together with
affected_start()
- property affected_start
Return affected start position in 0-based coordinates
For SNVs, MNVs, and deletions, the behaviour is the start position. In the case of insertions, the position behind the insert position is returned, yielding a 0-length interval together with
affected_end()
- begin
An
intwith a 0-based begin position
- call_for_sample
A mapping from sample name to entry in self.calls.
- calls
A list of genotype
Callobjects. Optional, must be given if and only ifFORMATis also given.
- end
An
intwith a 0-based end position
- update_calls(calls: Sequence[Call | UnparsedCall])[source]
Update
self.callsand other fields as necessary.
- class vcfpy.SampleHeaderLine(key: str, value: str, mapping: dict[str, Any])[source]
Bases:
SimpleHeaderLineHeader line for defining a SAMPLE entry
- classmethod from_mapping(mapping: dict[str, Any]) SampleHeaderLine[source]
Construct from mapping, not requiring the string value
- id
name of the alternative allele
- class vcfpy.SamplesInfos(sample_names: list[str], parsed_samples: list[str] | None = None)[source]
Bases:
objectHelper class for handling the samples in VCF files
The purpose of this class is to decouple the sample name list somewhat from
Header. This encapsulates subsetting samples for which the genotype should be parsed and reordering samples into output files.Note that when subsetting is used and the records are to be written out again then the
FORMATfield must not be touched.- name_to_idx
mapping from sample name to index
- names
list of sample that are read from/written to the VCF file at hand in the given order
- parsed_samples
setwith the samples for which the genotype call fields should be read; can be used for partial parsing (speedup) and defaults to the full list of samples, None if all are parsed
- class vcfpy.SimpleHeaderLine(key: str, value: str, mapping: dict[str, Any])[source]
Bases:
HeaderLineBase class for simple header lines, currently contig and filter header lines
Don’t use this class directly but rather the sub classes.
- Raises:
vcfpy.exceptions.InvalidHeaderExceptionin the case of missing key"ID"
- mapping
collections.OrderedDictwith key/value mapping of the attributes
- property value
- class vcfpy.SingleBreakEnd(orientation: str, sequence: str)[source]
Bases:
BreakEndA placeholder for a single breakend
- class vcfpy.Substitution(type_: Literal['SNV', 'MNV', 'DEL', 'INS', 'INDEL', 'SV', 'BND', 'SYMBOLIC', 'MIXED'], value: str)[source]
Bases:
AltRecordA basic alternative allele record describing a REF->AltRecord substitution
Note that this subsumes MNVs, insertions, and deletions.
- value
The alternative base sequence to use in the substitution
- class vcfpy.SymbolicAllele(value: str)[source]
Bases:
AltRecordA placeholder for a symbolic allele
The allele symbol must be defined in the header using an
ALTheader before being parsed. Usually, this is used for succinct descriptions of structural variants or IUPAC parameters.- value
The symbolic value, e.g. ‘DUP’
- class vcfpy.UnparsedCall(sample: str, unparsed_data: Any, site: Record | None = None)[source]
Bases:
objectPlaceholder for
Callwhen parsing only a subset of fields- sample
the name of the sample for which the call was made
- unparsed_data
strwith the unparsed data
- class vcfpy.Writer(stream: IO[str], header: Header, path: Path | str | None = None)[source]
Bases:
objectClass for writing VCF files to
file-like objectsInstead of using the constructor, use the class methods
from_stream()andfrom_path().The writer has to be constructed with a
Headerobject and the full VCF header will be written immediately on construction. This, of course, implies that modifying the header after construction is illegal.- classmethod from_path(path: Path | str, header: Header)[source]
Create new
Writerfrom path- Parameters:
path – the path to load from (converted to
strfor compatibility withpath.py)header – VCF header to use, lines and samples are deep-copied
- classmethod from_stream(stream: IO[str] | IO[bytes], header: Header, path: Path | str | None = None, use_bgzf: bool | None = None)[source]
Create new
Writerfrom fileNote that for getting bgzf support, you have to pass in a stream opened in binary mode. Further, you either have to provide a
pathending in".gz"or setuse_bgzf=True. Otherwise, you will get the notorious “TypeError: ‘str’ does not support the buffer interface”.- Parameters:
stream –
file-like object to write toheader – VCF header to use, lines and samples are deep-copied
path – optional string with path to store (for display only)
use_bgzf – indicator whether to write bgzf to
streamifTrue, prevent ifFalse, interpretpathifNone
- header
the :py:class:~vcfpy.header.Header` to write out, will be deep-copied into the
Writeron initialization
- path
optional
strwith the path to the stream
- stream
stream (
file-like object) to read from
- write_record(record: Record)[source]
Write out the given
vcfpy.record.Recordto this Writer
- vcfpy.header_without_lines(header: Header, remove: Iterable[tuple[str, str]]) Header[source]
Return
Headerwithout lines given inremoveremoveis an iterable of pairskey/IDwith the VCF header key andIDof entry to remove. In the case that a line does not have amappingentry, you can give the full value to remove.# header is a vcfpy.Header, e.g., as read earlier from file new_header = vcfpy.without_header_lines( header, [('assembly', None), ('FILTER', 'PASS')]) # now, the header lines starting with "##assembly=" and the "PASS" # filter line will be missing from new_header