vcfpy package¶
Submodules¶
vcfpy.bgzf module¶
Support code for writing BGZF files
Shamelessly taken from Biopython
-
class
vcfpy.bgzf.BgzfWriter(filename=None, mode='w', fileobj=None, compresslevel=6)[source]¶ Bases:
object-
close()[source]¶ Flush data, write 28 bytes BGZF EOF marker, and close BGZF file. samtools will look for a magic EOF marker, just a 28 byte empty BGZF block, and if it is missing warns the BAM file may be truncated. In addition to samtools writing this block, so too does bgzip - so this implementation does too.
-
-
vcfpy.bgzf.make_virtual_offset(block_start_offset, within_block_offset)[source]¶ Compute a BGZF virtual offset from block start and within block offsets. The BAM indexing scheme records read positions using a 64 bit ‘virtual offset’, comprising in C terms: block_start_offset << 16 | within_block_offset Here block_start_offset is the file offset of the BGZF block start (unsigned integer using up to 64-16 = 48 bits), and within_block_offset within the (decompressed) block (unsigned 16 bit integer). >>> make_virtual_offset(0, 0) 0 >>> make_virtual_offset(0, 1) 1 >>> make_virtual_offset(0, 2**16 - 1) 65535 >>> make_virtual_offset(0, 2**16) Traceback (most recent call last): … ValueError: Require 0 <= within_block_offset < 2**16, got 65536 >>> 65536 == make_virtual_offset(1, 0) True >>> 65537 == make_virtual_offset(1, 1) True >>> 131071 == make_virtual_offset(1, 2**16 - 1) True >>> 6553600000 == make_virtual_offset(100000, 0) True >>> 6553600001 == make_virtual_offset(100000, 1) True >>> 6553600010 == make_virtual_offset(100000, 10) True >>> make_virtual_offset(2**48, 0) Traceback (most recent call last): … ValueError: Require 0 <= block_start_offset < 2**48, got 281474976710656
vcfpy.exceptions module¶
Exceptions for the vcfpy module
-
exception
vcfpy.exceptions.CannotConvertValue[source]¶ Bases:
vcfpy.exceptions.VCFPyWarningCannot convert value.
-
exception
vcfpy.exceptions.DuplicateHeaderLineWarning[source]¶ Bases:
vcfpy.exceptions.VCFPyWarningA header line occurs twice in a header
-
exception
vcfpy.exceptions.FieldInfoNotFound[source]¶ Bases:
vcfpy.exceptions.VCFPyWarningA header field is not found, default is used
-
exception
vcfpy.exceptions.FieldInvalidNumber[source]¶ Bases:
vcfpy.exceptions.VCFPyWarningRaised when compound header has invalid number
-
exception
vcfpy.exceptions.FieldMissingNumber[source]¶ Bases:
vcfpy.exceptions.VCFPyWarningRaised when compound heade misses number
-
exception
vcfpy.exceptions.HeaderInvalidType[source]¶ Bases:
vcfpy.exceptions.VCFPyWarningRaised when compound header has invalid type
-
exception
vcfpy.exceptions.HeaderMissingDescription[source]¶ Bases:
vcfpy.exceptions.VCFPyWarningRaised when compound header has missing description
-
exception
vcfpy.exceptions.HeaderNotFound[source]¶ Bases:
vcfpy.exceptions.VCFPyExceptionRaised when a VCF header could not be found
-
exception
vcfpy.exceptions.IncorrectListLength[source]¶ Bases:
vcfpy.exceptions.VCFPyWarningWrong length of multi-element field
-
exception
vcfpy.exceptions.IncorrectVCFFormat[source]¶ Bases:
vcfpy.exceptions.VCFPyExceptionRaised on problems parsing VCF
-
exception
vcfpy.exceptions.InvalidHeaderException[source]¶ Bases:
vcfpy.exceptions.VCFPyExceptionRaised in the case of invalid header formatting
-
exception
vcfpy.exceptions.InvalidRecordException[source]¶ Bases:
vcfpy.exceptions.VCFPyExceptionRaised in the case of invalid record formatting
-
exception
vcfpy.exceptions.LeadingTrailingSpaceInKey[source]¶ Bases:
vcfpy.exceptions.VCFPyWarningLeading or trailing space in key
-
exception
vcfpy.exceptions.SpaceInChromLine[source]¶ Bases:
vcfpy.exceptions.VCFPyWarningSpace instead of TAB in ##CHROM line
-
exception
vcfpy.exceptions.UnknownFilter[source]¶ Bases:
vcfpy.exceptions.VCFPyWarningMissing FILTER
-
exception
vcfpy.exceptions.UnknownVCFVersion[source]¶ Bases:
vcfpy.exceptions.VCFPyWarningUnknown VCF version
vcfpy.header module¶
Code for representing the VCF header part
The VCF header class structure is modeled after HTSJDK
-
class
vcfpy.header.AltAlleleHeaderLine(key, value, mapping)[source]¶ Bases:
vcfpy.header.SimpleHeaderLineAlternative allele header line
Mostly used for defining symbolic alleles for structural variants and IUPAC ambiguity codes
-
id= None¶ name of the alternative allele
-
-
class
vcfpy.header.CompoundHeaderLine(key, value, mapping)[source]¶ Bases:
vcfpy.header.HeaderLineBase class for compound header lines, currently format and header lines
Compound header lines describe fields that can have more than one entry.
Don’t use this class directly but rather the sub classes.
-
mapping= None¶ OrderedDict with key/value mapping
-
value¶
-
-
class
vcfpy.header.ContigHeaderLine(key, value, mapping)[source]¶ Bases:
vcfpy.header.SimpleHeaderLineContig header line
Most importantly, parses the
'length'key into an integer-
id= None¶ name of the contig
-
length= None¶ length of the contig,
Noneif missing
-
-
vcfpy.header.FORMAT_TYPES= ('Integer', 'Float', 'Character', 'String')¶ valid FORMAT value types
-
class
vcfpy.header.FieldInfo(type_, number, description=None, id_=None)[source]¶ Bases:
objectCore information for describing field type and number
-
description= None¶ Description for the header field, optional
-
id= None¶ The id of the field, optional.
-
number= None¶ Number description, either an int or constant
-
type= None¶ The type, one of INFO_TYPES or FORMAT_TYPES
-
-
class
vcfpy.header.FilterHeaderLine(key, value, mapping)[source]¶ Bases:
vcfpy.header.SimpleHeaderLineFILTER header line
-
description= None¶ description for the filter,
Noneif missing
-
id= None¶ token for the filter
-
-
class
vcfpy.header.FormatHeaderLine(key, value, mapping)[source]¶ Bases:
vcfpy.header.CompoundHeaderLineHeader line for FORMAT fields
-
description= None¶ description, should be given,
Noneif not given
-
id= None¶ key in the INFO field
-
source= None¶ source of INFO field,
Noneif not given
-
type= None¶ value type
-
version= None¶ version of INFO field,
Noneif not given
-
-
vcfpy.header.HEADER_NUMBER_ALLELES= 'A'¶ number of alleles excluding reference
-
vcfpy.header.HEADER_NUMBER_GENOTYPES= 'G'¶ number of genotypes
-
vcfpy.header.HEADER_NUMBER_REF= 'R'¶ number of alleles including reference
-
vcfpy.header.HEADER_NUMBER_UNBOUNDED= '.'¶ unbounded number of values
-
class
vcfpy.header.Header(lines=None, samples=None)[source]¶ Bases:
objectRepresent header of VCF file
While this class allows mutating records, it should not be changed once it has been assigned to a writer. Use :py:method:`~Header.copy` to create a copy that can be modified without problems.
This class provides function for adding lines to a header and updating the supporting index data structures. There is no explicit API for removing header lines, the best way is to reconstruct a new
Headerinstance with a filtered list of header lines.-
add_contig_line(mapping)[source]¶ Add “contig” header line constructed from the given mapping
Parameters: mapping – OrderedDictwith mapping to add. It is recommended to useOrderedDictoverdictas this makes the result reproducibleReturns: Falseon conflicting line andTrueotherwise
-
add_filter_line(mapping)[source]¶ Add FILTER header line constructed from the given mapping
Parameters: mapping – OrderedDictwith mapping to add. It is recommended to useOrderedDictoverdictas this makes the result reproducibleReturns: Falseon conflicting line andTrueotherwise
-
add_format_line(mapping)[source]¶ Add FORMAT header line constructed from the given mapping
Parameters: mapping – OrderedDictwith mapping to add. It is recommended to useOrderedDictoverdictas this makes the result reproducibleReturns: Falseon conflicting line andTrueotherwise
-
add_info_line(mapping)[source]¶ Add INFO header line constructed from the given mapping
Parameters: mapping – OrderedDictwith mapping to add. It is recommended to useOrderedDictoverdictas this makes the result reproducibleReturns: Falseon conflicting line andTrueotherwise
-
add_line(header_line)[source]¶ Add header line, updating any necessary support indices
Returns: Falseon conflicting line andTrueotherwise
-
has_header_line(key, id_)[source]¶ Return whether there is a header line with the given ID of the type given by
keyParameters: - key – The VCF header key/line type.
- id – The ID value to compare fore
Returns: Trueif there is a header line starting with##${key}=in the VCF file having the mapping entryIDset toid_.
-
lines= None¶ listof :py:HeaderLine objects
-
samples= None¶ SamplesInfoobject
-
-
class
vcfpy.header.HeaderLine(key, value)[source]¶ Bases:
objectBase class for VCF header lines
-
key= None¶ strwith key of header line
-
value¶
-
-
vcfpy.header.INFO_TYPES= ('Integer', 'Float', 'Flag', 'Character', 'String')¶ valid INFO value types
-
class
vcfpy.header.InfoHeaderLine(key, value, mapping)[source]¶ Bases:
vcfpy.header.CompoundHeaderLineHeader line for INFO fields
Note that the
Numberfield will be parsed into anintif possible. Otherwise, the constantsHEADER_NUMBER_*will be used.-
description= None¶ description, should be given,
Noneif not given
-
id= None¶ key in the INFO field
-
source= None¶ source of INFO field,
Noneif not given
-
type= None¶ value type
-
version= None¶ version of INFO field,
Noneif not given
-
-
vcfpy.header.LINES_WITH_ID= ('ALT', 'contig', 'FILTER', 'FORMAT', 'INFO', 'META', 'PEDIGREE', 'SAMPLE')¶ header lines that contain an “ID” entry
-
class
vcfpy.header.MetaHeaderLine(key, value, mapping)[source]¶ Bases:
vcfpy.header.SimpleHeaderLineAlternative allele header line
Used for defining set of valid values for samples keys
-
id= None¶ name of the alternative allele
-
-
class
vcfpy.header.PedigreeHeaderLine(key, value, mapping)[source]¶ Bases:
vcfpy.header.SimpleHeaderLineHeader line for defining a pedigree entry
-
id= None¶ name of the alternative allele
-
-
vcfpy.header.RESERVED_INFO= {'BKPTID': FieldInfo('String', '.', 'ID of the assembled alternate allele in the assembly file', None), 'DGVID': FieldInfo('String', 1, 'ID of this element in Database of Genomic Variation', None), 'H2': FieldInfo('Flag', 0, 'Membership in HapMap 2', None), 'VALIDATED': FieldInfo('Flag', 0, 'Validated by follow-up experiment', None), 'AA': FieldInfo('String', 1, 'Ancestral Allele', None), 'BQ': FieldInfo('Float', 1, 'RMS base quality at this position', None), 'SOMATIC': FieldInfo('Flag', 0, 'Indicates that the record is a somatic mutation, for cancer genomics', None), 'CICNADJ': FieldInfo('Integer', '.', 'Confidence interval around copy number for the adjacency', None), 'HOMLEN': FieldInfo('Integer', '.', 'Length of base pair identical micro-homology at event breakpoints', None), 'SB': FieldInfo('Integer', 4, 'Strand bias at this position', None), 'CN': FieldInfo('Integer', 1, 'Copy number of segment containing breakend', None), 'CNADJ': FieldInfo('Integer', '.', 'Copy number of adjacency', None), 'AC': FieldInfo('Integer', 'A', 'Allele count in genotypes, for each ALT allele, in the same order as listed', None), 'ADR': FieldInfo('Integer', 'R', 'Reverse read depth for each allele', None), 'DBVARID': FieldInfo('String', 1, 'ID of this element in DBVAR', None), 'DBRIPID': FieldInfo('String', 1, 'ID of this element in DBRIP', None), 'DPADJ': FieldInfo('Integer', '.', 'Read Depth of adjacency', None), 'HOMSEQ': FieldInfo('String', '.', 'Sequence of base pair identical micro-homology at event breakpoints', None), 'DP': FieldInfo('Integer', 1, 'Combined depth across samples for small variants and Read Depth of segment containing breakend for SVs', None), 'SVLEN': FieldInfo('Integer', 1, 'Difference in length between REF and ALT alleles', None), 'CIGAR': FieldInfo('String', 'A', 'CIGAR string describing how to align each ALT allele to the reference allele', None), 'SVTYPE': FieldInfo('String', 1, 'Type of structural variant', None), 'AD': FieldInfo('Integer', 'R', 'Total read depth for each allele', None), 'CICN': FieldInfo('Integer', 2, 'Confidence interval around copy number for the segment', None), 'H3': FieldInfo('Flag', 0, 'Membership in HapMap 3', None), 'CIEND': FieldInfo('Integer', 2, 'Confidence interval around END for imprecise variants', None), 'AN': FieldInfo('Integer', 1, 'Total number of alleles in called genotypes', None), 'MQ': FieldInfo('Integer', 1, 'RMS mapping quality', None), 'AF': FieldInfo('Float', 'A', 'Allele frequency for each ALT allele in the same order as listed: used for estimating from primary data not called genotypes', None), 'METRANS': FieldInfo('String', 4, 'Mobile element transduction info of the form CHR,START,END,POLARITY', None), 'ADF': FieldInfo('Integer', 'R', 'Forward read depth for each allele', None), 'MATEID': FieldInfo('String', '.', 'ID of mate breakends', None), 'NOVEL': FieldInfo('Flag', 0, 'Indicates a novel structural variation', None), 'MEINFO': FieldInfo('String', 4, 'Mobile element info of the form NAME,START,END,POLARITY', None), 'PARID': FieldInfo('String', 1, 'ID of partner breakend', None), 'IMPRECISE': FieldInfo('Flag', 0, 'Imprecise structural variation', None), 'END': FieldInfo('Integer', 1, 'End position of the variant described in this record (for symbolic alleles)', None), 'NS': FieldInfo('Integer', 1, 'Number of samples with data', None), 'DB': FieldInfo('Flag', 0, 'dbSNP membership', None), 'EVENT': FieldInfo('String', 1, 'ID of event associated to breakend', None), 'CIPOS': FieldInfo('Integer', 2, 'Confidence interval around POS for imprecise variants', None), 'CILEN': FieldInfo('Integer', 2, 'Confidence interval around the inserted material between breakends', None), 'MQ0': FieldInfo('Integer', 1, 'Number of MAPQ == 0 reads covering this record', None), '1000G': FieldInfo('Flag', 0, 'Membership in 1000 Genomes', None)}¶ Reserved fields for INFO from VCF v4.3
-
class
vcfpy.header.SampleHeaderLine(key, value, mapping)[source]¶ Bases:
vcfpy.header.SimpleHeaderLineHeader line for defining a SAMPLE entry
-
id= None¶ name of the alternative allele
-
-
class
vcfpy.header.SamplesInfos(sample_names, parsed_samples=None)[source]¶ Bases:
objectHelper class for handling the samples in VCF files
The purpose of this class is to decouple the sample name list somewhat from
Header. This encapsulates subsetting samples for which the genotype should be parsed and reordering samples into output files.Note that when subsetting is used and the records are to be written out again then the
FORMATfield must not be touched.-
name_to_idx= None¶ mapping from sample name to index
-
names= None¶ list of sample that are read from/written to the VCF file at hand in the given order
-
parsed_samples= None¶ setwith the samples for which the genotype call fields should be read; can be used for partial parsing (speedup) and defaults to the full list of samples, None if all are parsed
-
-
class
vcfpy.header.SimpleHeaderLine(key, value, mapping)[source]¶ Bases:
vcfpy.header.HeaderLineBase class for simple header lines, currently contig and filter header lines
Don’t use this class directly but rather the sub classes.
Raises: vcfpy.exceptions.InvalidHeaderExceptionin the case of missing key"ID"-
mapping= None¶ collections.OrderedDictwith key/value mapping of the attributes
-
value¶
-
-
vcfpy.header.VALID_NUMBERS= ('A', 'R', 'G', '.')¶ valid values for “Number” entries, except for integers
-
vcfpy.header.header_without_lines(header, remove)[source]¶ Return
Headerwithout lines given inremoveremoveis an iterable of pairskey/IDwith the VCF header key andIDof entry to remove. In the case that a line does not have amappingentry, you can give the full value to remove.# header is a vcfpy.Header, e.g., as read earlier from file new_header = vcfpy.without_header_lines( header, [('assembly', None), ('FILTER', 'PASS')]) # now, the header lines starting with "##assembly=" and the "PASS" # filter line will be missing from new_header
vcfpy.parser module¶
Parsing of VCF files from str
-
class
vcfpy.parser.FormatChecker(header)[source]¶ Bases:
objectHelper class for checking a FORMAT field
-
header= None¶ VCFHeader to use for checking
-
-
class
vcfpy.parser.HeaderLineParserBase[source]¶ Bases:
objectParse into appropriate HeaderLine
-
parse_key_value(key, value)[source]¶ Parse the key/value pair
Parameters: - key (str) – the key to use in parsing
- value (str) – the value to parse
Returns: vcfpy.header.HeaderLineobject
-
-
class
vcfpy.parser.HeaderParser[source]¶ Bases:
objectHelper class for parsing a VCF header
-
parse_line(line)[source]¶ Parse VCF header
line(trailing ‘ ‘ or ‘ ‘ is ignored)param str line: strwith line to parseparam dict sub_parsers: dictmapping header line types to appropriate parser objectsreturns: appropriate HeaderLineparsed fromlineraises: vcfpy.exceptions.InvalidHeaderExceptionif there was a problem parsing the file
-
sub_parsers= None¶ Sub parsers to use for parsing the header lines
-
-
class
vcfpy.parser.InfoChecker(header)[source]¶ Bases:
objectHelper class for checking an INFO field
-
header= None¶ VCFHeader to use for checking
-
-
class
vcfpy.parser.MappingHeaderLineParser(line_class)[source]¶ Bases:
vcfpy.parser.HeaderLineParserBaseParse into HeaderLine (no particular structure)
-
line_class= None¶ the class to use for the VCF header line
-
-
class
vcfpy.parser.Parser(stream, path=None, record_checks=None)[source]¶ Bases:
objectClass for line-wise parsing of VCF files
In most cases, you want to use
vcfpy.reader.Readerinstead.Parameters: - stream –
file-like object to read from - path (str) – path the VCF is parsed from, for display purposes only, optional
-
header= None¶ header, once it has been read
-
parse_header(parsed_samples=None)[source]¶ Read and parse
vcfpy.header.Headerfrom file, set intoself.headerand return itParameters: parsed_samples (list) – listofstrfor subsetting the samples to parseReturns: vcfpy.header.HeaderRaises: vcfpy.exceptions.InvalidHeaderExceptionin the case of problems reading the header
-
parse_next_record()[source]¶ Read, parse and return next
vcfpy.record.RecordReturns: next VCF record or Noneif at endRaises: vcfpy.exceptions.InvalidRecordExceptionin the case of problems reading the record
-
record_checks= None¶ checks to perform, can contain ‘INFO’ and ‘FORMAT’
-
samples= None¶ vcfpy.header.SamplesInfoswith sample information; set on parsing the header
- stream –
-
class
vcfpy.parser.QuotedStringSplitter(delim=', ', quote='"', brackets='[]')[source]¶ Bases:
objectHelper class for splitting quoted strings
Has support for interpreting quoting strings but also brackets. Meant for splitting the VCF header line dicts
-
ARRAY= 3¶ state constant for array
-
DELIM= 4¶ state constant for delimiter
-
ESCAPED= 2¶ state constant for delimiter
-
NORMAL= 0¶ state constant for normal
-
QUOTED= 1¶ state constant for quoted
-
delim= None¶ string delimiter
-
quote= None¶ quote character
-
run(s)[source]¶ Split string
sat delimiter, correctly interpreting quotesFurther, interprets arrays wrapped in one level of
[]. No recursive brackets are interpreted (as this would make the grammar non-regular and currently this complexity is not needed). Currently, quoting inside of braces is not supported either. This is just to support the example from VCF v4.3.
-
-
class
vcfpy.parser.RecordParser(header, samples, record_checks=None)[source]¶ Bases:
objectHelper class for parsing VCF records
-
header= None¶ Header with the meta information
-
parse_line(line_str)[source]¶ Parse line from file (including trailing line break) and return resulting Record
-
record_checks= None¶ The checks to perform, can contain ‘INFO’ and ‘FORMAT’
-
samples= None¶ SamplesInfos with sample information
-
-
vcfpy.parser.SUPPORTED_VCF_VERSIONS= ('VCFv4.0', 'VCFv4.1', 'VCFv4.2', 'VCFv4.3')¶ Supported VCF versions, a warning will be issued otherwise
-
class
vcfpy.parser.StupidHeaderLineParser[source]¶ Bases:
vcfpy.parser.HeaderLineParserBaseParse into HeaderLine (no particular structure)
-
vcfpy.parser.build_header_parsers()[source]¶ Return mapping for parsers to use for each VCF header type
Inject the WarningHelper into the parsers.
-
vcfpy.parser.convert_field_value(type_, value)[source]¶ Convert atomic field value according to the type
-
vcfpy.parser.parse_breakend(alt_str)[source]¶ Parse breakend and return tuple with results, parameters for BreakEnd constructor
-
vcfpy.parser.parse_mapping(value)[source]¶ Parse the given VCF header line mapping
Such a mapping consists of “key=value” pairs, separated by commas and wrapped into angular brackets (“<…>”). Strings are usually quoted, for certain known keys, exceptions are made, depending on the tag key. this, however, only gets important when serializing.
Raises: vcfpy.exceptions.InvalidHeaderExceptionif there was a problem parsing the file
-
vcfpy.parser.process_alt(header, ref, alt_str)[source]¶ Process alternative value using Header in
header
vcfpy.reader module¶
Parsing of VCF files from file-like objects
-
class
vcfpy.reader.Reader(stream, path=None, tabix_path=None, record_checks=None, parsed_samples=None)[source]¶ Bases:
objectClass for parsing of files from
file-like objectsInstead of using the constructor, use the class methods
from_stream()andfrom_path().On construction, the header will be read from the file which can cause problems. After construction,
Readercan be used as an iterable ofRecord.Raises: InvalidHeaderExceptionin the case of problems reading the headerNote
It is important to note that the
headermember is used during the parsing of the file. If you need a modified version then create a copy, e.g., using :py:method:`~vcfpy.header.Header.copy`.Note
If you use the
parsed_samplesfeature and you write out records then you must not change theFORMATof the record.-
fetch(chrom_or_region, begin=None, end=None)[source]¶ Jump to the start position of the given chromosomal position and limit iteration to the end position
Parameters: - chrom_or_region (str) – name of the chromosome to jump to if begin and end are given and a samtools region string otherwise (e.g. “chr1:123,456-123,900”).
- begin (int) – 0-based begin position (inclusive)
- end (int) – 0-based end position (exclusive)
-
classmethod
from_path(path, tabix_path=None, record_checks=None, parsed_samples=None)[source]¶ Create new
Readerfrom pathNote
If you use the
parsed_samplesfeature and you write out records then you must not change theFORMATof the record.Parameters: - path – the path to load from (converted to
strfor compatibility withpath.py) - tabix_path – optional string with path to TBI index,
automatic inferral from
pathwill be tried on the fly if not given - record_checks (list) – record checks to perform, can contain ‘INFO’ and ‘FORMAT’
- path – the path to load from (converted to
-
classmethod
from_stream(stream, path=None, tabix_path=None, record_checks=None, parsed_samples=None)[source]¶ Create new
Readerfrom fileNote
If you use the
parsed_samplesfeature and you write out records then you must not change theFORMATof the record.Parameters: - stream –
file-like object to read from - path – optional string with path to store (for display only)
- record_checks (list) – record checks to perform, can contain ‘INFO’ and ‘FORMAT’
- parsed_samples (list) –
listofstrvalues with names of samples to parse call information for (for speedup); leave toNonefor ignoring
- stream –
-
header= None¶ the Header
-
parsed_samples= None¶ if set, list of samples to parse for
-
parser= None¶ the parser to use
-
path= None¶ optional
strwith the path to the stream
-
record_checks= None¶ checks to perform on records, can contain ‘FORMAT’ and ‘INFO’
-
stream= None¶ stream (
file-like object) to read from
-
tabix_file= None¶ the
pysam.TabixFileused for reading from index bgzip-ed VCF; constructed on the fly
-
tabix_path= None¶ optional
strwith path to tabix file
-
vcfpy.record module¶
Code for representing a VCF record
The VCF record structure is modeled after the one of PyVCF
-
vcfpy.record.ALLELE_DELIM= re.compile('[|/]')¶ Regular expression for splitting alleles
-
class
vcfpy.record.AltRecord(type_=None)[source]¶ Bases:
objectAn alternative allele Record
Currently, can be a substitution, an SV placeholder, or breakend
-
type= None¶ String describing the type of the variant, could be one of SNV, MNV, could be any of teh types described in the ALT header lines, such as DUP, DEL, INS, …
-
-
vcfpy.record.BND= 'BND'¶ Code for break-end allele
-
class
vcfpy.record.BreakEnd(mate_chrom, mate_pos, orientation, mate_orientation, sequence, within_main_assembly)[source]¶ Bases:
vcfpy.record.AltRecordA placeholder for a breakend
-
mate_chrom= None¶ chromosome of the mate breakend
-
mate_orientation= None¶ orientation breakend’s mate
-
mate_pos= None¶ position of the mate breakend
-
orientation= None¶ orientation of this breakend
-
sequence= None¶ breakpoint’s connecting sequence
-
within_main_assembly= None¶ boolspecifying if the breakend mate is within the assembly (True) or in an ancillary assembly (False)
-
-
class
vcfpy.record.Call(sample, data, site=None)[source]¶ Bases:
objectThe information for a genotype callable
By VCF, this should always include the genotype information and can contain an arbitrary number of further annotation, e.g., the coverage at the variant position.
-
called= None¶ whether or not the variant is fully called
-
data= None¶ an OrderedDict with the key/value pair information from the call’s data
-
gt_alleles= None¶ the allele numbers (0, 1, …) in this calls or None for no-call
-
gt_bases¶ Return the actual genotype bases, e.g. if VCF genotype is 0/1, could return (‘A’, ‘T’)
-
gt_phase_char¶ Return character to use for phasing
-
gt_type¶ The type of genotype, returns one of
HOM_REF,HOM_ALT, andHET.
-
is_filtered(require=None, ignore=None)[source]¶ Return
Truefor filtered callsParameters: - ignore (iterable) – if set, the filters to ignore, make sure to
include ‘PASS’, when setting, default is
['PASS'] - require (iterable) – if set, the filters to require for returning
True
- ignore (iterable) – if set, the filters to ignore, make sure to
include ‘PASS’, when setting, default is
-
is_het¶ Return
Truefor heterozygous calls
-
is_phased¶ Return boolean indicating whether this call is phased
-
is_variant¶ Return
Truefor non-hom-ref calls
-
plodity= None¶ the number of alleles in this sample’s call
-
sample= None¶ the name of the sample for which the call was made
-
-
vcfpy.record.DEL= 'DEL'¶ Code for “clean” deletion allele
-
vcfpy.record.ESCAPE_MAPPING= [('%', '%25'), (':', '%3A'), (';', '%3B'), ('=', '%3D'), (',', '%2C'), ('\r', '%0D'), ('\n', '%0A'), ('\t', '%09')]¶ Mapping for escaping reserved characters
-
vcfpy.record.FORWARD= '+'¶ code for forward orientation
-
vcfpy.record.HET= 1¶ Code for heterozygous
-
vcfpy.record.HOM_ALT= 2¶ Code for homozygous alternative
-
vcfpy.record.HOM_REF= 0¶ Code for homozygous reference
-
vcfpy.record.INDEL= 'INDEL'¶ Code for indel allele, includes substitutions of unequal length
-
vcfpy.record.INS= 'INS'¶ Code for “clean” insertion allele
-
vcfpy.record.MIXED= 'MIXED'¶ Code for mixed variant type
-
vcfpy.record.MNV= 'MNV'¶ Code for a multi nucleotide variant allele
-
vcfpy.record.RESERVED_CHARS= ':;=%,\r\n\t'¶ Characters reserved in VCF, have to be escaped
-
vcfpy.record.REVERSE= '-'¶ code for reverse orientation
-
class
vcfpy.record.Record(CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, calls)[source]¶ Bases:
objectRepresent one record from the VCF file
Record objects are iterators of their calls
-
CHROM= None¶ A
strwith the chromosome name
-
FILTER= None¶ A list of strings for the FILTER column
-
FORMAT= None¶ A list of strings for the FORMAT column
-
ID= None¶ A list of the semicolon-separated values of the ID column
-
INFO= None¶ An OrderedDict giving the values of the INFO column, flags are mapped to
True
-
POS= None¶ An
intwith a 1-based begin position
-
QUAL= None¶ The quality value, can be
None
-
REF= None¶ A
strwith the REF value
-
add_format(key, value=None)[source]¶ Add an entry to format
The record’s calls
data[key]will be set tovalueif not yet set and value is notNone. If key is already in FORMAT then nothing is done.
-
affected_end¶ Return affected start position in 0-based coordinates
For SNVs, MNVs, and deletions, the behaviour is based on the start position and the length of the REF. In the case of insertions, the position behind the insert position is returned, yielding a 0-length interval together with
affected_start()
-
affected_start¶ Return affected start position in 0-based coordinates
For SNVs, MNVs, and deletions, the behaviour is the start position. In the case of insertions, the position behind the insert position is returned, yielding a 0-length interval together with
affected_end()
-
begin= None¶ An
intwith a 0-based begin position
-
call_for_sample= None¶ A mapping from sample name to entry in self.calls
-
end= None¶ An
intwith a 0-based end position
-
-
vcfpy.record.SNV= 'SNV'¶ Code for single nucleotide variant allele
-
vcfpy.record.SV= 'SV'¶ Code for structural variant allele
-
vcfpy.record.SYMBOLIC= 'SYMBOLIC'¶ Code for symbolic allele that is neither SV nor BND
-
class
vcfpy.record.SingleBreakEnd(orientation, sequence)[source]¶ Bases:
vcfpy.record.BreakEndA placeholder for a single breakend
-
class
vcfpy.record.Substitution(type_, value)[source]¶ Bases:
vcfpy.record.AltRecordA basic alternative allele record describing a REF->AltRecord substitution
Note that this subsumes MNVs, insertions, and deletions.
-
value= None¶ The alternative base sequence to use in the substitution
-
-
class
vcfpy.record.SymbolicAllele(value)[source]¶ Bases:
vcfpy.record.AltRecordA placeholder for a symbolic allele
The allele symbol must be defined in the header using an
ALTheader before being parsed. Usually, this is used for succinct descriptions of structural variants or IUPAC parameters.-
value= None¶ The symbolic value, e.g. ‘DUP’
-
-
vcfpy.record.UNESCAPE_MAPPING= [('%25', '%'), ('%3A', ':'), ('%3B', ';'), ('%3D', '='), ('%2C', ','), ('%0D', '\r'), ('%0A', '\n'), ('%09', '\t')]¶ Mapping from escaped characters to reserved one
vcfpy.warn_utils module¶
vcfpy.writer module¶
Writing of VCF files to file-like objects
Currently, only writing to plain-text files is supported
-
class
vcfpy.writer.Writer(stream, header, path=None)[source]¶ Bases:
objectClass for writing VCF files to
file-like objectsInstead of using the constructor, use the class methods
from_stream()andfrom_path().The writer has to be constructed with a
Headerobject and the full VCF header will be written immediately on construction. This, of course, implies that modifying the header after construction is illegal.-
classmethod
from_path(path, header)[source]¶ Create new
Writerfrom pathParameters: - path – the path to load from (converted to
strfor compatibility withpath.py) - header – VCF header to use, lines and samples are deep-copied
- path – the path to load from (converted to
-
classmethod
from_stream(stream, header, path=None, use_bgzf=None)[source]¶ Create new
Writerfrom fileNote that for getting bgzf support, you have to pass in a stream opened in binary mode. Further, you either have to provide a
pathending in".gz"or setuse_bgzf=True. Otherwise, you will get the notorious “TypeError: ‘str’ does not support the buffer interface”.Parameters: - stream –
file-like object to write to - header – VCF header to use, lines and samples are deep-copied
- path – optional string with path to store (for display only)
- use_bgzf – indicator whether to write bgzf to
streamifTrue, prevent ifFalse, interpretpathifNone
- stream –
-
header= None¶ the :py:class:~vcfpy.header.Header` to write out, will be deep-copied into the
Writeron initialization
-
path= None¶ optional
strwith the path to the stream
-
stream= None¶ stream (
file-like object) to read from
-
write_record(record)[source]¶ Write out the given
vcfpy.record.Recordto this Writer
-
classmethod