Modules¶
vamp.utils¶
Utilities for working with multiple sequence alignments and MAF objects
Copyright 2012, 2013, 2014 Lance Parsons <lparsons@princeton.edu> All rights reserved.
BSD 2-Clause License http://www.opensource.org/licenses/BSD-2-Clause
- class vamp.utils.ContigComposition¶
Represents the composition of one interval by another interval.
Association of two genomic intervals, used to represent the composition of one interval by another.
- seq str¶
Sequence id of the interval being described
- start str¶
The start position of the interval being described (1-based)
- end str¶
The end position of the interval being described (1-based)
- contig str¶
The sequence id of the second interval
- contig_start str¶
The start position of the second interval (1-based)
- contig_end str¶
The end position of the second interval (1-based)
- strand str¶
The strand (‘+’ or ‘-‘) of the interval being described
- contig_size str¶
The complete length of the sequence of the second interval.
- static tab_headings()¶
Static method that retuns a tab delimited string of header names
Returns: A tab delimited string representing the headers in the order used by the to_tab() method. Return type: string
- to_tab()¶
Return a tab delimited string of the ContigComposition
Returns: A tab delimited string representing the ContigComposition. Return type: string
- vamp.utils.find_deletions(contig_composition_list, verbose=False)¶
Find contigs with deletions.
From a contig composition list, find contigs that have deleted sections. When a contig has deleted sections, the pieces of the contig may be replaced by a contiguous section. This function returns tuples containing the indices of the contigs pieces to be replaced along with a replacement ContigComposition object consisting of the contiguous section.
e.g.:
([2, 3], {'seq': 'chr', 'start': 3, 'end': 5, 'contig': 'contig1', 'contig_start': 5, 'contig_end': 10, 'strand': '+', 'contig_size': 20})
indicates that we may wish to replace contig_composition_list[2:3] with the new ContigComposition specified.
Parameters: - contig_composition_list (list) – A list of ContigComposition objects.
- verbose (bool, optional) – If true, output additional debug info (default is False).
Returns: A list of tuples with the indices of ContigComposition objects in the list to be replaced along with replacement ContigComposition objects.
Return type: list
- vamp.utils.get_block_by_label(maf_filename, label)¶
Return the MAF block with the specified label
Parameters: - maf_filename (string) – The name of the MAF file.
- label (string) – The label of the block in the MAF file to search for.
Returns: The first block found in the MAF file that has the given label
Return type: block
- vamp.utils.get_sequence_length_from_maf(maf_file, reference_species, sequence_id)¶
Return length of the reference_species.sequence_id
Parameters: - maf_filename – The filename of the MAF file.
- reference_species – The name species used as the reference.
- sequence_id – The sequence_id used as the reference. The format of sequence names in the MAF file is assumed to be ‘species.sequence_id’ (e.g. ‘scerevisiae.chrI’)
Returns: The length of the specified sequence in the first component containing that sequence in the MAF file, or None if no matching componenets were found in the MAF file.
Return type: integer
- vamp.utils.get_sequence_net_alignment(maf_filename, reference_species, sequence_id, species, verbose=False)¶
Return the alignment created by stitching MAF blocks together
Stitches MAF blocks together along an entire reference sequence (including gaps). For regions covered by more than one block, the highest scoring block is used.
Parameters: - maf_filename – The filename of the MAF file.
- reference_species – The name species used as the reference.
- sequence_id – The sequence_id used as the reference. The format of sequence names in the MAF file is assumed to be ‘species.sequence_id’ (e.g. ‘scerevisiae.chrI’)
- species – A list of the species names to be returned
- verbose (bool, optional) – If True, print debug information (default: False)
Returns: A tuple containing a Bio.Align.MutlipleSeqAlignment object and a list of intervals. The multiple sequence alignments contains each the alignment of each species from the MAF file created by stitching blocks together based on the specified reference sequence The list of intervals is relative to the alignment that indicate the MAF block, block start, and block end of the source of that piece of the alignment.
Return type: tuple
- vamp.utils.get_vamp_home()¶
Return the directory where the VAMP module is installed
- vamp.utils.read_contig_composition_summary(filename)¶
- Generator that reads a contig composition summary file and returns
- attributes.
Parameters: filename (string) – The name of contig composition summary file as output by compare_genomes.py. Yields: ContigComposition – A ContigComposition object for each line in the contig composition summary file.
- vamp.utils.replace_alignment_with_block(alignment, block, reference_species, sequence_id, verbose=False)¶
Update the multiple sequence alignment with the specified MAF block
Use the MAF block alignment to replace the appropriate section of the given multiple sequence alignment by using the specified reference species and sequence as guide
Parameters: - block (maf block) – MAF block
- reference_species (str) – The name species used as the reference.
- sequence_id (str) – The sequence_id used as the reference. The format of sequence names in the MAF file is assumed to be ‘species.sequence_id’ (e.g. ‘scerevisiae.chrI’)
- verbose (bool, optional) – If True, print debug information (default: False)
Returns: The updated alignment and a Pybedtools interval of the section of the alignment that was replaced. The interval contains the following attributes: maf_block; block_start; block_end which indicate the MAF block label and start and end position on the block used in the replacement
Return type: tuple
- vamp.utils.subtract_intervals(interval1, interval2)¶
Subtract two pybedtools intervals, return list of resulting intervals
Parameters: - interval1 – A pybedtools interval
- interval2 – A pybedtools interval to subtract from interval1
Returns: A list of pybedtools intervals that contain the region(s) of interval1 that are not overlapped by interval2
Return type: list
- vamp.utils.summarize_contig_composition(interval_list, src_tag, start_tag, end_tag, strand_tag, source_size_tag)¶
Summarize the contig composition of a stitched MAF file.
Parameters: - interval_list (list) – A list of Pybedtools interval objects
- src_tag (string) – The attribute containing the contig name
- start_tag (string) – The attribute containing the start position in the contig
- end_tag (string) – The attribute containing the end position in the contig
- strand_tag (string) – The attribute containing the strand
- source_size_tag (string) – The attribute containing the contig size
Returns: A list of dictionaries with the following keys: (seq, start, end, contig, contig_start, contig_end, strand, contig_size)
Return type: list
- vamp.utils.update_contig_composition_summary(contig_composition_summary, replacements)¶
Update list of ContigComposition objects with replacements.
Replacements are a list of tuples containing a list of indices of contigs to be replaced along with replacements. The replacements must be non-overlapping and sorted.
Parameters: - contig_compostion_summary (list) – List of dictionaries as returned by summarize_contig_composition()
- replacements (list) – A list of ContigCompostion objects
- Retuns:
- list: A updated contig composition summary
- vamp.utils.update_sequence_with_replacements(seq, replacements, replacement_seq_dict)¶
Update Seq object with replacements.
The replacements specified by ContigComposition objects and must be non-overlapping and sorted.
Parameters: - seq (Bio.Seq) – The sequence object to be updated.
- replacements (list) – A list ContigComposition objects
- replacement_seq_dict (dict) – A dictionary to the replacement sequences.
Returns: Bio.Seq: An updated sequence object with replacements made
- vamp.utils.verify_maf_fasta(maf_filename, reference_species, fasta_filename, verbose=False)¶
Verify the consistency between the sequence in a MAF and a FASTA file
Checks all compenents in all blocks of the MAF file for the specified species and checks that the sequence matches that in the FASTA file.
Parameters: - maf_filename (string) – The name of the MAF file.
- reference_species (string) – The species to select from the MAF file.
- fasta_filename (string) – The name of the FASTA file to check against.
- verbose (bool, optional) – If true, output additional debugging info (default is False).
Returns: Prints to STDOUT if there is a mismatch.
Return type: None
seq_utils.convert_coordinates¶
Convert coordinates from GFF or BED file using multi-fasta alignments
- seq_utils.convert_coordinates.find_aligned_position(gap_positions, pos)¶
Update position by adding preceeding gaps
Parameters: - gap_positions (list) – list of gaps (must include start and end methods to return the start and end of a gap, typically they are re.MatchObjects)
- pos (int) – the position to adjust by adding preceding gaps
Returns: The new position, accounting for preceeding gaps
Return type: int
seq_utils.fasta_from_gff¶
Extract fasta sequences from regions defined in GFF/BED file and output fasta to stdout
seq_utils.summarize_alignments¶
Summarize the differences between sequences in an aligned FASTA file.
This script will output summarize the differences between sequences in an aligned FASTA file.
Usage:
summarize_alignments.py aligned_fasta reference_sequence [-h,--help]
[-v,--verbose] [--version]
- seq_utils.summarize_alignments.main()¶
Runs summary_of_alignment function on input files from the command line.
- seq_utils.summarize_alignments.mismatch_string(mismatches)¶
Generate a string from a list of mismatches.
Parameters: mismatches (list) – A list of mismatches. A mismatch is a dictionary with a position (pos), reference genotype (ref), and alternate genotype (alt). Returns: A comma separated string of the mismatches Return type: string
- seq_utils.summarize_alignments.parse_event(event, reference_sequence, alternate_sequence)¶
Parse an event (sequence of differences) for VCF output.
Parse a simple event with reference_position, reference_base, and new_base and determine the type and add padding if necessary (for VCF compatibility)
Parameters: - event (dictonary) – An event has at least a position (pos), reference genotype (ref), and aternate genotype (alt). May also have a flag indicating if it is a snp (snp).
- reference_sequence (str) – The complete reference sequence
- alternate_sequence (str) – The complete alternate sequence
Returns: An event with additional padding to the start of the variant and an added type attribute, for VCF compatibility
Return type: dictonary
- seq_utils.summarize_alignments.summary_of_alignment(alignment, reference_sequence_id)¶
Summarizes changes in given alignment
Parameters: - alignment (Bio.AlignIO object) – Alignment object
- reference_index (int) – index of the reference sequence in alignment (default is 1)
Returns: - A dctionary with a key for each non-reference sequence
in the alignment
Each entry is another dictionary with the following keys:
match_count: The number of matching bases
mismatch_count: The number of mismatching bases, including indels
mismatches: list of mismatches by base: RefBase(RefPos)NewBase
- contiguous_change_count: the number of contiguous change
“events”
Return type: dictionary
seq_utils.utils¶
Utility classes and methods for working with sequence data
- seq_utils.utils.convert_interval_gapped_to_nongapped(seq, start, end)¶
Take position with gaps and return position without gaps
Uses 0-based positions
Parameters: - seq (str) – sequence string (with gaps included)
- start (int) – starting position of interval (including gaps)
- end (int) – ending postion of interval (including gaps)
Returns: (start, end) the start and end positions after removing gaps in the sequence
Return type: tuple
- seq_utils.utils.convert_interval_nongapped_to_gapped(seq, start, end, include_end_gaps=False)¶
Take position without gaps and return position with gaps
Uses 0-based positions
Parameters: - seq (str) – sequence string (with gaps added)
- start (int) – starting position of interval (excluding gaps)
- end (int) – ending postion of interval (excluding gaps)
- include_end_gaps (bool, optional) – if true, include gap positions that directly follow the end positions in the new interval, default is False and such end positions are not included
Returns: (start, end) the start and end positions after accouting for gaps in the sequence
Return type: tuple