PGR-TK¶
PGR-TK provide Python and Rust libraries to facilitate pangenomics analysis. Several algorithms and data structures used for the Peregrine Genome Assembler are useful for Pangenomics analysis. This repo takes those algorithms and data structure, combining other handy 3rd party tools to expose them as a library in Python (with Rust code for those computing parts that need performance.)
Module contents¶
pgrtk¶
This module is used to extract and compare sequences in a set of pan genome assemblies. It includes a number modules for access the sequence data and query the sequence index.
Example
This shows a simple example to query the pangenome database:
import pgrtk
## The AGCFile class is used to access the sequence data stored in a AGC file.
ref_db =pgrtk.AGCFile("hg19.agc")
## Load a pre-build index and sequence data from all humane genome assemblies
## of the HPRC year one release.
sdb = pgrtk.SeqIndexDB()
sdb.load_from_agc_index("HPRC-y1-rebuild-04252022")
## Extract a sequence from the hg19 AGC file.
gene_seq = ref_db.get_sub_seq('hg19.fasta', 'chr6',
160952514, 161087407)
## find hits in the pangenomic reference
alignment_ranges = pgrtk.query_sdb(sdb, gene_seq,
merge_range_tol=100000)
- pgrtk.compute_graph_diffusion_entropy(gfa_fn, max_nodes=6000)¶
- Give a GFA file name, compute an entropy by a simple diffusion model on the grap
and generate the list of the final diffusion weight for each node
- Parameters:
gfa_fn (string) – a gfa filename
- Returns:
(entropy, list_of_diffusion_weight)
list_of_diffusion_weight =
[(node_id, weight), ...]
- Return type:
tuple
- pgrtk.get_principle_bundle_bed_file_for_query(seqs, w=64, k=56, r=4, min_span=32, min_cov=2, min_branch_length=8)¶
- pgrtk.get_variant_calls(aln_segs, ref_bgn, ctg_bgn, rs0, cs0, strand)¶
Generate a variant call internal representation from the alignment segments.
- Parameters:
aln_segs (list of tuples) –
a list of tuples of “alignment segments” generate by
pgrtk.pgrtk.get_aln_segments()
the “alignment segments” are a list of
(ref_loc: SeqLocus, tgt_loc: SeqLocus, align_type: AlnSegType)
. The data structures is defined as following Rust structs:pub struct SeqLocus { pub id: u32, pub bgn: u32, pub len: u32, } pub enum AlnSegType { Match, Mismatch, Insertion, Deletion, Unspecified, } pub struct AlnSegment { pub ref_loc: SeqLocus, pub tgt_loc: SeqLocus, pub t: AlnSegType, }
ref_bgn (int) – the reference sequence start position
ctg_bgn (int) – the contig start position
rs0 (string) – the reference sequence
cs0 (string) – the contig sequence
strand (int) – the contig strand
- Returns:
a dictionary mapping the key (ref_id, reference_position) to a set of variant calls in the form of a dictionary mapping from (target_location, strand) to a variant call record.
- Return type:
dict
- pgrtk.group_smps_by_principle_bundle_id(smps, len_cutoff=2500, merge_length=5000)¶
- Filter and group SHIMMER pair output from SeqIndexDB.get_principal_bundle_decomposition()
by bundle id. This function will filter out small bundle segment with lenght smaller than len_curoff and merge two bundle with the same id and direction within merge_length
TODO: This is currently implemented in python, we plan to move this as Rust code in the future.
- Parameters:
len_cutoff (int) – the length cutoff used for filtering small bundle segment
merge_length (int) – the length determining if two bundles should be merged
- Returns:
a list of the lists of SHIMMER pairs tagged with bundle id, direction, position in the bundle
each element of the list SHIMMER is a tuple of ((shimmer0, shimmer1, pos0, pos1, direction), bundle_id, direction_to_the_bundle, position_in bundle)
- Return type:
list
- pgrtk.map_intervals_in_sdb(seq_index_db, interval, query_seq, gap_penality_factor=0.001, merge_range_tol=100, max_count=32, max_query_count=32, max_target_count=32, max_aln_span=8)¶
TODO: Document
- pgrtk.merge_regions(rgns, tol=1000)¶
Take a list of ranges and merge them if two regions are within
tol
. :param rgns: a list of tuples of (start
,end
,length
,orientation
, …) :type rgns: list of tuples- Returns:
a list of tuples of (
start
,end
,length
,orientation
, …)- Return type:
list of tuples
- pgrtk.output_variants_to_vcf_records(variant_calls, ref_name)¶
Convert the variant calls to VCF records.
- Parameters:
variant_calls (dict) – the variant calls generated by
get_variant_calls()
ref_name (string) – reference sequence name
- Returns:
list of VCF records
- Return type:
list
- pgrtk.query_sdb(seq_index_db, query_seq, gap_penalty_factor=0.25, merge_range_tol=12, max_count=128, max_query_count=128, max_target_count=128, max_aln_span=8)¶
Query a sequence index database for a query sequence.
- Parameters:
seq_index_db (SeqIndexDB object) – a sequence index database object
query_seq (list of bytes) – a list of bytes representing the DNA sequence
gap_penalty_factor (float) – the gap penalty factor used in sparse dynamic programming for finding the hits
merge_range_tol (int) – a parameter used to merge the alignment ranges
max_count (int) – only use the shimmer pairs that less than the
max_count
for sparse dynamic programmingmax_query_count (int) – only use the shimmer pairs that less than the
max_count
in the query sequence for sparse dynamic programmingmax_query_count – only use the shimmer pairs that less than the
max_count
in the target sequence for sparse dynamic programmingmax_aln_span (int) – the size of span used in the sparse dynamic alignment for finding the hits
- Returns:
a python dictionary with the key as the target sequence id and the value as a list of alignment ranges
each alignment ranges is a list of tuples, each tuple is (
start
,end
,length
,orientation
,aln_records
)the
aln_records
is a list of tuples of (target_sequence_id
, (score
,list_of_the_hit_pairs
)), where thelist_of_the_hit_pairs
is a list of tuples of ((query_start
,query_end
,query_orientation
), (target_start
,target_end
,target_orientation
))
- Return type:
dict
- pgrtk.rc(seq)¶
Reverse complement a sequence as a Python String.
- Parameters:
seq (string) – a DNA sequence as a Python String
- Returns:
the reverse complement DNA sequence as a Python String
- Return type:
string
- pgrtk.rc_byte_seq(seq)¶
Reverse complement a sequence as a list of bytes.
- Parameters:
seq (list of bytes) – ascii code of the DNA sequence
- Returns:
the list of bytes of the reverse complement DNA sequence
- Return type:
list of bytes
- pgrtk.rc_u8_seq(seg)¶
Reverse complement a sequence as a list of bytes (unsigned 8bit interger).
- Parameters:
seq (list of bytes / usigned 8bit interger) – ascii code of the DNA sequence
- Returns:
the list of bytes of the reverse complement DNA sequence
- Return type:
list of bytes
- pgrtk.string_to_u8(s)¶
Convert a Python String to a list of bytes.
- Parameters:
s (string) – a Python String of a DNA sequence
- Returns:
a list of bytes representing the DNA sequence
- Return type:
list of bytes
- pgrtk.u8_to_string(u8)¶
Convert DNA sequene in a list of bytes to a Python String.
- Parameters:
u8 (list of bytes) – a list of bytes representing the DNA sequence
- Returns:
a Python String of a DNA sequence
- Return type:
string
pgrtk.pgrtk¶
The internal pgrtk modules implemented with Rust. These classes and fucntion are re-exported as pgrtk.* so import pgrtk will bring these classes and function into pgrtk.* scope to avoid using the verbose pgrtk.pgrtk.*.
- class pgrtk.pgrtk.AGCFile¶
Bases:
object
A PyO3 class wrapping an existing AGC file for reading
Example:
>>> agc_file = AGCFile("/path/to/genomes.agc")
- ctg_lens¶
A hashmap mapping (source, ctg_name) to sequence length. It is more efficient to make a copy in Python to access it to avoid expensive Rust struct ot Python object conversion than using it directly. For example:
>>> agc_file = AGCFile("genomes.agc") >>> agc_ctg_lens = agc_file.ctg_lens.copy()
- get_seq(sample_name, ctg_name)¶
fetch a full contig sequence from an AGC file
- Parameters:
sample_name (string) – the sample name stored in the AGC file
ctg_name (string) – the contig name stored in the AGC file
- Returns:
a list of bytes representing the sequence
- Return type:
list
- get_sub_seq(sample_name, ctg_name, bgn, end)¶
fetch a contiguous sub-sequence from an AGC file
- Parameters:
sample_name (string) – the sample name stored in the AGC file
ctg_name (string) – the contig name stored in the AGC file
bgn (int) – the starting coordinate (0-based)
end (int) – the ending coordinate (exclusive)
- Returns:
a list of bytes representing the sequence
- Return type:
list
- class pgrtk.pgrtk.SeqIndexDB¶
Bases:
object
A class that stores pangenome indices and sequences with multiple backend storage options (AGC, fasta file, memory) Large set of genomic sequences, a user should use AGC backend. A binary file provides the command
pgr-mdb
which can read an AGC to create the index file. For example, we can create the index files from an AGC file:# create a file that contains a list of file that contains a set of files from which we want to build the indices $ echo HPRC-y1-rebuild-04252022.agc > filelist # using pgr-mdb to create the index files, for 97 haplotyped genome assembly from HPRC year one release, # it takes about 30 to 40 min to create the index files $ pgr-mdb filelist HPRC-y1-rebuild-04252022 # two index files will be created by the pgr-mdb command # one with a suffix .mdb and another one with a suffix .midx # when we use the load_from_agc_index() method, all three files, e.g., genomes.agc, genomes.mdb and # genomes.midx should have the same prefix as the parameter used to call load_from_agc_index() method
One can also create index and load the sequences from a fasta file using
`load_from_fastx()`
methods. Currently, this might be a good option for mid-size dataset (up to a couple of hundred megabases).Or, a user can load the sequence from memory using a Python list. This is convenient when one needs to rebuild the SHIMMER index with different parameters for a different resolution.
Once the index is built, the database can be queried quickly by using the
query_fragment()
or thequery_fragment_to_hps()
method.- append_from_fastx()¶
- generate_mapg_gfa(min_count, filepath, method=Ellipsis, keeps=Ellipsis)¶
Convert the adjacent list of the shimmer graph shimmer_pair -> GFA
- Parameters:
min_count (int) – the minimum number of times a pair of shimmers must be observed to be included in the graph
filepath (string) – the path to the output file
- Returns:
The data is written into the file at filepath
- Return type:
None
- generate_principal_mapg_gfa(min_count, path_len_cutoff, filepath, keeps=Ellipsis)¶
Convert the adjacent list of the shimmer graph shimmer_pair -> GFA
- Parameters:
min_count (int) – the minimum number of times a pair of shimmers must be observed to be included in the graph
filepath (string) – the path to the output file
- Returns:
The data is written into the file at filepath
- Return type:
None
- get_match_positions_with_fragment(seq)¶
use a fragment of sequence to query the database to get all hits and sort it by the data base sequence id
- Parameters:
seq (list) – the sequence in bytes used for query
- Returns:
a dictionary maps sequence id in the database to a list of tuple (position0, position1, direction)
- Return type:
dict
- get_principal_bundle_decomposition(min_count, path_len_cutoff, keeps=Ellipsis)¶
Get the principal bundles and bundle decomposition of all sequences
- Parameters:
min_count (int) – minimum coverage count to be included in the graph
path_len_cut_off (int) –
remove short path less than path_len_cut_off when generating the principal path
if the number is small, the generated principal paths will be more fragmented.
- Returns:
a tuple consist of two lists: (principal_bundles, seqid_smps_with_bundle_id_seg_direction)
principal_bundles = list of (principal_bundle_id, ave_bundle_position, list_bundle_vertex)
list_of_bundle_vertex = list of (hash0:u64, hash0:u64, direction:u8)
seqid_smps_with_bundle_id_seg_direction = list of shimmer pairs in the database annotated with principal bundle id and direction
- the elements of the list are ((hash0:u64, hash1:u64, pos0:u32, pos0:u32, direction:0),
(principal_bundle_id, direction, order_in_the_bundle))
- Return type:
tuple
- get_principal_bundle_projection(min_count, path_len_cutoff, sequence, keeps=Ellipsis)¶
Project sequences outside the sequence database on to a principal bundle decomposition
- Parameters:
min_count (int) – minimum coverage count to be included in the graph
path_len_cut_off (int) –
remove short path less than path_len_cut_off when generating the principal path
if the number is small, the generated principal paths will be more fragmented.
sequences ((contig_id: int, list of sequences)) –
- Returns:
a tuple consist of two lists: (principal_bundles, seqid_smps_with_bundle_id_seg_direction)
principal_bundles = list of (principal_bundle_id, ave_bundle_position, list_bundle_vertex)
list_of_bundle_vertex = list of (hash0:u64, hash0:u64, direction:u8)
seqid_smps_with_bundle_id_seg_direction = list of shimmer pairs in the database annotated with principal bundle id and direction
- the elements of the list are ((hash0:u64, hash1:u64, pos0:u32, pos0:u32, direction:0),
(principal_bundle_id, direction, order_in_the_bundle))
- Return type:
tuple
- get_principal_bundles(min_count, path_len_cutoff, keeps=Ellipsis)¶
Get the principal bundles in MAPG
- Parameters:
min_count (int) – minimum coverage count to be included in the graph
path_len_cut_off (int) –
remove short path less than path_len_cut_off when generating the principal path
if the number is small, the generated principal paths will be more fragmented.
- Returns:
list of paths, each path is a list of nodes
- Return type:
list
- get_seq(sample_name, ctg_name)¶
fetch a sequence
- Parameters:
sample_name (string) – the sample name stored in the AGC file
ctg_name (string) – the contig name stored in the AGC file
- Returns:
a list of bytes representing the sequence
- Return type:
list
- get_seq_by_id(sample_name, ctg_name)¶
fetch a sequence by the sequence id in the database
- Parameters:
sid (int) – sequence id in the database
ctg_name (string) – the contig name stored in the AGC file
- Returns:
a list of bytes representing the sequence
- Return type:
list
- get_shmmr_map()¶
get the
shmmr_pair
tofragment_id
map in Pythonthis can be very expensive to generate the Python objects of a large hashmap in Rust
- Returns:
the
shmmr_pair
tofragments
mapfragments: a list of
FragmentSignature
: (frg_id, seq_id, bgn, end, orientation(to the shimmer pair)) defined as:pub type FragmentSignature = (u32, u32, u32, u32, u8);
- Return type:
dict
- get_shmmr_pair_count(shmmr_pair)¶
count the number of shimmer hits in the database
- Parameters:
shmmr_pair (tuple) – a shimmer pair used for query
- Returns:
number of hits
- Return type:
int
- get_shmmr_pair_list()¶
get the
shmmr_pair
tofragment_id
map in Python as a listthis can be very expensive to generate the Python objects of a large hashmap in Rust
- Returns:
list of the tuple (shmmr0, shmmr1, seq_id, position0, position1, orientation)
- Return type:
list
- get_shmmr_pair_source_count(shmmr_pair, max_unique_count)¶
count the number of shimmer hits partitioned by the source file in the database
- Parameters:
shmmr_pair (tuple) – a shimmer pair used for query
max_unique_count (int) – a integer to filter out shimmer pairs with count that are greater than the max_unique_count
- Returns:
a list of the tuple (source_name : string, count : int)
- Return type:
list
- get_shmmr_spec()¶
Output the specific of the shimmer used to build the index
- Returns:
(window_size, k_mer_size, reduction_factor, min_space, use_sketch)
- Return type:
tuple
- get_smp_adj_list(min_count, keeps=Ellipsis)¶
Get adjacent list of the shimmer graph shimmer_pair -> shimmer_pair
- Parameters:
min_count (int) – the minimum number of times a pair of shimmers must be observed to be included in the graph
- Returns:
list of pairs of shimmer pairs ((h00, h01, orientation0),(h10, h11, orientation1))
- Return type:
list
- get_sub_seq(sample_name, ctg_name, bgn, end)¶
fetch a contiguous sub-sequence
- Parameters:
sample_name (string) – the sample name stored in the AGC file
ctg_name (string) – the contig name stored in the AGC file
bgn (int) – the starting coordinate (0-based)
end (int) – the ending coordinate (exclusive)
- Returns:
a list of bytes representing the sequence
- Return type:
list
- get_sub_seq_by_id(sample_name, ctg_name, bgn, end)¶
fetch a contiguous sub-sequence by a sequence id
- Parameters:
sid (int) – sequence id in the database
bgn (int) – the starting coordinate (0-based)
end (int) – the ending coordinate (exclusive)
- Returns:
a list of bytes representing the sequence
- Return type:
list
- get_vertex_map_from_principal_bundles(pb)¶
- load_from_agc_index(prefix)¶
use AGC file for sequences and the index created from an AGC file
- Parameters:
prefix (string) – the prefix to the .agc, .mdb and .midx files
- Return type:
None or I/O Error
- load_from_fastx(filepath, w=Ellipsis, k=Ellipsis, r=Ellipsis, min_span=Ellipsis)¶
load and create the index created from a fasta / fastq file
- Parameters:
filepath (string) – the path the fasta or fastq file
w (int) – the window size of the shimmer index, default to 80
k (int) – the k-mer size of the shimmer index, default to 56
r (int) – the reduction factor of the shimmer index, default to 4
min_span (int) – the min_span ofr the shimmer index, default to 8
- Returns:
None
- Return type:
None or I/O Error
- load_from_frg_index(prefix)¶
- load_from_seq_list(seq_list, source=Ellipsis, w=Ellipsis, k=Ellipsis, r=Ellipsis, min_span=Ellipsis)¶
load and create the index created from a python list
- Parameters:
seq_list (list) – a list of tuple of the form (sequence_id : int, sequence_name : string, sequence: list of bytes)
source (string) – a string indicating the source of the sequence, default to “Memory”
w (int) – the window size of the shimmer index, default to 80
k (int) – the k-mer size of the shimmer index, default to 56
r (int) – the reduction factor of the shimmer index, default to 4
min_span (int) – the min_span ofr the shimmer index, default to 8
- Returns:
None
- Return type:
None or I/O Error
- map_positions_in_seq(positions, seq, penalty, max_count, max_query_count, max_target_count, max_aln_span)¶
Given a sequence context, this function maps the specific positions in the context to the sequences in the database. The context sequence is aligned to the sequences in the database with sparse dynamic programming, then the regions include the positions of interest are identified. A wavefront alignment is performed to pin down the exact mapped positions in the sequences in the database.
- Parameters:
positions (list of integer) – a list of integers of the position to map
seq (list of bytes) – a list of bytes representing the DNA sequence providing to context
penalty (float) – the gap penalty factor used in sparse dynamic programming for finding the hits
merge_range_tol (int) – a parameter used to merge the alignment ranges
max_count (int) – only use the shimmer pairs that less than the
max_count
for sparse dynamic programmingmax_query_count (int) – only use the shimmer pairs that less than the
max_count
in the query sequence for sparse dynamic programmingmax_query_count – only use the shimmer pairs that less than the
max_count
in the target sequence for sparse dynamic programmingmax_aln_span (int) – the size of span used in the sparse dynamic alignment for finding the hits
- Returns:
a list of tuples of (
position_in_the_context
, (target_seq_id
,target_position
,orientation
), (context_end
,context_end
), (target_end
,target_end
))the sequences from (
context_end
,context_end
) in the context sequence and the sequences from (target_end
,target_end
) in the target sequence are used for the detailed alignment to pin down the exact mapped positions.- Return type:
list
- query_fragment(seq)¶
use a fragment of sequence to query the database to get all hits
- Parameters:
seq (list) – the sequence in bytes used for query
- Returns:
- a list of hits in the format (shimmer_pair, query_fragment, target_fragments), where
shimmer_pair: (int, int), tuple of the shimmer_pair
query_fragment: (int, int, int) = (start_coordinate, end_coordinate, orientation)
target_fragments: a list of
FragmentSignature
: (frg_id, seq_id, bgn, end, orientation(to the shimmer pair)) defined as:pub type FragmentSignature = (u32, u32, u32, u32, u8);
- Return type:
list
- query_fragment_to_hps(seq, penalty, max_count, max_query_count, max_target_count, max_aln_span)¶
use a fragment of sequence to query the database to get all hits
sparse dynamic programming is performed to long chain of alignment
- Parameters:
seq (list of bytes) – a list of bytes representing the DNA sequence
penalty (float) – the gap penalty factor used in sparse dynamic programming for finding the hits
merge_range_tol (int) – a parameter used to merge the alignment ranges
max_count (int) – only use the shimmer pairs that less than the
max_count
for sparse dynamic programmingmax_query_count (int) – only use the shimmer pairs that less than the
max_count
in the query sequence for sparse dynamic programmingmax_query_count – only use the shimmer pairs that less than the
max_count
in the target sequence for sparse dynamic programmingmax_aln_span (int) – the size of span used in the sparse dynamic alignment for finding the hits
- Returns:
a list of tuples of (
target_sequence_id
, (score
,list_of_the_hit_pairs
)), where thelist_of_the_hit_pairs
is a list of tuples of ((query_start
,query_end
,query_orientation
), (target_start
,target_end
,target_orientation
))- Return type:
list
- seq_index¶
get a dictionary that maps (ctg_name, source) -> (id, len)
- seq_info¶
a dictionary that maps id -> (ctg_name, source, len)
- shmmr_sparse_aln_consensus(sids, min_cov)¶
generate consensus sequence for one sequence in the database
- sort_adj_list_by_weighted_dfs(adj_list, start)¶
Sort the adjacent list of the shimmer graph
- Parameters:
adj_list (list) – the list from the output from
get_smp_adj_list
sort_by ((u64, u64)) – the starting node signature
- Returns:
list of node in the tuple (node, parent_node, node_weight, is_leaf, global_rank, branch, branch_rank)
- Return type:
list
- write_frag_and_index_files(file_prefix)¶
- write_mapg_idx(filepath)¶
Write additional meta data for GFA into a file
- Parameters:
filepath (string) – the path to the output file
- Returns:
The data is written into the file at filepath
- Return type:
None
- write_midx_to_text_file(filepath)¶
- pgrtk.pgrtk.get_shmmr_dots(seq0, seq1, w=Ellipsis, k=Ellipsis, r=Ellipsis, min_span=Ellipsis)¶
Generate a list of shimmer matches for creating a dot plot between two sequences
- Parameters:
seq0 (list) – a list of bytes representing the first sequences
seq1 (list) – a list of bytes representing the second sequences
w (int) – window size, default to 80, max allowed is 128
k (int) – k-mer size, default to 56, max allowed is 56
r (int) – reduction factor for generate sparse hierarchical minimizers (shimmer), default to 4, max allowed is 12
min_span (int) –
a parameter to remove close-by shimmer pairs
if not zero, shimmer pairs which the distance between them are less than
the min_span
will be removed
- Returns:
(x, y)
:x
: the matched shimmer positions in sequence 0y
: the matched shimmer positions in sequence 1
- Return type:
tuple of two lists
- pgrtk.pgrtk.get_shmmr_pairs_from_seq(seq, w=Ellipsis, k=Ellipsis, r=Ellipsis, min_span=Ellipsis, padding=Ellipsis)¶
Generate a list of shimmer pair from a sequence
- Parameters:
w (int) – window size, default to 80, max allowed is 128
k (int) – k-mer size, default to 56, max allowed is 56
r (int) – reduction factor for generate sparse hierarchical minimizers (shimmer), default to 4, max allowed is 12
min_span (int) –
a parameter to remove close-by shimmer pairs
if not zero, shimmer pairs which the distance between them are less than
the min_span
will be removeddefault to 16
padding (bool) –
for short fragment that segmented by using shimmer, set
padding
to true to preserve the first and last shimmersdefault to false
- Returns:
a list fo tuple of
(shmmr0, shmmr1, position0, position1, orientation)
- Return type:
list of tuple
- pgrtk.pgrtk.guided_shmmr_dbg_consensus(seqs, w=Ellipsis, k=Ellipsis, r=Ellipsis, min_span=Ellipsis, min_cov=Ellipsis)¶
Perform a guided shimmer de Bruijn graph consensus
- Parameters:
aln_segs (list) – a list of the list of bytes representing the bases of each sequence
k (int) – specification of the shimmers for construting graph
w (int) – specification of the shimmers for construting graph
r (int) – specification of the shimmers for construting graph
min_span (int) – specification of the shimmers for construting graph
min_cov (int) – to keep hyplotype specific consensus, if a kmer has coverage more or equal to min_cov, it will be kept
- Returns:
a list of a set of bytes representing the consensus sequences of all branches in the graph
- Return type:
list
- pgrtk.pgrtk.naive_dbg_consensus(seqs, kmer_size=Ellipsis, min_cov=Ellipsis)¶
Generate the CIGAR string from two sequences with WFA
- Parameters:
seq0 (string) – a string representing the first sequence
seq1 (string) – a string representing the second sequence
ref_id (int) – a integer id for the reference sequence
ref_seq (string) – a python string of the reference sequence
tgt_id (int) – a integer id for the target sequence
tgt_seq (string) – a python string of the target sequence
aln_segs (list) – a list of the
AlnSegment
s0 (string) – a python string of the reference sequence
s1 (int) – a integer id for the target sequence
aln_segs – a list of the list of bytes representing the bases of each sequence
kmer_size (int) – the size of kmers used for constructing the de Bruijn graph
min_cov (int) – to keep haplotype specific consensus, if a kmer has coverage more or equal to min_cov, it will be kept
- Returns:
tuple – tuple of (alignment_score, CIGAR_list)
Get alignment segments from two sequences
- Returns:
list – a list of
AlnSegment
the
AlnSegment
is a Rust struct defined as:pub struct SeqLocus { pub id: u32, pub bgn: u32, pub len: u32, } pub enum AlnSegType { Match, Mismatch, Insertion, Deletion, Unspecified, } pub struct AlnSegment { pub ref_loc: SeqLocus, pub tgt_loc: SeqLocus, pub t: AlnSegType, }
Get alignment map from a list of alignment segments
- Returns:
list – a list of
AlnSegment
Perform a naive de Bruijn graph consensus
- Returns:
a list of bytes representing the consensus sequence
- Return type:
list
- pgrtk.pgrtk.pgr_lib_version()¶
Get the revision (git-hashtag) of the build
- pgrtk.pgrtk.shmmr_dbg_consensus(seqs, w=Ellipsis, k=Ellipsis, r=Ellipsis, min_span=Ellipsis)¶
Perform a shimmer de Bruijn graph consensus
- Parameters:
aln_segs (list) – a list of the list of bytes representing the bases of each sequence
k (int) – specification of the shimmers for construting graph
w (int) – specification of the shimmers for construting graph
r (int) – specification of the shimmers for construting graph
min_span (int) – specification of the shimmers for construting graph
- Returns:
a list of a set of bytes representing the consensus sequences of all branches in the graph
- Return type:
list
- pgrtk.pgrtk.shmmr_sparse_aln_consensus(seqs, w=Ellipsis, k=Ellipsis, r=Ellipsis, min_span=Ellipsis, min_cov=Ellipsis)¶
Perform a shimmer de Bruijn graph consensus
- Parameters:
aln_segs (list) – a list of the list of bytes representing the bases of each sequence
k (int) – specification of the shimmers for construting graph
w (int) – specification of the shimmers for construting graph
r (int) – specification of the shimmers for construting graph
min_span (int) – specification of the shimmers for construting graph
- Returns:
a list of a set of bytes representing the consensus sequences of all branches in the graph
- Return type:
list
- pgrtk.pgrtk.sparse_aln(sp_hits, max_span, penalty)¶
Perform sparse dynamic programming to identify alignment between sequence using matched shimmer pairs
- Parameters:
sp_hits (list) – a list of tuple of
HitPair
defined aspub type HitPair = ((u32, u32, u8), (u32, u32, u8))
This represents the hits as the position of matched Shimmer Pairs from the two sequence. For example, if there two shimmers at position 2342 and 4322 of the query sequence that matches the shimmers at positions 6125465 and 6127445, them the HitPair will be(2342, 4322, 0)
and(6125465, 6127445, 0)
. The third number would be 1 if shimmer paired are reversed matched to the sequence orientation and 0 otherwise.max_span (int) – For a give hit, the max_span defines how many other following hits are considered for the next aligned position. This will limit the search space for the best alignment. If the two sequences are very repetitive, then one needs to use larger
max_span
to ensure capturing the right alignment pathpenalty (float) – this parameter will determine when to break the alignment if there are big gaps between the alignment segment. One can set it to zero to catch large chunk alignment ignoring the gaps. Typically, a number between 0.1 to 0.5 should be used.