PGR-TK

PGR-TK provide Python and Rust libraries to facilitate pangenomics analysis. Several algorithms and data structures used for the Peregrine Genome Assembler are useful for Pangenomics analysis. This repo takes those algorithms and data structure, combining other handy 3rd party tools to expose them as a library in Python (with Rust code for those computing parts that need performance.)

Module contents

pgrtk

This module is used to extract and compare sequences in a set of pan genome assemblies. It includes a number modules for access the sequence data and query the sequence index.

Example

This shows a simple example to query the pangenome database:

import pgrtk

## The AGCFile class is used to access the sequence data stored in a AGC file.
ref_db =pgrtk.AGCFile("hg19.agc")

## Load a pre-build index and sequence data from all humane genome assemblies
## of the HPRC year one release.
sdb = pgrtk.SeqIndexDB()
sdb.load_from_agc_index("HPRC-y1-rebuild-04252022")

## Extract a sequence from the hg19 AGC file.
gene_seq = ref_db.get_sub_seq('hg19.fasta', 'chr6',
                              160952514, 161087407)

## find hits in the pangenomic reference
alignment_ranges = pgrtk.query_sdb(sdb, gene_seq,
                                     merge_range_tol=100000)
pgrtk.compute_graph_diffusion_entropy(gfa_fn, max_nodes=6000)
Give a GFA file name, compute an entropy by a simple diffusion model on the grap

and generate the list of the final diffusion weight for each node

Parameters:

gfa_fn (string) – a gfa filename

Returns:

(entropy, list_of_diffusion_weight)

list_of_diffusion_weight = [(node_id, weight), ...]

Return type:

tuple

pgrtk.get_principle_bundle_bed_file_for_query(seqs, w=64, k=56, r=4, min_span=32, min_cov=2, min_branch_length=8)
pgrtk.get_variant_calls(aln_segs, ref_bgn, ctg_bgn, rs0, cs0, strand)

Generate a variant call internal representation from the alignment segments.

Parameters:
  • aln_segs (list of tuples) –

    • a list of tuples of “alignment segments” generate by pgrtk.pgrtk.get_aln_segments()

    • the “alignment segments” are a list of (ref_loc: SeqLocus, tgt_loc: SeqLocus, align_type: AlnSegType). The data structures is defined as following Rust structs:

      pub struct SeqLocus {
          pub id: u32,
          pub bgn: u32,
          pub len: u32,
      }
      
      pub enum AlnSegType {
          Match,
          Mismatch,
          Insertion,
          Deletion,
          Unspecified,
      }
      
      pub struct AlnSegment {
          pub ref_loc: SeqLocus,
          pub tgt_loc: SeqLocus,
          pub t: AlnSegType,
      }
      

  • ref_bgn (int) – the reference sequence start position

  • ctg_bgn (int) – the contig start position

  • rs0 (string) – the reference sequence

  • cs0 (string) – the contig sequence

  • strand (int) – the contig strand

Returns:

a dictionary mapping the key (ref_id, reference_position) to a set of variant calls in the form of a dictionary mapping from (target_location, strand) to a variant call record.

Return type:

dict

pgrtk.group_smps_by_principle_bundle_id(smps, len_cutoff=2500, merge_length=5000)
Filter and group SHIMMER pair output from SeqIndexDB.get_principal_bundle_decomposition()

by bundle id. This function will filter out small bundle segment with lenght smaller than len_curoff and merge two bundle with the same id and direction within merge_length

TODO: This is currently implemented in python, we plan to move this as Rust code in the future.

Parameters:
  • len_cutoff (int) – the length cutoff used for filtering small bundle segment

  • merge_length (int) – the length determining if two bundles should be merged

Returns:

a list of the lists of SHIMMER pairs tagged with bundle id, direction, position in the bundle

each element of the list SHIMMER is a tuple of ((shimmer0, shimmer1, pos0, pos1, direction), bundle_id, direction_to_the_bundle, position_in bundle)

Return type:

list

pgrtk.map_intervals_in_sdb(seq_index_db, interval, query_seq, gap_penality_factor=0.001, merge_range_tol=100, max_count=32, max_query_count=32, max_target_count=32, max_aln_span=8)

TODO: Document

pgrtk.merge_regions(rgns, tol=1000)

Take a list of ranges and merge them if two regions are within tol. :param rgns: a list of tuples of (start, end, length, orientation, …) :type rgns: list of tuples

Returns:

a list of tuples of (start, end, length, orientation, …)

Return type:

list of tuples

pgrtk.output_variants_to_vcf_records(variant_calls, ref_name)

Convert the variant calls to VCF records.

Parameters:
  • variant_calls (dict) – the variant calls generated by get_variant_calls()

  • ref_name (string) – reference sequence name

Returns:

list of VCF records

Return type:

list

pgrtk.query_sdb(seq_index_db, query_seq, gap_penalty_factor=0.25, merge_range_tol=12, max_count=128, max_query_count=128, max_target_count=128, max_aln_span=8)

Query a sequence index database for a query sequence.

Parameters:
  • seq_index_db (SeqIndexDB object) – a sequence index database object

  • query_seq (list of bytes) – a list of bytes representing the DNA sequence

  • gap_penalty_factor (float) – the gap penalty factor used in sparse dynamic programming for finding the hits

  • merge_range_tol (int) – a parameter used to merge the alignment ranges

  • max_count (int) – only use the shimmer pairs that less than the max_count for sparse dynamic programming

  • max_query_count (int) – only use the shimmer pairs that less than the max_count in the query sequence for sparse dynamic programming

  • max_query_count – only use the shimmer pairs that less than the max_count in the target sequence for sparse dynamic programming

  • max_aln_span (int) – the size of span used in the sparse dynamic alignment for finding the hits

Returns:

  • a python dictionary with the key as the target sequence id and the value as a list of alignment ranges

  • each alignment ranges is a list of tuples, each tuple is (start, end, length, orientation, aln_records)

  • the aln_records is a list of tuples of (target_sequence_id, (score, list_of_the_hit_pairs)), where the list_of_the_hit_pairs is a list of tuples of ((query_start, query_end, query_orientation), (target_start, target_end, target_orientation))

Return type:

dict

pgrtk.rc(seq)

Reverse complement a sequence as a Python String.

Parameters:

seq (string) – a DNA sequence as a Python String

Returns:

the reverse complement DNA sequence as a Python String

Return type:

string

pgrtk.rc_byte_seq(seq)

Reverse complement a sequence as a list of bytes.

Parameters:

seq (list of bytes) – ascii code of the DNA sequence

Returns:

the list of bytes of the reverse complement DNA sequence

Return type:

list of bytes

pgrtk.rc_u8_seq(seg)

Reverse complement a sequence as a list of bytes (unsigned 8bit interger).

Parameters:

seq (list of bytes / usigned 8bit interger) – ascii code of the DNA sequence

Returns:

the list of bytes of the reverse complement DNA sequence

Return type:

list of bytes

pgrtk.string_to_u8(s)

Convert a Python String to a list of bytes.

Parameters:

s (string) – a Python String of a DNA sequence

Returns:

a list of bytes representing the DNA sequence

Return type:

list of bytes

pgrtk.u8_to_string(u8)

Convert DNA sequene in a list of bytes to a Python String.

Parameters:

u8 (list of bytes) – a list of bytes representing the DNA sequence

Returns:

a Python String of a DNA sequence

Return type:

string

pgrtk.pgrtk

The internal pgrtk modules implemented with Rust. These classes and fucntion are re-exported as pgrtk.* so import pgrtk will bring these classes and function into pgrtk.* scope to avoid using the verbose pgrtk.pgrtk.*.

class pgrtk.pgrtk.AGCFile

Bases: object

A PyO3 class wrapping an existing AGC file for reading

Example:

>>> agc_file = AGCFile("/path/to/genomes.agc")
ctg_lens

A hashmap mapping (source, ctg_name) to sequence length. It is more efficient to make a copy in Python to access it to avoid expensive Rust struct ot Python object conversion than using it directly. For example:

>>> agc_file = AGCFile("genomes.agc")
>>> agc_ctg_lens = agc_file.ctg_lens.copy()
get_seq(sample_name, ctg_name)

fetch a full contig sequence from an AGC file

Parameters:
  • sample_name (string) – the sample name stored in the AGC file

  • ctg_name (string) – the contig name stored in the AGC file

Returns:

a list of bytes representing the sequence

Return type:

list

get_sub_seq(sample_name, ctg_name, bgn, end)

fetch a contiguous sub-sequence from an AGC file

Parameters:
  • sample_name (string) – the sample name stored in the AGC file

  • ctg_name (string) – the contig name stored in the AGC file

  • bgn (int) – the starting coordinate (0-based)

  • end (int) – the ending coordinate (exclusive)

Returns:

a list of bytes representing the sequence

Return type:

list

class pgrtk.pgrtk.SeqIndexDB

Bases: object

A class that stores pangenome indices and sequences with multiple backend storage options (AGC, fasta file, memory) Large set of genomic sequences, a user should use AGC backend. A binary file provides the command pgr-mdb which can read an AGC to create the index file. For example, we can create the index files from an AGC file:

# create a file that contains a list of file that contains a set of files from which we want to build the indices

$ echo HPRC-y1-rebuild-04252022.agc > filelist

# using pgr-mdb to create the index files, for 97 haplotyped genome assembly from HPRC year one release,
# it takes about 30 to 40 min to create the index files

$ pgr-mdb filelist HPRC-y1-rebuild-04252022

# two index files will be created by the pgr-mdb command
# one with a suffix .mdb and another one with a suffix .midx
# when we use the load_from_agc_index() method, all three files, e.g., genomes.agc, genomes.mdb and
# genomes.midx should have the same prefix as the parameter used to call  load_from_agc_index() method

One can also create index and load the sequences from a fasta file using `load_from_fastx()` methods. Currently, this might be a good option for mid-size dataset (up to a couple of hundred megabases).

Or, a user can load the sequence from memory using a Python list. This is convenient when one needs to rebuild the SHIMMER index with different parameters for a different resolution.

Once the index is built, the database can be queried quickly by using the query_fragment() or the query_fragment_to_hps() method.

append_from_fastx()
generate_mapg_gfa(min_count, filepath, method=Ellipsis, keeps=Ellipsis)

Convert the adjacent list of the shimmer graph shimmer_pair -> GFA

Parameters:
  • min_count (int) – the minimum number of times a pair of shimmers must be observed to be included in the graph

  • filepath (string) – the path to the output file

Returns:

The data is written into the file at filepath

Return type:

None

generate_principal_mapg_gfa(min_count, path_len_cutoff, filepath, keeps=Ellipsis)

Convert the adjacent list of the shimmer graph shimmer_pair -> GFA

Parameters:
  • min_count (int) – the minimum number of times a pair of shimmers must be observed to be included in the graph

  • filepath (string) – the path to the output file

Returns:

The data is written into the file at filepath

Return type:

None

get_match_positions_with_fragment(seq)

use a fragment of sequence to query the database to get all hits and sort it by the data base sequence id

Parameters:

seq (list) – the sequence in bytes used for query

Returns:

a dictionary maps sequence id in the database to a list of tuple (position0, position1, direction)

Return type:

dict

get_principal_bundle_decomposition(min_count, path_len_cutoff, keeps=Ellipsis)

Get the principal bundles and bundle decomposition of all sequences

Parameters:
  • min_count (int) – minimum coverage count to be included in the graph

  • path_len_cut_off (int) –

    remove short path less than path_len_cut_off when generating the principal path

    if the number is small, the generated principal paths will be more fragmented.

Returns:

a tuple consist of two lists: (principal_bundles, seqid_smps_with_bundle_id_seg_direction)

principal_bundles = list of (principal_bundle_id, ave_bundle_position, list_bundle_vertex)

list_of_bundle_vertex = list of (hash0:u64, hash0:u64, direction:u8)

seqid_smps_with_bundle_id_seg_direction = list of shimmer pairs in the database annotated with principal bundle id and direction

the elements of the list are ((hash0:u64, hash1:u64, pos0:u32, pos0:u32, direction:0),

(principal_bundle_id, direction, order_in_the_bundle))

Return type:

tuple

get_principal_bundle_projection(min_count, path_len_cutoff, sequence, keeps=Ellipsis)

Project sequences outside the sequence database on to a principal bundle decomposition

Parameters:
  • min_count (int) – minimum coverage count to be included in the graph

  • path_len_cut_off (int) –

    remove short path less than path_len_cut_off when generating the principal path

    if the number is small, the generated principal paths will be more fragmented.

  • sequences ((contig_id: int, list of sequences)) –

Returns:

a tuple consist of two lists: (principal_bundles, seqid_smps_with_bundle_id_seg_direction)

principal_bundles = list of (principal_bundle_id, ave_bundle_position, list_bundle_vertex)

list_of_bundle_vertex = list of (hash0:u64, hash0:u64, direction:u8)

seqid_smps_with_bundle_id_seg_direction = list of shimmer pairs in the database annotated with principal bundle id and direction

the elements of the list are ((hash0:u64, hash1:u64, pos0:u32, pos0:u32, direction:0),

(principal_bundle_id, direction, order_in_the_bundle))

Return type:

tuple

get_principal_bundles(min_count, path_len_cutoff, keeps=Ellipsis)

Get the principal bundles in MAPG

Parameters:
  • min_count (int) – minimum coverage count to be included in the graph

  • path_len_cut_off (int) –

    remove short path less than path_len_cut_off when generating the principal path

    if the number is small, the generated principal paths will be more fragmented.

Returns:

list of paths, each path is a list of nodes

Return type:

list

get_seq(sample_name, ctg_name)

fetch a sequence

Parameters:
  • sample_name (string) – the sample name stored in the AGC file

  • ctg_name (string) – the contig name stored in the AGC file

Returns:

a list of bytes representing the sequence

Return type:

list

get_seq_by_id(sample_name, ctg_name)

fetch a sequence by the sequence id in the database

Parameters:
  • sid (int) – sequence id in the database

  • ctg_name (string) – the contig name stored in the AGC file

Returns:

a list of bytes representing the sequence

Return type:

list

get_shmmr_map()

get the shmmr_pair to fragment_id map in Python

this can be very expensive to generate the Python objects of a large hashmap in Rust

Returns:

the shmmr_pair to fragments map

fragments: a list of FragmentSignature: (frg_id, seq_id, bgn, end, orientation(to the shimmer pair)) defined as:

pub type FragmentSignature = (u32, u32, u32, u32, u8);

Return type:

dict

get_shmmr_pair_count(shmmr_pair)

count the number of shimmer hits in the database

Parameters:

shmmr_pair (tuple) – a shimmer pair used for query

Returns:

number of hits

Return type:

int

get_shmmr_pair_list()

get the shmmr_pair to fragment_id map in Python as a list

this can be very expensive to generate the Python objects of a large hashmap in Rust

Returns:

list of the tuple (shmmr0, shmmr1, seq_id, position0, position1, orientation)

Return type:

list

get_shmmr_pair_source_count(shmmr_pair, max_unique_count)

count the number of shimmer hits partitioned by the source file in the database

Parameters:
  • shmmr_pair (tuple) – a shimmer pair used for query

  • max_unique_count (int) – a integer to filter out shimmer pairs with count that are greater than the max_unique_count

Returns:

a list of the tuple (source_name : string, count : int)

Return type:

list

get_shmmr_spec()

Output the specific of the shimmer used to build the index

Returns:

(window_size, k_mer_size, reduction_factor, min_space, use_sketch)

Return type:

tuple

get_smp_adj_list(min_count, keeps=Ellipsis)

Get adjacent list of the shimmer graph shimmer_pair -> shimmer_pair

Parameters:

min_count (int) – the minimum number of times a pair of shimmers must be observed to be included in the graph

Returns:

list of pairs of shimmer pairs ((h00, h01, orientation0),(h10, h11, orientation1))

Return type:

list

get_sub_seq(sample_name, ctg_name, bgn, end)

fetch a contiguous sub-sequence

Parameters:
  • sample_name (string) – the sample name stored in the AGC file

  • ctg_name (string) – the contig name stored in the AGC file

  • bgn (int) – the starting coordinate (0-based)

  • end (int) – the ending coordinate (exclusive)

Returns:

a list of bytes representing the sequence

Return type:

list

get_sub_seq_by_id(sample_name, ctg_name, bgn, end)

fetch a contiguous sub-sequence by a sequence id

Parameters:
  • sid (int) – sequence id in the database

  • bgn (int) – the starting coordinate (0-based)

  • end (int) – the ending coordinate (exclusive)

Returns:

a list of bytes representing the sequence

Return type:

list

get_vertex_map_from_principal_bundles(pb)
load_from_agc_index(prefix)

use AGC file for sequences and the index created from an AGC file

Parameters:

prefix (string) – the prefix to the .agc, .mdb and .midx files

Return type:

None or I/O Error

load_from_fastx(filepath, w=Ellipsis, k=Ellipsis, r=Ellipsis, min_span=Ellipsis)

load and create the index created from a fasta / fastq file

Parameters:
  • filepath (string) – the path the fasta or fastq file

  • w (int) – the window size of the shimmer index, default to 80

  • k (int) – the k-mer size of the shimmer index, default to 56

  • r (int) – the reduction factor of the shimmer index, default to 4

  • min_span (int) – the min_span ofr the shimmer index, default to 8

Returns:

None

Return type:

None or I/O Error

load_from_frg_index(prefix)
load_from_seq_list(seq_list, source=Ellipsis, w=Ellipsis, k=Ellipsis, r=Ellipsis, min_span=Ellipsis)

load and create the index created from a python list

Parameters:
  • seq_list (list) – a list of tuple of the form (sequence_id : int, sequence_name : string, sequence: list of bytes)

  • source (string) – a string indicating the source of the sequence, default to “Memory”

  • w (int) – the window size of the shimmer index, default to 80

  • k (int) – the k-mer size of the shimmer index, default to 56

  • r (int) – the reduction factor of the shimmer index, default to 4

  • min_span (int) – the min_span ofr the shimmer index, default to 8

Returns:

None

Return type:

None or I/O Error

map_positions_in_seq(positions, seq, penalty, max_count, max_query_count, max_target_count, max_aln_span)

Given a sequence context, this function maps the specific positions in the context to the sequences in the database. The context sequence is aligned to the sequences in the database with sparse dynamic programming, then the regions include the positions of interest are identified. A wavefront alignment is performed to pin down the exact mapped positions in the sequences in the database.

Parameters:
  • positions (list of integer) – a list of integers of the position to map

  • seq (list of bytes) – a list of bytes representing the DNA sequence providing to context

  • penalty (float) – the gap penalty factor used in sparse dynamic programming for finding the hits

  • merge_range_tol (int) – a parameter used to merge the alignment ranges

  • max_count (int) – only use the shimmer pairs that less than the max_count for sparse dynamic programming

  • max_query_count (int) – only use the shimmer pairs that less than the max_count in the query sequence for sparse dynamic programming

  • max_query_count – only use the shimmer pairs that less than the max_count in the target sequence for sparse dynamic programming

  • max_aln_span (int) – the size of span used in the sparse dynamic alignment for finding the hits

Returns:

a list of tuples of (position_in_the_context, (target_seq_id, target_position, orientation), (context_end, context_end), (target_end, target_end))

the sequences from (context_end, context_end) in the context sequence and the sequences from (target_end, target_end) in the target sequence are used for the detailed alignment to pin down the exact mapped positions.

Return type:

list

query_fragment(seq)

use a fragment of sequence to query the database to get all hits

Parameters:

seq (list) – the sequence in bytes used for query

Returns:

a list of hits in the format (shimmer_pair, query_fragment, target_fragments), where
  • shimmer_pair: (int, int), tuple of the shimmer_pair

  • query_fragment: (int, int, int) = (start_coordinate, end_coordinate, orientation)

  • target_fragments: a list of FragmentSignature: (frg_id, seq_id, bgn, end, orientation(to the shimmer pair)) defined as:

    pub type FragmentSignature = (u32, u32, u32, u32, u8);
    

Return type:

list

query_fragment_to_hps(seq, penalty, max_count, max_query_count, max_target_count, max_aln_span)

use a fragment of sequence to query the database to get all hits

sparse dynamic programming is performed to long chain of alignment

Parameters:
  • seq (list of bytes) – a list of bytes representing the DNA sequence

  • penalty (float) – the gap penalty factor used in sparse dynamic programming for finding the hits

  • merge_range_tol (int) – a parameter used to merge the alignment ranges

  • max_count (int) – only use the shimmer pairs that less than the max_count for sparse dynamic programming

  • max_query_count (int) – only use the shimmer pairs that less than the max_count in the query sequence for sparse dynamic programming

  • max_query_count – only use the shimmer pairs that less than the max_count in the target sequence for sparse dynamic programming

  • max_aln_span (int) – the size of span used in the sparse dynamic alignment for finding the hits

Returns:

a list of tuples of (target_sequence_id, (score, list_of_the_hit_pairs)), where the list_of_the_hit_pairs is a list of tuples of ((query_start, query_end, query_orientation), (target_start, target_end, target_orientation))

Return type:

list

seq_index

get a dictionary that maps (ctg_name, source) -> (id, len)

seq_info

a dictionary that maps id -> (ctg_name, source, len)

shmmr_sparse_aln_consensus(sids, min_cov)

generate consensus sequence for one sequence in the database

sort_adj_list_by_weighted_dfs(adj_list, start)

Sort the adjacent list of the shimmer graph

Parameters:
  • adj_list (list) – the list from the output from get_smp_adj_list

  • sort_by ((u64, u64)) – the starting node signature

Returns:

list of node in the tuple (node, parent_node, node_weight, is_leaf, global_rank, branch, branch_rank)

Return type:

list

write_frag_and_index_files(file_prefix)
write_mapg_idx(filepath)

Write additional meta data for GFA into a file

Parameters:

filepath (string) – the path to the output file

Returns:

The data is written into the file at filepath

Return type:

None

write_midx_to_text_file(filepath)
pgrtk.pgrtk.get_shmmr_dots(seq0, seq1, w=Ellipsis, k=Ellipsis, r=Ellipsis, min_span=Ellipsis)

Generate a list of shimmer matches for creating a dot plot between two sequences

Parameters:
  • seq0 (list) – a list of bytes representing the first sequences

  • seq1 (list) – a list of bytes representing the second sequences

  • w (int) – window size, default to 80, max allowed is 128

  • k (int) – k-mer size, default to 56, max allowed is 56

  • r (int) – reduction factor for generate sparse hierarchical minimizers (shimmer), default to 4, max allowed is 12

  • min_span (int) –

    • a parameter to remove close-by shimmer pairs

    • if not zero, shimmer pairs which the distance between them are less than the min_span will be removed

Returns:

(x, y):

  • x: the matched shimmer positions in sequence 0

  • y: the matched shimmer positions in sequence 1

Return type:

tuple of two lists

pgrtk.pgrtk.get_shmmr_pairs_from_seq(seq, w=Ellipsis, k=Ellipsis, r=Ellipsis, min_span=Ellipsis, padding=Ellipsis)

Generate a list of shimmer pair from a sequence

Parameters:
  • w (int) – window size, default to 80, max allowed is 128

  • k (int) – k-mer size, default to 56, max allowed is 56

  • r (int) – reduction factor for generate sparse hierarchical minimizers (shimmer), default to 4, max allowed is 12

  • min_span (int) –

    • a parameter to remove close-by shimmer pairs

    • if not zero, shimmer pairs which the distance between them are less than the min_span will be removed

    • default to 16

  • padding (bool) –

    • for short fragment that segmented by using shimmer, set padding to true to preserve the first and last shimmers

    • default to false

Returns:

a list fo tuple of (shmmr0, shmmr1, position0, position1, orientation)

Return type:

list of tuple

pgrtk.pgrtk.guided_shmmr_dbg_consensus(seqs, w=Ellipsis, k=Ellipsis, r=Ellipsis, min_span=Ellipsis, min_cov=Ellipsis)

Perform a guided shimmer de Bruijn graph consensus

Parameters:
  • aln_segs (list) – a list of the list of bytes representing the bases of each sequence

  • k (int) – specification of the shimmers for construting graph

  • w (int) – specification of the shimmers for construting graph

  • r (int) – specification of the shimmers for construting graph

  • min_span (int) – specification of the shimmers for construting graph

  • min_cov (int) – to keep hyplotype specific consensus, if a kmer has coverage more or equal to min_cov, it will be kept

Returns:

a list of a set of bytes representing the consensus sequences of all branches in the graph

Return type:

list

pgrtk.pgrtk.naive_dbg_consensus(seqs, kmer_size=Ellipsis, min_cov=Ellipsis)

Generate the CIGAR string from two sequences with WFA

Parameters:
  • seq0 (string) – a string representing the first sequence

  • seq1 (string) – a string representing the second sequence

  • ref_id (int) – a integer id for the reference sequence

  • ref_seq (string) – a python string of the reference sequence

  • tgt_id (int) – a integer id for the target sequence

  • tgt_seq (string) – a python string of the target sequence

  • aln_segs (list) – a list of the AlnSegment

  • s0 (string) – a python string of the reference sequence

  • s1 (int) – a integer id for the target sequence

  • aln_segs – a list of the list of bytes representing the bases of each sequence

  • kmer_size (int) – the size of kmers used for constructing the de Bruijn graph

  • min_cov (int) – to keep haplotype specific consensus, if a kmer has coverage more or equal to min_cov, it will be kept

Returns:

  • tuple – tuple of (alignment_score, CIGAR_list)

  • Get alignment segments from two sequences

Returns:

  • list – a list of AlnSegment

    the AlnSegment is a Rust struct defined as:

    pub struct SeqLocus {
        pub id: u32,
        pub bgn: u32,
        pub len: u32,
    }
    
    pub enum AlnSegType {
        Match,
        Mismatch,
        Insertion,
        Deletion,
        Unspecified,
    }
    
    pub struct AlnSegment {
        pub ref_loc: SeqLocus,
        pub tgt_loc: SeqLocus,
        pub t: AlnSegType,
    }
    
  • Get alignment map from a list of alignment segments

Returns:

  • list – a list of AlnSegment

  • Perform a naive de Bruijn graph consensus

Returns:

a list of bytes representing the consensus sequence

Return type:

list

pgrtk.pgrtk.pgr_lib_version()

Get the revision (git-hashtag) of the build

pgrtk.pgrtk.shmmr_dbg_consensus(seqs, w=Ellipsis, k=Ellipsis, r=Ellipsis, min_span=Ellipsis)

Perform a shimmer de Bruijn graph consensus

Parameters:
  • aln_segs (list) – a list of the list of bytes representing the bases of each sequence

  • k (int) – specification of the shimmers for construting graph

  • w (int) – specification of the shimmers for construting graph

  • r (int) – specification of the shimmers for construting graph

  • min_span (int) – specification of the shimmers for construting graph

Returns:

a list of a set of bytes representing the consensus sequences of all branches in the graph

Return type:

list

pgrtk.pgrtk.shmmr_sparse_aln_consensus(seqs, w=Ellipsis, k=Ellipsis, r=Ellipsis, min_span=Ellipsis, min_cov=Ellipsis)

Perform a shimmer de Bruijn graph consensus

Parameters:
  • aln_segs (list) – a list of the list of bytes representing the bases of each sequence

  • k (int) – specification of the shimmers for construting graph

  • w (int) – specification of the shimmers for construting graph

  • r (int) – specification of the shimmers for construting graph

  • min_span (int) – specification of the shimmers for construting graph

Returns:

a list of a set of bytes representing the consensus sequences of all branches in the graph

Return type:

list

pgrtk.pgrtk.sparse_aln(sp_hits, max_span, penalty)

Perform sparse dynamic programming to identify alignment between sequence using matched shimmer pairs

Parameters:
  • sp_hits (list) – a list of tuple of HitPair defined as pub type HitPair = ((u32, u32, u8), (u32, u32, u8)) This represents the hits as the position of matched Shimmer Pairs from the two sequence. For example, if there two shimmers at position 2342 and 4322 of the query sequence that matches the shimmers at positions 6125465 and 6127445, them the HitPair will be (2342, 4322, 0) and (6125465, 6127445, 0). The third number would be 1 if shimmer paired are reversed matched to the sequence orientation and 0 otherwise.

  • max_span (int) – For a give hit, the max_span defines how many other following hits are considered for the next aligned position. This will limit the search space for the best alignment. If the two sequences are very repetitive, then one needs to use larger max_span to ensure capturing the right alignment path

  • penalty (float) – this parameter will determine when to break the alignment if there are big gaps between the alignment segment. One can set it to zero to catch large chunk alignment ignoring the gaps. Typically, a number between 0.1 to 0.5 should be used.