| CPC G16B 30/10 (2019.02) [G06F 16/2255 (2019.01); G06F 16/24578 (2019.01); G16B 30/20 (2019.02)] | 27 Claims |
|
1. A computer-implemented method for determining a nucleotide sequence from nucleotide sequencing reads, comprising:
receiving a plurality of first nucleotide sequencing reads and a second nucleotide sequencing read associated with each first nucleotide sequencing read;
for each first nucleotide sequencing read and associated second nucleotide sequencing read:
generating a plurality of first identifier subsequences from a first identifier sequence of the first nucleotide sequencing read comprising subsequences of the first identifier sequence;
generating a plurality of second identifier subsequences from a second identifier sequence of the second nucleotide sequencing read comprising subsequences of the second identifier sequence;
for each first identifier subsequence and second identifier subsequence, determining a plurality of hashes using a plurality of hash functions;
generating a first signature for the first nucleotide sequencing read comprising a plurality of first signature hashes for a plurality of first positions, wherein a first signature hash is selected from the hashes of the plurality of hashes determined for the plurality of first identifier subsequences at the first position;
generating a second signature for the second nucleotide sequencing read comprising a plurality of second signature hashes for a plurality of second positions, wherein a second signature hash is selected from the hashes of the plurality of hashes determined for the plurality of second identifier subsequences at the second position; and
assigning the first nucleotide sequencing read or the second nucleotide sequencing read to at least one first particular bin of a first hash data structure based on the first signature or based on the second signature, wherein keys of bins of the first hash data structure are stored in a first key data structure and keys of bins of a second hash data structure are stored in a second key data structure and wherein the assigning comprises using a first stored key of the first key data structure or a second stored key of the second key data structure; and
determining a nucleotide sequence for each first particular bin of the first hash data structure with one or more first nucleotide sequencing reads assigned.
|