US 12,230,365 B2
Systems and methods for grouping and collapsing sequencing reads
Chen Zhao, San Diego, CA (US); Kevin Eric Wu, Oceanside, CA (US); and Sven Bilke, Bethesda, MD (US)
Assigned to Illumina, Inc., San Diego, CA (US)
Filed by Illumina, Inc., San Diego, CA (US)
Filed on May 12, 2023, as Appl. No. 18/316,939.
Application 18/316,939 is a continuation of application No. 16/667,642, filed on Oct. 29, 2019, granted, now 11,688,489.
Claims priority of provisional application 62/753,786, filed on Oct. 31, 2018.
Prior Publication US 2023/0282309 A1, Sep. 7, 2023
Int. Cl. G01N 33/48 (2006.01); G06F 16/22 (2019.01); G06F 16/2457 (2019.01); G16B 30/10 (2019.01); G16B 30/20 (2019.01)
CPC G16B 30/10 (2019.02) [G06F 16/2255 (2019.01); G06F 16/24578 (2019.01); G16B 30/20 (2019.02)] 27 Claims
 
1. A computer-implemented method for determining a nucleotide sequence from nucleotide sequencing reads, comprising:
receiving a plurality of first nucleotide sequencing reads and a second nucleotide sequencing read associated with each first nucleotide sequencing read;
for each first nucleotide sequencing read and associated second nucleotide sequencing read:
generating a plurality of first identifier subsequences from a first identifier sequence of the first nucleotide sequencing read comprising subsequences of the first identifier sequence;
generating a plurality of second identifier subsequences from a second identifier sequence of the second nucleotide sequencing read comprising subsequences of the second identifier sequence;
for each first identifier subsequence and second identifier subsequence, determining a plurality of hashes using a plurality of hash functions;
generating a first signature for the first nucleotide sequencing read comprising a plurality of first signature hashes for a plurality of first positions, wherein a first signature hash is selected from the hashes of the plurality of hashes determined for the plurality of first identifier subsequences at the first position;
generating a second signature for the second nucleotide sequencing read comprising a plurality of second signature hashes for a plurality of second positions, wherein a second signature hash is selected from the hashes of the plurality of hashes determined for the plurality of second identifier subsequences at the second position; and
assigning the first nucleotide sequencing read or the second nucleotide sequencing read to at least one first particular bin of a first hash data structure based on the first signature or based on the second signature, wherein keys of bins of the first hash data structure are stored in a first key data structure and keys of bins of a second hash data structure are stored in a second key data structure and wherein the assigning comprises using a first stored key of the first key data structure or a second stored key of the second key data structure; and
determining a nucleotide sequence for each first particular bin of the first hash data structure with one or more first nucleotide sequencing reads assigned.