US 12,009,062 B2
Efficient clustering of noisy polynucleotide sequence reads
Luis Ceze, Seattle, WA (US); Sergey Yekhanin, Redmond, WA (US); Siena Dumas Ang, Seattle, WA (US); Karin Strauss, Seattle, WA (US); Cyrus Rashtchian, Seattle, WA (US); Ravindran Kannan, Oakland, CA (US); and Konstantin Makarychev, Seattle, WA (US)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Appl. No. 16/325,112
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
PCT Filed Sep. 25, 2017, PCT No. PCT/US2017/053147
§ 371(c)(1), (2) Date Feb. 12, 2019,
PCT Pub. No. WO2018/063950, PCT Pub. Date Apr. 5, 2018.
Claims priority of provisional application 62/402,873, filed on Sep. 30, 2016.
Prior Publication US 2021/0035657 A1, Feb. 4, 2021
Int. Cl. G16B 30/10 (2019.01); G06F 16/28 (2019.01); G16B 40/00 (2019.01); G16B 30/20 (2019.01)
CPC G16B 30/10 (2019.02) [G06F 16/285 (2019.01); G16B 40/00 (2019.02); G16B 30/20 (2019.02)] 16 Claims
OG exemplary drawing
 
1. A system for improving accuracy of polynucleotide sequencing, the system comprising:
a polynucleotide sequencer configured to generate a plurality of DNA reads from a plurality of DNA strands having different nucleotide sequences, wherein the plurality of DNA reads includes errors introduced by the polynucleotide sequencer;
at least one processing unit;
a memory in communication with the processing unit;
a clusterization module stored in the memory and executable on the processing unit to divide the plurality of DNA reads into clusters by first grouping DNA reads having a same hash as determined by randomized locality-sensitive hashing (LSH) into buckets and then grouping DNA reads in a same bucket into clusters based at least in part on similarity of signatures of the DNA reads, wherein the signatures are generated by a technique that deterministically embeds edit-distance space into Hamming space; and
a consensus output sequence generator stored in the memory and executable on the processing unit configured to generate a single consensus output sequence from each cluster that has less error than individual reads in each cluster.