US 12,009,062 B2
	Efficient clustering of noisy polynucleotide sequence reads
Luis Ceze, Seattle, WA (US); Sergey Yekhanin, Redmond, WA (US); Siena Dumas Ang, Seattle, WA (US); Karin Strauss, Seattle, WA (US); Cyrus Rashtchian, Seattle, WA (US); Ravindran Kannan, Oakland, CA (US); and Konstantin Makarychev, Seattle, WA (US)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Appl. No. 16/325,112
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
PCT Filed Sep. 25, 2017, PCT No. PCT/US2017/053147 § 371(c)(1), (2) Date Feb. 12, 2019, PCT Pub. No. WO2018/063950, PCT Pub. Date Apr. 5, 2018.
Claims priority of provisional application 62/402,873, filed on Sep. 30, 2016.
Prior Publication US 2021/0035657 A1, Feb. 4, 2021
Int. Cl. G16B 30/10 (2019.01); G06F 16/28 (2019.01); G16B 40/00 (2019.01); G16B 30/20 (2019.01)

CPC G16B 30/10 (2019.02) [G06F 16/285 (2019.01); G16B 40/00 (2019.02); G16B 30/20 (2019.02)]

16 Claims

1. A system for improving accuracy of polynucleotide sequencing, the system comprising:

a polynucleotide sequencer configured to generate a plurality of DNA reads from a plurality of DNA strands having different nucleotide sequences, wherein the plurality of DNA reads includes errors introduced by the polynucleotide sequencer;

at least one processing unit;

a memory in communication with the processing unit;

a clusterization module stored in the memory and executable on the processing unit to divide the plurality of DNA reads into clusters by first grouping DNA reads having a same hash as determined by randomized locality-sensitive hashing (LSH) into buckets and then grouping DNA reads in a same bucket into clusters based at least in part on similarity of signatures of the DNA reads, wherein the signatures are generated by a technique that deterministically embeds edit-distance space into Hamming space; and

a consensus output sequence generator stored in the memory and executable on the processing unit configured to generate a single consensus output sequence from each cluster that has less error than individual reads in each cluster.