US 12,431,218 B2
Multi-pass software-accelerated genomic read mapping engine
Guillaume Alexandre Pascal Rizk, Rennes (FR)
Assigned to Illumina, Inc., San Diego, CA (US)
Filed by Illumina, Inc., San Diego, CA (US)
Filed on Mar. 3, 2023, as Appl. No. 18/117,088.
Claims priority of provisional application 63/317,859, filed on Mar. 8, 2022.
Prior Publication US 2023/0290443 A1, Sep. 14, 2023
Int. Cl. G16B 30/10 (2019.01); G16B 20/00 (2019.01); G16B 20/20 (2019.01); G16B 30/00 (2019.01); G16B 40/20 (2019.01)
CPC G16B 30/10 (2019.02) [G16B 20/00 (2019.02); G16B 20/20 (2019.02); G16B 30/00 (2019.02); G16B 40/20 (2019.02)] 20 Claims
 
1. A method for software accelerated genomic data read mapping a genomic data read to a reference genome, the method comprising:
(a) obtaining, by one or more computers, a first k-mer seed from the genomic data read;
(b) generating, by the one or more computers, a hash value representing a genomic signature by applying a hash function to the first k-mer seed;
(c) determining, by the one or more computers, a reference sequence location that matches at least a portion of the first k-mer seed using a hash data structure, wherein the hash data structure comprises N data cells and wherein a first data cell of the N data cells includes (i) a first portion storing a predetermined genomic signature derived from the portion of the first k-mer seed and (ii) a second portion storing a value that corresponds to a location within the reference genome that matches at least the portion of the first k-mer seed;
(d) determining, by the one or more computers, a number of mismatches for the first k-mer seed based on comparing genomic data of the genomic data read to genomic data of the reference genome;
(e) determining, by the one or more computers, whether the number of mismatches satisfies a threshold number of mismatches;
continuing to perform operations corresponding to (a)-(e) for an additional k-mer seed at each iteration, wherein the additional k-mer seed corresponds to a portion of the genomic data read different than the first k-mer seed, until a determination is made that the number of mismatches for a last additional k-mer seed satisfies a second mismatch threshold;
determining that the number of mismatches for the last additional k-mer seed satisfies the second mismatch threshold;
based on a determination that the number of mismatches satisfies the second mismatch threshold, terminating the continued performance of the operations corresponding to (a)-(e) prior to a subsequent pass; and
selecting an actual alignment for the genomic data read based on the last additional k-mer seed.