| CPC G16B 30/10 (2019.02) [G16B 20/00 (2019.02); G16B 20/20 (2019.02); G16B 30/00 (2019.02); G16B 40/20 (2019.02)] | 20 Claims |
|
1. A method for software accelerated genomic data read mapping a genomic data read to a reference genome, the method comprising:
(a) obtaining, by one or more computers, a first k-mer seed from the genomic data read;
(b) generating, by the one or more computers, a hash value representing a genomic signature by applying a hash function to the first k-mer seed;
(c) determining, by the one or more computers, a reference sequence location that matches at least a portion of the first k-mer seed using a hash data structure, wherein the hash data structure comprises N data cells and wherein a first data cell of the N data cells includes (i) a first portion storing a predetermined genomic signature derived from the portion of the first k-mer seed and (ii) a second portion storing a value that corresponds to a location within the reference genome that matches at least the portion of the first k-mer seed;
(d) determining, by the one or more computers, a number of mismatches for the first k-mer seed based on comparing genomic data of the genomic data read to genomic data of the reference genome;
(e) determining, by the one or more computers, whether the number of mismatches satisfies a threshold number of mismatches;
continuing to perform operations corresponding to (a)-(e) for an additional k-mer seed at each iteration, wherein the additional k-mer seed corresponds to a portion of the genomic data read different than the first k-mer seed, until a determination is made that the number of mismatches for a last additional k-mer seed satisfies a second mismatch threshold;
determining that the number of mismatches for the last additional k-mer seed satisfies the second mismatch threshold;
based on a determination that the number of mismatches satisfies the second mismatch threshold, terminating the continued performance of the operations corresponding to (a)-(e) prior to a subsequent pass; and
selecting an actual alignment for the genomic data read based on the last additional k-mer seed.
|