US 11,929,150 B2
Methods and apparatuses for performing character matching for short read alignment
Meysam Roodi, Richmond Hill (CA); and Zahra Lak, Toronto (CA)
Assigned to HUAWEI TECHNOLOGIES CO., LTD., Shenzhen (CN)
Filed by Meysam Roodi, Richmond Hill (CA); and Zahra Lak, Toronto (CA)
Filed on Jan. 22, 2020, as Appl. No. 16/749,696.
Claims priority of provisional application 62/797,001, filed on Jan. 25, 2019.
Prior Publication US 2020/0388350 A1, Dec. 10, 2020
Int. Cl. G16B 30/10 (2019.01); G16B 20/20 (2019.01); G16B 25/10 (2019.01)
CPC G16B 30/10 (2019.02) [G16B 20/20 (2019.02); G16B 25/10 (2019.02)] 14 Claims
OG exemplary drawing
 
1. A method comprising:
performing a seed search of a short read (SR) against a reference sequence stored in a memory of a computing device using a Burrows Wheeler Transform (BWT) algorithm to determine at least one seed, wherein the reference sequence comprises a first sequence of base pairs (bps) and the first sequence of bps are stored sequentially in the memory, and the SR comprises a second sequence of bps, wherein the seed search is performed by:
generating a count table and an occurrence table based on the reference sequence;
storing the occurrence table in the memory; and
processing the SR base pair (bp) by base pair (bp-by-bp), wherein each bp is processed by updating a top pointer and updating a bottom pointer, the top and bottom pointers being pointers to respective indices in the occurrence table, the occurrence table stored in the memory being accessed for each update of the top pointer and update of the bottom pointer;
during performance of the seed search:
determining one or more seed candidates in the reference sequence, each seed candidate being a continuous portion of the first sequence of bps in the reference sequence matching a corresponding continuous portion of the second sequence of bps in the SR, up to a matching bp from the SR; and
determining a total number of seed candidates by computing the total number of seed candidates in the reference sequence based on the top pointer and the bottom pointer;
in response to a determination that a total number of seed candidates is greater than a threshold value, continuing the seed search using the BWT algorithm with a next bp from the SR following the matching bp;
in response to a determination that the total number of seed candidates is less than or equal to the threshold value:
stopping the seed search using the BWT algorithm;
obtaining one or more extended seed candidates corresponding to each respective one or more seed candidates each extended seed candidate being obtained by retrieving, in a single access to the first sequence of bps of the reference sequence stored sequentially in the memory, sequential bps of the reference sequence such that each respective seed candidate is extended from the matching bp to a respective extended seed candidate equal in length to the SR;
for each extended seed candidate, performing bp-to-bp comparisons between a remaining bp sequence of the SR after the matching bp and a corresponding remaining bp sequence in the extended seed candidate, the bp-to-bp comparisons being performed absent any additional access to the reference sequence stored in the memory; and
when at least one extended seed candidate exactly matches the SR in the bp-to-bp comparisons, outputting the exactly matching extended seed candidate as the seed.