CPC G06N 7/01 (2023.01) [G06N 20/00 (2019.01); G16B 10/00 (2019.02); G16B 40/20 (2019.02); G16B 40/00 (2019.02)] | 20 Claims |
1. A computer-implemented method comprising:
accessing an input sample genetic dataset of an individual;
dividing the input sample genetic dataset into a plurality of windows, each window comprising a set of a plurality of single nucleotide polymorphisms (SNPs);
generating, using the divided input sample genetic dataset, an inter-window hidden Markov model (HMM), wherein the inter-window HMM comprises:
(i) for each window, a set of nodes representing the window, each node in the set corresponding to a pair of labels and associated with an emission probability, each label in the pair representing an ethnicity label for the plurality of SNPs included in the window;
(ii) a plurality of edges, each edge connecting a first node of a first set of nodes representing a first window to a second node of a second set of nodes representing a second window, each edge representing a transition from the first node to the second node;
and wherein the inter-window HMM is trained by:
receiving haplotype data corresponding to sequences of alleles of individuals;
building per-window models for the plurality of windows;
receiving a set of reference panel samples; and
training the per-window models using the set of reference panel samples to generate the emission probability for each node of each window in the inter-window HMM; and
assigning one or more ethnicity labels to the input sample genetic dataset using the inter-window HMM.
|