CPC G16H 50/20 (2018.01) [G06N 5/022 (2013.01); G16H 10/60 (2018.01); G16H 50/70 (2018.01); G16H 70/60 (2018.01); G06F 40/279 (2020.01)] | 20 Claims |
1. A computer implemented method of generating a training dataset for training a machine learning model to identify individuals with a rare disease, the method comprising:
generating a respective first embedding vector for each of a plurality of terms associated with the rare disease, wherein the plurality of terms are obtained by using natural language processing on medical literature associated with the rare disease;
receiving an initial dataset comprising respective medical data associated with a plurality of individuals with the rare disease, the respective medical data for each individual comprising data indicative of a plurality of features of the rare disease experienced by the individual;
combining the initial dataset with a control dataset comprising respective medical data associated with a plurality of individuals without the rare disease to generate a combined dataset; and
generating, for each individual in the combined dataset, a second embedding vector that represents the individual based on (i) features associated with the individual and (ii) the first embedding vectors for the plurality of terms associated with the rare disease, wherein the second embedding vectors form the training dataset, wherein generating the second embedding vector representing an individual comprises:
identifying one or more first embedding vectors that correspond to particular terms associated with features of the rare disease experienced by the individual;
averaging the identified first embedding vectors to generate an average embedding vector for the individual; and
modifying the average embedding vector representing the individual using an embedding vector representing a name of the rare disease to generate the second embedding vector, the modifying comprising subtracting the embedding vector representing a name of the rare disease from the average embedding vector representing the individual; and
training the machine learning model on the training dataset comprising the second embedding vectors using a machine learning training technique to determine trained values of a set of model parameters of the machine learning model.
|