US 11,869,661 B2
Systems and methods for determining whether a subject has a cancer condition using transfer learning
M. Cyrus Maher, San Mateo, CA (US)
Assigned to GRAIL, LLC, Menlo Park, CA (US)
Filed by GRAIL, LLC, Menlo Park, CA (US)
Filed on May 22, 2020, as Appl. No. 16/881,928.
Claims priority of provisional application 62/851,486, filed on May 22, 2019.
Prior Publication US 2020/0372296 A1, Nov. 26, 2020
Int. Cl. G16H 50/20 (2018.01); G16H 50/70 (2018.01); G16H 50/30 (2018.01); G06F 18/2115 (2023.01); C12Q 1/6886 (2018.01)
CPC G16H 50/20 (2018.01) [G06F 18/2115 (2023.01); G16H 50/30 (2018.01); G16H 50/70 (2018.01); C12Q 1/6886 (2013.01)] 21 Claims
 
1. A computer system for classifying a test subject to a first cancer condition in a cancer condition set, the cancer condition set comprising two or more cancer conditions, the computer system comprising:
at least one processor; and
a memory storing at least one program for execution by the at least one processor, the at least one program comprising instructions for:
obtaining test genotypic information comprising a corresponding test plurality of bin values, each respective bin value in the test plurality of bin values for a corresponding bin in a plurality of bins, wherein:
each bin in the plurality of bins represents a portion of a reference genome of a species,
the test plurality of bin values is obtained from a test biological sample of the test subject, using a corresponding test plurality of sequence reads determined by a first nucleic acid sequencing method,
the test plurality of sequence reads comprises at least 10,000 sequence reads, and
the plurality of bins comprises at least 100 bins;
applying the test plurality of bin values to a machine learning classifier, trained on a transformed second dataset obtained by transfer learning between a first dataset and a second dataset, to cause the machine learning classifier to classify the test subject to the first cancer condition in the cancer condition set, wherein:
the first dataset comprises, for each respective subject in a first plurality of training subjects, the first plurality of training subjects comprising at least fifty subjects, corresponding first genotypic information comprising (i) a corresponding first plurality of bin values, each respective bin value in the corresponding first plurality of bin values for a corresponding bin in the plurality of bins and (ii) an indication of a cancer condition of the respective subject in the cancer condition set, wherein the corresponding first plurality of bin values of each respective subject in the first plurality of subjects is obtained from a corresponding biological sample of the respective subject, which comprises a first tissue type, using a corresponding first plurality of sequence reads determined by a second nucleic acid sequencing method, and
the second dataset comprises, for each respective subject in a second plurality of subjects of the species, corresponding second genotypic information comprising (i) a corresponding second plurality of bin values, each respective bin value in the corresponding second plurality of bin values representing a corresponding bin in the plurality of bins and (ii) an indication of a cancer condition of the respective subject in the cancer condition set, and wherein the corresponding second plurality of bin values of each respective subject in the second plurality of subjects is obtained from a corresponding biological sample of the respective subject, which comprises a second tissue type, using a corresponding second plurality of sequence reads determined by a third nucleic acid sequencing method,
wherein at least the second nucleic acid sequencing method differs from the third nucleic acid sequencing method or the first tissue type differs from the second tissue type;
wherein the machine learning classifier was trained via:
obtaining a plurality of feature extraction functions associated with the first dataset by applying a feature extraction technique to the respective bin values of respective subjects in the first dataset, thereby identifying the plurality of feature extraction functions, wherein each feature extraction function in the plurality of feature extraction functions independently encodes a linear or nonlinear function of bin values of all or a subset of the plurality of bins, and the plurality of feature extraction functions collectively discriminates respective subjects in the first plurality of subjects as having a cancer condition within the cancer condition set based on respective bin values for the respective subjects;
applying each respective feature extraction function in the plurality of feature extraction functions against the respective second plurality of bin values of each corresponding subject in the second plurality of subjects;
generating, based on the applying, a transformed second dataset comprising a respective plurality of feature values for each corresponding subject in the second plurality of subjects; and
training, by utilizing the respective plurality of feature values in the transformed second dataset in conjunction with the indication of the cancer condition for each of the second plurality of subjects, the machine learning classifier.