US 12,331,364 B2
	Method for diagnosing cancer and predicting type of cancer based on single nucleotide variant in cell-free DNA
JungKyoon Choi, Daejeon (KR); Gyuhee Kim, Daejeon (KR); Eun Hae Cho, Gyeonggi-do (KR); Chang-Seok Ki, Gyeonggi-do (KR); and Junnam Lee, Gyeonggi-do (KR)
Assigned to GC GENOME CORPORATION, Yongin-si (KR)
Filed by GC GENOME CORPORATION, Gyeonggi-do (KR)
Filed on Feb. 15, 2023, as Appl. No. 18/169,750.
Claims priority of application No. 10-2022-0072680 (KR), filed on Jun. 15, 2022.
Prior Publication US 2023/0407405 A1, Dec. 21, 2023
Int. Cl. C12Q 1/6886 (2018.01); G16B 20/20 (2019.01); G16B 30/10 (2019.01); G16B 40/00 (2019.01)

CPC C12Q 1/6886 (2013.01) [G16B 20/20 (2019.02); G16B 30/10 (2019.02); G16B 40/00 (2019.02); C12Q 2600/112 (2013.01); C12Q 2600/156 (2013.01)]

16 Claims

1. A method for measuring regional mutation density (RMD) and a frequency of mutation signature, the method comprising:

(a) extracting nucleic acids from a biological sample to obtain sequence read information, the sequence read information comprising reads of 5 to 5,000 bp, and comprising at least 5,000 reads, the extracted nucleic acids comprising cell free DNA (cfDNA);

(b) aligning the obtained sequence read information to a reference genome from a reference genome database to generate aligned sequence read information;

(c) extracting cancer-specific single nucleotide variants, wherein the extracting is performed by detecting single nucleotide variants in the aligned sequence read information and performing filtering based on the aligned sequence read information, the filtering comprising removing artifacts and germline mutations generated during the sequencing process;

(d) dividing the reference genome into predetermined chromosomal bins and calculating a RMD of extracted cancer-specific single nucleotide variants from step (c) in each bin;

(e) calculating a frequency of 150 mutation signatures of the extracted cancer-specific single nucleotide variants from step (c), the 150 mutation signature features comprising:

(i) 6 basic mutation signatures (C>A, C>G, C>T, T>A, T>C, and T>G);

(ii.) 24 (4×6) mutation signatures for a base mutation in a 5′ direction;

(iii.) 24 (6×4) mutation signatures for a base mutation in a 3′ direction;

(iv.) 96 (4×6×4) mutation signatures for a base mutation in a 5′ direction and a base mutation in a 3′ direction;

(f) combining the RMD of extracted cancer-specific single nucleotide variants calculated in step (d) and the frequency of mutation signature of extracted cancer-specific single nucleotide variants calculated in step (e);

(g) inputting the combined RMD and frequency of mutation signature of step (f) into an artificial intelligence model trained to perform cancer diagnosis;

(h) determining an output value using the artificial intelligence model, wherein the artificial intelligence model transforms the combined RMD and frequency of mutation signature of step (f) into an output value, that is a probability value between 0 and 1; and

(i) determining whether cancer is present or not by comparing the output value with a reference value, that is a value between 0 and 1 capable of determining the presence of cancer when compared to the output value;

determining the presence of cancer when the output value exceeds the reference value,

training the artificial intelligence model for cancer diagnosis using a binary cross-entropy loss function represented by Equation 1:

where N is a total number of samples, ŷ_iis a probability value predicted by the model that an i^thinput value is close to class 1, and y_iis an actual class of the i^thinput value;

wherein the training comprises:

hyper-parameter tuning using Bayesian optimization;

inputting regional mutation density and mutation signature data divided into training, validation, and test datasets into the artificial intelligence model;

performing cancer detection on each of the test datasets using the artificial intelligence model, allowing each dataset to serve once as the test dataset; and

evaluating model performance using a prediction probability when the entire sample was the test dataset.