US 12,374,426 B2
	Predicting mRNA properties using large language transformer models
Ziv Bar-Joseph, Cambridge, MA (US); Sven Frederik Jager, Frankfurt am Main (DE); Sizhen Li, Cambridge, MA (US); and Saeed Moayedpour, Cambridge, MA (US)
Assigned to Sanofi, Paris (FR)
Filed by Sanofi, Paris (FR)
Filed on Jul. 26, 2024, as Appl. No. 18/785,864.
Claims priority of provisional application 63/648,338, filed on May 16, 2024.
Claims priority of provisional application 63/516,226, filed on Jul. 28, 2023.
Claims priority of application No. 24305758 (EP), filed on May 16, 2024.
Prior Publication US 2025/0037795 A1, Jan. 30, 2025
Int. Cl. G16B 30/00 (2019.01); G16B 20/00 (2019.01); G16B 40/20 (2019.01)

CPC G16B 30/00 (2019.02) [G16B 20/00 (2019.02); G16B 40/20 (2019.02)]

29 Claims

1. A computer-implemented method for predicting properties of an mRNA molecule, the method comprising:

obtaining data representing (i) a codon sequence of a coding sequence (CDS) of the mRNA molecule and (ii) a respective nucleotide sequence of each of one or more non-coding regions of the mRNA molecule;

generating an input token vector by numerically encoding the codon sequence;

generating an embedded feature vector for the CDS of the mRNA molecule by processing the input token vector using an embedding neural network, wherein the embedding neural network has the first set of model parameters that have been updated using a first training process of a first neural network, wherein the first training process is performed based on a dataset specifying known codon sequences of mRNA molecules, and the first neural network is configured to perform one or more pre-training tasks;

generating a joint embedding by combining the embedded feature vector generated for the CDS of the mRNA molecule with one or more embeddings generated from the nucleotide sequences of the one or more non-coding regions of the mRNA molecule;

processing the joint embedding using a property-prediction machine-learning model to generate an output that predicts one or more properties of the mRNA molecule, wherein the property-prediction machine-learning model has a second set of model parameters that have been updated using a second training process of a machine-learning model, wherein the second training process is based on a plurality of training examples, each training example comprising (i) a respective training input specifying a representation of a respective mRNA molecule and (ii) a respective label specifying one or more properties of the respective mRNA molecule; and

providing the one or more predicted properties of the mRNA molecule for physically generating the mRNA molecule.