US 12,380,966 B2
AA2CDS:pre-trained amino acid-to-codon sequence mapping enabling efficient expression and yield optimization
Hazem Essam, Alexandria (EG); Hosna Eltarras, Cairo (EG); Wafaa Ashraf Salah-Eldin, New Cairo (EG); Mohamed Soudy, Cairo (EG); Ahmed Saleh, New Cairo (EG); Walid Moustafa, Cairo (EG); Ahmed Eid Zoheir, Beni Suef (EG); and Mohamed Ashraf Elkerdawy, New Cairo (EG)
Assigned to Proteinea, Inc., Newton, MA (US)
Filed by Proteinea, Inc., Newton, MA (US)
Filed on Jul. 10, 2024, as Appl. No. 18/769,001.
Claims priority of provisional application 63/616,896, filed on Jan. 2, 2024.
Claims priority of provisional application 63/525,735, filed on Jul. 10, 2023.
Prior Publication US 2025/0022540 A1, Jan. 16, 2025
Int. Cl. G06N 3/123 (2023.01); G16B 40/20 (2019.01)
CPC G16B 40/20 (2019.02) [G06N 3/123 (2013.01)] 18 Claims
OG exemplary drawing
 
1. A system, comprising:
memory storing an inference protein sequence that requires translation from a protein input space to a codon output space;
a protein embedder configured to process the inference protein sequence through protein embedding coefficients to generate an inference protein embedding, wherein the protein embedding coefficients are trained to encode the inference protein sequence in a higher-dimensional protein latent space,
wherein each instance of the inference protein embedding is M-dimensional,
wherein each instance of the inference protein sequence is N-dimensional, and
wherein M is greater than N;
a protein-to-codon translator configured with translation coefficients trained using training protein and codon embedding pairs that are higher-dimensional representations of corresponding training protein and codon pairs,
wherein the protein-to-codon translator is an encoder-decoder neural network configured with an encoder neural network and a decoder neural network, and wherein the protein-to-codon translator provides an embedding space that generates the higher-dimensional representations,
wherein each of the training protein and codon embedding pairs are M-dimensional,
wherein each of the training protein and codon pairs are N-dimensional, and
wherein M is greater than N;
an inference logic configured to process the inference protein embedding through the translation coefficients by sampling the embedding space using the protein-to-codon translator to generate an inference codon embedding; and
a reverse mapping logic configured to process the inference codon embedding through the translation coefficients by sampling the embedding space using the protein-to-codon translator to generate an inference codon sequence, wherein the inference codon sequence is a translation of the inference protein sequence in the codon output space,
wherein each instance of the inference codon embedding is M-dimensional,
wherein each instance of the inference codon sequence is N-dimensional, and
wherein M is greater than N.