US 12,340,874 B1
Artificial neural network models for prediction of de novo sequencing of chains of amino acids
Krishnan Palaniappan, San Francisco, CA (US); Peter Cimermancic, Mountain View, CA (US); and Roie Levy, San Francisco, CA (US)
Assigned to Verily Life Sciences LLC, Dallas, TX (US)
Filed by Verily Life Sciences LLC, South San Francisco, CA (US)
Filed on Oct. 25, 2018, as Appl. No. 16/170,158.
Claims priority of provisional application 62/581,276, filed on Nov. 3, 2017.
Int. Cl. G16B 40/00 (2019.01); G06N 3/044 (2023.01); G06N 3/084 (2023.01); G06N 3/088 (2023.01); G16B 30/00 (2019.01); G16B 40/10 (2019.01)
CPC G16B 40/00 (2019.02) [G06N 3/044 (2023.01); G06N 3/084 (2013.01); G06N 3/088 (2013.01); G16B 30/00 (2019.02); G16B 40/10 (2019.02)] 15 Claims
OG exemplary drawing
 
1. A method for identifying an unknown peptide or protein in a sample, the method comprising:
obtaining the sample comprising a mixture of proteins and molecules;
preprocessing the mixture of proteins and molecules to isolate the unknown peptide or protein from the mixture of proteins and molecules;
analyzing, using a mass spectrometer, the isolated peptide or protein to obtain a mass spectrum comprising a two-dimensional data set of mass and intensity values for each fragment ion from the isolated peptide or protein, wherein the analyzing comprises determining the mass and intensity values for each fragment ion from a m/z ratio and abundance detected by the mass spectrometer;
generating, by a computing device, a digital representation of the mass spectrum, the digital representation including a plurality of container elements, and each container element of the plurality of container elements is an object that stores other objects including a one-hot vector, a set of m/z values and respective abundancies, or a combination thereof that uniquely identifies each ion fragment within the mass spectrum;
inputting, by the computing device, the digital representation of the mass spectrum into an encoder-decoder network comprising an encoder portion and a decoder portion, wherein:
the encoder portion comprises a first bidirectional recurrent neural network of a first set of long short term memory cells and gated recurrent unit cells that are trained to map a variable-length source sequence defined by each container element to a fixed-dimensional vector representation,
the first bidirectional recurrent neural network is configured to process the variable-length source sequence from start to end and then process the variable-length source sequence from end to start,
the decoder portion comprises a second bidirectional recurrent neural network of a second set of long short term memory cells and gated recurrent unit cells that are trained to map the fixed-dimensional vector representation back to a variable-length amino acid sequence,
the second bidirectional recurrent neural network is configured to process the fixed-dimensional vector representation from start to end and then process the fixed-dimensional vector representation from end to start, and
the encoder-decoder network is initially trained on a first set of training data comprising standard amino acid residues, and then fine-tuned on a second set of training data comprising post-translational modifications of peptide or protein chains;
encoding, by the computing device and using the encoder portion, each container element as the fixed-dimensional vector representation, which is an m-dimensional or one-dimensional vector of m elements, wherein m corresponds to a number of different m/z values and abundances for n-most abundant ions;
appending, by the computing device, a set of metadata features to each container element or the fixed-dimensional vector representation for each container element, wherein the set of metadata features include: ionization method, mass spectrometer type, fragmentation method, fragmentation energy, peptide charge state, peptide mass-over-charge ratio, peptide's retention time, or a combination thereof;
generating, by the computing device using an attention model, a context vector for each of the fixed-dimensional vector representations based on intermediate outputs from the encoder portion from each step of encoding each of the fixed-dimensional vector representations, wherein the context vector is a weighted arithmetic mean of argument values, and weights are chosen according to relevance of each argument value given a context;
decoding, by the computing device and using the decoder portion, each of the fixed-dimensional vector representation into a variable-length amino acid sequence based on the set of metadata features and the context vector, wherein the variable-length amino acid sequence is represented as a multi-dimensional data set of amino acids types and probability of each amino acid type in each position of a sequence; and
identifying, by the computing device and based on the variable-length amino acid sequence, the unknown peptide or protein in order to characterize aspects including a complete sequence, structure, and possible function of the unknown peptide or protein, wherein the identifying comprises:
combining the variable-length amino acid sequences into a complete sequence of amino acids representing the unknown peptide or protein,
comparing the complete sequence of amino acids to a reference proteome, and
identifying a known peptide or protein based on the comparison.