| CPC G16C 20/50 (2019.02) [G16C 20/70 (2019.02); G16C 20/90 (2019.02)] | 8 Claims |

|
1. A drug design method based on an autoregressive model, comprising:
acquiring a ligand data set and a protein ligand data set, training protein information in the protein ligand data set by using a sub-word tokenization algorithm to obtain a protein tokenizer, and training ligand information in the ligand data set by using the sub-word tokenization algorithm to obtain a ligand tokenizer; and constructing a tokenizer of the autoregressive model by using the protein tokenizer and the ligand tokenizer;
processing and transforming original data in the ligand data set and the protein ligand data set into a text form suitable for the autoregressive model to obtain text data, and encoding the text data by the tokenizer of the autoregressive model to construct a training data set required by the autoregressive model;
training the autoregressive model by using the training data set to obtain a trained autoregressive model, so that the autoregressive model is configured to understand and learn an interaction mode between a protein and a ligand;
generating predicted ligand data by using the trained autoregressive model, and post-processing the predicted ligand data through a chemical or biological information tool to obtain candidate ligands each with a chemical structure; and
evaluating and optimizing the candidate ligands, specifically comprising: performing structural optimization and activity prediction on the candidate ligands by using the chemical or biological information tool, to determine target candidate molecules;
wherein a simplified molecular input line entry system (SMILES) format of a ligand in a data set containing protein ligand pair information is trained by using a byte pair encoding (BPE) algorithm in the sub-word tokenization algorithm to obtain the ligand tokenizer; the ligand tokenizer comprises a first vocab.json of vocabulary and a first merges.txt of merging operation of the ligand, and a size of the first vocab.json of vocabulary is 3560;
wherein a size of a vocabulary of the protein is set to 50000, and an amino acid sequence of a protein in the data set containing protein ligand pair information is trained by using the BPE algorithm to obtain the protein tokenizer; the protein tokenizer comprises a second vocab.json of vocabulary and a second merges.txt of merging operation of the protein; and
when constructing the tokenizer of the autoregressive model, a merging operation is performed on the first vocab.json of vocabulary of the ligand and the second vocab.json of vocabulary of the protein, a same word and a same word with different merging operations are deleted, after deleting a repeated word, most initial 256 characters in the BPE algorithm are supplemented as words to make up for a vacancy caused by deleting the repeated word, and after processing, a size of a vocabulary of the tokenizer of the autoregressive model is 53080.
|