US 12,437,847 B1
	Drug design method based on autoregressive model
Suxia Han, Xi'an (CN); Yungang Xu, Xi'an (CN); Yuesen Li, Xi'an (CN); Chengyi Gao, Xi'an (CN); Yiping Li, Xi'an (CN); Xin Song, Xi'an (CN); and Xiangyu Wang, Xi'an (CN)
Filed by The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an (CN)
Filed on Jan. 18, 2025, as Appl. No. 19/031,343.
Claims priority of application No. 202410754023.5 (CN), filed on Jun. 12, 2024.
Int. Cl. G01N 33/48 (2006.01); G01N 33/50 (2006.01); G16C 20/50 (2019.01); G16C 20/70 (2019.01); G16C 20/90 (2019.01)

CPC G16C 20/50 (2019.02) [G16C 20/70 (2019.02); G16C 20/90 (2019.02)]

8 Claims

1. A drug design method based on an autoregressive model, comprising:

acquiring a ligand data set and a protein ligand data set, training protein information in the protein ligand data set by using a sub-word tokenization algorithm to obtain a protein tokenizer, and training ligand information in the ligand data set by using the sub-word tokenization algorithm to obtain a ligand tokenizer; and constructing a tokenizer of the autoregressive model by using the protein tokenizer and the ligand tokenizer;

processing and transforming original data in the ligand data set and the protein ligand data set into a text form suitable for the autoregressive model to obtain text data, and encoding the text data by the tokenizer of the autoregressive model to construct a training data set required by the autoregressive model;

training the autoregressive model by using the training data set to obtain a trained autoregressive model, so that the autoregressive model is configured to understand and learn an interaction mode between a protein and a ligand;

generating predicted ligand data by using the trained autoregressive model, and post-processing the predicted ligand data through a chemical or biological information tool to obtain candidate ligands each with a chemical structure; and

evaluating and optimizing the candidate ligands, specifically comprising: performing structural optimization and activity prediction on the candidate ligands by using the chemical or biological information tool, to determine target candidate molecules;

wherein a simplified molecular input line entry system (SMILES) format of a ligand in a data set containing protein ligand pair information is trained by using a byte pair encoding (BPE) algorithm in the sub-word tokenization algorithm to obtain the ligand tokenizer; the ligand tokenizer comprises a first vocab.json of vocabulary and a first merges.txt of merging operation of the ligand, and a size of the first vocab.json of vocabulary is 3560;

wherein a size of a vocabulary of the protein is set to 50000, and an amino acid sequence of a protein in the data set containing protein ligand pair information is trained by using the BPE algorithm to obtain the protein tokenizer; the protein tokenizer comprises a second vocab.json of vocabulary and a second merges.txt of merging operation of the protein; and

when constructing the tokenizer of the autoregressive model, a merging operation is performed on the first vocab.json of vocabulary of the ligand and the second vocab.json of vocabulary of the protein, a same word and a same word with different merging operations are deleted, after deleting a repeated word, most initial 256 characters in the BPE algorithm are supplemented as words to make up for a vacancy caused by deleting the repeated word, and after processing, a size of a vocabulary of the tokenizer of the autoregressive model is 53080.