CPC G16B 30/00 (2019.02) [G06N 3/044 (2023.01); G06N 3/045 (2023.01)] | 20 Claims |
1. A method comprising:
identifying, in a genome sequence, a set of domains, each identified domain corresponding to a set of domain identifiers;
applying a shallow neural network block to each set of domain identifiers to produce a set of vectors, each vector corresponding to a set of domain identifiers;
applying a recurrent neural network (RNN) block to the set of vectors to produce a biosynthetic gene cluster (BGC) class score for each domain, wherein the RNN block was trained by:
identifying a set of positive vectors representing known BGCs;
synthesizing a set of negative vectors unlikely to represent BGCs;
applying the RNN block to the positive and negative sets of vectors to generate predictions of whether each vector is a positive or negative vector; and
updating weights of the RNN block based on the predictions;
selecting candidate BGCs by averaging BGC class scores across genes within a domain and comparing the average BGC class scores to a threshold;
predicting a molecular activity of biosynthetic products derived from the selected BGCs; and
providing for display, on a user interface, the candidate BGCs and predicted molecular activity.
|