US 12,009,060 B2
Identifying biosynthetic gene clusters
Geoffrey D. Hannigan, Melrose, MA (US); David Prihoda, Czech Republic (CZ); Jindrich Soukup, Prague (CZ); Christopher Harron Woelk, Winchester, MA (US); and Danny A. Bitton, Czech Republic (CZ)
Assigned to Merck Sharp & Dohme LLC, Rahway, NJ (US); and MSD Czech Republic s.r.o., Smichov (CZ)
Filed by Merck Sharp & Dohme LLC, Rahway, NJ (US); and MSD Czech Republic s.r.o., Prague (CZ)
Filed on Mar. 22, 2019, as Appl. No. 16/362,236.
Claims priority of provisional application 62/779,697, filed on Dec. 14, 2018.
Prior Publication US 2020/0194098 A1, Jun. 18, 2020
Int. Cl. G16B 30/00 (2019.01); G06N 3/044 (2023.01); G06N 3/045 (2023.01)
CPC G16B 30/00 (2019.02) [G06N 3/044 (2023.01); G06N 3/045 (2023.01)] 20 Claims
OG exemplary drawing
 
1. A method comprising:
identifying, in a genome sequence, a set of domains, each identified domain corresponding to a set of domain identifiers;
applying a shallow neural network block to each set of domain identifiers to produce a set of vectors, each vector corresponding to a set of domain identifiers;
applying a recurrent neural network (RNN) block to the set of vectors to produce a biosynthetic gene cluster (BGC) class score for each domain, wherein the RNN block was trained by:
identifying a set of positive vectors representing known BGCs;
synthesizing a set of negative vectors unlikely to represent BGCs;
applying the RNN block to the positive and negative sets of vectors to generate predictions of whether each vector is a positive or negative vector; and
updating weights of the RNN block based on the predictions;
selecting candidate BGCs by averaging BGC class scores across genes within a domain and comparing the average BGC class scores to a threshold;
predicting a molecular activity of biosynthetic products derived from the selected BGCs; and
providing for display, on a user interface, the candidate BGCs and predicted molecular activity.