US 12,367,397 B2
	Query-based molecule optimization and applications to functional molecule discovery
Samuel Chung Hoffman, New York, NY (US); Enara C Vijil, Westchester, NY (US); Pin-Yu Chen, White Plains, NY (US); Payel Das, Yorktown Heights, NY (US); and Kahini Wadhawan, Delhi (IN)
Assigned to International Business Machines Corporation, Armonk, NY (US)
Filed by International Business Machines Corporation, Armonk, NY (US)
Filed on Sep. 10, 2020, as Appl. No. 17/016,640.
Prior Publication US 2022/0076137 A1, Mar. 10, 2022
Int. Cl. G06N 20/00 (2019.01); G06F 16/245 (2019.01); G06N 3/126 (2023.01)

CPC G06N 3/126 (2013.01) [G06F 16/245 (2019.01); G06N 20/00 (2019.01)]

18 Claims

1. An end-to-end query-based molecule optimization method comprising:

receiving, at a pre-trained plug-in encoder-decoder model running on at least one hardware processor of a computer system, a 1-dimensional string representation of an original molecule having a plurality of properties to be optimized;

encoding, using the trained plug-in encoder-decoder model running on the at least one hardware processor, the 1-dimensional string representation of the original molecule to be optimized into a latent vector representation of a latent vector space of the trained encoder-decoder model;

modifying, by the at least one hardware processor, the latent vector representation in the latent vector space;

decoding, by the trained plug-in encoder-decoder model running on the at least one hardware processor, the latent vector representation to obtain a decoded candidate molecule sequence structure;

inputting, by the at least one hardware processor, the decoded candidate molecule sequence structure to each of a plurality of independent, pre-trained machine learned molecular property prediction models, the plug-in encoder-decoder model being detached from the plurality of independent, pre-trained, machine-learned property prediction modules, and running, by the at least one hardware processor, the plurality of machine learned molecular property prediction models to predict for said decoded candidate molecule sequence structure a respective plurality of molecular properties used to evaluate the candidate molecule sequence structure;

obtaining, using the at least one hardware processor, a loss function using a combination of the multiple molecular property predictions, said one or more said respective plurality of molecular property predictions for optimizing molecular properties of said original molecule while satisfying one or more constraints, the obtained loss function comprising: a first function term quantifying a molecular constraint loss to be minimized and a second function term quantifying a molecular property score to be maximized;

performing a pseudo-gradient estimation of said loss function using the at least one hardware processor, said pseudo-gradient estimation comprising a zeroth-order optimization over the latent representation vector space of the trained encoder-decoder model that includes iteratively modifying the latent vector by applying one or more random direction queries using said random vector from said latent representation vector space and obtaining, at each iteration, a pseudo-gradient estimation of the loss function and generating corresponding loss values as a measure of differences between respective plurality of molecular property predictions and a corresponding respective plurality of specified threshold constraints, and

using the generated loss values at each iteration to determine a direction in the latent space for further modifying said latent vector in said latent representation vector space to achieve improved predicted molecular properties;

determining by the at least one hardware processor when said each of the plurality of properties predicted for said corresponding further modified latent vector representation or its corresponding decoded candidate molecule sequence structure satisfy all said corresponding respective plurality of the specified threshold constraints; and

outputting to a manufacturing pipeline, by the at least one hardware processor, the corresponding further modified sequence structure as an optimized original molecule to be manufactured when each of the plurality of properties predicted for said further modified sequence structure satisfies all said respective plurality of the specified threshold constraints, wherein the end-to-end query-based molecule optimization is model-agnostic as to how molecule representations are learned and trained.