US 12,014,143 B2
	Techniques for performing contextual phrase grounding
Pelin Dogan, Zürich (CH); Leonid Sigal, Vancouver (CA); and Markus Gross, Zurich (CH)
Assigned to DISNEY ENTERPRISES, INC., Burbank, CA (US); and ETH Zürich (Eidgenössische Technische Hochschule Zürich), Zürich (CH)
Filed by DISNEY ENTERPRISES, INC., Burbank, CA (US); and ETH Zürich (Eidgenössische Technische Hochschule Zürich), Zürich (CH)
Filed on Feb. 25, 2019, as Appl. No. 16/285,115.
Prior Publication US 2020/0272695 A1, Aug. 27, 2020
Int. Cl. G06F 40/30 (2020.01); G06F 40/253 (2020.01); G06N 3/08 (2023.01)

CPC G06F 40/30 (2020.01) [G06N 3/08 (2013.01); G06F 40/253 (2020.01)]

20 Claims

1. A computer-implemented method for performing automated phrase grounding operations, the computer-implemented method comprising:

extracting, from a source sentence, a first phrase and a second phrase;

executing a first encoder neural network that generates, based on the first phrase, a first encoding of the first phrase and, based on the second phrase, a second encoding of the second phrase;

executing one or more neural networks to convert a plurality of bounding boxes for a plurality of objects included in a source image into a plurality of box states corresponding to the plurality of bounding boxes, wherein each box state included in the plurality of box states encodes one or more interrelationships between a corresponding bounding box included in the plurality of bounding boxes and one or more additional bounding boxes included in the plurality of bounding boxes;

executing a decision neural network to generate a first plurality of grounding decisions based on input that includes (i) a first box state that is included in the plurality of box states and corresponds to a first bounding box included in the plurality of bounding boxes, (ii) the first encoding, and (iii) the second encoding;

performing one or more comparisons using the first plurality of grounding decisions and a decision threshold to determine that a first grounding decision included in the first plurality of grounding decisions indicates that the first phrase matches the first bounding box;

generating a first matched pair that specifies the first phrase and the first bounding box; and

causing one or more annotation operations to be performed on the source image based on the first matched pair.