US 12,327,410 B2
	Methods and systems for disambiguation of referred objects for embodied agents
Chayan Sarkar, Kolkata (IN); Pradip Pramanick, Kolkata (IN); Brojeshwar Bhowmick, Kolkata (IN); Ruddra Dev Roychoudhury, Kolkata (IN); and Sayan Paul, Kolkata (IN)
Assigned to TATA CONSULTANCY SERVICES LIMITED, Mumbai (IN)
Filed by Tata Consultancy Services Limited, Mumbai (IN)
Filed on Jun. 9, 2023, as Appl. No. 18/207,836.
Claims priority of application No. 202221039195 (IN), filed on Jul. 7, 2022.
Prior Publication US 2024/0013538 A1, Jan. 11, 2024
Int. Cl. G06V 20/50 (2022.01); G06F 40/205 (2020.01); G06F 40/284 (2020.01); G06F 40/40 (2020.01); G06T 15/00 (2011.01); G06V 10/764 (2022.01)

CPC G06V 20/50 (2022.01) [G06F 40/205 (2020.01); G06F 40/284 (2020.01); G06F 40/40 (2020.01); G06T 15/00 (2013.01); G06V 10/764 (2022.01)]

15 Claims

1. A processor implemented method, comprising:

receiving, via one or more hardware processors, (i) at least one natural language instruction from one or more users on an embodied agent, and (ii) a set of successive images from one or more views and a set of absolute poses of the embodied agent corresponding to the one or more views of a current scene captured by the embodied agent in an environment, wherein the at least one natural language instruction is characterized by a target object and a task to be executed by the embodied agent on the target object;

detecting, via the one or more hardware processors, a plurality of objects in each of the set of successive images corresponding to the one or more views of the current scene using an object detector;

generating, via the one or more hardware processors, a natural language text description for each of the plurality of detected objects in each of the set of successive images using a dense-image captioning model;

determining, via the one or more hardware processors, a graph representation of (i) the at least one natural language instruction, and (ii) the natural language text description for each of the plurality of detected objects in each of the set of successive images using a phrase-to-graph network;

identifying, via the one or more hardware processors, a set of unique instance graph representations for the plurality of detected objects by merging a plurality of object graph representations using a multi-view aggregation algorithm, wherein the plurality of object graph representations are generated from the natural language text descriptions for a plurality of unique objects identified from the plurality of detected objects in each of the set of successive images;

determining, via the one or more hardware processors, an ambiguity in identifying the target object using a graph discriminator algorithm, wherein the graph discriminator algorithm utilizes the set of unique instance graph representations for determining the ambiguity; and

generating, via the one or more hardware processors, a descriptive query using the graph discriminator algorithm to extract information, wherein the extracted information is used for disambiguating the target object.