US 12,334,057 B2
	Method and system for visual context aware automatic speech recognition
Chayan Sarkar, Kolkata (IN); Pradip Pramanick, Kolkata (IN); and Ruchira Singh, Kolkata (IN)
Assigned to Tata Consultancy Services Limited, Mumbai (IN)
Filed by Tata Consultancy Services Limited, Mumbai (IN)
Filed on Jun. 13, 2023, as Appl. No. 18/333,983.
Claims priority of application No. 202221043394 (IN), filed on Jul. 28, 2022.
Prior Publication US 2024/0038224 A1, Feb. 1, 2024
Int. Cl. G10L 15/16 (2006.01); G10L 15/06 (2013.01)

CPC G10L 15/16 (2013.01) [G10L 15/063 (2013.01)]

9 Claims

1. A processor implemented method for automatic speech recognition, the method comprising:

triggering, by a robotic agent implemented via one or more hardware processors, capture of an ego-view image of an environment of the robotic agent on detecting a speech input;

detecting the ego-view image, one or more objects and associated text descriptions, by the robotic agent using a dense image captioning network;

processing and filtering, by the robotic agent, the text descriptions to explicitly label the text descriptions using a bias prediction network to generate a dynamic word vocabulary, wherein the dynamic word vocabulary providing a reduced space for a modified beam search decoding technique applied during generation of a final transcript for the speech input, wherein the dynamic word vocabulary comprises a list of biasing words having self-attributes and relational attributes associated with the one or more objects in the ego-view image;

dynamically compiling, by the robotic agent, the dynamic word vocabulary into a trie in accordance with movement of the robotic agent within the environment, wherein the trie provides a visual context for the input speech in accordance with the one or more objects detected in the ego-view image;

processing, by the robotic agent, the speech input in a plurality of time steps, using an acoustic model, to generate in each time-step a probability distribution sequence over a character vocabulary, and biasing using the dynamic word vocabulary in accordance with the speech input;

iteratively decoding, by the robotic agent, the probability distribution sequence generated in each time step into a transcript to eventually generate a final transcript using a modified beam search decoding technique, wherein the modified beam search decoding technique applies a modified sampling function, a contextual re-scoring function and a bias aware pruning function on the transcript generated in each iteration in accordance with the trie;

performing a task understanding, by the robotic agent, by processing the final transcript; and

performing a task planning, by the robotic agent, in accordance with the understood task.