| CPC G10L 15/16 (2013.01) [G10L 15/063 (2013.01)] | 9 Claims |

|
1. A processor implemented method for automatic speech recognition, the method comprising:
triggering, by a robotic agent implemented via one or more hardware processors, capture of an ego-view image of an environment of the robotic agent on detecting a speech input;
detecting the ego-view image, one or more objects and associated text descriptions, by the robotic agent using a dense image captioning network;
processing and filtering, by the robotic agent, the text descriptions to explicitly label the text descriptions using a bias prediction network to generate a dynamic word vocabulary, wherein the dynamic word vocabulary providing a reduced space for a modified beam search decoding technique applied during generation of a final transcript for the speech input, wherein the dynamic word vocabulary comprises a list of biasing words having self-attributes and relational attributes associated with the one or more objects in the ego-view image;
dynamically compiling, by the robotic agent, the dynamic word vocabulary into a trie in accordance with movement of the robotic agent within the environment, wherein the trie provides a visual context for the input speech in accordance with the one or more objects detected in the ego-view image;
processing, by the robotic agent, the speech input in a plurality of time steps, using an acoustic model, to generate in each time-step a probability distribution sequence over a character vocabulary, and biasing using the dynamic word vocabulary in accordance with the speech input;
iteratively decoding, by the robotic agent, the probability distribution sequence generated in each time step into a transcript to eventually generate a final transcript using a modified beam search decoding technique, wherein the modified beam search decoding technique applies a modified sampling function, a contextual re-scoring function and a bias aware pruning function on the transcript generated in each iteration in accordance with the trie;
performing a task understanding, by the robotic agent, by processing the final transcript; and
performing a task planning, by the robotic agent, in accordance with the understood task.
|