CPC G10L 13/10 (2013.01) [G06F 40/169 (2020.01); G06T 7/20 (2013.01); G06V 20/40 (2022.01); G06V 40/20 (2022.01); G10L 13/02 (2013.01); G10L 13/08 (2013.01); G10L 15/063 (2013.01); G10L 15/1815 (2013.01); G10L 15/183 (2013.01); G10L 15/22 (2013.01); G10L 25/57 (2013.01); H04N 5/04 (2013.01); G06T 2207/10016 (2013.01); G06T 2207/30196 (2013.01); G10L 2015/223 (2013.01)] | 20 Claims |
1. A method implemented by one or more processors, the method comprising:
obtaining, from an online multimedia repository, video content that includes a stream of audio data for audible content of the video content and a stream of vision data for visual content of the video content;
processing, using an automatic speech recognition model, the stream of audio data for the audible content of the video content to generate a stream of textual content corresponding to one or more spoken utterances captured in the stream of audio data for the audible content of the video content;
processing, using one or more movement tracking machine learning models, the stream of vision data for the visual content of the video content to generate a stream of visual cues corresponding to one or more movements captured in the stream of vision data for the visual content of the video content;
generating, based on processing the stream of audio data and based on processing the stream of video data, a given persona training data instance to be utilized in further training an instance of a given large language model (LLM) that is specific to a given persona, from among of a plurality of disparate personas, embodied in the video content;
training the instance of the given LLM based on at least the given persona training instance; and
causing the instance of the given LLM to be utilized in subsequently processing additional streams of audio data capturing additional spoken utterances directed to an instance of an automated assistant that is assigned the given persona.
|