US 12,223,944 B2
	Dynamically adapting given assistant output based on a given persona assigned to an automated assistant
Martin Baeuml, Zurich (CH); Thushan Amarasiriwardena, Alameda, CA (US); Roberto Pieraccini, Zurich (CH); and Gianluca Martini, Zurich (CH)
Assigned to GOOGLE LLC, Mountain View, CA (US)
Filed by GOOGLE LLC, Mountain View, CA (US)
Filed on May 13, 2022, as Appl. No. 17/744,440.
Application 17/744,440 is a continuation of application No. 17/726,244, filed on Apr. 21, 2022.
Prior Publication US 2023/0343324 A1, Oct. 26, 2023
Int. Cl. G06F 40/56 (2020.01); G06F 3/16 (2006.01); G06F 16/332 (2019.01); G06F 40/169 (2020.01); G06T 7/20 (2017.01); G06V 20/40 (2022.01); G06V 40/20 (2022.01); G10L 13/02 (2013.01); G10L 13/033 (2013.01); G10L 13/08 (2013.01); G10L 13/10 (2013.01); G10L 15/06 (2013.01); G10L 15/18 (2013.01); G10L 15/183 (2013.01); G10L 15/22 (2006.01); G10L 25/57 (2013.01); H04N 5/04 (2006.01)

CPC G10L 13/10 (2013.01) [G06F 40/169 (2020.01); G06T 7/20 (2013.01); G06V 20/40 (2022.01); G06V 40/20 (2022.01); G10L 13/02 (2013.01); G10L 13/08 (2013.01); G10L 15/063 (2013.01); G10L 15/1815 (2013.01); G10L 15/183 (2013.01); G10L 15/22 (2013.01); G10L 25/57 (2013.01); H04N 5/04 (2013.01); G06T 2207/10016 (2013.01); G06T 2207/30196 (2013.01); G10L 2015/223 (2013.01)]

20 Claims

1. A method implemented by one or more processors, the method comprising:

obtaining, from an online multimedia repository, video content that includes a stream of audio data for audible content of the video content and a stream of vision data for visual content of the video content;

processing, using an automatic speech recognition model, the stream of audio data for the audible content of the video content to generate a stream of textual content corresponding to one or more spoken utterances captured in the stream of audio data for the audible content of the video content;

processing, using one or more movement tracking machine learning models, the stream of vision data for the visual content of the video content to generate a stream of visual cues corresponding to one or more movements captured in the stream of vision data for the visual content of the video content;

generating, based on processing the stream of audio data and based on processing the stream of video data, a given persona training data instance to be utilized in further training an instance of a given large language model (LLM) that is specific to a given persona, from among of a plurality of disparate personas, embodied in the video content;

training the instance of the given LLM based on at least the given persona training instance; and

causing the instance of the given LLM to be utilized in subsequently processing additional streams of audio data capturing additional spoken utterances directed to an instance of an automated assistant that is assigned the given persona.