US 12,437,746 B2
Real-time system for spoken natural stylistic conversations with large language models
Adrian Wyatt Bonar, Seattle, WA (US); Jennifer Fox, Seattle, WA (US); Nicole E. Berdy, Cambridge, MA (US); Mollie Munoz, Redmond, WA (US); Shawn Callegari, Redmond, WA (US); Devis Lucato, Redmond, WA (US); and Ryan H. Volum, Seattle, WA (US)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed by MICROSOFT TECHNOLOGY LICENSING, LLC, Redmond, WA (US)
Filed on Apr. 7, 2023, as Appl. No. 18/132,356.
Claims priority of provisional application 63/427,079, filed on Nov. 21, 2022.
Prior Publication US 2024/0169974 A1, May 23, 2024
Int. Cl. G10L 13/10 (2013.01); G10L 15/26 (2006.01); G10L 13/02 (2013.01); G10L 13/08 (2013.01); G10L 15/18 (2013.01); G10L 15/22 (2006.01); G10L 25/63 (2013.01)
CPC G10L 13/10 (2013.01) [G10L 15/26 (2013.01); G10L 13/02 (2013.01); G10L 13/08 (2013.01); G10L 2013/083 (2013.01); G10L 15/1815 (2013.01); G10L 2015/225 (2013.01); G10L 25/63 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A method comprising:
configuring a large language model with a conversational profile comprising training data that includes a set of example conversations that comprise labeled data demonstrating positive interactions, negative interactions, and appropriate responses and inappropriate responses, and a set of sentiments;
receiving a user input comprising a speech audio input;
converting the user input into a text translation of the speech audio input;
analyzing the speech audio input to determine at least one of a volume, a tone of voice, and one or more inflections of a voice included in the speech audio input;
analyzing at least one of the volume, the tone of voice, and the one or more inflections with the text translation of the speech audio input using a prompt engine to determine a sentiment of the user input;
generating a prompt, using the prompt engine, wherein the prompt includes the sentiment of the user input that is determined using at least one of the tone, the volume, the one or more inflections, and the text translation of the speech audio input, the prompt further comprising instructions for causing the large language model to include punctuation and a word selection in a text response that represents a selected sentiment from the set of sentiments within the conversational profile;
causing the large language model to generate the text response based on the prompt and the text translation of the speech audio input, using the selected sentiment and the training data, wherein the training data causes the large language model to set patterns for subsequent text responses and prompts by increasing emphasis for positive user interactions and appropriate text responses and decreasing emphasis for negative user interactions and inappropriate text responses;
receiving the text response with the selected sentiment from the large language model generated from the prompt;
selecting a style cue for an audio output based on the sentiment of the user input, the selected sentiment of the text response, and the prompt using the large language model; and
generating the audio output using the text response and the style cue, wherein the audio output response uses the style cue, the punctuation, and the world selection from the text response to generate the audio output including at least one of an audio output volume, an audio output tone of voice, or one or more inflections of a voice in the audio output, wherein at least one of the audio output volume, the audio output tone of voice, or one or more inflections of the voice in the audio output corresponds to the style cue, the punctuation, and the word selection of the text response.