US 12,412,574 B1
	Conversation-based skill component for assessing a user's state
Katherine M Ryan, Seattle, WA (US); Avani Parakh, Seattle, WA (US); Chao Wang, Newton, MA (US); Viktor Rozgic, Belmont, MA (US); Siddhartha Reddy Jonnalagadda, Bothell, WA (US); Elizabeth Shriberg, Berkeley, CA (US); and Alexandros Potamianos, Santa Monica, CA (US)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Sep. 29, 2022, as Appl. No. 17/956,137.
Int. Cl. G10L 15/22 (2006.01); G10L 15/06 (2013.01); G10L 15/16 (2006.01)

CPC G10L 15/22 (2013.01) [G10L 15/063 (2013.01); G10L 15/16 (2013.01); G10L 2015/0638 (2013.01); G10L 2015/223 (2013.01)]

20 Claims

1. A computer-implemented method comprising:

receiving, from a device, first input audio data corresponding to a first spoken natural language user input, wherein the first input audio data is associated with user profile data;

using the first input audio data, performing speech processing to determine the first spoken natural language user input is to be responded to using a speech-based conversational assessment component;

generating first output data including a first question related to a speech-based conversational assessment;

sending the first output data to the device for presentation;

after sending the first output data, receiving, from the device, second input audio data corresponding to a second spoken natural language user input responsive to the first question;

using the second input audio data, performing automatic speech recognition (ASR) processing to generate ASR results data corresponding to the second spoken natural language user input;

processing the ASR results data to generate lexical embedding data corresponding to the second spoken natural language user input;

processing the second input audio data to determine tone data representing a tone of the second spoken natural language user input;

processing the ASR results data to determine first topic data representing a first topic of the second spoken natural language user input;

generating state data using the ASR results data, the lexical embedding data, the tone data, and the first topic data;

determining past state data associated with the user profile data, wherein the past state data corresponds to one or more speech-based conversational assessments;

processing the state data and the past state data using a first trained machine learning model to determine a first type of response to the second spoken natural language user input;

processing the state data and the past state data using a second trained machine learning model to determine a second type of response to the second spoken natural language user input;

based on at least one of the first type of response and the second type of response, generating second output data including a second question related to the speech-based conversational assessment; and

sending the second output data to the device for presentation.