US 11,908,468 B2
	Dialog management for multiple users
Prakash Krishnan, Santa Clara, CA (US); Arindam Mandal, Redwood City, CA (US); Siddhartha Reddy Jonnalagadda, Bothell, WA (US); Nikko Strom, Kirkland, WA (US); Ariya Rastrow, Seattle, WA (US); Ying Shi, Beijing (CN); David Chi-Wai Tang, Palo Alto, CA (US); Nishtha Gupta, San Jose, CA (US); Aaron Challenner, Cambridge, MA (US); Bonan Zheng, Los Angeles, CA (US); Angeliki Metallinou, Mountain View, CA (US); Vincent Auvray, San Jose, CA (US); and Minmin Shen, Milpitas, CA (US)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Dec. 4, 2020, as Appl. No. 17/112,520.
Claims priority of provisional application 63/081,012, filed on Sep. 21, 2020.
Prior Publication US 2022/0093101 A1, Mar. 24, 2022
Int. Cl. G10L 25/78 (2013.01); G10L 15/22 (2006.01); G10L 15/24 (2013.01); G10L 15/08 (2006.01); G10L 15/06 (2013.01); G06V 40/20 (2022.01); G06F 3/16 (2006.01); G10L 13/08 (2013.01); G10L 15/20 (2006.01); G06V 40/10 (2022.01); G06V 10/40 (2022.01); G10L 15/02 (2006.01); G06F 18/24 (2023.01)

CPC G10L 15/22 (2013.01) [G06F 3/167 (2013.01); G06F 18/24 (2023.01); G06V 10/40 (2022.01); G06V 40/10 (2022.01); G06V 40/20 (2022.01); G10L 13/08 (2013.01); G10L 15/02 (2013.01); G10L 15/063 (2013.01); G10L 15/08 (2013.01); G10L 15/20 (2013.01); G10L 15/222 (2013.01); G10L 15/24 (2013.01); G10L 2015/0635 (2013.01); G10L 2015/088 (2013.01); G10L 2015/223 (2013.01); G10L 2015/227 (2013.01)]

20 Claims

1. A computer-implemented method comprising:

receiving, by a user device comprising at least one microphone and at least one speaker, output audio data representing synthesized speech of a list of entries;

using the at least one speaker, beginning playback of audio corresponding to the output audio data;

during the playback of the audio, detecting, by the at least one microphone, user speech;

determining input audio data representing the user speech;

determining a first time corresponding to the beginning of the playback;

determining a second time corresponding to detection of the user speech;

determining, by the user device, offset time data representing a difference between the first time and the second time;

using a first trained machine learning (ML) model, processing the input audio data to determine that the user speech is system directed;

based at least in part on determining that the user speech is system directed, sending, to at least one remote device, the input audio data and the offset time data;

performing automatic speech recognition (ASR) processing on the input audio data to determine ASR output data representing a transcript of the input audio data;

using a second trained ML model, performing natural language understanding (NLU) processing on the ASR output data to determine NLU output data representing at least an intent corresponding to the user speech;

based at least in part on the NLU output data, determining that the user speech refers to an entry that is absent from the user speech;

based at least in part on the entry being absent from the user speech, processing the offset time data to determine the entry is a first entry in the list of entries; and

causing an action to be performed based at least in part on the first entry.