| CPC G10L 15/1822 (2013.01) [G10L 13/02 (2013.01); G10L 15/22 (2013.01); G10L 2015/223 (2013.01)] | 20 Claims |

|
1. A computer-implemented method, comprising:
receiving first input audio data corresponding to a first utterance detected by a device;
performing speech processing using the first input audio data to determine at least a first natural language understanding (NLU) hypothesis and a second NLU hypothesis for the first utterance;
determining that the first NLU hypothesis corresponds to a first intent;
determining that the second NLU hypothesis corresponds to a second intent;
determining, that the first NLU hypothesis more likely represents what the first utterance meant than the second NLU hypothesis;
using a first component to obtain, from a first skill component associated with the first intent, first visual content and results data responsive to the first utterance;
causing the device to present the first visual content;
performing speech synthesis using the results data to generate output audio data responsive to the first utterance;
causing the device to present output audio corresponding to the output audio data;
in response to the first input audio data, using a second component to obtain, from a second skill component, second visual content;
causing the device to present the second visual content while presenting the first visual content;
receiving, from the device, input data corresponding to a second input;
determining that the second input corresponds to the second visual content;
obtaining output content corresponding to the second skill component and responsive to the second input; and
causing the device to present the output content.
|