US 12,243,517 B1
Utterance endpointing in task-oriented conversational systems
Mahnoosh Mehrabani, North Bethesda, MD (US); and Srinivas Bangalore, Morristown, NJ (US)
Assigned to Interactions LLC, Franklin, MA (US)
Filed by Interactions LLC, Franklin, MA (US)
Filed on Oct. 13, 2021, as Appl. No. 17/500,834.
Int. Cl. G10L 15/18 (2013.01); G10L 15/02 (2006.01); G10L 15/06 (2013.01); G10L 15/08 (2006.01)
CPC G10L 15/1815 (2013.01) [G10L 15/02 (2013.01); G10L 15/063 (2013.01); G10L 15/083 (2013.01); G10L 15/1807 (2013.01); G10L 2015/0636 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method performed by a task-oriented dialog system, the method comprising:
receiving, in real time, a user utterance provided during a task-oriented communication session between a user and a virtual agent (VA); and
for each incremental portion of a sequence of incremental portions of the user utterance, wherein consecutive incremental portions in the sequence are demarcated in the utterance by a pause and expand to include newly-uttered words in the user utterance:
recognizing a plurality of words in the incremental portion using an automated speech recognition (ASR) model;
applying a natural language processing (NLP) model to the plurality of words to generate semantic information for the incremental portion of the user utterance, the semantic information comprising information describing parts of speech and relationships between the parts of speech;
generating an acoustic-prosodic signature of the incremental portion of the user utterance using an acoustic-prosodic model;
generating a feature vector representative of at least the plurality of words, the semantic information, and the acoustic-prosodic signature;
applying a trained model to the feature vector, the trained model configured to determine a confidence score indicative of a likelihood that the incremental portion of the user utterance includes an endpoint of the user utterance; and
in response to the confidence score meeting or exceeding a threshold score indicating that the incremental portion of the user utterance includes the endpoint of the user utterance, causing the VA to generate a response utterance to respond to the user.