| CPC G10L 17/06 (2013.01) [G10L 17/04 (2013.01); G10L 17/18 (2013.01); G10L 17/22 (2013.01); G10L 17/02 (2013.01)] | 22 Claims |

|
1. A computer-implemented method comprising:
receiving, by a device, audio data corresponding to a first spoken user input;
generating, using the audio data, a first vector representing first speech characteristics of the first spoken user input, the first vector comprising a first plurality of values including at least a first value and a second value;
identifying a second vector associated with the device and a first user profile identifier, the second vector representing second speech characteristics of a first user corresponding to the first user profile identifier, the second vector comprising a second plurality of values;
identifying a third vector associated with the device and a second user profile identifier, the third vector representing third speech characteristics of a second user corresponding to the second user profile identifier, the third vector comprising a third plurality of values;
determining a machine learning (ML) model corresponding to a group of users associated with the device, the ML model being configured using at least:
a first positive sample comprising a fourth vector representing a second spoken user input associated with the first user profile identifier and a fifth vector representing a third spoken user input associated with the first user profile identifier,
a second positive sample comprising a sixth vector representing a fourth spoken user input associated with the second user profile identifier and a seventh vector representing a fifth spoken user input associated with the second user profile identifier, and
a negative sample comprising the fourth vector and the sixth vector;
processing, using the ML model, the first vector to generate an eighth vector representing a portion of the first speech characteristics, the eighth vector comprising a fourth plurality of values including the first value and excluding the second value, the fourth plurality of values comprising fewer values than the first plurality of values;
processing, using the ML model, the second vector to generate a ninth vector representing a portion of the second speech characteristics, the ninth vector comprising a fifth plurality of values comprising fewer values than the second plurality of values;
processing, using the ML model, the third vector to generate a tenth vector representing a portion of the third speech characteristics, the tenth vector comprising a sixth plurality of values comprising fewer values than the third plurality of values;
determining a first score representing a similarity between the eighth vector and the ninth vector;
determining a second score representing a similarity between the eighth vector and the tenth vector;
determining, based at least in part on the first score and the second score, that the first spoken user input corresponds to the first user profile identifier; and
determining, using the first user profile identifier, output data responsive to the first spoken user input.
|