US 12,327,564 B1
	Voice-based user recognition
Zhenning Tan, Union City, CA (US); Eunjung Han, Los Altos, CA (US); Ruirui Li, Sunnyvale, CA (US); Hongda Mao, Fremont, CA (US); Yuguang Yang, Charlotte, NC (US); Oguz Hasan Elibol, Sunnyvale, CA (US); Itay Teller, Sunnyvale, CA (US); Mohamed G Mahmoud, Santa Clara, CA (US); and Andreas Stolcke, Alameda, CA (US)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Sep. 29, 2021, as Appl. No. 17/488,520.
Claims priority of provisional application 63/241,075, filed on Sep. 6, 2021.
Int. Cl. G10L 17/04 (2013.01); G10L 17/02 (2013.01); G10L 17/06 (2013.01); G10L 17/18 (2013.01); G10L 17/22 (2013.01)

CPC G10L 17/06 (2013.01) [G10L 17/04 (2013.01); G10L 17/18 (2013.01); G10L 17/22 (2013.01); G10L 17/02 (2013.01)]

22 Claims

1. A computer-implemented method comprising:

receiving, by a device, audio data corresponding to a first spoken user input;

generating, using the audio data, a first vector representing first speech characteristics of the first spoken user input, the first vector comprising a first plurality of values including at least a first value and a second value;

identifying a second vector associated with the device and a first user profile identifier, the second vector representing second speech characteristics of a first user corresponding to the first user profile identifier, the second vector comprising a second plurality of values;

identifying a third vector associated with the device and a second user profile identifier, the third vector representing third speech characteristics of a second user corresponding to the second user profile identifier, the third vector comprising a third plurality of values;

determining a machine learning (ML) model corresponding to a group of users associated with the device, the ML model being configured using at least:

a first positive sample comprising a fourth vector representing a second spoken user input associated with the first user profile identifier and a fifth vector representing a third spoken user input associated with the first user profile identifier,

a second positive sample comprising a sixth vector representing a fourth spoken user input associated with the second user profile identifier and a seventh vector representing a fifth spoken user input associated with the second user profile identifier, and

a negative sample comprising the fourth vector and the sixth vector;

processing, using the ML model, the first vector to generate an eighth vector representing a portion of the first speech characteristics, the eighth vector comprising a fourth plurality of values including the first value and excluding the second value, the fourth plurality of values comprising fewer values than the first plurality of values;

processing, using the ML model, the second vector to generate a ninth vector representing a portion of the second speech characteristics, the ninth vector comprising a fifth plurality of values comprising fewer values than the second plurality of values;

processing, using the ML model, the third vector to generate a tenth vector representing a portion of the third speech characteristics, the tenth vector comprising a sixth plurality of values comprising fewer values than the third plurality of values;

determining a first score representing a similarity between the eighth vector and the ninth vector;

determining a second score representing a similarity between the eighth vector and the tenth vector;

determining, based at least in part on the first score and the second score, that the first spoken user input corresponds to the first user profile identifier; and

determining, using the first user profile identifier, output data responsive to the first spoken user input.