US 11,893,999 B1
	Speech based user recognition
Sai Sailesh Kopuri, Seattle, WA (US); John Moore, Acton, MA (US); Sundararajan Srinivasan, Sunnyvale, CA (US); Aparna Khare, San Jose, CA (US); Arindam Mandal, San Jose, CA (US); Spyridon Matsoukas, Hopkinton, MA (US); and Rohit Prasad, Lexington, MA (US)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Aug. 6, 2018, as Appl. No. 16/055,755.
Claims priority of provisional application 62/670,828, filed on May 13, 2018.
Int. Cl. G10L 17/22 (2013.01); G10L 17/04 (2013.01); G10L 17/10 (2013.01); G06F 40/20 (2020.01)

CPC G10L 17/22 (2013.01) [G06F 40/20 (2020.01); G10L 17/04 (2013.01); G10L 17/10 (2013.01)]

20 Claims

1. A method, comprising:

receiving, from a first device, first audio data representing a first spoken user input;

generating a first feature vector representing first speech characteristics of the first spoken user input, the first feature vector being unassociated with specific user profile data;

determining a group profile identifier associated with the first device, the group profile identifier being associated with a plurality of user profile identifiers;

determining a first stored feature vector associated with the group profile identifier, the first stored feature vector representing second speech characteristics of a second spoken user input, the first stored feature vector being unassociated with specific user profile data;

determining a first similarity value between the first feature vector and the first stored feature vector;

determining the first audio data comprises a first number of speech frames;

determining the first number of speech frames satisfies a threshold number of speech frames;

generating, based at least in part on the first similarity value and the first number of speech frames satisfying the threshold number of speech frames, a first user recognition feature vector using the first feature vector and the first stored feature vector;

storing first data associating the first user recognition feature vector with the group profile identifier;

receiving, from the first device and after storing the first data, second audio data representing a third spoken user input;

generating a second feature vector representing third speech characteristics of the third spoken user input;

determining a second similarity value between the second feature vector and the first user recognition feature vector;

determining, based at least in part on the second similarity value, system usage data associated with the first user recognition feature vector, the system usage data representing at least first content output in response to at least one previous spoken user input corresponding to the first user recognition feature vector, the system usage data and the first user recognition feature vector being unassociated with specific user profile data;

determining, using the system usage data, an action responsive to the third spoken user input; and

performing the action responsive to the third spoken user input.