| CPC G10L 17/18 (2013.01) [G10L 17/02 (2013.01); G10L 17/04 (2013.01)] | 20 Claims |

|
1. A computer-implemented method executed on data processing hardware that causes the data processing hardware to perform operations comprising:
receiving a plurality of training samples, each training sample comprising:
a training verification utterance;
a training enrollment utterance spoken by a corresponding speaker; and
a first classification for the training sample that indicates whether a speaker of the training utterance is the same or different from the corresponding speaker that spoke the training enrollment utterance;
training a neural network on the plurality of training samples by:
for each training sample:
processing, using the neural network, audio signals characterizing the training verification utterance to generate a first training speaker representation for the training verification utterance;
processing, using the neural network, the audio signals characterizing the training enrollment utterance to generate a second training speaker representation;
determining a second classification for the training sample based on the first training speaker representation and the second training speaker representation, the second classification for the training sample indicating whether the speaker of the training utterance is the same or different from the corresponding speaker that spoke the training enrollment utterance; and
adjusting parameters of the neural network based on a comparison of the first classification of the training sample and the second classification determined for the training sample; and
transmitting the trained neural network over a network to a user device.
|