US 12,148,433 B2
	Neural networks for speaker verification
Georg Heigold, Mountain View, CA (US); Samuel Bengio, Los Altos, CA (US); and Ignacio Lopez Moreno, New York, NY (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Oct. 11, 2023, as Appl. No. 18/485,069.
Application 18/485,069 is a continuation of application No. 17/444,384, filed on Aug. 3, 2021.
Application 17/444,384 is a continuation of application No. 16/752,007, filed on Jan. 24, 2020, granted, now 11,107,478, issued on Aug. 31, 2021.
Application 16/752,007 is a continuation of application No. 15/966,667, filed on Apr. 30, 2018, granted, now 10,586,542, issued on Mar. 10, 2020.
Application 15/966,667 is a continuation of application No. 14/846,187, filed on Sep. 4, 2015, granted, now 9,978,374, issued on May 22, 2018.
Prior Publication US 2024/0038245 A1, Feb. 1, 2024
This patent is subject to a terminal disclaimer.
Int. Cl. G10L 17/18 (2013.01); G10L 17/02 (2013.01); G10L 17/04 (2013.01)

CPC G10L 17/18 (2013.01) [G10L 17/02 (2013.01); G10L 17/04 (2013.01)]

20 Claims

1. A computer-implemented method executed on data processing hardware that causes the data processing hardware to perform operations comprising:

receiving a plurality of training samples, each training sample comprising:

a training verification utterance;

a training enrollment utterance spoken by a corresponding speaker; and

a first classification for the training sample that indicates whether a speaker of the training utterance is the same or different from the corresponding speaker that spoke the training enrollment utterance;

training a neural network on the plurality of training samples by:

for each training sample:

processing, using the neural network, audio signals characterizing the training verification utterance to generate a first training speaker representation for the training verification utterance;

processing, using the neural network, the audio signals characterizing the training enrollment utterance to generate a second training speaker representation;

determining a second classification for the training sample based on the first training speaker representation and the second training speaker representation, the second classification for the training sample indicating whether the speaker of the training utterance is the same or different from the corresponding speaker that spoke the training enrollment utterance; and

adjusting parameters of the neural network based on a comparison of the first classification of the training sample and the second classification determined for the training sample; and

transmitting the trained neural network over a network to a user device.