US 12,236,970 B2
	Separating speech by source in audio recordings by predicting isolated audio signals conditioned on speaker representations
Neil Zeghidour, Paris (FR); and David Grangier, Mountain View, CA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Oct. 17, 2022, as Appl. No. 17/967,726.
Application 17/967,726 is a continuation of application No. 17/170,657, filed on Feb. 8, 2021, granted, now 11,475,909.
Claims priority of provisional application 62/971,632, filed on Feb. 7, 2020.
Prior Publication US 2023/0112265 A1, Apr. 13, 2023
This patent is subject to a terminal disclaimer.
Int. Cl. G10L 21/028 (2013.01); G06N 3/045 (2023.01); G06N 3/08 (2023.01); G10L 17/04 (2013.01); G10L 17/18 (2013.01); G10L 21/0208 (2013.01); G10L 21/0272 (2013.01); G10L 21/0316 (2013.01); G10L 25/30 (2013.01)

CPC G10L 21/028 (2013.01) [G06N 3/045 (2023.01); G06N 3/08 (2013.01); G10L 17/04 (2013.01); G10L 17/18 (2013.01); G10L 21/0208 (2013.01); G10L 21/0272 (2013.01); G10L 21/0316 (2013.01); G10L 25/30 (2013.01); G10L 2021/02087 (2013.01)]

21 Claims

1. A method performed by one or more computers, the method comprising:

obtaining a recording comprising speech from a plurality of speakers;

processing the recording using a speaker neural network having speaker parameter values, comprising:

for each time step of multiple time steps, generating a respective set of per-time-step speaker representations, each per-time-step speaker representation representing features identifying a respective speaker in the recording for the time step;

generating a plurality of per-recording speaker representations based on the respective sets of per-time-step speaker representations, wherein each per-recording speaker representation represents features of a respective identified speaker in the recording;

processing the per-recording speaker representations and the recording using a separation neural network having separation parameter values, comprising:

for each per-recording speaker representation of the per-recording speaker representations, generating a respective predicted isolated audio signal that corresponds to speech of one of the speakers in the recording;

wherein the separation parameter values and the speaker parameters values are updated by training the speaker neural network and the separation neural network jointly using common training examples; wherein the speaker parameter values and the separation parameter values are updated by minimizing a separation loss between predicted isolated audio signals generated by the separation neural network and corresponding ground-truth audio signals in the common training samples, and the speaker parameter values are updated by minimizing a speaker loss, wherein the speaker loss measures, at each time step of the multiple time steps in a given common training example, (i) a first distance between a per-time-step speaker representation identifying a corresponding speaker and an embedding for the speaker for the time step and (ii) a second distance between two different per-time-step speaker representations at the same time step.