| CPC G10L 21/028 (2013.01) [G06N 3/045 (2023.01); G06N 3/08 (2013.01); G10L 17/04 (2013.01); G10L 17/18 (2013.01); G10L 21/0208 (2013.01); G10L 21/0272 (2013.01); G10L 21/0316 (2013.01); G10L 25/30 (2013.01); G10L 2021/02087 (2013.01)] | 21 Claims |

|
1. A method performed by one or more computers, the method comprising:
obtaining a recording comprising speech from a plurality of speakers;
processing the recording using a speaker neural network having speaker parameter values, comprising:
for each time step of multiple time steps, generating a respective set of per-time-step speaker representations, each per-time-step speaker representation representing features identifying a respective speaker in the recording for the time step;
generating a plurality of per-recording speaker representations based on the respective sets of per-time-step speaker representations, wherein each per-recording speaker representation represents features of a respective identified speaker in the recording;
processing the per-recording speaker representations and the recording using a separation neural network having separation parameter values, comprising:
for each per-recording speaker representation of the per-recording speaker representations, generating a respective predicted isolated audio signal that corresponds to speech of one of the speakers in the recording;
wherein the separation parameter values and the speaker parameters values are updated by training the speaker neural network and the separation neural network jointly using common training examples; wherein the speaker parameter values and the separation parameter values are updated by minimizing a separation loss between predicted isolated audio signals generated by the separation neural network and corresponding ground-truth audio signals in the common training samples, and the speaker parameter values are updated by minimizing a speaker loss, wherein the speaker loss measures, at each time step of the multiple time steps in a given common training example, (i) a first distance between a per-time-step speaker representation identifying a corresponding speaker and an embedding for the speaker for the time step and (ii) a second distance between two different per-time-step speaker representations at the same time step.
|