US 12,217,768 B2
	Audio-visual separation of on-screen sounds based on machine learning models
Efthymios Tzinis, Urbana, IL (US); Scott Wisdom, Boston, MA (US); Aren Jansen, Mountain View, CA (US); and John R. Hershey, Winchester, MA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Jul. 26, 2023, as Appl. No. 18/226,545.
Application 18/226,545 is a continuation of application No. 17/214,186, filed on Mar. 26, 2021, granted, now 11,756,570.
Prior Publication US 2023/0386502 A1, Nov. 30, 2023
Int. Cl. G10L 25/57 (2013.01); G06F 18/214 (2023.01); G06N 3/088 (2023.01); G06V 20/40 (2022.01); G10L 25/30 (2013.01)

CPC G10L 25/57 (2013.01) [G06F 18/214 (2023.01); G06N 3/088 (2013.01); G06V 20/40 (2022.01); G10L 25/30 (2013.01)]

20 Claims

1. A computer-implemented method, comprising:

receiving, by a computing device, an audio waveform associated with a plurality of video frames of video content;

determining, by a time-domain convolutional masking network of a neural network and from the audio waveform, a neural representation comprising one or more audio frames, wherein each audio frame of the one or more audio frames comprises a respective plurality of coefficients, and wherein the respective plurality of coefficients represent one or more audio features in an encoded mixture of the audio waveform;

predicting, by the neural network and based on the neural representation, one or more audio sources associated with the plurality of video frames;

modifying the one or more predicted audio sources by applying a mixture consistency projection that constrains the one or more predicted audio sources to add up to the received audio waveform; and

providing, by the computing device, the modified one or more predicted audio sources.