| CPC G10L 25/57 (2013.01) [G06F 18/214 (2023.01); G06N 3/088 (2013.01); G06V 20/40 (2022.01); G10L 25/30 (2013.01)] | 20 Claims |

|
1. A computer-implemented method, comprising:
receiving, by a computing device, an audio waveform associated with a plurality of video frames of video content;
determining, by a time-domain convolutional masking network of a neural network and from the audio waveform, a neural representation comprising one or more audio frames, wherein each audio frame of the one or more audio frames comprises a respective plurality of coefficients, and wherein the respective plurality of coefficients represent one or more audio features in an encoded mixture of the audio waveform;
predicting, by the neural network and based on the neural representation, one or more audio sources associated with the plurality of video frames;
modifying the one or more predicted audio sources by applying a mixture consistency projection that constrains the one or more predicted audio sources to add up to the received audio waveform; and
providing, by the computing device, the modified one or more predicted audio sources.
|