US 12,431,159 B2
	Audio source separation systems and methods
Emile de la Rey, Wellington (NZ); and Paris Smaragdis, Urbana, IL (US)
Assigned to WingNut Films Productions Limited, Wellington (NZ)
Filed by WingNut Films Productions Limited, Wellington (NZ)
Filed on Jun. 23, 2022, as Appl. No. 17/848,341.
Claims priority of provisional application 63/272,650, filed on Oct. 27, 2021.
Prior Publication US 2023/0126779 A1, Apr. 27, 2023
Int. Cl. G10L 25/81 (2013.01); G10L 15/06 (2013.01); G10L 21/0272 (2013.01); G10L 25/30 (2013.01); G10L 25/51 (2013.01); G10L 25/84 (2013.01)

CPC G10L 25/51 (2013.01) [G10L 15/063 (2013.01); G10L 21/0272 (2013.01); G10L 25/30 (2013.01)]

18 Claims

1. A system comprising:

a memory component storing machine-readable instructions; and

a logic device configured to execute the machine-readable instructions to:

a trained audio source separation model configured to receive an audio input sample comprising a single-track mixture of audio signals generated from a plurality of audio sources and generate a plurality of audio stems, the plurality of audio stems corresponding to one or more audio source of the plurality of audio sources; and

a self-iterative training system configured to perform a plurality of training iterations, a training iteration comprising generating a new audio source separation model based at least in part on a training dataset comprising a subset of the generated plurality of audio stems from the preceding training iteration, wherein a subset of the generated plurality of audio stems comprises one or more of: a part of one stem, multiple parts of one stem, one complete stem, multiple stems, or a combination of one complete stem and one or more parts of another stem, and wherein the new audio source separation model generated in each iteration is increasingly specific to the mixture of audio signals in the audio input sample,

wherein the self-iterative training system is further configured to determine whether the new audio source separation model is increasingly specific to the mixture of audio signals in the audio input sample by calculating a first quality metric associated with the generated plurality of audio stems, the first quality metric providing a first performance measure of the audio source separation model of the prior iteration, calculating a second quality metric associated with the audio stems generated in the present iteration, the second quality metric providing a second performance measure of the new audio source separation model, and wherein the second quality metric is greater than the first quality metric; and

wherein the new audio source separation model is configured to re-process the audio input stream to generate a plurality of enhanced audio stems.