US 11,756,551 B2
System and method for producing metadata of an audio signal
Niko Moritz, Allston, MA (US); Gordon Wichern, Boston, MA (US); Takaaki Hori, Lexington, MA (US); and Jonathan Le Roux, Arlington, MA (US)
Assigned to Mitsubishi Electric Research Laboratories, Inc., Cambridge, MA (US)
Filed by Mitsubishi Electric Research Laboratories, Inc., Cambridge, MA (US)
Filed on Oct. 7, 2020, as Appl. No. 17/64,986.
Prior Publication US 2022/0108698 A1, Apr. 7, 2022
Int. Cl. G10L 15/26 (2006.01); G10L 15/16 (2006.01)
CPC G10L 15/26 (2013.01) [G10L 15/16 (2013.01)] 19 Claims
 
1. An audio processing system, comprising:
an input interface configured to receive an audio signal;
a memory configured to store a neural network trained to determine different types of attributes of multiple concurrent audio events of different origins,
wherein the different types of attributes include time-dependent and time-agnostic attributes of speech and non-speech audio events,
wherein a model of the neural network shares at least some parameters for determining the different types of attributes,
wherein the neural network is trained jointly to perform multiple different transcription tasks using the shared parameters for performing each of the multiple different transcription tasks,
wherein the multiple different transcription tasks include an automatic speech recognition (ASR) transcription task, an acoustic event detection (AED) transcription task, and an audio tagging (AT) transcription task,
wherein the model of the neural network includes a transformer model and a connectionist temporal classification (CTC)-based model, and
wherein the transformer model includes an encoder and a decoder;
a processor configured to:
process the audio signal with the encoder to encode the audio signal;
process the encoded audio signal with the decoder to execute ASR decoding, AED decoding, and AT decoding to produce a decoder output;
process the encoded audio signal with the CTC-based model to execute the ASR decoding and the AED decoding to produce a CTC output, wherein the decoder output and the CTC output of the ASR decoding and the AED decoding are jointly scored to produce a joint decoding output; and
process the joint decoding output and the decoder output to produce metadata of the audio signal, the metadata including one or multiple attributes of one or multiple audio events in the audio signal; and
an output interface configured to output the metadata of the audio signal.