US 12,452,590 B2
	Method and system for sound event localization and detection
Gordon Wichern, Cambridge, MA (US); Olga Slizovskaia, Cambridge, MA (US); and Jonathan Le Roux, Cambridge, MA (US)
Assigned to Mitsubishi Electric Research Laboratories, Inc., Cambridge, MA (US)
Filed by Mitsubishi Electric Research Laboratories, Inc., Cambridge, MA (US)
Filed on Mar. 7, 2022, as Appl. No. 17/687,866.
Prior Publication US 2023/0283950 A1, Sep. 7, 2023
Int. Cl. H04R 3/00 (2006.01); G06N 3/08 (2023.01); H04R 1/40 (2006.01)

CPC H04R 3/005 (2013.01) [G06N 3/08 (2013.01); H04R 1/406 (2013.01); H04R 2201/401 (2013.01)]

19 Claims

1. A sound event localization and detection (SELD) system for localization of one or more target sound events, the SELD system comprising: at least one processor; and memory having instructions stored thereon that, when executed by the at least one processor, cause the SELD system to:

collect a first digital representation of an acoustic mixture of sounds of a plurality of sound events sensed by an acoustic sensor;

receive a second digital representation of a sound of a class corresponding to a target sound event in the acoustic mixture;

input the second digital representation to one or multiple feature-invariant linear modulation (FiLM) blocks;

combine outputs of the one or multiple FiLM blocks with the outputs of intermediate layers of the neural network;

submit the combinations to next layers of the neural network;

process the first digital representation and the second digital representation with a neural network trained to produce a localization information of the target sound event indicative of a location of an origin of the target sound event in the acoustic mixture with respect to a location of the acoustic sensor sensing the acoustic mixture, wherein the neural network comprises a class conditioned SELD network to determine the localization information, wherein the class conditioned SELD network comprises at least one FiLM block that is directed to one or more convolution blocks, wherein the at least one FiLM block and the one or more convolution blocks are trained to identify the target sound event and estimate the localization information; and

output the localization information of the origin of the target sound event.