CPC G10L 25/30 (2013.01) [G06V 10/26 (2022.01); G06V 10/774 (2022.01); G06V 10/82 (2022.01); G06V 10/95 (2022.01); G06V 20/10 (2022.01)] | 13 Claims |
1. A method for training a neural network to output a description of the environment in the vicinity of at least one sound acquisition device on the basis of an audio signal acquired by the sound acquisition device, the method comprising:
obtaining audio and image training signals of a scene showing an environment with objects generating sounds,
obtaining a target description of the environment seen on the image training signal,
inputting the audio training signal to the neural network so that the neural network outputs a training description of the environment, and
comparing the target description of the environment with the training description of the environment, wherein
the description of the environment, the target description of the environment, and the training description of the environment include at least one of a semantic segmentation of a frame of the image training signal or a depth map of a frame of the image training signal.
|