US 12,288,567 B2
	Method for training a neural network to describe an environment on the basis of an audio signal, and the corresponding neural network
Wim Abbeloos, Brussels (BE); Arun Balajee Vasudevan, Zurich (CH); Dengxin Dai, Zurich (CH); and Luc Van Gool, Zurich (CH)
Assigned to TOYOTA JIDOSHA KABUSHIKI KAISHA, Toyota (JP); and ETH ZÜRICH, Zurich (CH)
Appl. No. 17/792,073
Filed by TOYOTA MOTOR EUROPE, Brussels (BE); and ETH ZURICH, Zurich (CH)
PCT Filed Jan. 10, 2020, PCT No. PCT/EP2020/050605 § 371(c)(1), (2) Date Jul. 11, 2022, PCT Pub. No. WO2021/139899, PCT Pub. Date Jul. 15, 2021.
Prior Publication US 2023/0047017 A1, Feb. 16, 2023
Int. Cl. G06V 10/00 (2022.01); G06V 10/26 (2022.01); G06V 10/774 (2022.01); G06V 10/82 (2022.01); G06V 10/94 (2022.01); G06V 20/10 (2022.01); G10L 25/30 (2013.01)

CPC G10L 25/30 (2013.01) [G06V 10/26 (2022.01); G06V 10/774 (2022.01); G06V 10/82 (2022.01); G06V 10/95 (2022.01); G06V 20/10 (2022.01)]

13 Claims

1. A method for training a neural network to output a description of the environment in the vicinity of at least one sound acquisition device on the basis of an audio signal acquired by the sound acquisition device, the method comprising:

obtaining audio and image training signals of a scene showing an environment with objects generating sounds,

obtaining a target description of the environment seen on the image training signal,

inputting the audio training signal to the neural network so that the neural network outputs a training description of the environment, and

comparing the target description of the environment with the training description of the environment, wherein

the description of the environment, the target description of the environment, and the training description of the environment include at least one of a semantic segmentation of a frame of the image training signal or a depth map of a frame of the image training signal.