| CPC G06T 7/73 (2017.01) [G06T 7/50 (2017.01); H04R 1/406 (2013.01); G06T 2207/20081 (2013.01); G06T 2207/20084 (2013.01)] | 19 Claims |

|
1. A method of estimating a pose of a subject human, the method comprising:
obtaining a data image of the subject human in a target environment;
obtaining a plurality of data audio recordings of the target environment while the subject human is present in the target environment;
determining, by a neural network (NN), a 3D metric pose of the subject human based on an input of the data image and the plurality of data audio recordings,
wherein the NN is trained using a training dataset including training images and training audio recordings captured in a plurality of training environments with respect to a plurality of training humans,
wherein the plurality of training environments comprises a first training environment and a second training environment, and the training comprises:
obtaining, using a first plurality of audio sensors and corresponding first audio recordings, a first plurality of empty room impulse responses in the first training environment while no human is present;
obtaining, using the first plurality of audio sensors and corresponding second audio recordings in the first training environment, a first plurality of occupied room impulse responses in the first training environment while a first training human is present;
obtaining, using a distance camera, a first training image of the first training human in the first training environment, wherein the distance camera provides first depth information;
obtaining, using a second plurality of audio sensors and corresponding third audio recordings in the second training environment, a second plurality of empty room impulse responses in the second training environment while no human is present;
obtaining, using the second plurality of audio sensors and corresponding fourth audio recordings in the second training environment, a second plurality of occupied room impulse responses in the second training environment while a second training human is present;
obtaining, using the distance camera, a second training image of the second training human in the second training environment, wherein the distance camera provides second depth information; and
training the NN based on the first plurality of empty room impulse responses, the first plurality of occupied room impulse responses, the second plurality of empty room impulse responses, the second plurality of occupied room impulse responses, the first training image, the first depth information, the second training image and the second depth information.
|