CPC G06T 7/73 (2017.01) [G06F 3/017 (2013.01); G06F 3/0304 (2013.01); G06F 18/256 (2023.01); G06N 3/045 (2023.01); G06T 7/20 (2013.01); G06T 7/90 (2017.01); G06V 40/165 (2022.01); G06V 40/23 (2022.01); G06V 40/28 (2022.01)] | 16 Claims |
1. A device for processing images associated with a gesture, comprising:
at least one camera; and
at least one processor configured to implement:
one or more three-dimensional convolution neural networks (3D CNNs), each of the 3D CNNs comprising:
an input to receive a plurality of input images from the at least one camera, and
an output to provide recognition information produced by each of the 3D CNNs, and
at least one recurrent neural network (RNN) comprising:
an input to receive a second type of recognition information, and
an output that is coupled to the input of the at least one RNN to provide a feedback connection,
wherein the at least one processor is configured to:
receive a plurality of captured images at a pre-processing module, perform pose estimation on each of the plurality of captured images, and overlay pose estimation pixels onto the plurality of captured images to generate the plurality of input images for consumption by the one or more 3D CNNs, and
receive the recognition information produced by each of the one or more 3D CNNs at a fusion module, and aggregate the received recognition information to generate the second type of recognition information for consumption by the at least one RNN,
wherein each of the one or more 3D CNNs is operable to produce the recognition information comprising at least one characteristic associated with the gesture in each of the plurality of input images, and provide the recognition information to the fusion module, the at least one characteristic comprising a pose, a color or a gesture type, and
wherein the at least one RNN is operable to determine whether the recognition information produced by the one or more 3D CNNs corresponds to a singular gesture across the plurality of input images.
|