US 12,249,147 B2
Adaptive selection of data modalities for efficient video recognition
Rameswar Panda, Medford, MA (US); Richard Chen, Baldwin Place, NY (US); Quanfu Fan, Lexington, MA (US); and Rogerio Schmidt Feris, West Hartford, CT (US)
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION, Armonk, NY (US)
Filed by International Business Machines Corporation, Armonk, NY (US)
Filed on Mar. 11, 2021, as Appl. No. 17/199,307.
Prior Publication US 2022/0292285 A1, Sep. 15, 2022
Int. Cl. G06V 20/40 (2022.01); G06F 18/214 (2023.01); G06F 18/25 (2023.01); G06N 20/00 (2019.01)
CPC G06V 20/46 (2022.01) [G06F 18/214 (2023.01); G06F 18/256 (2023.01); G06N 20/00 (2019.01); G06V 20/49 (2022.01)] 15 Claims
OG exemplary drawing
 
1. A method for video recognition, comprising: receiving an input video comprising a sequence of video segments over a plurality of data modalities, wherein each segment comprising two or more frames;
for at least one video segment of the sequence of video segments, adaptively selecting a subset of data modalities of the plurality of data modalities based on data representing the video segment, wherein each data modality selected is optimal for video recognition of the at least one video segment, and wherein each data modality of the plurality of data modalities that is not selected is redundant for the video recognition of the at least one video segment, wherein the plurality of data modalities are selected from a group comprising a RGB modality, an optical flow modality, and an audio modality;
for each data modality selected, providing at least one data input representing the at least one video segment over the data modality selected to a machine learning model corresponding to the data modality selected, and generating a first type of prediction representative of the at least one video segment via the machine learning model; and
determining a second type of prediction representative of the input video as a whole by aggregating all first type of predictions generated, wherein the second type of prediction is indicative of an object or an activity captured in the input video.