US 12,217,493 B2
Methods, systems, and media for generating video classifications using multimodal video analysis
Mattia Fumagalli, Brooklyn, NY (US); Stefano Vegnaduzzo, Ann Arbor, MI (US); Charlie Schumacher, New York, NY (US); Varun Medappa, New York, NY (US); Tyler Hughes, New York, NY (US); Aleks Navratil, New York, NY (US); Josh Ryan, New York, NY (US); and Igor Benyaminov, Forest Hills, NY (US)
Assigned to Integral Ad Science, Inc., New York, NY (US)
Filed by Integral Ad Science, Inc., New York, NY (US)
Filed on Aug. 17, 2022, as Appl. No. 17/890,112.
Claims priority of provisional application 63/234,209, filed on Aug. 17, 2021.
Prior Publication US 2023/0054330 A1, Feb. 23, 2023
Int. Cl. G06V 20/40 (2022.01); G06Q 30/0241 (2023.01); G06V 10/80 (2022.01); G06V 10/82 (2022.01); G06V 30/19 (2022.01); G10L 15/26 (2006.01)
CPC G06V 10/811 (2022.01) [G06Q 30/0276 (2013.01); G06V 10/82 (2022.01); G06V 20/41 (2022.01); G06V 20/48 (2022.01); G06V 30/191 (2022.01); G10L 15/26 (2013.01)] 21 Claims
OG exemplary drawing
 
1. A method for classifying videos, the method comprising:
receiving, from a computing device, a video identifier;
parsing a video associated with the video identifier into an audio portion and a plurality of image frames;
analyzing the plurality of images frames associated with the video using (i) an optical character recognition technique to obtain first textual information corresponding to text appearing in at least one of the plurality of image frames and (ii) an image classifier to obtain, for each of a plurality of objects appearing in at least one of the plurality of frames of the video, a probability that an object appearing in at least one of the plurality of images falls within an image class;
concurrently with analyzing the plurality of image frames associated with the video, analyzing the audio portion of the video using an automated speech recognition technique to obtain second textual information corresponding to words spoken in the video;
combining the first textual information, the probability of each of the plurality of objects appearing in the at least one of the plurality of frames of the video, and the second textual information to obtain a combined analysis output for the video;
determining, using a neural network, a safety score for each of a plurality of categories that the video contains content belonging to a category of the plurality of categories, wherein the combined analysis output is input into the neural network; and
in response to receiving the video identifier, transmitting a plurality of safety scores corresponding to the plurality of categories to the computing device for the video associated with the video identifier.