US 12,461,967 B2
	Multi-modal content based automated feature recognition
Pablo Pernias, Sant Joan d'Alacant (ES); Monica Alfaro Vendrell, Barcelona (ES); Francesc Josep Guitart Bravo, Lleida (ES); Marc Junyent Martin, Barcelona (ES); and Miquel Angel Farre Guiu, Bern (CH)
Assigned to Disney Enterprises Inc., Burbank, CA (US)
Filed by Disney Enterprises, Inc., Burbank, CA (US)
Filed on Aug. 30, 2021, as Appl. No. 17/460,910.
Prior Publication US 2023/0068502 A1, Mar. 2, 2023
Int. Cl. G06F 16/75 (2019.01); G06F 16/68 (2019.01); G06F 16/735 (2019.01); G06F 16/78 (2019.01); G06N 3/045 (2023.01); G06N 20/00 (2019.01)

CPC G06F 16/75 (2019.01) [G06F 16/686 (2019.01); G06F 16/735 (2019.01); G06F 16/7867 (2019.01); G06N 3/045 (2023.01); G06N 20/00 (2019.01)]

20 Claims

1. A system comprising:

a computing platform including a processing hardware and a system memory storing a software code and a machine learning (ML) model-based feature classifier;

the processing hardware configured to execute the software code to:

receive media content including a video component corresponding to a video mode and one of a text component or an audio component corresponding to one of a text mode or an audio mode;

encode a plurality of video frames of the video component, using a first encoder of the software code, to generate a plurality of video embedding vectors;

encode the one of the text component or the audio component, using a second encoder of the software code, to generate a plurality of audio embedding vectors or a plurality of text embedding vectors;

combine the plurality of video embedding vectors and one of the plurality of audio embedding vectors or the plurality of text embedding vectors to provide an input data structure for a neural network mixer of the software code;

process, using the neural network mixer, the input data structure to provide a feature data corresponding to a feature of the media content, wherein the neural network mixer is tuned to provide the feature data such that the feature data preferentially focuses on one of objects, locations, performers, characters, or activities depicted in the video component over others of the objects, locations, performers, characters, or activities depicted in the video component; and

predict, using the ML model-based feature classifier and the feature data, a classification of the feature.