US 12,444,194 B2
Text-conditioned video representation
Satya Krishna Gorti, Toronto (CA); Junwei Ma, Toronto (CA); Guangwei Yu, Toronto (CA); Maksims Volkovs, Toronto (CA); Keyvan Golestan Irani, Toronto (CA); and Noël Vouitsis, Toronto (CA)
Assigned to The Toronto-Dominion Bank, Toronto (CA)
Filed by THE TORONTO-DOMINION BANK, Toronto (CA)
Filed on Aug. 24, 2022, as Appl. No. 17/894,738.
Claims priority of provisional application 63/336,116, filed on Apr. 28, 2022.
Prior Publication US 2023/0351753 A1, Nov. 2, 2023
Int. Cl. G06V 20/40 (2022.01)
CPC G06V 20/47 (2022.01) [G06V 20/41 (2022.01)] 20 Claims
OG exemplary drawing
 
1. A system for evaluating relevance of a text string to a video comprising:
a processor; and
a non-transitory computer-readable medium having instructions executable by the processor for:
identifying a text embedding of the text string;
identifying a plurality of frame embeddings associated with a plurality of frames of the video;
evaluating the text embedding with respect to each frame embedding of the plurality of frame embeddings;
selecting a set of highest-relevance frames based on the evaluating;
generating a text-conditioned video embedding for the video by combining the plurality of frame embeddings associated with the set of highest-relevance frames without contribution of the frame embeddings not associated with the set of highest-relevance frames; and
determining a relevance score of the text string to the video based on the text-conditioned video embedding and the text embedding.