US 12,444,194 B2
	Text-conditioned video representation
Satya Krishna Gorti, Toronto (CA); Junwei Ma, Toronto (CA); Guangwei Yu, Toronto (CA); Maksims Volkovs, Toronto (CA); Keyvan Golestan Irani, Toronto (CA); and Noël Vouitsis, Toronto (CA)
Assigned to The Toronto-Dominion Bank, Toronto (CA)
Filed by THE TORONTO-DOMINION BANK, Toronto (CA)
Filed on Aug. 24, 2022, as Appl. No. 17/894,738.
Claims priority of provisional application 63/336,116, filed on Apr. 28, 2022.
Prior Publication US 2023/0351753 A1, Nov. 2, 2023
Int. Cl. G06V 20/40 (2022.01)

CPC G06V 20/47 (2022.01) [G06V 20/41 (2022.01)]

20 Claims

1. A system for evaluating relevance of a text string to a video comprising:

a processor; and

a non-transitory computer-readable medium having instructions executable by the processor for:

identifying a text embedding of the text string;

identifying a plurality of frame embeddings associated with a plurality of frames of the video;

evaluating the text embedding with respect to each frame embedding of the plurality of frame embeddings;

selecting a set of highest-relevance frames based on the evaluating;

generating a text-conditioned video embedding for the video by combining the plurality of frame embeddings associated with the set of highest-relevance frames without contribution of the frame embeddings not associated with the set of highest-relevance frames; and

determining a relevance score of the text string to the video based on the text-conditioned video embedding and the text embedding.