US 12,216,709 B1
Computer-implemented methods for machine learning model based spatial-temporal adaptive shift for end-to-end text-video retrieval
Ning Xie, Bellevue, WA (US); Han Li, Seattle, WA (US); Qipin Chen, Bellevue, WA (US); Yuan Chen, San Francisco, CA (US); and Lingyun Wang, Bothell, WA (US)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Mar. 22, 2023, as Appl. No. 18/188,420.
Int. Cl. G06F 16/78 (2019.01); G06F 16/783 (2019.01); G06T 1/00 (2006.01); G06V 10/70 (2022.01); H04N 21/232 (2011.01)
CPC G06F 16/7867 (2019.01) [G06F 16/783 (2019.01); G06T 1/0021 (2013.01); G06V 10/70 (2022.01); H04N 21/232 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method comprising:
receiving a video comprising a plurality of frames at a content delivery service;
generating, by the content delivery service, a set of embeddings for each of a plurality of sections of each frame of the plurality of frames;
determining, by a candidate selector machine learning model of the content delivery service, a proper subset of the plurality of sections of each frame of the plurality of frames for a time shift based on the set of embeddings;
time shifting, by the content delivery service, the proper subset of the plurality of sections of each frame of the plurality of frames to generate time shifted frames;
generating, by the content delivery service, an updated set of embeddings based on the time shifted frames;
receiving a search request comprising input text from a user device;
determining the video is a match for the search request based on the input text and the updated set of embeddings for the time shifted frames; and
sending the video to the user device based on the match.