US 11,917,266 B1
	Enhanced generation and selection of video clips from still frame images
Shilpa Pundi Ananth, Chennai (IN); Sai Sree Harsha, Tumakuru (IN); Pooja Ashok Kumar, Bangalore (IN); Yashal Shakti Kanungo, Seattle, WA (US); Sumit Negi, Bangalore (IN); Brittney C. Gannon, Clinton, WA (US); and Lauren K. Johnson, Bellevue, WA (US)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Sep. 28, 2022, as Appl. No. 17/955,295.
Int. Cl. H04N 21/218 (2011.01); H04N 21/222 (2011.01); H04N 21/235 (2011.01); H04N 21/488 (2011.01); H04N 21/6379 (2011.01); H04N 21/81 (2011.01); H04N 19/46 (2014.01); G06V 10/74 (2022.01); H04N 5/262 (2006.01)

CPC H04N 21/8153 (2013.01) [G06V 10/761 (2022.01); H04N 5/2628 (2013.01); H04N 19/46 (2014.11); H04N 21/812 (2013.01)]

20 Claims

1. A method for generating video clips of a product based on still frame images of the product, the method comprising:

identifying, by at least one processor of a device associated with an online retail system, a first image representing a product at a first scene, the product for sale using the online retail system;

identifying, by the at least one processor, a second image representing the product at a second scene different than the first scene;

generating, by the at least one processor, based on the first image, first images representing the product at the first scene and using a first type of camera shot;

generating, by the at least one processor, based on the second image, second images representing the product at the second scene and using a second type of camera shot different than the first type of camera shot;

encoding, by the at least one processor, using a first encoder network, first embeddings for a first video comprising the first images, the first embeddings indicative of features of the first scene;

encoding, by the at least one processor, using the first encoder network, second embeddings for a second video, the second embeddings indicative of features of the second scene;

encoding, by the at least one processor, using a second encoder network, third embeddings for the first video, the third embeddings indicative of camera shot features of the first video;

encoding, by the at least one processor, using the second encoder network, fourth embeddings for the second video, the fourth embeddings indicative of camera shot features of the second video; and

generating, by the at least one processor, using machine learning models, based on the first embeddings, the second embeddings, the third embeddings, and the fourth embeddings, a video sequence for the product, the video sequence comprising one of the first video or the second video.