US 12,238,390 B1
	Enhanced generation and selection of video clips from still frame images
Shilpa Pundi Ananth, Chennai (IN); Sai Sree Harsha, Tumakuru (IN); Pooja Ashok Kumar, Bangalore (IN); Yashal Shakti Kanungo, Seattle, WA (US); Sumit Negi, Bangalore (IN); Brittney C. Gannon, Clinton, WA (US); and Lauren K. Johnson, Bellevue, WA (US)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Dec. 29, 2023, as Appl. No. 18/400,569.
Application 18/400,569 is a continuation of application No. 17/955,295, filed on Sep. 28, 2022, granted, now 11,917,266.
This patent is subject to a terminal disclaimer.
Int. Cl. H04N 21/218 (2011.01); G06V 10/74 (2022.01); H04N 5/262 (2006.01); H04N 19/46 (2014.01); H04N 21/222 (2011.01); H04N 21/235 (2011.01); H04N 21/488 (2011.01); H04N 21/6379 (2011.01); H04N 21/81 (2011.01)

CPC H04N 21/8153 (2013.01) [G06V 10/761 (2022.01); H04N 5/2628 (2013.01); H04N 19/46 (2014.11); H04N 21/812 (2013.01)]

20 Claims

1. A method for generating video clips based on still frame images, the method comprising:

identifying, by at least one processor of a device, a first image representing a first scene;

identifying, by the at least one processor, a second image representing a second scene different than the first scene;

generating, by the at least one processor, based on the first image, first images representing the first scene and using a first type of camera shot;

generating, by the at least one processor, based on the second image, second images representing the second scene and using a second type of camera shot different than the first type of camera shot;

encoding, by the at least one processor, using a first encoder network, first embeddings for a first video comprising the first images, the first embeddings indicative of features of the first scene;

encoding, by the at least one processor, using the first encoder network, second embeddings for a second video, the second embeddings indicative of features of the second scene;

encoding, by the at least one processor, using a second encoder network, third embeddings for the first video, the third embeddings indicative of camera shot features of the first video;

encoding, by the at least one processor, using the second encoder network, fourth embeddings for the second video, the fourth embeddings indicative of camera shot features of the second video; and

generating, by the at least one processor, using machine learning models, based on the first embeddings, the second embeddings, the third embeddings, and the fourth embeddings, a video sequence comprising one of the first video or the second video.