US 11,790,695 B1
	Enhanced video annotation using image analysis
Abhinav Aggarwal, New Delhi (IN); Yash Pandya, Navi Mumbai (IN); Laxmi Shivaji Ahire, Malegaon (IN); Lokesh Amarnath Ravindranathan, Bangalore (IN); Manivel Sethu, Bangalore (IN); and Muhammad Raffay Hamid, Seattle, WA (US)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on May 17, 2021, as Appl. No. 17/322,753.
Int. Cl. G06K 9/00 (2022.01); H04N 21/44 (2011.01); G06F 16/78 (2019.01); G06F 16/75 (2019.01); G06F 16/783 (2019.01); G06V 40/16 (2022.01); G06V 20/40 (2022.01); G06F 18/23 (2023.01); G06F 18/21 (2023.01)

CPC G06V 40/173 (2022.01) [G06F 18/2178 (2023.01); G06F 18/23 (2023.01); G06V 20/40 (2022.01); G06V 40/179 (2022.01)]

20 Claims

1. A method, comprising:

identifying, by at least one processor of a first device, video frames of a television show having multiple seasons, the video frames including representations of unidentified actors;

identifying, using a convolutional neural network, faces represented by the video frames;

generating, by the at least one processor, based on a first episode of the television show, a first cluster of first faces of the faces;

generating, by the at least one processor, based on a second episode of the television show, a second cluster of second faces of the faces;

generating, by the at least one processor, based on a third episode of the television show, a third cluster of third faces of the faces;

determining, by the at least one processor, that a first cosine similarity between the first faces and the second faces exceeds a similarity threshold;

determining, by the at least one processor, that a second cosine similarity between the first faces and the third faces fails to exceed the similarity threshold;

selecting, by the at least one processor, based on the second cosine similarity, a first face to represent the first faces and the third faces;

selecting, by the at least one processor, based on the first cosine similarity, a second face to represent the second faces;

determining, by the at least one processor, a first score associated with the first episode, the first score indicative of a first number of faces to label using actor names, the first number of faces included in the first episode;

determining, by the at least one processor, a second score associated with the second episode, the second score indicative of a second number of faces to label using actor names, the second number of faces included in the second episode, the first score less than the second score;

selecting, by the at least one processor, based on a comparison of the first score to the second score, the first episode for face labeling;

sending, by the at least one processor, the first episode and the first face to a human operator;

receiving, by the at least one processor, from the human operator, a first face label for the first face, the first face label indicative of an actor's name;

generating, by the at least one processor, based on a comparison of the first face to a third face included in a fourth episode of the television show, a second face label for the third face, the second face label indicative of the actor's name;

sending, by the at least one processor, the third face and the second face label to the human operator;

receiving, by the at least one processor, from the human operator, a verification of the second face label; and

sending, by the at least one processor, the first face label and the second face label to a second device for presentation with the video frames.

5. A method, comprising:

identifying, by at least one processor of a first device, first faces represented by first video frames of video frames, the video frames including representations of unidentified actors, the first faces comprising a first face;

identifying, by the at least one processor, second faces represented by second video frames of the video frames, the second faces comprising a second face;

determining, by the at least one processor, a first score associated with the first video frames, the first score indicative of a first number of faces to label using actor names, the first number of faces represented by the first video frames;

determining, by the at least one processor, a second score associated with the second video frames, the second score indicative of a second number of faces to label using actor names, the second number of faces represented by the second video frames, the first score less than the second score;

selecting, by the at least one processor, based on a comparison of the first score to the second score, the first video frames for face labeling;

receiving, by the at least one processor, from a human operator, a first face label for the first face, the first face label indicative of an actor's name;

generating, by the at least one processor, based on a comparison of the first face to the second face, a second face label for the second face, the second face label indicative of the actor's name; and

sending, by the at least one processor, the first face label and the second face label to a second device for presentation with the video frames.