US 12,131,539 B1
	Detecting interactions from features determined from sequences of images captured using one or more cameras
Chris Broaddus, Sammamish, WA (US); Jayakrishnan Kumar Eledath, Princeton Junction, NJ (US); Tian Lan, Seattle, WA (US); Hui Liang, Issaquah, WA (US); Gerard Guy Medioni, Los Angeles, CA (US); and Chuhang Zou, Seattle, WA (US)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Jun. 29, 2022, as Appl. No. 17/853,236.
Int. Cl. G06V 20/00 (2022.01); G06Q 30/0601 (2023.01); G06T 7/70 (2017.01); G06V 10/77 (2022.01); G06V 10/82 (2022.01); G06V 20/40 (2022.01); G06V 40/10 (2022.01); H04N 7/18 (2006.01)

CPC G06V 20/44 (2022.01) [G06Q 30/0643 (2013.01); G06T 7/70 (2017.01); G06V 10/7715 (2022.01); G06V 10/82 (2022.01); G06V 20/41 (2022.01); G06V 20/46 (2022.01); G06V 40/11 (2022.01); H04N 7/181 (2013.01); G06T 2207/10016 (2013.01); G06T 2207/10024 (2013.01); G06T 2207/20081 (2013.01); G06T 2207/20084 (2013.01); G06T 2207/20132 (2013.01); G06T 2207/30196 (2013.01); G06T 2207/30242 (2013.01)]

21 Claims

1. A computer system comprising at least one processor and at least one data store,

wherein the computer system is in communication with a plurality of cameras, and

wherein the computer system is programmed with one or more sets of instructions that, when executed by the at least one processor, cause the computer system to execute a method comprising:

receiving a first sequence of spatial-temporal features from a first camera of the plurality of cameras, wherein the first sequence of spatial-temporal features comprises a first set of spatial-temporal features generated by the first camera based on a first clip of images captured by the first camera and a second set of spatial-temporal features generated by the first camera based on a second clip of images captured by the first camera, wherein each of the images of the first clip is a multi-channel image including, for each pixel of such images, a plurality of channels corresponding to color values, a channel corresponding to a mask for a hand, a channel corresponding to a mask for a product and a channel corresponding to a mask for a product space, wherein each of the images of the second clip is a multi-channel image including, for each pixel of such images, a plurality of channels corresponding to color values, a channel corresponding to a mask for a hand, a channel corresponding to a mask for a product and a channel corresponding to a mask for a product space, and wherein each of the first clip and the second clip has been classified as depicting at least one event at the product space;

receiving a second sequence of spatial-temporal features from a second camera of the plurality of cameras, wherein the second sequence of spatial-temporal features comprises a third set of spatial-temporal features generated by the second camera based on a third clip of images captured by the second camera and a fourth set of spatial-temporal features generated by the second camera based on a fourth clip of images captured by the second camera, wherein each of the images of the third clip is a multi-channel image including, for each pixel of such images, a plurality of channels corresponding to color values, a channel corresponding to a mask for a hand, a channel corresponding to a mask for a product and a channel corresponding to a mask for a product space, wherein each of the images of the fourth clip is a multi-channel image including, for each pixel of such images, a plurality of channels corresponding to color values, a channel corresponding to a mask for a hand, a channel corresponding to a mask for a product and a channel corresponding to a mask for a product space, and wherein each of the third clip and the fourth clip has been classified as depicting at least one event at the product space;

providing each of the first sequence of spatial-temporal features and the second sequence of spatial-temporal features as inputs to a transformer executed by the computer system, wherein the transformer comprises:

a transformer encoder having at least one layer configured to generate a feature map based at least in part on a sequence of spatial-temporal features derived from clips of images, wherein the at least one layer of the transformer encoder has a multi-head self-attention module and a feedforward network; and

a transformer decoder configured to generate a hypothesis of a type of an event and a location of the event based at least in part on a feature map and a plurality of positional embeddings, wherein each of the positional embeddings corresponds to one of a plurality of product spaces;

receiving outputs from the transformer in response to the inputs;

determining that an actor executed at least one of a taking event, a return event or an event that is neither the taking event nor the return event with an item associated with the product space based at least in part on the outputs received from the transformer in response to the inputs; and

storing information regarding the at least one of the taking event, the return event or the event that is neither the taking event nor the return event in association with the actor in the at least one data store.