US 11,741,712 B2
	Multi-hop transformer for spatio-temporal reasoning and localization
Asim Kadav, Mountain View, CA (US); Farley Lai, Santa Clara, CA (US); Hans Peter Graf, South Amboy, NJ (US); Alexandru Niculescu-Mizil, Plainsboro, NJ (US); Renqiang Min, Princeton, NJ (US); and Honglu Zhou, Somerset, NJ (US)
Assigned to NEC Corporation
Filed by NEC Laboratories America, Inc., Princeton, NJ (US)
Filed on Sep. 1, 2021, as Appl. No. 17/463,757.
Claims priority of provisional application 63/084,066, filed on Sep. 28, 2020.
Prior Publication US 2022/0101007 A1, Mar. 31, 2022
Int. Cl. G06V 20/40 (2022.01); G06T 7/73 (2017.01); G06T 7/246 (2017.01); G06F 18/213 (2023.01); G06N 3/045 (2023.01)

CPC G06V 20/41 (2022.01) [G06F 18/213 (2023.01); G06N 3/045 (2023.01); G06T 7/246 (2017.01); G06T 7/73 (2017.01); G06V 20/46 (2022.01); G06T 2207/10016 (2013.01); G06T 2207/20081 (2013.01); G06T 2207/20084 (2013.01); G06V 2201/07 (2022.01)]

20 Claims

1. A method for using a multi-hop reasoning framework to perform multi-step compositional long-term reasoning, the method comprising:

extracting feature maps and frame-level representations from a video stream by using a convolutional neural network (CNN);

performing object representation learning and detection;

linking objects through time via tracking to generate object tracks and image feature tracks;

feeding the object tracks and the image feature tracks to a multi-hop transformer that hops over frames in the video stream while concurrently attending to one or more of the objects in the video stream until the multi-hop transformer arrives at a correct answer; and

employing video representation learning and recognition from the objects and image context to locate a target object within the video stream.