US 12,175,767 B2
End-to-end object tracking using neural networks with attention
Ruichi Yu, Mountain View, CA (US); Xu Chen, Livermore, CA (US); Shiwei Sheng, Cupertino, CA (US); Luming Tang, Ithica, NY (US); and Chieh-En Tsai, Burlingame, CA (US)
Assigned to Waymo LLC, Mountain View, CA (US)
Filed by Waymo LLC, Mountain View, CA (US)
Filed on Apr. 7, 2022, as Appl. No. 17/715,838.
Prior Publication US 2023/0326215 A1, Oct. 12, 2023
Int. Cl. G06V 20/58 (2022.01); B60W 60/00 (2020.01); G06T 7/20 (2017.01); G06V 10/62 (2022.01); G06V 10/82 (2022.01)
CPC G06V 20/58 (2022.01) [G06T 7/20 (2013.01); G06V 10/62 (2022.01); G06V 10/82 (2022.01); B60W 60/001 (2020.02); B60W 2420/403 (2013.01); G06T 2207/10028 (2013.01); G06T 2207/20084 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A method comprising:
obtaining, by one or more sensors, a plurality of images of an environment, wherein each image of the plurality of images is associated with a corresponding time of a plurality of times;
generating, by one or more processing devices, a plurality of sets of feature tensors (FTs), wherein each set of FTs is associated with one or more objects of the environment depicted in a respective image of the plurality of images;
obtaining, using the plurality of sets of FTs, a combined FT;
processing the combined FT using an encoder neural network (NN) to generate a plurality of object vectors, each object vector of the plurality of object vectors characterizing association of an individual FT of the plurality of sets of FTs with other FTs of the plurality of sets of FTs; and
processing, using a decoder NN, the plurality of object vectors to identify one or more tracks, wherein each track of the one or more tracks characterizes motion of a respective object of the one or more objects of the environment.