US 11,941,820 B1
	Method for tracking an object in a low frame-rate video and object tracking device using the same
Kye Hyeon Kim, Suwon-si (KR)
Assigned to Superb AI Co., Ltd., Seoul (KR)
Filed by Superb AI Co., Ltd., Seoul (KR)
Filed on Oct. 27, 2023, as Appl. No. 18/384,706.
Claims priority of application No. 1020220157632 (KR), filed on Nov. 22, 2022.
Int. Cl. G06T 7/246 (2017.01); G06T 7/11 (2017.01); G06T 7/73 (2017.01)

CPC G06T 7/246 (2017.01) [G06T 7/11 (2017.01); G06T 7/73 (2017.01); G06T 2207/10016 (2013.01); G06T 2207/20016 (2013.01); G06T 2207/20081 (2013.01); G06T 2207/20084 (2013.01); G06T 2207/20132 (2013.01)]

16 Claims

1. A method for tracking an object in a low frame rate video, comprising steps of:

(a) in response to acquiring a video image including a plurality of frames from an imaging device, an object tracking device (i) inputting a t-th frame corresponding to a current time among the plurality of frames into an object detection network, thereby instructing the object detection network to input the t-th frame into FPN (Feature Pyramid Network) and thus generate each of a (1_1)-st pyramid feature map to a (1_k)-th pyramid feature map corresponding to each of a 1-st scale to a k-th scale, wherein the k is an integer of two or more, and (ii) performing an object detection on a 1-st combined feature map, in which the (1_1)-st pyramid feature map to the (1_k)-th pyramid feature map are combined, and thus detect 1-st objects contained in a 1-st frame, thereby acquiring 1-st bounding boxes corresponding to the 1-st objects, and imparting unique IDs to the 1-st objects; and

(b) the object tracking device (i) (i−1) inputting a (t+1)-th frame which is a next frame of the t-th frame into the object detection network, thereby instructing the object detection network to input the (t+1)-th frame into the FPN and thus generate each of a (2_1)-st pyramid feature map to a (2_k)-th pyramid feature map corresponding to the 1-st scale to the k-th scale and (i−2) performing an object detection on a 2-nd combined feature map, in which the (2_1)-st pyramid feature map to the (2_k)-th pyramid feature map are combined, and thus detect 2-nd objects contained in the (t+1)-th frame, thereby acquiring 2-nd bounding boxes corresponding to the 2-nd objects, (ii) (ii−1) generating a (1_1)-st cropped feature map acquired by cropping regions corresponding to the 1-st bounding boxes from a (1_1)-st specific pyramid feature map corresponding to a 1-st specific scale among the (1_1)-st pyramid feature map to the (1_k)-st pyramid feature map, (ii−2) inputting the (1_1)-st cropped feature map and a (2_1)-st specific pyramid feature map respectively into a 1-st self-attention layer and a 1-st cross attention layer, wherein the (2_1)-st specific pyramid feature map corresponds to the 1-st specific scale among the (2_1)-st pyramid feature map to the (2_k)-th pyramid feature map, thereby instructing the 1-st self-attention layer and the 1-st cross attention layer to respectively perform operations related to self-attention and operations related to cross-attention on the (1_1)-st cropped feature map and the (2_1)-st specific pyramid feature map and thus generate a (1_1)-st conversion feature map and a (2_1)-st conversion feature map, wherein the (1_1)-st conversion feature map is acquired by converting each of 1-st features of the 1-st cropped feature map to 1-st feature descriptors containing feature information and location information of each of the 1-st features and wherein the (2_1)-st conversion feature map is acquired by converting each of 2-nd features of the (2_1)-st specific pyramid feature map to 2-nd feature descriptors containing feature information and location information of each of the 2-nd features, and (ii−3) inputting the (1_1)-st conversion feature map and the (2_1)-st conversion feature map into a matching layer, thereby instructing the matching layer to acquire 1-st matching pairs by using 1-st matching probabilities acquired by matching the 1-st feature descriptors on the (1_1)-st conversion feature map and the 2-nd feature descriptors on the (2_1)-st conversion feature map, and (iii) selecting a specific 2-nd bounding box containing specific 2-nd features the most corresponding to specific 2-nd feature descriptors according to the 1-st matching pairs among the 2-nd bounding boxes, and imparting a specific unique ID identical to that of a specific 1-st object to a specific 2-nd object, wherein the specific 1-st object corresponds to a specific 1-st bounding box and the specific 2-nd object corresponds to the specific 2-nd bounding box, thereby performing object tracking.