US 12,260,674 B2
System and method for attention-aware relation mixer for person search
Mustansar Fiaz, Abu Dhabi (AE); Hisham Cholakkal, Abu Dhabi (AE); Sanath Narayan, Abu Dhabi (AE); Rao Muhammad Anwer, Abu Dhabi (AE); and Fahad Khan, Abu Dhabi (AE)
Assigned to Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi (AE)
Filed by Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi (AE)
Filed on Nov. 9, 2022, as Appl. No. 17/983,741.
Prior Publication US 2024/0153308 A1, May 9, 2024
Int. Cl. G06V 40/16 (2022.01); G06V 10/82 (2022.01); G06V 20/59 (2022.01); H04N 7/18 (2006.01)
CPC G06V 40/173 (2022.01) [G06V 10/82 (2022.01); G06V 20/59 (2022.01); H04N 7/181 (2013.01)] 16 Claims
OG exemplary drawing
 
1. A video system for person search, comprising:
at least one video camera for capturing video images;
a display device; and
a computer system having processing circuitry and memory,
the processing circuitry configured to:
receive a target query person,
perform machine learning using a deep learning network to determine person images, from among the video images, matching the target query person, the deep learning network having
a person detection branch;
a person re-identification branch; and
an attention-aware relation mixer (ARM) connected to the person detection branch and to the person re-identification branch,
the attention-aware relation mixer (ARM) including:
a relation mixer having spatial and channel mixer that performs spatial attention followed by spatial mixing by emphasizing local spatial regions of a person using a spatial attention before globally mixing the local spatial regions across all spatial regions, channel attention followed by channel mixing, and an input-output skip connection configured to perform feature re-using within the relation mixer, and
a joint spatio-channel attention layer that utilizes 3D attention weights to modulate 3D spatio-channel region of interest features and aggregate the features with output of the relation mixer; and
the display device is configured to display matching person images for the person search,
wherein in the deep learning network the person detection branch has a region of interest alignment (RoIAlign) block for region of interest alignment and a shared convolution (res5) block,
the person re-identification branch having a RoIAlign block and a shared convolution block, and
each said branch is connected to the attention-aware relation mixer (ARM) between the respective RoIAlign block and shared convolution block.