| CPC G06V 40/173 (2022.01) [G06V 10/82 (2022.01); G06V 20/59 (2022.01); H04N 7/181 (2013.01)] | 16 Claims |

|
1. A video system for person search, comprising:
at least one video camera for capturing video images;
a display device; and
a computer system having processing circuitry and memory,
the processing circuitry configured to:
receive a target query person,
perform machine learning using a deep learning network to determine person images, from among the video images, matching the target query person, the deep learning network having
a person detection branch;
a person re-identification branch; and
an attention-aware relation mixer (ARM) connected to the person detection branch and to the person re-identification branch,
the attention-aware relation mixer (ARM) including:
a relation mixer having spatial and channel mixer that performs spatial attention followed by spatial mixing by emphasizing local spatial regions of a person using a spatial attention before globally mixing the local spatial regions across all spatial regions, channel attention followed by channel mixing, and an input-output skip connection configured to perform feature re-using within the relation mixer, and
a joint spatio-channel attention layer that utilizes 3D attention weights to modulate 3D spatio-channel region of interest features and aggregate the features with output of the relation mixer; and
the display device is configured to display matching person images for the person search,
wherein in the deep learning network the person detection branch has a region of interest alignment (RoIAlign) block for region of interest alignment and a shared convolution (res5) block,
the person re-identification branch having a RoIAlign block and a shared convolution block, and
each said branch is connected to the attention-aware relation mixer (ARM) between the respective RoIAlign block and shared convolution block.
|