US 12,190,557 B2
	Difference description statement generation method and apparatus, device and medium
Xiaochuan Li, Suzhou (CN); Rengang Li, Suzhou (CN); Zhenhua Guo, Suzhou (CN); Yaqian Zhao, Suzhou (CN); and Baoyu Fan, Suzhou (CN)
Assigned to SUZHOU METABRAIN INTELLIGENT TECHNOLOGY CO., LTD., Suzhou (CN)
Appl. No. 18/714,928
Filed by Suzhou Metabrain Intelligent Technology Co., Ltd., Suzhou (CN)
PCT Filed Sep. 15, 2022, PCT No. PCT/CN2022/118852 § 371(c)(1), (2) Date May 30, 2024, PCT Pub. No. WO2023/201975, PCT Pub. Date Oct. 26, 2023.
Claims priority of application No. 202210407134.X (CN), filed on Apr. 19, 2022.
Prior Publication US 2024/0331345 A1, Oct. 3, 2024
Int. Cl. G06V 10/46 (2022.01); G06N 3/045 (2023.01); G06V 30/418 (2022.01)

CPC G06V 10/467 (2022.01) [G06N 3/045 (2023.01); G06V 30/418 (2022.01)]

20 Claims

1. A difference description statement generation method, comprising:

encoding a target image and target text respectively, and performing feature concatenation on an image encoding feature and a text encoding feature that are obtained by encoding, to obtain a concatenated encoding feature;

inputting the concatenated encoding feature to a preset image-text alignment unit constructed based on a preset self-attention mechanism to perform image-text alignment processing, to obtain a concatenated alignment feature;

splitting the concatenated alignment feature to obtain an image alignment feature and a text alignment feature, and inputting the image alignment feature, the text encoding feature, and the text alignment feature to a preset noise monitoring unit constructed based on the preset self-attention mechanism and a preset cross-attention mechanism to perform processing, to extract a difference signal between the target image and the target text; and

generating a difference description statement based on the difference signal by using a preset difference description generation algorithm.