US 12,223,284 B2
	Visual dialogue method and system
Lei Zhao, Yibin (CN); Junlin Li, Yibin (CN); Jie Shao, Yibin (CN); Lianli Gao, Yibin (CN); and Jingkuan Song, Yibin (CN)
Assigned to Sichuan Institute of Artificial Intelligence, Yibin, Sichuan, China, Yibin (CN)
Filed by Sichuan Institute of Artificial Intelligence, Yibin, Sichuan, China, Yibin (CN)
Filed on Oct. 27, 2022, as Appl. No. 17/974,568.
Claims priority of application No. 202211110308.2 (CN), filed on Sep. 13, 2022.
Prior Publication US 2024/0086643 A1, Mar. 14, 2024
Int. Cl. G06F 40/35 (2020.01); G06V 10/80 (2022.01); G06V 10/82 (2022.01)

CPC G06F 40/35 (2020.01) [G06V 10/811 (2022.01); G06V 10/82 (2022.01)]

4 Claims

1. A visual dialogue method, comprising:

obtaining original input data, wherein the original input data comprises current image data and a new question, and the new question is related to the current image data;

preprocessing text data and image data in the original input data to obtain a text feature sequence and a visual feature sequence, respectively;

using a Visual Dialog (VisDial) dataset to construct a text corpus;

obtaining text sequence knowledge based on the visual feature sequence and the text corpus;

constructing a sparse scene graph based on the visual feature sequence;

performing a data fusion on the text feature sequence, the visual feature sequence, the text sequence knowledge, and the sparse scene graph to obtain a data fusion result; and

obtaining dialogue content of the new question by decoding based on the data fusion result;

obtaining a text data feature in the text corpus;

calculating a similarity between the text data feature and the visual feature sequence to obtain the text sequence knowledge;

wherein the step of performing the data fusion on the text feature sequence, the visual feature sequence, the text sequence knowledge, and the sparse scene graph to obtain the data fusion result comprises:

obtaining a first attention result based on the text feature sequence and the new question;

obtaining a second attention result based on the text sequence knowledge;

cascading the first attention result and the second attention result to obtain a cascading result;

obtaining a third attention result based on the visual feature sequence and the second attention result;

performing a graph convolution on the sparse scene graph to obtain a graph convolution result; and

obtaining the data fusion result based on the cascading result, the third attention result, and the graph convolution result;

wherein the step of obtaining the first attention result based on the text feature sequence and the new question comprises:

performing a sentence-level attention guidance on the text feature sequence by using the new question to obtain an attention feature;

filtering the attention feature by using a sigmoid activation function to obtain a sentence-level sequential representation of potential knowledge;

obtaining a word-level sequential representation of the potential knowledge by calculating a dot-product of the attention feature and the sigmoid activation function based on a word-level question feature of the new question; and

obtaining the first attention result based on the attention feature and the word-level sequential representation of the potential knowledge.