US 11,775,574 B2
	Method and apparatus for visual question answering, computer device and medium
Yulin Li, Beijing (CN); Xiameng Qin, Beijing (CN); Ju Huang, Beijing (CN); Qunyi Xie, Beijing (CN); and Junyu Han, Beijing (CN)
Assigned to Beijing Baidu Netcom Science Technology Co., Ltd., Beijing (CN)
Filed by Beijing Baidu Netcom Science Technology Co., Ltd., Beijing (CN)
Filed on Feb. 23, 2021, as Appl. No. 17/182,987.
Claims priority of application No. 202010616310.1 (CN), filed on Jun. 30, 2020.
Prior Publication US 2021/0406592 A1, Dec. 30, 2021
Int. Cl. G06F 16/00 (2019.01); G06F 16/36 (2019.01); G06F 40/279 (2020.01); G06F 18/25 (2023.01); G06V 10/764 (2022.01); G06V 10/80 (2022.01); G06V 10/82 (2022.01); G06V 10/44 (2022.01); G06V 10/426 (2022.01); G06N 3/02 (2006.01)

CPC G06F 16/367 (2019.01) [G06F 18/253 (2023.01); G06F 40/279 (2020.01); G06V 10/426 (2022.01); G06V 10/454 (2022.01); G06V 10/764 (2022.01); G06V 10/811 (2022.01); G06V 10/82 (2022.01); G06N 3/02 (2013.01)]

16 Claims

1. A method for visual question answering, comprising:

acquiring an input image and an input question;

constructing a visual graph based on the input image, wherein the visual graph comprises a first node feature and a first edge feature;

constructing a question graph based on the input question, wherein the question graph comprises a second node feature and a second edge feature;

performing a multimodal fusion on the visual graph and the question graph to obtain an updated visual graph and an updated question graph;

determining a question feature based on the input question;

determining a fusion feature based on the updated visual graph, the updated question graph and the question feature; and

generating a predicted answer for the input image and the input question based on the fusion feature;

wherein the performing the multimodal fusion on the visual graph and the question graph comprises: performing at least one round of multimodal fusion operation, wherein each of the at least one round of multimodal fusion operation comprises:

encoding the first node feature by using a first predetermined network based on the first node feature and the first edge feature, to obtain an encoded visual graph;

encoding the second node feature by using a second predetermined network based on the second node feature and the second edge feature, to obtain an encoded question graph; and

performing a multimodal fusion on the encoded visual graph and the encoded question graph by using a graph match algorithm, to obtain the updated visual graph and the updated question graph;

wherein the first predetermined network comprises: a first fully connected layer, a first graph convolutional layer and a second graph convolutional layer, and the encoding the first node feature comprises:

mapping the first node feature to a first feature by using the first fully connected layer, wherein a number of spatial dimensions of the first feature equals to a predetermined number;

processing the first feature by using the first graph convolutional layer to obtain a second feature;

processing the second feature by using the second graph convolutional layer to obtain the encoded first node feature; and

constructing the encoded visual graph by using the encoded first node feature and the first edge feature.