US 12,461,965 B2
Multi-stage image querying
Xiaodong He, Sammamish, WA (US); Li Deng, Redmond, WA (US); Jianfeng Gao, Woodinville, WA (US); Alex Smola, Pittsburgh, PA (US); and Zichao Yang, Pittsburgh, PA (US)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed on May 3, 2021, as Appl. No. 17/306,606.
Application 17/306,606 is a continuation of application No. 15/097,086, filed on Apr. 12, 2016, granted, now 10,997,233.
Prior Publication US 2021/0326377 A1, Oct. 21, 2021
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 16/583 (2019.01); G06F 16/334 (2025.01); G06F 16/335 (2019.01); G06F 16/683 (2019.01); G06N 3/044 (2023.01); G06N 3/045 (2023.01)
CPC G06F 16/5846 (2019.01) [G06F 16/334 (2019.01); G06F 16/335 (2019.01); G06F 16/583 (2019.01); G06F 16/683 (2019.01); G06N 3/044 (2023.01); G06N 3/045 (2023.01)] 15 Claims
OG exemplary drawing
 
1. A computing system comprising:
one or more processing units;
one or more computer-readable media storing instructions which, when executed by the one or more processing units, cause the one or more processing units to perform operations for answering a natural-language question received from a user about an image, the operations comprising:
determining query features of the natural-language question;
determining image features of image regions within the image;
at a first stage:
applying a first neural network to the query features and the image features to compute first attention information that numerically quantifies each image region according to a relevance of the image region to the natural-language question, and
computing first revised features that combine the image features of the image regions, each image region weighted based on the first attention information according to its relevance to the natural-language question, with the query features;
at a second stage:
applying a second neural network to the first revised features and the image features to determine second attention information, and
computing second revised features that combine the image features of the image regions, each image region weighted based on the second attention information, with the query features;
at one or more additional stages:
repeating computing attention information and revised features, with the revised features output by each stage except a terminal stage being fed as input to a subsequent stage;
determining a natural-language answer to the natural-language question based at least in part on the revised features output by the terminal stage; and
presenting the natural-language answer to the user.