| CPC G06F 16/5846 (2019.01) [G06F 16/334 (2019.01); G06F 16/335 (2019.01); G06F 16/583 (2019.01); G06F 16/683 (2019.01); G06N 3/044 (2023.01); G06N 3/045 (2023.01)] | 15 Claims |

|
1. A computing system comprising:
one or more processing units;
one or more computer-readable media storing instructions which, when executed by the one or more processing units, cause the one or more processing units to perform operations for answering a natural-language question received from a user about an image, the operations comprising:
determining query features of the natural-language question;
determining image features of image regions within the image;
at a first stage:
applying a first neural network to the query features and the image features to compute first attention information that numerically quantifies each image region according to a relevance of the image region to the natural-language question, and
computing first revised features that combine the image features of the image regions, each image region weighted based on the first attention information according to its relevance to the natural-language question, with the query features;
at a second stage:
applying a second neural network to the first revised features and the image features to determine second attention information, and
computing second revised features that combine the image features of the image regions, each image region weighted based on the second attention information, with the query features;
at one or more additional stages:
repeating computing attention information and revised features, with the revised features output by each stage except a terminal stage being fed as input to a subsequent stage;
determining a natural-language answer to the natural-language question based at least in part on the revised features output by the terminal stage; and
presenting the natural-language answer to the user.
|