| CPC G06V 10/82 (2022.01) [G06F 40/10 (2020.01); G06F 40/284 (2020.01)] | 30 Claims |

|
9. A processor-implemented method performed by at least one processor, the processor-implemented method comprising:
receiving, by a first artificial neural network (ANN), an interleaved sequence of images and textual information;
extracting, by the first ANN, grid features of the images of the interleaved sequence of the images and the textual information to generate a representation of the interleaved sequence of the images and the textual information based on the grid features;
mapping, by a second ANN, the grid features to a textual domain;
extracting, by the second ANN, visual information of the interleaved sequence of the images and the textual information based on the grid features in the textual domain; and
determining, by the second ANN, a rationale based on the visual information of the interleaved sequence of images and the textual information, the visual information comprising one or more lower-level surrogate tasks.
|