| CPC G06F 40/58 (2020.01) [G06V 10/454 (2022.01); G06V 10/7715 (2022.01); G06V 10/82 (2022.01); G06V 20/62 (2022.01)] | 19 Claims |

|
8. A system, comprising:
one or more memory devices storing instructions; and
one or more data processing apparatus that are configured to interact with the one or more memory devices, and upon execution of the instructions, perform operations including:
obtaining a first image that depicts first text written in a source language;
inputting the first image into an image feature extractor of an image translation model that is trained, using a loss function, to extract, from an input image, a set of image features that are a description of a portion of the input image in which the text is depicted;
obtaining, from the feature extractor and in response to inputting the first image into the feature extractor, a first set of image features representing the first text present in the first image;
inputting the first set of image features into a decoder of the image translation model, wherein the decoder is trained, using the loss function, to infer text in a target language from an input set of image features, wherein the inferred text is a predicted translation of text represented by the input set of image features; and
obtaining, from the decoder and in response to inputting the first set of image features into the decoder, a second text that is in a target language and is predicted to be a translation of the first text.
|