| CPC G06F 3/038 (2013.01) [G06F 3/023 (2013.01); G06F 40/284 (2020.01); G06N 3/0442 (2023.01); G06N 3/092 (2023.01); G06V 10/82 (2022.01)] | 16 Claims |

|
1. A computer-implemented method for controlling a particular computer to execute a task, the method including:
receiving a control input including a visual input, the visual input including one or more screen frames of a computer display that represent at least a current state of the particular computer;
processing the control input using a neural network to generate one or more control outputs that are used to control the particular computer to execute the task, wherein the one or more control outputs comprise an action type output that specifies at least one of a pointing device action or a keyboard action to be performed to control the particular computer;
determining one or more actions from the one or more control outputs; and
executing the one or more actions to control the particular computer,
wherein the control input further comprises one or more language inputs, one or more previous controls, or both, and
wherein the neural network comprises a visual processing sub-network, one or more language processing sub-networks, a previous control processing sub-network, a multimodal transformer sub-network, and an output sub-network, and wherein processing the control input using the neural network to generate the one or more control outputs comprises:
processing, using the visual processing sub-network, the visual input to generate one or more visual embeddings;
processing each language input in the one or more language inputs using the language processing sub-network to generate a respective language embedding;
processing, using the previous control processing sub-network, the one or more previous controls to generate a previous control embedding;
combining, using a multimodal transformer sub-network, the one or more visual embeddings and the one or more language embeddings to generate a transformed embedding; and
processing, using the output sub-network, the transformed embedding and the previous control embedding to generate the one or more control outputs.
|