| CPC G06N 20/00 (2019.01) [G06F 9/451 (2018.02)] | 20 Claims |

|
1. A system for generating training data to train agents to automate multimodal interface task workflows, comprising:
an intermediary interposed between an interface, comprising multimodal content, and a user, and the intermediary is configured to:
intercept one or more user-actuated actions directed towards the interface by the user, wherein the user-actuated actions, if received by the interface, execute a task on the interface;
preserve a state of the interface prior to the execution of the task, wherein the preserved state includes multimodal data comprising arbitrary-length text sequences and arbitrary-resolution images;
translate the user-actuated actions into one or more actuation commands using a transformer-based multimodal neural network comprising multi-head attention mechanisms, wherein the actuation commands are configured to trigger one or more machine-actuated actions that replicate the user-actuated actions on the interface to cause automation of the multimodal task workflow; and
generate a multimodal training dataset to train an agent to automate the task, wherein the training dataset requires the agent to process, as input, the multimodal state of the interface prior to the execution of the task, and to generate, as output, the actuation commands, wherein the translation and generation are performed using runtime interpretation logic dynamically executing on a client-side computing device.
|