US 12,437,238 B1
	Generation of agentic trajectories for training artificial intelligence agents to automate multimodal interface task workflows
Shaya Zarkesh, San Francisco, CA (US); Lina Lukyantseva, San Francisco, CA (US); Rohan Bavishi, San Francisco, CA (US); David Luan, San Francisco, CA (US); John Qian, San Francisco, CA (US); Claire Pajot, San Francisco, CA (US); Fred Bertsch, San Francisco, CA (US); Erich Elsen, San Francisco, CA (US); and Curtis Hawthorne, San Francisco, CA (US)
Assigned to Anthropic, PBC, San Francisco, CA (US)
Filed by Anthropic, PBC, San Francisco, CA (US)
Filed on Oct. 7, 2024, as Appl. No. 18/908,447.
Claims priority of provisional application 63/638,644, filed on Apr. 25, 2024.
Claims priority of provisional application 63/638,613, filed on Apr. 25, 2024.
Claims priority of provisional application 63/638,631, filed on Apr. 25, 2024.
Claims priority of provisional application 63/567,667, filed on Mar. 20, 2024.
Claims priority of provisional application 63/567,714, filed on Mar. 20, 2024.
Claims priority of provisional application 63/567,721, filed on Mar. 20, 2024.
Claims priority of provisional application 63/567,681, filed on Mar. 20, 2024.
Claims priority of provisional application 63/567,698, filed on Mar. 20, 2024.
Int. Cl. G06N 20/00 (2019.01); G06F 9/451 (2018.01)

CPC G06N 20/00 (2019.01) [G06F 9/451 (2018.02)]

20 Claims

1. A system for generating training data to train agents to automate multimodal interface task workflows, comprising:

an intermediary interposed between an interface, comprising multimodal content, and a user, and the intermediary is configured to:

intercept one or more user-actuated actions directed towards the interface by the user, wherein the user-actuated actions, if received by the interface, execute a task on the interface;

preserve a state of the interface prior to the execution of the task, wherein the preserved state includes multimodal data comprising arbitrary-length text sequences and arbitrary-resolution images;

translate the user-actuated actions into one or more actuation commands using a transformer-based multimodal neural network comprising multi-head attention mechanisms, wherein the actuation commands are configured to trigger one or more machine-actuated actions that replicate the user-actuated actions on the interface to cause automation of the multimodal task workflow; and

generate a multimodal training dataset to train an agent to automate the task, wherein the training dataset requires the agent to process, as input, the multimodal state of the interface prior to the execution of the task, and to generate, as output, the actuation commands, wherein the translation and generation are performed using runtime interpretation logic dynamically executing on a client-side computing device.