| CPC G06N 20/00 (2019.01) [G06F 3/011 (2013.01); G06T 19/006 (2013.01); G06F 3/0487 (2013.01)] | 18 Claims |

|
1. A computer-implemented method comprising:
receiving, at one or more computing devices, a multi-modal input representing a query associated with a physical object, wherein the multi-modal input comprises one or more frames of a video and at least one of text or audio component inputs;
detecting, by the one or more computing devices, the physical object in the one or more frames using a machine learning model trained to perform object detection;
determining, by the one or more computing devices, based in part on an identification of the physical object, at least one response to the query associated with the physical object, the at least one response being identified by processing the at least one of the text or audio component inputs using a language processing model;
determining, by the one or more computing devices, a sequence of actions associated with the at least one response, the sequence of actions involving an interaction with at least one portion of the physical object;
generating, by the one or more computing devices, a digital representation representing the sequence of actions on a virtual model of the physical object, the digital representation comprising: the virtual model of the physical object, and one or more gesture-icons representing the sequence of actions, each of the one or more gesture-icons being overlaid on corresponding portions of the virtual model of the physical object; and
providing the digital representation to a user-device for presentation on a display, the digital representation being configured to be aligned in accordance with a changed orientation of the virtual model of the physical object on the display.
|