| CPC B25J 9/1697 (2013.01) [B25J 9/1661 (2013.01); G06T 7/10 (2017.01); G06T 2207/10024 (2013.01); G06T 2207/10028 (2013.01)] | 17 Claims |

|
1. A method, comprising:
receiving a video that encodes a plurality of frames associated with assembly of an object from a plurality of sub-objects;
determining spatio-temporal features based on the plurality of frames;
identifying a plurality of actions represented in the video based on the spatio-temporal features;
mapping the plurality of actions to the plurality of sub-objects to generate an assembly plan based on the video;
combining output from a point cloud model and output from a color embedding model to generate a plurality of sets of coordinates corresponding to the plurality of sub-objects, wherein each set of coordinates corresponds to a respective action of the plurality of actions;
performing object segmentation to estimate a plurality of grip points and a plurality of widths corresponding to the plurality of sub-objects; and
generating instructions, for one or more robotic machines for each action of the plurality of actions, based on the assembly plan, the plurality of sets of coordinates, the plurality of grip points, and the plurality of widths,
wherein combining the output from the point cloud model and the output from the color embedding model comprises:
generating a pixel-wise dense fusion matrix using the output from the point cloud model and the output from the color embedding model; and
generating global features based on pooling the output from the point cloud model and the output from the color embedding model,
wherein one of the plurality of sets of coordinates corresponding to one of the plurality of sub-objects is calculated based on the pixel-wise dense fusion matrix and the global features.
|