US 12,343,884 B2
Robotic assembly instruction generation from a video
Kumar Abhinav, Hazaribag (IN); Alpana Dubey, Bangalore (IN); Shubhashis Sengupta, Bangalore (IN); Suma Mani Kuriakose, Mumbai (IN); Priyanshu Abhijit Barua, Pune (IN); and Piyush Goenka, Bangalore (IN)
Assigned to Accenture Global Solutions Limited, Dublin (IE)
Filed by Accenture Global Solutions Limited, Dublin (IE)
Filed on Sep. 21, 2022, as Appl. No. 17/950,021.
Prior Publication US 2024/0091948 A1, Mar. 21, 2024
Int. Cl. B25J 9/16 (2006.01); G06T 7/10 (2017.01)
CPC B25J 9/1697 (2013.01) [B25J 9/1661 (2013.01); G06T 7/10 (2017.01); G06T 2207/10024 (2013.01); G06T 2207/10028 (2013.01)] 17 Claims
OG exemplary drawing
 
1. A method, comprising:
receiving a video that encodes a plurality of frames associated with assembly of an object from a plurality of sub-objects;
determining spatio-temporal features based on the plurality of frames;
identifying a plurality of actions represented in the video based on the spatio-temporal features;
mapping the plurality of actions to the plurality of sub-objects to generate an assembly plan based on the video;
combining output from a point cloud model and output from a color embedding model to generate a plurality of sets of coordinates corresponding to the plurality of sub-objects, wherein each set of coordinates corresponds to a respective action of the plurality of actions;
performing object segmentation to estimate a plurality of grip points and a plurality of widths corresponding to the plurality of sub-objects; and
generating instructions, for one or more robotic machines for each action of the plurality of actions, based on the assembly plan, the plurality of sets of coordinates, the plurality of grip points, and the plurality of widths,
wherein combining the output from the point cloud model and the output from the color embedding model comprises:
generating a pixel-wise dense fusion matrix using the output from the point cloud model and the output from the color embedding model; and
generating global features based on pooling the output from the point cloud model and the output from the color embedding model,
wherein one of the plurality of sets of coordinates corresponding to one of the plurality of sub-objects is calculated based on the pixel-wise dense fusion matrix and the global features.