| CPC G06T 17/20 (2013.01) [G06F 30/27 (2020.01); G06T 5/60 (2024.01); G06T 15/20 (2013.01); G06T 17/05 (2013.01); G06T 2210/56 (2013.01)] | 20 Claims | 

| 
               1. A method performed by one or more computers, the method comprising: 
            obtaining a set of point clouds captured by one or more sensors, wherein each point cloud comprises a respective plurality of three-dimensional points; 
                assigning the three-dimensional points to respective voxels in a voxel grid of voxels; 
                generating multi-scale features of the voxel grid, the multi-scale features comprising, for each of a plurality of scales, respective features for each non-empty voxel in a scaled voxel grid corresponding to the scale, the generating comprising: 
                processing respective features for each non-empty voxel in the voxel grid through a hierarchical sequence of self-attention neural network blocks, the processing comprising, for each scale: 
                obtaining initial features for each non-empty voxel in the scaled voxel grid corresponding to the scale; and 
                    processing the initial features for the non-empty voxels in the scaled voxel grid corresponding to the scale using a self-attention neural network to generate the respective features for the non-empty voxels in the scaled voxel grid corresponding to the scale; and 
                  generating an output for a point cloud processing task using the multi-scale features of the voxel grid. 
               |