| CPC G06V 20/56 (2022.01) [G01S 13/86 (2013.01); G01S 17/86 (2020.01); G05D 1/0212 (2013.01); G06F 3/16 (2013.01); G06F 18/22 (2023.01); G06V 10/764 (2022.01); G06V 2201/07 (2022.01)] | 20 Claims |

|
1. A system comprising:
one or more processors; and
one or more non-transitory computer-readable media storing instructions that, when executed, cause the system to perform operations comprising:
receiving first sensor data generated by a first modality of sensor of a vehicle, the first modality being audio;
receiving second sensor data generated by a second modality of sensor of the vehicle, the second modality being vision, lidar, or radar;
inputting the first sensor data and the second sensor data into a machine-learned transformer model;
determining, by the machine-learned transformer model, a first key vector, a first query vector, and a first value vector for the first sensor data;
determining, by the machine-learned transformer model, a second key vector, a second query vector, and a second value vector for the second sensor data;
determining, by the machine-learned transformer model, a first attention vector based on the first query vector, the second value vector, and the second key vector;
determining, by the machine-learned transformer model, a second attention vector based on the second query vector, the first value vector, and the first key vector; and
determining, by the machine-learned transformer model and based at least in part on the first attention vector and the second attention vector, at least one of a location or a classification of an object in an environment of the vehicle,
wherein the second sensor data includes sensor data generated by two or more sensors of the second modality of sensor, the sensor data generated by the two or more sensors being combined prior to being input into the machine-learned transformer model.
|