US 11,756,309 B2
	Contrastive learning for object detection
Alper Ayvaci, Santa Clara, CA (US); Feiyu Chen, Cupertino, CA (US); Justin Yu Zheng, Mountain View, CA (US); Bayram Safa Cicek, Los Angeles, CA (US); and Vasiliy Igorevich Karasev, New York, NY (US)
Assigned to Waymo LLC, Mountain View, CA (US)
Filed by Waymo LLC, Mountain View, CA (US)
Filed on Jan. 13, 2021, as Appl. No. 17/148,148.
Claims priority of provisional application 63/117,406, filed on Nov. 23, 2020.
Prior Publication US 2022/0164585 A1, May 26, 2022
Int. Cl. G06N 3/08 (2023.01); G06V 20/58 (2022.01); B60W 60/00 (2020.01)

CPC G06V 20/58 (2022.01) [B60W 60/001 (2020.02); G06N 3/08 (2013.01); B60W 2420/52 (2013.01); B60W 2554/4049 (2020.02)]

26 Claims

1. A method of training a neural network to detect one or more objects in an environment, the method comprising:

obtaining a network input representing the environment, wherein the input comprises sensor data for each of a plurality of locations in the environment;

processing the network input using a first subnetwork of the neural network to generate a respective embedding for each of the plurality of locations in the environment;

processing the embeddings for each of the plurality of locations in the environment using a second subnetwork of the neural network to generate, for each of the plurality of locations in the environment, an object prediction that characterizes a possible object at the location in the environment;

processing the embeddings for each of the plurality of locations in the environment using a third subnetwork of the neural network to generate an updated embedding for each of the plurality of locations in the environment;

determining, for each of a plurality of pairs of the plurality of locations in the environment, whether the respective object predictions of the pair of locations characterize the same possible object or different possible objects;

computing a respective contrastive loss value for each of the plurality of pairs of locations in the environment, wherein:

for each pair of locations whose object predictions characterize the same possible object, the corresponding contrastive loss value is proportional to a difference between the respective updated embeddings of the pair of locations; and

for each pair of locations whose object predictions characterize different possible objects, the corresponding contrastive loss value is inversely proportional to a difference between the respective updated embeddings of the pair of locations; and

updating values for (i) a plurality of parameters of the first subnetwork and (ii) a plurality of parameters of the third subnetwork using the computed contrastive loss values.