US 12,321,401 B1
	Multimodal query prediction
Jessica Lee, Brooklyn, NY (US); Cindy L. Huynh, San Francisco, CA (US); Harshit Kharbanda, Pleasanton, CA (US); Louis Wang, San Francisco, CA (US); Richard Cameron, Port Washington, NY (US); Christophe Patrice Fondacci, Daly City, CA (US); Ruslan Alfridovich Abdikeev, Burlingame, CA (US); Jatin Matani, San Francisco, CA (US); Kai Yu, San Francisco, CA (US); and Wenjia Yuan, Mountain View, CA (US)
Assigned to GOOGLE LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Jun. 10, 2024, as Appl. No. 18/738,529.
Int. Cl. G06F 16/95 (2019.01); G06F 3/0482 (2013.01); G06F 3/0488 (2022.01); G06F 16/9532 (2019.01); G06F 16/9535 (2019.01); G06F 16/957 (2019.01)

CPC G06F 16/9532 (2019.01) [G06F 3/0482 (2013.01); G06F 3/0488 (2013.01); G06F 16/9535 (2019.01); G06F 16/9577 (2019.01)]

20 Claims

1. A computing system for multimodal search, the system comprising:

one or more processors; and

one or more non-transitory computer-readable media that collectively store instructions that, when executed by the one or more processors, cause the computing system to perform operations, the operations comprising:

obtaining image data, wherein the image data is descriptive of one or more images, wherein the one or more images comprise one or more frames obtained from a live camera feed;

processing the image data with an object classification model to determine one or more object classifications for one or more objects depicted in the one or more images;

processing the one or more object classifications to generate one or more multimodal query suggestions, wherein the one or more multimodal query suggestions comprise one or more suggested text strings to provide with at least a portion of the image data to a search engine;

providing the one or more suggested text strings for display with the live camera feed;

obtaining a selection of the one or more suggested text strings associated with the one or more multimodal query suggestions;

generating a multimodal query comprising the one or more suggested text strings and at least one of the one or more images or a current frame of the live camera feed; and

determining one or more search results based on the multimodal query.