US 12,326,867 B2
	Method and system of using domain specific knowledge in retrieving multimodal assets
Adit Krishnan, Seattle, WA (US); Varun Tandon, Sunnyvale, CA (US); and Ji Li, San Jose, CA (US)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed on Jan. 23, 2023, as Appl. No. 18/158,121.
Prior Publication US 2024/0248901 A1, Jul. 25, 2024
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 16/2457 (2019.01); G06F 16/2455 (2019.01); G06F 16/248 (2019.01)

CPC G06F 16/24578 (2019.01) [G06F 16/24556 (2019.01); G06F 16/248 (2019.01)]

16 Claims

1. A data processing system comprising:

a processor; and

a memory in communication with the processor, the memory comprising executable instructions that, when executed by the processor, cause the data processing system to perform:

receiving, via a query representation model, a multimodal search query for searching for one or more multimodal assets from among a plurality of candidate multimodal assets, wherein a multimodal asset contains two or more different types of content including graphic or image content and each of the candidate multimodal assets has a corresponding multimodal representation and a domain-specific representation, the domain-specific representation representing domain-specific knowledge for the corresponding candidate multimodal asset;

parsing, via the query representation model, the multimodal search query to identify from multimodal content of the search query a first content type and a second content type in the search query, the second content type being a graphic or image content type;

transmitting the first content type to a first representation machine-learning (ML) model to generate a first set of vector embeddings;

transmitting the second content type to a second representation ML model to generate a second set of vector embeddings;

transmitting the first and second sets of vector embeddings to a tensor generation unit to generate tensors based on the first and second sets of vector embeddings and to output a query tensor representation;

comparing the query tensor representation to a plurality of the multimodal representations to identify a set of candidate multimodal assets matching the search query;

reducing the set of candidate multimodal assets matching the search query by comparing the plurality of domain-specific representations to the search query; and

based on the reduced set of candidate multimodal assets, providing the candidate multimodal assets for display as search results to the search query.