US 12,333,791 B2
	Determining media documents embedded in other media documents
Pavel Sukhov, Olso (NO); and Thomas Peter Kunert, Oslo (NO)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed on Aug. 18, 2022, as Appl. No. 17/890,769.
Prior Publication US 2024/0062529 A1, Feb. 22, 2024
Int. Cl. G06V 10/774 (2022.01); G06V 10/74 (2022.01); G06V 10/778 (2022.01); G06V 10/82 (2022.01); G06V 20/40 (2022.01)

CPC G06V 10/7747 (2022.01) [G06V 10/761 (2022.01); G06V 10/7796 (2022.01); G06V 10/82 (2022.01); G06V 20/41 (2022.01); G06V 20/46 (2022.01)]

17 Claims

1. An apparatus, comprising:

a device including memory having processor-executable code stored therein, and a processor that is adapted to execute the processor-executable code, wherein the processor-executable code includes processor-executable instructions that, in response to execution, enable the device to perform actions, including:

obtaining source input images that are derived from a set of source media documents;

obtaining target input images that are derived from a set of target media documents;

generating source fingerprints from the source input images using a source machine-learning model;

generating target fingerprints from the target input images using a target machine-learning model, wherein:

the source machine-learning model includes a first neural network,

the target machine-learning model includes a second neural network that is different from the first neural network, and

the source machine-learning model was trained in parallel with the target machine-learning model such that the source machine-learning model outputs a source fingerprint from a source input image and the target machine-learning model outputs a target fingerprint from a target input image with a training objective that:

a distance between the source fingerprint and the target fingerprint is less than a first threshold if the source input image is embedded within the target input image, and

the distance between the source fingerprint and the target fingerprint is greater than the first threshold if the source input image is absent from the target input image;

using the source fingerprints and the target fingerprints to determine a set of candidate media-document pairs, wherein each candidate media-document pair of the set of candidate media-document pairs includes a candidate source media document from the set of source media documents and a candidate target media document from the set of target media documents such that the candidate source media document is a candidate for being embedded in the candidate target media document; and

using a confirmation machine-learning model to determine, for candidate media-document pairs in the set of candidate media-document pairs, a confidence score that the candidate source media document of the set of candidate media-document pairs is embedded in the candidate target media document of the set of candidate media-document pairs.