US 12,306,809 B2
	Identifying duplication multimedia entities
Keith G. Frost, Delaware, OH (US); Stephen A. Boxwell, Columbus, OH (US); Kyle M. Brake, Dublin, OH (US); and Stanley J. Vernier, Grove City, OH (US)
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION, Armonk, NY (US)
Filed by International Business Machines Corporation, Armonk, NY (US)
Filed on May 17, 2019, as Appl. No. 16/415,041.
Prior Publication US 2020/0364195 A1, Nov. 19, 2020
Int. Cl. G06F 16/215 (2019.01); G06F 16/2455 (2019.01); G06F 16/25 (2019.01); G06F 16/483 (2019.01)

CPC G06F 16/215 (2019.01) [G06F 16/24568 (2019.01); G06F 16/258 (2019.01); G06F 16/483 (2019.01)]

18 Claims

1. A computer system comprising:

a hardware processor operatively coupled to memory; and

a knowledge engine in communication with the hardware processor, the knowledge engine configured to implement one or more tools to support identification of duplicate media content, the one or more tools comprising:

a data manager configured to convert a first data stream into a first sequence of events and to convert a second data stream into a second sequence of events, wherein each of the first and second data streams comprise multimedia data, and wherein converting the first data stream into the first sequence of events and the second data stream into the second sequence of events further comprises performing speech-to-text on an audio portion and performing object recognition on a visual component of each of the first and second data streams to identify objects and text present within each respective data stream, wherein the objects further comprise one or more of audio, image, and video, and representing the identified objects and text in the first data stream as the first sequence of events and the identified objects and text in the second data stream as the second sequence of events with each event having timestamp data associated with a corresponding time frame; and

an assessment manager configured to:

conduct a similarity assessment between the first and second data streams, the similarity assessment to produce a distance measurement between the first sequence of events and the second sequence of events, the distance measurement to quantify similarity between the first data stream and the second data stream, wherein conducting the similarity assessment further comprises,

generating a first ordered list of the identified objects and text based on the first sequence of events and the timestamp data, and generating a second ordered list of the identified objects and text based on the second sequence of events and the timestamp data, and

producing the distance measurement based on a comparison between the first and second ordered list of the identified objects and text, wherein producing the distance measurement further comprises determining a value reflecting a quantity of edits required to create equivalency between the first and second ordered list of the identified objects and text,

selectively identify duplicate data in the first and second sequences of events ordered by time and responsive to the similarity assessment and the produced distance measurement of the first and second ordered list of the identified objects and text, wherein the selectively identifying the duplicate data is further based on measuring the produced distance measurement against a threshold value;

in response to identifying the duplicate data based on the similarity assessment, outputting a response indicating the duplicate data and the respective data source identifying the duplicate data.