US 12,292,870 B2
	Determining a degree of similarity of a subset of tabular data arrangements to subsets of graph data arrangements at ingestion into a data-driven collaborative dataset platform
David Lee Griffith, Austin, TX (US)
Assigned to data.world, Inc., Austin, TX (US)
Filed by data.world, Inc., Austin, TX (US)
Filed on Jul. 1, 2021, as Appl. No. 17/365,214.
Application 17/365,214 is a continuation of application No. 16/137,297, filed on Sep. 20, 2018, granted, now 11,068,453.
Application 16/137,297 is a continuation in part of application No. 15/985,704, filed on May 22, 2018, granted, now 11,068,847.
Application 16/137,297 is a continuation in part of application No. 15/985,702, filed on May 22, 2018, granted, now 11,068,475.
Application 16/137,297 is a continuation in part of application No. 15/927,004, filed on Mar. 20, 2018, granted, now 11,036,716.
Application 16/137,297 is a continuation in part of application No. 15/926,999, filed on Mar. 20, 2018, granted, now 11,016,931.
Application 16/137,297 is a continuation in part of application No. 15/454,923, filed on Mar. 9, 2017, granted, now 10,353,911.
Prior Publication US 2022/0405292 A1, Dec. 22, 2022
Int. Cl. G06F 16/2455 (2019.01); G06F 16/22 (2019.01); G06F 16/2457 (2019.01); G06F 16/25 (2019.01); G06F 16/28 (2019.01); G06F 16/901 (2019.01)

CPC G06F 16/221 (2019.01) [G06F 16/2456 (2019.01); G06F 16/24578 (2019.01); G06F 16/252 (2019.01); G06F 16/256 (2019.01); G06F 16/258 (2019.01); G06F 16/285 (2019.01); G06F 16/9024 (2019.01)]

20 Claims

1. A method, comprising:

ingesting a dataset to form an ingested dataset;

compressing the ingested dataset in accordance with at least one of one or more algorithmic hash functions to form a compressed data representation of the ingested data, the at least one of the one or more algorithmic hash functions including a plurality of differently-configured hash functions to generate a first subset of instances in a first state relative to a second subset of instances in a second state to specify a degree of similarity, the plurality of the differently-configured hash functions being configured to generate a plurality of hash values, at least one of the plurality of differently-configured hash functions being implemented based on a data type;

identifying an indication associated with instructing computation of the degree of similarity between the ingested dataset and another dataset, the degree of similarity being used to join the dataset and the another dataset in a graph in response to executing instructions at one or more processors;

determining a first ratio associated with an overlap function;

determining a second ratio associated with a coverage function;

identifying a metric configured to be used to determine the degree of similarity;

associating a subset of similarity matrices with a subset of graph data joined to the ingested dataset;

accessing the subset of similarity matrices, at least one subset of a similarity matrix is formed to identify a subset of relevant data associated with the another dataset disposed in a graph data arrangement, at least a portion of the another dataset in the graph data arrangement being formatted as one or more triple-based data formats, the similarity matrix and the degree of similarity being a function of the plurality of the hash values relating to a union of compressed ingested dataset data and compressed target data;

forming a plurality of links among a column of data associated with the ingested dataset as the dataset and the another dataset of the ingested data, the plurality of links being determined based on the degree of similarity based on the compressed data representation; and

receiving a query as a data operation based on ranked datasets identified based on the degree of similarity and other degrees of similarity,

wherein a combined number of hash-derived attributes of the at least one of one or more algorithmic hash functions are configured to determine the degree of similarity based on multiple algorithmic hash functions.