CPC G06F 16/221 (2019.01) [G06F 16/2456 (2019.01); G06F 16/24578 (2019.01); G06F 16/252 (2019.01); G06F 16/256 (2019.01); G06F 16/258 (2019.01); G06F 16/285 (2019.01); G06F 16/9024 (2019.01)] | 20 Claims |
1. A method, comprising:
ingesting a dataset to form an ingested dataset;
compressing the ingested dataset in accordance with at least one of one or more algorithmic hash functions to form a compressed data representation of the ingested data, the at least one of the one or more algorithmic hash functions including a plurality of differently-configured hash functions to generate a first subset of instances in a first state relative to a second subset of instances in a second state to specify a degree of similarity, the plurality of the differently-configured hash functions being configured to generate a plurality of hash values, at least one of the plurality of differently-configured hash functions being implemented based on a data type;
identifying an indication associated with instructing computation of the degree of similarity between the ingested dataset and another dataset, the degree of similarity being used to join the dataset and the another dataset in a graph in response to executing instructions at one or more processors;
determining a first ratio associated with an overlap function;
determining a second ratio associated with a coverage function;
identifying a metric configured to be used to determine the degree of similarity;
associating a subset of similarity matrices with a subset of graph data joined to the ingested dataset;
accessing the subset of similarity matrices, at least one subset of a similarity matrix is formed to identify a subset of relevant data associated with the another dataset disposed in a graph data arrangement, at least a portion of the another dataset in the graph data arrangement being formatted as one or more triple-based data formats, the similarity matrix and the degree of similarity being a function of the plurality of the hash values relating to a union of compressed ingested dataset data and compressed target data;
forming a plurality of links among a column of data associated with the ingested dataset as the dataset and the another dataset of the ingested data, the plurality of links being determined based on the degree of similarity based on the compressed data representation; and
receiving a query as a data operation based on ranked datasets identified based on the degree of similarity and other degrees of similarity,
wherein a combined number of hash-derived attributes of the at least one of one or more algorithmic hash functions are configured to determine the degree of similarity based on multiple algorithmic hash functions.
|