CPC G06F 16/2237 (2019.01) [G06F 16/221 (2019.01); G06F 16/2282 (2019.01); G06F 16/24553 (2019.01); G06F 16/288 (2019.01); G06F 16/29 (2019.01)] | 16 Claims |
1. A method for data processing, comprising:
generating a master dataset that includes data from a plurality of datasets, the master dataset comprising a plurality of rows that include a dataset name associated with a dataset of the plurality of datasets, an indication of a field from the dataset, a value corresponding to the field, and an indication of a number of occurrences of the value within the dataset of the plurality of datasets;
generating a plurality of text strings for the plurality of rows in the master dataset, the plurality of text strings including a tokenized field string, a tokenized value string, and a non-tokenized field name string, wherein the non-tokenized field name string is positioned between the tokenized field string and the tokenized value string;
generating a plurality of vectors associated with the master dataset by using a word embedding function to process the plurality of text strings;
generating, using the word embedding function, a query vector corresponding to a received query; and
executing the received query to retrieve query results from the master dataset, wherein retrieving the query results comprises determining that a cosine similarity between the query vector and at least one vector of the plurality of vectors satisfies a similarity threshold.
|