US 11,675,764 B2
Learned data ontology using word embeddings from multiple datasets
Zuye Zheng, San Francisco, CA (US); Vaibhav Garg, San Francisco, CA (US); Sumitkumar Kukkar, Dublin, CA (US); Timothy Noonan, Beulah, CO (US); Evan Tsao, San Francisco, CA (US); Thushara Paul, San Francisco, CA (US); and Behzad Farhang Richey, Los Angeles, CA (US)
Assigned to Salesforce, Inc., San Francisco, CA (US)
Filed by Salesforce.com, Inc., San Francisco, CA (US)
Filed on Oct. 16, 2020, as Appl. No. 17/72,615.
Prior Publication US 2022/0121636 A1, Apr. 21, 2022
Int. Cl. G06F 16/22 (2019.01); G06F 40/284 (2020.01); G06F 16/28 (2019.01); G06F 16/29 (2019.01); G06F 16/2455 (2019.01)
CPC G06F 16/2237 (2019.01) [G06F 16/221 (2019.01); G06F 16/2282 (2019.01); G06F 16/24553 (2019.01); G06F 16/288 (2019.01); G06F 16/29 (2019.01)] 16 Claims
OG exemplary drawing
 
1. A method for data processing, comprising:
generating a master dataset that includes data from a plurality of datasets, the master dataset comprising a plurality of rows that include a dataset name associated with a dataset of the plurality of datasets, an indication of a field from the dataset, a value corresponding to the field, and an indication of a number of occurrences of the value within the dataset of the plurality of datasets;
generating a plurality of text strings for the plurality of rows in the master dataset, the plurality of text strings including a tokenized field string, a tokenized value string, and a non-tokenized field name string, wherein the non-tokenized field name string is positioned between the tokenized field string and the tokenized value string;
generating a plurality of vectors associated with the master dataset by using a word embedding function to process the plurality of text strings;
generating, using the word embedding function, a query vector corresponding to a received query; and
executing the received query to retrieve query results from the master dataset, wherein retrieving the query results comprises determining that a cosine similarity between the query vector and at least one vector of the plurality of vectors satisfies a similarity threshold.