US 12,254,265 B2
Generating unique word embeddings for jargon-specific tabular data for neural network training and usage
Bhavna Agrawal, Armonk, NY (US); Elham Khabiri, Briarcliff Manor, NY (US); Yingjie Li, Chappaqua, NY (US); and Pranav Girish Sankhe, Buffalo, NY (US)
Assigned to International Business Machines Corporation, Armonk, NY (US)
Filed by International Business Machines Corporation, Armonk, NY (US)
Filed on Sep. 24, 2021, as Appl. No. 17/483,989.
Prior Publication US 2023/0097150 A1, Mar. 30, 2023
Int. Cl. G06F 40/18 (2020.01); G06F 40/284 (2020.01); G06N 3/047 (2023.01); G06N 3/08 (2023.01)
CPC G06F 40/18 (2020.01) [G06F 40/284 (2020.01); G06N 3/047 (2023.01); G06N 3/08 (2013.01)] 18 Claims
OG exemplary drawing
 
1. A method of using a computing device to generate unique word embeddings for jargon-specific tabular data comprising:
accessing by the computing device tabular data containing a plurality of entries of alphanumeric data, individual entries comprising one or more strings;
generating, by the computing device using a tokenization process, a plurality of tokens of the plurality of entries of alphanumeric data, the tokenization process maintaining jargon-specific features of the alphanumeric data by masking a set comprising every individual numerical character in all strings of the plurality of entries of alphanumeric data in the tabular data by replacing the individual numerical characters of the set with an equal size set of individual replacement characters to form masked strings, wherein one or more of the plurality of tokens comprise masked strings that maintain an original sequence of the alphanumeric data while masking one or more characters in the original sequence and keeping other characters in the one or more characters as unchanged;
generating, by the computing device using the tokens, a plurality of embeddings of the plurality of entries of alphanumeric data, the plurality of embeddings capturing similarity of the plurality of entries considering all of global features, column features, and row features in the tokens of the tabular data, and wherein the generating the plurality of embeddings creates an embedded table;
forming a total context by:
for a cell in the embedded table, extracting a sliced table containing the cell and adjacent cells;
selecting a row context and a column context for the cell from rows and columns of the sliced table; and
concatenating the row context and the column context to form the total context;
training a supervised attention-based neural network at least by applying cells, of the embedded table, and corresponding total context to the supervised attention-based neural network using pre-defined classes; and
predicting, by the computing device using the supervised attention-based neural network, probabilities for the pre-defined classes for the tabular data using the generated plurality of embeddings.