CPC G06N 3/08 (2013.01) [G06F 17/18 (2013.01); G06F 40/166 (2020.01); G06F 40/284 (2020.01); G06N 20/10 (2019.01)] | 16 Claims |
1. A method, comprising:
receiving, by a processor, a plurality of character strings stored in a plurality of data fields in a first data file;
based on at least one neural network model:
(i) assigning, by the processor, to each character or at least one group of characters in each character string, a probability of belonging to at least one specific data type; and
(ii) splitting, by the processor, each character string into at least one word based on the probability of belonging to the at least one specific data type;
wherein the at least one neural network model has been trained to match word samples to data types in a plurality of data types based on a training dataset that comprises the plurality of data types and the word samples belonging to each data type in the plurality of data types;
generating, by the processor, at least one data vector based on at least one predefined format;
wherein the at least one data vector comprises the at least one word from each of the plurality of character strings for the at least one specific data type; and
constructing, by the processor, a second data file based on the at least one data vector with the at least one predefined format.
|