US 12,217,178 B2
	Computer-based systems configured for detecting and splitting data types in a data file and methods of use thereof
Galen Rafferty, Mahomet, IL (US); Reza Farivar, Champaign, IL (US); Jeremy Goodsitt, Champaign, IL (US); Anh Truong, Champaign, IL (US); and Austin Walters, Savoy, IL (US)
Assigned to Capital One Services, LLC, McLean, VA (US)
Filed by Capital One Services, LLC, McLean, VA (US)
Filed on Apr. 3, 2023, as Appl. No. 18/295,153.
Application 18/295,153 is a continuation of application No. 17/014,394, filed on Sep. 8, 2020, granted, now 11,620,520.
Application 17/014,394 is a continuation of application No. 16/667,451, filed on Oct. 29, 2019, granted, now 10,789,532, issued on Sep. 29, 2020.
Prior Publication US 2023/0244939 A1, Aug. 3, 2023
This patent is subject to a terminal disclaimer.
Int. Cl. G06N 3/08 (2023.01); G06F 17/18 (2006.01); G06F 40/166 (2020.01); G06F 40/284 (2020.01); G06N 20/10 (2019.01)

CPC G06N 3/08 (2013.01) [G06F 17/18 (2013.01); G06F 40/166 (2020.01); G06F 40/284 (2020.01); G06N 20/10 (2019.01)]

16 Claims

1. A method, comprising:

receiving, by a processor, a plurality of character strings stored in a plurality of data fields in a first data file;

based on at least one neural network model:

(i) assigning, by the processor, to each character or at least one group of characters in each character string, a probability of belonging to at least one specific data type; and

(ii) splitting, by the processor, each character string into at least one word based on the probability of belonging to the at least one specific data type;

wherein the at least one neural network model has been trained to match word samples to data types in a plurality of data types based on a training dataset that comprises the plurality of data types and the word samples belonging to each data type in the plurality of data types;

generating, by the processor, at least one data vector based on at least one predefined format;

wherein the at least one data vector comprises the at least one word from each of the plurality of character strings for the at least one specific data type; and

constructing, by the processor, a second data file based on the at least one data vector with the at least one predefined format.