US 12,443,614 B2
Efficient column detection using sequencing, and applications thereof
Robert Raymond Lindner, Fitchburg, WI (US); and Carlos Vera-Ciro, Madison, WI (US)
Assigned to VEDA Data Solutions, Inc., Madison, WI (US)
Filed by VEDA Data Solutions, Inc., Washington, DC (US)
Filed on Feb. 24, 2023, as Appl. No. 18/114,075.
Claims priority of provisional application 63/268,539, filed on Feb. 25, 2022.
Prior Publication US 2023/0273934 A1, Aug. 31, 2023
Int. Cl. G06F 16/248 (2019.01); G06F 16/22 (2019.01)
CPC G06F 16/248 (2019.01) [G06F 16/2237 (2019.01)] 18 Claims
OG exemplary drawing
 
1. A method for determining which labels correspond to columns of a data file comprising columns and rows, the method comprising:
receiving the data file containing a plurality of fields organized as a table with a plurality of columns and a plurality of rows, the data file having inconsistent labeling for the columns;
training a model to detect a label based on data within the columns, wherein the model is trained using:
a number of Monte Carlo training sets having sample data files, and
rules for common types of demographic information;
for respective columns in the plurality of columns, selecting, from a plurality of consistent labels, a label corresponding to the respective columns of the data file using the model to detect the label based on data within the column;
for respective first and second columns in the plurality of columns, determining a column score indicating a likelihood that the first column has the label corresponding to the first column given that the second column has the label corresponding to the second column;
determining, based on the column score, a placement score indicating a likelihood that labels from the plurality of consistent labels corresponding to the respective columns are correct, wherein for all of the labels the determining the placement score further comprises,
determining a first frequency of a first label amongst the labels,
determining a second frequency of a second label amongst the labels,
determining a frequency score indicating a likelihood that the first label occurs at the first frequency given that the second label occurs at the second frequency, and
determining the placement score based on the frequency score;
adjusting the labels corresponding to each of the respective first and second columns based on the placement score;
repeating, until the placement score converges, steps of:
for the respective first and second columns in the plurality of columns, determining the column score indicating the likelihood that the first column has the label corresponding to the first column given that the second column has the label corresponding to the second column;
determining, based on the column score, the placement score indicating the likelihood that the labels corresponding to the respective columns are correct, wherein for all of the labels, the determining the placement score further comprises:
determining the first frequency of the first label amongst the labels,
determining the second frequency of the second label amongst the labels,
determining the frequency score indicating the likelihood that the first label occurs at the first frequency given that the second label occurs at the second frequency, and
determining the placement score based on the frequency score; and
adjusting the labels correspond to each of the respective first and second columns based on the placement score; and
generating, by a column detector on a computing device, a reformatted data file based on the adjusted labels.