| CPC G06F 40/18 (2020.01) [G06F 40/186 (2020.01)] | 16 Claims |

|
1. A computer implemented method for automatically extracting data from a spreadsheet that defines rows and columns and comprises a plurality of cells that are delineated by the rows and the columns, the method comprising:
obtaining the spreadsheet, wherein the spreadsheet includes data that is stored in a set of rows and a set of columns of the spreadsheet;
receiving a contiguous selection of cells of the spreadsheet, wherein the contiguous selection of cells spans a first set of rows and a first set of columns, and wherein the first set of rows is a subset of the set of rows and the first set of columns is a subset of the set of columns;
for each column in the first set of columns:
identifying characteristics of data included in each cell of the column;
determining a template type of the column based on the characteristics of the data in each selected cell of the column, wherein the template type includes a categorical template or a detailed record template, and wherein (1) a categorical template specifies that data stored in the column includes categorical data that is associated with a plurality of rows of data in an extracted dataset or (2) a detailed record template specifies that data stored in the column includes detailed data that is associated with a single row of data in the extracted dataset; and
determining, from among a plurality of cells of the column and based on characteristics of the data included in the plurality of cells of the column, a representative cell that is representative of the determined template type of the column;
selecting, from among the first set of columns, a second set of columns that includes each column that is determined to be categorical template columns and a third set of columns that includes one or more columns that are determined to be detailed record template columns,
wherein identifying the third set of columns that includes one or more columns that are determined to be detailed record template columns, comprises:
determining a candidacy fitness score for each column in the first set of columns, wherein the candidacy fitness score for a particular column specifies a likelihood of the particular column being suitable for data extraction; and
identifying, from among the first set of columns, the one or more columns based on the candidacy fitness score for each of the one or more columns being higher relative to the candidacy fitness score for each of a remaining number of columns in the first set of columns;
identifying, based on the representative cells in each of the first set of columns, a single row in the contiguous selection, wherein each of a plurality of cells in the single row includes data in a format and a structure that is representative of a format and a structure of data stored in a corresponding column for the cell;
generating, for each column in the third set of columns corresponding to the single row, a set of rules that define data extraction locations in the column;
generating, based on the single row, the second set of columns, the third set of columns, and the set of rules for each of the third set of columns, an extracted dataset; and
generating a graphical user interface providing the extracted dataset for display on a computing device, wherein the graphical user interface comprises:
a first user interface element graphically representing the spreadsheet,
a second user interface element graphically representing the extracted dataset as a table, and
a third user interface element comprising one or more controls for modifying one or more inferences regarding the data in the spreadsheet used to generate the extracted data,
wherein the first user interface element, second user interface element, and third interface element are displayed concurrently in the graphical user interface, and
wherein generating the graphical user interface comprises modifying the extracted dataset and the second user interface element based on one or more user inputs with respect to the third user interface element.
|