US 12,223,278 B2
Automatic data card generation
Hans-Martin Ramsl, Mannheim (DE)
Assigned to SAP SE, Walldorf (DE)
Filed by SAP SE, Walldorf (DE)
Filed on Jul. 8, 2022, as Appl. No. 17/860,912.
Prior Publication US 2024/0013004 A1, Jan. 11, 2024
Int. Cl. G06F 40/30 (2020.01); G06F 16/2458 (2019.01); G06F 16/31 (2019.01)
CPC G06F 40/30 (2020.01) [G06F 16/2458 (2019.01); G06F 16/31 (2019.01)] 17 Claims
OG exemplary drawing
 
1. A method comprising:
analyzing, by one or more processors, a plurality of files of a dataset to determine a primary data type of the dataset;
determining, by the one or more processors, one or more statistical values for the dataset;
determining, by the one or more processors, a count of entries in the dataset;
accessing, by the one or more processors, a data value representing a stored count of the entries;
based on comparing, by the one or more processors, the determined count of the entries with the data value, storing the data value in a data structure;
determining, by the one or more processors, a percentage of the dataset that comprises text in each of a plurality of languages, by performing operations comprising:
splitting each of the plurality of files of the dataset into lines;
for each line, determining a language; and
determining a percentage of lines in each of the plurality of languages;
based on the determined percentages, determining a primary language of the dataset;
storing, in the data structure, an indication of the primary data type and at least a subset of the one or more statistical values and an indication of the primary language;
determining, by the one or more processors, a first subset of the dataset to be used for training;
determining, by the one or more processors, a second subset of the dataset to be used for validation;
storing, in the data structure, an indication of the first subset and an indication of the second subset;
iteratively generating additional data structures for a set of additional datasets;
receiving, via a user interface, user input to select a set of search criteria, wherein the set
of search criteria comprises the primary data type, the count of entries, and the primary language;
based on the set of search criteria and the data structure, displaying information for the dataset in the user interface, the displayed information including the primary data type, the count of entries, and the primary language; and
based on a selection of the dataset by the user, training a neural network using the first subset of the dataset.