| CPC G06N 20/00 (2019.01) [G06F 16/211 (2019.01)] | 17 Claims |

|
1. A method, comprising:
generating a plurality of schemas based on a plurality of sample data payloads, wherein the plurality of sample data payloads have different data formats;
identifying common elements in the plurality of schemas;
creating a new schema based on the common elements of the plurality of schemas;
determining a coverage of the new schema with respect to the plurality of schemas by ascertaining percentages of each of the generated plurality of schemas associated with the sample data payloads that are covered by the identified common elements;
storing the new schema responsive to determining that the coverage is not less than a first specified threshold or repeating the generating the plurality of schemas, the identifying the common elements in the plurality of schemas, and the creating the new schema based on the common elements of the plurality of schemas responsive to determining that the coverage is less than the first specified threshold;
receiving a plurality of data payloads;
converting the plurality of data payloads to a format that conforms to the new schema;
generating feature vectors for the plurality of data payloads using the new schema;
determining a suitability of the plurality of data payloads for training a machine learning model based on the feature vectors; and
training the machine learning model using the converted plurality of data payloads responsive to determining that the suitability is greater than a second specified threshold.
|