| CPC G06F 9/5066 (2013.01) [G06F 9/5038 (2013.01); G06F 9/505 (2013.01)] | 20 Claims |

|
1. A computer-implemented method comprising:
accessing a set of files in a semi-structured or unstructured format across a plurality of distributed computer systems, wherein:
the files are sorted according to dates corresponding to the files,
each date indicates a date the respective file was created,
the files are grouped into chunks according to the dates and a given date range size,
respective expected amounts of time required to process the files are based on using a machine learning model trained on features of historical training data,
the features of the historical training data include observed amounts of processing time for historical files in the historical training data and at least one of: file sizes of the historical files, file names of the historical files, file types of the historical files, times at which processing of the historical files occurred, or a user identification of a user associated with the times, and;
the files within the chunks are further sorted according to the respective expected amounts of time;
processing the sorted files within the chunks by processing units of a processing subsystem based on an iterative allocation of the files according to the dates and the respective expected amounts of time across the processing units to a processing unit having a lowest current computing load, wherein the computing load of the processing unit having the lowest current computing load is updated according to the respective expected amount of time to process the respective file and a processing ability of the processing unit; and
storing results of the processing within a structured database, wherein the results are available to consumers.
|