| CPC G06F 40/279 (2020.01) [G06F 16/90344 (2019.01); G06F 16/93 (2019.01); G06N 5/022 (2013.01)] | 13 Claims |

|
1. A computer-implemented method implemented with one or more processors for detecting machine-generated documents for identifying false information in a collection of documents including both machine-generated and human-authored documents as well as repeated substrings, comprising:
(a) for the collection of documents, computing with the one or more processors a set of repeated substrings with each repeated substring having at least a selected length and where the set of repeated substrings comprise super-maximal repeats with each super-maximal repeat comprising a repeated substring that does not occur in any other repeated substring within the collection of documents;
(b) using a subset of the set of repeated substrings to designate documents containing the subset of the repeated substrings as machine-generated documents, the documents designated as machine-generated comprising positive examples of machine-generated documents;
(c) developing with the one or more processors negative examples of machine-generated documents, the negative examples of machine-generated documents including at least one human-authored document;
(d) creating with the one or more processors a dataset comprising both the positive and negative examples of machine-generated documents;
(e) training with the one or more processors a plurality of automatic binary classifiers by feeding the dataset as input to the plurality of automatic binary classifiers, the plurality of automatic binary classifiers outputting predictions responsive to said feeding that vary as a function of an extent to which the documents of the dataset contain machine-generated text; and
(f) summing with the one or more processors outputted predictions of each document to obtain an amplified signal for each document fed to the plurality of automatic binary classifiers;
(g) identifying false information based on machine generated documents detected using the amplified signal for each document fed to the plurality of automatic binary classifiers;
wherein conditioned on the predictions outputted at (e): repeating (d) to create an additional dataset after repeating (b)-(d); repeating (e) to retrain the plurality of automatic binary classifiers using the additional dataset; repeating (f) using the plurality automatic binary classifiers retrained using the additional dataset; and repeating (g) using the amplified signal for each document fed to the plurality of automatic binary classifiers retrained using the additional dataset.
|