US 12,437,240 B2
System and method for artificial intelligence driven document analysis, including automated reuse of predictive coding rules based on management and curation of datasets or models
Alan Justin Lockett, Georgetown, TX (US); Verlyn Michael Fischer, Cedar Park, TX (US); Richard Alan Vestal, Jonestown, TX (US); Jesse Abraham Ramos, Austin, TX (US); Robert Duane Harrington, Austin, TX (US); and Brian Daniel Luskey, Austin, TX (US)
Assigned to CS Disco, Inc., Austin, TX (US)
Filed by CS Disco, Inc., Austin, TX (US)
Filed on Jul. 7, 2022, as Appl. No. 17/859,886.
Application 17/859,886 is a continuation of application No. 16/881,274, filed on May 22, 2020, granted, now 11,416,685.
Claims priority of provisional application 62/968,659, filed on Jan. 31, 2020.
Prior Publication US 2023/0004873 A1, Jan. 5, 2023
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 17/00 (2019.01); G06F 40/30 (2020.01); G06N 5/04 (2023.01); G06N 20/20 (2019.01)
CPC G06N 20/20 (2019.01) [G06F 40/30 (2020.01); G06N 5/04 (2013.01)] 21 Claims
OG exemplary drawing
 
1. A system for document analysis comprising:
a processor;
a data store, comprising a first corpus of electronic documents and a second corpus of electronic documents; and
a non-transitory computer readable medium comprising instructions for:
receiving an indication that a first code is to be boosted with a second code, wherein the first code is associated with first documents from the first corpus and the second code is associated with a boosting dataset comprising positive signals or negative signals from the second corpus, each positive or negative signal associated with second documents of the second corpus, wherein each positive signal indicates an associated second document of the second corpus belongs to the second code and each negative signal indicates the associated second document of the second corpus does not belong to the second code; and
training a boosting machine learning model adapted to:
generate a probability score for each document of the first corpus, wherein the probability score indicates a likelihood that the document belongs to the first code, based on the boosting dataset including the boosting dataset comprising positive signals or negative signals from the second corpus, including training the boosting machine learning model based on each positive signal or negative signal from the second corpus, such that the boosting machine learning model is trained to generate predictive scores for the first code based on the positive or negative signals from the second corpus associated with the second documents of the second corpus associated with the boosting dataset, wherein the training of the boosting machine learning model further comprises balancing the boosting machine learning model by:
dividing the boosting dataset into a first group comprising the positive signals and a second group comprising the negative signals;
alternately selecting a positive signal from the first group and a negative signal from the second group to form pairs for the training of the boosting machine learning model, until all instances of signals in the first group or the second group have been selected; and
when either the positive or negative group has remaining instances after all instances in the other group have been selected, selecting the remaining instances from the other group until all instances have been selected; and
displaying the probability score to a user.