| CPC G06N 20/20 (2019.01) [G06F 40/30 (2020.01); G06N 5/04 (2013.01)] | 21 Claims |

|
1. A system for document analysis comprising:
a processor;
a data store, comprising a first corpus of electronic documents and a second corpus of electronic documents; and
a non-transitory computer readable medium comprising instructions for:
receiving an indication that a first code is to be boosted with a second code, wherein the first code is associated with first documents from the first corpus and the second code is associated with a boosting dataset comprising positive signals or negative signals from the second corpus, each positive or negative signal associated with second documents of the second corpus, wherein each positive signal indicates an associated second document of the second corpus belongs to the second code and each negative signal indicates the associated second document of the second corpus does not belong to the second code; and
training a boosting machine learning model adapted to:
generate a probability score for each document of the first corpus, wherein the probability score indicates a likelihood that the document belongs to the first code, based on the boosting dataset including the boosting dataset comprising positive signals or negative signals from the second corpus, including training the boosting machine learning model based on each positive signal or negative signal from the second corpus, such that the boosting machine learning model is trained to generate predictive scores for the first code based on the positive or negative signals from the second corpus associated with the second documents of the second corpus associated with the boosting dataset, wherein the training of the boosting machine learning model further comprises balancing the boosting machine learning model by:
dividing the boosting dataset into a first group comprising the positive signals and a second group comprising the negative signals;
alternately selecting a positive signal from the first group and a negative signal from the second group to form pairs for the training of the boosting machine learning model, until all instances of signals in the first group or the second group have been selected; and
when either the positive or negative group has remaining instances after all instances in the other group have been selected, selecting the remaining instances from the other group until all instances have been selected; and
displaying the probability score to a user.
|