CPC G06F 40/279 (2020.01) [G06F 40/40 (2020.01); G06N 20/00 (2019.01); G06F 40/30 (2020.01)] | 20 Claims |
1. A computer-implemented method for generating a cross-document similarity measure for an unlabeled document data object and a label document data object, the computer-implemented method comprising:
generating, by one or more processors and an attention-based text encoder machine learning model and based at least in part on the unlabeled document data object, an unlabeled document word-wise embedded representation set for the unlabeled document data object, wherein: (i) the attention-based text encoder machine learning model is configured to generate, for each input document data object, a respective document-wide embedded representation and a respective word-wise embedded representation set, and (ii) generating the attention-based text encoder machine learning model comprises:
identifying a group of training input document data objects comprising one or more training unlabeled document data objects and a plurality of labeled document data objects,
identifying one or more training document pairs, wherein each training document pair of the one or more training document pairs comprises (1) a respective training unlabeled document data object of the one or more training unlabeled document data objects and (2) a respective label document data object of the plurality of labeled document data objects that is related to the respective training unlabeled document data object based at least in part on ground-truth cross-document relationships,
generating, using a language modeling loss model that is defined in accordance with a language modeling training task and based at least in part on the group of training input document data objects, one or more initially-optimized parameter values for one or more trainable parameters of the attention-based text encoder machine learning model,
for each training document pair of the one or more training document pairs, generating a respective cross-document distance measure based at least in part on the respective document-wide embedded representation for the respective training unlabeled document data object and the respective labeled document data object,
generating a similarity determination loss model based at least in part on each generated cross-document distance measure, and
generating, using a sequential learning loss model that is determined based at least in part on adjusting the similarity determination loss model in accordance with a sequential learning regularization factor that describes computed effects of potential updates to the one or more initially-optimized parameter values on the language modeling loss model, one or more subsequently-optimized parameter values for the one or more trainable parameters of the attention-based text encoder machine learning model;
generating, by the one or more processors and the attention-based text encoder machine learning model and based at least in part on the respective labeled document data object, a label document word-wise embedded representation set for the unlabeled document data object;
generating, by the one or more processors and based at least in part on the unlabeled document word-wise embedded representation set and the labeled document word-wise embedded representation set, the cross-document similarity measure; and
performing, by the one or more processors, one or more prediction-based actions based at least in part on the cross-document similarity measure.
|