| CPC G06F 40/279 (2020.01) [G06F 40/40 (2020.01); G06N 20/00 (2019.01); G06F 40/30 (2020.01)] | 20 Claims |

|
1. A computer-implemented method comprising:
generating, by one or more processors and using an attention-based text encoder machine learning model and based at least in part on an unlabeled document data object, an unlabeled document word-wise embedded representation set that comprises a plurality of unlabeled document word-wise embedded representations, each of the plurality of unlabeled document word-wise embedded representations associated with a respective unlabeled document word of the unlabeled document data object, wherein the attention-based text encoder machine learning model comprises a tunable parameter value that is generated based at least in part on a concurrent learning loss by:
identifying one or more training input document data objects comprising one or more training unlabeled document data objects and a plurality of labeled document data objects,
identifying a training document pair, wherein the training document pair comprises (i) a training unlabeled document data object of the one or more training unlabeled document data objects and (ii) a labeled document data object of the plurality of labeled document data objects that is related to the training unlabeled document data object based at least in part on ground-truth cross-document relationships,
generating a cross-document distance measure for the training document pair,
generating the concurrent learning loss based at least in part on: (i) a language modeling loss model that is based at least in part on the one or more training input document data objects, and (ii) a similarity determination loss model that is determined based at least in part on the cross-document distance measure, and
generating, using the concurrent learning loss, the tunable parameter value;
generating, by the one or more processors and using the attention-based text encoder machine learning model and based at least in part on a labeled document data object of the plurality of the labeled document data objects, a labeled document word-wise embedded representation set that comprises a plurality of labeled document word-wise embedded representations, each of the plurality of labeled document word-wise embedded representations associated with a respective labeled document word of the labeled document data object;
generating, by the one or more processors and based at least in part on the plurality of unlabeled document word-wise embedded representations and the plurality of labeled document word-wise embedded representations, a cross-document similarity measure comparing the unlabeled document data object and the labeled document data object; and
generating, by the one or more processors and for the unlabeled document data object, a document classification based at least in part on the cross-document similarity measure.
|