CPC G06F 40/30 (2020.01) [G06F 40/253 (2020.01); G06F 40/284 (2020.01); G06N 3/08 (2013.01); G06N 20/00 (2019.01)] | 20 Claims |
1. A computer-implemented method comprising:
receiving data encapsulating a document of text;
segmenting the text into a plurality of semantically coherent units using a coherence-aware text segmentation (CATS) machine learning model, the CATS machine learning model being a multi-task learning model that alternatively minimizes a sentence-level segmentation objective and a coherence objective and which differentiates correct sequences of sentences in the document from corrupt sequences of sentences in the document, the CATS model being generated by cross-lingual zero-shot transfer in which a supervised alignment model is used to project target-language vectors from an independently trained embedding space of a target language to a monolingual embedding space of a source language; and
providing data characterizing the segmenting.
|