US 11,748,571 B1
	Text segmentation with two-level transformer and auxiliary coherence modeling
Goran Glavas, Mannheim (DE); and Swapna Somasundaran, Plainsboro, NJ (US)
Assigned to Educational Testing Service, Princeton, NJ (US)
Filed by Educational Testing Service, Princeton, NJ (US)
Filed on May 20, 2020, as Appl. No. 16/878,708.
Claims priority of provisional application 62/850,610, filed on May 21, 2019.
Int. Cl. G06F 40/289 (2020.01); G06F 40/30 (2020.01); G06N 3/096 (2023.01); G06N 20/00 (2019.01); G06F 40/284 (2020.01); G06F 40/253 (2020.01); G06N 3/08 (2023.01)

CPC G06F 40/30 (2020.01) [G06F 40/253 (2020.01); G06F 40/284 (2020.01); G06N 3/08 (2013.01); G06N 20/00 (2019.01)]

20 Claims

1. A computer-implemented method comprising:

receiving data encapsulating a document of text;

segmenting the text into a plurality of semantically coherent units using a coherence-aware text segmentation (CATS) machine learning model, the CATS machine learning model being a multi-task learning model that alternatively minimizes a sentence-level segmentation objective and a coherence objective and which differentiates correct sequences of sentences in the document from corrupt sequences of sentences in the document, the CATS model being generated by cross-lingual zero-shot transfer in which a supervised alignment model is used to project target-language vectors from an independently trained embedding space of a target language to a monolingual embedding space of a source language; and

providing data characterizing the segmenting.