US 12,086,559 B2
Clause extraction using machine translation and natural language processing
Vadim Sheinin, Yorktown Heights, NY (US); Octavian Popescu, Westchester, NY (US); Ngoc Phuoc An Vo, Bronx, NY (US); and Irene Lizeth Manotas Gutiérrez, White Plains, NY (US)
Assigned to International Business Machines Corporation, Armonk, NY (US)
Filed by International Business Machines Corporation, Armonk, NY (US)
Filed on Mar. 31, 2021, as Appl. No. 17/219,030.
Prior Publication US 2022/0318523 A1, Oct. 6, 2022
Int. Cl. G06F 40/47 (2020.01); G06F 40/205 (2020.01); G06F 40/58 (2020.01)
CPC G06F 40/47 (2020.01) [G06F 40/205 (2020.01); G06F 40/58 (2020.01)] 17 Claims
OG exemplary drawing
 
1. A computer-implemented method for clause extraction using machine translation, the computer-implemented method comprising:
training a machine translation model, using machine learning, to translate an input sentence in a source language into a translated sentence in a target language and to insert a grammatical indicator into a position of the translated sentence that identifies a dependent clause, wherein the grammatical indicator is not present in the input sentence, wherein the source language includes English, wherein the machine translation model is trained using a corpus of example multi-clause sentences in the source language that each lack the grammatical indicator and corresponding translated sentences in the target language that each include the grammatical indicator, and wherein the example multi-clause sentences are not manually edited when used during training of the machine translation model;
translating the input sentence in the source language into the translated sentence in the target language using the machine translation model;
aligning the input sentence and the translated sentence to determine a position in the input sentence that corresponds to the position of the grammatical indicator in the translated sentence, wherein aligning comprises sequentially numbering each word in the input sentence, mapping each word of the translated sentence to a corresponding word of the input sentence, identifying a subset of words in the translated sentence based on the position of the grammatical indicator, and identifying a corresponding subset of words in the input sentence;
translating the input sentence into at least one additional translated sentence and aligning the input sentence to the at least one additional translated sentence to verify the determined position in the input sentence; and
extracting the dependent clause, in the source language, from the input sentence based on the determined position in the input sentence, wherein the dependent clause is extracted based on a highest-numbered word in the subset of words in the input sentence.