US 11,790,889 B2
	Feature engineering with question generation
Carlos Fernández Musoles, Sheffield (GB); Unai Garay Maestre, Alicante (ES); and Walter Bender, Washington, DC (US)
Assigned to Sorcero, Inc., Washington, DC (US)
Filed by Sorcero, Inc., Washington, DC (US)
Filed on Mar. 23, 2021, as Appl. No. 17/210,320.
Claims priority of provisional application 62/993,122, filed on Mar. 23, 2020.
Prior Publication US 2021/0294781 A1, Sep. 23, 2021
Int. Cl. G10L 15/06 (2013.01); G10L 15/197 (2013.01); G06F 40/20 (2020.01); G10L 15/16 (2006.01); G06F 16/332 (2019.01); G06F 9/451 (2018.01); G06F 16/33 (2019.01); G06F 16/36 (2019.01); G06N 20/00 (2019.01); G06F 16/34 (2019.01); G06F 40/40 (2020.01); G06F 16/22 (2019.01); G06F 16/9032 (2019.01); G06F 16/248 (2019.01); G06F 9/54 (2006.01); G06F 16/31 (2019.01); G06F 40/289 (2020.01); G06N 3/04 (2023.01); G06F 40/30 (2020.01); G16H 40/20 (2018.01); G16H 10/60 (2018.01); G16H 70/20 (2018.01)

CPC G10L 15/063 (2013.01) [G06F 9/451 (2018.02); G06F 9/547 (2013.01); G06F 16/2237 (2019.01); G06F 16/248 (2019.01); G06F 16/328 (2019.01); G06F 16/3323 (2019.01); G06F 16/3329 (2019.01); G06F 16/3338 (2019.01); G06F 16/3344 (2019.01); G06F 16/3347 (2019.01); G06F 16/345 (2019.01); G06F 16/367 (2019.01); G06F 16/90332 (2019.01); G06F 40/20 (2020.01); G06F 40/289 (2020.01); G06F 40/30 (2020.01); G06F 40/40 (2020.01); G06N 3/04 (2013.01); G06N 20/00 (2019.01); G10L 15/16 (2013.01); G10L 15/197 (2013.01); G16H 10/60 (2018.01); G16H 40/20 (2018.01); G16H 70/20 (2018.01)]

21 Claims

1. A computer-implemented method of indexing data in a corpus of natural-language text documents, the method comprising:

obtaining, with a computer system, the corpus of natural-language text documents;

segmenting, with the computer system, a first document of the corpus into a plurality of n-gram sequences, wherein each respective n-gram sequence of the plurality of n-gram sequences represents a phrase or a sentence, and wherein segmenting the first document comprises:

determining a topic based on the first document of the corpus;

determining a set of sequence scores for each member of the plurality of n-gram sequences, wherein each respective score of the set of sequence scores is based on a indicates a count of the respective n-gram sequence with respect to the topic;

selecting, with the computer system, a first n-gram sequence of the plurality of n-gram sequences based on the sets of sequence scores;

generating, with the computer system, a question based on at least one n-gram of the first n-gram sequence;

determining, with the computer system, a first set of embedding vectors based on the question;

mapping, with the computer system, the first document to the question in an index;

obtaining, with the computer system, a query;

determining, with the computer system, a second set of embedding vectors based on the query and a distance between the first set of embedding vectors and the second set of embedding vectors;

determining, with the computer system, whether the distance satisfies a criterion;

in response to the distance satisfying the criterion, retrieving at least a portion of text of the first document using the index; and

sending, with the computer system, the portion of the text to a client computing device.