CPC G10L 15/063 (2013.01) [G06F 9/451 (2018.02); G06F 9/547 (2013.01); G06F 16/2237 (2019.01); G06F 16/248 (2019.01); G06F 16/328 (2019.01); G06F 16/3323 (2019.01); G06F 16/3329 (2019.01); G06F 16/3338 (2019.01); G06F 16/3344 (2019.01); G06F 16/3347 (2019.01); G06F 16/345 (2019.01); G06F 16/367 (2019.01); G06F 16/90332 (2019.01); G06F 40/20 (2020.01); G06F 40/289 (2020.01); G06F 40/30 (2020.01); G06F 40/40 (2020.01); G06N 3/04 (2013.01); G06N 20/00 (2019.01); G10L 15/16 (2013.01); G10L 15/197 (2013.01); G16H 10/60 (2018.01); G16H 40/20 (2018.01); G16H 70/20 (2018.01)] | 21 Claims |
1. A computer-implemented method of indexing data in a corpus of natural-language text documents, the method comprising:
obtaining, with a computer system, the corpus of natural-language text documents;
segmenting, with the computer system, a first document of the corpus into a plurality of n-gram sequences, wherein each respective n-gram sequence of the plurality of n-gram sequences represents a phrase or a sentence, and wherein segmenting the first document comprises:
determining a topic based on the first document of the corpus;
determining a set of sequence scores for each member of the plurality of n-gram sequences, wherein each respective score of the set of sequence scores is based on a indicates a count of the respective n-gram sequence with respect to the topic;
selecting, with the computer system, a first n-gram sequence of the plurality of n-gram sequences based on the sets of sequence scores;
generating, with the computer system, a question based on at least one n-gram of the first n-gram sequence;
determining, with the computer system, a first set of embedding vectors based on the question;
mapping, with the computer system, the first document to the question in an index;
obtaining, with the computer system, a query;
determining, with the computer system, a second set of embedding vectors based on the query and a distance between the first set of embedding vectors and the second set of embedding vectors;
determining, with the computer system, whether the distance satisfies a criterion;
in response to the distance satisfying the criterion, retrieving at least a portion of text of the first document using the index; and
sending, with the computer system, the portion of the text to a client computing device.
|