US 12,333,246 B1
Automated question-answer generation system for documents
Himanshu Gupta, Chhattisgarh (IN); Raaed Ahmed Syed, Telangana (IN); Tarun Kumar, Uttar Pradesh (IN); Tamanna Agrawal, Chhattisgarh (IN); and Himanshu Sharad Bhatt, Karnataka (IN)
Assigned to American Express (India) Private Limited, New Delhi (IN)
Filed by American Express Travel Related Services Company, Inc., New York, NY (US)
Filed on Dec. 17, 2021, as Appl. No. 17/554,761.
Int. Cl. G06F 40/30 (2020.01); G06F 40/166 (2020.01); G06F 40/211 (2020.01); G06F 40/253 (2020.01); G06F 40/40 (2020.01); G06N 5/01 (2023.01)
CPC G06F 40/211 (2020.01) [G06F 40/166 (2020.01); G06F 40/253 (2020.01); G06F 40/30 (2020.01); G06F 40/40 (2020.01); G06N 5/01 (2023.01)] 12 Claims
OG exemplary drawing
 
1. A computer implemented method for question-answer pair generation, the method comprising:
receiving, by one or more computing devices, a document;
identifying, by the one or more computing devices, a sentence in the document;
generating, by the one or more computing devices, a syntactic map for the sentence, wherein the syntactic map represents a grammatical structure of the sentence based on dependencies between words in the sentence;
identifying, by the one or more computing devices, a further sentence in the document;
generating, by the one or more computing devices, a further syntactic map for the further sentence;
generating, by the one or more computing devices, a combined syntactic map from the syntactic map and the further syntactic map by connecting the syntactic map and the further syntactic map using common words found in each of the syntactic map and the further syntactic map;
generating, by the one or more computing devices, word vector representations for encoding each word of the sentence and the further sentence, by processing each of the sentence and the further sentence using a Bi-Directional Gated Recurrent Unit (BiGRU) and giving weights to each word of the sentence and the further sentence based on the BiGRU being trained to recognize a relative importance of each word to the sentence and the further sentence based on its part of speech;
generating, by the one or more computing devices, a combined vector representation of each word of the sentence and the further sentence by computing a weighted average based on each of the word vector representations;
generating, by the one or more computing devices, a structurally aware vector representation of the sentence and the further sentence by processing the combined syntactic map and the combined vector representation of each word of the sentence and the further sentence, using a graph attention network (GAT);
generating, by the one or more computing devices, a semantic enriched vector representation of the sentence and the further sentence by processing the structurally aware vector representation of the sentence and the further sentence and the word vector representations for each word of the sentence and the further sentence, using a neural network, wherein the semantic enriched vector representation comprises a value representing the importance of each word in the combined syntactic map;
generating, by the one or more computing devices, document level questions based on the semantic enriched vector representation;
determining, by the one or more computing devices, a cosine similarity between:
a document level question from the document level questions generated, and one or more paragraphs of the document;
for a paragraph from the one or more paragraphs determined to have a highest cosine similarity, transmitting, by the one or more computing devices, the document level question and the paragraph to a Question-Answer Model (QA Model);
determining, by the one or more computing devices and using the QA Model, an answer to the document level question from the paragraph; and
post-processing, by the one or more computing devices, the answer to determine whether the answer is redundant, incorrect, or irrelevant based on using a further trained model trained to recognize correct answers based on patterns of previous answers to similarly posed questions and determining which answers are most similar to the answer and discard answers deemed redundant, incorrect, or irrelevant.