US 12,111,856 B2
Method and system for long-form answer extraction based on combination of sentence index generation techniques
Anumita Dasguptabandyopadhyay, Kolkata (IN); Prabir Mallick, Kolkata (IN); Tapas Nayak, Kolkata (IN); Indrajit Bhattacharya, Kolkata (IN); and Sangameshwar Suryakant Patil, Pune (IN)
Assigned to Tata Consultancy Services Limited, Mumbai (IN)
Filed by Tata Consultancy Services Limited, Mumbai (IN)
Filed on Sep. 20, 2023, as Appl. No. 18/470,657.
Claims priority of application No. 202221058931 (IN), filed on Oct. 14, 2022.
Prior Publication US 2024/0126791 A1, Apr. 18, 2024
Int. Cl. G06F 16/30 (2019.01); G06F 16/31 (2019.01); G06F 16/332 (2019.01)
CPC G06F 16/31 (2019.01) [G06F 16/3329 (2019.01)] 8 Claims
OG exemplary drawing
 
1. A processor implemented method comprising:
receiving a plurality of inputs, via one or more hardware processors, wherein the plurality of inputs is associated with information regarding a dataset comprising a question, a document comprising an answer text for the question;
pre-processing the plurality of inputs to obtain a set of pre-processed training data, via the one or more hardware processors, wherein the set of pre-processed training data comprises a plurality of pre-processed sentence indices data and a plurality of pre-processed sentence index spans, wherein the plurality of pre-processed sentence indices data obtained using a document truncation technique, wherein the document truncation technique includes selecting an entire answer text span from the document and extending the text span by adding sentences appearing immediately before and after the span as long as the extended span does not exceed a pre-defined number of tokens, wherein a sentence index of the plurality of sentence indices is appended at the beginning of each sentence in the document for the appending the sentence index;
training a set of generation models using the set of pre-processed training data based on a supervised learning technique, via the one or more hardware processors, where the set of generation models comprises a sentence indices generation model and a sentence index spans generation model, wherein the set of generation models are sequence-to-sequence models generated using neural networks, and wherein the sentence indices generation model comprises a plurality of answer sentence indices and the sentence index spans generation model comprises a plurality of answer sentence index spans, wherein the sentence indices generation model trained using the plurality of pre-processed sentence indices data, and the sentence index spans generation model trained using the plurality of pre-processed sentence index spans, wherein the set of generation models are generative auto-regressive sequence-to-sequence models using chain rule and modeling the probability of each token oi in the output sequence o conditioned on the input sequence x and the previously generated tokens o<I, wherein the set of generation models are trained by maximizing loglikelihood of the input-output sequences in the pre-processed training data; and
post-processing the plurality of answer sentence indices and the plurality of answer sentence index spans of the set of generation models for a long-form answer extraction, via the one or more hardware processors, wherein the long form answer text is extracted for a user question input, based on mapping of a union of the first sentence indices and the second sentence indices, wherein the post-processing comprises:
obtaining a first sentence indices by post-processing the plurality of answer sentence indices based on a cleaning technique; and
obtaining a second sentence indices by post-processing the plurality of answer sentence index spans based on the cleaning technique, and a span expansion technique.