US 12,430,517 B2
Method and system for document structure based unsupervised long-form technical question generation
Subhasish Ghosh, Kolkata (IN); Arpita Kundu, Kolkata (IN); Indrajit Bhattacharya, Kolkata (IN); Pratik Saini, Noida (IN); and Tapas Nayak, Kolkata (IN)
Assigned to Tata Consultancy Services Limited, Mumbai (IN)
Filed by Tata Consultancy Services Limited, Mumbai (IN)
Filed on Aug. 16, 2023, as Appl. No. 18/450,588.
Claims priority of application No. 202221052005 (IN), filed on Sep. 12, 2022.
Prior Publication US 2024/0095466 A1, Mar. 21, 2024
Int. Cl. G06F 40/14 (2020.01); G06F 40/137 (2020.01); G06F 40/205 (2020.01); G06F 40/40 (2020.01); G06Q 50/20 (2012.01); G06V 30/413 (2022.01)
CPC G06F 40/40 (2020.01) [G06F 40/137 (2020.01); G06F 40/205 (2020.01); G06Q 50/20 (2013.01); G06V 30/413 (2022.01); G06V 2201/10 (2022.01)] 18 Claims
OG exemplary drawing
 
1. A processor implemented method, the method comprising:
receiving, by one or more hardware processors, a textbook document, wherein the textbook document is in Portable Document Format (PDF);
extracting, by the one or more hardware processors, a PDF metadata from the textbook document using a Natural Language Processing (NLP) technique, wherein the PDF metadata comprises text sizes, fonts, and text coordinates;
extracting, by the one or more hardware processors, a plurality of structures from the textbook document based on the PDF metadata using a NLP based filtering technique, wherein the plurality of structures comprising a plurality of hierarchical index structures and a plurality of Table of Content (TOC) structures, wherein each of the plurality of hierarchical index structures and the plurality of TOC structures is a tree structure comprising a plurality of nodes and a plurality of edges connecting the plurality of nodes;
annotating, by the one or more hardware processors, each of the plurality of hierarchical index structures by identifying a plurality of entities and a plurality of contexts corresponding to each of the plurality of entities using a parsing technique, wherein each of the plurality of nodes of each of the plurality of hierarchical index structures is annotated as one of a) an entity, b) a context, and c) an entity with context;
simultaneously annotating, by the one or more hardware processors, each of the plurality of TOC structures by identifying a plurality of type information using the parsing technique, wherein each of the plurality of nodes corresponding to each of the plurality of TOC structures are annotated as one of a) a question, b) a question phrase, c) a VBG (Verb Gerund present participle) phrase, d) a sentence and e) a noun phrase;
obtaining, by the one or more hardware processors, a plurality of index based question templates from a plurality of predefined question templates based on an annotated plurality of hierarchical index structures using an index based question template selection technique;
obtaining, by the one or more hardware processors, a plurality of TOC based question templates from the plurality of predefined question templates based on an annotated plurality of TOC structures using a TOC structure based question template selection technique, wherein the TOC structure based question template selection technique selects the plurality of TOC based question templates based on the plurality of type information; and
generating, by the one or more hardware processors, a plurality of long-form technical questions based on the annotated plurality of hierarchical index structure and the annotated plurality of TOC structures by instantiating the plurality of index based question templates and the plurality of TOC based question templates.