CPC G06F 40/40 (2020.01) [G06F 16/345 (2019.01); G06F 16/951 (2019.01); G06F 40/295 (2020.01)] | 16 Claims |
1. A method for extracting information from an unstructured data source, the method comprising:
scraping, by at least one processor, a plurality of texts from the unstructured data source, the scraping comprising obtaining an unstructured formatted text from the unstructured data source, obtaining a title of the unstructured formatted text, based on a formatting of the title in the unstructured formatted text, using a title classification model to classify the title as one of relevant and non-relevant, and if the title is classified as relevant, parsing the unstructured formatted text and including the title in the plurality of texts;
extracting, by the at least one processor, from the plurality of texts a chunk of relevant text;
summarizing, by the at least one processor, using a pre-trained summarizer, the chunk of relevant text, each of the at least one processor to summarize the chunk of relevant text in parallel to obtain semi-structured information comprising a set of sentences that summarize the chunk of relevant text; and
postprocessing, by the at least one processor, the semi-structured information to obtain structured information.
|