| CPC G06N 5/022 (2013.01) [G06F 40/117 (2020.01); G06F 40/205 (2020.01); G06F 40/253 (2020.01); G06F 40/284 (2020.01); G06F 40/289 (2020.01)] | 13 Claims |

|
1. A processor-implemented method for training a word-level data model for extraction of tasks from documents using weakly supervision comprising:
receiving a plurality of documents from a plurality of sources, via one or more hardware processors;
pre-processing the plurality of documents using a plurality of pre-processing techniques, via the one or more hardware processors, to obtain a plurality of pre-processed documents comprising a plurality of sentences, a plurality of words within the plurality of sentences, a plurality of Part-of-speech (POS) tags, a plurality of dependency trees and a plurality of WordNet® based features;
labelling the plurality of words from the plurality of pre-processed documents as one of a task headword and a no-task headword, via the one or more hardware processors, wherein the plurality of words is labelled based on the plurality of sentences using a plurality of linguistic rules, wherein the plurality of words is labelled based on the plurality of linguistic rules for a volitionality aspect referring to tasks whose actions are carried out by an actor volitionally,
wherein the volitionality aspect includes an action verbs or nouns aspect, an animate organization agent aspect, an inanimate agent aspect and a volition marker,
wherein the animate organization agent aspect captures the volitionality in an implicit way as animate agents indicate that the action corresponding to verb w is likely to be carried out volitionally,
wherein for the animate organization agent aspect, when the agent of the verb w is animate or corresponds to an organization, then the word is labelled as “task headword”, wherein for the inanimate agent aspect, when the agent of the verb w is inanimate, then the word is labelled as “no task headword”; and
training a word-level weakly supervised classification model for extraction of tasks, via the one or more hardware processors, using the task headword and the no-task headword labelled using the plurality of linguistic rules, wherein the word-level weakly supervised classification model is the word-level data model for extraction of tasks from documents, wherein the word-level weakly supervised classification model is a Bidirectional Encoder Representations from Transformers (BERT-based classification model), wherein each instance is annotated with soft label and each instance is a combination of a word w, the word's POS tag p, and the sentence S.
|