US 12,242,544 B2
	Training and applying structured data extraction models
Gregorio Benincasa, London (GB); Andrea Schiavi, London (GB); Huiting Liu, London (GB); Tom Bird, London (GB); Uwais Iqbal, London (GB); Petar Petrov, London (GB); Jordan Muscatello, London (GB); Gwyneth Harrison-Shermoen, London (GB); Domenico Flauto, London (GB); Sinan Guclu, London (GB); and Jacob Cozens, London (GB)
Assigned to Sirion Eigen Limited, North Harrow (GB)
Filed by Sirion Eigen Limited, North Harrow (GB)
Filed on Feb. 3, 2022, as Appl. No. 17/592,269.
Application 17/592,269 is a continuation of application No. PCT/EP2020/072790, filed on Aug. 13, 2020.
Claims priority of application No. 1911760 (GB), filed on Aug. 16, 2019.
Prior Publication US 2022/0309109 A1, Sep. 29, 2022
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 16/93 (2019.01); G06F 16/25 (2019.01); G06F 16/31 (2019.01); G06F 16/34 (2019.01); G06F 16/35 (2019.01); G06F 16/383 (2019.01); G06F 40/183 (2020.01); G06F 40/205 (2020.01); G06F 40/284 (2020.01); G06F 40/289 (2020.01)

CPC G06F 16/93 (2019.01) [G06F 16/254 (2019.01); G06F 16/316 (2019.01); G06F 16/345 (2019.01); G06F 16/35 (2019.01); G06F 16/383 (2019.01); G06F 40/183 (2020.01); G06F 40/205 (2020.01); G06F 40/284 (2020.01); G06F 40/289 (2020.01)]

18 Claims

1. A computer-implemented method of extracting structured data from unstructured or semi-structured text in an electronic document, the method comprising:

training a data extraction model to identify an extracted point from unstructured or semi-structured text of an electronic document, the training comprising:

providing a training data structure for training a data extraction model associated with a single user defined question by converting each of a cohort of unstructured or semi-structured documents into word tokens representing words of the document;

generating, for each document, for each word token of the document, feature values of a set of features derived from the word tokens, the features including at least one feature associated with a relative position of the word token in a sequence of tokens representing each document;

providing, for each document, a label for each token word, the label being a label which indicates whether the word token is relevant to the user defined question, or whether the word token is not relevant to the user defined question, the labels being sequenced according to the sequence of tokens; and

training the data extraction model over the cohort of documents using the feature values and the sequenced labels; and

tokenizing the text as a token sequence of section tokens, wherein each section token corresponds to a portion of text extracted from an electronic document;

for each of multiple candidate label sequences, wherein each label sequence of the multiple candidate label sequences assigns a label to each section token:

extracting a set of feature values for each section token, wherein a topic distribution is determined for a portion of text corresponding to the section token, wherein the topic distribution of each section token is compared to an average topic distribution to extract a feature value for that section token as a divergence between the topic distribution of each section token and the average topic distribution, wherein the average topic distribution is determined from sections of training data labelled as relevant, excluding non-relevant sections, the training data used to train the data extraction model, wherein at least some of the features are learned from extraction training documents used to train the data extraction model, wherein each document of the extraction training documents is tokenized and each section token thereof is labelled as relevant or non-relevant to a question which the document extraction model is trained to answer, and wherein features are only extracted from the section tokens labelled as relevant;

applying the data extraction model to each label sequence of the multiple candidate label sequences and the set of feature values determined for that label sequence for each section token, thereby computing a score for each label sequence;

selecting a label sequence of the multiple candidate label sequences for the token sequence based on the computed scores of each of the label sequences of the multiple candidate label sequences; and

providing extracted structured data, in the form of one or more extracted section tokens of the token sequence, based on the selected label sequence.