US 12,277,389 B2
	Text mining based on document structure information extraction
Tetsuya Nasukawa, Kawasaki (JP); Shoko Suzuki, Yokohama (JP); Daisuke Takuma, Toshima-ku (JP); and Issei Yoshida, Setagaya-ku (JP)
Assigned to International Business Machines Corporation, Armonk, NY (US)
Filed by International Business Machines Corporation, Armonk, NY (US)
Filed on May 10, 2021, as Appl. No. 17/315,447.
Prior Publication US 2022/0358287 A1, Nov. 10, 2022
Int. Cl. G06F 40/279 (2020.01); G06F 16/93 (2019.01); G06F 40/30 (2020.01)

CPC G06F 40/279 (2020.01) [G06F 16/93 (2019.01); G06F 40/30 (2020.01)]

14 Claims

1. A method for mining text by a computer-based text mining system, comprising:

obtaining, by the text mining system, a first frequent sequence of characters from a set of documents, the set of documents having structured contents according to a common rule, wherein the first frequent sequence satisfies a condition of maximality and the satisfying of the condition of maximality comprises:

performing a first comparison, the first comparison comprising comparing a first occurrence frequency of the first frequent sequence to a second occurrence frequency of a second sequence, wherein the second sequence is longer than the first sequence and the second sequence contains the first sequence;

determining that the first frequent sequence includes a symbol and the symbol comprises formatting data for a target document;

decomposing, responsive to the determining the first frequent sequence includes a symbol, the first frequent sequence into a symbol part and a remaining part;

evaluating, by the text mining system and based on the comparing, a first confidence of the first frequent sequence being a label expression, wherein the label expression represents a document part in the target document, the evaluating the first confidence comprises:

calculating a primary confidence value for the first frequent sequence across the set of documents;

computing a likelihood of the symbol being contained in the first frequent sequence observed in the target document; and

adjusting the primary confidence value, resulting in a secondary confidence value for the first frequent sequence within the target document, wherein the adjusting the primary confidence value for the first frequent sequence to obtain the secondary confidence value is based on the likelihood of the symbol being contained in the first frequent sequence;

determining that the first confidence is above a confidence threshold;

extracting, in response to the determining the first confidence is above the confidence threshold and by the text mining system, one or more keywords from the target document based on the secondary confidence value of the first frequent sequence, wherein the extracting the one or more keywords further comprises:

applying keyword extraction to the target document, resulting in a set of keywords;

identifying that a first keyword included in the set of keywords overlaps with the first frequent sequence;

removing, based on the determining and on the identifying, the first keyword from the set of keywords; and

assigning a label relating to the first frequent sequence to a second keyword included in the set of keywords, the assigning based on positions in the target document where the first frequent sequence and the second keyword have appeared; and

outputting the one or more keywords and the secondary confidence value.