US 12,216,714 B2
Method and system of classifying documents based on layout determination
Niveditha Sureshbabu, Bengaluru (IN); Rajesh Raj, Bengaluru (IN); and Madhusudan Singh, Bangalore (IN)
Assigned to L&T TECHNOLOGY SERVICES LIMITED, Chennai (IN)
Filed by L&T TECHNOLOGY SERVICES LIMITED, Chennai (IN)
Filed on Dec. 19, 2023, as Appl. No. 18/544,486.
Claims priority of application No. 202341039663 (IN), filed on Jun. 9, 2023.
Prior Publication US 2024/0411819 A1, Dec. 12, 2024
Int. Cl. G06F 16/00 (2019.01); G06F 16/906 (2019.01); G06F 16/93 (2019.01); G06N 5/022 (2023.01)
CPC G06F 16/906 (2019.01) [G06F 16/93 (2019.01); G06N 5/022 (2013.01)] 18 Claims
OG exemplary drawing
 
1. A method of classifying a document, the method comprising:
determining, by a processor, line-text data for each of a plurality of lines of the document using a text extraction technique;
determining, by the processor, a set of unique keywords in the document from a predefined list of keywords based on detection of at least one alias corresponding to each of the set of keywords in the line-text data for each of the plurality of lines,
wherein the set of unique keywords are determined in a pre-defined reading sequence of the plurality of lines of the document;
determining, by the processor, a feature matrix for the set of unique keywords by:
for each keyword in the set of unique keywords:
determining, by the processor, two forward nodes as next two subsequent keywords in the set of unique keywords based on determination of a shortest distance between position of the corresponding keyword and positions of the next two subsequent keywords and based on the pre-defined reading sequence;
determining, by the processor, weights of each of the two forward nodes based on the shortest distance and an angle of each of the two forward nodes with respect to the corresponding keyword; and
determining, by the processor, a document layout of the document by:
determining, by the processor, a cluster from a plurality of clusters based on the feature matrix using a machine learning clustering model,
wherein the each of the plurality of clusters correspond to a unique document layout from a plurality of document layouts, and
wherein the machine learning clustering model is trained based on training data comprising a plurality of documents corresponding to each of the plurality of document layouts.