US 11,782,957 B2
Systems and methods for automated classification of a document
Kathan Roberts, Palo Alto, CA (US); Max Weiland Rosen, San Francisco, CA (US); Joerg Bredno, San Francisco, CA (US); Jafi Lipson, Palo Alto, CA (US); and Harit Nandani, San Carlos, CA (US)
Assigned to GRAIL, LLC, Menlo Park, CA (US)
Filed by GRAIL, LLC, Menlo Park, CA (US)
Filed on Apr. 6, 2022, as Appl. No. 17/714,826.
Claims priority of provisional application 63/172,471, filed on Apr. 8, 2021.
Claims priority of provisional application 63/248,755, filed on Sep. 27, 2021.
Prior Publication US 2022/0327145 A1, Oct. 13, 2022
Int. Cl. G06F 17/00 (2019.01); G06F 16/28 (2019.01); G06V 30/19 (2022.01); G06N 20/00 (2019.01); G16H 15/00 (2018.01)
CPC G06F 16/285 (2019.01) [G06N 20/00 (2019.01); G06V 30/19173 (2022.01); G16H 15/00 (2018.01)] 24 Claims
OG exemplary drawing
 
1. A computer-implemented method for extracting information from a dataset, comprising:
receiving, at an information handling device, a dataset;
extracting, via optical character recognition implemented by a processor of the information handling device, textual information associated with the dataset; and
classifying the dataset into one of a plurality of classes, the classifying further comprising:
computing a similarity score for each of the plurality of classes for each of a plurality of window regions of the dataset, the computing further comprising:
sliding a window across the textual information to define the plurality of window regions, and for each of the plurality of window regions:
computing a relevance metric for the window region; and
calculating the similarity score for each of the plurality of classes by calculating a similarity function between the relevance metric for the window region and an average relevance metric for each of the plurality of classes;
determining, based on a subset of highest similarity scores computed for each of the plurality of classes for each of the plurality of window regions, overall similarity scores for each of the plurality of classes for the dataset; and
classifying the dataset as corresponding to a class of the plurality of classes with a highest overall similarity score for the dataset.