US 12,277,608 B2
Machine learning system for summarizing tax documents with non-structured portions
Geralyn R. Hurd, Chicago, IL (US); Nathaniel J. Jones, Grand Rapids, MI (US); Camron Momeni, Chicago, IL (US); and Justin A. Bass, Chicago, IL (US)
Assigned to K1X, Inc., Chicago, IL (US)
Filed by K1X, Inc., Chicago, IL (US)
Filed on Feb. 29, 2024, as Appl. No. 18/591,116.
Application 18/591,116 is a continuation of application No. 16/571,775, filed on Sep. 16, 2019, granted, now 11,941,706.
Prior Publication US 2024/0212061 A1, Jun. 27, 2024
This patent is subject to a terminal disclaimer.
Int. Cl. G06Q 30/00 (2023.01); G06N 20/00 (2019.01); G06Q 40/12 (2023.01); G06V 30/19 (2022.01); G06V 30/412 (2022.01); G06V 30/10 (2022.01)
CPC G06Q 40/123 (2013.12) [G06N 20/00 (2019.01); G06V 30/19167 (2022.01); G06V 30/19173 (2022.01); G06V 30/412 (2022.01); G06V 30/10 (2022.01)] 16 Claims
OG exemplary drawing
 
12. A method for summarizing tax documents that include an unstructured portion, the method comprising:
receiving a tax document that includes a structured facepage portion and an unstructured free form whitepaper portion;
identifying which portion of the tax document is the structured facepage portion and which portion is the unstructured whitepaper portion;
extracting a plurality of structured data elements from the structured facepage portion;
extracting, using a machine learning model, a plurality of unstructured data elements from the unstructured free form whitepaper portion, wherein the machine learning model is configured to extract a plurality of data elements from the unstructured free form whitepaper portion related to one or more of state apportionment, unrelated business taxable income (“UBTI”) data, and/or foreign disclosures from the whitepaper portion of the tax document;
generating, using a machine learning model, a confidence level associated with each extracted unstructured data element, wherein the confidence level represents a prediction on how likely the extracted unstructured data element was accurately extracted;
generating a document in an electronic interchange format that represents: (i) the plurality of extracted structured data elements from the structured facepage portion; (ii) the plurality of extracted unstructured data elements from the unstructured whitepaper portion; and (iii) the confidence level associated with each of the plurality of extracted unstructured data elements;
establishing a confidence level threshold and flag any extracted unstructured data elements with an associated confidence level below the confidence level threshold, wherein the confidence level threshold is user-adjustable;
adjusting one or more parameters of the machine learning model by: (i) training and/or re-training the machine learning model based on historical data extractions; (ii) generating, based on the training or re-training, a proposed threshold confidence level as to whether extracted data elements were correctly extracted; (iii) receiving user feedback regarding correction of one or more unstructured data elements; and (iv) adjusting the confidence level based on the user feedback; and
wherein the plurality of extracted unstructured data elements in the electronic interchange format are organized into a data schema with: (1) one or more top-level fields including a form year, an entity name, an investor name or a filing date; (2) an array of data corresponding to one or more parts of the tax document; and (3) additional extracted fields including (a) state apportionment data, (b) unrelated business taxable income (“UBTI”) data, and/or (c) foreign disclosures data.