US 12,298,970 B2
Method and system for analyzing natural language data by using domain-specific language models
Naan Cho, New York, NY (US); Zhen Zeng, Ypsilanti, MI (US); William Watson, Long Beach, NY (US); Manuela Veloso, New York, NY (US); Matthew Brian MacKay, Los Angeles, CA (US); and Tucker Richard Balch, Suwanee, GA (US)
Assigned to JPMORGAN CHASE BANK, N.A., New York, NY (US)
Filed by JPMorgan Chase Bank, N.A., New York, NY (US)
Filed on Jul. 3, 2023, as Appl. No. 18/217,868.
Prior Publication US 2025/0013633 A1, Jan. 9, 2025
Int. Cl. G06F 16/242 (2019.01); G06F 16/2457 (2019.01); G06F 16/25 (2019.01)
CPC G06F 16/243 (2019.01) [G06F 16/24575 (2019.01); G06F 16/258 (2019.01)] 16 Claims
OG exemplary drawing
 
1. A method for providing a domain-specific language model to facilitate natural language data analytics, the method being implemented by at least one processor, the method comprising:
aggregating, by the at least one processor, a plurality of documents from at least one source, each of the plurality of documents including natural language data;
ingesting, by the at least one processor, each of the plurality of documents to generate at least one structured data set that is organized according to a contextual hierarchy, wherein the ingesting of each of the plurality of documents further comprises:
attaching, by the at least one processor, at least one tag to each of the plurality of documents based on content of corresponding natural language data, each of the at least one tag including corresponding metadata;
formatting, by the at least one processor, at least one data table in each of the plurality of documents to discover at least one corresponding table boundary wherein the formatting of the at least one data table further comprises:
associating, by the at least one processor, each of the at least one data table with a corresponding placeholder reference, the placeholder reference representing a spatial relationship between the at least one data table and the plurality of documents; and
persisting, by the at least one processor, the placeholder reference within the corresponding plurality of documents in place of the corresponding at least one data table; and
segmenting, by the at least one processor, each of the plurality of documents into at least one section by using at least one stylistic indicator and at least one contextual indicator;
determining, by the at least one processor, at least one prompt that provides domain-specific information for a language model, the domain-specific information including instructions to access the at least one structured data set;
receiving, by the at least one processor, a request via a graphical user interface, the request relating to at least one question in a natural language format;
generating, by the at least one processor using the language model, at least one software code for the request based on the at least one prompt; and
executing, by the at least one processor, each of the at least one software code to identify at least one result for the request from the at least one structured data set.