US 12,450,496 B1
Systems and methods for classifying strings of arbitrary length in a large number of classes
Joseph Lo, Newark, NJ (US); Jaiden Fallo, Newark, NJ (US); and Nicholas Aronow, Newark, NJ (US)
Assigned to Broadridge Financial Solutions, Inc., Newark, NJ (US)
Filed by Broadridge Financial Solutions, Inc., Newark, NJ (US)
Filed on Oct. 2, 2024, as Appl. No. 18/904,741.
Int. Cl. G06N 3/096 (2023.01); G06F 16/35 (2019.01); G06F 16/93 (2019.01); G06N 3/045 (2023.01)
CPC G06N 3/096 (2023.01) [G06F 16/35 (2019.01); G06F 16/93 (2019.01); G06N 3/045 (2023.01)] 19 Claims
OG exemplary drawing
 
1. A method, comprising:
training, by at least one computing device, at least one pre-trained large language model (LLM), the training comprising:
obtaining, by the at least one computing device, at least one tagged document of a predetermined type;
extracting, by the at least one computing device, from the at least one tagged document, one or more chunks of texts of a first predetermined size, each chunk of text being a portion of the at least one tagged document;
determining, by the at least one computing device, for each chunk of text, a set of tags pertaining to the predetermined type of document;
wherein each tag in the set of tags is associated with a respective text segment;
identifying, by the at least one computing device, for each tag in the set of tags, at least one value in the respective segment of text and associated with each tag in the set of tags;
generating, by the at least one computing device, a plurality of first training pairs comprising a plurality of tag: text pairs based on the identified at least one text associated with each tag;
identifying, by the at least one computing device, context information, among the chunks of texts, associated with each of the plurality of tag: text pairs;
generating, by the at least one computing device, a plurality of second training pairs, each second training pair comprising:
a respective tag: text pair of the plurality of tag: text pairs, and
a respective context information;
transforming, by the at least one computing device, the plurality of first training pairs and the plurality of second training pairs to form a plurality of training messages;
producing, by the at least one computing device, in the at least one pre-trained LLM, at least one hierarchical run time using the plurality of training messages to produce at least one hierarchical document tagging (HDT) LLM by iteratively:
providing at least one first training message comprising the plurality of first training pairs to train the at least one pre-trained LLM, and
providing at least one second training message comprising the plurality of second training pairs to train the at least one pre-trained LLM;
providing, by the at least one computing device, the at least one HDT LLM, configured to utilize the at least one hierarchical run time with an unseen document of the predetermined type to output a list of tags from the unseen document and at least one value associated with each tag of in the list of tags, the at least one value of each tag being determined from an associated segment of text of the unseen document.