US 12,242,433 B2
Automatic database enrichment and curation using large language models
Georg Gottlob, Nicosia (CY); and Jinsong Guo, London (GB)
Assigned to Ratiolytics Limited, Nicosia (CY)
Filed by Ratiolytics Limited, Nicosia (CY)
Filed on Aug. 2, 2024, as Appl. No. 18/792,699.
Claims priority of provisional application 63/530,795, filed on Aug. 4, 2023.
Prior Publication US 2025/0045256 A1, Feb. 6, 2025
Int. Cl. G06F 16/00 (2019.01); G06F 16/21 (2019.01); G06F 16/215 (2019.01); G06F 16/23 (2019.01); G06F 16/2455 (2019.01)
CPC G06F 16/211 (2019.01) [G06F 16/215 (2019.01); G06F 16/2365 (2019.01); G06F 16/24564 (2019.01)] 30 Claims
OG exemplary drawing
 
1. A method to be executed by a computing device, the method comprising:
(i) accessing and/or modifying a database by at least one processor, the database to be automatically curated, the database being accessible to, and modifiable by, a controller, the database containing zero or more structured datasets and zero or more metadata items, where the database in its initial state, is referred to as an input database,
(ii) accessing zero or more additional external data or information sources by the at least one processor, and from which specific data or information useful to a database curation task can be obtained by at least one of:
(a) performing database access,
(b) retrieving information for retrieval augmented generation, and
(c) automatically extracting structured data,
(iii) using one or more pre-trained large language models (LLMs) accessed via application programming interfaces (API) or other connections, issuing prompts and retrieving answers to prompts and executing one or more database curation requests by the at least one processor, each request specifying a database curation task to be performed on at least a sub-structure of the database, the database curation tasks comprising:
(a) a database enrichment task to compute new data records comprising tuples or other associations of data items and insert the records into the sub-structure of the database,
(b) a database verification task to verify, with help of the one or more LLMs, data contained in the sub-structure, and, when incorrect data is identified, return the incorrect data as output for further processing or correction,
(c) a database update task to automatically recognize and update erroneous or outdated data within the sub-structure of the database, and
(d) a null-value or a missing value replacement task to replace null-values contained in the sub-structure of the database by concrete data values,
wherein zero or more constraints are accessible to the controller, each constraint being either a database constraint expressing a property to be fulfilled and maintained by the dataset, or a process constraint that restricts the control flow to be generated by the controller,
where the input database, the task description, and the zero or more constraints are jointly referred-to as input, and where the requested database curation tasks are automatically performed by the controller by a computation that uses the one or more LLMs and that performs a sequence of prompts to the LLMs such that each prompt of the sequence is generated by the controller from at least one of (a) the input, (b) forms of and answers to previously issued prompts from the sequence, (c) intermediate calculations of the controller.