US 12,455,980 B2
Large language model privacy preservation system
Gennaro Anthony Cuomo, Raleigh, NC (US); Blaine H. Dolph, Western Springs, IL (US); and Christopher Hay, Great Horkesley (GB)
Assigned to International Business Machines Corporation, Armonk, NY (US)
Filed by International Business Machines Corporation, Armonk, NY (US)
Filed on Sep. 13, 2023, as Appl. No. 18/466,049.
Prior Publication US 2025/0086310 A1, Mar. 13, 2025
Int. Cl. G06F 21/62 (2013.01); G06F 16/3332 (2025.01)
CPC G06F 21/6245 (2013.01) [G06F 16/3335 (2019.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method comprising:
receiving prompt data from a user device for processing by a large language model;
generating pre-processed prompt data using the prompt data from the user device, wherein generating the pre-processed prompt data comprises detecting personally identifiable information (PII) in the prompt data, substituting each detected PII instance with a placeholder token, and removing sensitive information and irrelevant information including punctuation and stop words;
deleting the original prompt data from volatile memory upon completion of the substitution of detected PII with placeholder tokens, such that only the redacted prompt data remains in memory;
identifying a category for the pre-processed prompt data using topic modeling, wherein the topic modeling identifies topics based on patterns of word and phrase clusters and frequencies of words in the pre-processed prompt data;
generating normalized prompt data using the pre-processed prompt data, wherein the normalized prompt data retains key elements of the prompt data while preserving semantic essence without personally identifiable information; and
storing the category and the normalized prompt data by generating a data object containing both the category and the normalized prompt data, and storing only the data object in a category-indexed datastore for use in large language model applications including personalized content recommendations, quality control, refined model training, resource optimization, and research in natural language processing, wherein the original prompt data is not retained in any storage after deletion from volatile memory.