| CPC G06F 40/211 (2020.01) [G06F 40/284 (2020.01)] | 4 Claims |

|
1. A system for training a language model for cybersecurity, the system comprising:
a memory storing instructions; and
a processor configured to execute the instructions to:
collect a cybersecurity document used for training the language model for cybersecurity, wherein the cybersecurity document includes linguistic elements and non-linguistic elements, and the non-linguistic elements include completely non-linguistic elements that are arbitrary strings and have no linguistic meaning, and paralinguistic elements from which linguistic content can be inferred, and wherein the completely non-linguistic elements include at least one of a Bitcoin address, a hash value, an IP address, and a vulnerability identifier, and the paralinguistic elements include at least one of a uniform resource locator (URL) and an email address,
identify the non-linguistic elements in the cybersecurity document based on a non-linguistic element database,
tokenize the cybersecurity document to generate a plurality of tokens as input data for training the language model,
randomly mask the generated tokens excluding tokens corresponding to the completely non-linguistic elements,
input the entire sequence of the generated tokens including the randomly masked tokens into the language model as the input data,
extract, by the language model, features of the input data,
generate, by the language model, vectors corresponding to the features, and
train the language model to simultaneously perform a first task and a second task by referring to the vectors, wherein the first task is a task of classifying types of the tokens corresponding to the completely non-linguistic elements of the non-linguistic elements and tokens corresponding to the paralinguistic elements of the non-linguistic elements included in the cybersecurity document and the second task is a task of recovering only the tokens corresponding to the paralinguistic elements of the non-linguistic elements and tokens corresponding to the linguistic elements in the cybersecurity document.
|