US 12,423,520 B2
Method of training language model for cybersecurity and system performing the same
Seung Won Shin, Daejeon (KR); Young Jin Jin, Daejeon (KR); Eu Gene Jang, Seongnam-si (KR); Da Yeon Yim, Seoul (KR); Jin Woo Chung, Seongnam-si (KR); Yong Jae Lee, Seongnam-si (KR); Jian Cui, Bloomington, IN (US); Chang Hoon Yoon, Seongnam-si (KR); and Seung Yong Yang, Seoul (KR)
Assigned to S2W INC., Seongnam-si (KR); and KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY, Daejeon (KR)
Filed by S2W INC., Seongnam-si (KR); and KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY, Daejeon (KR)
Filed on Dec. 9, 2024, as Appl. No. 18/973,338.
Claims priority of application No. 10-2023-0177674 (KR), filed on Dec. 8, 2023.
Prior Publication US 2025/0190698 A1, Jun. 12, 2025
Int. Cl. G06F 40/211 (2020.01); G06F 40/284 (2020.01)
CPC G06F 40/211 (2020.01) [G06F 40/284 (2020.01)] 4 Claims
OG exemplary drawing
 
1. A system for training a language model for cybersecurity, the system comprising:
a memory storing instructions; and
a processor configured to execute the instructions to:
collect a cybersecurity document used for training the language model for cybersecurity, wherein the cybersecurity document includes linguistic elements and non-linguistic elements, and the non-linguistic elements include completely non-linguistic elements that are arbitrary strings and have no linguistic meaning, and paralinguistic elements from which linguistic content can be inferred, and wherein the completely non-linguistic elements include at least one of a Bitcoin address, a hash value, an IP address, and a vulnerability identifier, and the paralinguistic elements include at least one of a uniform resource locator (URL) and an email address,
identify the non-linguistic elements in the cybersecurity document based on a non-linguistic element database,
tokenize the cybersecurity document to generate a plurality of tokens as input data for training the language model,
randomly mask the generated tokens excluding tokens corresponding to the completely non-linguistic elements,
input the entire sequence of the generated tokens including the randomly masked tokens into the language model as the input data,
extract, by the language model, features of the input data,
generate, by the language model, vectors corresponding to the features, and
train the language model to simultaneously perform a first task and a second task by referring to the vectors, wherein the first task is a task of classifying types of the tokens corresponding to the completely non-linguistic elements of the non-linguistic elements and tokens corresponding to the paralinguistic elements of the non-linguistic elements included in the cybersecurity document and the second task is a task of recovering only the tokens corresponding to the paralinguistic elements of the non-linguistic elements and tokens corresponding to the linguistic elements in the cybersecurity document.