US 12,475,330 B2
	Method for identifying noise samples, electronic device, and storage medium
Huapeng Qin, Beijing (CN); Min Zhao, Beijing (CN); and Guoxin Zhang, Beijing (CN)
Assigned to BEIJING BAIDU NETCOM SCIENCE TECHNOLOGY CO., LTD., Beijing (CN)
Filed by BEIJING BAIDU NETCOM SCIENCE TECHNOLOGY CO., LTD., Beijing (CN)
Filed on Sep. 29, 2022, as Appl. No. 17/956,558.
Claims priority of application No. 202111165584.4 (CN), filed on Sep. 30, 2021.
Prior Publication US 2023/0023789 A1, Jan. 26, 2023
Int. Cl. G06F 40/51 (2020.01); G06F 40/47 (2020.01); G06F 40/49 (2020.01); G06F 40/55 (2020.01)

CPC G06F 40/51 (2020.01) [G06F 40/47 (2020.01); G06F 40/49 (2020.01); G06F 40/55 (2020.01)]

10 Claims

1. A method for identifying noise samples, performed by an electronic device, comprising:

obtaining an original sample set;

obtaining a target sample set by adding masks to original training corpora in the original sample set using a preset adjustment rule;

performing mask prediction on a plurality of target training corpora in the target sample set using a language model to obtain a first mask prediction character corresponding to each target training corpus;

matching the first mask prediction character corresponding to each target training corpus with a preset condition;

according to target training corpora of which first mask prediction characters do not match the preset condition in the target sample set, determining corresponding original training corpora in the original sample set as noise samples;

modifying the noise samples; and

updating the original sample set using the modified noise samples; and continuing to train the language model based on the modified noise samples until the language model satisfies a requirement of a downstream application,

wherein the language model is obtained by training a masked pre-trained language model using the plurality of target training corpora in the target sample set,

wherein the method further comprises:

determining the adjustment rule according to a training task performed by the masked pre-trained language model during a pre-training process,

wherein the training task comprises a text classification task; and

the preset adjustment rule comprises:

for each original training corpus, splicing the original training corpus and a spliced text to obtain a target training corpus corresponding to the training task; wherein, the spliced text is obtained by splicing a text segment in the original training corpus and a second category label mask through a second associated word, and the second category label mask indicates that a corresponding position is predicted as a category corresponding to the text segment;

wherein each original training corpus in the original sample set has a label, and the language model is determined by:

performing mask prediction on a plurality of target training corpora in the target sample set using the masked pre-trained language model to obtain a second mask prediction character corresponding to each target training corpus;

adjusting model parameters of the masked pre-trained language model according to a difference between the second mask prediction character corresponding to each target training corpus and the label of the corresponding original training corpora; and

determining the language model obtained after adjusting as the language model for identifying noise samples.