US 12,405,987 B2
	Annotation data determination method and apparatus, and readable medium and electronic device
Ling Chang, Beijing (CN); Heng Kang, Beijing (CN); Xin Liao, Beijing (CN); Ke Shen, Beijing (CN); Leizhen Sun, Beijing (CN); and Tengfei Bao, Beijing (CN)
Assigned to Beijing Bytedance Network Technology Co., Ltd., Beijing (CN)
Appl. No. 18/552,781
Filed by Beijing Bytedance Network Technology Co., Ltd., Beijing (CN)
PCT Filed Mar. 17, 2022, PCT No. PCT/CN2022/081502 § 371(c)(1), (2) Date Sep. 27, 2023, PCT Pub. No. WO2022/206413, PCT Pub. Date Oct. 6, 2022.
Claims priority of application No. 202110351435.0 (CN), filed on Mar. 31, 2021.
Prior Publication US 2024/0176809 A1, May 30, 2024
Int. Cl. G06F 16/35 (2025.01); G06F 16/335 (2019.01); G06F 16/353 (2025.01)

CPC G06F 16/353 (2019.01) [G06F 16/335 (2019.01)]

14 Claims

1. A method for determining labeled data, comprising steps of:

obtaining candidate data from a candidate data set, wherein the candidate data set is a set constituted by a plurality of unlabeled text data;

inputting the candidate data into a first text recognition model and a second text recognition model respectively to obtain a first recognition result output from the first text recognition model and a second recognition result output from the second text recognition model, wherein the first text recognition model and the second text recognition model are both capable of recognizing whether text data belongs to a target category;

determining whether the candidate data meets a labeling condition according to the first recognition result and the second recognition result, wherein the labeling condition is that the candidate data is recognized by at least one of the first text recognition model or the second text recognition model as belonging to the target category;

determining the candidate data as text data needing to be labeled if it is determined that the candidate data meets the labeling condition; and

determining the candidate data as text data not needing to be labeled if it is determined that the candidate data does not meet the labeling condition,

wherein the first recognition result is a first score output from the first text recognition model for the candidate data, the second recognition result is a second score output from the second text recognition model for the candidate data,

the determining whether the candidate data meets the labeling condition according to the first recognition result and the second recognition result comprises:

determining that the candidate data meets the labeling condition if the first score is greater than or equal to a score threshold, or if the second score is greater than or equal to the score threshold, and

wherein the score threshold is determined by the following steps:

determining whether the text data meets the labeling condition for each text data in the candidate data set according to the first text recognition model, the second text recognition model and a target score used this time;

increasing the target score if a number of text data in the candidate data set that meets the labeling condition is greater than a maximum sampling number;

performing the step of determining whether the text data meets the labeling condition for each text data in the candidate data set again based on the increased target score and determining whether a number of text data in the candidate data set that meets the labeling condition is greater than the maximum sampling number; and

determining the increased target score as the score threshold if the number of text data in the candidate data set that meets the labeling condition is less than or equal to the maximum sampling number.