US 11,853,305 B2
Method, device, and computer program product for file annotation
Min Gong, Shanghai (CN); Qicheng Qiu, Shanghai (CN); and Jiacheng Ni, Shanghai (CN)
Assigned to EMC IP Holding Company LLC, Hopkinton, MA (US)
Filed by EMC IP Holding Company LLC, Hopkinton, MA (US)
Filed on Jun. 30, 2021, as Appl. No. 17/364,814.
Claims priority of application No. 202110440639.1 (CN), filed on Apr. 23, 2021.
Prior Publication US 2022/0342890 A1, Oct. 27, 2022
Int. Cl. G06F 16/2457 (2019.01); G06N 20/00 (2019.01); G06F 16/35 (2019.01)
CPC G06F 16/24573 (2019.01) [G06F 16/35 (2019.01); G06N 20/00 (2019.01)] 20 Claims
OG exemplary drawing
 
1. A method, comprising:
annotating, by a system comprising a processor, a plurality of files by using an annotation model to determine a first performance of the annotation model, the first performance being associated with a confidence that is a function of an aggregate uncertainty measure of a model annotation result generated by the annotation model;
in response to the first performance being determined to be lower than a defined threshold, determining a group of target files from the plurality of files based at least on the confidence of the model annotation result, wherein determining the group of target files from the plurality of files comprises:
determining a respective expected annotation cost, corresponding to each file of the plurality of files, for acquisition of truth-value annotation information comprising using a cost prediction model, the cost prediction model being trained based on historical annotation costs of a group of training files, and
selecting the group of target files from the plurality of files based on a respective uncertainty measure corresponding to each file and the respective expected annotation cost corresponding to each file;
acquiring truth-value annotation information of the group of target files;
retraining the annotation model based on the truth-value annotation information of the group of target files, resulting in a retrained annotation model; and
in response to a second performance of the retrained annotation model being determined to be higher than or equal to the defined threshold, determining annotation information for at least some of the plurality of files by using the retrained annotation model.