| CPC G06F 40/30 (2020.01) [G06F 40/205 (2020.01); G06V 30/148 (2022.01); G06V 2201/07 (2022.01)] | 15 Claims |

|
1. A method for file processing, comprising:
acquiring a file to be processed and an information type, wherein a data volume of the file to be processed is less than or equal to a threshold;
recognizing at least one piece of candidate information related to the information type from the file to be processed, wherein recognizing at least one piece of candidate information related to the information type from the file to be processed comprises:
recognizing the at least one piece of candidate information related to the information type from the at least one file to be processed in a parallel processing manner;
determining a target recognition feature and a semantic feature of each piece of candidate information, wherein the target recognition feature is configured to describe a matching condition between the each piece of candidate information and the information type; and
determining target information from the at least one piece of candidate information based on the target recognition feature and the semantic feature;
wherein determining target information from the at least one piece of candidate information based on the target recognition feature and the semantic feature comprises:
determining character recognition confidences corresponding to a plurality of characters,
inputting the minimum character recognition confidence, the average character recognition confidence, the type feature, the index feature, and the semantic coding feature into a pre-trained classification model, and obtaining a classification evaluation value corresponding to each piece of candidate information output by the classification model; and
determining the target information from the at least one piece of candidate information based on the classification evaluation value corresponding to each piece of candidate information,
wherein determining the character recognition confidences corresponding to a plurality of characters comprises:
obtaining recognition confidences corresponding to a plurality of characters by recognizing each character using an optical character recognition (OCR) method, and obtain the character recognition confidence of each of the plurality of characters;
wherein acquiring a file to be processed comprises:
acquiring an initial file; and
obtaining at least one file to be processed by splitting the initial file based on the threshold;
wherein execution of the foregoing steps converts the file to be processed from an unstructured Portable file Format (PDF) or image into structured target information corresponding to the information type, and stores the structured target information together with an index feature corresponding to the information type in an index table in memory.
|