CPC G06F 21/6254 (2013.01) [G06F 40/174 (2020.01); G06V 30/10 (2022.01); G06V 30/412 (2022.01); G16H 10/60 (2018.01)] | 15 Claims |
1. A method for the batch de-identification of unstructured health care documents, the method comprising:
optical character recognizing a form-based document, the optical character recognition (OCR) producing an initial set of terms;
identifying initial specific terms amongst the initial set of terms containing protected information and replacing in the form-based document each of the identified initial specific terms with synthetically generated corresponding terms;
performing additional OCR on the form-based document to produce a new set of terms and identifying new specific terms amongst the new set of terms containing protected information;
comparing the new specific terms to the initial specific terms; and,
adding the form-based document to a repository of de-identified documents only if none of the new specific terms are equivalent to corresponding ones of the initial specific terms, but otherwise flagging the form-based document in error.
|