| CPC G06Q 30/0201 (2013.01) [G06F 16/951 (2019.01); G06F 40/295 (2020.01); G06N 20/00 (2019.01)] | 21 Claims |

|
1. A computer-implemented method of identifying demographic information in a data file, comprising:
training a machine learning model according to labeled, sampled training sets to identify a heading based at least in part on structure and content of the data file from a data source with information describing medical providers, the machine learning model being based on a plurality of machine learning algorithms to identify different types of demographic information;
receiving, by a processor, data files containing a plurality of fields of demographic information from a plurality of third-party sources, the data files having inconsistent or mislabeled nomenclatures with one another for one or more fields of the plurality of fields of demographic information;
analyzing, by the processor, the heading identifying a plurality of strings representing a field of demographic information in the data files using the machine learning model, wherein the analyzing is performed across two or more fields of demographic information to identify a relationship between headings, wherein the identification of the relationship further comprises:
determining types of demographic information;
determining whether the headings should be grouped as pairs; and
determining whether entries of each of the headings describe a same medical provider; and
generating, by the processor, a score indicating a probability that each of the plurality of fields of demographic information was identified correctly, wherein the generating the score further comprises:
generating a baseline score for each of the plurality of fields of demographic information; and
adjusting the baseline score to increase the score when the heading and content of the field of demographic information match, or to decrease the score when the heading and the content of the field of demographic information do not match;
generating, by the processor, a revised data file labeling each of the plurality of fields of demographic information based on the identified type; and
inserting, by the processor and in the revised data file, missing fields of demographic information based on the identified type of demographic information.
|