CPC G06F 16/285 (2019.01) [G06F 16/215 (2019.01); G06F 16/24578 (2019.01); G16H 10/20 (2018.01)] | 20 Claims |
1. A system comprising:
a data store;
a non-transitory computer-readable medium including computer program code for database record selection; and
a processing device communicatively coupled to the data store and the non-transitory computer-readable medium, wherein the processing device is configured for executing the computer program code to perform operations comprising:
identifying data sources for geographically clustered data containing corresponding descriptors for unique clinical trial investigators across different sources of information for database records to be written to the data store;
formatting the corresponding descriptors to produce standardized, corresponding descriptors;
matching each standardized, corresponding descriptor of the standardized corresponding descriptors to produce a record score for each standardized, corresponding descriptor;
producing, for each standardized corresponding descriptor, a modified score using the record score and a number of characters in at least one of the corresponding descriptors;
combining the modified scores for the standardized, corresponding descriptors to produce a binary overall score for each database record of the database records; and
selectively writing each database record to the data store based on the binary overall score to compile a deduplicated database of the unique clinical trial investigators.
|