US 12,130,815 B2
System and method for processing data for electronic searching
Peter Piatetsky, New York, NY (US); Julian Vasilkoski, Brookline, MA (US); and Christopher Sullivan, Newton, MA (US)
Assigned to Castellum.AI, New York, NY (US)
Filed by Castellum.AI, New York, NY (US)
Filed on Aug. 9, 2021, as Appl. No. 17/397,101.
Claims priority of provisional application 63/064,239, filed on Aug. 11, 2020.
Prior Publication US 2022/0050838 A1, Feb. 17, 2022
Int. Cl. G06F 16/00 (2019.01); G06F 16/2455 (2019.01); G06F 16/33 (2019.01); G06N 20/00 (2019.01)
CPC G06F 16/2455 (2019.01) [G06F 16/334 (2019.01); G06N 20/00 (2019.01)] 11 Claims
OG exemplary drawing
 
1. A method for processing data for electronic searching, comprising:
receiving raw data and calculating a checksum, wherein the raw data includes unstructured watchlist data, and wherein the checksum is a watchlist checksum of the unstructured watchlist data;
comparing the watchlist checksum to a previously calculated watchlist checksum to determine whether a change occurred in the raw data;
downloading the raw data to an object storage;
parsing the raw data to have a parsed raw data and transmitting the parsed raw data to a sanity check system;
enriching the raw data with a combination of rigid data schema categories;
loading the parsed raw data that is sanity checked by the sanity check system and enriched to a staging database and performing a stability check process on the parsed raw data that is sanity checked by the sanity check system; and
loading the parsed raw data that is stability checked by the stability check process to a master database and transmitting the parsed raw data stability checked by the stability check process to a data search platform, wherein the data search platform is loaded with the parsed raw data enriched with the combination of rigid data schema categories, stability checked, and sanity checked, wherein the unstructured watchlist data is structured and enriched in real-time or near-real time and in any language, including:
normalizing the unstructured watchlist data into a canonical format;
cleaning the unstructured watchlist data by removing problematic characters;
standardizing all country and location related information to International Organization for Standardization (ISO) standard codes,
using a named entity recognition to extract identification (ID) numbers and ID types from unstructured text fields of the unstructured watchlist data;
applying a sanity check to the unstructured watchlist data; and
assigning each item of the unstructured watchlist data a type for entity resolution.