US 12,229,100 B2
Systems and methods for counteracting data-skewness for locality sensitive hashing via feature selection and pruning
Sean Moran, Putney (GB); Fanny Silavong, London (GB); Rob Otter, Witham (GB); Antonios Georgiadis, London (GB); and Brett Sanford, Tunbridge Wells (GB)
Assigned to JPMORGAN CHASE BANK, N.A., New York, NY (US)
Filed by JPMORGAN CHASE BANK, N.A., New York, NY (US)
Filed on Sep. 27, 2021, as Appl. No. 17/486,103.
Prior Publication US 2022/0100725 A1, Mar. 31, 2022
Int. Cl. G06F 16/9535 (2019.01); G06F 16/22 (2019.01); G06F 16/2457 (2019.01); G06F 16/951 (2019.01)
CPC G06F 16/2255 (2019.01) [G06F 16/2246 (2019.01); G06F 16/24578 (2019.01)] 18 Claims
OG exemplary drawing
 
1. A method for feature selection for counteracting data skewness in locality sensitive hashing (LSH)-based searches, comprising:
ingesting, by an ingestion computer program and from a plurality of data sources, data comprising an abstract syntax tree;
extracting, by the ingestion computer program, a plurality of tree-based features from the abstract syntax tree;
transforming, by the ingestion computer program, each of the plurality of tree-based features into a feature vector;
scoring, by the ingestion computer program and using a scoring method, each of the feature vectors, wherein the scoring method comprises Normalized Sub-Path Frequency (NSPF);
selecting, by the ingestion computer program, a subset of the plurality of feature vectors by selecting the feature vectors with a highest score and/or not selecting feature vectors with low variance; and
for each selected feature vector:
padding, by the ingestion computer program, the selected feature vector to a fixed length;
computing, by the ingestion computer program, a random hash function for the selected padded feature vector; and
inserting, by the ingestion computer program, an output of the random hash function into a hash table with the selected padded feature vector.