CPC G06F 16/2255 (2019.01) [G06F 16/2246 (2019.01); G06F 16/24578 (2019.01)] | 18 Claims |
1. A method for feature selection for counteracting data skewness in locality sensitive hashing (LSH)-based searches, comprising:
ingesting, by an ingestion computer program and from a plurality of data sources, data comprising an abstract syntax tree;
extracting, by the ingestion computer program, a plurality of tree-based features from the abstract syntax tree;
transforming, by the ingestion computer program, each of the plurality of tree-based features into a feature vector;
scoring, by the ingestion computer program and using a scoring method, each of the feature vectors, wherein the scoring method comprises Normalized Sub-Path Frequency (NSPF);
selecting, by the ingestion computer program, a subset of the plurality of feature vectors by selecting the feature vectors with a highest score and/or not selecting feature vectors with low variance; and
for each selected feature vector:
padding, by the ingestion computer program, the selected feature vector to a fixed length;
computing, by the ingestion computer program, a random hash function for the selected padded feature vector; and
inserting, by the ingestion computer program, an output of the random hash function into a hash table with the selected padded feature vector.
|