US 12,141,282 B2
Methods and apparatus to augment classification coverage for low prevalence samples through neighborhood labels proximity vectors
German Lancioni, San Jose, CA (US); and Jonathan King, Hillsboro, OR (US)
Assigned to McAfee, LLC, San Jose, CA (US)
Filed by McAfee, LLC, San Jose, CA (US)
Filed on Dec. 31, 2021, as Appl. No. 17/566,760.
Claims priority of provisional application 63/227,305, filed on Jul. 29, 2021.
Prior Publication US 2023/0029679 A1, Feb. 2, 2023
Int. Cl. G06N 5/04 (2023.01); G06F 21/56 (2013.01); G06N 7/01 (2023.01)
CPC G06F 21/566 (2013.01) [G06F 21/56 (2013.01); G06F 21/567 (2013.01); G06N 7/01 (2023.01); G06F 2221/033 (2013.01)] 25 Claims
OG exemplary drawing
 
1. An apparatus to augment classification coverage for low prevalence malware samples comprising:
interface circuitry;
processor circuitry including one or more of:
at least one of a central processing unit, a graphic processing unit or a digital signal processor, the at least one of the central processing unit, the graphic processing unit or the digital signal processor having control circuitry to control data movement within the processor circuitry, arithmetic and logic circuitry to perform one or more first operations according to instructions, and one or more registers to store a result of the one or more first operations, the instructions in the apparatus;
a Field Programmable Gate Array (FPGA), the FPGA including first logic gate circuitry, a plurality of configurable interconnections, and storage circuitry, the first logic gate circuitry and interconnections to perform one or more second operations, the storage circuitry to store a result of the one or more second operations; or
Application Specific Integrated Circuitry (ASIC) including second logic gate circuitry to perform one or more third operations;
the processor circuitry to perform at least one of the first operations, the second operations or the third operations to instantiate:
feature-based classifier circuitry to calculate a first classification result using a first classifier for a data sample;
sample classification circuitry to determine whether the first classification result of the first classifier passes a confidence threshold; and
appendix classifier circuitry to:
query a clean locality sensitive hashing (LSH) forest and a malicious LSH forest for a plurality of similar neighbor samples;
calculate a custom distance metric (CDM) for a first stored similar neighbor sample of the plurality of similar neighbor samples;
sort the plurality of similar neighbor samples based on the calculated CDM for the first stored similar neighbor sample; and
execute a classification algorithm on the sorted plurality of similar neighbor samples to calculate a second classification result, wherein the sample classification circuitry is to output the first classification result when the first classification result of the first classifier is determined to have passed the confidence threshold, and is to output the second classification result when the first classification result of the first classifier is determined to have not passed the confidence threshold.