US 11,899,791 B2
Automated identification of malware families based on shared evidences
Yu-Siang Chen, Minxiong Township (TW); Ci-Hao Wu, Taipei (TW); Ying-Chen Yu, Taipei (TW); Pao-Chuan Liao, Taipei (TW); and June-Ray Lin, Taipei (TW)
Assigned to International Business Machines Corporation, Armonk, NY (US)
Filed by International Business Machines Corporation, Armonk, NY (US)
Filed on Sep. 29, 2021, as Appl. No. 17/489,725.
Prior Publication US 2023/0100947 A1, Mar. 30, 2023
Int. Cl. G06F 21/00 (2013.01); G06F 21/56 (2013.01); G06N 5/04 (2023.01); G06N 5/02 (2023.01)
CPC G06F 21/561 (2013.01) [G06F 21/568 (2013.01); G06N 5/02 (2013.01); G06N 5/04 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A method, in a data processing system comprising at least one processor and at least one memory, the at least one memory comprising instructions executed by the at least one processor to cause the at least one processor to implement a malware family identification engine for automatically identifying family tree relationships among malware based on reasoning of indirect relations from observed entities to family entities, the method comprising:
constructing a graph data structure of direct relationships between malware instances and malware families, direct relationships between malware instances and detected tags, and indirect relationships between detected tags and malware families, wherein each detected tag node has one or more outgoing links (OGLs) to malware family nodes;
building a dictionary data structure comprising detected tag entries linking each detected tag to one or more malware family nodes based on the graph data structure;
identifying significant indirect entities (SIEs) within the detected tag entries of the dictionary data structure;
selecting a SIE with a highest number of out-going links (OGLs) as a root node in a family tree data structure;
recursively connecting SIEs with a number of OGLs less than the highest number of OGLs to the root node in the family tree data structure; and
converting each SIE name in the family tree data structure to a chained family entity name in the family tree data structure.