US 11,886,583 B2
Description-entropy-based intelligent detection method for big data mobile software similarity
Quanlong Guan, Guangdong (CN); Weiqi Luo, Guangdong (CN); Chuying Liu, Guangdong (CN); Huanming Zhang, Guangdong (CN); Lin Cui, Guangdong (CN); Zhefu Li, Guangdong (CN); and Rongjun Li, Guangdong (CN)
Appl. No. 17/312,449
Filed by Jinan University, Guangdong (CN)
PCT Filed Apr. 22, 2020, PCT No. PCT/CN2020/086052
§ 371(c)(1), (2) Date Jun. 10, 2021,
PCT Pub. No. WO2020/233322, PCT Pub. Date Nov. 26, 2020.
Claims priority of application No. 201910424145.7 (CN), filed on May 21, 2019.
Prior Publication US 2022/0058263 A1, Feb. 24, 2022
Int. Cl. G06F 7/04 (2006.01); G06F 15/16 (2006.01); H04L 29/06 (2006.01); G06F 21/56 (2013.01); G06F 8/74 (2018.01)
CPC G06F 21/563 (2013.01) [G06F 8/74 (2013.01)] 8 Claims
OG exemplary drawing
 
1. A method for intelligent determination of similarity of big data mobile softwares based on descriptive entropy, comprising the following steps:
S1, acquiring a path for each of the mobile softwares to read the mobile softwares according to the paths;
S2, performing a preliminary reverse-engineering decompilation on each of the mobile softwares to acquire function characteristics for each of the mobile softwares;
S3, summarizing a descriptive entropy distribution for each of the mobile softwares through descriptive entropies in the function characteristics;
S4, integrating the descriptive entropies of the mobile softwares, comparing the descriptive entropy distributions of mobile software pairs based on the integrated descriptive entropy distributions, and calculating similarity scores of the mobile software pairs; and
S5, outputting the similarity scores of the mobile softwares to give a mobile software similarity result; wherein
in step S2, the preliminary reverse-engineering decompilation specifically comprises:
acquiring source codes for each of the mobile softwares using a decompilation tool, acquiring function compression codes for each of the mobile softwares through the source codes, and calculating a floating point number representing an amount of information of a function or class, that is, the descriptive entropy from each of the function compression codes by the following formula:
Hd(substri)=−Σi=0np(substri)log2 p(substri)
wherein, assuming that each of the function compression codes has n substrings, substn is the Ith substring of the function compression code, and p(substei) is the occurrence probability of the Ith substring; and
storing the function compression codes, descriptive entropies, and hash values for the mobile softwares in corresponding text files.