| CPC G06F 16/24578 (2019.01) [G06F 16/248 (2019.01); G06F 16/9535 (2019.01); G06Q 10/101 (2013.01); G06Q 10/107 (2013.01); G06Q 50/01 (2013.01); G06V 30/416 (2022.01); H04L 51/216 (2022.05)] | 10 Claims |

|
1. A computer system for file analysis, the system comprising:
a memory comprising instructions executable by one or more processors, wherein the one or more processors are operable to execute the instructions to:
control a hardware engine, comprising an entity extractor and a similarity engine, operatively coupled via a graphics bus to accelerate identification and analysis of large datasets in real-time, wherein the hardware engine is configured to:
identify one or more relevant files according to one or more semantic concepts;
identify one or more words for each of the one or more relevant files;
identify one or more n-grams according to the one or more words identified in the one or more relevant files, wherein an n-gram is one or more combinations of the one or more words;
generate a plurality of first scores, wherein each first score of the plurality of first scores is generated according to a term frequency and a global document frequency for each of the one or more words of each of the one or more n-grams of each of the one or more relevant files;
perform vector analysis on the one or more relevant files to generate a model document that improves file classification accuracy and reduces computational complexity by efficiently identifying unknown files according to similarities to relevant files, in order to assign each unknown files a relevant score;
generate a document vector according to averages of the plurality of first scores, wherein the document vector represents a reduced-dimensional representation of the file that increases the speed and accuracy of comparison between files and comprises a final value that illustrates how valuable the one or more words are in the one or more unknown files;
compare the unknown file with the model document according to the term frequency and the global document frequency; and
assign the relevant score to the unknown file according to the comparison, wherein the relevant score is used in a practical application comprising one or more of a technical space, a conceptual field, a geographic location and an industry sector, thereby enhancing the speed and efficiency of file retrieval and classification in such environments.
|