US 11,888,707 B2
	Method for detecting anomalies in communications, and corresponding device and computer program product
Daniele Ucci, Turin (IT); Filippo Sobrero, Turin (IT); and Federica Bisio, Turin (IT)
Assigned to AIZOON S.R.L., Turin (IT)
Filed by Aizoon S.r.l., Turin (IT)
Filed on Dec. 23, 2022, as Appl. No. 18/088,279.
Claims priority of application No. 102021000033203 (IT), filed on Dec. 31, 2021.
Prior Publication US 2023/0216746 A1, Jul. 6, 2023
Int. Cl. H04L 41/16 (2022.01); H04L 43/04 (2022.01); G06N 20/10 (2019.01); G06F 18/23213 (2023.01); G06N 7/01 (2023.01); G06F 18/20 (2023.01)

CPC H04L 41/16 (2013.01) [G06F 18/29 (2023.01); G06N 7/01 (2023.01); G06N 20/10 (2019.01); H04L 43/04 (2013.01)]

11 Claims

1. A method of detecting anomalies in communications exchanged via a communication network between a respective source and a respective destination, comprising steps of:

obtaining metadata for a plurality of communications in a monitoring interval, wherein said metadata includes for each communication an identifier of said source, an identifier of said destination, and data extracted from an application protocol of the respective communication, wherein said communications comprises Hypertext Transfer Protocol (HTTP) communications;

processing said extracted data to obtain preprocessed data comprising one or more tokens for the respective communication, wherein each token comprises a string, wherein said one or more tokens are extracted from user agent field or referrer field;

dividing said monitoring interval into a training interval and a verification interval;

obtaining the identifier of a given source and generating a first list of a plurality of features (F_SRC,TI) for connections of said given source in said training interval via following steps:

selecting the connections of said given source in said training interval,

determining for said connections of said given source in the said training interval the univocal destination identifiers and for each token the respective univocal values,

determining a first set of enumeration rules by enumerating said univocal destination identifiers and for each token the respective univocal values, and

associating by means of said first set of enumeration rules with each connection of said source in said training interval a respective enumerated destination identifier and one or more respective enumerated tokens, wherein said first list of features comprises for each connection of said given source in said training interval the respective enumerated destination identifier and the respective one or more enumerated tokens;

obtaining the identifier of a group of devices to which said given source belongs and generating a second list of a plurality of features for the connections of the devices belonging to said group of devices in said training interval via following steps:

selecting the connections of said group of devices in said training interval,

determining for said connections of said group of devices in said training range the univocal destination identifiers and for each token the respective univocal values,

determining a second set of enumeration rules by enumerating said univocal destination identifiers and for each token the respective univocal values, and

associating by means of said second set of enumeration rules with each connection of said group of devices in said training interval a respective enumerated destination identifier and one or more respective enumerated tokens, wherein said second list of features comprises for each connection of said group of devices in said training interval the respective enumerated destination identifier and the respective one or more enumerated tokens;

generating a first set of Bayesian networks by training for each feature of said first list of features a respective Bayesian network using the data of other features of said first list of features (F_SRC,TI), and generating a second set of Bayesian networks by training for each feature of said second list of features a respective Bayesian network using the data of the other features of said second list of features,

generating a third list of a plurality of features for the connections of said given source in said verification interval via following steps:

selecting the connections of said given source in said verification interval, and

associating by means of said first set of enumeration rules with each connection of said given source in said verification interval a respective enumerated destination identifier and one or more respective enumerated tokens, wherein said third list of features comprises for each connection of said given source in said verification interval the respective enumerated destination identifier and the respective one or more respective enumerated tokens;

generating a fourth list of a plurality of features for connections of said given source in said verification interval via following steps:

selecting the connections of said given source in said verification interval,

associating by means of said second set of enumeration rules with each connection of said given source in said verification interval a respective enumerated destination identifier and one or more respective enumerated tokens, wherein said fourth list of features comprises for each connection of said given source in said verification interval the respective enumerated destination identifier and the respective one or more respective enumerated tokens, wherein said first list of a plurality of features, said second list of a plurality of features, said third list of a plurality of features and said fourth list of a plurality of features further comprise at least one of: an enumerated value generated for a destination port of Transmission Control Protocol (TCP) or the User Datagram Protocol (UDP) of the respective communication;

repeating following steps for each connection of said given source in said verification interval:

determining based on the values of the features of said third list of features associated with the respective connection of said given source for each feature of said third list of features the respective most probable value by using said first set of Bayesian networks,

classifying each value of the features of said third list of features associated with the respective connection of said given source via following steps:

in response to determining the value of a feature of said third list of features corresponds to the respective most probable value, classifying the value of the feature of said third list of features as normal, and

in response to determining the value of a feature of said third list of features does not correspond to the respective most probable value:

a) determining for the value of said feature of said third list of features the respective probability of occurrence by using said first set of Bayesian networks, and

b) classifying the value of said feature of said third list of features as normal in response to determining the respective probability of occurrence is greater than a first threshold, and

c) classifying the value of said feature of said third list of features as anomalous in response to determining the respective probability of occurrence is smaller than said first threshold; and

determining based on the values of the feature values of said fourth list of features associated with the respective connection of said given source for each feature of said fourth list of features the respective most probable value by using said second set of Bayesian networks, wherein discretizing one or more of the features of said first list of a plurality of features, said second list of features of a plurality of features, said third list of a plurality of features, and said fourth list of a plurality of features by means of a clustering algorithm, a k-means clustering algorithm, and

classifying each value of the features of said fourth list of features associated with the respective connection of said given source via following steps:

in response to determining the value of a feature of said fourth list of features corresponds to the respective most probable value, classifying the value of the feature of said fourth list of features as normal, and

in response to determining the value of a feature of said fourth list of features does not correspond to the respective most probable value:

a) determining for the value of said feature of said fourth list of features the respective probability of occurrence by using said second set of Bayesian networks, and

b) classifying the value of said feature of said fourth list of features as normal in response to determining the respective probability of occurrence is greater than a second threshold, and

c) classifying the value of said feature of said fourth list of features as anomalous in response to determining the respective probability of occurrence is smaller than said second threshold.