US 12,373,324 B1
	System and method for format drift and format anomaly detection
Zhaohui Wang, San Francisco, CA (US); Ryan Gannon, San Francisco, CA (US); Xiao Lin, San Jose, CA (US); and Chandrima Sarkar, Dublin, CA (US)
Assigned to Cisco Technology, Inc., San Jose, CA (US)
Filed by Splunk Inc., San Francisco, CA (US)
Filed on Feb. 2, 2022, as Appl. No. 17/591,535.
Claims priority of provisional application 63/285,997, filed on Dec. 3, 2021.
Int. Cl. G06F 11/34 (2006.01); G06F 16/242 (2019.01); G06F 16/2458 (2019.01)

CPC G06F 11/3452 (2013.01) [G06F 16/244 (2019.01); G06F 16/2462 (2019.01)]

23 Claims

1. A computerized method comprising:

extracting a format representation for a first data sample of an incoming data stream by at least accessing a data schema for each field of the first data sample to determine a data point type, wherein the first data sample comprises a plurality of data points, each data point of the plurality of data points is maintained within a field of the first data sample and corresponds to a performance measurement directed to (i) computing resource associated with a source of the incoming data stream or (ii) an operating state of the source of the incoming data stream;

conducting transformations on format representations associated with data point types of the first data sample to produce a first plurality of count values, wherein the transformed format representations associated with each data point type within the first data sample operates as a count reference;

accessing a data schema for each field of a second data sample of the incoming data stream to determine a data point type for identifying changes in field format;

conducting transformations on format representations associated with data point types of a second data sample of the incoming data stream to produce a second plurality of count values, wherein the second plurality of count values identifying a number of occurrences of the transformed format representation associated with each data point type within the second data sample;

computing a first probability distribution based on the first plurality of count values;

computing a second probability distribution based on the second plurality of count values;

conducting analytics using the first probability distribution and the second probability distribution to produce a first metric; and

determining a format drift for the data stream in response to evaluating the first metric to a second metric operating as a threshold metric signifying a format drift condition.