| CPC G06F 16/2255 (2019.01) | 10 Claims |

|
1. A computer-implemented method for storing log data generated in a distributed computing environment, comprising:
receiving a log line;
applying a first tokenization rule to the log line to create a plurality of base tokens, where each base token is a sequence of successive characters in the log line having same type;
applying a second tokenization rule to the log line to create a plurality of combination tokens, where each combination token is comprised of two or more base tokens appended together;
applying a third tokenization rule to the log line to create a plurality of n-gram tokens, where each n-gram token is an n-gram derived from a base token in the plurality of base tokens;
combining tokens from the plurality of base tokens, the plurality of combination tokens and the plurality of n-gram tokens to form a set of tokens;
for each token in the set of tokens, storing a given token by
applying a hash function to the given token to generate a hash value, where the given token is associated with a given software module at which the log line was produced;
updating a listing of software entities with the given software module, where entries in the listing of software entities can identify more than one software module and each entry in the listing of software entities specifies a unique set of software modules; and
storing the hash value, along with an address, in a token map table of a probabilistic data structure, where the address maps the hash value to an entry in the listing of software entities.
|