US 11,741,145 B1
Method and system for classification of unstructured data items
Bhushan Pandit, Maharashtra (IN); Surashree Kane, Maharashtra (IN); and Abhishek Shinde, Maharashtra (IN)
Assigned to Veritas Technologies LLC, Santa Clara, CA (US)
Filed by Veritas Technologies LLC, Mountain View, CA (US)
Filed on Sep. 30, 2018, as Appl. No. 16/147,822.
Int. Cl. G06F 16/00 (2019.01); G06F 7/00 (2006.01); G06F 16/35 (2019.01); G06F 16/31 (2019.01)
CPC G06F 16/355 (2019.01) [G06F 16/313 (2019.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method, implemented in a computer system, comprising:
producing an item identifier corresponding to each unstructured data item of a plurality of unstructured data items, wherein
the producing comprises
performing a hashing operation on information associated with each of the plurality of unstructured data items;
storing each unstructured data item and its corresponding item identifier in association with one another in a storage device of the computer system;
for each item of a plurality of items,
 determining whether a backup operation should be performed on an item of the plurality of items, wherein
the item and each unstructured data item comprises unstructured data, and
the determining comprises
ingesting the item into a classification engine, wherein
the classification engine is implemented in the computer system, and
the ingesting comprises
 generating an item identifier for the item, at least in part, by performing the hash operation on information associated with the item, and
 storing the item identifier and item in association with one another in the storage device,
performing term processing, wherein
the performing the term processing comprises
 determining a first number of occurrences of each term of a plurality of terms in the item, comprising
 identifying at least one term of the plurality of terms in the item by determining a term frequency of each of the plurality of terms in the item, and
 determining an inverse document frequency of the at least one term with respect to the plurality of unstructured data items, and
 determining a second number of occurrences of each term of the plurality of terms in a reference item of unstructured data,
generating a similarity index, comprising
producing a first list of ranking values by ranking the plurality of terms in the item based on the first number of occurrences,
producing a second list of ranking values by ranking the plurality of terms in the reference item based on the second number of occurrences, and
determining a number of common ranking values, wherein
 each ranking value is in the first list of ranking values and in the second list of ranking values, and
responsive to a size of the similarity index meeting a threshold,
determining a relational similarity index, wherein
 the relational similarity index is based, at least in part, on a subset of the first list of ranking values and another subset of a list of ranking values for an unstructured data item of the plurality of unstructured data items, and
 the relational similarity index represents a similarity between the item and the unstructured data item, and
in response to the relational similarity index indicating that the item and the unstructured data item are sufficiently similar, associating a classification tag with the item; and
performing a backup operation on one or more items of the plurality of items that are associated with the classification tag.