US 11,769,005 B2
Information uniqueness assessment using string-based collection frequency
Shou-Huey Jiang, Durham, NC (US); Wenjin Liu, Cary, NC (US); and Chao Su, Cary, NC (US)
Assigned to EMC IP Holding Company LLC, Hopkinton, MA (US)
Filed by EMC IP Holding Company LLC, Hopkinton, MA (US)
Filed on May 29, 2020, as Appl. No. 16/887,046.
Prior Publication US 2021/0374336 A1, Dec. 2, 2021
Int. Cl. G06F 40/205 (2020.01); G06F 40/284 (2020.01); G06N 5/04 (2023.01); G06F 21/62 (2013.01); G06V 30/262 (2022.01)
CPC G06F 40/205 (2020.01) [G06F 21/6254 (2013.01); G06F 40/284 (2020.01); G06N 5/04 (2013.01); G06V 30/274 (2022.01)] 20 Claims
OG exemplary drawing
 
8. An apparatus comprising:
at least one processing device comprising a processor coupled to a memory;
the at least one processing device being configured to implement the following steps:
obtaining a plurality of documents from at least one data source;
evaluating the plurality of documents for a given level, of a plurality of levels, to identify a plurality of collections of documents, wherein the given level is a level of interest to a given user;
determining a collection frequency, within the given level, for a given character string based at least in part on: (i) a number of the collections of documents within the given level comprising the given character string, wherein an existence of the given character string in one or more of the collections of documents, of the plurality of collections of documents, within the given level has a count of one for each collection of documents comprising one or more instances of the given character string, relative to (ii) a total number of the collections of documents within the given level in the plurality of collections;
assigning a uniqueness rating to the given character string for the given level based at least in part on a comparison of: (a) the collection frequency of the given character string within the given level, to (b) a collection frequency of one or more additional character strings within the given level in one or more of the plurality of collections; and
performing at least one automated action using the given character string based at least in part on the assigned uniqueness rating.