US 11,886,229 B1
	System and method for generating a global dictionary and performing similarity search queries in a network
Naveen Goela, Berkeley, CA (US); Joshua F. Stoddard, Cary, NC (US); John R. Coates, Berkeley, CA (US); Christian L. Hunt, Chapel Hill, NC (US); and Adam Mustafa, Cedar Grove, NJ (US)
Assigned to TANIUM INC., Kirkland, WA (US)
Filed by Tanium Inc., Emeryville, CA (US)
Filed on Feb. 22, 2021, as Appl. No. 17/182,083.
Application 17/182,083 is a continuation in part of application No. 16/532,391, filed on Aug. 5, 2019, granted, now 10,929,345.
Application 16/532,391 is a continuation in part of application No. 15/215,474, filed on Jul. 20, 2016, granted, now 10,482,242, issued on Nov. 19, 2019.
Application 15/215,474 is a continuation in part of application No. 15/215,468, filed on Jul. 20, 2016, granted, now 10,372,904, issued on Aug. 6, 2019.
Claims priority of provisional application 62/868,767, filed on Jun. 28, 2019.
Claims priority of provisional application 62/333,768, filed on May 9, 2016.
Claims priority of provisional application 62/305,482, filed on Mar. 8, 2016.
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 16/14 (2019.01); G06F 16/13 (2019.01); G06F 16/93 (2019.01); G06F 16/182 (2019.01); G06F 18/22 (2023.01)

CPC G06F 16/156 (2019.01) [G06F 16/137 (2019.01); G06F 16/144 (2019.01); G06F 16/182 (2019.01); G06F 16/93 (2019.01); G06F 18/22 (2023.01)]

32 Claims

1. A method of searching a collection of machines for documents similar to a target document, the method comprising:

at a server system:

at a sequence of times, requesting samples of document frequency information from respective machines in the collection of machines, and in response receiving sampling responses, wherein

each sampling response in at least a subset of the sampling responses includes information indicating one or more terms in a corpus of information stored at a respective machine in the collection of machines;

collectively, the collection of machines store a corpora of information that includes the corpus of information stored at each respective machine in the collection of machines; and

collectively, information in the sampling responses corresponds, for terms specified in the received sampling responses, to document frequencies of said terms in the corpora of information stored in the collection of machines;

generating a global dictionary from the received sampling responses, the global dictionary includes global document frequency values corresponding to the document frequencies of terms in the corpora of information stored in the collection of machines;

in response to one or more user commands, generating a similarity search query for a target document, the similarity search query including identifiers of terms in the target document, and sending, through one or more linear communication orbits, the similarity search query to one or more respective machines in the collection of machines; and

receiving, in response to the similarity search query:

query responses identifying files stored at the respective machines that meet predefined similarity criteria with respect to the target document, and for each identified file a similarity score that is based, at least in part, on global document frequency values, obtained from the global dictionary, for the terms identified in the similarity search query.