US 11,861,039 B1
Hierarchical system and method for identifying sensitive content in data
Yahor Pushkin, Redmond, WA (US); Sravan Babu Bodapati, Redmond, WA (US); Sunil Mallya Kasaragod, San Francisco, CA (US); Sameer Karnik, Issaquah, WA (US); Abhinav Goyal, Snohomish, WA (US); Yaser Al-Onaizan, Cortlandt Manor, NY (US); Ravindra Manjunatha, Seattle, WA (US); Kalpit Dixit, Mountain View, CA (US); Alok Kumar Parmesh, Bothell, WA (US); and Syed Kashif Hussain Shah, Santa Clara, CA (US)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Sep. 28, 2020, as Appl. No. 17/035,437.
Int. Cl. G06F 21/62 (2013.01); G06F 16/903 (2019.01); G06F 3/06 (2006.01); G06N 20/00 (2019.01)
CPC G06F 21/6245 (2013.01) [G06F 3/0619 (2013.01); G06F 3/0623 (2013.01); G06F 3/0683 (2013.01); G06F 16/90344 (2019.01); G06N 20/00 (2019.01)] 20 Claims
OG exemplary drawing
 
1. A system, comprising:
a data storage system comprising a plurality of storage devices storing one or more data collections comprising sensitive data text strings and non-sensitive data text strings;
one or more computing devices, comprising one or more processors and memory, configured to implement one or more sensitive data classifiers local to the data storage system, wherein the one or more sensitive data classifiers are configured to:
analyze a plurality of data items of the one or more data collections, wherein individual data items comprise a plurality of text strings;
classify at least some data items of the plurality of data items as containing the sensitive data text strings within a probability threshold, based at least in part on the analysis; and
provide the at least some data items classified as containing the sensitive data text strings within the probability threshold to a separate sensitive data discovery component; and
one or more other computing devices, comprising one or more other processors and memory, configured to implement the sensitive data discovery component remote to the data storage system, wherein the sensitive data discovery component is configured to:
obtain the at least some data items classified by the one or more sensitive data classifiers as containing the sensitive data text strings;
perform a sensitive data location analysis on the at least some obtained data items to identify a location of the sensitive data text strings distinct from the non-sensitive data text strings within one or more of the data items;
generate location information for the sensitive data text strings within the one or more data items; and
provide to a destination information comprising the location information for the sensitive data text strings within the one or more data items.