US 12,229,311 B2
Identifying sensitive data risks in cloud-based enterprise deployments based on graph analytics
Julian James Stephen, Yorktown Heights, NY (US); Ted Augustus Habeck, Hopewell Junction, NY (US); and Arjun Natarajan, Old Tappan, NJ (US)
Assigned to International Business Machines Corporation, Armonk, NY (US)
Filed by International Business Machines Corporation, Armonk, NY (US)
Filed on Apr. 5, 2023, as Appl. No. 18/131,049.
Application 18/131,049 is a continuation of application No. 17/225,745, filed on Apr. 8, 2021, granted, now 11,727,142.
Prior Publication US 2023/0244812 A1, Aug. 3, 2023
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 21/62 (2013.01); H04L 41/0604 (2022.01); H04L 41/22 (2022.01); H04L 67/02 (2022.01)
CPC G06F 21/6245 (2013.01) [G06F 21/6227 (2013.01); H04L 41/0627 (2013.01); H04L 41/22 (2013.01); H04L 67/02 (2013.01)] 19 Claims
OG exemplary drawing
 
1. A method, in a data processing system, for identifying sensitive data risks in cloud-based deployments, the method comprising:
building a knowledge graph based on data schema information for a cloud-based computing environment, a set of parsed infrastructure logs, and a set of captured application queries;
identifying a set of sensitive flows in the knowledge graph representing paths from a sensitive data element to an endpoint in the knowledge graph;
scoring the set of sensitive flows based on a scoring algorithm, wherein the scoring algorithm determines, for each sensitive flow, a score along a centrality dimension at least by generating, for each vertex in the set of sensitive flows, a ranking score based on a propagation of a rank value from one vertex to another connected vertex in the set of sensitive flows; and
issuing an alert to an administrator in response to a score of a sensitive flow within the set of sensitive flows exceeding a threshold, wherein the scoring algorithm further determines, for each sensitive flow, a score along the centrality dimension at least by, for vertices that do not have outgoing edges, performing a teleportation operation that teleports propagation of the rank value to a randomly selected vertex using a damping factor to model a probability that data read from one data element is not propagated further, wherein the teleportation operation is limited to vertices in the set of sensitive flows, and wherein vertices with a higher concentration of sensitive data have a higher relative ranking score.