CPC G06F 16/219 (2019.01) [G06F 16/2477 (2019.01); G06F 16/256 (2019.01); G06F 16/26 (2019.01)] | 20 Claims |
1. A system, comprising:
one or more computing devices;
wherein the one or more computing devices include instructions that upon execution on or across the one or more computing devices:
obtain, at a data lineage tracking service of a cloud computing environment, an indication of (a) one or more event information sources from which occurrences of events pertaining to one or more data pipelines including a particular data pipeline can be detected and (b) a selection criterion for events pertaining to the particular data pipeline, wherein the particular data pipeline includes a plurality of stages including a first data storage stage, a second data storage stage, and a data analysis stage;
extract, by the data lineage tracking service, from the one or more event information sources, using at least the selection criterion, information pertaining to occurrences of a plurality of events including a first event which represents a transfer of at least a portion of a first data set from the first data storage stage to form a second data set at the second data storage stage, a second event which represents a transfer of at least a portion of the second data set from the second data storage stage to form a third data set at the data analysis stage, and a third event which represents a completion of a computation performed on at least a portion of the third data set at the data analysis stage;
store, by the data lineage tracking service based at least in part on analysis of the plurality of events, in a graph database at the cloud computing environment, at least a portion of a particular graph comprising a plurality of nodes and a plurality of edges, wherein individual ones of the nodes represent a respective data set at a respective stage of the plurality of stages, and wherein individual ones of the edges indicate individual ones of the plurality of events; and
in response to a first query for lineage information pertaining to a particular data set at a particular stage of the plurality of stages, provide, by the data lineage tracking service, an indication of a sequence of events represented in the particular graph, including a particular event which resulted in a presence of the particular data set at the particular stage.
|