| CPC G06F 16/254 (2019.01) [G06F 16/283 (2019.01); G06N 7/00 (2013.01)] | 6 Claims |

|
1. A computer-implemented method for configuring and executing a data pipeline across multiple cloud platforms among a plurality of cloud platforms, the method comprising:
receiving a cloud platform independent specification of a data pipeline configured as a data mesh representing a directed acyclic graph of a plurality of nodes connected by edges, wherein:
(1) a node represents a data pipeline unit specified using (a) inputs of the data pipeline unit, (b) outputs of the data pipeline unit, (c) one or more storage units used by the data pipeline unit, and (d) one or more data transformations performed by the data pipeline unit,
(2) an edge represents a relation between a first node and a second node, such that an output generated by the first node is provided as an input to the second node;
(3) the plurality of nodes comprises:
a set of input nodes configured to receive input data processed by the data mesh from one or more data sources, said input data comprising a set of input fields;
a set of output nodes configured to provide output data processed by the data mesh to one or more consumer systems, said output data comprising a set of output fields; and
a set of internal nodes, wherein each internal node receives data output by a previous node of the data mesh and provides output as an input to a next node of the data mesh; and
(4) each node of the data mesh stores snapshots of partially computed data in one or more storage units of the node, each snapshot associated with a timestamp;
identifying multiple target cloud platforms among the plurality of cloud platforms for deployment and execution of the data mesh, wherein each of the multiple target cloud platforms require different platform-specific instructions for implementation of the data pipeline;
for each node of the data mesh, generating platform-specific instructions for configuring the node on a respective one of the target cloud platforms based on the cloud platform independent specification, where the cloud platform independent specification is reused across the multiple target cloud platforms of the data mesh, wherein at least two of the nodes of the data mesh are implemented concurrently on different ones of the multiple target cloud platforms such that the input to a second data pipeline unit at a second one of the multiple target cloud platforms is the output of a first data pipeline unit of a first one of the multiple cloud platforms;
provisioning computing infrastructure on the target cloud platforms for each node of the data mesh according to the generated platform-specific instructions;
creating a connection with the provisioned computing infrastructure on the target cloud platforms;
responsive to receiving input data for the data mesh, executing the generated platform-specific instructions corresponding to the plurality of nodes, wherein the data generated by each node is propagated according to the connection;
generating lineage information describing a field, wherein the field is one of:
an input field, wherein the lineage information represents a set of nodes determining values derived from the input field; and
an output field, wherein the lineage information represents a set of nodes determining values used to compute the output field;
receiving a change in specification describing the field;
identifying a subset of nodes of the data mesh based on the lineage information of the field;
recommending the subset of nodes of the data mesh as data pipeline units that need to be modified in connection with the change in the specification of the field;
receiving modified specification for the recommended set of nodes of the data mesh;
generating instructions for the recommended set of nodes the data mesh;
for only the recommended set of nodes, executing generated instructions to reconfigure the recommended set of nodes according to the modified specification such that a remainder of the nodes of the data mesh remain unmodified;
receiving a timestamp value and a same or different subset of nodes of the data mesh; and
for only each of the same or different subset of nodes of the data mesh, executing instructions of the node to re-process the partially computed data stored on the node obtained from a snapshot corresponding to the timestamp.
|