US 12,242,441 B1
Data lineage tracking
Tao Feng, Foster City, CA (US); Menglei Sun, Mountain View, CA (US); and Zhuoying Wang, Santa Clara, CA (US)
Assigned to Databricks, Inc., San Francisco, CA (US)
Filed by Databricks, Inc., San Francisco, CA (US)
Filed on Jan. 31, 2023, as Appl. No. 18/162,562.
Application 18/162,562 is a continuation of application No. 17/862,158, filed on Jul. 11, 2022.
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 16/28 (2019.01); G06F 11/07 (2006.01); G06F 16/215 (2019.01); G06F 16/22 (2019.01); G06F 16/23 (2019.01); G06F 16/906 (2019.01); G06F 17/18 (2006.01)
CPC G06F 16/215 (2019.01) [G06F 11/0793 (2013.01); G06F 16/2246 (2019.01); G06F 16/2365 (2019.01)] 20 Claims
OG exemplary drawing
 
1. A method comprising:
executing a job on one or more workers of a compute resource, wherein executing the one or more jobs further comprises invoking one or more data entities;
detecting that a data entity in the one or more data entities is corrupt in response to determination that execution of the job has failed;
identifying a lineage data identifier associated with the data entity based on a mapping of lineage data identifiers to data entity identifiers;
accessing lineage data that is stored in association with the identified lineage data identifier, the lineage data having been generated based on a query tree that was used to generate the data entity, and the lineage data identifying a set of data entities that rely on the data entity;
identifying, based on the lineage data, one or more upstream data entities from the data entity;
determining that the one or more upstream data entities from which the data entity depends from has been corrupted; and
providing an indication of the corruption in the one or more upstream data entities or the data entity to a client device.