US 11,934,377 B2
Consistency checking for distributed analytical database systems
Maninderjit Singh Parmar, Redmond, WA (US)
Assigned to MICROSOFT TECHNOLOGY LICENSING, LLC, Redmond, WA (US)
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed on Feb. 25, 2021, as Appl. No. 17/185,674.
Prior Publication US 2022/0269669 A1, Aug. 25, 2022
Int. Cl. G06F 16/23 (2019.01); G06F 16/18 (2019.01); G06F 16/215 (2019.01)
CPC G06F 16/2365 (2019.01) [G06F 16/1865 (2019.01); G06F 16/215 (2019.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method for consistency checking of data files, in a distributed database system, that represent a table, comprising:
receiving an ordered sequence of event records associated with the table, each event record in the ordered sequence including information about a particular operation performed with respect to one or more of the data files, the information for the particular operation including a transaction version, an operation type identifying a type of operation represented by the event record, a set of input data file identifiers identifying data files acted on by the particular operation, a set of output data file identifiers identifying data files generated by the particular operation, and an operation status indicating whether the particular operation was committed;
processing each first event record in a plurality of first event records in the ordered sequence, in an order specified by the ordered sequence, by:
maintaining, in a valid data file set, data file identifiers corresponding to data files that should be visible until the currently processed first event;
maintaining, in an invalid data file set, data file identifiers corresponding to data files that should not be visible;
determining whether the operation associated with the first event record was successfully committed based on the operation status associated with the first event record;
in response to determining that the operation associated with the first event record was successfully committed, adding any data file identifier in the set of output data file identifiers associated with the first event record to the valid data file set; and
in response to determining that the operation associated with the first event record was not successfully committed, adding any data file identifier in the set of output data file identifiers associated with the first event record to the invalid data file set; and
performing the following for a second event record in the ordered sequence that follows at least one of the first event records in the ordered sequence:
determining that a data inconsistency exists with respect to the table based on at least one of the valid data file set or the invalid data file set, and at least one of the set of input data file identifiers associated with the second event record or the set of output data file identifiers associated with the second event record.