US 12,259,905 B2
Data distribution in data analysis systems
Felix Beier, Haigerloch (DE); Dennis Butterstein, Stuttgart (DE); Einar Lueck, Filderstadt (DE); and Sabine Perathoner-Tschaffler, Nufringen (DE)
Assigned to International Business Machines Corporation, Armonk, NY (US)
Filed by International Business Machines Corporation, Armonk, NY (US)
Filed on Sep. 24, 2021, as Appl. No. 17/448,713.
Prior Publication US 2023/0101740 A1, Mar. 30, 2023
Int. Cl. G06F 7/00 (2006.01); G06F 11/14 (2006.01); G06F 16/22 (2019.01); G06F 16/23 (2019.01); G06F 16/27 (2019.01)
CPC G06F 16/27 (2019.01) [G06F 11/1471 (2013.01); G06F 16/2255 (2019.01); G06F 16/2358 (2019.01)] 18 Claims
OG exemplary drawing
 
1. A computer implemented method for data synchronization in a data analysis system, the data analysis system comprising a target database system and a source database system, the method comprising
retrieving a change record describing an operation performed on a data record in the source database system of the data analysis system based on reading a transaction log associated with the source database system with a frequency higher than a defined minimum frequency;
determining a distribution key that is configured to be used by the target database system to distribute records over target database nodes of the target database system, wherein the target database system comprises a metadata catalog that includes cluster metadata and table metadata, the table meta comprising information on a total number of the target database nodes and storage properties of the target database nodes and the cluster metadata comprising information on the distribution key;
reading the change record for determining a value of the distribution key of the data record;
using the value of the distribution key for selecting a target database node of the target database nodes where the operation is to be performed, wherein selecting the target database node for storing data record comprises providing a distribution map of hash values to connection numbers, wherein each connection number indicates a connection between the source database system and a respective target database node, computing a hash value of the determined value of the distribution key of the data record, and using the distribution map for assigning the computed hash value to a connection number, wherein the connection is established according to the connection number;
establishing a direct TCP/IP connection between the source database system and the selected target database node; and
providing the change record to the selected target database node through the direct connection.