US 12,141,124 B1
Discovering candidate referential integrities in a database
Sung Jin Kim, Buena Park, CA (US); Yinuo Zhang, Los Angeles, CA (US); Mohamed Mahmoud Hafez Mahmoud Abdelrahman, Torrance, CA (US); and Paul Geoffrey Brown, Concord, MA (US)
Assigned to Teradata US, Inc., San Diego, CA (US)
Filed by Teradata US, Inc, San Diego, CA (US)
Filed on Nov. 24, 2023, as Appl. No. 18/518,813.
Int. Cl. G06F 16/20 (2019.01); G06F 16/23 (2019.01); G06F 16/2453 (2019.01)
CPC G06F 16/2365 (2019.01) [G06F 16/2453 (2019.01)] 18 Claims
OG exemplary drawing
 
1. A method comprising:
executing a plurality of processes on a plurality of compute groups,
wherein each compute group comprises one or more compute clusters,
wherein each compute cluster comprises one or more nodes,
wherein each node comprises at least one computer processor and a memory,
wherein the plurality of processes store data on a data storage;
a database system executing as at least one of the plurality of processes;
the database system configured to issue executable steps to at least one of the processes executing as part of the database system,
the database system receiving a database comprising one or more relations comprising a plurality of input columns;
the database system enumerating one-column candidate referential integrities (1CRIs) from the plurality of input columns in the one or more relations, wherein each 1CRI includes a referenced column (A) having a plurality of referenced-column values and a referencing column (B) having a plurality of referencing column values;
the database system applying one or more disqualification tests to the 1CRIs to eliminate illegitimate 1CRIs resulting in a list of non-disqualified 1CRIs, wherein the disqualification tests are applied to an 1CRI being tested (hereinafter (A*,B*), A* representing a set of values of a referenced column or columns and B* representing a set of values of a referencing column or columns) until (A*,B*) is disqualified or until all of the disqualification tests have been executed and (A*,B*) has not been disqualified, in which case (A*,B*) is added to the list of non-disqualified 1CRIs, wherein each of the disqualification tests reduces the likelihood of incorrectly adding (A*,B*) to the list of non-disqualified 1CRIs;
wherein the disqualification tests include a minimum coverage disqualification test, that includes one or more of the following sub-tests:
(d1) using cardinality sub-test, in which (A*,B*) is disqualified if:
NUVB*/NUVA*> a domain-coverage threshold, where NUVB* is a number of unique values in B* and NUVA* is a number of unique values in A*, and
CARD(BloomA*∩BloomB*)/CARD(BloomA*), where ∩ represents a bitwise AND, CARD is a function having an argument that computes the number of bits in the argument that are set to 1, BloomA* is a Bloom filter of A*, and BloomB* is a Bloom filter of B*, and
(d2) using Min Hash sub-test, in which in which (A*,B*) is disqualified if (MH(A*,B*)·(NUVA*+NUVB*)/(MH(A*,B*)+1))/NUVA* is less than a min-hash threshold, where MH(A*,B*) is a statistic available from the database system;
the database system applying a qualifying test to the non-disqualified 1CRIs resulting in a list of qualified 1CRIs;
the database system defining a referential integrity constraint between the referenced column and the referencing column of one of the qualified 1CRIsthat is confirmed by a user to be a referential integrity constraint; and
the database system using the referential integrity constraint to optimize execution of a query received by the database system.