US 11,886,396 B2
System and method for learning-based synthesis of data transformation rules
Raveendra Kumar Medicherla, Bangalore (IN); Sayandeep Mitra, Pune (IN); and Ravindra Dinkar Naik, Pune (IN)
Assigned to TATA CONSULTANCY SERVICES LIMITED, Mumbai (IN)
Filed by Tata Consultancy Services Limited, Mumbai (IN)
Filed on Nov. 8, 2022, as Appl. No. 17/982,677.
Claims priority of application No. 202121052118 (IN), filed on Nov. 13, 2021.
Prior Publication US 2023/0153278 A1, May 18, 2023
Int. Cl. G06F 16/21 (2019.01); G06F 40/284 (2020.01); G06F 40/30 (2020.01)
CPC G06F 16/213 (2019.01) [G06F 40/284 (2020.01); G06F 40/30 (2020.01)] 9 Claims
OG exemplary drawing
 
1. A processor implemented method for learning-based synthesis of data transformation rules, the method comprising:
providing, via one or more hardware processors, a source database schema and a target database schema as a first input, wherein the source database schema comprises source tables and respective source fields and the target database schema comprises target tables and respective target fields;
receiving, via the one or more hardware processors, historical data mapping between the source tables and the target tables as a second input, wherein the historical data mapping comprises a plurality of historic transformation rules and a matching list between the source fields and the target fields;
assigning, via the one or more hardware processors, each field from the source fields and the target fields to a semantic type;
categorizing, via the one or more hardware processors, the plurality of historic transformation rules as per the semantic type;
identifying, via the one or more hardware processors, similar rule statements based on structure of operators present therein within each categorized transformation rules and arranging into a plurality of groups;
identifying, via the one or more hardware processors, rule patterns for each of the plurality of groups by replacing field names and constants with a plurality of symbols;
re-arranging, via the one or more hardware processors, the identified rule patterns in a hierarchical order;
identifying and inferring, via the one or more hardware processors, lexical tokens and grammar rules from the hierarchical order;
rearranging, via the one or more hardware processors, the identified lexical tokens and grammar rules to form a set of domain specific languages (DSL), wherein the set of DSLs comprises a plurality of DSL syntactic rules for arrangement of a plurality of DSL operators and their parameters;
defining, via the one or more hardware processors, operator semantics for each of the DSL operator amongst the plurality of DSL operators;
defining, via the one or more hardware processors, operator parameter annotations configured to assist in rule inferencing, for each operator amongst the plurality of DSL operators;
defining, via the one or more hardware processors, a ranking for each of the DSL operator based on a usage frequency count of the plurality of DSL operators from the historical data mapping;
instantiating, via the one or more hardware processors, a synthesizer using
the set of DSLs, the source database schema, the target database schema, a set of source-target data samples, and a matching list between target and source fields, and
a syntax of the DSL operator, defined operator semantics, defined operator parameter annotations and the defined ranking;
generating, via the one or more hardware processors, a plurality of candidate data transformation rules by the synthesizer, wherein the each of generated transformation rules amongst the plurality of candidate data transformation rules are represented as respective programs;
ranking, via the one or more hardware processors, each program using a score assigned to each DSL operator in the set of DSLs, wherein a higher score is assigned to operators which have a higher frequency of use in the historical data mapping as compared to operators which have lower frequency of use; and
selecting, via the one or more hardware processors, a set of candidate data transformation rules out of the plurality of candidate data transformation rules based on a top user-defined number of rankings.