US 12,106,196 B2
	Method and system for generating synthetic time domain signals to build a classifier
Sakyajit Bhattacharya, Kolkata (IN); Oishee Muzumder, Kolkata (IN); Aniruddha Sinha, Kolkata (IN); Dibyendu Roy, Kolkata (IN); and Avik Ghose, Kolkata (IN)
Assigned to Tata Consultancy Services Limited, Mumbai (IN)
Filed by Tata Consultancy Services Limited, Mumbai (IN)
Filed on Mar. 9, 2021, as Appl. No. 17/196,406.
Claims priority of application No. 202021018573 (IN), filed on Apr. 30, 2020.
Prior Publication US 2021/0342641 A1, Nov. 4, 2021
Int. Cl. G06N 20/10 (2019.01); G06F 18/2134 (2023.01); G06F 18/214 (2023.01); G06F 18/2321 (2023.01); G06F 18/243 (2023.01); G16H 50/20 (2018.01); G16H 50/70 (2018.01)

CPC G06N 20/10 (2019.01) [G06F 18/2134 (2023.01); G06F 18/2148 (2023.01); G06F 18/2321 (2023.01); G06F 18/24317 (2023.01); G16H 50/20 (2018.01); G16H 50/70 (2018.01); G06F 2218/14 (2023.01)]

9 Claims

1. A processor implemented method for generating synthetic time domain signals to build a classifier, the method comprising:

receiving, by one or more hardware processors, a parent dataset of a plurality of samples of a time domain signal of interest comprising a combination of a class data, wherein the class data refer to a subject with Coronary artery disease, CAD and a non-class data, wherein the non-class data refer to the subject without the CAD;

identifying, by the one or more hardware processors, a plurality of subsets, from the parent dataset, corresponding to a plurality of morphological features identified for the time domain signal of interest, wherein each subset among the plurality of subsets comprises p samples corresponding to the plurality of the morphological features, further wherein the plurality of morphological features comprise a Peak Sample (Ps), a Peak Amplitude (Pa), a Trough Sample (Ts), a Trough Amplitude (Ta), a Notch Sample (Ns), a Notch Amplitude (Na), a Dip Sample (Ds), a Dip Amplitude (Da), and distance between left and right samples corresponding to the 25%, 50%, 75% of the (Pa) defining distances d₁, d₂, d₃, and wherein the plurality of morphological features define a template for the time domain signal of interest;

processing, by the one or more hardware processors, each of the plurality of subsets corresponding to each of the plurality of morphological features to generate a plurality of sets of observational values with each of the plurality of sets of observational values comprising p actual values corresponding to each of the plurality of morphological features;

fitting, by the one or more hardware processors, a gaussian kernel density estimate (KDE) to each of the plurality of sets of observational values;

generating, by the one or more hardware processors, N-point simulated data for each of the plurality of morphological features by generating N random samples from the gaussian KDE fitted to each of the plurality of sets of observational values;

constructing, by the one or more hardware processors, a plurality of synthetic time domain signals for the time domain signal of interest from the N-point simulated data for each of the plurality of morphological features in accordance to the template, wherein the constructing of the plurality of synthetic time domain signals comprising:

determining a plurality of sequences derived from the generated N-point simulated data for each of the plurality of morphological features, wherein a plurality of elements in each of the plurality of sequences comprising Ts, Ps, Ds, Ns, r₁r₂, r₃, q₁, q₂and q₃, wherein values of q₁, q₂, q₃, r₁, r₂, r₃are derived based on a preselected values of the distances d₁, d₂, d₃defined for the template, and wherein position of each of the plurality of elements within each of the plurality of sequences is based on a set of conditions defined by a predefined set of morphological features among the plurality of morphological features;

generating, from the plurality of sequences, a plurality of time domain signals corresponding to the time domain signal of interest by performing a spline fitting on each of the plurality of sequences, wherein the spline fitting utilizes piecewise linear regression to obtain parameters of lines connecting two successive elements among the plurality of elements of the each of the plurality of sequences;

smoothening each of the plurality of time domain signals using a smoothening technique to generate a plurality of smoothened time domain signals; and

applying a peak smoothening technique on each of the smoothened plurality of time domain signals to construct the plurality of synthetic time domain signals for the time domain signal of interest; and

building, via the one or more hardware processors, a two stage cascaded classifier for classifying input data corresponding to the time domain signal of interest into one of the class data and the non-class data by using a combination of the plurality of samples in the parent dataset and the plurality of synthetic time domain signals as a training data, and wherein the two stage cascaded classifier comprises:

a first classifier utilizing Matusita distance for likeness measurement and data explosion driven decision rule for classification, wherein a plurality of statistical features is extracted from the plurality of synthetic time domain signals to differentiate between the class data and the non-class data, using a rule based on the Matusita distance; and

a second classifier utilizing a random forest technique for classification, wherein constructing the two stage cascaded classifier using the training data is used to correct bias, test real time data corresponding to the time domain signal of interest and classify the real time data into the final class data and the final non-class data.