US 12,277,394 B2
Systems and methods for multi-utterance generation of data with immutability regulation and punctuation-memory
Nidhi Harshad Shroff, Mumbai (IN); Paras Dwivedi, Mumbai (IN); Siva Prasad Pusarla, Mumbai (IN); Sudhakara Deva Poojary, Mumbai (IN); Pranav Champaklal Shah, Mumbai (IN); Varsha Nayak, Mumbai (IN); Amit Aggrawal, Mumbai (IN); and Godfrey Claudin Mathais, Mumbai (IN)
Assigned to TATA CONSULTANCY SERVICES LIMITED, Mumbai (IN)
Filed by Tata Consultancy Services Limited, Mumbai (IN)
Filed on Nov. 1, 2022, as Appl. No. 17/978,443.
Claims priority of application No. 202221011116 (IN), filed on Mar. 1, 2022.
Prior Publication US 2023/0281393 A1, Sep. 7, 2023
Int. Cl. G06F 40/30 (2020.01); G06F 40/211 (2020.01); G06F 40/284 (2020.01); G06F 40/289 (2020.01)
CPC G06F 40/30 (2020.01) [G06F 40/211 (2020.01); G06F 40/284 (2020.01); G06F 40/289 (2020.01)] 11 Claims
OG exemplary drawing
 
1. A processor implemented method, comprising:
receiving, via one or more hardware processors, a plurality of input data pertaining to one or more applications, wherein the input data comprises text data and non-text data, and wherein the non-text data comprises one or more audios, one or more images, and one or more videos;
converting, via the one or more hardware processors, the non-text data into a text data based on one or more conversion techniques when the input data includes non-text data, results in forming converted text data;
combining, via the one or more hardware processors, the received text data and the converted text data to form combined text data;
processing, via the one or more hardware processors, the combined text data to obtain a processed text data with immutability regulation and punctuation memory enabled, and wherein processing of the combined text data comprising:
identifying a set of words from the combined text data;
tokenizing each of the combined text data such that the identified set of words are immutability regulated and punctuation consistency is maintained, wherein immutability regulated refers to a provision to selectively maintain or regulate phrases/words intact inferring a way of expressing a given text in varied ways considering relevance to domain, thereby minimizing efforts for manual creation of training data;
determining a plurality of context related synonyms in an inflected form for each of a plurality of tokenized text data; and
eliminating, one or more words identified as duplicates from the plurality of tokenized text data added with the plurality of context related synonyms in the inflected form;
iteratively generating, via the one or more processors, a plurality of multiple context-related utterances corresponding to each of the processed text data;
accumulating, via the one or more processors, the plurality of multiple context-related utterances that are ranked based on an index of deviation;
selecting, via the one or more processors, a set of high ranked multiple context-related utterances from the plurality of multiple context-related utterances when a number of possible multiple context-related utterance is greater than the number of required multiple context-related utterances and scalable to the non-text data with help of converters or adapters, wherein a first dictionary, an auto-updating dictionary utilizing a feedback, is dynamically updated when a coupling coefficient of n-grams of the plurality of text data exceeds a predefined threshold; and
training machine learning models with the set of high ranked multiple context-related utterances without manual intervention and by maintaining quality of the generated context-related utterances.