US 11,914,960 B2
System and method for statistical subject identification from input data
Nidhi Harshad Shroff, Mumbai (IN); Paras Dwivedi, Mumbai (IN); Siva Prasad Pusarla, Mumbai (IN); Sudhakara Deva Poojary, Mumbai (IN); Pranav Champaklal Shah, Mumbai (IN); Varsha Nayak, Mumbai (IN); Amit Aggrawal, Mumbai (IN); and Godfrey Claudin Mathais, Mumbai (IN)
Assigned to TATA CONSULTANCY SERVICES LIMITED, Mumbai (IN)
Filed by Tata Consultancy Services Limited, Mumbai (IN)
Filed on Oct. 5, 2021, as Appl. No. 17/450,062.
Claims priority of application No. 202121028201 (IN), filed on Jun. 23, 2021.
Prior Publication US 2023/0028304 A1, Jan. 26, 2023
Int. Cl. G06F 40/284 (2020.01); G06F 40/166 (2020.01); G06F 40/216 (2020.01); G06F 40/117 (2020.01)
CPC G06F 40/284 (2020.01) [G06F 40/117 (2020.01); G06F 40/166 (2020.01); G06F 40/216 (2020.01)] 7 Claims
OG exemplary drawing
 
1. A processor-implemented method for statistical subject identification comprising:
receiving, via an input/output interface, a plurality of input data from a user, wherein the plurality of input data includes text strings, images, audios, and videos;
converting, via one or more hardware processors, the received images, audios, and videos into text strings using a transcription model;
pre-processing, via the one or more hardware processors, the received and converted text strings to obtain pre-processed text strings and n-grams corresponding to each of the pre-processed text strings;
computing, via the one or more hardware processors, a frequency distribution of the n-grams for the pre-processed text strings;
calculating, via the one or more hardware processors, a weightage of the n-grams for the pre-processed text strings, wherein the weightage is a frequency scaled considering size of the corresponding pre-processed text strings;
determining, via the one or more hardware processors, a ratio of sum of the calculated weightage of each n-gram across the pre-processed text strings to a maxima weightage of the n-gram;
calculating, via the one or more hardware processors, an n-gram confidence value using a box-cox transformation over the determined ratio to obtain a normal distribution of the n-gram confidence values;
identifying, via the one or more hardware processors, one or more nodes of significance from the obtained normal distribution based on one or more higher values of the calculated n-gram confidence values than a predefined dynamic threshold value, wherein the predefined dynamic threshold value depends on the obtained normal distribution;
mapping, via the one or more hardware processors, the identified one or more nodes of significance to one or more domains using a predefined domain lexicon;
converging, via the one or more hardware processors, the mapped one or more domains to a plurality of subject areas and simultaneously computing a score by adding the corresponding n-gram confidence value, associated with one or more nodes of significance, for that subject area; and
identifying, via the one or more hardware processors, at least one subject area based on the computed score to each of the one or more subjects.