US 12,266,340 B2
	Adaptive, individualized, and contextualized text-to-speech systems and methods
Bjorn Austraat, New York, NY (US)
Assigned to TRUIST BANK, Charlotte, NC (US)
Filed by Truist Bank, Charlotte, NC (US)
Filed on Dec. 7, 2022, as Appl. No. 18/062,630.
Prior Publication US 2024/0194178 A1, Jun. 13, 2024
Int. Cl. G10L 13/027 (2013.01); G10L 15/16 (2006.01); G10L 15/18 (2013.01); G10L 13/00 (2006.01)

CPC G10L 13/027 (2013.01) [G10L 15/16 (2013.01); G10L 15/1815 (2013.01); G10L 13/00 (2013.01)]

18 Claims

1. A computing system for adaptive, individualized, and contextualized text-to-speech, the system comprising:

a memory;

one or more processors in communication with the memory; and

program instructions executable by the one or more processors via the memory to:

iteratively train, using training data, at least one recurrent neural network (RNN) to categorize communication elements from audio data, the RNN being trained to perform natural language processing on the audio data to generate transcribed data from the audio data, parse the generated transcribed data, and perform a reduction analysis on the transcribed data in order to interpret the transcribed data to derive the communication elements for categorization, the communication elements including intent, context, emotion, and other circumstantial factors, the training including:

inserting the training data into an iterative training and testing loop to predict a target variable; and

repeatedly predicting the target variable during each iteration of the training and testing loop, wherein each iteration of the training and testing loop has differing weights applied to one or more nodes of the RNN, each of the differing weights being updated with each iteration of the training and testing loop to reduce error in predicting the target variable and improve predictability of the RNN;

deploy the at least one trained RNN to facilitate categorization of the communication elements;

receive, in real-time from a user device, input audio data of a user via a telephonic communication means, the input audio data comprising a plurality of communication elements;

apply the input audio data to the at least one trained RNN to categorize one or more communication elements of the plurality of communication elements, the categorizing including assigning at least one contextual category to a communication element of the plurality of communication elements;

generate text comprising a response to one or more of the plurality of communication elements, the response including one or more individualized and contextualized qualities predicted to provide an optimal outcome based at least in part on (i) the assigned at least one contextual category and (ii) the user from which the input audio data is received;

implement text-to-speech processing of the generated text, the text-to-speech processing producing an audio output comprising (a) the response and (b) a speech pattern predicted to facilitate the optimal outcome, the speech pattern including at least one prosody element, the at least one prosody element including a timbre intended to elicit certain emotions of the user associated with a desired outcome, the timbre being based on the at least one contextual category assigned to the communication element of the plurality of communication elements; and

provide, to the user device, the audio output and based thereon measure a reaction of the user in response to the speech pattern and the timbre of the audio output according to a quantifiable quality score and using the quantifiable quality score to modify future iterations of the text-to-speech processing in order to provide one or more future audio outputs comprising a revised speech pattern.