| CPC G10L 15/063 (2013.01) [G06F 40/169 (2020.01); G06F 40/20 (2020.01); G06N 20/00 (2019.01); G10L 15/02 (2013.01); G10L 15/16 (2013.01); G10L 15/22 (2013.01); G06T 13/205 (2013.01); G10L 25/18 (2013.01); G10L 25/30 (2013.01)] | 17 Claims |

|
1. A system comprising:
a non-transitory memory storing instructions executable to construct a machine learning network to quantify a trust score and to automate trust delivery with a digital avatar by generating a trustworthy voice for the digital avatar; and
a processor in communication with the non-transitory memory, wherein, the processor executes the instructions to cause the system to:
obtain a set of vocal features and a set of text features for each sample in audio samples;
obtain a trust score for each sample;
perform a preprocess on the set of vocal features and the set of text features to obtain a set of input features for each sample;
determine a type of machine-learning algorithm for the machine-learning network based on a training result of the machine-learning network;
tune a set of hyper parameters for the machine-learning network based on a cross validation according to the machine-learning network;
generate a predicated trust score by the machine-learning network with the sets of input features for each sample;
train the machine-learning network based on the predicated trust score and the trust score for each sample to obtain the training result;
generate a set of trust components for a user by the machine-learning network;
concatenate the set of trust components with a user profile of the user to obtain an expanded user profile;
train a second machine-learning network by input the expanded user profile to recommend features for improving trust scores; and
generate a list of recommended features for the user by the trained second machine learning network based on the expanded user profile,
wherein generating the trustworthy voice for the digital avatar comprises:
receiving an input text and a reference trustworthy tone sample;
collecting a sequence of phonemes and a Mel spectrogram from the input text using a text to speech module;
encoding the Mel spectrogram with an input encoder to generate an input embedding;
encoding the reference trustworthy tone sample with a trust encoder and concatenating with the input embedding to generate an output;
processing the output of the concatenation through a location-sensitive attention layer using cumulative attention weights to generate an encoded input sequence;
predicting a Mel spectrogram with a decoder from the encoded input sequence; and
generating the trustworthy voice for the digital avatar from the Mel spectrogram using a vocoder, wherein the digital avatar is configured to replace the user in a conversation.
|