| CPC G10L 15/22 (2013.01) [G06F 9/453 (2018.02); G06N 3/08 (2013.01); G10L 25/63 (2013.01); G10L 2015/225 (2013.01)] | 12 Claims |

|
1. A method for dynamic speech modulation, the method comprising:
extracting, by a digital voice assistant, Mel-frequency cepstral coefficient (MFCC) features from a received audio command from a user to identify the user and a corresponding user profile based on the MFCC features, wherein the corresponding user profile stores a default language of the user and one or more user pronunciation models learned by the digital voice assistant based on analysis of user pronunciations;
generating, by the digital voice assistant, a first response to the received audio command, wherein the first response includes a first pronunciation associated with a different language than the default language of the user;
sampling, by the digital voice assistant, the first response to determine whether the first response will be understood by the user based on a comprehension level of the user associated with the different language in the first response, wherein the comprehension level of the user is identified in the corresponding user profile, and wherein the comprehension level is dynamically set by the digital voice assistant based on learning a user comprehension rate for one or more prior unmodified responses from the digital voice assistant; and
in response to determining, by the digital voice assistant, that the first response includes a lower confidence score than the comprehension level of the user, performing a cosine similarity to identify at least one speech feature in the first pronunciation of the first response that is different relative to the one or more user pronunciation models learned by the digital voice assistant, modifying the at least one speech feature in the first pronunciation of the first response to align the first pronunciation with the one or more user pronunciation models learned by the digital voice assistant, and transmitting to the user, by the digital voice assistant, a second response to the received audio command generated based on the one or more user pronunciation models, wherein the second response includes a higher confidence score than the comprehension level of the user.
|