CPC G10L 13/086 (2013.01) [G06N 3/04 (2013.01); G10L 13/047 (2013.01)] | 13 Claims |
1. A method implemented by one or more processors, the method comprising:
receiving a natural language textual data stream to be converted into computer generated speech for rendering to a user via one or more speakers of a computing device,
wherein the natural language textual data stream includes a primary portion that is in a primary language assigned to the user, and wherein the natural language textual data stream includes a secondary language portion that is in a secondary language and not in the primary language assigned to the user;
processing the primary portion, of the natural language textual data stream, to determine a first set of phonemes in a universal phoneme set,
wherein the determined first set of phonemes corresponds to the primary language portion, and
wherein the universal phoneme set includes one or more phonemes that are common to a plurality of languages, the plurality of languages including the primary language, the secondary language, and a tertiary language;
processing the secondary portion that is in the secondary language and not in the primary language, of the natural language textual data stream, to determine a second set of phonemes in the universal phoneme set, wherein the determined second set of phonemes corresponds to the secondary language portion;
determining whether the secondary language, of the secondary language portion of the natural language textual data stream, is assigned as a familiar language for the user;
in response to determining that the secondary language is assigned as a familiar language for the user:
processing, using a neural network model trained to generate human speech using phonemes that are specific to each of multiple languages, both the determined first set of phonemes in the universal phoneme set and the determined second set of phonemes in the universal phoneme set to generate audio data that mimics a given human voice speaking the first set of phonemes and the second set of phonemes, wherein the neural network model is trained based on a plurality of training instances that each includes a corresponding cross-lingual spoken utterance from a multilingual user and corresponding cross-lingual phonemes corresponding to the spoken utterance;
wherein a primary portion, of the audio data generated using the trained neural network model, corresponding to the primary portion in the primary language is pronounced by the given human voice in the primary language, and
wherein a secondary portion, of the audio data generated using the trained neural network model, corresponding to the secondary portion in the secondary language is pronounced by the given human voice in the secondary language; in response to determining that the secondary language is not assigned as a familiar language for the user:
mapping the determined second set of phonemes, that correspond to the secondary language portion, to one or more correlated phonemes in the primary language; and
processing the determined first set of phonemes in the universal phoneme set and the correlated phonemes in the primary language to generate alternate audio data that mimics the given human voice speaking the first set of phonemes and the correlated phonemes; and
causing the audio data or the alternate audio data to be rendered via the one or more speakers of the computing device.
|