US 12,020,685 B2
	Phonemes and graphemes for neural text-to-speech
Ye Jia, Mountain View, CA (US); Byungha Chun, Tokyo (JP); Yu Zhang, Mountain View, CA (US); Jonathan Shen, Mountain View, CA (US); and Yonghui Wu, Fremont, CA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Dec. 10, 2021, as Appl. No. 17/643,684.
Claims priority of provisional application 63/166,929, filed on Mar. 26, 2021.
Prior Publication US 2022/0310059 A1, Sep. 29, 2022
Int. Cl. G10L 13/08 (2013.01); G06F 40/263 (2020.01); G06F 40/279 (2020.01); G06N 3/08 (2023.01); G10L 13/047 (2013.01)

CPC G10L 13/086 (2013.01) [G06F 40/263 (2020.01); G06F 40/279 (2020.01); G06N 3/08 (2013.01); G10L 13/047 (2013.01)]

18 Claims

1. A computer-implemented method when executed by data processing hardware causes the data processing hardware to perform operations comprising:

receiving, at an encoder of a speech synthesis model, a text input comprising a sequence of words represented as an input encoder embedding, the input encoder embedding comprising a plurality of tokens, the plurality of tokens comprising a first set of grapheme tokens representing the text input as respective graphemes and a second set of phoneme tokens representing the text input as respective phonemes, each grapheme token of the first set of grapheme tokens comprising a respective wordpiece sub-word of a respective word in the sequence of words, wherein each corresponding token of the plurality of tokens of the input encoder embedding represents a combination of:

a respective word position embedding for each respective word in the sequence of words, the respective word position embedding representing sub-word level positions for both one or more of the grapheme tokens from the first set of grapheme tokens that correspond to the respective word and one or more of the phoneme tokens from the second set of phoneme tokens that correspond to the respective word; and

a position embedding representing an overall index of position for each token of the plurality of tokens of the input encoder embedding;

for each respective phoneme token of the second set of phoneme tokens:

identifying, by the encoder, a respective word of the sequence of words corresponding to the respective phoneme token based on the respective word position embedding that represents the sub-word level position for the respective phoneme token that corresponds to the respective word; and

determining, by the encoder, a respective grapheme token representing the respective word of the sequence of words corresponding to the respective phoneme token by determining that the sub-word level position for the respective grapheme token that corresponds to the respective word is represented by the same respective word position embedding as the respective word position embedding representing the sub-word level position for the respective phoneme token; and

generating, by the encoder, an output encoder embedding based on a relationship between each respective phoneme token and the respective grapheme token determined to represent a same respective word as the respective phoneme token.