US 12,148,444 B2
Synthesizing speech from text using neural networks
Yonghui Wu, Fremont, CA (US); Jonathan Shen, Santa Clara, CA (US); Ruoming Pang, New York, NY (US); Ron J. Weiss, New York, NY (US); Michael Schuster, Saratoga, CA (US); Navdeep Jaitly, Mountain View, CA (US); Zongheng Yang, Berkeley, CA (US); Zhifeng Chen, Sunnyvale, CA (US); Yu Zhang, Mountain View, CA (US); Yuxuan Wang, Sunnyvale, CA (US); Russell John Wyatt Skerry-Ryan, Mountain View, CA (US); Ryan M. Rifkin, Oakland, CA (US); and Ioannis Agiomyrgiannakis, London (GB)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Apr. 5, 2021, as Appl. No. 17/222,736.
Application 17/222,736 is a continuation of application No. 16/058,640, filed on Aug. 8, 2018, granted, now 10,971,170, issued on Apr. 6, 2021.
Prior Publication US 2021/0295858 A1, Sep. 23, 2021
This patent is subject to a terminal disclaimer.
Int. Cl. G10L 13/047 (2013.01); G06N 3/045 (2023.01); G06N 3/08 (2023.01); G06N 5/046 (2023.01); G06N 7/01 (2023.01); G10L 13/08 (2013.01); G10L 25/18 (2013.01); G10L 25/30 (2013.01)
CPC G10L 25/30 (2013.01) [G06N 3/045 (2023.01); G06N 3/08 (2013.01); G06N 5/046 (2013.01); G06N 7/01 (2023.01); G10L 13/047 (2013.01); G10L 13/08 (2013.01); G10L 25/18 (2013.01)] 20 Claims
OG exemplary drawing
 
9. A system comprising one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising:
generating, from an input data representing a text input, an output sequence of audio data representing the input data, wherein the output sequence of audio data comprises a respective audio output sample for each of a plurality of time steps, including, for each of the plurality of time steps:
generating a mel-frequency spectrogram for a timestep of the plurality of timesteps by processing a representation of a respective portion of the input data using a decoder neural network, wherein the decoder neural network is an autoregressive neural network comprising an LSTM subnetwork, a linear transform, and a convolutional subnetwork;
generating a probability distribution over a plurality of possible audio output samples for the time step by processing the mel-frequency spectrogram for the time step using a vocoder neural network; and
selecting an audio output sample for the time step from the plurality of possible audio output samples in accordance with the probability distribution.