US 12,272,348 B2
	Conformer-based speech conversion model
Bhuvana Ramabhadran, Mt. Kisco, NY (US); Zhehuai Chen, Jersey City, NJ (US); Fadi Biadsy, Mountain View, CA (US); and Pedro J. Moreno Mengibar, Jersey City, NJ (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Mar. 16, 2022, as Appl. No. 17/655,030.
Claims priority of provisional application 63/312,195, filed on Feb. 21, 2022.
Claims priority of provisional application 63/166,954, filed on Mar. 26, 2021.
Prior Publication US 2022/0310056 A1, Sep. 29, 2022
Int. Cl. G10L 13/027 (2013.01); G10L 13/047 (2013.01); G10L 15/16 (2006.01); G10L 15/22 (2006.01); G10L 25/18 (2013.01)

CPC G10L 13/027 (2013.01) [G10L 13/047 (2013.01); G10L 15/16 (2013.01); G10L 15/22 (2013.01); G10L 25/18 (2013.01)]

18 Claims

1. A system comprising data processing hardware and memory hardware in communication with the data processing hardware and storing instructions that when executed by the data processing hardware causes the data processing hardware to perform operations for executing a speech conversion system, the speech conversion system comprising:

an encoder configured to encode an input spectrogram corresponding to an utterance, the input spectrogram extracted from input speech spoken by a source speaker with atypical speech and comprising a sequence of spectrogram frames having a length equal to a first value, the encoder comprising:

a first subsampling layer configured to subsample the sequence of spectrogram frames to increase the length of the sequence of spectrogram frames to a second value greater than the first value;

a stack of self-attention blocks comprising an initial set of self-attention blocks and a final set of self-attention blocks, the initial set of self-attention blocks is configured to process the subsampled sequence of spectrogram frames output by the first subsampling layer to output first hidden representations having a length equal to the second value; and

a second subsampling layer configured to subsample the first hidden representations output by the initial set of self-attention blocks to increase the length of the first hidden representations to a third value greater than the second value, wherein the final set of self-attention blocks is configured to process the subsampled first hidden representations output by the initial set of self-attention blocks to output second hidden representations having a length equal to the third value; and

a spectrogram decoder configured to:

receive, as input, the encoded spectrogram from the encoder; and

generate, as output, an output spectrogram corresponding to a synthesized speech representation of the utterance comprising a synthesized canonical fluent speech representation of the utterance in a voice of the source speaker.