US 12,154,544 B1
	Synthetic speech processing
Michal Czuczman, Gdansk (PL); You Wang, Cambridge (GB); Masaki Noguchi, Cambridge (GB); and Viacheslav Klimkov, Gdansk (PL)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Mar. 18, 2021, as Appl. No. 17/205,493.
Int. Cl. G10L 13/08 (2013.01); G10L 15/05 (2013.01); G10L 15/16 (2006.01); G10L 15/187 (2013.01)

CPC G10L 13/08 (2013.01) [G10L 15/05 (2013.01); G10L 15/16 (2013.01); G10L 15/187 (2013.01); G10L 2013/083 (2013.01)]

19 Claims

1. A computer-implemented method for processing text data for synthesized speech, the method comprising:

receiving input data representing a number corresponding to a first pronunciation and to a second pronunciation;

processing the input data to determine first segment data corresponding to a first portion of the number;

processing the input data to determine second segment data corresponding to a second portion of the number;

processing, using a neural network attention layer of an encoder, the first segment data to determine first embedding data representing the first segment data and a first context of the first segment data with respect to the input data;

processing, using the neural network attention layer, the second segment data to determine second embedding data representing the second segment data and a second context of the second segment data with respect to the input data;

processing, using a decoder, the first embedding data to determine first category data indicating that the first embedding data corresponds to a first category associated with the first pronunciation, wherein processing the first embedding data comprises processing, using a first component, the first embedding data;

processing the second embedding data to determine second category data indicating that the second embedding data corresponds to a second category corresponding to the second pronunciation; and

processing the first category data and the second category data to determine output data representing at least a first word corresponding to one of the first pronunciation and the second pronunciation.