US 12,154,544 B1
Synthetic speech processing
Michal Czuczman, Gdansk (PL); You Wang, Cambridge (GB); Masaki Noguchi, Cambridge (GB); and Viacheslav Klimkov, Gdansk (PL)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Mar. 18, 2021, as Appl. No. 17/205,493.
Int. Cl. G10L 13/08 (2013.01); G10L 15/05 (2013.01); G10L 15/16 (2006.01); G10L 15/187 (2013.01)
CPC G10L 13/08 (2013.01) [G10L 15/05 (2013.01); G10L 15/16 (2013.01); G10L 15/187 (2013.01); G10L 2013/083 (2013.01)] 19 Claims
OG exemplary drawing
 
1. A computer-implemented method for processing text data for synthesized speech, the method comprising:
receiving input data representing a number corresponding to a first pronunciation and to a second pronunciation;
processing the input data to determine first segment data corresponding to a first portion of the number;
processing the input data to determine second segment data corresponding to a second portion of the number;
processing, using a neural network attention layer of an encoder, the first segment data to determine first embedding data representing the first segment data and a first context of the first segment data with respect to the input data;
processing, using the neural network attention layer, the second segment data to determine second embedding data representing the second segment data and a second context of the second segment data with respect to the input data;
processing, using a decoder, the first embedding data to determine first category data indicating that the first embedding data corresponds to a first category associated with the first pronunciation, wherein processing the first embedding data comprises processing, using a first component, the first embedding data;
processing the second embedding data to determine second category data indicating that the second embedding data corresponds to a second category corresponding to the second pronunciation; and
processing the first category data and the second category data to determine output data representing at least a first word corresponding to one of the first pronunciation and the second pronunciation.