US 12,380,876 B2
	Generating genre appropriate voices for audio books
Ramya Rasipuram, Los Gatos, CA (US); William Beckman, San Rafael, CA (US); Ladan Golipour, Saratoga, CA (US); David A. Winarsky, San Jose, CA (US); Cheng-Chieh Yeh, Santa Clara (CA); and Weicheng Zhang, Santa Clara (CA)
Assigned to Apple Inc., Cupertino, CA (US)
Filed by Apple Inc., Cupertino, CA (US)
Filed on Oct. 31, 2022, as Appl. No. 17/977,360.
Claims priority of provisional application 63/331,626, filed on Apr. 15, 2022.
Claims priority of provisional application 63/273,796, filed on Oct. 29, 2021.
Prior Publication US 2023/0134970 A1, May 4, 2023
Int. Cl. G10L 13/10 (2013.01); G06F 40/284 (2020.01); G06F 40/30 (2020.01); G10L 13/033 (2013.01)

CPC G10L 13/10 (2013.01) [G06F 40/284 (2020.01); G06F 40/30 (2020.01); G10L 13/033 (2013.01)]

40 Claims

1. A non-transitory computer-readable storage medium storing one or more programs configured to be executed by one or more processors of an electronic device, the one or more programs including instructions for:

receiving a first text including at least a first subset and a second subset, wherein at least a portion of the first subset overlaps with at least a portion of the second subset;

determining one or more themes associated with the first text;

determining, based on the one or more themes, a genre associated with the first text, wherein the genre is different from the one or more themes;

determining, based on the first text, a first prosody for a speech output, wherein the first prosody is representative of the genre;

determining a first semantic meaning of the first text based on a context determined from a second text received prior to the first text, wherein a machine learning model is trained to determine the first semantic meaning of the first text;

adjusting the first prosody for the speech output based on the context determined from the second text received prior to the first text; and

generating, based on the prosody and the first semantic meaning, a first speech output of the first text;

in accordance with a determination that a similarity between the first speech output and a candidate text representation determined from the first speech output is below a threshold:

determining a second prosody and a second semantic meaning of the first text; and

generating a second speech output of the first text based on the second prosody and the second semantic meaning.