US 12,249,313 B2
Method and system for text-to-speech synthesis of streaming text
Michael Hassid, Jerusalem (IL); Sapir Caduri, Tel Aviv (IL); Nadav Bar, Raanana (IL); Danielle Cohen, Tel Aviv (IL); Benny Schlesinger, Ramat Hasharon (IL); and Michelle Tadmor Ramanovich, Tel Aviv (IL)
Assigned to Google LLC, Mountain View, CA (US)
Appl. No. 17/914,010
Filed by GOOGLE LLC, Mountain View, CA (US)
PCT Filed Oct. 27, 2020, PCT No. PCT/US2020/057529
§ 371(c)(1), (2) Date Sep. 23, 2022,
PCT Pub. No. WO2022/093192, PCT Pub. Date May 5, 2022.
Prior Publication US 2023/0335111 A1, Oct. 19, 2023
Int. Cl. G10L 13/08 (2013.01); G10L 13/00 (2006.01)
CPC G10L 13/08 (2013.01) [G10L 13/00 (2013.01); G10L 2013/083 (2013.01)] 17 Claims
OG exemplary drawing
 
1. A method comprising:
at a text-to-speech (TTS) system, receiving a real-time streaming text string having a starting point and an ending point;
at the TTS system, accumulating a first sub-string comprising a first portion of the text string received from an initial point to a first trigger point, wherein the initial point is no earlier than the starting point and is prior to the first trigger point, and the first trigger point is before the ending point;
at the TTS system, applying a punctuation model of the TTS system to the first sub-string to generate a pre-processed first sub-string comprising the first sub-string with added grammatical punctuation as determined by the punctuation model;
at the TTS system, applying TTS synthesis processing to at least the pre-processed first sub-string to generate first synthesized speech;
providing an audio playout signal of the first synthesized speech;
while applying TTS synthesis processing to the pre-processed first sub-string to generate the first synthesized speech, concurrently accumulating a second sub-string comprising a second portion of the text string received from the first trigger point to a second trigger point that is after the first trigger point and no further than the ending point;
applying the punctuation model to the second sub-string to generate a pre-processed second sub-string;
while providing the audio playout signal of the first synthesized speech, concurrently applying TTS synthesis processing to the pre-processed second sub-string to generate second synthesized speech; and
providing an audio playout signal of the second synthesized speech.