US 11,935,515 B2
	Generating a synthetic voice using neural networks
Claude Polonov, Lynbrook, NY (US)
Filed by Meca Holdings IP LLC, Lynbrook, NY (US)
Filed on Dec. 27, 2021, as Appl. No. 17/563,008.
Claims priority of provisional application 63/216,521, filed on Jun. 29, 2021.
Claims priority of provisional application 63/170,536, filed on Apr. 4, 2021.
Claims priority of provisional application 63/130,618, filed on Dec. 25, 2020.
Prior Publication US 2022/0208172 A1, Jun. 30, 2022
Int. Cl. G10L 13/06 (2013.01); G06N 3/045 (2023.01); G10L 13/02 (2013.01); G10L 15/02 (2006.01); G10L 15/04 (2013.01); G10L 21/10 (2013.01)

CPC G10L 13/02 (2013.01) [G06N 3/045 (2023.01); G10L 15/02 (2013.01); G10L 15/04 (2013.01); G10L 21/10 (2013.01); G10L 2015/025 (2013.01)]

20 Claims

1. A method of generating a synthetic voice, comprising the steps of:

a. capturing audio data and saving the audio data as a set of speech segments;

b. converting the set of speech segments into a common audio format;

c. cutting the speech segments into speech segments of uniform length;

d. grouping the speech segments such that the speech segments within each group of speech segments are derived from a common speaker;

e. identifying phonemes within each speech segment using a phoneme processor;

f. clipping each speech segment to isolate the separate phonemes within each speech segment, thereby forming phoneme segments;

g. grouping the phoneme segments such that each group of phoneme segments have a common phoneme type;

h. identifying pitch types within each group of phoneme segments using a pitch processor, with pitch types including high, medium, and low pitches;

i. grouping the phoneme segments such that each group of phoneme segments have a common phoneme-pitch type, thereby forming phoneme-pitch groups;

j. converting the phoneme segments into spectral Mel-Scale data segments using a Mel-Spectrogram to form a first set of spectral segments;

k. performing SPL analysis on each spectral segment in the first set of spectral segments to generate a first set of SPL tracks, with each SPL track of the first set of SPL tracks corresponding to a given spectral segment and identifying sound pressure, tone, and pause attributes of the given spectral segment;

l. Grouping the first set of spectral segments such that every spectral segment within a group of spectral segments has common sound pressure, tone, and pause attributes;

m. inputting the first set of spectral segments and the first set of SPL tracks into a first neural vocoder;

n. comparing sound pressure, tone, pause, and phoneme attributes of the first set of spectral segments to standard sound pressure, tone, pause, and phoneme attribute ranges, with the standard attribute ranges selected for comparison determined by the phoneme-pitch group and the SPL tracks;

o. culling spectral segments with attributes that fall outside standard attribute ranges from the first set of spectral segments;

p. merging remaining spectral segments within each group of common phoneme-pitch type into a merged segment, with the merged segments being a second set of spectral segments;

q. performing SPL analysis on each merged segment to generate SPL tracks for the merged segments, with the SPL tracks generated for the merged segments being a second set of SPL tracks;

r. iteratively inputting the second set of spectral segments and the second set of SPL tracks into the first neural vocoder into a first stream of the first neural vocoder and inputting the first set of spectral segments and the first set of SPL tracks into a second stream of the first neural vocoder;

s. during each iteration, comparing the first set of spectral segments to the second set of spectral segments, and then replacing spectral segments from the second set of spectral segments that are determined to have sound properties inferior to corresponding spectral segments from the first set of spectral segments with the corresponding spectral segments from the first set of spectral segments;

t. providing a second neural vocoder configured to predict subsequent phonemes, with the second neural vocoder comprising a plurality of layers including a first phoneme layer, a second phoneme layer, a phoneme weight layer, a first pitch layer, a second pitch layer, and a pitch weight layer;

i. with the first and second phoneme layers each comprising sets of nodes associated with distinct phonemes;

ii. with the first and second pitch layers each comprising sets of nodes associated with distinct pitches;

iii. with the phoneme weight layer comprising a first set of weight nodes, with each weight node of the first set of weight nodes associated with a likelihood that a node from the first phoneme layer will be succeeded by a node from the second phoneme layer;

iv. with the pitch weight layer comprising a second set of pitch weight nodes, with each weight node of the second set of pitch weight nodes associated with a likelihood that a node from the first pitch layer will be succeeded by a node from the second pitch layer;

v. with the second neural vocoder trained on a set of whole speech segments, with each whole speech segment of the whole speech segments having multiple phonemes.