US 12,451,114 B2
	Techniques for securely synthesizing speech with the natural voice of a speaker during a language-translated communication session
Jan Pavlovsky, Prague (CZ); Adam Czeisler, Seattle, WA (US); and Luis Carrasco, Seattle, WA (US)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed on Dec. 8, 2022, as Appl. No. 18/077,934.
Claims priority of provisional application 63/429,736, filed on Dec. 2, 2022.
Prior Publication US 2024/0194177 A1, Jun. 13, 2024
Int. Cl. G10L 13/02 (2013.01); G06F 40/40 (2020.01); G10L 13/08 (2013.01)

CPC G10L 13/02 (2013.01) [G06F 40/40 (2020.01); G10L 13/086 (2013.01)]

20 Claims

1. A computer-implemented method for securely synthesizing speech in a natural voice of a speaker during a communication session, the method comprising:

during the communication session:

generating a first voice profile for the speaker by i) obtaining a first sample of audio data from a stream of audio data received from a device of the speaker during a first time interval, the stream of audio data representing speech from the speaker, ii) generating from the sample of audio data the first voice profile for the speaker, and iii) writing the voice profile of the speaker to a memory storage device;

processing the stream of audio data received from the device of the speaker during the first time interval by generating a first portion of a stream of audio data for communicating to a participant of the communication session, the first portion of the stream of audio data for communicating to the participant representing synthesized speech using the first voice profile of the speaker as stored in the memory storage device;

communicating the first portion of the stream of audio data representing the synthesized speech to a device of the participant for play back;

after the first time interval, generating a second voice profile for the speaker by (i) obtaining a second sample of audio data from the stream of audio data received from the device of the speaker during a second time interval, the stream of audio data representing speech from the speaker, (ii) generating from the second sample of audio data the second voice profile for the speaker, and (iii) writing the second voice profile of the speaker to the memory storage device such that it replaces the first voice profile;

processing the stream of audio data received from the device of the speaker during the second time interval by generating a second portion of the stream of audio data for communicating to the participant of the communication session, the second portion of the stream of audio data for communicating to the participant representing synthesized speech using the second voice profile of the speaker as stored in the memory storage device;

communicating the second portion of the stream of audio data representing the synthesized speech to the device of the participant for play back; and

iteratively repeating the generating of voice profiles, processing of the stream of audio data, and communicating of portions of the stream of audio data for successive time intervals until conclusion of the communication session.