US 12,469,513 B2
System and method for replicating background acoustic properties using neural networks
Dushyant Sharma, Tracy, CA (US); James Wellford Fosburgh, Syracuse, NY (US); and Patrick Aubrey Naylor, Reading (GB)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed on Dec. 6, 2022, as Appl. No. 18/075,573.
Prior Publication US 2024/0185875 A1, Jun. 6, 2024
Int. Cl. G10L 21/0232 (2013.01); G10L 21/0208 (2013.01); G10L 21/0264 (2013.01); G10L 21/034 (2013.01); G10L 25/18 (2013.01); G10L 25/30 (2013.01)
CPC G10L 21/0232 (2013.01) [G10L 21/0264 (2013.01); G10L 21/034 (2013.01); G10L 25/18 (2013.01); G10L 25/30 (2013.01); G10L 2021/02082 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method, executed on a computing device, comprising:
receiving a target audio signal segment recorded in a target acoustic environment, wherein a speech processing system is deployed in the target acoustic environment;
receiving an input audio signal segment generated by a text-to-speech (TTS) system, wherein background acoustic properties of the input audio signal segment mismatch background acoustic properties of the target audio signal segment;
estimating noise spectrum from the target audio signal segment;
generating an acoustic neural embedding from the target audio signal segment;
estimating loss associated with processing the target audio signal segment with the speech processing system;
generating an augmented audio signal segment with background acoustic properties matching the background acoustic properties of the target audio signal segment by processing the input audio signal segment to add noise and reverberation in accordance with the noise spectrum, the acoustic neural embedding, and the estimated loss associated with processing the target audio signal segment with the speech processing system, wherein a loss associated with processing the augmented audio signal segment with the speech processing system is within a threshold difference of the estimated loss associated with processing the target audio signal segment with the speech processing system; and
training the speech processing system based on training data that includes the augmented audio signal segment.