US 12,283,289 B2
Separating and rendering voice and ambience signals by offsetting impact of device movements
Jonathan D. Sheaffer, Santa Clara, CA (US); Joshua D. Atkins, Los Angeles, CA (US); Mehrez Souden, Los Angeles, CA (US); Symeon Delikaris Manias, Los Angeles, CA (US); and Sean A. Ramprashad, Los Altos, CA (US)
Assigned to Apple Inc., Cupertino, CA (US)
Filed by Apple Inc., Cupertino, CA (US)
Filed on Oct. 29, 2021, as Appl. No. 17/514,694.
Application 17/514,694 is a continuation of application No. PCT/US2020/032273, filed on May 9, 2020.
Claims priority of provisional application 62/848,368, filed on May 15, 2019.
Prior Publication US 2022/0059123 A1, Feb. 24, 2022
Int. Cl. G10L 25/78 (2013.01); G06T 7/246 (2017.01); G10L 21/0272 (2013.01); H04R 3/00 (2006.01); H04R 25/00 (2006.01); H04S 1/00 (2006.01)
CPC G10L 25/78 (2013.01) [G06T 7/248 (2017.01); G10L 21/0272 (2013.01); H04R 3/005 (2013.01)] 16 Claims
OG exemplary drawing
 
1. A method performed by a processor of a device having a plurality of microphones, comprising:
receiving a plurality of audio signals from the plurality of microphones, the plurality of microphones capturing a sound field;
processing the audio signals into a plurality of frequency domain signals;
extracting, from the frequency domain signals, a primary speech signal;
extracting, from the frequency domain signals, one or more ambience audio signals;
generating one or more spatial parameters defining spatial characteristics of an ambience sound in the one or more ambience audio signals, the one or more spatial parameters include a location of an ambience sound source;
detecting a location or an orientation of the device, as tracking data;
modifying the one or more spatial parameters based on the tracking data by offsetting a relative movement of the ambience sound source, the relative movement caused by a change in the location or orientation of the device, to maintain a location of the ambience sound source constant during playback; and
encoding the primary speech signal, the one or more ambience audio signals, and the as modified spatial parameters into one or more encoded data streams.