US 12,230,252 B2
Generation of interactive audio tracks from visual content
Matthew Sharifi, Mountain View, CA (US); and Victor Carbune, Mountain View, CA (US)
Assigned to GOOGLE LLC, Mountain View, CA (US)
Appl. No. 17/282,135
Filed by GOOGLE LLC, Mountain View, CA (US)
PCT Filed Jun. 9, 2020, PCT No. PCT/US2020/036749
§ 371(c)(1), (2) Date Apr. 1, 2021,
PCT Pub. No. WO2021/251953, PCT Pub. Date Dec. 16, 2021.
Prior Publication US 2022/0157300 A1, May 19, 2022
Int. Cl. G10L 15/08 (2006.01); G06F 3/16 (2006.01); G06V 20/64 (2022.01); G10L 15/06 (2013.01); G10L 15/18 (2013.01); G10L 15/22 (2006.01); G10L 15/26 (2006.01)
CPC G10L 15/083 (2013.01) [G06F 3/167 (2013.01); G06V 20/64 (2022.01); G10L 15/063 (2013.01); G10L 15/1822 (2013.01); G10L 15/22 (2013.01); G10L 15/26 (2013.01); G10L 2015/088 (2013.01)] 24 Claims
OG exemplary drawing
 
1. A system to transition between different modalities, comprising:
a data processing system comprising one or more processors to:
receive, via a network, data packets comprising an input audio signal detected by a microphone of a computing device remote from the data processing system;
parse the input audio signal to identify a request;
select, based on the request, a digital component object having a visual output format, the digital component object associated with metadata;
determine, based on a type of the computing device, to convert the digital component object into an audio output format;
generate, responsive to the determination to convert the digital component object into the audio output format, text for the digital component object;
select, based on context of the digital component object, a digital voice to render the text;
construct a baseline audio track of the digital component object with the text rendered by the digital voice;
generate, based on the digital component object, non-spoken audio cues;
combine the non-spoken audio cues with the baseline audio form of the digital component object to generate an audio track of the digital component object; and
provide, responsive to the request from the computing device, the audio track of the digital component object to the computing device for output via a speaker of the computing device.