| CPC G10L 15/083 (2013.01) [G06F 3/167 (2013.01); G06V 20/64 (2022.01); G10L 15/063 (2013.01); G10L 15/1822 (2013.01); G10L 15/22 (2013.01); G10L 15/26 (2013.01); G10L 2015/088 (2013.01)] | 24 Claims | 

| 
               1. A system to transition between different modalities, comprising: 
            a data processing system comprising one or more processors to: 
                receive, via a network, data packets comprising an input audio signal detected by a microphone of a computing device remote from the data processing system; 
                parse the input audio signal to identify a request; 
                select, based on the request, a digital component object having a visual output format, the digital component object associated with metadata; 
                determine, based on a type of the computing device, to convert the digital component object into an audio output format; 
                generate, responsive to the determination to convert the digital component object into the audio output format, text for the digital component object; 
                select, based on context of the digital component object, a digital voice to render the text; 
                construct a baseline audio track of the digital component object with the text rendered by the digital voice; 
                generate, based on the digital component object, non-spoken audio cues; 
                combine the non-spoken audio cues with the baseline audio form of the digital component object to generate an audio track of the digital component object; and 
                provide, responsive to the request from the computing device, the audio track of the digital component object to the computing device for output via a speaker of the computing device. 
               |