| CPC G10L 13/08 (2013.01) [G06F 40/205 (2020.01); G06F 40/279 (2020.01); G10L 13/00 (2013.01); G10L 13/033 (2013.01); H04M 1/72433 (2021.01); H04W 68/005 (2013.01); H04M 1/72442 (2021.01); H04M 2201/39 (2013.01)] | 18 Claims |

|
1. A method comprising:
receiving notification data during a display of a media asset by a media device, wherein the notification data is unrelated to the media asset;
in response to receiving the notification data during the display of the media asset on the media device:
determining that the media asset comprises a voice;
determining that the notification data comprises non-textual visual information;
converting the non-textual visual information to text;
converting the text to synthesized speech using a text-to-voice model generated based on characteristics of the voice; and
generating, for output by the media device, the synthesized speech by:
determining a position in the media asset for outputting the synthesized speech, based on one or more of contextual features of the media asset and the notification data; and
generating, for output at the position in the media asset by the media device, the synthesized speech.
|