| CPC G06V 20/47 (2022.01) [G06F 16/7844 (2019.01); G06F 16/7867 (2019.01); G06F 40/205 (2020.01); G06N 3/045 (2023.01); G06N 3/08 (2013.01); G06V 20/44 (2022.01); G06V 30/153 (2022.01); G10L 13/08 (2013.01)] | 20 Claims |

|
1. A computer-implemented method comprising:
given an input text comprising a reference to an event in an activity,
parsing the input text, using a text parsing module, to identify the event referenced in the input text; and
using a text-to-speech (TTS) module to convert the input text into TTS-generated audio;
given an input video that comprises the event:
obtaining a time mapping by performing time anchoring to correlate runtime of the input video with runtime of the activity;
generating an initial video clip from the input video that includes the event by using timing information related to the activity and the time mapping obtained by time anchoring to identify an approximate time in the input video of when the event occurred;
extracting features from the initial video clip;
obtaining a final time value of the event in the initial video clip using the extracted features and one or more trained neural network models;
responsive to a runtime of the initial video clip being inconsistent with a runtime of the TTS-generated audio, generating a final video clip by editing the initial video clip to have a runtime consistent with the runtime of the TTS-generated audio; and
responsive to the runtime of the initial video clip being consistent with the runtime of the TTS-generated audio, using the initial video clip as the final video clip; and
combining the TTS-generated audio with the final video clip to generate an event highlight video.
|