US 12,223,720 B2
Generating highlight video from video and text inputs
Xin Zhou, Mountain View, CA (US); Le Kang, Dublin, CA (US); Zhiyu Cheng, Sunnyvale, CA (US); Hao Tian, Cupertino, CA (US); Daming Lu, Dublin, CA (US); Dapeng Li, Los Altos, CA (US); Jingya Xun, San Jose, CA (US); Jeff Wang, San Jose, CA (US); Xi Chen, San Jose, CA (US); and Xing Li, Santa Clara, CA (US)
Assigned to Baidu USA, LLC, Sunnyvale, CA (US)
Filed by Baidu USA, LLC, Sunnyvale, CA (US)
Filed on Nov. 23, 2021, as Appl. No. 17/533,769.
Application 17/533,769 is a continuation in part of application No. 17/393,373, filed on Aug. 3, 2021, granted, now 11,769,327.
Claims priority of provisional application 63/124,832, filed on Dec. 13, 2020.
Prior Publication US 2022/0189173 A1, Jun. 16, 2022
Int. Cl. G06F 16/78 (2019.01); G06F 16/783 (2019.01); G06F 40/205 (2020.01); G06N 3/045 (2023.01); G06N 3/08 (2023.01); G06V 20/40 (2022.01); G06V 30/148 (2022.01); G10L 13/08 (2013.01); G06N 3/04 (2023.01)
CPC G06V 20/47 (2022.01) [G06F 16/7844 (2019.01); G06F 16/7867 (2019.01); G06F 40/205 (2020.01); G06N 3/045 (2023.01); G06N 3/08 (2013.01); G06V 20/44 (2022.01); G06V 30/153 (2022.01); G10L 13/08 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method comprising:
given an input text comprising a reference to an event in an activity,
parsing the input text, using a text parsing module, to identify the event referenced in the input text; and
using a text-to-speech (TTS) module to convert the input text into TTS-generated audio;
given an input video that comprises the event:
obtaining a time mapping by performing time anchoring to correlate runtime of the input video with runtime of the activity;
generating an initial video clip from the input video that includes the event by using timing information related to the activity and the time mapping obtained by time anchoring to identify an approximate time in the input video of when the event occurred;
extracting features from the initial video clip;
obtaining a final time value of the event in the initial video clip using the extracted features and one or more trained neural network models;
responsive to a runtime of the initial video clip being inconsistent with a runtime of the TTS-generated audio, generating a final video clip by editing the initial video clip to have a runtime consistent with the runtime of the TTS-generated audio; and
responsive to the runtime of the initial video clip being consistent with the runtime of the TTS-generated audio, using the initial video clip as the final video clip; and
combining the TTS-generated audio with the final video clip to generate an event highlight video.