US 12,236,935 B2
Generating dubbed audio from a video-based source
Andrew R. Levine, New York, NY (US); Buddhika Kottahachchi, San Mateo, CA (US); Christopher Davie, Queens, NY (US); Kulumani Sriram, Danville, CA (US); Richard James Potts, Mountain View, CA (US); and Sasakthi S. Abeysinghe, Santa Clara, CA (US)
Assigned to GOOGLE LLC, Mountain View, CA (US)
Filed by GOOGLE LLC, Mountain View, CA (US)
Filed on Sep. 9, 2022, as Appl. No. 17/931,026.
Prior Publication US 2024/0087557 A1, Mar. 14, 2024
Int. Cl. G10L 13/00 (2006.01); G06F 40/58 (2020.01); G10L 13/02 (2013.01); G10L 13/08 (2013.01)
CPC G10L 13/02 (2013.01) [G06F 40/58 (2020.01); G10L 13/086 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A method of dubbing a video, comprising:
receiving video data and corresponding audio data in a first language;
generating, based on the audio data and an original transcript in the first language, a translated preliminary transcript in a second language;
based on the video data in the first language, aligning timing windows of portions of the translated preliminary transcript with corresponding segments of the audio data in the first language to generate a translated aligned transcript;
based on the timing windows of the portions of the translated preliminary transcript and timing windows of the corresponding segments of the audio data in the first language, determining portions of the translated aligned transcript in the second language that exceed a timing window range of the corresponding segments of the audio data in the first language to generate flagged transcript portions;
based on the translated aligned transcript, generating a first speech dub in the second language and combining the first speech dub with the video data to generate a first dubbed video;
transmitting the original transcript, the translated aligned transcript, and the first speech dub to a first device, the generated flagged transcript portions included in the original transcript and the translated aligned transcript;
receiving, from the first device, a modified original transcript; and
generating, based on the modified original transcript, a second speech dub in the second language.