US 12,283,291 B1
Factually consistent generative narrations
Noah Lirone Sarfati, Tel Aviv (IL); Ido Yerushalmy, Tel Aviv (IL); Michael Chertok, Raanana (IL); and Ianir Ideses, Raanana (IL)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Aug. 16, 2023, as Appl. No. 18/450,695.
Int. Cl. G11B 27/036 (2006.01); G10L 15/26 (2006.01); H04N 21/81 (2011.01); H04N 21/8549 (2011.01)
CPC G11B 27/036 (2013.01) [G10L 15/26 (2013.01); H04N 21/8106 (2013.01); H04N 21/8549 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method, comprising:
determining a video stream of an event, wherein the video stream comprises a composite audio stream that includes audio commentary and background noise;
determining, based at least in part on the video stream and using a speech transcription service, transcribed commentary of the audio commentary;
obtaining a plurality of metadata messages for the event;
determining, based at least in part on the video stream and using a clip selection system, a plurality of highlight clips; and
determining narrated highlights for each of the plurality of highlight clips by at least:
determining a first time window for a highlight clip of the plurality of highlight clips;
determining one or more metadata messages from the plurality of metadata messages from the first time window;
determining a portion of the transcribed commentary from a second time window that includes the first time window;
determining a prompt based at least in part on the one or more metadata messages and the portion of the transcribed commentary;
determining a generative narration based at least in part on the prompt and using a generative model;
validating factual consistency of the generative narration based at least in part on the one or more metadata messages and the portion of the transcribed commentary;
extracting the background noise for the highlight clip; and
producing a narrated highlight by at least replacing audio of the highlight clip with the generative narration and the background noise.