US 12,334,116 B2
Image diffusion framework for text-guided video editing
Nazmul Karim, Orlando, FL (US); Nazanin Rahnavard, Orlando, FL (US); Umar Khalid, Orlando, FL (US); and Chen Chen, Orlando, FL (US)
Assigned to University of Central Florida Research Foundation, Inc., Orlando, FL (US)
Filed by University of Central Florida Research Foundation, Inc., Orlando, FL (US)
Filed on Nov. 21, 2024, as Appl. No. 18/955,385.
Claims priority of provisional application 63/601,439, filed on Nov. 21, 2023.
Prior Publication US 2025/0166664 A1, May 22, 2025
Int. Cl. G11B 27/031 (2006.01); G06F 40/40 (2020.01)
CPC G11B 27/031 (2013.01) [G06F 40/40 (2020.01)] 30 Claims
OG exemplary drawing
 
1. A method for adapting a text-to-image (T2I) diffusion model to edit video content based on a text prompt, the method comprising:
a. receiving, by at least one processor, a video comprising a plurality of frames and a text prompt specifying modifications to visual elements in the video;
b. performing spectral decomposition, by the at least one processor, on at least one weight matrix of the pre-trained T2I diffusion model to separate each matrix into a set of singular values and corresponding singular vectors;
c. generating, by the at least one processor, a spectral shift parameter matrix by selectively adjusting only the singular values based on the text prompt, while maintaining the singular vectors unmodified;
d. applying, by the at least one processor, a spectral shift regularizer to the spectral shift parameter matrix, wherein the spectral shift regularizer imposes more restricted adjustments to singular values with larger magnitudes and allows comparatively relaxed adjustments to singular values with smaller magnitudes;
e. adapting, by the at least one processor, the pre-trained T2I diffusion model by incorporating the spectral shift parameter matrix, thereby creating an adapted model configured to modify specific visual elements within the video according to the text prompt; and
f. outputting, by the at least one processor, an edited video in which the visual elements specified by the text prompt are modified in the plurality of frames, while preserving non-targeted visual elements and maintaining temporal coherence across frames.