 
            
Hover over to view the input still images and text prompts.
|   time-lapse of a blooming flower on a stem |   a train traveling through a field of flowers and grasses | 
|---|---|
|   pouring honey onto some slices of bread |   a lighthouse with waving ocean | 
Hover over to view the input still images and text prompts.
|   bear playing guitar happily, snowing |   boy walking on the street |   cat dancing |   cowboy riding a bull over a fence |   zoom-in, a landscape, springtime | 
|---|---|---|---|---|
|   two people dancing |   explode colorful smoke coming out |   A blonde woman rides on top of a moving washing machine into the sunset. |   girl talking and blinking |   sailing ship in the ocean, waves are surging | 
We compare our method against existing methods using still images with a wide range of content (e.g., landscape, human, animal, vehicle, statue) and style (e.g., real-life, AI-generated, painting, clay, anime, isometric illustration).
 
       
       
       
       
       
       
       
       
       
       
       
       
       
       
       
       
       
      
| "Man talking" | PikaLabs | Gen-2 | DynamiCrafter (Ours) | DynamiCrafterDCP (Ours) | 
|---|---|---|---|---|
|  | ||||
| "Man waving hands" | ||||
|  | ||||
| "Man clapping" | ||||
|  | ||||
| "Girl dancing" | PikaLabs | Gen-2 | DynamiCrafter (Ours) | DynamiCrafterDCP (Ours) | 
|---|---|---|---|---|
|  | ||||
| "Girl waving hands" | ||||
|  | ||||
| "Girl twirling her hair" | ||||
|  | ||||
Storytelling with shots. We can use ChatGPT (enpowered by DALLĀ·E 3) to create several shots of a story and then generate storytelling videos by animating these shots. [Image source]
| "A disheartened bear sat by the lake, hanging its head." | "He is meeting a girl and introducing himself." | ||||||
|---|---|---|---|---|---|---|---|
|  |  | ||||||
|  | "He chatted happily with that girl by the lake." | "Before leaving, the girl told him to be positive." |  | ||||
|  |  | ||||||
Generative frame interpolation (@320×512 resolution).
| Input starting frame | Input ending frame | Generated video | 
|---|---|---|
|  |  | |
|  |  | |
|  |  | |
|  |  | |
Looping video generation (@320×512 resolution).
FPS control.
| "An anime scene with windmills standing tall in a field and blue sky" | FPS = 30 | FPS = 10 | FPS = 5 | 
|---|---|---|---|
|  |  |  |  | 
| "A boat moving on the sea" | FPS = 30 | FPS = 10 | FPS = 5 | 
|  |  |  |  | 
Multi-cond classifier free guidance. Higher stxt and simg indicates a more significant impact for the text prompt and image condition, respectively.
| "A statue of two men with wings are dancing" | stxt=simg=7.5 | stxt=1.2, simg=7.5 | stxt=7.5, simg=1.2 | 
|---|---|---|---|
|  |  |  |  | 
Dual-stream image injection.
| "A camel in a zoo enclosure" | Ours | w/o ctx | w/o VDG | w/o λ | OursG | 
|---|---|---|---|---|---|
|  |  |  |  |  |  | 
Training paradigm. Visual comparisons of the context conditioning stream learned in one-stage and our two-stage adaption strategy.
| "A man hiking in the mountains with a backpack" | One-stage | Our adaption | 
|---|---|---|
|  |  |  | 
Training paradigm.
| "A girl with short blue and pink hair speaking" | Ours | Fine-tuning entire. | 1st frame condtion | 
|---|---|---|---|
|  |  |  |  | 
Challenging case in terms of image content understanding.
| "Moving clouds in an anime scene" | Output | 
|---|---|
|  |  | 
Inability to generate specific motions since the dataset lacks precise motion descriptions.
| "Girl rubbing her eyes" | Output | 
|---|---|
|  |  | 
Inheriting slight flickering artifacts and human face distortion issue from the pre-trained low-resolution T2V model.
| "Old man walking with his wife" | Output | 
|---|---|
|  |  |