Hover over to view the input still images and text prompts.
![]() time-lapse of a blooming flower on a stem
|
![]() a train traveling through a field of flowers and grasses
|
---|---|
![]() pouring honey onto some slices of bread
|
![]() a lighthouse with waving ocean
|
Hover over to view the input still images and text prompts.
![]() bear playing guitar happily, snowing
|
![]() boy walking on the street
|
![]() cat dancing
|
![]() cowboy riding a bull over a fence
|
![]() zoom-in, a landscape, springtime
|
---|---|---|---|---|
![]() two people dancing
|
![]() explode colorful smoke coming out
|
![]() A blonde woman rides on top of a moving washing machine into the sunset.
|
![]() girl talking and blinking
|
![]() sailing ship in the ocean, waves are surging
|
We compare our method against existing methods using still images with a wide range of content (e.g., landscape, human, animal, vehicle, statue) and style (e.g., real-life, AI-generated, painting, clay, anime, isometric illustration).
"Man talking" | PikaLabs | Gen-2 | DynamiCrafter (Ours) | DynamiCrafterDCP (Ours) |
---|---|---|---|---|
![]() |
||||
"Man waving hands" | ||||
![]() |
||||
"Man clapping" | ||||
![]() |
||||
"Girl dancing" | PikaLabs | Gen-2 | DynamiCrafter (Ours) | DynamiCrafterDCP (Ours) |
---|---|---|---|---|
![]() |
||||
"Girl waving hands" | ||||
![]() |
||||
"Girl twirling her hair" | ||||
![]() |
||||
Storytelling with shots. We can use ChatGPT (enpowered by DALLĀ·E 3) to create several shots of a story and then generate storytelling videos by animating these shots. [Image source]
"A disheartened bear sat by the lake, hanging its head." | "He is meeting a girl and introducing himself." | ||||||
---|---|---|---|---|---|---|---|
![]() |
![]() |
||||||
![]() |
"He chatted happily with that girl by the lake." | "Before leaving, the girl told him to be positive." | ![]() |
||||
![]() |
![]() |
||||||
Generative frame interpolation (@320×512 resolution).
Input starting frame | Input ending frame | Generated video |
---|---|---|
![]() |
![]() |
|
![]() |
![]() |
|
![]() |
![]() |
|
![]() |
![]() |
|
Looping video generation (@320×512 resolution).
FPS control.
"An anime scene with windmills standing tall in a field and blue sky" | FPS = 30 | FPS = 10 | FPS = 5 |
---|---|---|---|
![]() |
![]() |
![]() |
![]() |
"A boat moving on the sea" | FPS = 30 | FPS = 10 | FPS = 5 |
![]() |
![]() |
![]() |
![]() |
Multi-cond classifier free guidance. Higher stxt and simg indicates a more significant impact for the text prompt and image condition, respectively.
"A statue of two men with wings are dancing" | stxt=simg=7.5 | stxt=1.2, simg=7.5 | stxt=7.5, simg=1.2 |
---|---|---|---|
![]() |
![]() |
![]() |
![]() |
Dual-stream image injection.
"A camel in a zoo enclosure" | Ours | w/o ctx | w/o VDG | w/o λ | OursG |
---|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Training paradigm. Visual comparisons of the context conditioning stream learned in one-stage and our two-stage adaption strategy.
"A man hiking in the mountains with a backpack" | One-stage | Our adaption |
---|---|---|
![]() |
![]() |
![]() |
Training paradigm.
"A girl with short blue and pink hair speaking" | Ours | Fine-tuning entire. | 1st frame condtion |
---|---|---|---|
![]() |
![]() |
![]() |
![]() |
Challenging case in terms of image content understanding.
"Moving clouds in an anime scene" | Output |
---|---|
![]() |
![]() |
Inability to generate specific motions since the dataset lacks precise motion descriptions.
"Girl rubbing her eyes" | Output |
---|---|
![]() |
![]() |
Inheriting slight flickering artifacts and human face distortion issue from the pre-trained low-resolution T2V model.
"Old man walking with his wife" | Output |
---|---|
![]() |
![]() |