Zeroscope text-to-video
A settings guide by Replicate

Zeroscope v2 is an open-source text-to-video model, give it a prompt and it’ll generate a short video.

It’s a fine-tuned version of Damo’s original text-to-video model, tuned by @cerspense. And it can make short 1024x576 videos without watermarks.

Make videos with Replicate

Compare videos

Try changing num_frames, guidance_scale and num_inference_steps to see what happens.


Recommended settings

Start with the 576w model

Begin by generating videos using the 576w model at 576x320 resolution. When you find a video that you like, upscale it to 1024x576 using the xl model.

You could create a 1024x576 video directly by using the xl model, but this video will render with duplicate objects and will tend to have low coherency. It’s better to use a two-step process (unless you’re intentionally trying to get weirder results).

Number of frames and FPS (frames-per-second)

Zeroscope v2 was trained on short 24 frame clips. Set your num_frames to 24 for the best results. You can try longer clips, but beyond 40 frames the video will completely degrade.

For a smooth 1s video set the fps to 24. Alternatively, use 12 or 8 for a more jerky 2s or 3s video, then use a video interpolater to fill in the missing frames and make the video smooth again.

Guidance scale

This determines how much the model pays attention to your prompt. Too low and you see a grayscale mess, too high and the video will look distorted, with color artifacts. The sweet-spot is between 10 and 15, but try pushing this higher if your prompt is being ignored. Try setting guidance_scale to 12.5 to start with.


How many inference steps will be used to generate your video. More steps give better quality and coherency, but takes longer to generate. Fewer steps will give a lower quality video but will generate quickly – a good setting for fast prompt experimentation. Going above 100 steps will not improve your video.

Try setting num_inference_steps to 50 to start with.


When you have your 576x320 video, you can upscale it with the xl model.

You should:

View an example prediction

Try experimenting with the prompt too, you can get some weird results.

Interpolating video

Smooth out your video with frame interpolation.

Try using: