Wan v2.5 Text-to-Video Preview marked Alibaba's initial public release of the Wan text-to-video architecture. Given only a free-form text prompt, it produces single-shot video clips up to 10 seconds long, with output available at 480p, 720p, or 1080p across three aspect ratios: landscape (16:9), portrait (9:16), and square (1:1).
What set this release apart from many early text-to-video models was its integrated audio generation. Rather than rendering silent video and requiring a separate dubbing pass, the 2.5 pipeline synthesizes ambient sound, effects, and even prompted character dialogue with lip-sync, all within a single generation call. For workflows that need audio-visual output, this removes an entire post-processing step.
The preview designation means the model is intended primarily for evaluation and prototyping. Teams can use it to develop prompt strategies, validate resolution and aspect ratio choices, and estimate costs at the 480p tier before scaling up to the production-grade Wan 2.6 models.