Part of Alibaba's Wan 2.6 update, this model is the production-grade text-to-video tier in that family. You provide a text prompt describing a scene, and the model returns a finished video clip, up to 15 seconds, at 720p or 1080p, in any of five aspect ratios covering landscape, portrait, square, and broadcast formats.
The headline feature is intelligent multi-shot storytelling. When a prompt describes a sequence of events spanning multiple locations or moments, the model introduces scene cuts and camera transitions on its own rather than forcing everything into a single continuous take. The visual identity of characters and objects stays consistent across these cuts, which means a product demo or short narrative can be generated as a cohesive piece rather than stitched together from separate clips in post-production.
On the rendering side, the 2.6 generation produces noticeably cleaner output than its predecessor. Frame-to-frame temporal consistency is improved, reducing the flicker artifacts that plagued earlier text-to-video models on detailed elements like text overlays, fine textures, and hair. Audio, ambient, effects, and music, continues to be generated in the same pass as the video, maintaining the integrated pipeline introduced in the 2.5 generation.