Wan v2.6 Text-to-Video
Wan v2.6 Text-to-Video is the production-grade text-to-video model in Alibaba's Wan series, generating cinematic clips up to 15 seconds with automatic multi-shot scene composition and native audio at resolutions up to 1080p.
import { experimental_generateVideo as generateVideo } from 'ai';
const result = await generateVideo({ model: 'alibaba/wan-v2.6-t2v', prompt: 'A serene mountain lake at sunrise.'});About Wan v2.6 Text-to-Video
Part of Alibaba's Wan 2.6 update, this model is the production-grade text-to-video tier in that family. You provide a text prompt describing a scene, and the model returns a finished video clip, up to 15 seconds, at 720p or 1080p, in any of five aspect ratios covering landscape, portrait, square, and broadcast formats.
The headline feature is intelligent multi-shot storytelling. When a prompt describes a sequence of events spanning multiple locations or moments, the model introduces scene cuts and camera transitions on its own rather than forcing everything into a single continuous take. The visual identity of characters and objects stays consistent across these cuts, which means a product demo or short narrative can be generated as a cohesive piece rather than stitched together from separate clips in post-production.
On the rendering side, the 2.6 generation produces noticeably cleaner output than its predecessor. Frame-to-frame temporal consistency is improved, reducing the flicker artifacts that plagued earlier text-to-video models on detailed elements like text overlays, fine textures, and hair. Audio, ambient, effects, and music, continues to be generated in the same pass as the video, maintaining the integrated pipeline introduced in the 2.5 generation.