Wan v2.6 Text-to-Video

Wan v2.6 Text-to-Video is the production-grade text-to-video model in Alibaba's Wan series, generating cinematic clips up to 15 seconds with automatic multi-shot scene composition and native audio at resolutions up to 1080p.

text-to-video

index.ts

import { experimental_generateVideo as generateVideo } from 'ai';

const result = await generateVideo({
  model: 'alibaba/wan-v2.6-t2v',
  prompt: 'A serene mountain lake at sunrise.'
});

Overview About Providers Similar FAQ

About Wan v2.6 Text-to-Video

Part of Alibaba's Wan 2.6 update, this model is the production-grade text-to-video tier in that family. You provide a text prompt describing a scene, and the model returns a finished video clip, up to 15 seconds, at 720p or 1080p, in any of five aspect ratios covering landscape, portrait, square, and broadcast formats.

Intelligent multi-shot storytelling is the most distinct new capability. When a prompt describes a sequence of events spanning multiple locations or moments, the model introduces scene cuts and camera transitions on its own rather than forcing everything into a single continuous take. The visual identity of characters and objects stays consistent across these cuts, which means a product demo or short narrative can be generated as a cohesive piece rather than stitched together from separate clips in post-production.

On the rendering side, the 2.6 generation produces noticeably cleaner output than its predecessor. Frame-to-frame temporal consistency is improved, reducing the flicker artifacts that plagued earlier text-to-video models on detailed elements like text overlays, fine textures, and hair. Audio, ambient, effects, and music, continues to be generated in the same pass as the video, maintaining the integrated pipeline introduced in the 2.5 generation.

Agent Stack

Core Platform

Tools

Learn

Build

Explore

Wan v2.6 Text-to-Video

About Wan v2.6 Text-to-Video