Wan v2.6 Reference-to-Video
Wan v2.6 Reference-to-Video is Alibaba's quality-first reference-to-video model, extracting identity from short clips and rendering new scenes with high-fidelity appearance, voice, and motion preservation at up to 1080p.
import { experimental_generateVideo as generateVideo } from 'ai';
const result = await generateVideo({ model: 'alibaba/wan-v2.6-r2v', prompt: 'A serene mountain lake at sunrise.'});Playground
Try out Wan v2.6 Reference-to-Video by Alibaba. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.
Providers
Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.
| Provider |
|---|
More models by Alibaba
| Model |
|---|
About Wan v2.6 Reference-to-Video
Alibaba positioned Wan v2.6 Reference-to-Video as China's first reference-to-video generation model. The core idea is simple: feed the model a short video clip of a person, and it extracts enough about their face, body, clothing, voice, and movement style to convincingly place them into an entirely new scene described by text.
What sets the standard R2V apart within the Wan lineup is its emphasis on reconstruction depth. The model devotes additional inference time to faithfully reproducing fine-grained identity signals, the specific way light falls across facial features, subtle mannerisms in how a subject moves, the particular resonance of a voice. For final-delivery video where a client or audience will scrutinize whether the generated character truly matches the reference, this level of fidelity matters.
The reference pipeline accepts multiple characters from reference images or videos. On AI Gateway, pass reference URLs in order and name them character1, character2, and so on in the prompt (see the reference-to-video docs). You can mix images and videos within provider limits (up to five references total). Each video reference can be 2 to 30 seconds. Output duration is 2 to 10 seconds at 720p or 1080p, with five aspect ratio options covering landscape, portrait, square, and intermediate formats.
What To Consider When Choosing a Provider
- Configuration: Because R2V allocates more compute to identity reconstruction than the Flash variant, generation times are longer. Budget wall-clock time accordingly when planning render queues for final-delivery assets.
- Zero Data Retention: AI Gateway does not currently support Zero Data Retention for this model. See the documentation for models that support ZDR.
- Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.
When to Use Wan v2.6 Reference-to-Video
Best For
- Final-delivery marketing assets: Generated characters that must pass close visual inspection against the reference subject
- Brand campaign consistency: Voice and appearance preservation across a series of scenes rendered over time
- Virtual production pipelines: Directors who need confident identity transfer before approving a generated sequence
- Multi-subject compositions: Scenes where several reference identities appear together in one generated clip
Consider Alternatives When
- Speed over peak fidelity: Wan-v2.6-r2v-flash provides faster turnaround during early creative exploration
- Photograph source material: Wan-v2.6-i2v animates from still images when the source is a photograph rather than a video clip
- Text-only video generation: Wan-v2.6-t2v is the appropriate model when no reference subject is involved
Conclusion
Wan v2.6 Reference-to-Video is purpose-built for cases where identity fidelity matters more than generation speed. It trades wall-clock time for meticulous reconstruction of appearance, voice, and motion from reference material, making it the right choice when the output needs to survive close comparison to the original subject.
Frequently Asked Questions
Why does R2V take longer to generate than the Flash variant?
The standard R2V model spends additional compute on identity reconstruction, fine-grained facial detail, voice matching, and movement pattern extraction all receive more processing time. This is a deliberate design choice favoring output quality over speed.
What happens if my reference clip is very short, like 2 seconds?
The model can work with clips as short as 2 seconds, but shorter references provide less identity data. For the highest fidelity, longer clips within the 2-30 second range give the extraction pipeline more material to work with, particularly for voice and movement characteristics.
How should I tag multiple subjects in a single prompt?
Use
character1,character2, and so on in the prompt text, in the same order as your reference URLs. Each name maps to the corresponding reference image or video.Does R2V preserve clothing and accessories from the reference?
Yes. The identity extraction covers visual appearance broadly, including clothing, accessories, hairstyle, and body proportions, not just facial features. The generated output aims to maintain the full visual signature of the reference subject.
What aspect ratios work best for social media delivery?
For vertical social content, 9:16 is the standard choice. For feed posts, 1:1 provides square framing. The model also supports 16:9, 4:3, and 3:4 for other distribution contexts.
Can R2V be used for product or object identity transfer, or only people?
The reference extraction is designed broadly enough to capture objects and animals in addition to people, though the pipeline is most heavily optimized for human subjects where facial and vocal identity are the primary signals.