Alibaba positioned Wan v2.6 Reference-to-Video as China's first reference-to-video generation model. The core idea is simple: feed the model a short video clip of a person, and it extracts enough about their face, body, clothing, voice, and movement style to convincingly place them into an entirely new scene described by text.
What sets the standard R2V apart within the Wan lineup is its emphasis on reconstruction depth. The model devotes additional inference time to faithfully reproducing fine-grained identity signals, the specific way light falls across facial features, subtle mannerisms in how a subject moves, the particular resonance of a voice. For final-delivery video where a client or audience will scrutinize whether the generated character truly matches the reference, this level of fidelity matters.
The reference pipeline accepts multiple characters from reference images or videos. On AI Gateway, pass reference URLs in order and name them character1, character2, and so on in the prompt (see the reference-to-video docs). You can mix images and videos within provider limits (up to five references total). Each video reference can be 2 to 30 seconds. Output duration is 2 to 10 seconds at 720p or 1080p, with five aspect ratio options covering landscape, portrait, square, and intermediate formats.