Skip to content

Wan v2.6 Reference-to-Video

Wan v2.6 Reference-to-Video is Alibaba's quality-first reference-to-video model, extracting identity from short clips and rendering new scenes with high-fidelity appearance, voice, and motion preservation at up to 1080p.

reference-to-video
index.ts
import { experimental_generateVideo as generateVideo } from 'ai';
const result = await generateVideo({
model: 'alibaba/wan-v2.6-r2v',
prompt: 'A serene mountain lake at sunrise.'
});

Frequently Asked Questions

  • Why does R2V take longer to generate than the Flash variant?

    The standard R2V model spends additional compute on identity reconstruction, fine-grained facial detail, voice matching, and movement pattern extraction all receive more processing time. This is a deliberate design choice favoring output quality over speed.

  • What happens if my reference clip is very short, like 2 seconds?

    The model can work with clips as short as 2 seconds, but shorter references provide less identity data. For the highest fidelity, longer clips within the 2-30 second range give the extraction pipeline more material to work with, particularly for voice and movement characteristics.

  • How should I tag multiple subjects in a single prompt?

    Use character1, character2, and so on in the prompt text, in the same order as your reference URLs. Each name maps to the corresponding reference image or video.

  • Does R2V preserve clothing and accessories from the reference?

    Yes. The identity extraction covers visual appearance broadly, including clothing, accessories, hairstyle, and body proportions, not just facial features. The generated output aims to maintain the full visual signature of the reference subject.

  • What aspect ratios work best for social media delivery?

    For vertical social content, 9:16 is the standard choice. For feed posts, 1:1 provides square framing. The model also supports 16:9, 4:3, and 3:4 for other distribution contexts.

  • Can R2V be used for product or object identity transfer, or only people?

    The reference extraction is designed broadly enough to capture objects and animals in addition to people, though the pipeline is most heavily optimized for human subjects where facial and vocal identity are the primary signals.