How does first/last frame anchoring work in Kling v2.6 Image-to-Video?

You supply images defining the opening frame, the closing frame, or both. The model generates motion and scene evolution between those endpoints. This suits controlled product reveals or transition sequences where you know the visual start and end states in advance.

What audio categories does v2.6 i2v generate?

Three types: natural speech synthesis in Chinese and English, action-relevant sound effects timed to on-screen events, and environmental ambient sound reinforcing the scene atmosphere. All three synchronize to the video output.

Is the audio produced in a separate processing step?

No. Audio generates in the same inference request as video. There's no separate TTS or SFX pipeline to coordinate. The finished output includes both video and audio.

What is the maximum video duration for Kling v2.6 Image-to-Video?

Outputs are available at five or 10 seconds, at up to 1080p resolution across 16:9, 9:16, and 1:1 aspect ratios.

How does v2.6 i2v differ from v2.5 Turbo i2v?

V2.6 adds native audio generation. V2.5 Turbo produces silent video and prioritizes fast generation at lower cost. V2.6 also includes visual quality improvements over the v2.5 generation.

Does the model require a text prompt in addition to the reference image?

A text prompt is optional but recommended. It guides the model's animation direction and audio synthesis toward your intended output.

Kling v2.6 Image-to-Video

Kling v2.6 Image-to-Video animates reference images into video with synchronized native audio, including speech, sound effects, and ambient sound, in a single API request with first/last frame control at up to 1080p.

image-to-videoaudio-generation

import { experimental_generateVideo as generateVideo } from 'ai';

const result = await generateVideo({
  model: 'klingai/kling-v2.6-i2v',
  prompt: 'A serene mountain lake at sunrise.'
});

Overview Playground About Providers Throughput Latency Similar FAQ

Frequently Asked Questions

How does first/last frame anchoring work in Kling v2.6 Image-to-Video?
You supply images defining the opening frame, the closing frame, or both. The model generates motion and scene evolution between those endpoints. This suits controlled product reveals or transition sequences where you know the visual start and end states in advance.
What audio categories does v2.6 i2v generate?
Three types: natural speech synthesis in Chinese and English, action-relevant sound effects timed to on-screen events, and environmental ambient sound reinforcing the scene atmosphere. All three synchronize to the video output.
Is the audio produced in a separate processing step?
No. Audio generates in the same inference request as video. There's no separate TTS or SFX pipeline to coordinate. The finished output includes both video and audio.
What is the maximum video duration for Kling v2.6 Image-to-Video?
Outputs are available at five or 10 seconds, at up to 1080p resolution across 16:9, 9:16, and 1:1 aspect ratios.
How does v2.6 i2v differ from v2.5 Turbo i2v?
V2.6 adds native audio generation. V2.5 Turbo produces silent video and prioritizes fast generation at lower cost. V2.6 also includes visual quality improvements over the v2.5 generation.
Does the model require a text prompt in addition to the reference image?
A text prompt is optional but recommended. It guides the model's animation direction and audio synthesis toward your intended output.

AI Cloud

Core Platform

Security

Company

Learn

Open Source

Use Cases

Tools

Users

Kling v2.6 Image-to-Video

Frequently Asked Questions