Skip to content

Kling v2.6 Image-to-Video

Kling v2.6 Image-to-Video animates reference images into video with synchronized native audio, including speech, sound effects, and ambient sound, in a single API request with first/last frame control at up to 1080p.

image-to-videoaudio-generation
index.ts
import { experimental_generateVideo as generateVideo } from 'ai';
const result = await generateVideo({
model: 'klingai/kling-v2.6-i2v',
prompt: 'A serene mountain lake at sunrise.'
});

Playground

Try out Kling v2.6 Image-to-Video by Kling AI. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.

About Kling v2.6 Image-to-Video

Kling v2.6 Image-to-Video is the first Kling image-to-video release with native audio generation. It eliminates the post-production step of aligning a separately rendered audio track to a silent clip. The audio layer covers three categories: natural speech in Chinese and English, action-synchronized sound effects, and environmental ambience. All audio renders in the same inference pass that produces the video frames.

The reference image grounds the visual output. The model animates the scene depicted in the provided image rather than generating visual content from scratch. First-frame and last-frame anchoring carry forward from prior versions. You can define both the opening and closing visual states, and the model fills in the motion path between those anchors. This works well for product reveals, character entrances, or controlled transitions where you've defined the start and end configurations photographically.

For content teams working on social media clips, product demos, or localized marketing materials, audio-video unification in v2.6 reduces pipeline stages per finished asset. A reference product photograph plus a text description produces a complete video with visuals and sound. No separate text-to-speech (TTS) or sound effects (SFX) processing step is needed.

The v2.6 generation also improves visual quality over v2.5, with better temporal consistency across frames and sharper rendering of complex scenes.

Providers

Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.

Provider
Context
Latency
Throughput
Input
Output
Cache
Web Search
Per Query
Capabilities
ZDR
No Training
Release Date
Kling AI
Legal:Terms
Privacy
12/21/2025
Throughput

P50 throughput on live AI Gateway traffic, in tokens per second (TPS). Visit the docs for more info.

Latency

P50 time to first token (TTFT) on live AI Gateway traffic, in milliseconds. View the docs for more info.

More models by Kling AI

Model
Context
Latency
Throughput
Input
Output
Cache
Web Search
Per Query
Capabilities
Providers
ZDR
No Training
Release Date

What To Consider When Choosing a Provider

  • Configuration: Audio generation supports Chinese and English speech synthesis. If you need other languages, confirm support before you ship.
  • Zero Data Retention: AI Gateway does not currently support Zero Data Retention for this model. See the documentation for models that support ZDR.
  • Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

When to Use Kling v2.6 Image-to-Video

Best For

  • Brand and product videos: Reference imagery anchors visuals and synchronized audio narration or ambient sound is required
  • Localized short-form video: Chinese or English speech synthesis tied to existing photography
  • Controlled transitions with audio: Animation between two defined visual states with audio accompaniment
  • Advertising clips: Social media and ad content where audio-video sync is expected and a reference image defines the visual subject

Consider Alternatives When

  • Silent video preferred: You don't need audio output and prefer the lower cost of a silent tier, so V2.5 Turbo i2v is faster and cheaper
  • Multi-shot sequences: You need narrative sequences with distinct scene changes, which later Kling versions support
  • Text-driven visuals: A text description rather than a reference image should drive visual content, so use the t2v variant

Conclusion

Kling v2.6 Image-to-Video unifies image animation and audio generation in one inference call. If you used to pair silent AI video with separate audio, you drop a sync step. Teams animating reference imagery into full clips with sound get fewer handoffs than with silent tiers.

Frequently Asked Questions

  • How does first/last frame anchoring work in Kling v2.6 Image-to-Video?

    You supply images defining the opening frame, the closing frame, or both. The model generates motion and scene evolution between those endpoints. This suits controlled product reveals or transition sequences where you know the visual start and end states in advance.

  • What audio categories does v2.6 i2v generate?

    Three types: natural speech synthesis in Chinese and English, action-relevant sound effects timed to on-screen events, and environmental ambient sound reinforcing the scene atmosphere. All three synchronize to the video output.

  • Is the audio produced in a separate processing step?

    No. Audio generates in the same inference request as video. There's no separate TTS or SFX pipeline to coordinate. The finished output includes both video and audio.

  • What is the maximum video duration for Kling v2.6 Image-to-Video?

    Outputs are available at five or 10 seconds, at up to 1080p resolution across 16:9, 9:16, and 1:1 aspect ratios.

  • How does v2.6 i2v differ from v2.5 Turbo i2v?

    V2.6 adds native audio generation. V2.5 Turbo produces silent video and prioritizes fast generation at lower cost. V2.6 also includes visual quality improvements over the v2.5 generation.

  • Does the model require a text prompt in addition to the reference image?

    A text prompt is optional but recommended. It guides the model's animation direction and audio synthesis toward your intended output.