Kling v2.6 Text-to-Video
Kling v2.6 Text-to-Video generates video with native audio from text prompts alone. It supports multi-shot narrative storytelling with synchronized speech, sound effects, and ambient audio at up to 1080p in a single request.
import { experimental_generateVideo as generateVideo } from 'ai';
const result = await generateVideo({ model: 'klingai/kling-v2.6-t2v', prompt: 'A serene mountain lake at sunrise.'});Frequently Asked Questions
What does multi-shot storytelling mean in Kling v2.6 Text-to-Video?
The model interprets a prompt describing sequential events and generates distinct scene cuts within the output, rather than a single continuous shot. A prompt structured as a brief narrative (scene A, then scene B) produces a video that transitions between those scenes.
How should I write prompts to take advantage of multi-shot storytelling?
Describe sequential events or scene changes within the prompt. For example, a prompt that describes an action followed by its result, or a product shown then used, tends to trigger multi-shot composition.
What audio types does Kling v2.6 Text-to-Video generate?
Natural speech in Chinese and English, action-synchronized sound effects, and environmental ambient audio. All produce in the same inference request as the video, with no separate TTS or SFX pipeline.
Can v2.6 t2v produce video without audio if audio is not needed?
Audio generation is built into v2.6. If you prefer silent video at lower cost, use v2.5 Turbo t2v instead.
How does v2.6 t2v differ from v2.5 Turbo t2v?
V2.6 adds native audio generation and multi-shot storytelling. V2.5 Turbo is faster and produces silent video. Choose v2.5 Turbo when speed and cost efficiency outweigh the need for audio or narrative scene structure.
What output durations and resolutions are available?
Outputs are available at five or 10 seconds, at up to 1080p resolution, in 16:9, 9:16, and 1:1 aspect ratios.