Skip to content

Kling v2.6 Text-to-Video

Kling v2.6 Text-to-Video generates video with native audio from text prompts alone. It supports multi-shot narrative storytelling with synchronized speech, sound effects, and ambient audio at up to 1080p in a single request.

text-to-videoaudio-generation
index.ts
import { experimental_generateVideo as generateVideo } from 'ai';
const result = await generateVideo({
model: 'klingai/kling-v2.6-t2v',
prompt: 'A serene mountain lake at sunrise.'
});

About Kling v2.6 Text-to-Video

Kling v2.6 Text-to-Video introduces multi-shot storytelling as a core capability. Earlier Kling text-to-video generations produced a single continuous scene. V2.6 interprets a prompt describing sequential events and generates distinct scene cuts within the output duration. This suits narrative content: an advertisement with a product reveal followed by a lifestyle shot, an educational clip moving through two or three steps, or a social post telling a mini-story.

Native audio generation accompanies this multi-shot capability. Speech synthesis in Chinese and English, sound effects synchronized to on-screen action, and environmental ambient audio all produce in the same inference pass as the video frames. The result is a finished audio-visual asset from a text prompt, with no post-processing alignment needed.

V2.6 also sharpens visual detail rendering compared to v2.5, with improved temporal consistency across scene cuts. This matters for multi-shot content where abrupt or inconsistent transitions degrade the viewing experience.

For developers building content generation pipelines (social media automation, creative brief to video, marketing content at scale), multi-shot storytelling and integrated audio remove two previously required pipeline stages: separate audio processing and manual scene composition.