Kling v2.6 Image-to-Video is the first Kling image-to-video release with native audio generation. It eliminates the post-production step of aligning a separately rendered audio track to a silent clip. The audio layer covers three categories: natural speech in Chinese and English, action-synchronized sound effects, and environmental ambience. All audio renders in the same inference pass that produces the video frames.
The reference image grounds the visual output. The model animates the scene depicted in the provided image rather than generating visual content from scratch. First-frame and last-frame anchoring carry forward from prior versions. You can define both the opening and closing visual states, and the model fills in the motion path between those anchors. This works well for product reveals, character entrances, or controlled transitions where you've defined the start and end configurations photographically.
For content teams working on social media clips, product demos, or localized marketing materials, audio-video unification in v2.6 reduces pipeline stages per finished asset. A reference product photograph plus a text description produces a complete video with visuals and sound. No separate text-to-speech (TTS) or sound effects (SFX) processing step is needed.
The v2.6 generation also improves visual quality over v2.5, with better temporal consistency across frames and sharper rendering of complex scenes.