GLM-4.6V

GLM-4.6V is Z.ai's full-scale 106B vision-language foundation model with a context window of 128K tokens, native multimodal function calling, interleaved image-text generation, and pixel-accurate frontend replication from screenshots.

Vision (Image)File InputReasoningTool UseImplicit Caching

index.ts

import { streamText } from 'ai'

const result = streamText({
  model: 'zai/glm-4.6v',
  prompt: 'Why is the sky blue?'
})

Overview Playground About Providers Throughput Latency Uptime Status Similar FAQ

About GLM-4.6V

GLM-4.6V is Z.ai's full-scale 106B parameter vision-language model, designed for cloud and high-performance cluster deployment. Released September 30, 2025, it upgrades GLM-4.5V with a context window of 128K tokens, native multimodal function calling, and reported multimodal benchmark results at comparable parameter scales.

A defining capability is native multimodal function calling, which lets you pass images and screenshots directly as tool inputs without converting them to text descriptions first. This enables agentic workflows where the model reasons about visual content, decides to call tools based on what it sees, and processes visual results across multiple steps. Combined with interleaved image-text content generation, GLM-4.6V produces mixed-media outputs from complex inputs.

GLM-4.6V handles frontend replication and visual editing: given a screenshot, it can reconstruct pixel-accurate HTML/CSS and apply iterative modifications. This capability, combined with multimodal document understanding (joint interpretation of text, layout, charts, and figures), suits UI development workflows and complex document processing pipelines.

AI Cloud

Core Platform

Security

Company

Learn

Open Source

Use Cases

Tools

Users

GLM-4.6V

About GLM-4.6V