What is native multimodal function calling in GLM-4.6V?

It allows you to pass images and screenshots directly as tool inputs in agentic workflows. The model reasons about visual content and calls tools based on what it sees, without requiring intermediate text conversion of images.

Can GLM-4.6V reconstruct HTML/CSS from screenshots?

Yes. GLM-4.6V can produce pixel-accurate HTML/CSS from screenshots and apply iterative modifications, making it effective for frontend development and UI replication workflows.

What is the difference between GLM-4.6V and GLM-4.6V-Flash?

GLM-4.6V is the full 106B parameter model for maximum capability. GLM-4.6V-Flash is a 9B parameter lightweight variant designed for local deployment and low-latency applications.

How do I authenticate with GLM-4.6V through AI Gateway?

AI Gateway provides a unified API key. No separate Z.ai account is needed. Use the model identifier to route requests, or configure BYOK for direct provider access.

What context window does GLM-4.6V support?

128K tokens, supporting extended document analysis, multi-image understanding, and multimodal inputs in a single request.

Does GLM-4.6V support video input?

The model is designed primarily for image and document understanding. For dedicated video understanding capabilities, see GLM model documentation for video-specific features.

GLM-4.6V

GLM-4.6V is Z.ai's full-scale 106B vision-language foundation model with a context window of 128K tokens, native multimodal function calling, interleaved image-text generation, and pixel-accurate frontend replication from screenshots.

Vision (Image)File InputReasoningTool UseImplicit Caching

import { streamText } from 'ai'

const result = streamText({
  model: 'zai/glm-4.6v',
  prompt: 'Why is the sky blue?'
})

Overview Playground About Providers Throughput Latency Uptime Status Similar FAQ

Frequently Asked Questions

What is native multimodal function calling in GLM-4.6V?
It allows you to pass images and screenshots directly as tool inputs in agentic workflows. The model reasons about visual content and calls tools based on what it sees, without requiring intermediate text conversion of images.
How does GLM-4.6V compare to GLM-4.5V?
GLM-4.6V is a major upgrade: 106B parameters (vs. GLM-4.5V built on GLM-4.5-Air), context window of 128K tokens, native multimodal function calling, interleaved image-text generation, and improved frontend replication.
Can GLM-4.6V reconstruct HTML/CSS from screenshots?
Yes. GLM-4.6V can produce pixel-accurate HTML/CSS from screenshots and apply iterative modifications, making it effective for frontend development and UI replication workflows.
What is the difference between GLM-4.6V and GLM-4.6V-Flash?
GLM-4.6V is the full 106B parameter model for maximum capability. GLM-4.6V-Flash is a 9B parameter lightweight variant designed for local deployment and low-latency applications.
How do I authenticate with GLM-4.6V through AI Gateway?
AI Gateway provides a unified API key. No separate Z.ai account is needed. Use the model identifier to route requests, or configure BYOK for direct provider access.
What context window does GLM-4.6V support?
128K tokens, supporting extended document analysis, multi-image understanding, and multimodal inputs in a single request.
Does GLM-4.6V support video input?
The model is designed primarily for image and document understanding. For dedicated video understanding capabilities, see GLM model documentation for video-specific features.

AI Cloud

Core Platform

Security

Company

Learn

Open Source

Use Cases

Tools

Users

GLM-4.6V

Frequently Asked Questions