What does the V4 hybrid attention architecture change for inference?

DeepSeek V4 Flash combines Compressed Sparse Attention (CSA) with Heavily Compressed Attention (HCA), and uses ManifoldConstrained Hyper-Connections (mHC) in place of standard residual connections. The combination targets efficient inference at long context, including the full 1.0M tokens window.

When should I pick DeepSeek V4 Flash over DeepSeek V4 Pro?

Pick DeepSeek V4 Flash for instruction following, classification, and short-form question answering where latency and per-token cost matter most. Use DeepSeek V4 Pro for complex reasoning, multi-step problem solving, and agentic tasks.

What is the context window and max output for DeepSeek V4 Flash?

The context window is 1.0M tokens and the maximum output is 1.0M tokens.

What does implicit caching do for pricing?

Implicit caching detects repeated input prefixes (typically long system prompts) and charges the cached input rate of $0.0028 per token instead of the standard $0.14 input rate. No explicit cache-control header is required.

Does DeepSeek V4 Flash support tool calls?

Yes. DeepSeek V4 Flash is tagged for tool use and reasoning, so function calling works through the AI SDK as well as Chat Completions, Responses, and Messages API formats.

Does DeepSeek V4 Flash support zero data retention?

Yes, Zero Data Retention is available for this model. Zero Data Retention is offered on a per-provider basis. See https://vercel.com/docs/ai-gateway/capabilities/zdr for details.

DeepSeek V4 Flash

DeepSeek V4 Flash is DeepSeek's April 23, 2026 efficiency-tier model in the V4 series. It pairs a hybrid attention architecture with a context window of 1.0M tokens and supports reasoning, tool use, and implicit caching.

ReasoningTool UseImplicit Caching

index.ts

import { streamText } from 'ai'

const result = streamText({
  model: 'deepseek/deepseek-v4-flash',
  prompt: 'Why is the sky blue?'
})

Overview Playground About Providers Throughput Latency Uptime Status Similar FAQ

Frequently Asked Questions

What does the V4 hybrid attention architecture change for inference?
DeepSeek V4 Flash combines Compressed Sparse Attention (CSA) with Heavily Compressed Attention (HCA), and uses ManifoldConstrained Hyper-Connections (mHC) in place of standard residual connections. The combination targets efficient inference at long context, including the full 1.0M tokens window.
When should I pick DeepSeek V4 Flash over DeepSeek V4 Pro?
Pick DeepSeek V4 Flash for instruction following, classification, and short-form question answering where latency and per-token cost matter most. Use DeepSeek V4 Pro for complex reasoning, multi-step problem solving, and agentic tasks.
What is the context window and max output for DeepSeek V4 Flash?
The context window is 1.0M tokens and the maximum output is 1.0M tokens.
What does implicit caching do for pricing?
Implicit caching detects repeated input prefixes (typically long system prompts) and charges the cached input rate of $0.0028 per token instead of the standard $0.14 input rate. No explicit cache-control header is required.
Does DeepSeek V4 Flash support tool calls?
Yes. DeepSeek V4 Flash is tagged for tool use and reasoning, so function calling works through the AI SDK as well as Chat Completions, Responses, and Messages API formats.
Does DeepSeek V4 Flash support zero data retention?
Yes, Zero Data Retention is available for this model. Zero Data Retention is offered on a per-provider basis. See https://vercel.com/docs/ai-gateway/capabilities/zdr for details.

AI Cloud

Core Platform

Security

Company

Learn

Open Source

Use Cases

Tools

Users

DeepSeek V4 Flash

Frequently Asked Questions