What types of image inputs does Nvidia Nemotron Nano 12B V2 VL support?

The model handles image Q&A, OCR, dense captioning, and multi-image reasoning. Nvidia Nemotron Nano 12B V2 VL cited OCRBenchV2 at launch. OCRBenchV2 tests text extraction from document images with complex layouts, tables, and mixed formatting.

What is Efficient Video Sampling (EVS)?

EVS identifies and prunes temporally static patches in video sequences (frames where little changes between consecutive images). Removing redundant patches reduces the token count per video clip. The model can process longer videos with up to 2.5x higher throughput without sacrificing accuracy.

How does this model support RAG pipelines?

Nvidia Nemotron Nano 12B V2 VL serves as the reasoning component for visual content in the Nemotron RAG suite. Embedding models in the same family appear on ViDoRe, MTEB, and MMTEB leaderboards for visual, multimodal, and multilingual text retrieval. Together, they enable retrieval-augmented generation (RAG) across proprietary data with mixed-modality documents.

What benchmark did Nvidia Nemotron Nano 12B V2 VL highlight at launch?

OCRBenchV2. It measures document intelligence and optical character recognition on visually complex documents.

Is this model open source?

Yes. NVIDIA released model weights on Hugging Face under the NVIDIA Open Model License.

Can I use this model for multi-image reasoning tasks?

Yes. Multi-image reasoning is part of the model's task coverage across image Q&A, OCR, dense captioning, video Q&A, and multi-image reasoning. You can use it for tasks like comparing document versions, analyzing image sequences, or reasoning over slide decks.

Where are per-token prices listed?

Rates are listed on this page. They reflect the providers routing through AI Gateway and shift when providers update their pricing.

Nvidia Nemotron Nano 12B V2 VL

Nvidia Nemotron Nano 12B V2 VL is NVIDIA's open 12B multimodal reasoning model with a hybrid Mamba-Transformer architecture, OCRBenchV2 results, and specialized support for document intelligence, video understanding, and RAG pipelines.

ReasoningTool UseVision (Image)

index.ts

import { streamText } from 'ai'

const result = streamText({
  model: 'nvidia/nemotron-nano-12b-v2-vl',
  prompt: 'Why is the sky blue?'
})

Overview Playground About Providers Latency Uptime Status Similar FAQ

Playground

Try out Nvidia Nemotron Nano 12B V2 VL by NVIDIA. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.

Providers

Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.

Provider

Context	Latency	Throughput	Input	Output	Cache	Web Search	Per Query	Capabilities	ZDR	No Training	Release Date

Legal:Terms

•

Privacy

131K

0.2s

$0.20/M

$0.60/M

—

12/01/2024

Legal:Terms

•

Privacy

131K

0.2s

$0.20/M

$0.60/M

—

12/01/2024

More models by NVIDIA

Model

Context	Latency	Throughput	Input	Output	Cache	Web Search	Per Query	Capabilities	Providers	ZDR	No Training	Release Date

256K

0.2s

103tps

$0.15/M

$0.65/M

Read:—

Write:$0.06/M

—

03/18/2026

131K

0.2s

153tps

$0.06/M

$0.23/M

—

08/18/2025

262K

0.3s

114tps

$0.05/M

$0.24/M

—

12/01/2024

About Nvidia Nemotron Nano 12B V2 VL

Nvidia Nemotron Nano 12B V2 VL is an open 12B multimodal reasoning model from NVIDIA, released on December 1, 2024. It handles document intelligence and video understanding tasks, letting agents extract, interpret, and act on information across text, images, tables, and videos in a single model. At launch, Nvidia Nemotron Nano 12B V2 VL cited OCRBenchV2, reflecting document-level optical character recognition (OCR) and structured extraction capability.

The architecture is a hybrid Mamba-Transformer, the same design philosophy as the broader Nemotron family but applied to vision-language tasks. For video inputs, the model implements Efficient Video Sampling (EVS). EVS identifies and prunes temporally static patches, reducing token redundancy and processing longer video clips with up to 2.5x higher throughput without accuracy loss.

Nvidia Nemotron Nano 12B V2 VL runs on vLLM and TRT-LLM inference engines. Embedding and retrieval models in the same family appear on leaderboards such as ViDoRe, MTEB, and MMTEB for visual, multimodal, and multilingual text retrieval. The NVIDIA AI Blueprint for video search and summarization (VSS) is built around this model, making it a practical foundation for production video intelligence pipelines. Announcement and agent-focused context: https://deepinfra.com/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL.

What To Consider When Choosing a Provider

Configuration: Video inputs consume significantly more tokens than static images. Estimate average clip length and token density before deploying video tasks at scale so you can budget token costs accurately. Compare $0.2 and $0.6.
Zero Data Retention: AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.
Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

When to Use Nvidia Nemotron Nano 12B V2 VL

Best For

Document intelligence: Extracting structured text, tables, and bounding boxes from PDFs, scanned forms, and reports
Video understanding: Dense captioning, video Q&A, or summarization of longer clips with efficient token handling
Multimodal RAG pipelines: Retrieval and reasoning across documents with images, diagrams, and tables alongside text
Media asset management: Workflows requiring semantic search and retrieval across visual content
Mixed-modality agents: Agentic systems that perceive and act on text, image, and video inputs in a single model call

Consider Alternatives When

Text-only workloads: Nemotron 3 Nano or Super are better fits without the multimodal overhead
Deep text reasoning: Super's architecture is optimized for complex text-based multi-agent planning
Short simple videos: A lighter vision model may suffice for short video inputs
Throughput-optimized generation: Models with latent MoE or multi-token prediction handle long-context generation more efficiently

Conclusion

Nvidia Nemotron Nano 12B V2 VL applies NVIDIA's hybrid Mamba-Transformer design to multimodal tasks, with a focus on document OCR, video understanding, and visual RAG. Open weights and training data make it a customizable foundation for document and video intelligence pipelines. Use AI Gateway for routing.

Frequently Asked Questions

What types of image inputs does Nvidia Nemotron Nano 12B V2 VL support?
The model handles image Q&A, OCR, dense captioning, and multi-image reasoning. Nvidia Nemotron Nano 12B V2 VL cited OCRBenchV2 at launch. OCRBenchV2 tests text extraction from document images with complex layouts, tables, and mixed formatting.
What is Efficient Video Sampling (EVS)?
EVS identifies and prunes temporally static patches in video sequences (frames where little changes between consecutive images). Removing redundant patches reduces the token count per video clip. The model can process longer videos with up to 2.5x higher throughput without sacrificing accuracy.
How does this model support RAG pipelines?
Nvidia Nemotron Nano 12B V2 VL serves as the reasoning component for visual content in the Nemotron RAG suite. Embedding models in the same family appear on ViDoRe, MTEB, and MMTEB leaderboards for visual, multimodal, and multilingual text retrieval. Together, they enable retrieval-augmented generation (RAG) across proprietary data with mixed-modality documents.
What benchmark did Nvidia Nemotron Nano 12B V2 VL highlight at launch?
OCRBenchV2. It measures document intelligence and optical character recognition on visually complex documents.
Is this model open source?
Yes. NVIDIA released model weights on Hugging Face under the NVIDIA Open Model License.
Can I use this model for multi-image reasoning tasks?
Yes. Multi-image reasoning is part of the model's task coverage across image Q&A, OCR, dense captioning, video Q&A, and multi-image reasoning. You can use it for tasks like comparing document versions, analyzing image sequences, or reasoning over slide decks.
Where are per-token prices listed?
Rates are listed on this page. They reflect the providers routing through AI Gateway and shift when providers update their pricing.

AI Cloud

Core Platform

Security

Company

Learn

Open Source

Use Cases

Tools

Users

Nvidia Nemotron Nano 12B V2 VL

Playground

Providers

More models by NVIDIA

About Nvidia Nemotron Nano 12B V2 VL

What To Consider When Choosing a Provider

When to Use Nvidia Nemotron Nano 12B V2 VL

Best For

Consider Alternatives When

Conclusion

Frequently Asked Questions