Nvidia Nemotron Nano 12B V2 VL
Nvidia Nemotron Nano 12B V2 VL is NVIDIA's open 12B multimodal reasoning model with a hybrid Mamba-Transformer architecture, OCRBenchV2 results, and specialized support for document intelligence, video understanding, and RAG pipelines.
import { streamText } from 'ai'
const result = streamText({ model: 'nvidia/nemotron-nano-12b-v2-vl', prompt: 'Why is the sky blue?'})Playground
Try out Nvidia Nemotron Nano 12B V2 VL by NVIDIA. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.
About Nvidia Nemotron Nano 12B V2 VL
Nvidia Nemotron Nano 12B V2 VL is an open 12B multimodal reasoning model from NVIDIA, released on December 1, 2024. It handles document intelligence and video understanding tasks, letting agents extract, interpret, and act on information across text, images, tables, and videos in a single model. At launch, Nvidia Nemotron Nano 12B V2 VL cited OCRBenchV2, reflecting document-level optical character recognition (OCR) and structured extraction capability.
The architecture is a hybrid Mamba-Transformer, the same design philosophy as the broader Nemotron family but applied to vision-language tasks. For video inputs, the model implements Efficient Video Sampling (EVS). EVS identifies and prunes temporally static patches, reducing token redundancy and processing longer video clips with up to 2.5x higher throughput without accuracy loss.
Nvidia Nemotron Nano 12B V2 VL runs on vLLM and TRT-LLM inference engines. Embedding and retrieval models in the same family appear on leaderboards such as ViDoRe, MTEB, and MMTEB for visual, multimodal, and multilingual text retrieval. The NVIDIA AI Blueprint for video search and summarization (VSS) is built around this model, making it a practical foundation for production video intelligence pipelines. Announcement and agent-focused context: https://deepinfra.com/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL.
Providers
Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.
| Provider |
|---|
P50 throughput on live AI Gateway traffic, in tokens per second (TPS). Visit the docs for more info.
P50 time to first token (TTFT) on live AI Gateway traffic, in milliseconds. View the docs for more info.
Direct request success rate on AI Gateway and per-provider. Visit the docs for more info.
More models by NVIDIA
| Model |
|---|
What To Consider When Choosing a Provider
- Configuration: Video inputs consume significantly more tokens than static images. Estimate average clip length and token density before deploying video tasks at scale so you can budget token costs accurately. Compare $0.2 and $0.6.
- Zero Data Retention: AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.
- Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.
When to Use Nvidia Nemotron Nano 12B V2 VL
Best For
- Document intelligence: Extracting structured text, tables, and bounding boxes from PDFs, scanned forms, and reports
- Video understanding: Dense captioning, video Q&A, or summarization of longer clips with efficient token handling
- Multimodal RAG pipelines: Retrieval and reasoning across documents with images, diagrams, and tables alongside text
- Media asset management: Workflows requiring semantic search and retrieval across visual content
- Mixed-modality agents: Agentic systems that perceive and act on text, image, and video inputs in a single model call
Consider Alternatives When
- Text-only workloads: Nemotron 3 Nano or Super are better fits without the multimodal overhead
- Deep text reasoning: Super's architecture is optimized for complex text-based multi-agent planning
- Short simple videos: A lighter vision model may suffice for short video inputs
- Throughput-optimized generation: Models with latent MoE or multi-token prediction handle long-context generation more efficiently
Conclusion
Nvidia Nemotron Nano 12B V2 VL applies NVIDIA's hybrid Mamba-Transformer design to multimodal tasks, with a focus on document OCR, video understanding, and visual RAG. Open weights and training data make it a customizable foundation for document and video intelligence pipelines. Use AI Gateway for routing.
Frequently Asked Questions
What types of image inputs does Nvidia Nemotron Nano 12B V2 VL support?
The model handles image Q&A, OCR, dense captioning, and multi-image reasoning. Nvidia Nemotron Nano 12B V2 VL cited OCRBenchV2 at launch. OCRBenchV2 tests text extraction from document images with complex layouts, tables, and mixed formatting.
What is Efficient Video Sampling (EVS)?
EVS identifies and prunes temporally static patches in video sequences (frames where little changes between consecutive images). Removing redundant patches reduces the token count per video clip. The model can process longer videos with up to 2.5x higher throughput without sacrificing accuracy.
How does this model support RAG pipelines?
Nvidia Nemotron Nano 12B V2 VL serves as the reasoning component for visual content in the Nemotron RAG suite. Embedding models in the same family appear on ViDoRe, MTEB, and MMTEB leaderboards for visual, multimodal, and multilingual text retrieval. Together, they enable retrieval-augmented generation (RAG) across proprietary data with mixed-modality documents.
What benchmark did Nvidia Nemotron Nano 12B V2 VL highlight at launch?
OCRBenchV2. It measures document intelligence and optical character recognition on visually complex documents.
Is this model open source?
Yes. NVIDIA released model weights on Hugging Face under the NVIDIA Open Model License.
Can I use this model for multi-image reasoning tasks?
Yes. Multi-image reasoning is part of the model's task coverage across image Q&A, OCR, dense captioning, video Q&A, and multi-image reasoning. You can use it for tasks like comparing document versions, analyzing image sequences, or reasoning over slide decks.
Where are per-token prices listed?
Rates are listed on this page. They reflect the providers routing through AI Gateway and shift when providers update their pricing.