Skip to content
Vercel April 2026 security incident

Nvidia Nemotron Nano 12B V2 VL

nvidia/nemotron-nano-12b-v2-vl

Nvidia Nemotron Nano 12B V2 VL is NVIDIA's open 12B multimodal reasoning model with a hybrid Mamba-Transformer architecture, OCRBenchV2 results, and specialized support for document intelligence, video understanding, and RAG pipelines.

ReasoningTool UseVision (Image)
index.ts
import { streamText } from 'ai'
const result = streamText({
model: 'nvidia/nemotron-nano-12b-v2-vl',
prompt: 'Why is the sky blue?'
})

What To Consider When Choosing a Provider

  • Zero Data Retention

    AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.

    Authentication

    AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

Video inputs consume significantly more tokens than static images. Estimate average clip length and token density before deploying video tasks at scale so you can budget token costs accurately. Compare $0.2 and $0.6.

When to Use Nvidia Nemotron Nano 12B V2 VL

Best For

  • Document intelligence:

    Extracting structured text, tables, and bounding boxes from PDFs, scanned forms, and reports

  • Video understanding:

    Dense captioning, video Q&A, or summarization of longer clips with efficient token handling

  • Multimodal RAG pipelines:

    Retrieval and reasoning across documents with images, diagrams, and tables alongside text

  • Media asset management:

    Workflows requiring semantic search and retrieval across visual content

  • Mixed-modality agents:

    Agentic systems that perceive and act on text, image, and video inputs in a single model call

Consider Alternatives When

  • Text-only workloads:

    Nemotron 3 Nano or Super are better fits without the multimodal overhead

  • Deep text reasoning:

    Super's architecture is optimized for complex text-based multi-agent planning

  • Short simple videos:

    A lighter vision model may suffice for short video inputs

  • Throughput-optimized generation:

    Models with latent MoE or multi-token prediction handle long-context generation more efficiently

Conclusion

Nvidia Nemotron Nano 12B V2 VL applies NVIDIA's hybrid Mamba-Transformer design to multimodal tasks, with a focus on document OCR, video understanding, and visual RAG. Open weights and training data make it a customizable foundation for document and video intelligence pipelines. Use AI Gateway for routing.

FAQ

The model handles image Q&A, OCR, dense captioning, and multi-image reasoning. Nvidia Nemotron Nano 12B V2 VL cited OCRBenchV2 at launch. OCRBenchV2 tests text extraction from document images with complex layouts, tables, and mixed formatting.

EVS identifies and prunes temporally static patches in video sequences (frames where little changes between consecutive images). Removing redundant patches reduces the token count per video clip. The model can process longer videos with up to 2.5x higher throughput without sacrificing accuracy.

Nvidia Nemotron Nano 12B V2 VL serves as the reasoning component for visual content in the Nemotron RAG suite. Embedding models in the same family appear on ViDoRe, MTEB, and MMTEB leaderboards for visual, multimodal, and multilingual text retrieval. Together, they enable retrieval-augmented generation (RAG) across proprietary data with mixed-modality documents.

OCRBenchV2. It measures document intelligence and optical character recognition on visually complex documents.

Yes. NVIDIA released model weights on Hugging Face under the NVIDIA Open Model License.

Yes. Multi-image reasoning is part of the model's task coverage across image Q&A, OCR, dense captioning, video Q&A, and multi-image reasoning. You can use it for tasks like comparing document versions, analyzing image sequences, or reasoning over slide decks.

Rates are listed on this page. They reflect the providers routing through AI Gateway and shift when providers update their pricing.