NVIDIA Nemotron 3 Super 120B A12B
NVIDIA Nemotron 3 Super 120B A12B is NVIDIA's 120B total, 12B active-parameter hybrid Mamba-Transformer MoE built for complex multi-agent applications, featuring latent MoE and multi-token prediction.
import { streamText } from 'ai'
const result = streamText({ model: 'nvidia/nemotron-3-super-120b-a12b', prompt: 'Why is the sky blue?'})What To Consider When Choosing a Provider
Zero Data Retention
AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.Authentication
AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.
NVIDIA Nemotron 3 Super 120B A12B's multi-agent orientation means it works best as the planning and reasoning backbone in a pipeline where lighter models handle individual steps. Evaluate your task decomposition before choosing a tier. Compare $0.15 and $0.65.
When to Use NVIDIA Nemotron 3 Super 120B A12B
Best For
Complex multi-agent applications:
Software development pipelines or cybersecurity triaging that require deep planning across long contexts
Context explosion workloads:
Multi-agent systems with up to 15x the token volume of standard chats that cause goal drift with smaller models
Dense technical problem-solving:
Tasks where higher parameter count provides reasoning headroom
Super plus nano pattern:
Agentic pipelines pairing Super for complex decisions with Nano for efficient individual steps
Fully open model requirement:
Teams that need weights and recipes for enterprise customization, data control, or reproducibility
Consider Alternatives When
Simpler task steps:
Nemotron 3 Nano is more throughput-efficient for lighter workloads
Vision-language inputs:
Super is text-only; Nemotron Nano 12B v2 VL supports multimodal inputs
Cost-first constraints:
A lighter model may deliver acceptable quality at lower cost per token
Conclusion
NVIDIA Nemotron 3 Super 120B A12B combines latent MoE for expert specialization and multi-token prediction for inference speedups. Route requests through AI Gateway as the planning and reasoning backbone for complex multi-agent applications at scale.
FAQ
Latent MoE compresses token embeddings into a smaller latent space before routing. This reduces per-expert compute cost and lets NVIDIA Nemotron 3 Super 120B A12B consult 4x as many experts for the same inference budget. Distinct experts activate for code generation, SQL logic, and natural language without the overhead of running them all densely.
MTP trains the model to predict multiple future tokens in a single forward pass. At inference, the MTP heads provide draft tokens that can be verified in parallel, acting as built-in speculative decoding. This delivers wall-clock speedups for structured generation like code and tool calls, without requiring a separate draft model.
NVIDIA describes using Nano for straightforward individual steps in a pipeline and Super for complex decisions requiring deep reasoning. In software development, for example, Nano might handle routine merge requests while Super tackles tasks that require understanding a full codebase. This pattern distributes compute across task difficulty.
Multi-agent systems generate high token volume (up to 15x that of standard chats) from tool outputs, reasoning steps, and history resent at each turn. A window of 256K tokens lets agents keep full session history, large codebases, and retrieved context in a single pass. This reduces goal drift from context truncation.
Rates are listed on this page. They reflect the providers routing through AI Gateway and shift when providers update their pricing.