About Nemotron 3 Nano 30B A3B

NVIDIA announced Nemotron 3 Nano 30B A3B on December 15, 2025 as the first model in the Nemotron 3 family. The core idea is architectural efficiency at scale. 30B total parameters provide a broad knowledge base, but only 3B activate for any given token. This keeps inference cost and speed in the range of much smaller models.

Three layer types interleave throughout the architecture. Mamba-2 layers handle sequence processing with linear-time complexity. This makes the context window of 262.1K tokens feasible without the quadratic memory growth of pure attention. Transformer attention layers appear at strategic depths to maintain precise associative recall: the ability to pick out a specific fact from a large context. Mixture-of-experts (MoE) routing selects which expert parameters activate for each token, keeping compute proportional to the 3B active count rather than the full 30B.

Weights and recipes are available under the NVIDIA Open Model License. Deployment cookbooks for vLLM, SGLang, and TensorRT-LLM are also provided. Overview and techniques: https://deepinfra.com/nvidia/Nemotron-3-Nano-30B-A3B.

What To Consider When Choosing a Provider

Configuration: With a context window of 262.1K tokens, entire codebases or multi-document evidence sets fit in a single call. Plan context usage carefully. Filling the window is possible, but model the cost and latency implications ahead of time. Compare $0.05 and $0.24.
Zero Data Retention: AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.
Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

When to Use Nemotron 3 Nano 30B A3B

Best for

Concurrent multi-agent systems: Running many lightweight agents where per-agent throughput matters
Long-context tasks: Holding entire codebases, extended session histories, or multi-document sets in one call
Agentic tool-calling workflows: Multi-step pipelines with chained actions

Consider alternatives when

Maximum reasoning depth: Nemotron 3 Super (120B/12B active) handles complex multi-agent planning
Vision-language tasks: Nemotron Nano 12B v2 VL is the multimodal option
Smaller context needs: A 128K context window is sufficient and the 262.1K tokens capacity goes unused
Compact dense reasoning: Nemotron Nano 9B v2 targets a dense model profile

Conclusion

Nemotron 3 Nano 30B A3B delivers the throughput of a small model with the knowledge breadth of a large one. Its hybrid Mamba-Transformer MoE architecture and context of 262.1K tokens suits tasks that require holding large amounts of information in a single pass. Use AI Gateway to route traffic with unified auth.

Agent Stack

Core Platform

Tools

Learn

Build

Explore

Nemotron 3 Nano 30B A3B

Playground

Providers

More models by NVIDIA