NVIDIA announced Nemotron 3 Nano 30B A3B on December 1, 2024 as the first model in the Nemotron 3 family. The core idea is architectural efficiency at scale. 30B total parameters provide a broad knowledge base, but only 3B activate for any given token. This keeps inference cost and speed in the range of much smaller models.
Three layer types interleave throughout the architecture. Mamba-2 layers handle sequence processing with linear-time complexity. This makes the context window of 262.1K tokens feasible without the quadratic memory growth of pure attention. Transformer attention layers appear at strategic depths to maintain precise associative recall: the ability to pick out a specific fact from a large context. Mixture-of-experts (MoE) routing selects which expert parameters activate for each token, keeping compute proportional to the 3B active count rather than the full 30B.
Weights and recipes are available under the NVIDIA Open Model License. Deployment cookbooks for vLLM, SGLang, and TensorRT-LLM are also provided. Overview and techniques: https://deepinfra.com/nvidia/Nemotron-3-Nano-30B-A3B.