NVIDIA released NVIDIA Nemotron 3 Super 120B A12B on March 18, 2026 as the second model in the Nemotron 3 family, following Nano. It has 120B total parameters and 12B active parameters per token. The hybrid Mamba-Transformer MoE backbone interleaves Mamba-2 layers for long-sequence processing, Transformer attention layers for precise recall, and MoE layers for compute efficiency. NVIDIA Nemotron 3 Super 120B A12B delivers higher throughput than the previous Nemotron Super generation.
Two architectural innovations distinguish Super from Nano. First, latent MoE: before routing, token embeddings compress into a low-rank latent space. This lets the model consult 4x as many expert specialists at the same inference cost. Finer-grained routing allows distinct experts to activate for different subtasks (Python syntax, SQL logic, multi-hop reasoning) without paying the compute cost of running them all. Second, multi-token prediction (MTP): the model predicts multiple future tokens in a single forward pass. MTP strengthens reasoning during training and provides built-in speculative decoding at inference, yielding up to 3x speedups on structured generation tasks like code and tool calls.
On PinchBench (a benchmark evaluating LLMs as the planning brain of an OpenClaw agent), NVIDIA Nemotron 3 Super 120B A12B scores 85.6%. Full announcement: https://docs.aws.amazon.com/en_us/bedrock/latest/userguide/model-card-nvidia-nemotron-super-3-120b.html.