Nemotron 3 Nano 30B A3B

Nemotron 3 Nano 30B A3B is a sparse hybrid Mamba-Transformer mixture-of-experts (MoE) model with 30B total parameters but only 3B active per token. It supports a context window of 262.1K tokens with throughput closer to a 3B dense model than a 30B one.

Reasoning

index.ts

import { streamText } from 'ai'

const result = streamText({
  model: 'nvidia/nemotron-3-nano-30b-a3b',
  prompt: 'Why is the sky blue?'
})

Overview Playground About Providers Throughput Latency Uptime Status Similar FAQ

About Nemotron 3 Nano 30B A3B

NVIDIA announced Nemotron 3 Nano 30B A3B on December 1, 2024 as the first model in the Nemotron 3 family. The core idea is architectural efficiency at scale. 30B total parameters provide a broad knowledge base, but only 3B activate for any given token. This keeps inference cost and speed in the range of much smaller models.

Three layer types interleave throughout the architecture. Mamba-2 layers handle sequence processing with linear-time complexity. This makes the context window of 262.1K tokens feasible without the quadratic memory growth of pure attention. Transformer attention layers appear at strategic depths to maintain precise associative recall: the ability to pick out a specific fact from a large context. Mixture-of-experts (MoE) routing selects which expert parameters activate for each token, keeping compute proportional to the 3B active count rather than the full 30B.

Weights and recipes are available under the NVIDIA Open Model License. Deployment cookbooks for vLLM, SGLang, and TensorRT-LLM are also provided. Overview and techniques: https://deepinfra.com/nvidia/Nemotron-3-Nano-30B-A3B.

AI Cloud

Core Platform

Security

Company

Learn

Open Source

Use Cases

Tools

Users

Nemotron 3 Nano 30B A3B

About Nemotron 3 Nano 30B A3B