About Nemotron 3 Ultra

NVIDIA released Nemotron 3 Ultra on June 4, 2026 as the largest model in the Nemotron 3 family, completing the tier above Nano and Super. It carries 550B total parameters with 55B active per token, and NVIDIA positions it as the reasoning and orchestration layer for long-running agent workflows: the model that handles planning, synthesis, and verification while lighter models execute routine steps.

The architecture interleaves three layer types. Mamba layers process long sequences with linear-time complexity, which keeps a context window of 1M tokens practical. Transformer attention layers appear at select depths to preserve precise recall from large contexts. Latent mixture-of-experts (MoE) routing compresses token embeddings into a smaller latent space before selecting experts, so distinct specialists activate for reasoning, coding, and tool calls without dense compute. Multi-token prediction (MTP) layers predict several future tokens per forward pass, providing built-in speculative decoding for long outputs.

Nemotron 3 Ultra scores 91% on PinchBench, 82% on IFBench, and 95% on Ruler at 1M tokens. Weights, data, and recipes are released under the Linux Foundation's permissive OpenMDW-1.1 license. Full details: https://www.together.ai/models/nvidia-nemotron-3-ultra.

What To Consider When Choosing a Provider

Configuration: Long-running agent sessions accumulate tokens quickly, and a context window of 1M tokens makes it easy to carry everything forward. Budget for that before you scale. Compare $0.37 and $1.08, and use prompt caching at $0.12 for repeated prefixes like system prompts and tool definitions.
Configuration: Output is capped at 65K tokens per request, so plan chunking for very long generations. Nemotron 3 Ultra is the flagship tier of the Nemotron 3 family. Reserve it for the planning and verification calls that need the depth, and route routine steps to smaller Nemotron 3 models.
Zero Data Retention: AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.
Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

When to Use Nemotron 3 Ultra

Best for

Agent Orchestration Backbones: Planning, synthesis, and verification steps in long-running multi-agent pipelines
Long-Horizon Coding Agents: Multi-step software tasks that span large codebases and extended tool-call sequences
Deep Research Workflows: Gathering, cross-checking, and synthesizing evidence across many sources in one context
Full-Context Session Handling: Keeping complete agent histories, codebases, or document sets in a single pass
Open-Model Requirements: Teams that need open weights and permissive licensing for governance or reproducibility

Consider alternatives when

Lightweight Task Execution: Nemotron 3 Nano handles routine pipeline steps at far lower compute
Mid-Tier Agent Planning: Nemotron 3 Super covers complex multi-agent decisions at a smaller footprint
Vision or Multimodal Inputs: Nemotron 3 Ultra is a text reasoning model, so image and video tasks need a vision-language model
Cost-First Workloads: A smaller model may deliver acceptable quality at lower per-token rates

Conclusion

Nemotron 3 Ultra closes out the Nemotron 3 family as its reasoning and orchestration tier, pairing latent MoE efficiency with a context window of 1M tokens. Route it through AI Gateway with unified auth and billing, and call it with the AI SDK or through Chat Completions, Responses, Messages, and other API formats.

Agent Stack

Core Platform

Tools

Learn

Build

Explore

Nemotron 3 Ultra

Playground

Providers

More models by NVIDIA