Skip to content
Dashboard

Nova Micro

Nova Micro delivers text-only inference at high throughput with per-token pricing below multimodal Nova models in the same generation, purpose-built for latency-sensitive applications at scale.

index.ts
import { streamText } from 'ai'
const result = streamText({
model: 'amazon/nova-micro',
prompt: 'Why is the sky blue?'
})

Playground

Try out Nova Micro by Amazon. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.

amazon logo
amazon logo

Ask Nova Micro anything to try it out.

Providers

Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.

Provider
Context
Latency
Throughput
Input
Output
Cache
Web Search
Per Query
Capabilities
ZDR
No Training
Release Date
Amazon Bedrock
128K
0.3s
$0.04/M$0.14/M——
Throughput

P50 throughput on live AI Gateway traffic, in tokens per second (TPS). Visit the docs for more info.

Latency

P50 time to first token (TTFT) on live AI Gateway traffic, in milliseconds. View the docs for more info.

Uptime

Direct request success rate on AI Gateway and per-provider. Visit the docs for more info.

More models by Amazon

Model
Context
Latency
Throughput
Input
Output
Cache
Web Search
Per Query
Capabilities
Providers
ZDR
No Training
Release Date
300K
0.4s
$0.06/M$0.24/M——
bedrock logo
12/03/2024
300K
0.3s
141tps
$0.80/M$3.20/M——
bedrock logo
12/03/2024
1M
0.3s
233tps
$0.30/M$2.50/M
Read:$0.07/M
Write:—
——
bedrock logo
12/01/2024
$0.02/M——
bedrock logo
04/01/2024

About Nova Micro

Nova Micro launched alongside the rest of the first-generation Nova family, but its design philosophy differs. Nova Lite and Nova Pro layer in image and video understanding. Nova Micro drops multimodal support entirely. The result is a model that does one thing, text processing, at high speed and low cost.

The tradeoff is intentional. Removing vision processing frees up architecture for generation throughput within a context window of 128K tokens. Even high-volume classification or tagging pipelines have low per-request costs.

Many teams deploy Nova Micro as the default tier in a routing architecture. Straightforward text requests (classification, entity extraction, simple Q&A, short summaries) go to Micro. Only when a request involves images, requires deep reasoning, or exceeds the context of 128K tokens does it escalate to Lite, Pro, or a second-generation model. This pattern keeps average cost per request low while covering the full range of task complexity.

What To Consider When Choosing a Provider

  • Configuration: If streaming throughput matters, test Nova Micro early. It's built for high text throughput within AI Gateway.
  • Zero Data Retention: AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.
  • Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

When to Use Nova Micro

Best For

  • Chatbots and conversational interfaces: Interactive experiences where response speed affects UX
  • Text classification at scale: Sentiment analysis and entity extraction across high request volumes
  • Autocomplete and suggestions: Inline features where generation speed matters
  • Default routing tier: Handle routine text requests at minimal cost in a model-routing architecture

Consider Alternatives When

  • Multimodal inputs required: Switch to Nova Lite or Nova Pro when inputs include images, documents, or video
  • Structured multi-step reasoning: Nova 2 Lite is better equipped for agentic tool use and deeper reasoning
  • Context beyond 128K tokens: Long documents or extended conversations require a model with a larger context window

Conclusion

Nova Micro is the speed specialist of the Nova family. By focusing exclusively on text, it achieves throughput and pricing that suit many text-only workloads where response latency matters. Pair it with a routing layer that escalates to multimodal or reasoning-capable models when needed.