Skip to content

GPT-4.1 mini

GPT-4.1 mini delivers GPT-4o-class intelligence at reduced cost with nearly half the latency, making it a cost-performance option in the GPT-4.1 family for high-volume production workloads.

File InputTool UseVision (Image)Implicit Caching
index.ts
import { streamText } from 'ai'
const result = streamText({
model: 'openai/gpt-4.1-mini',
prompt: 'Why is the sky blue?'
})

Frequently Asked Questions

  • What changed between this model and its predecessor in the 4o family?

    Three major leaps: the context window expanded from 128K to 1.0M tokens (8x), instruction following improved significantly for complex multi-constraint prompts, and coding benchmarks rose across generation, review, and refactoring tasks. Cost dropped relative to GPT-4o.

  • How does the 75% prompt caching discount work with the context of 1.0M tokens?

    Cached input tokens, from repeated system prompts, shared few-shot examples, or persistent context, are billed at 75% below the standard input rate. With a window of 1.0M tokens, caching a large system prompt or reference corpus across requests yields substantial savings.

  • Is GPT-4.1 mini a distilled version of full GPT-4.1?

    OpenAI describes it as a separate model in the GPT-4.1 family, not a direct distillation. It was trained to match GPT-4o-level intelligence at lower compute requirements while sharing the GPT-4.1 family's improvements in coding and instruction following.

  • Can GPT-4.1 mini handle an entire codebase in one request?

    The context window of 1.0M tokens accommodates most single-repository codebases. For retrieval accuracy across the full range, the GPT-4.1 family maintains strong performance even at extreme context lengths, an area where previous-generation models often degraded.

  • What latency improvement should I expect?

    GPT-4.1 mini delivers nearly half the latency compared to GPT-4o. See live throughput and time-to-first-token metrics on this page for current measured performance.

  • When should I use full GPT-4.1 instead of mini?

    When the task demands the absolute highest accuracy, particularly on complex coding challenges, nuanced multi-step instructions, or workloads where the quality gap between mini and full is measurable and consequential for your application.