Pixtral 12B 2409 introduced multimodal capability to the Mistral AI lineup with a clean architectural split. A 400M parameter vision encoder trained from scratch handles image understanding, while a 12B decoder based on Mistral AI Nemo handles text generation. The two components were trained together on interleaved image-text data, so visual and textual understanding are integrated rather than bolted together.
The context window of 128K tokens accommodates multiple images alongside text in a single request. You can compare images, trace visual changes across a document, or cross-reference diagrams with their written descriptions. Variable aspect ratio support processes images at their native dimensions, which matters for charts, documents, and technical schematics where distortion degrades accuracy.
On MMMU, Pixtral 12B 2409 scores 52.5% and achieves a 20% relative improvement in instruction following over comparable open-source multimodal models. Pixtral 12B 2409 ships under Apache 2.0. Mistral AI has designated Pixtral 12B 2409 as deprecated in favor of newer vision models, though it remains available through AI Gateway for existing integrations.