Llama 3.2 11B Vision Instruct is the smaller of Meta's two Llama 3.2 vision-language models, released on September 25, 2024. Its architecture uses a cross-attention adapter to connect a vision encoder to the existing Llama 3.1 8B language model, with the language model weights kept frozen. This preserves the model's text generation and instruction-following capabilities. The vision component adds image understanding without degrading language quality.
At 11B parameters, Llama 3.2 11B Vision Instruct is substantially cheaper to serve than the 90B variant. For many specialized tasks (medical imaging annotation, product catalog understanding, and document parsing with visual elements), the 11B scale is sufficient and far more accessible.
The model handles a range of image understanding tasks within its parameter budget: chart and diagram interpretation, document visual elements, scene description, and visual question answering over standard photographic content. For applications where the complexity of visual reasoning is bounded (recognizing objects, reading text in images, and describing layouts), 11B provides a cost-effective foundation.