Llama 3.2 90B Vision Instruct is Meta's largest open vision-language model, released on September 25, 2024. It's built on the Llama 3.1 70B language foundation with a cross-attention vision adapter connecting to a vision encoder. At 90B parameters, the language model component has substantially more capacity for complex reasoning, synthesis, and generation than the 11B variant. Tasks that pair difficult visual understanding with demanding text generation are better served here.
The context window of 128K tokens accommodates extended visual conversations: a sequence of images with questions, a technical document with figures and tables, or an annotated slide deck can all fit in a single context. For enterprise use cases like research document analysis, technical diagram interpretation, and medical image description, the combination of large language model capacity and vision capability is more practical than at the 11B scale.