Llama 3.2 Instruct 11B (Vision) is a multimodal large language model developed by Meta, designed to process both text and image inputs. It belongs to the Llama 3.2 collection, which introduced Meta's first natively supported vision capabilities in a smaller, more efficient parameter size compared to its 90B counterpart. The model is instruction-tuned using supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to optimize performance for conversational use and task-oriented requests.
The model architecture is built upon the Llama 3.1 text-only backbone, utilizing a transformer-based auto-regressive approach. To enable visual understanding, it incorporates a vision adapter consisting of cross-attention layers that integrate image encoder representations into the core language model. This design allows the model to maintain strong linguistic reasoning while gaining the ability to interpret visual data.
Key capabilities of the 11B Vision model include image reasoning, document visual question answering (DocVQA), image captioning, and optical character recognition (OCR). It is capable of extracting structured information from images, such as charts or graphs, and answering general questions about visual scenes. While the model supports multilingual text processing in eight languages, visual-text multimodal tasks are officially supported primarily in English.
With a context window of 128,000 tokens, the model can handle long-form documents alongside high-resolution image inputs. It is released under the Llama 3.2 Community License, allowing for both research and commercial applications.