Llama 3.2 Instruct 90B (Vision) by Meta: LLM Benchmarks, Rankings & Specs

Llama 3.2 90B Vision Instruct is a large-scale multimodal model developed by Meta, representing the most powerful vision-capable model in the Llama 3.2 collection. It is designed to handle sophisticated reasoning tasks that require the integration of text and visual information, such as document understanding, image captioning, and visual question answering. This model is particularly effective at extracting information from charts, graphs, and complex visual layouts while maintaining high-quality text generation capabilities.

Architecture and Capabilities

The model architecture is built by integrating a vision encoder into the Llama 3.1 70B text-only model. This was achieved through a cross-attention mechanism that maps image embeddings into the language space, resulting in a total parameter count of approximately 90 billion. This design allows the model to process high-resolution images alongside text sequences within a 128,000-token context window. The "Instruct" version is specifically fine-tuned for conversational use-cases, agentic workflows, and safety, following instructions across both modalities with high precision.

Llama 3.2 Instruct 90B (Vision)

Architecture and Capabilities

Explore AI Studio

Rankings & Comparison