Allen AI logo
Allen AI
Open Weights

Molmo 7B-D

Released Sep 2024

Intelligence
#404
Coding
#365
Math
#266
Context4K
Parameters7.6B

Molmo 7B-D is a multimodal language model developed by the Allen Institute for AI (Ai2) as part of the Molmo family of open vision-language models. Released in September 2024, it is a dense model designed to provide high-performance multimodal reasoning at a scale efficient enough for standard hardware. The "D" suffix indicates its status as a demonstration-optimized dense variant within the model suite.

Architecture and Capabilities

The model architecture combines a Qwen2-7B language backbone with an OpenAI CLIP ViT-L/14 vision encoder. It employs a simple MLP connector to project visual features into the language model's input space. Molmo 7B-D is capable of sophisticated image understanding, including visual question answering and detailed captioning. A notable feature is its ability to perform "pointing" tasks, where it generates 2D coordinates to identify specific objects or regions within an image.

Training and Data

Molmo 7B-D was trained on the PixMo dataset, a collection of roughly 1 million highly-curated image-text pairs. Unlike many models that rely on massive, noisy web-scraped datasets, the Molmo family emphasizes data quality and high-density information. The training process focused on enabling the model to match the performance of significantly larger proprietary systems through efficient architecture and superior data curation.

Rankings & Comparison