Mercury 2 is a diffusion-based language model (dLLM) developed by Inception Labs. Distinguished from the prevailing autoregressive Transformer paradigm, Mercury 2 utilizes a parallel refinement architecture. This approach enables the model to generate and refine multiple text blocks simultaneously, significantly reducing latency and increasing throughput for complex reasoning tasks. On modern hardware, such as Nvidia Blackwell GPUs, the model is capable of reaching speeds of over 1,000 tokens per second.
Designed for production-grade reasoning, the model excels in scenarios where low latency is critical, including agentic loops, real-time voice interactions, and iterative coding environments. It supports a 128,000-token context window and features native capabilities for tool use and schema-aligned JSON output. Benchmark evaluations indicate that the 8B variant matches the performance of significantly larger autoregressive models, scoring 91.1 on the AIME 2025 mathematical reasoning test and 73.6 on the GPQA Diamond benchmark.
Architecture and Capabilities
The core innovation of Mercury 2 lies in its diffusion-based generation process, which treats text production as a denoising task across entire sequences rather than a linear sequence of predictions. This allows for global context awareness during the generation phase and permits iterative error correction within a single inference pass. The model is available in 3B and 8B parameter variants and is released under an Apache 2.0 license, facilitating open research and community-led fine-tuning.
Inception Labs provides a drop-in OpenAI-compatible API to ease integration into existing software stacks. The company targets latency-sensitive applications where the cumulative delay of sequential generation typically creates a bottleneck, positioning Mercury 2 as a faster and more cost-efficient alternative for high-volume inference workflows.