gpt-oss-120b is a large-scale, open-weight language model developed by OpenAI, released as part of the GPT-OSS series. It utilizes a Mixture-of-Experts (MoE) architecture with approximately 117 billion total parameters, of which 5.1 billion are active per token. The model is designed for high-reasoning, agentic tasks and production use cases, offering performance comparable to proprietary reasoning models while being available under a permissive Apache 2.0 license.
Architecture and Design
The model consists of 36 layers with 128 experts per layer, employing a top-4 routing mechanism. It features alternating dense and locally banded sparse attention patterns and utilizes Grouped Multi-Query Attention (GQA) with a group size of 8 for inference efficiency. It supports a native context length of 128,000 tokens and uses the o200k_harmony tokenizer, which is optimized for STEM, coding, and multilingual data.
Key Capabilities
As a reasoning-focused model, gpt-oss-120b supports a configurable reasoning effort setting (low, medium, or high), allowing users to scale the depth of its internal chain-of-thought (CoT). This reasoning capability enables the model to excel in competition-level coding, advanced mathematics, and complex multi-step tool use. The model is post-trained using reinforcement learning techniques similar to those used in OpenAI's frontier reasoning systems, such as the o-series.
Efficiency and Deployment
To facilitate deployment on consumer-grade and enterprise hardware, the model was released with native MXFP4 quantization. This optimization allows the 117B parameter model to fit within the 80GB of memory provided by a single NVIDIA H100 or AMD MI300X GPU. It is compatible with standard inference libraries and follows the Harmony prompt format for structured interactions and tool calling.