Trinity Large Thinking is an open-source reasoning-optimized language model developed by Arcee AI, designed specifically for complex, long-horizon agentic workflows and multi-step planning. Released under the Apache 2.0 license, it is a variant of the Trinity-Large family that incorporates extended chain-of-thought (CoT) reasoning and agentic reinforcement learning. The model is characterized by its ability to generate explicit reasoning traces wrapped in <think>...</think> tags before providing a final response or tool call, a process intended to improve performance on agentic benchmarks and maintain coherence across long-horizon interactions.\n\n## Architecture and Design\nThe model utilizes a sparse Mixture-of-Experts (MoE) architecture with approximately 398 billion total parameters, of which roughly 13 billion are active per token. This design employs a high sparsity ratio, utilizing 256 experts with only 4 active at any given step (a 1.56% routing fraction). To ensure routing stability at this scale, the architecture includes six dense layers and a specialized load-balancing strategy known as Soft-clamped Momentum Expert Bias Updates (SMEBU). It supports a native context window of 512,000 tokens, enabling the processing of extensive conversation histories and documentation.\n\n## Training and Performance\nTraining was conducted on 17 trillion tokens over 33 days using a cluster of 2,048 NVIDIA B300 Blackwell GPUs. The optimization process utilized the Muon optimizer, and the training data, curated in partnership with DatologyAI, included significant portions of synthetic data alongside diverse datasets covering STEM, programming, and reasoning. The model's post-training phase focused on refining multi-turn tool orchestration and instruction following, resulting in high scores on agentic benchmarks such as τ²-Bench and PinchBench.\n\nIn practical use, the model's "thinking" phase is architecturally load-bearing; the preservation of these reasoning traces in conversation histories is essential to maintain the model's state and planning capabilities during multi-turn loops. The model is optimized for high inference throughput relative to its total parameter count due to its sparse activation pattern, which allows most weights to remain idle during any given inference step.