DeepSeek-OCR is an open-source multimodal model developed by DeepSeek, specifically optimized for high-efficiency optical character recognition and advanced document understanding. It introduces the concept of Optical Context Compression, which utilizes visual representations as a dense medium to encode textual information. This approach allows the model to achieve significant token compression—often reducing the token count of a document by 10x—while maintaining high extraction accuracy.
The model's architecture consists of a unified encoder-decoder system. The DeepEncoder component (approximately 250M-380M parameters) leverages vision transformers to process high-resolution images and compress visual features. The decoder is a 3B Mixture-of-Experts (MoE) model, specifically the DeepSeek3B-MoE-A570M, which utilizes 570M activated parameters during inference. This design enables high throughput, allowing the model to process hundreds of thousands of pages per day on standard enterprise hardware.
Beyond standard text extraction, DeepSeek-OCR possesses "deep parsing" capabilities for interpreting complex document elements. It can accurately reconstruct tables, charts, mathematical formulas (outputting LaTeX), and chemical notations (outputting SMILES). The model is multilingual, supporting over 100 languages, and is designed to handle various formats including handwritten text, scanned receipts, and multi-column academic papers.