Tulu-2-DPO-70B is an instruction-tuned language model developed by the Allen Institute for AI (AI2). As part of the Tulu 2 suite, it is designed to study and improve the adaptation of pretrained models to human instructions and user preferences. It is built on the Llama 2 70B architecture.
Training and Alignment
The model was developed using a multi-stage process, beginning with supervised fine-tuning on the Tulu V2 mix, a diverse dataset of human and synthetic instructions. It was further refined using Direct Preference Optimization (DPO) on the UltraFeedback dataset. This method allows the model to align with human preferences more efficiently than traditional Reinforcement Learning from Human Feedback (RLHF), providing a robust open-source alternative for high-capacity chat and reasoning tasks.
Research Significance
AI2 released Tulu-2-DPO-70B as a transparent research artifact, providing the full training data, evaluation framework, and model weights. At the time of its release, it represented one of the first successful applications of the DPO algorithm at the 70-billion-parameter scale, aiming to facilitate open research into the best practices of post-pretraining adaptation for large language models.