Starling-LM-7B-beta is an open-source large language model developed by Nexusflow, representing an improved iteration over the previous Starling-LM-7B-alpha. It is designed to provide high-quality chat interactions and instruction-following capabilities within the 7-billion parameter class. ## Architecture and Training The model is fine-tuned from OpenChat-3.5-0106, a foundation based on the Mistral-7B-v0.1 architecture. It utilizes a training pipeline centered on Reinforcement Learning from AI Feedback (RLAIF). Specifically, it employs the Starling-RM-34B reward model to rank responses and guide policy optimization via Proximal Policy Optimization (PPO). ## Dataset and Performance Training was performed using the Nectar dataset, which consists of 183,000 chat prompts with multiple ranked model-generated responses. This process resulted in a significant performance boost on conversational benchmarks, achieving a score of 8.12 on MT-Bench. The model is noted for its ability to handle both single-turn and multi-turn dialogues with improved helpfulness and reasoning capabilities compared to its predecessor.