Lecture 44

This is a lecture record from your database.

Title: Tulu 3: Pushing Frontiers in Open Language Model Post-Training
Description: Reinforcement Learning with Verifiable Rewards
Grade level
Private: false
Model creativity: 0.5
Lesson plan: \abstract{ Language model post-training is applied to refine behaviors and unlock new skills across a wide range of language models, but open recipes for applying these techniques lag behind proprietary ones. The underlying training data and recipes for post-training are simultaneously the most important pieces of the puzzle and the portion with the least transparency. To bridge this gap, we introduce \modelname~3, a family of fully-open state-of-the-art post-trained models, alongside its data, code, and training recipes, serving as a comprehensive guide for modern post-training techniques. \modelname~3, which builds on Llama 3.1 base models, achieves results surpassing the instruct versions of Llama 3.1, Qwen 2.5, Mistral, and even closed models such as GPT-4o-mini and Claude 3.5-Haiku. The training algorithms for our models include supervised finetuning (SFT), Direct Preference Optimization (DPO), and a novel method we call Reinforcement Learning with Verifiable Rewards (RLVR). With \modelname~3, we build a multi-task evaluation scheme for post-training with development and unseen evaluations, standard benchmark implementations, and substantial decontamination of existing open datasets on said benchmarks. We conclude with analysis and discussion of training methods that did not reliably improve performance. The \modelname~3 release includes model weights, a demo, and the complete recipe --- datasets for diverse core skills, a robust toolkit for data curation and evaluation, the training code and infrastructure, and, most importantly, a detailed report for reproducing and further adapting the \modelname~3 approach to more domains. % } \metadata[\quad\huggingface Tulu 3 8B:]{\href{https://hf.co/allenai/Llama-3.1-Tulu-3-8B}{\texttt{Llama-3.1-Tulu-3-8B}}} \metadata[\quad\huggingface Tulu 3 70B:]{\href{https://hf.co/allenai/Llama-3.1-Tulu-3-70B}{\texttt{Llama-3.1-Tulu-3-70B}}} \metadata[\quad\huggingface Tulu 3 405B:]{\href{https://hf.co/allenai/Llama-3.1-Tulu-3-405B}{\texttt{Llama-3.1-Tulu-3-405B}}} \metadata[\quad\huggingface Tulu 3 \textsc{Data}:]{\href{https://hf.co/collections/allenai/tulu-3-datasets-673b8df14442393f7213f372}{\texttt{tulu-3-datasets-673b8df14442393f7213f372}}} \metadata[\quad\github Tulu 3 Code:]{\href{https://github.com/allenai/open-instruct}{\texttt{open-instruct}}} \metadata[\quad\github \evalname:]{\href{https://github.com/allenai/olmes}{\texttt{olmes}}} \metadata[\quad\aitwo Demo:]{\href{https://playground.allenai.org/}{\texttt{playground.allenai.org}}} \begin{document} \maketitle \newpage \tableofcontents \newpage \input{tables/artifacts} \newpage \begin{figure}[t] \centering \includegraphics[width=\linewidth]{figures_images/tulu3_figure1_v2.pdf} \caption{An overview of the \modelname~3 recipe. This includes: data curation targeting general and target capabilities, training strategies and a standardized evaluation suite for development and final evaluation stage. } \label{fig:tulu3_recipe_overview} \end{figure} \section{Introduction} \emph{``Just as the camel shares its burdens with others in the caravan, the wise share their insights to lighten the load of ignorance.'' -- Proverb generated by} \modelname~3. Post-training --- the collection of techniques including instruction tuning, reinforcement learning from human feedback, and other types of finetuning --- has become a crucial step in building frontier language models~\citep{OpenAI2024, Anthropic2024}, yet developments to these techniques are frequently not accompanied by open resources and recipes. Fully open source counterparts (e.g.,~\modelname~2~\citep{ivison2023camels} and Zephyr-$\beta$~\citep{tunstall2023zephyr}) often rely on simpler-to-implement and cheaper pipelines and have become outdated on many metrics. To close the gap between open and closed post training, we introduce {\bf \modelname\footnote{A t\"ulu is a hybrid camel bred between Bactrian camel and dromedary: \url{https://en.wikipedia.org/wiki/Hybrid_camel}.}~3}, a family of open state-of-the-art post-trained models, alongside all of the data, training recipes, code, infrastructure, and evaluation framework. Integrating partial details from proprietary methods with novel techniques and established academic research, \modelname~3 pushes the boundaries of research in post-training. The advancements of \modelname~3 are attributed to \modelname~3 \textsc{Data}, new permissively licensed training datasets targeting core skills, \evalname, an evaluation suite and tools to establish clear performance goals and guide improvement through training stages, and \modelname~3 \textsc{Recipe}, an advanced multi-stage training pipeline incorporating new algorithmic advancements in reinforcement learning, cutting-edge infrastructure, and rigorous experimentation to optimize data mixes, methods, and parameters across various training stages. In order to build \modelname~3, we identify a set of core skills to improve after training (e.g., reasoning, math, coding, safety, precise instruction following, knowledge recall, etc.) and build an evaluation framework to establish clear performance goals and guide model improvement over a selection of development and unseen tasks. \modelname~3 benefits significantly from leveraging publicly available open data, generating diverse, skill-specific synthetic data at various training stages, and aggressively decontaminating them against our evaluation suite. The \modelname~3 training recipe involves multiple stages, with each stage building upon the previous model and focusing on different types of data --- namely, {\it prompt-completion} instances for supervised finetuning, {\it preferences} for preference tuning, or {\it verifiable rewards} for reinforcement learning. Our methodology facilitates identifying skill deficiencies and refining the data mix, methods and parameters, ensuring a balanced performance of core skills across the training process. Through rigorous, principled experimentation, we determine the best data mix for supervised finetuning, resulting in the \modelname~3~SFT checkpoint. Leveraging recent advances in preference tuning, we then train a model over carefully curated {\it on-policy} preference data from comparing \modelname~3~SFT completions against outputs from other language models. Furthermore, we introduce a new final finetuning stage -- Reinforcement Learning with Verifiable Rewards (RLVR) - which employs a novel RL objective tailored to enhance specific skills with verifiable answers, such as mathematics and precise instruction following. Our best performing recipe yields \modelname~3 models that outperform the state-of-the-art post-trained open-weight models of the same size such as Llama~3.1~Instruct~\citep{dubey2024llama} or Mistral-Instruct~\citep{mistral2024ministraux}, and at the large 70B size \modelname matches the offerings of closed providers such as Claude 3.5 Haiku and GPT-4o~mini. Furthermore, at 405B size our model performs competitively against DeepSeek~v3~\citep{deepseekai2024deepseekv3technicalreport} and GPT~4o (11-24). In summary, \modelname~3 represents a family of state-of-the-art open language models, featuring a modern post-training framework with fully open-source data \modelname~3~\textsc{Data}, evaluation~\evalname, training code \modelname~3~\textsc{Code} and development recipes \modelname~3~\textsc{Recipe}. Here are a few key contributions from the development of \modelname: \begin{itemize} \item Extensive guidance and tooling for evaluation, decontamination, and recipe design, \item Scaled, new synthetic instruction datasets, \item Scaling preference data with on-policy generations, \item Reinforcement learning with verifiable rewards, an RL-based method that only gets a reward if the model's completions are verified to be correct, and \item Advanced infrastructure, details, and code to facilitate the successful implementation of large models. \end{itemize} \input{tables/results_overview} The result of our work is completely open pipelines for finetuning language models. We release final models trained on Llama 3.1 base versions~\citep{dubey2024llama}, % with intermediate checkpoints, training data, training code, and evaluation code (a full list of artifacts released is available in Table~\ref{tab:artifacts}). With all the released resources, others can take open base models and finetune them to high-performance on any task of interest -- laying the foundation of post-training research within complex, multi-objective and multi-stage training regimes. \section{\modelname~3 Overview} Early work in language model post-training followed a standard recipe pioneered by models like InstructGPT~\citep{ouyang2022training}, consisting of instruction-tuning followed by preference finetuning (PreFT)~\citep{stiennon2020learning, nakano2021webgpt, askell2021general, ouyang2022training}. Since then, the sophistication and complexity of post-training approaches have continued to increase, moving towards multiple rounds of training, human data plus synthetic data, and multiple training algorithms and objectives~\citep{touvron2023llama, dubey2024llama, gunter2024apple}. However, most successful post-training models offer limited information about their training data, code, or recipes.\footnote{On LMSYS's ChatBotArena, no model in the top 50 (as of November 20th, 2024) has released its post-training data~\citep{chiang2024chatbot}.} Open post-training research, such as \modelname~2~\citep{ivison2023camels} and Zephyr-$\beta$~\citep{tunstall2023zephyr}, show strong results in some benchmarks and on chat evaluations such as AlpacaEval or Arena-Hard~\citep{arenahard2024}, but still lag behind in core capabilities such as MATH~\citep{hendrycksmath2021}, IFEval~\citep{zhou2023instructionfollowingevaluationlargelanguage} and GSM8K~\citep{cobbe2021gsm8k}. \modelname~3 pushes the boundaries of research in post-training and \textbf{closes the gap between open and closed finetuning recipes}. With \modelname~3, we hope to \textbf{uncover which paths for the open-source community will lead to success and which do not} (by reporting negative results). It is a complex training process that integrates partial details from proprietary methods with novel techniques and combines it with established academic research. The key factors in the success of \modelname~3 are careful data curation, rigorous experimentation and evaluation, innovative methodologies, and improved training infrastructure. We followed systematic guidelines by scientifically evaluating this process through creating development and test sets for evaluation, and conduct careful decontamination of publicly available datasets. \textbf{\modelname~3 is not just an artifact, but a comprehensive suite of data and tools designed to advance the frontier of open post-training. } By openly sharing our data, recipe and findings, we aim to empower the community to explore new and innovative post-training approaches. We list the extensive artifacts and tools released in Table~\ref{tab:artifacts}. \subsection{\modelname~3 Data}\label{sec:core_skills} \input{tables/eval-suite} The \modelname~3 effort began with identifying key areas where open post-training recipes often fall behind and that are desirable capabilities for generalist language models. Table~\ref{tab:eval_suites} outlines the core capabilities we aim to enhance and the evaluation benchmarks selected to cover these skills. With \modelname~3, we focus on core skills of knowledge recall, reasoning, mathematics, coding, instruction following, general chat, and safety. We curate and collect \modelname~3 \textsc{Data} to target these core skills by sourcing from public data and synthetically curating data. We use various data formats at different stages of training. Table \ref{tab:sft_summary} outlines the collection of datasets used to train our model, and further details are provided in Section \autoref{sec:prompts}. \subsection{\modelname~3 Evaluation} A key factor in the success of our post-training approach is establishing clear performance goals and evaluation tools to guide improvement. With \evalname, we release a unified, standardized evaluation suite and a toolkit to guide the development of and assessment of final models while decontaminating training data against evaluation benchmarks. Our framework consists of an open evaluation toolkit for reproducible evaluations (Section~\ref{sec:olmes}), a suite for evaluating core skills in instruction-tuned models with separate development (Section~\ref{sec:dev_suite}) and held-out evaluations (Section~\ref{sec:unseen_suite}), and a set of recommended settings for evaluating on our evaluation suite based on our experiments with various models. Both splits cover all identified skills, except we have no unseen safety evaluation. Crucially, we did not examine scores on our unseen set when developing our models, allowing us to observe how much we may have overfit to particular evaluations in our decisions around data mixtures, algorithms, and hyperparameters. Table~\ref{tab:eval_suites} summarizes our evaluation suite. We provide further details on our evaluations in Section~\ref{sec:evaluation} and in Table~\ref{tab:test_settings}. We publicly release our evaluation suite at \url{https://github.com/allenai/olmes}. \input{tables/405b_results} \input{tables/70b_results} \input{tables/8b_results} \subsection{\modelname~3 Recipe} In this section, we provide an overview of the \modelname~3 recipe to obtain a state-of-the-art post-trained model. We produce \modelname~3 models through a four-stage post-training recipe on top of pretrained language models (see \autoref{fig:tulu3_recipe_overview}). The \modelname~3 \textsc{Recipe} is an advanced multi-stage training pipeline incorporating new algorithmic advancements in reinforcement learning, cutting-edge infrastructure, and rigorous experimentation to curate data and optimize data mixes, methods, and parameters across various training stages. Throughout all stages, we measure model performance using a carefully-chosen evaluation suite. The stages are as follows: \begin{itemize}[label=] \item {\bf Stage 1: Data Curation (\autoref{sec:prompts})} We curate a variety of prompts to be allocated across multiple stages of optimization. We create new synthetic prompts or, when available, source prompts from existing datasets to target specific capabilities. We ensure prompts are not contaminated with our evaluation suite, \evalname. \item {\bf Stage 2: Supervised Finetuning (\autoref{sec:sft})} We perform supervised finetuning (SFT) on carefully selected prompts and completions. With thorough experimentation, the final SFT data and training hyperparameters are determined to enhance target core skills without significantly impacting the performance of others, guided by our evaluation framework. \item {\bf Stage 3: Preference Tuning (\autoref{sec:preft})} We apply preference tuning, specifically DPO, to newly curated on-policy synthetically created preference data from selected prompts along with off-policy data. As in the SFT stage, we identify the best preference data mix through thorough experimentation, uncovering what formats of data, methods, or hyperparameters lead to improvements. \item {\bf Stage 4: Reinforcement Learning with Verifiable Rewards (\autoref{sec:rlvr})} We introduce a new RL-based post-training stage which trains the model on verifiable rewards instead of a reward model, as is common for traditional RLHF training. We select tasks with verifiable outcomes, such as mathematical problem-solving, and only provide rewards when the model's generations are verified to be correct. We then use RL to maximize these rewards. \end{itemize} The key contributions of our \modelname~3 pipeline lie in improved \textbf{data}, \textbf{methods}, \textbf{infrastructure}, and rigorous \textbf{evaluation}. Key elements of our pipeline include: \begin{itemize} \item \textbf{Data Quality, Provenance, and Scale} (\S\ref{sec:prompts}) We obtain prompts by carefully surveying available open-source datasets, analyzing their provenance, and decontaminating them, as well as curating synthetic prompts that target core skills. To ensure effectiveness, we conduct thorough experiments to study their impact on our development evaluation suite. We find targeted prompts to be influential to improve core skills, while real-world queries, e.g., WildChat~\citep{zhao2024wildchat}, are important to improve general chat capabilities. Using the \evalname decontamination tool, we ensure prompts are not contaminated against our evaluation suite.\footnote{ We observe a non-trivial amount of contamination in a few open datasets with popular evaluation benchmarks. Details are provided in \autoref{tab:contaminated_datasets}.} % \item \textbf{Creating a Multi-Skill SFT Dataset} (\S\ref{sec:sft_data}) The distribution of the prompts in the ``general'' and ``skill-specific'' categories was refined by several rounds of supervised finetuning % on various data mixtures. For example, to improve mathematical reasoning, we first establish an upper bound in our evaluation suite by creating math-specialized models, then mix data to bring the general models closer to this upper bound. \item \textbf{Curating an On-Policy Preference Dataset} (\S\ref{sec:pref_pipeline_main}) We develop an on-policy data curation pipeline to scale our preference dataset generation. Concretely, we generate completions from \modelname~3-SFT and other models for given prompts, and obtain preference labels through their pairwise comparisons. Our approach extends and improves the off-policy preference data generation method by~\citet{cui2023ultrafeedback}. Careful multi-skill selection of preference data yields 354,192 instances for preference tuning demonstrating significant improvements in a range of tasks. % \item \textbf{Preference Tuning Algorithm Design} (\S\ref{sec:preference_tuning_recipe}) We experiment with several preference tuning algorithms and observe improved performance in using length-normalized Direct Preference Optimization. We prioritized simplicity and efficiency in experimentation and used length-normalized DPO throughout the development process and training our final models, in lieu of more costly investigations into RL-based methods, such as PPO. \item \textbf{Skill-Specific RL with Verifiable Rewards} (\S\ref{sec:rlvr}) We adapt a new approach, leveraging a standard reinforcement-learning paradigm to target skills that can be evaluated against a ground-truth outcome (e.g., Math). We refer to this algorithm as Reinforcement Learning with Verifiable Rewards (RLVR); it obtains a constant reward value if a completion is successful. Our results show that RLVR can improve GSM8K, MATH, and IFEval performance. \item \textbf{Training Infrastructure for Reinforcement Learning} (\S\ref{subsec:rl_infra}): We implemented an asynchronous RL setup: we run LLM inference efficiently via vLLM while the learners perform gradient updates concurrently. Our RL codebase is also highly scalable and can train 70B and 405B RLVR policy models. \item \textbf{Evaluation Framework: \evalname} (\S\ref{sec:evaluation}) In addition to evaluating the final models, our evaluation framework is an open evaluation toolkit designed to guide the development progress through carefully selected evaluation suite and tools for decontamination. \end{itemize} \subsection{Evaluation and Results} When reporting scores throughout this work, we use the metrics identified in Table~\ref{tab:eval_suites}; higher is better. When computing overall performance, we simply average scores across all evaluations, treating each evaluation equally. For generative evaluations our output length is 4096. \modelname~3 trained on Llama 3 base models outperforms all other open-weight models in its size category on our development evaluation suite. Compared to closed models, \modelname~3 70B even surpasses closed models such as GPT-3.5-Turbo-0125 or GPT-4o-mini-2024-07-18, while approaching the performance of Claude 3.5 Haiku 20241022. The summary of \modelname~3 trained on Llama 3 at 8 and 70 billion parameters versus the leading models in their size classes is shown in Table~\ref{tab:overview}. A per training stage breakdown of performance is shown for the 8B version in Table~\ref{tab:8b_results} and for 70B in Table~\ref{tab:70b_results}. With our models trained from raw pretrained base models, we compare to instruct models trained on the same base models (e.g. Nous Hermes 3), instruct models on similar sized, but different base versions (e.g. Ministral 8B or Qwen 2.5 Instruct), and other finetuning recipes trained on an instruct version (e.g. Nemotron Llama 3.1). At 70B, we compare to and surpass Llama 3.1 70B Instruct, Qwen 2.5 72B Instruct~\citep{qwen2.5}, Nous Hermes 3 70B~\citep{teknium2024hermes} (trained on Llama 3.1 70B), and Nemotron Llama 3.1 70B~\citep{wang2024helpsteer2p} (trained on Llama 3.1 70B Instruct). At 8B, we compare to and surpass Llama 3.1 8B Instruct, Gemma 2 9B Instruct~\citep{team2024gemma}, Nous Hermes 3 8B (trained on Llama 3.1 8B), Qwen 2 7B Instruct, and Ministral 8B Instruct 2410. \paragraph{Artifacts Released.} We release all artifacts associated with the \modelname~3 training recipe -- including SFT, DPO, and RL model checkpoints, along with new SFT and DPO datasets. A summary of the artifacts released with \modelname~3 is included in Table~\ref{tab:artifacts}. \input{tables/sft_summary} \section{\modelname~3 Data } \label{sec:prompts} Prompts represent the diverse ways users may interact with models and serve as the essential component for all post-training stages. We curate an extensive collection of millions of prompts as the starting point of \modelname~3 post-training recipe. Data selected for next stages of training are selected from these prompts. \autoref{tab:sft_summary} summarizes the key information of these prompts. In this section, we describe our prompt curation process and the decontamination effort to ensure that our evaluations are not leaked in these prompts. In the following sections, we describe how prompts are used for supervised finetuning \S\ref{sec:sft} and preference tuning \S\ref{sec:preft}. % \subsection{Prompt Curation} To target the desired core skills, we curate a {\it diverse} and {\it high quality} set of prompts from publicly available datasets with clear \textit{provenance} and synthetically generate prompts to fill any gaps. \subsubsection{Sourcing from Public Datasets} Since the release of our \modelname~2, the community has witnessed a large body of work creating datasets for post-training, in terms of both supervised finetuning and preference tuning. \modelname~3 aims to integrate and extend these resources to build stronger models. We start this process with a broad survey of public datasets, including those annotated by dedicated workers, sourced from real users, and synthesized with models. \footnote{The datasets we compiled and consider are available {\protect\href{https://docs.google.com/spreadsheets/d/1E2ScaKWbTnlelzJzcddCzEtf7WrpF3a5ZP5ZvdsOZ4Y/edit?usp=sharing}{here}: \url{https://docs.google.com/spreadsheets/d/1E2ScaKWbTnlelzJzcddCzEtf7WrpF3a5ZP5ZvdsOZ4Y/edit?usp=sharing}}.} We then manually review each individual dataset, and pick those with the following considerations. \paragraph{Diversity.} The diversity of training data is critical for eliciting models' generalization, avoiding model forgetting, and making models robust to uncommon inputs~\citep{wang2022super, chung2024scaling, zhou2024lima}. We pick datasets that can promote diversity, including: WildChat~\citep{zhao2024wildchat}, which is a large source of real-user interaction with models; Open Assistant~\citep{kopf2024openassistant}, which is created by volunteer workers for general chatting; No Robots~\citep{no_robots}, which is annotated by expert workers for a broad range of open-ended categories; and FLAN v2~\citep{longpre2023flan}, which is a big compilation of classical NLP tasks. We also include a decontaminated subset of UltraFeedback~\citep{cui2023ultrafeedback}, which is a composition of several datasets (FalseQA~\citep{hu2023won}, UltraChat~\citep{ding2023enhancing}, Evol-Instruct~\citep{xu2023wizardlm}, FLAN v2~\citep{longpre2023flan}) and has shown strong performance for general preference tuning in early studies~\citep{tunstall2023zephyr,ivison2024unpacking}. \paragraph{Target Skills.} We especially consider enhancing several capabilities that can power common use cases and our specific needs. As shown in our earlier study~\citep{wang2023far}, some capabilities, such as complex reasoning, coding, and precise instruction following, benefit from mixing in additional data. Therefore, we include the following datasets: OpenMathInstruct~\citep{toshniwal2024openmathinstruct} and NuminaMath~\citep{numina_math_7b} for mathematical reasoning, Evol-CodeAlpaca for coding, a subset of Daring-Anteater~\citep{wang2024helpsteer2} for precise instruction following, Aya~\citep{singh2024aya} for multilinguality, SciRIFF~\citep{wadden2024sciriff} for scientific literature understanding, and TableGPT~\citep{zha2023tablegpt} for processing table-related tasks. We have also considered other datasets for domains with plenty of published research (e.g., math), but they either did not bring additional benefits in our early supervised finetuning experiments or have restrictive licenses. \paragraph{Data Provenance and Licenses.} When sourcing prompts, we take careful consideration of the licenses of the original datasets and only allow those with clear and correct licenses. Since many publicly released datasets are compositions of other datasets, we have to manually track the provenance of subsets to verify their licenses and remove those that have issues. Specifically, the ShareGPT dataset\footnote{ShareGPT data was initially used to build the Vicuna model~\citep{vicuna2023}, but the exact dataset has not been released. Later work mainly used a community reproduced version at \url{https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/}.} is of questionable legal provenance as they were shared by users on the internet without an agreement to be used for model training or being released at all, so we exclude it and use WildChat instead. We also removed the relevant subset from UltraFeedback and decided not to use Helpsteer2 \citep{wang2024helpsteer2} due to the use of ShareGPT in their prompts. All the datasets included in our final curation have clear licenses. % \subsubsection{Synthesizing for Target Skills} \label{sec:synth_data_gen} To address the growing need for diverse and skill-specific datasets, we incorporate synthetic data generation as a complementary approach. Synthetic data generation has gained traction as a promising alternative to human-written data due to being cheaper to obtain, customizable for different purposes, and reflecting the vast knowledge of the underlying models~\citep{dubey2024llama}. However, generating diverse and high-quality data at scale is non-trivial, as LMs are susceptible to falling into repetitive modes or patterns, referred to as ``mode collapse'' \citep{kazdan2024collapsethriveperilspromises}. To ensure diversity in generation, we follow the recent \textit{persona-driven} methodology in \citet{chan2024scaling} to generate synthetic data. The key idea is to use different personas (e.g., ``A machine learning researcher focused on neural networks'') with a data synthesis prompt (e.g., ``create a coding problem'') to steer an LLM to synthesize data with corresponding perspectives. Specifically, we condition on $\sim$250K personas from Persona Hub \citep{chan2024scaling} to generate prompts targeting specific skills such as precise instruction following, math and coding. We detail our procedure for each select skill below. Prompts used to generate these instructions can be found in Appendix~\ref{appdx:sft-prompts}. Additionally, we build upon our previous efforts in~\cite{brahman2024art,han2024wildguard,wildteaming2024}, to generate noncompliance and safety data. \textbf{Precise Instruction Following. } Precise instruction following is the ability to follow verifiable instructions in natural language, such as ``your answer should contain exactly 3 paragraphs,'' that can be automatically verified with heuristics. We use our persona-driven approach to synthetically generate verifiable instructions covering 25 different constraint types defined in IFEval benchmark \citep{zhou2023instructionfollowingevaluationlargelanguage}. More concretely, we start by manually writing 1-2 example instructions per constraint (\textit{e.g.,} \texttt{number of words}), resulting in total of 33 verifiable instructions which we used as seed prompts. We then generate new instructions using \texttt{GPT-4o}~\citep{openai2024gpt4o}\footnote{We use \texttt{GPT-4o-2024-08-06} for all our persona-driven data synthesis, unless otherwise stated.} given a data synthesis prompt, persona, and a single verifiable instruction as an example. Figures \ref{fig:persona-prompt-sft-if} and \ref{fig:persona-prompt-if-res} show the exact prompts used to generate the instruction and its corresponding response, respectively. In total, we collected 29,980 verifiable instruction-response pairs which we call \textsc{If-Persona-Sft}. Lastly, we also generate another type of prompts targeted for constrained instruction following by randomly sampling instructions from the \modelname~2 SFT mix and combining them with constraints from the taxonomy in \citet{zhou2023instructionfollowingevaluationlargelanguage}. We call that set \textsc{IF-augmented}. These prompts are only used for the DPO and RLVR stages. \textbf{Math and Coding.} We follow a similar persona-driven approach to synthetically generate diverse math word and coding problems. Math problems include those that require advanced mathematical skills as well as grade school problems. For coding, we generate Python programming questions that are solvable by entry- to medium-level programmers. Unlike precise instruction following, we zero-shot prompt \texttt{GPT-4o} to generate problems that are unique and specific to a given \textit{persona} input. Having generated the problems, we then generate multi-step math solutions using \texttt{GPT-4o}, and Python programs using \texttt{claude-3-5-sonnet}. Exact prompts used to generate problems and solutions are provided in Figures \ref{fig:persona-prompt-sft-math}, \ref{fig:persona-prompt-sft-code}, \ref{fig:persona-prompt-sft-math-res}, and \ref{fig:persona-prompt-sft-code-res}, respectively. In total, we collected $\sim$220K and 35K instances for math reasoning and coding. \textbf{Noncompliance and Safety.} As we enhance models’ capabilities to assist users effectively, it is crucial to ensure they can reliably reject unsafe and appropriately handle nuanced and out of scope queries. To support this, we curate a set of noncompliance queries \citep{brahman2024art} that the model ought to not comply with, alongside safety-related direct and adversarial prompts \citep{han2024wildguard,wildteaming2024} covering both benign and harmful scenarios. Our noncompliance and safety prompts are either curated from existing datasets \citep{zhang-choi-2021-situatedqa, zhao2024wildchat} or synthetically generated from the GPT model family. More specifically, our noncompliance prompts are obtained based on our contextual noncompliance taxonomy from \citet{brahman2024art}, spanning multiple categories including \textit{incomplete}, \textit{unsupported}, \textit{indeterminate}, and \textit{humanizing} requests (in addition to \textit{unsafe} requests). Our safety-related prompts are carefully selected among synthetic adversarial prompts, synthetic vanilla (direct) requests, real-world user-LLM interactions (In-The-Wild), and curated annotator-written examples to maximize coverage, diversity, and balance. \subsection{Prompt Decontamination} \label{sec:decontamination} \input{tables/decontamdatasets} One important consideration when curating our training mix was possible overlap between training prompts and evaluation sets. We quantify such overlap as follows and remove instances from our training mix as needed in order to prevent test set contamination. \textbf{Matching Method.} We experimented with full-string, n-gram, and embedding-based matching and found that n-gram matching yielded the most useful results --- while embedding-based methods can in principle identify non-trivial contamination like that due to paraphrasing~\citep{yang2023rethinking}, we found it difficult to distinguish mere distributional similarity from actual paraphrasing. Moreover, partial surface-level overlap using n-gram matching successfully identified cases of contamination where the instances were trivially different, e.g., a math problem where only the numbers differ. \textbf{Identifying Matching Instances.} Since completions in training datasets are often regenerated using language models, we chose to compute overlap in the prompts alone (or more generally user turns in multi-turn dialogues). We used 8-gram matching for our contamination checks following~\citep{dubey2024llama,singh2024evaluation}. For each token in a test instance, we consider it to match a token in a train instance if the two instances share an 8-gram containing that token, and we consider the test instance itself to have significant overlap with a train instance if more than 50\% of the test tokens have 8-gram matches with the same training instance. \textbf{Decontamination.} We consider a training set to be contaminated if any number of its instances overlap with more than 2\% of the instances in any of the evaluations in our development and unseen suites. We remove all the training sets that were contaminated with our unseen evaluations. For training sets that were contaminated with our development evaluations, we removed the entire dataset if doing so did not significantly impact the performance of the resulting model; otherwise, we removed the specific instances that match any test instance. The list of datasets we decontaminated and the versions we released with overlapping samples removed is shown in Table~\ref{tab:decontamdatasets}. The full list of public datasets that we found to be significantly contaminated with our evaluation sets can be found in Table~\ref{tab:contaminated_datasets}. \section{Supervised Finetuning} \label{sec:sft} Adapting pretrained base models to various tasks and user requests often relies on supervised finetuning (SFT), also known as instruction finetuning. A key challenge in this process is balancing the proportions of mixed training datasets representing diverse skills. For \modelname~3, we conducted data mixture ablations and explored model merging techniques to develop an SFT training procedure that well balances performance across the core skills we prioritized. The following sections detail our experiments and findings. \input{figures_latex/sft-distribution} \subsection{SFT Data} \label{sec:sft_data} \subsubsection{From Prompts to SFT Data} To create our SFT mix, we collect or create responses for prompts described in Section~\ref{sec:prompts} in two ways: filtering existing responses, and creating new responses. For prompts with existing responses, we generally keep the original response if it was written by a human or a frontier model, like GPT-4o. For large datasets with subsets from frontier models (e.g. WildChat), we use the subset from the best models. We additionally filter empty responses and responses that contain information about models or their developers. If a set of prompts did not have responses, like our Persona prompts, or if the original responses were from a weaker model (e.g. WildGuardMix), we generate new responses using GPT-4o. We also hand-wrote responses to our hardcoded prompts. \subsubsection{The \modelname~3 SFT Mix}% To develop our SFT mix, we first identified the skills that were lagging behind state of the art models using Llama 3.1 trained on \modelname~2\footnote{\url{https://huggingface.co/allenai/llama-3.1-tulu-2-8b}} as our baseline. Targeting each of these skills in isolation, we collected high quality publicly available datasets and created synthetic datasets, as described in Section~\ref{sec:synth_data_gen}, and also removed some datasets that we identified to be of relatively lower quality compared to other more recent datasets. To design our final SFT mix, we first built skill-specific data mixtures and models, keeping the mixtures that led to the best performance on individual skills, ignoring other evaluations. This was done to approximate the upper bound for each evaluation given our setup. We then combined these mixtures to create our initial \modelname~3 preview mix. We then continued to iterate on the mixture by adding or removing datasets to improve lagging skills, decontaminating against our evaluations and downsampling particularly large datasets. We show the performance of major preview versions throughout development in Figure~\ref{fig:tulu_dev_versions}. \begin{figure}[t] \centering \includegraphics[width=0.7\linewidth]{figures_images/tulu_version_bars.pdf} {\small \\ \cblock{247}{200}{226}~\modelname~2 \quad \cblock{231}{238}{238}~Intermediate Mix 1 \quad \cblock{206}{220}{221}~Intermediate Mix 2 \quad \cblock{159}{185}{187}~Intermediate Mix 3 \\ \cblock{136}{168}{171}~Intermediate Mix 4 \quad \cblock{110}{151}{154}~Intermediate Mix 5 \quad \cblock{240}{82}{156}~\modelname~3 } \captionof{figure}{Average and selected skill-specific performance from training Llama 3.1 8B on our initial \modelname~2 SFT mix, and our intermediate and final \modelname~3 SFT mixes. Intermediate mixes 1, 2, and 3 were the result of adding new datasets to improve performance. Intermediate mixes 4 and 5 were the result of running multiple rounds of decontamination, causing small drops in performance.} \label{fig:tulu_dev_versions} \end{figure} \paragraph{Final SFT Results.} In Table~\ref{tab:sft-results}, we compare our final \tuluthreesmallsft and \tuluthreelargesft models against other SFT-only models trained on Llama 3 8B or 70B. Our new SFT mix shows substantial improvements over the \modelname~2 mix at both model sizes, and is better on average the other competitive 8B SFT models. \input{tables/sft_results} \label{sec:sft_mixing} \subsection{Key Data Experiments} We also ran a series of controlled experiments after developing our final SFT mix to explore the importance of different decisions made during data mixing and training. \input{tables/sft_ablations} \paragraph{Diverse Chat Data.} In our mix we also emphasized adding diverse chat data, mainly from WildChat. We show the impact of removing WildChat in Table~\ref{tab:sft-ablations}, and we see that there is a small but noticeable degradation on most skills, most noticeably on Alpaca Eval, highlighting the importance of diverse real-world data. \paragraph{Safety is Orthogonal.} We found that our safety SFT data was generally orthogonal to our other datasets. We report the effect of removing our safety-specific datasets in Table~\ref{tab:sft-ablations}, and we see that most skills stayed roughly the same, except the safety average. We also found that adding contrastive prompts, such as those in CoCoNot, were helpful for preventing our models from over-refusing safe prompts. \paragraph{New Persona Data.} Our new Persona datasets were built to target specific skills: mathematics, coding, and instruction following. In Table~\ref{tab:sft-ablations} we show that performance on HumanEval(+), GSM8K, MATH, and IFEval drop after removing our Persona datasets, showing the value of creating diverse, skill-specific SFT datasets. \paragraph{Targeting Specific Skills.} A large portion of our focus was on collecting or creating datasets targeting specific capabilities. Using mathematical reasoning as an illustrative example, we show in Table~\ref{tab:sft-ablations} the impact of our mathematics-specific data on both GSM8K and MATH. We see that our mathematics-specific SFT data substantially improves both GSM8K and MATH, showing the value of the data included in the final mix. \paragraph{Amount of SFT Data.} In Figure~\ref{fig:downsampling}, we show the effect of taking stratified subsamples of our SFT mix. We find that our models continue to improve on average as more SFT data is included, and we see large improvements on metrics like GSM8K as we increase the amount of data to the full mix. Interestingly, TruthfulQA performance actually \textit{drops} as the amount of data in the mix increases. We do not increase our SFT data size beyond the current mixture because we allocated other prompts for preference optimization. \input{tables/sft_hypers_separate} \subsection{SFT Recipe and Analyses.} \paragraph{Training Settings} To train our \modelname~3 models, we used between 4 and 16 8xH100 nodes with high speed interconnect. The final 8B model is trained on 32 GPUs for ~6 hours and the 70B model was trained on 64 GPUs for ~50 hours. We used an effective batch size of 128 and a maximum sequence length of 4,096 tokens. We trained for two epochs using a learning rate of 5e-6 for our 8B models, and 2e-6 for our 70B models, which we found after a hyperparameter search. Our hyperparameter settings are also summarized in Table~\ref{tab:sft-hyperparameters}. For merging experiments we used mergekit\footnote{\url{https://github.com/arcee-ai/mergekit}} \citep{goddard2024arcee}, using linear weighted averaging. \subsubsection{Key Training Experiments} \paragraph{Choice of Base Model.} We also test the effect of training different base pretrained models on mathematical performance using our full SFT mix. In Table~\ref{tab:different_base_models}, we show the impact of changing the model's \textit{size} by training on both Llama 3.1 8B and 70B, and the impact of adding \textit{domain-specific pretraining data} by training on Qwen 2.5 7B and Qwen 2.5 Math 7B. In both cases, we see a substantial improvement in both GSM8K and MATH, highlighting the importance of both model size and pretraining data for downstream skills. \input{tables/different_base_models} \paragraph{Chat Template Variation.} \label{sec:chat_template_variation} During creating \modelname~3, we explored changing the chat template used to guide the generation of finetuned models. We made a small change to the chat template used in previous \modelname versions, specifically removing the new line at the end of the template (before the model response). The performance between different changes to the chat template is shown in Table~\ref{tab:chattemplates} on an early version of our SFT setup. We found that replacing the newlines at the end of assistant messages with an eos token resulted in the best performance, but we opted not to use this to avoid generation inconsistency with later steps in our post-training pipeline. The chat template can be found \href{https://github.com/allenai/open-instruct/blob/2bc1772f115da412bfb7c705305307a8f2a6b0dc/open_instruct/dataset_processor.py#L131}{in our codebase} and we provide it in Appendix~\ref{app:chat_template}. \input{tables/chat_templates_simple_sft} \paragraph{Random Seeds and Model Soups.} We also explored changing the random seed during SFT, and then using those models to create model soups \citep{pmlr-v162-wortsman22a}. In Table~\ref{tab:model_soups}, we compare training 8B and 70B models with multiple different seeds with the best model soup. We see that SFT performance noticeably varies based on the seed, highlighting the importance of multiple training runs, and that the best model soup does not always outperform the best single training run. Because of this, we use the best single SFT training run for each model size as our final SFT models. \input{tables/merging_example} \begin{figure}[t] \centering \includegraphics[width=0.7\linewidth]{figures_images/downsampling_bars.pdf} {\small \\ \cblock{231}{238}{238}~5\% \quad \cblock{206}{220}{221}~10\% \quad \cblock{183}{203}{204}~25\% \quad \cblock{159}{185}{187}~50\% \quad \cblock{136}{168}{171}~75\% \quad \cblock{240}{82}{156}~100\% } \captionof{figure}{Average and skill-specific performance on stratified subsamples of our final SFT mix. We find that our full mix performs best overall.} \label{fig:downsampling} \end{figure} \subsubsection{Batch Aggregation} Early during training \modelname~3, we noticed a gap in performance between SFT models trained on our Open-Instruct framework and models trained in other settings such as on TPUs.\footnote{Relevant code: \url{https://github.com/hamishivi/EasyLM}} We found this issue was largely due to a (recently widely-reported) issue with loss aggregation inside Transformers~\citep{wolf-etal-2020-transformers}: Averaging the loss across padding tokens without taking into account gradient accumulation or distributed training setups. Here, we illustrate the issue with an example. Assume we have two samples in a batch, with $n_1$, $n_2$ non-padding tokens and $m_1$, $m_2$ padding tokens. If we pass both samples into the default Transformers forward pass at the same time, we get: \begin{align} L = \frac{l_{n_1} + l_{n_2}}{n_1 + n_2} \end{align} However, if we apply gradient accumulation, feeding in the two samples separately, computing loss, and then dividing, our loss is instead computed like: \begin{align} L = \frac{\frac{l_{n_1}}{n_1} + \frac{l_{n_2}}{n_2}}{2} \end{align} That is, in the second case we weight \textit{each example equally}, while in the first we weight \textit{each token equally}. As such, changing gradient accumulation can have large effects on performance due to effectively changing sample weightings, as reported by \citet{muennighoff2024generative}. A similar issue occurs in distributed training due to cross-device averaging. We refer to recent reports on this issue for a more in-depth explanation.\footnote{\url{https://unsloth.ai/blog/gradient},\\ \url{https://muellerzr.github.io/blog/gradient_accumulation_part2.html}} To fix this issue, we opted generally to use a \textbf{sum loss} instead of averaging (`mean loss') when training. This removes the issue by simply removing the denominator from the above equations and requires an adjustment to learning rates. This effectively weights all tokens equally (which we found led to generally better performance for initial mixtures). We validated the perfomance of our setup by finetuning Llama 3.0 on the \modelname 2 SFT mixture using a variety of learning rates, epochs, and loss types as shown in Figures~\ref{fig:sft_lr_ablate} and ~\ref{fig:sft_epoch_ablate}. Ultimately, we found that using a \textbf{sum loss with a learning rate of 5.00E-06} worked best. Surprisingly, we additionally found that training for longer did not yield further improvements, and so used 2 epochs for training. \begin{minipage}[t]{0.48\textwidth} \centering \includegraphics[width=\textwidth]{figures_images/post-training/sft_performance_vs_lr.pdf} \captionof{figure}{Average performance when finetuning Llama 3.0 on the \modelname~2 mixture using differing loss types and learning rates. We find that a LR of 5e-6 with a sum loss works best.} \label{fig:sft_lr_ablate} \end{minipage}% \hfill \begin{minipage}[t]{0.48\textwidth} \centering \includegraphics[width=\textwidth]{figures_images/post-training/sft_performance_vs_epochs.pdf} \captionof{figure}{Average performance when finetuning Llama 3.0 on the \modelname~2 mixture using sum loss and LR of 5e-6 for varying numbers of epochs. We find using 2 epochs works best.} \label{fig:sft_epoch_ablate} \end{minipage} \section{Preference Finetuning} \label{sec:preft} For \modelname~3 we explore many approaches to preference finetuning with the goal of improving our entire evaluation suite. We explore multiple training algorithms, from Direct Preference Optimization (DPO) and its derivatives to reinforcement learning algorithms such as Proximal Policy Optimization (PPO). In this section, we detail the problem formulation of learning from human preferences and our optimizers. Next, we explain how to convert our prompts into synthetic preference data from both on-policy (\modelname~3 suite) and off-policy models (other instruct models). We show how to create preference data for specific skills of interest and how we improve our models robustly with DPO. \subsection{Background} Prior work has established training on preference data as a crucial step for improving model performance on benchmarks simulating human or synthetic preferences~\citep{dubois2023alpacafarm, ivison2023camels, ivison2024unpacking}. The typical procedure is reinforcement learning from human or AI feedback\footnote{Now colloquially referred to as synthetic feedback data as well.} ~\citep{ziegler2019fine,stiennon2020learning,ouyang2022training,bai2022constitutional}. \subsubsection{Setup} \textbf{Preference Data.} In the standard setup, there is some preference dataset $\mathcal{D}$ consisting of prompts $x$ and two responses $y, y'$ per prompt. Some judge(s) will choose one of $y, y'$ as their preferred response $y_c$, and label the other as a rejected response $y_r$. \textbf{Reward Model.} Given the preference dataset, a reward model (RM) $r_\phi$ is trained with the following objective: \begin{equation} \max_{r_\phi} \mathbb{E}_{(x,y_c,y_r)\sim\mathcal{D}}[\log \sigma(r_\phi(x,y_c) - r_\phi(x,y_r))] \end{equation} where $\sigma$ is the logistic function. The RM objective maximizes the \emph{difference} between the rewards, and this difference represents the log-likelihood that $y_c$ will be preferred over $y_r$~\citep{ouyang2022training}. This reward model can help train policy models to output contents preferred by the RM's judgments. \subsubsection{Policy Optimization} There are a plethora of options for optimizing language models with access to preference data. Today, the two categories can be abstracted as reinforcement learning algorithms, which learn from an internal representation of value or reward, and direct alignment algorithms, which learn directly from the data. Prior work~\citep{ziegler2019fine,stiennon2020learning,ouyang2022training} optimizes the policy $\pi_\theta$ with the following objective: \begin{equation} \label{eq:rlhf-objective} \max_{\pi_\theta} \mathbb{E}_{y \sim \pi_\theta(x)} \left[R(x, y)\right] = \left[ r_\phi(x,y) - \beta \text{KL}[\pi_\theta(y|x) \| \pi_\text{ref} (y|x)] \right] \end{equation} where $\pi_\text{ref}$ is the initial reference policy and the $\beta$ coefficient helps control the Kullback-Lieber divergence (KL) divergence between the reference policy and the training policy. Here, we explain PPO and DPO as representative examples. \textbf{Proximal Policy Optimization (PPO).} An approach to address the above objective is to use online reinforcement learning (RL) like PPO~\citep{schulman2017proximal}. In each training iteration of PPO, the policy needs to generate some samples, generate rewards using the RM on those samples, and maximize $R(x, y)$ using the PPO algorithm. As PPO training loops are complex, we refer the reader to \citet{ouyang2022training, ivison2024unpacking, huang2024thenimplementationdetails} for more thorough descriptions of the setup and typical setups. We provide more implementation details in Sec~\ref{sec:rl_recipe}. \textbf{Direct Preference Tuning (DPO) and Variants.} Another approach is offline preference tuning. DPO~\citep{rafailov2024direct} can directly optimizes for the RLHF objective with the following equivalent objective: \begin{equation} \max_{\pi_\theta} \mathbb{E}_{y_c,y_r \sim \mathcal{D}} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_c | x)}{\pi_\text{ref}(y_c | x)} -\beta \log \frac{\pi_\theta(y_r | x)}{\pi_\text{ref}(y_r | x)} \right) \right]. \label{eq:dpo_eq} \end{equation} DPO trains an implicit reward model and a policy model simultaneously, without needing to use a trained reward model, do policy generations, and get rewards from the RM. Crucially, this allows offline preference finetuning, directly training a language model on preference pairs gathered from a variety of sources. Recently, much work has examined how to further improve the DPO objective, with a multitude of variants proposed~\citep[\textit{inter alia}]{meng2024simpo, xu2024contrastivepreferenceoptimizationpushing, hong-etal-2024-orpo}. In this work, we explored two promising variants: {\bf SimPO}~\citep{meng2024simpo} and {\bf length-normalized DPO}\footnote{As proposed in the original~\citet{rafailov2024direct}, but was not yet well optimized to successful hyperparameters until~\citet{meng2024simpo}.}. We find (in Section~\ref{sec:preference_tuning_recipe}) that length-normalized DPO works best, which uses the following objective: \begin{equation} \max_{\pi_\theta} \mathbb{E}_{y_c,y_r \sim \mathcal{D}} \left[ \log \sigma \left( \frac{\beta}{|y_c|} \log \frac{\pi_\theta(y_c | x)}{\pi_\text{ref}(y_c | x)} -\frac{\beta}{|y_r|} \log \frac{\pi_\theta(y_r | x)}{\pi_\text{ref}(y_r | x)} \right) \right]. \end{equation} As seen, this is simply the DPO objective (Eq~\ref{eq:dpo_eq}), but with log-probabilities normalized for length, which intuitively aids with mitigating the length bias common in human and model preferences~\citep{singhal2024a}. When developing \modelname 3, we opted to use length-normalized DPO for tuning our preference data mixtures and generation methods due to its relative simplicity and speed compared to approaches such as PPO. \subsection{\modelname{} 3 Preference Data} \label{sec:pref_pipeline_main} \subsubsection{From Prompts to Preference Data} \label{sec:pref_pipeline} We create on-policy preference data $(x,y,y',label)$ given our prompts from \autoref{sec:prompts} by adapting and advancing the UltraFeedback pipeline~\citep{cui2023ultrafeedback}. Our early experiments show the benefit of this pipeline in creating preference data, which leads to a high-quality, synthetic preference dataset (as observed by~\citet{ivison2024unpacking}). Our data creation pipeline (shown in \autoref{fig:ufpp_pipeline}) consists of three stages: prompt selection, response generation from a pool of models, and preference annotation with LLM-as-a-judge to create (preferred, rejected) pairs. \begin{figure}[t] \centering \includegraphics[width=\linewidth]{figures_images/ufpp_pipeline.pdf} \caption{ Pipeline for generating and scaling preference data that is based from Ultrafeedback \citep{cui2023ultrafeedback}. % } \label{fig:ufpp_pipeline} \end{figure} \begin{itemize} \item \textbf{Stage 1: Prompt Selection} The first step for preparing a dataset for preference finetuning is to select the prompts or user instructions to generate responses and obtain preferences for. Given the set of prompts in \autoref{tab:sft_summary}, we curate our selection to include prompts used during SFT, and prompts that were subsampled from the same sources, yet unused, for SFT. We also include prompts from other sources, such as a version of Ultrafeedback without TruthfulQA instances, or by adding new IF-constraints to a prompt. \item \textbf{Stage 2: Response Generation} For a given prompt, we randomly sample four models from a \textit{model pool} to generate responses. Our model selection is inspired by the Ultrafeedback pipeline which consists of open-source and proprietary models that vary across parameter size and model family. We update Ultrafeedback's model pool by using recent versions of some models (Llama 2 $\rightarrow$ Llama 3.1), adding best-performing models to increase the pool size, and replacing currently inaccessible models such as WizardLM with open-source alternatives. Finally, we also include on-policy data by sampling completions from the \modelname{} SFT model. We approach this by adding a selection of prompts where one response is generated from the on-policy model, and the other response from the off-policy models. \item \textbf{Stage 3: Preference Annotation} After generating four responses for each prompt, we use an LLM-as-a-judge \citep{zheng2023judging}, specifically \texttt{GPT-4o-2024-0806}, to rate each response from 1 to 5 across four different aspects: helpfulness, instruction-following, honesty, and truthfulness. Appendix \ref{appdx:pref-prompts} shows the external models used to sample off-policy data and the prompt template for each aspect. In order to obtain binary preferences for DPO, we obtain the mean of preference ratings similar to Argilla's binarization method\footnote{\url{https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences/blob/main/README.md}} and take the highest-rated response as the chosen response and randomly sample from the responses with the lower mean as the rejected response. \end{itemize} \subsubsection{The \modelname{} 3 Preference Mix} \label{sec:preference_mix} \input{tables/pref_summary_updated} We choose the final preference mix for the 8B and the 70B model, which maximizes average performance on the development evaluations, while also exceling at targeted skills. Most of the preference data mix ablations are run for the 8B model, We start with prompts used for SFT and generate on-policy and off-policy preference data, resulting in 96,911 (off-policy) and 19,444 (on-policy) preference instances. Given this preference base we ablate adding additional prompt sources to the mix and how these additions affect downstream evaluation performance, specifically targeting skills like precise instruction following, math and general chat performance on AlpacaEval. Table~\ref{tab:prefmixing} shows how the inclusion or exclusion of preference datasets influences the average performance. Our final mixes for \tuluthreesmalldpo and \tuluthreelargedpo are displayed in Table~\ref{tab:final_pref_mix}. In summary, our preference mixes come from different prompt sources, such as SFT data, WildChat and Persona IF. It includes prompts seen during SFT training but also new, unseen prompts. \input{tables/pref_mixing2} \subsection{Key Findings of Data Ablations} We perform several ablations to inform the design decisions of the synthetic preference pipeline (\autoref{sec:pref_pipeline}) and the composition of the \modelname{} 3 preference mix (\autoref{sec:preference_mix}). \begin{minipage}[t]{0.45\textwidth} \centering \includegraphics[width=0.9\textwidth]{figures_images/post-training/dpo/pref_scaling_v2.pdf} {\small \\ \cblock{251}{222}{236}~Initial 8B SFT \quad \cblock{206}{220}{221}~5\% \quad \cblock{183}{203}{204}~10\% \quad \cblock{159}{185}{187}~25\% \quad \cblock{136}{168}{171}~50\% \quad \cblock{110}{151}{154}~75\% \quad \cblock{240}{82}{156}~100\% } \captionof{figure}{Effect of scaling the size of the preference dataset, specifically the number of unique prompts, on downstream DPO model performance (AE: AlpacaEval).} \label{fig:pref_scaling} \end{minipage}% \hfill \begin{minipage}[t]{0.45\textwidth} \centering \includegraphics[width=0.9\textwidth]{figures_images/post-training/dpo/pref_unique.pdf} {\small\\ \cblock{251}{222}{236}~Initial 8B SFT \quad \cblock{159}{185}{187}~64k \quad \cblock{110}{151}{154}~180k \quad \cblock{63}{116}{120}~383k } \captionof{figure}{Effect of scaling a preference dataset by duplicating prompts on downstream DPO performance using the Ultrafeedback dataset. All sizes have the same number of unique prompts (64k).} \label{fig:pref_unique} \end{minipage} \paragraph{Scaling the Number of Unique Prompts Improve Downstream DPO Performance.} First, we investigate whether increasing the number of prompts will yield improvements in downstream DPO performance. To do so, we measure the downstream DPO model performance at different sizes of a fixed amount of preferences with unique prompts. \autoref{fig:pref_scaling} shows that there are noticeable performance gains across several metrics as the size of the preference dataset increases. This suggests that dataset scaling is important to achieve improvements in downstream model performance: our final preference mixes (Table \ref{tab:final_pref_mix}) contain more than 270k data points for the 8B model and more than 330k instances for the 70B model, which is considerably bigger than many available preference datasets. We also explore whether duplicating prompts, i.e. same prompts with different responses, is a viable approach to scaling the size of a preference dataset and whether it will lead to gains in downstream DPO performance. To do so, we expanded the Ultrafeedback dataset, which originally had four responses for each prompt, by creating additional pair combinations of responses. This expansion will naturally cause duplicated prompts, but with different chosen and rejected pairs sampled from the four responses in UltraFeedback, leading to preference datasets with 64k-, 180k-, and 383k instances. \autoref{fig:pref_unique} shows that, on average, the 383k-size preference dataset performs similarly to the 64k preference dataset. We also observe a slight performance degradation on DROP, GSM8k, and AlpacaEval as the number of duplicated prompts increase. This suggests that scaling via prompt duplication does not necessarily yield into significant gains in downstream DPO performance, and investing in the collection of unique prompts and proper mixing is more important for downstream evaluations. \paragraph{Unused Prompts Lead to Higher Performance vs. Reusing Prompts From SFT Mix.} We then compare including new prompts and re-using prompts from the SFT stage on their effect on downstream DPO performance. To do so, we sampled 100k prompts from the SFT dataset mix that were \textit{used} during training (as shown in \autoref{tab:sft_summary}) and compare it against prompts from the same open datasets (e.g., OpenAssistant, SciRIFF, Aya, Persona, WildChat, etc.) we subsampled from but left \textit{unused} during SFT. \autoref{fig:pref_ablation_used_unused} shows that the \textit{unused} dataset has a slightly higher performance as opposed to reusing prompts. This suggests that the presence of new prompts can help improve downstream DPO performance. Though, as seen in our best mix, combining unused and reused prompts seems to lead to the best result. \paragraph{On-policy Data Improves Downstream DPO Performance.} We investigate whether the inclusion of \textit{on-policy data}, i.e., text generations from the SFT model that will be used as the base model for preference finetuning, improves downstream model performance. Given the same set of prompts sourced from the SFT mix in \autoref{sec:sft}, we generate preferences from off-policy models and compared it to a mix that is strictly on-policy (i.e., one of the response is always from the Initial 8B SFT model, and the other response is from the off-policy models). We also compare it on a combination of both on-policy and off-policy data: we sample instances from the strict on-policy dataset and add it to the off-policy dataset so that the responses from each model is distributed equally. \autoref{fig:pref_ablation_on_and_off_policy} shows that including on-policy data improves aggregated downstream DPO performance compared to a completely \textit{off-policy} dataset where prompt completions were sampled from other models. \begin{minipage}[t]{0.45\textwidth} \centering \includegraphics[width=0.8\textwidth]{figures_images/post-training/dpo/pref_reusing.png} \captionof{figure}{ Effect of reusing prompts from SFT mix and new prompts from the same datasets subsampled for the SFT dataset mix. } \label{fig:pref_ablation_used_unused} \end{minipage}% \hfill \begin{minipage}[t]{0.45\textwidth} \centering \includegraphics[width=0.8\textwidth]{figures_images/post-training/dpo/pref_onpolicy.png} \captionof{figure}{ Effect of including on-policy data during the Response Generation stage of the synthetic preference data pipeline on downstream DPO model performance. } \label{fig:pref_ablation_on_and_off_policy} \end{minipage} \paragraph{Performance Across LLM Judges are Similar, with GPT-4o Leading Slightly Ahead.} In order to determine which judge to use for obtaining preference annotations, we test several commercial and open-source LLM judges such as GPT-4 (\texttt{GPT-4-turbo-2024-04-09}, \texttt{GPT-4o-2024-08-06}, \texttt{gpt-4o-mini-2024-07-18}) and Llama 3.1 (70B and 405B) on the same set of 10k randomly-sampled Ultrafeedback prompts and responses. In general, GPT-4o, Llama 3.1 405B, and GPT-4 Turbo perform similarly across all benchmarks, with GPT-4o leading slightly ahead on the aggregated average performance as shown in \autoref{tab:pref_ablation_best_judge}. In the synthetic preference pipeline for \modelname{}~3, we opted for \texttt{GPT-4o-2024-08-06} due to its ease-of-use, cheaper cost per request, and batch inference speed via OpenAI's Batch API.\footnote{\url{https://platform.openai.com/docs/guides/batch}} \input{tables/pref_ablation_best_judge} \paragraph{Going Beyond Ultrafeedback.} Previous work on preference learning using openly available datasets has shown that the UltraFeedback \citep{cui2023ultrafeedback} preference dataset generally outperforms other preference datasets \citep{ivison2023camels}. In Figure~\ref{fig:ufimprovements} we show that we were able to significantly surpass DPO training on UltraFeedback by training on our best mix. The improvement is greater for the 70B model (+3.3 vs. +1.8), we hypothesize that this is because UltraFeedback's completions are mainly sourced from models that are less capable than the 70B model we are starting with. Helpsteer2 \cite{wang2024helpsteer2}, another high-quality preference dataset, also performs lower than our best mix on the 8B model. \begin{figure}[t] \centering \includegraphics[width=0.7\linewidth]{figures_images/post-training/dpo/pref_beyond_uf.png} \caption{Effect of different DPO mixes on 8B and 70B models: UltraFeedback, Helpsteer2, and our best preference mix.} \label{fig:ufimprovements} \end{figure} \paragraph{Persona Preference Data.} From the three persona preference datasets targeting instruction following, coding and math skills, only \modelname~3 Persona IF improves the average eval score and the targeted IFEval score (see Figure~\ref{fig:persona_ablations}). Neither \modelname~3 Persona Math nor \modelname~3 Persona Code improve their respective targeted evaluations and slightly harm the average score. We therefore only include the \modelname~3 Persona IF preferences in our final mix. \begin{figure}[t] \centering \includegraphics[width=0.7\linewidth]{figures_images/post-training/dpo/pref_persona_mix.pdf} {\small\\ \cblock{251}{222}{236}~Average \quad \cblock{159}{185}{187}~IFEval \quad \cblock{134}{229}{197}~MATH \quad \cblock{216}{141}{243}~CHU } \caption{Adding persona preference data to the SFT Reused mix for DPO. } \label{fig:persona_ablations} \end{figure} \paragraph{Targeting IF.}\label{sec:ifpref} We created preference data targeted to improve a model's precise instruction following skills. \begin{enumerate} \item \textbf{Persona IF}: We take a subset of our collected instruction following SFT dataset, \textsc{If-Persona-Sft} and convert it into a preference dataset. Each example in \textsc{If-Persona-Sft} dataset contains a (prompt, constraints, response) tuple. We start by rewriting each prompt in the subset to relax one of the given constraints. More specifically, we prompt GPT-4o to generate rewrites such that the new response to the modified prompt is no longer a valid response for the original prompt (does not satisfy all the constraints). We then use the response to the new modified prompt as the rejected response, and create (chosen, rejected) pairs to form our \textsc{If-Persona-Pref} dataset containing close to 20K examples. \item \textbf{IF-augmented}: We randomly sample instructions from the \modelname~2 SFT mix and combine them with constraints from the taxonomy in \citet{zhou2023instructionfollowingevaluationlargelanguage}. The chosen and rejected completions are obtained through the synthetic pipeline in \S\ref{sec:pref_pipeline}. \item \textbf{WildChat IF}: We sample instructions from WildChat \citep{zhao2024wildchat} which contain constraints. For this purpose we asked GPT-4 to extract whether or not a prompt includes a constraint. \end{enumerate} For IF-augmented, we run two analyses. We generate an additional set of more than 66k instances and we then run the chosen completions through constraint verifier functions, and only add those instances to the final set which actually fulfilled the constraint(s). This leaves us with a cleaned set of about 26k preferences, which we call IF-augmented-verified. In Figure \ref{fig:IF_ablations} we show that the IF-persona preferences significantly improve IFEval scores beyond the baseline mix, while minimally harming average performance. The IF-augmented-verified dataset improves IFEval performance only by 1 point, while also slightly harming the average performance. Combining IF-persona with IF-augmented-verified leads to the best IFEval performance, but to a slightly lower average. We therefore choose to include IF-augmented (not verified) and Persona IF in the final 8B DPO mix, which leads to both a satisfiying average and IFEval score. \begin{figure}[t] \centering \begin{minipage}[b]{0.55\textwidth} \centering \includegraphics[width=\textwidth]{figures_images/post-training/dpo/persona_if_mix.png} { \small\\ \cblock{251}{222}{236}~Average \quad \cblock{159}{185}{187}~IFEval } \caption{Performance of different IF-targeted preference mixes, average and IFEval. Best here consists of our final best mix for the 8B model (minus Persona-IF).} \label{fig:IF_ablations} \end{minipage} \hfill \begin{minipage}[b]{0.40\textwidth} \centering \includegraphics[width=\textwidth]{figures_images/post-training/dpo/pref_uf_regen.png} { \small\\ \cblock{251}{222}{236}~Original dataset \quad\\ \cblock{159}{185}{187}~Regen. using synthetic pipeline } \caption{Comparing the use of the original completions to regenerating completions using our synthetic preference pipeline.} \label{fig:pref_uf_regen} \end{minipage} \end{figure} \paragraph{Wildchat.} Our ablations show that adding preference data consisting of WildChat prompts and chosen/rejected pairs obtained using our synthetic preference data pipeline, generally improves DPO performance. Ablations in Figure~\ref{sec:preference_mix} reveal that adding WildChat prompts seen during SFT training to the DPO mix leads to better average performance than combining the unused with the reused WildChat prompts. \paragraph{Comparing original preference datasets and their regenerated counterparts.} We also investigate whether the preference dataset generated by the synthetic pipeline in \S\ref{sec:pref_pipeline} can yield to gains in downstream DPO performance on existing datasets. To do so, we take the prompts from open-source datasets such as Helpsteer2, Ultrafeedback, and MultiPref \citep{miranda2024hybrid}, then regenerate their completions and preference annotations using the synthetic data pipeline. \autoref{fig:pref_uf_regen} shows that the downstream DPO performance of the regenerated dataset is better than the original dataset, suggesting that the synthetic pipeline itself can yield to performance gains. \subsection{Preference Tuning Recipe and Analyses} \label{sec:preference_tuning_recipe} \input{tables/hypers_ablation_dpo} \subsubsection{Hyperparameter and Algorithm Design} \input{tables/lr_70b_dpo} \input{tables/hypers_dpo} In light of the significant amount of work on improving DPO and related algorithms since the release of \modelname 2, we revisited our hyperparameter and algorithm choices alongside our preference datasets. We ablated both algorithm and hyperparameter choices using an early SFT checkpoint and the UltraFeedback dataset. We explored using DPO, SimPO~\citep{meng2024simpo}, and length-normalized DPO. Our results are shown in Table~\ref{tab:dpo-hyperparameter-tuning}. We found that only length-normalized DPO outperformed our base checkpoint overall, and so further tuned it, resulting in the final hyperparameters shown in Table~\ref{tab:final-dpo-hyperparameters}. We lowered the learning rate and increased the batch size for the 70B training based on the fact that it is common to lower the learning rate and increase batch size when doing SFT with larger models~\citep{touvron2023llama}. The 8B DPO model is trained for 10 hours on 8 Nvidia H100 GPUs and the 70B DPO model is trained for 19 hours on 64 interconnected H100s. The DPO training uses a maximum sequence length of 2048. \input{tables/hypers_rl_vs_dpo} \paragraph{Learning Rate Ablations for 70B.} We ran a small hyperparameter search over a set of leraning rates using a generally well performing preference data mix\footnote{Tülu-3-Persona-IF, Tulu-3-Helpsteer2, Ultrafeedback, Tulu-3-SFT-Used (On-policy).} and our final best mix. Table~\ref{tab:dpo-70b-hyperparameters} shows that either a learning rate of 2.0 × 10-7 or 5.0 × 10-7, depending on data mix, performs better than a lower learning rate. For our final DPO models we decided on using a learning rate of 2.0 × 10-7. \paragraph{Comparison Between PPO and DPO.} We also conducted a more in depth ablation study comparing PPO and DPO later in development. We anchored a DPO preference mix in the development history to train an RM. We use the same setup as \citet{stiennon2020learning,ouyang2022training,huang2024thenimplementationdetails}, we only extract the RM's logits at the end-of-sequence (EOS) token as the reward model. Also, the linear head to output reward scalars is initialized with weights according to $\mathcal{N}\big(0, 1/\sqrt{(d_{\text{model}} + 1)}\big)$. We use the same prompts in the DPO preference mix to make a controlled comparison between DPO and PPO. The reward model was trained only once and we \emph{did not} attempt to tune the RM's performance. Evaluating RM's performance can be tricky because strong RM performance on RM-specific benchmarks does not necessarily translate to better downstream performance for PPO~\citep{ivison2024unpacking, chen2024accuracy}. Furthermore, iterating with RM and PPO is more expensive than iterating with DPO, so we decided to do most of our preference tuning experiments via DPO. The hyperparameters for the RM and PPO can be found in Table~\ref{tab:hypers_rl_vs_dpo_rm} and Table~\ref{tab:hypers_rl_vs_dpo}. The results can be found in Figure~\ref{fig:the-dpo-vs-ppo-chart}. Here are our findings: \begin{enumerate} \item \textbf{PPO Gets Similar Average Scores with DPO in this Non-Tuned Setup} Overall, we found that PPO could reach a comparable level of performance to DPO (albeit slightly lower) in this controlled setup. \item \textbf{PPO is More Computationally Expensive} The PPO runtime is roughly 28 hours using two nodes, whereas the DPO runtime is about 4 hours using a single node. \end{enumerate} If we use more computational budget or do more tuning, it is entirely possible that we can push up the PPO's performance even higher. However, given limited resources and the subtlety in RM evaluation, using DPO for preference tuning seems more economical. We decide to use PPO primarily for RLVR, to be introduced in Section~\ref{sec:rlvr}. \subsubsection{Infrastructure for Scaling DPO} \label{sec:infra-dpo} To run the 70B DPO training, we found it useful to implement two key optimizations for reducing the GPU footprint of DPO training: \begin{enumerate} \item \textbf{Caching DPO Log Probs} To reduce GPU memory usage, we pre-compute and cache log probabilities across the dataset using the initial model, rather than keeping a reference DPO model in memory during training like the canonical implementation~\citep{vonwerra2022trl,rafailov2024direct}. This optimization eliminates the need to allocate GPU memory for the reference model. \item \textbf{Separate Forward Passes for Chosen and Rejected Sequences} The canonical DPO implementation~\citep{vonwerra2022trl,rafailov2024direct} also concatenates the chosen and rejected sequences during the forward pass, effectively doubling the batch size and increasing GPU memory requirements. To save GPU memory, we simply perform the forward passes separately on the chosen and rejected completions. \end{enumerate} \begin{figure}[t] \begin{minipage}[t]{0.49\textwidth}\vspace{-\fboxsep} \centering \includegraphics[width=1.0\linewidth]{figures_images/post-training/ppo_vs_dpo.png} \caption{The average scores of PPO runs with different learning rate warm-up ratios $\omega$, KL penalty coefficient $\beta$. PPO can get similar (though slightly lower) average scores as DPO.} \label{fig:the-dpo-vs-ppo-chart} \end{minipage} \hfill \begin{minipage}[t]{0.48\textwidth}\vspace{-\fboxsep} \centering \includegraphics[width=\linewidth]{figures_images/post-training/dpo_cache_memory.png} \caption{The peak GPU memory allocated can be reduced by caching the reference policy's logprobs on the preference dataset and doing forward passes separately for the chosen and rejected pairs.} \label{fig:dpo_cache_memory} \end{minipage} \end{figure} We empirically validated these two techniques on the Llama 3.1 model and found they resulted in near identical training losses. As expected, the model uses less GPU memory when using the two techniques on an 8xH100, as shown in Figure~\ref{fig:dpo_cache_memory}. \section{Reinforcement Learning with Verifiable Rewards} \label{sec:rlvr} \input{tables/verifiable_dataset_summary} \begin{figure} \centering \includegraphics[width=0.55\linewidth]{figures_images/rlvrv5.pdf} \caption{An overview of how Reinforcement Learning with Verifiable Rewards (RLVR) works. We sample completions from a policy model given a set of prompts, and verify their correctness using a deterministic function. If the answer is verifiably correct, we provide reward of $\alpha$, otherwise 0. We then train against this reward using PPO. } \label{fig:rlvr-system} \end{figure} In \modelname3, we introduce Reinforcement Learning with Verifiable Rewards (RLVR), a novel method for training language models on tasks with verifiable outcomes such as mathematical problem-solving and instruction following. RLVR leverages the existing RLHF objective but replaces the reward model with a verification function, as shown conceptually in Figure~\ref{fig:rlvr-system}. When applied to domains with verifiable answers, such as mathematics and verifiable instruction following tasks~\citep{zhou2023instructionfollowingevaluationlargelanguage}, RLVR demonstrates targeted improvements on benchmarks like GSM8K while maintaining performance across other tasks. RLVR can be seen as a simplified form of existing approaches for bootstrapping LM reasoning~\citep{zelikman2022star, Zelikman2024QuietSTaRLM, hoffman2023training} or a simpler form of RL with execution feedback~\citep{gehring2024rlefgroundingcodellms}, in which we simply use answer matching or constraint verification as a binary signal to train the model. While this has been done for improving math skills alone in prior work~\citep{VinePPO}, we further extend RLVR to cover multiple evaluations and test how it can improve overall model performance, integrating it as a component of a generalist training pipeline. RLVR is based on a simple principle, common in RL literature, applied to language models: the policy only receives a reward when its generated responses are verifiably correct. More specifically, RLVR optimizes the following objective:\footnote{This is very similar to the standard KL-constrained RLHF objective, shown in Eq.~\ref{eq:rlhf-objective}, with a reward function instead of a learned reward model.} \begin{equation} \label{eq:rlvr-objective} \max_{\pi_\theta} \mathbb{E}_{y \sim \pi_\theta(x)} \left[R_\text{RLVR}(x, y)\right] = \left[ v(x,y) - \beta \text{KL}[\pi_\theta(y|x) \| \pi_\text{ref} (y|x)] \right] \end{equation} where $v$ is the verifiable reward function. $v$ takes in a prompt and completion pair $(x, y)$, and checks if the answer is correct within the generated text: \begin{equation} v(x,y) = \begin{cases} \alpha & \text{if correct,} \\ 0 & \text{otherwise.} \end{cases} \end{equation} We train models with RLVR following preference finetuning, and we use the PPO~\citep{schulman2017proximal} algorithm to optimize for the RLVR objective. We set $\alpha = 10$ based on pilot experiments and did not tune it further. \subsection{RLVR Data} Creating data for RLVR entails obtaining prompts with an accompanying binary verifier (i.e., constructing a set of inputs $x$ with accompanying verifier functions $v$). We focus on two domains (mathematics, exact instruction following) and three evaluations (GSM8K, MATH, IFEval) with relatively straightfoward methods for verification, and leave more complex verifiers to future work.\footnote{For example, recent work has found success in using code execution feedback to train models with RL~\citep{gehring2024rlefgroundingcodellms, xudpoppo}.} In practice, the answer extraction and verification method is domain-dependent. We use three sources of training prompts and verifiers: \paragraph{GSM8K.} We use the GSM8k training set. We augment each sample with the standard 8-shot prompt using during evaluation to encourage the model to use chain-of-thought, and then extract the final number produced and compare to the ground-truth label to determine correctness. \paragraph{MATH.} We use the MATH training set. Similar to GSM8k, we augment each sample with the standard 3-shot CoT prompt used to encourage the model to generate chains of thought during evaluation, and then extract the answer and determine correctness following the `flex' MATH evaluation logic. \paragraph{IFEval.} We randomly sample instructions from the \modelname~2 SFT mix and combine them with constraints from the taxonomy in \citet{zhou2023instructionfollowingevaluationlargelanguage}. We have a verification function for each of the constraint templates that is able to verify whether a completion satisfies a constraint. Given these prompts and verification functions, we then train the models via Proximal Policy Optimization (PPO)~\citep{schulman2017proximal} on these verifiable rewards. We combine all prompts together, results in a mixture of roughly 30,000 prompts with ground truth labels. We summarize our verifiable prompt mixture in Table~\ref{tab:verifiable_prompt_summary}. \subsection{RLVR Recipe and Analyses} \label{sec:rl_recipe} \textbf{Implementation Details}. RL and RLHF have many subtle implementation details that can significantly impact training stability~\citep{Engstrom2020Implementation,shengyi2022the37implementation}. As we use PPO to train our models against our verifiers, we adapt these effective implementation details from \citet{huang2024thenimplementationdetails}: \begin{enumerate} \item \textbf{Initialize the Value model from a General RM} We initialize the value model from the reward model for the RLVR setup (following the standard setup in \citet{ziegler2019fine,ouyang2022training}). \item \textbf{Disable Dropout} We set the dropout probability to be 0 during RM and RL training (similar to \citet{ziegler2019fine}). This ensures the token log probabilities can be computed deterministically during the forward passes of the policy model and reference model, allowing a more accurate estimation of the KL penalty. Furthermore, PPO calculates the token log probabilities in two phases: during the rollout phase and the learning phase. It is important to ensure the token log probabilities match up during these two phases: they produce a probability ratio of 1 during the first PPO epoch, so PPO can clip the ratio and apply the surrogate objective properly. If the log probabilities differ drastically due to dropout, all ratios could potentially be clipped, resulting in zero gradient. \item \textbf{Train with the SFT Dataset and Shuffle Between Epochs} As pointed out in in \citet{huang2024thenimplementationdetails}, PPO can train for more episodes than the total available prompts, effectively training for multiple epochs. In our RLVR ablation experiments, we train for roughly $100,000 / 7,473 \approx 13$ epochs. We shuffle the prompts in between epochs. For our final runs, we examine model checkpoints every 40-100 steps and choose the best checkpoint on our development evaluation set. \item \textbf{Non End-of-Sequence (EOS) Penalty} During training, PPO typically sample a fixed amount of maximum tokens. If the sampled response does not end with an EOS token, we give a -10 penalty to encourage the model to always complete its responses. \item \textbf{Advantage Whitening / Normalization} Like done in standard PPO implementation details literature~\citep{Engstrom2020Implementation,shengyi2022the37implementation,huang2024thenimplementationdetails}, we normalize the advantages by subtracting its mean followed by dividing its standard deviation. \end{enumerate} \input{tables/hypers_rl} \begin{figure}[t!] \centering \begin{minipage}{0.95\textwidth} \centering \includegraphics[width=\linewidth]{figures_images/post-training/rl/gsm8k.pdf} \end{minipage} \hfill \begin{minipage}{0.95\textwidth} \centering \includegraphics[width=\linewidth]{figures_images/post-training/rl/math_only.pdf} \end{minipage} \begin{minipage}{0.95\textwidth} \centering \includegraphics[width=\linewidth]{figures_images/post-training/rl/ifeval.pdf} \end{minipage} \\ {\cblock{177}{27}{232}} $\beta = 0.01$ {\cblock{16}{82}{87}} $\beta = 0.03$ {\cblock{240}{82}{156}} $\beta = 0.05$ {\cblock{15}{203}{140}} $\beta = 0.1$ \begin{minipage}{0.99\textwidth} \centering \includegraphics[width=0.325\linewidth]{figures_images/post-training/rl/plot_1_gsm8k.png} \includegraphics[width=0.325\linewidth]{figures_images/post-training/rl/plot_3_math_flex.png} \includegraphics[width=0.325\linewidth]{figures_images/post-training/rl/plot_2_ifeval.png} \end{minipage} \caption{The top three rows show RLVR's verifiable rewards, KL divergence, and response lengths on the \emph{train dataset} of GSM8K, MATH, and prompts with constraints, when starting from a DPO checkpoint (i.e. an experimental, not final DPO checkpoint). The bottom row shows the corresponding downstream test performance. RLVR can lead to higher verifiable rewards in the train datasets. Importantly, RLVR can also lead to higher scores in the corresponding test dataset, however, an increase in the average score across all evaluations is not guaranteed.} \label{fig:the-rl-chart} \end{figure} \begin{figure}[t] \centering \includegraphics[width=\linewidth]{figures_images/post-training/rl/sft_vs_dpo.pdf} {\cblock{76}{114}{176}} start = DPO, $\beta=0.03$ {\cblock{221}{132}{82}} start = DPO, $\beta=0.05$ {\cblock{85}{168}{104}} start = DPO, $\beta=0.01$ {\cblock{196}{78}{82}} start = DPO, $\beta=0.1$\\ {\cblock{129}{114}{179}} start = SFT, $\beta=0.03$ {\cblock{147}{120}{96}} start = SFT, $\beta=0.05$ {\cblock{218}{139}{195}} start = SFT, $\beta=0.01$ {\cblock{140}{140}{140}} start = SFT, $\beta=0.1$ \caption{The comparison of RLVR's performance on GSM8K between starting from a DPO checkpoint and starting from a weaker SFT checkpoint. We see that starting from both SFT and DPO can lead to the same level of verifiable rewards, but starting from SFT would incur a larger KL compared to starting from DPO when using the same $\beta$.} \label{fig:the-rl-chart-dpo-vs-sft} \end{figure} We start by anchoring a DPO model in the development history as the initial model for conducting experiments for RLVR. We conducted several sets of experiments: \begin{enumerate} \item \textbf{Individual Tasks} we applied the RLVR recipe on GSM8K, MATH, and IFEval, respectively with a sweep of beta values $[0.1, 0.05, 0.03, 0.01]$. For evaluation, we look at the verifiable rewards, the KL divergence, and the response length. \item \textbf{Value Model Initialization Ablation} We experimented with initializing PPO's value model from 1) a general reward model or 2) the anchored DPO model, and ran a sweep of beta values $[0.1, 0.05, 0.03, 0.01, 0.005, 0.001]$ on the GSM8K task. The general RM is trained with the UltraFeedback dataset~\citep{cui2023ultrafeedback}. For evaluation, we examine the GSM8K test evaluation score and the average scores across all evaluation. \item \textbf{Scores from RM Ablations} One possible implementation for RLVR is to add verifiable rewards on top of the scores from the reward model. We launch experiments also using a sweep of beta values $[0.1, 0.05, 0.03, 0.01, 0.005, 0.001]$ on the GSM8K task. \item \textbf{Starting from Weaker Model} The model's base capabilities are also a confounding factor. We launch another set of experiments using an SFT model with lower average scores (from which the anchored DPO checkpoints were trained, so they share the same linearage) using beta values $[0.1, 0.05, 0.03, 0.01]$. \end{enumerate} Unless otherwise specified, we use the following hyperparameters for PPO training in Table~\ref{tab:hypers_rl_vs_dpo}. To train reward models, we use the hyperparameters in Table~\ref{tab:hypers_rl_vs_dpo_rm}. \subsubsection{Key Findings} \textbf{RLVR Can Improve Performance in Targeted Domains.} Figure~\ref{fig:the-rl-chart} shows training with RLVR results in improved \textit{test} performance for all three settings. In all cases, we achieve models that outperform the initial model in that particular evaluation. We also see that the verifiable rewards (i.e., correctness on the train set) improves consistently for all three settings. Interestingly, in GSM8K and MATH, we found that incurring more KL budget does not necessarily lead to improvements in verifiable rewards. \textbf{Initializing RLVR's Value Function from a General RM Works Best.} Figure~\ref{fig:x32_RLVR_KL} shows that initializing the value from a general RM obtains the highest GSM8K test score and also higher average scores. This suggests the value function plays an important role in RLVR's training. \textbf{Do Not Use the Scores from RM.} Figure~\ref{fig:x32_RLVR_KL2} shows that using only the verifiable rewards outperforms using scores from the reward model. Training with verifiable rewards with the scores from RM seems to introduce more noise, especially in the average scores. \textbf{Starting from a Weaker Model Can Converge to the Same Verifiable Rewards.} Figure~\ref{fig:the-rl-chart-dpo-vs-sft} shows that starting from both SFT and DPO can lead to the same level of verifiable rewards, but starting from the SFT model would incur a larger KL compared to starting from the DPO model. This makes sense because the SFT model is further away from good at GSM8K than the DPO model. However, we find that starting from a stronger model usually results in better \textit{test set} performance. \textbf{Overoptimization Happens.} As we lower the KL penalty $\beta$, the trained model incurs more KL from the initial model. We observed that more KL divergence typically results in lower average scores, as shown in Figure~\ref{fig:x32_RLVR_KL}. The exception is Figure~\ref{fig:x32_RLVR_KL2}, where the largest KL corresponds to the highest average score. Furthermore, we showcase overoptimization of prompts with constraints cases in Appendix~\ref{appendix:ifeval-overoptimization}. \subsection{RLVR Infrastructure} \label{subsec:rl_infra} Our PPO setup follows best practices on implementation details~\citep{huang2024thenimplementationdetails}. To enable our implementation to scale to models up to 405B parameters, we also adapted model allocation techniques from existing distributed RLHF frameworks~\citep{hu2024openrlhf}. Furthermore, we accelerate throughput by making RL training asynchronous~\citep{noukhovitch2024asynchronousrlhffasterefficient}. The final 8B reward model is trained in 9 hours on 8 H100 gpus, while the final 8B RL run takes ~65 hours on 8 GPUs, the final 70B RL run takes ~60 hours on 48 GPUs, and the final 405B RL run takes 46 hours on 256 GPUs. Note, for all of these models we took an earlier than final checkpoint from the run. \begin{figure}[t] \begin{minipage}{0.48\textwidth} \centering \includegraphics[width=\linewidth]{figures_images/post-training/x32_RLVR_KL.png} \caption{The performance of RLVR vs KL Divergence under different value model initialization. Both initializations could improve the models compared to an initial DPO baseline model. However, as the model diverges more from the initial model, overoptimization happens as the average scores drop significantly.} \label{fig:x32_RLVR_KL} \end{minipage} \hfill \begin{minipage}{0.48\textwidth} \centering \includegraphics[width=\linewidth]{figures_images/post-training/x32_RLVR_KL1.png} \caption{Similar to Figure~\ref{fig:x32_RLVR_KL}, but this is a comparison of 1) using scores from on top of the verifiable rewards and 2) using only the verifiable rewards. We found using the verifiable rewards performs better in GSM8K, and using scores and verifiable rewards to be more noisy.} \label{fig:x32_RLVR_KL2} \end{minipage} \end{figure} \textbf{Distributed Setup.} Our PPO infrastructure leverages Zero Stage 3~\citep{rajbhandari2020zero} to fit the models and applicable optimizer states into the memory. In RLVR, we have 3 models: the policy model, the reference policy model, and the value model. The policy and value models need to be trained, but the reference policy model only performs inference. Often inference time is a bottleneck in RLHF infrastructure, so we allocate dedicated GPUs to do inference like done in OpenRLHF~\citep{hu2024openrlhf}. In particular, we use Ray~\citep{moritz2018ray} to allocate dedicated GPUs to run PagedAttention via vLLM~\citep{kwon2023efficient}. PagedAttention helps reduce GPU memory fragmentation and redundant duplication leveraging virtual memory and paging techniques. As a result, it helps our PPO infrastructure run LLM inference using a much larger batch size and speed up inference. This setup allowed us to scale PPO policy training to the 405B scale. We share more 405B scaling details in Section~\ref{sec:405b-tulu3}. \textbf{Asynchronous RL Training.} Furthermore, our PPO setup uses asynchronous RL training to improve training efficiency~\citep{noukhovitch2024asynchronousrlhffasterefficient}. The existing PPO frameworks~\citep{hu2024openrlhf,shen2024nemoaligner} are typically synchronous: the inference GPUs would first collect policy rollout data, and then the learner GPUs would train on that rollout data. This setup would utilize GPU relatively well under a typical setup with Atari~\citep{mnih2015human}. However, under the RLHF setup, inference computation typically requires different optimizations (e.g., PagedAttention), thus requiring different GPU memory allocation strategies. As a result, synchronous RL training means inference GPUs could be idling while the learner GPUs run, and vice versa. An alternative implementation is to use the same set of GPUs for training and inference, but it could incur additional overhead such as an additional copy of the policy parameters and compiling inference engine in the training GPUs~\citep{shen2024nemoaligner}. Our setup allocates GPUs specifically for inference and training, alleviating the need to compile an inference engine and save a copy of the policy parameters in the training GPUs. Furthermore, we run the inference computation concurrently with the training computation, thus reducing the GPU idle time. However, asynchronous RL training can introduce stale data, in cases where the inference can generate data much faster than training consumes~\citep{espeholt2018impala}, which can introduce reproducibility problems~\citep{huang2023cleanba}. To help make training more reproducible, our setup always trains the policy using the second latest inference data~\citep {huang2023cleanba,noukhovitch2024asynchronousrlhffasterefficient}. \subsection{Final Experimental Results} \input{tables/rlvr_final_results} \begin{figure} \centering \includegraphics[width=\linewidth]{figures_images/final_rlvr} {\cblock{16}{82}{87}} Final 8B run {\cblock{240}{82}{156}} Final 70B run \caption{Rewards, KL divergence, and average response length on GSM8k train set over episodes for our final RLVR runs. We mark the point we choose the 8B and 70B checkpoints with a green and purple star respectively.} \label{fig:final_rlvr_run} \end{figure} Based on the above results, we ran our final RLVR runs using the combined verifiable prompt set, and used the best DPO models from the prior section as starting points. Specifically, at 8B scale, we tested the best overall DPO model and the best model with IFEval persona data mixed in during training, and at 70B scale we use the best overall DPO model. For hyperparameters, for 8B models, we used the hyperparameters from Table~\ref{tab:hypers_rl_vs_dpo}, but tested higher KL penalty coefficients (up to 0.15) based on previous 8B RL development runs. For 70B models, we used the hyperparameters from Table~\ref{tab:hypers_rl_vs_dpo}, but with a 1 $\times$ 10\textsuperscript{-7} learning rate, 0.1 warmup ratio, 2048 response length, 400,000 episodes, 640 effective batch size, and $\beta=0.7$ based on previous 70B RL development runs. We initialize our value model from a reward model trained on the same dataset as the best DPO model (the \modelname~3 8B preference mixture) starting from \modelname~3 SFT, using the same hyperparameters as in our ablation experiments (Table~\ref{tab:hypers_rl_vs_dpo_rm}). We evaluated our models every 100 training steps (40 for 70B), and picked as our final 8B model the checkpoints with best overall performance on MATH and IFEval. We show the logs from RLVR training in Figure~\ref{fig:final_rlvr_run}, and compare the final performance against their DPO starting points and Llama 3.1 in Table~\ref{tab:rlvr_final_results}. RLVR results in non-trivial improvements at the 8B scale, improving all three of MATH, GSM8k, and IFEval. In fact, we observed that some 8B runs were able to achieve GSM8k scores of up to 89.4\% and IFEval scores of up to 84.8\% (although such models tended to perform worse in other metrics, dragging down their overall average). At the 70B scale, we observe more modest improvements in IFEval and MATH, and no improvement in GSM8k, likely due to the fact that it is already close to saturation (93.5\%). Surprisingly, we find that our 70B run displays extremely low KL divergence, remaining well below 1 over the duration of run, probably due to the lower learning rate\footnote{We had attempted using a higher learning rate during the initial exploration but found that KL could explode initially and cause a non-trivial drop in average scores.}. \section{\modelname~3 Evaluation Framework} \label{sec:evaluation} \input{tables/eval-setup} We designed our framework for evaluating \modelname~3 and the other models we compare against with the following goals: 1) Our evaluations should be \textit{reproducible}. 2) We should evaluate models' generalization to unseen tasks, not just the specific benchmarks we use for development. 3) Our evaluation setup (e.g., templates and strategies for prompting) should be fair to a wide range of models. Accordingly, our framework consists of an open evaluation toolkit for reproducible evaluations (Section~\ref{sec:olmes}), a suite for evaluating core skills in instruction-tuned models with separate development (Section~\ref{sec:dev_suite}) and held-out evaluations (Section~\ref{sec:unseen_suite}), and a set of recommended settings for evaluating on our evaluation suite that based on our experiments with various models, which we refer to as the \modelname~3 Evaluation Regime, summarized in Table~\ref{tab:test_settings}. As described in Section~\ref{sec:core_skills}, we split our evaluation suite into a \textit{development} set and an \textit{unseen} set, the former used for developing models, and the latter only for evaluating final models. This setup, along with our training data decontamination efforts (see Section~\ref{sec:decontamination}) provide a fairer evaluation of our models generalization capabilities for each of the core skills we focus during development. However, in comparisons between our models and other models, we \textit{cannot rule out that any closed model has not trained on our evaluation suite}, and hence cannot make clear judgments on models that have not publicly released finetuning data. \subsection{Open Language Model Evaluation System (OLMES)} \label{sec:olmes} In an effort to make evaluations more standardized and reproducible, we are sharing the code base used to produce the evaluations in this work.\footnote{See \url{http://github.com/allenai/olmes}.} The OLMES evaluation system supports: \begin{itemize} \item A wide range of models and tasks, leveraging existing work in the Eleuther AI LM Evaluation Harness \citep{eval-harness}, \item Flexible configuration options for each task, \item Direct access to the specific task formulations used in this work (as well as in earlier work such as OLMo \citep{Groeneveld2024OLMoAT} and the OLMES standard \citep{gu2024olmesstandardlanguagemodel}), and \item Detailed instance-level output data for analysis of model predictions, confidences, etc \end{itemize} E.g., to reproduce our Llama-3.1-8B-Instruct numbers for MMLU-Pro, one would simply run something like ``{\tt olmes --task mmlu\_pro::tulu3 --model llama3.1-8b-instruct}''. \subsection{\modelname~3 Evaluation Suite - Development} \label{sec:dev_suite} We design the evaluation setup for the \textit{development} partition of \evalname based on existing practices in current literature as well as insights during development. Where appropriate, we adapt the evaluation setup based on the nature of the task and take additional care to robustify our answer extraction and comparison approaches as described below. \paragraph{MMLU~\citep{hendrycks2020measuring}} is heterogeneous with regard to the type of reasoning skills required to answer the questions, containing instances that require basic factual recall as well as those that demand logical reasoning and problem-solving skills. We design a zero-shot CoT setting that asks the models to ``summarize'' its reasoning before answering the questions (see \autoref{tab:mmlu-prompts} for the prompt used). We find that among the various CoT settings, including prompting the model to think ``step-by-step'' and using no CoT, the prompt yields a systematic performance improvement over standard 5-shot multiple choice setting across tested models (See \autoref{tab:mmlu-mc-cot-comparison}) and maximizes on the number of subject (knowledge categories) that the CoT benefits. This indicates that our ``summarize'' prompt is an effective strategy for dealing with the heterogeneity of the benchmark. See appendix \ref{appendix:mmlu-cot-prompting} for details. We compute a macro average over all the subjects in MMLU as the final task metric.% \paragraph{PopQA~\citep{mallen2023llm_memorization}} is an entity-centric question-answering benchmark that evaluates language models' tendency to \textit{forget} information about long-tail entities. We prompt the models in a 15-shot setting (as recommended in the dataset paper) without any additional instructions, with each QA demonstration presented in a different conversation turn, a setting that we refer to as \textit{Multiturn ICL} henceforth, and use greedy sampling to obtain model predictions. \paragraph{TruthfulQA~\citep{lin2021truthfulqa}} contains questions that humans tend to answer incorrectly due to common misconceptions. We use the multiple-choice version of this benchmark where the models being evaluated are presented with questions and options containing multiple correct answers (the MC2 setting). \paragraph{HumanEval~\citep{chen2021codex} and HumanEval+~\citep{evalplus}} evaluate models' ability to complete Python code given docstrings. HumanEval+ uses a more rigorous evaluation procedure than the original HumanEval benchmark with additional tests. We use samples from the models at a temperature of 0.8 and use pass@10 as the evaluation metric. \paragraph{GSM8K~\citep{cobbe2021gsm8k}} contains grade school math word problems. We use the 8-shot chain-of-thought prompt from \citet{wei2022chain}, formatted for a multiturn ICL evaluation. We obtain model responses using greedy sampling and extract the last number in the model response as the predicted answer. \paragraph{MATH~\citep{hendrycksmath2021}} contains problems from mathematics competitions spanning various categories such as algebra and calculus. We use a 4-shot multi-turn setup with CoT from \citet{lewkowycz2022solving}, formatted for multiturn ICL, and greedy sampling for model completions. For determining the correctness of predictions, we use a `flex' scheme that attempts to extract the answer in three different ways: (1) following the minerva format~\citep{lewkowycz2022solving}; (2) finding the last instance of `\texttt{\boxed{<ans>}}'; (3) taking the text between the last two `\$' tags. This is due to issues we found during development wherein models would often not follow the correct output format despite the few-shot examples, necessitating a varied answer extraction strategy. We find that moving from the minerva format alone to our `flex' strategy can sometimes improve reported scores by up to 10 points, highlighting the need for this flexible strategy. We compute macro average across the subsections to obtain the final task metric. \paragraph{BigBench-Hard~\citep{suzgun2022challenging}} contains challenging reasoning problems for which models benefit from step-by-step reasoning. We follow the setup described in the original paper and use 3-shot CoT prompts, formatted for multiturn ICL. We use greedy sampling for obtaining model predictions. \paragraph{DROP~\citep{dua-etal-2019-drop}} is a reading comprehension task that requires discrete reasoning. We draw 3 random few-shot examples from the train split following the setup used for Llama 3~\citep{dubey2024llama}, and use greedy sampling to get model predictions. \paragraph{IFEval~\citep{zhou2023instructionfollowingevaluationlargelanguage}} evaluates instruction following ability of models in a setting where each instruction corresponds to constraints such that the it can be programmatically verified whether the outputs satisfy those constraints. We use greedy decoding to generate model outputs give the instructions in the dataset, and measure the prompt-level accuracy of satisfying constraints in the loose evaluation setting. \paragraph{AlpacaEval 2~\citep{dubois2024length}} contains a set of prompts sourced that reflect real human usages of LMs, and compares model outputs to GPT-4 turbo responses with an additional length control to avoid longer answers being unfairly favoured. We generate responses using greedy decoding up to 8,192 tokens in length, following \citet{ivison2023camels}. \subsubsection{Safety Evaluation} \label{sec:safety_evals} \input{tables/eval-suite-safety} We follow~\citet{han2024wildguard} and~\citet{wildteaming2024} to define our safety evaluation suite using the following benchmarks.\footnote{Built on Ai2 Safety Tool, forked to focus only on safety evals: \url{https://github.com/nouhadziri/safety-eval-fork.} Note that while WildGuard and WildJailbreak are popular training datasets, we use generations over the test set prompts as a test of compliance with a response filter such as the WildGuard model.}. Each of these benchmarks evaluates whether models refuse to respond to unsafe requests, and in the case of XSTest and WildJailbreak, additionally evaluate whether they comply with benign requests. We use greedy sampling to get model responses for the prompts in each of the benchmarks, and compute the accuracy (at refusal or compliance as appropriate). We report the macro average of the scores over all the benchmarks as our final safety evaluation metric. \paragraph{XSTest~\citep{rottger2023xstest}} consists of 200 unsafe prompts and 250 prompts which are safe but superficially resemble unsafe prompts: these prompts use vocabulary similar to that of unsafe prompts. Categories include homonyms, figurative language, safe targets, safe contexts, definitions, real discrimination/nonsense group, nonsense discrimination/real group, historical events, public privacy, and fictional privacy. We report the overall accuracy score based on whether WildGuard~\citep{han2024wildguard} classifies the response as a refusal or compliance. \paragraph{HarmBench~\citep{mazeika2024harmbench}}. We evaluated on a subset of harmful prompts which consists of 321 harmful prompts\footnote{\url{https://github.com/centerforaisafety/HarmBench/blob/main/data/behavior_datasets/harmbench_behaviors_text_test.csv}} categorized into Functional and Semantic categories. Functional category includes two types of behavior: Standard behaviors, which are modeled after existing datasets of harmful behaviors such as AdvBench and the TDC 2023 Red Teaming Track dataset, and Copyright behaviors, which test the handling of copyrighted content. The semantic category comprises seven types of harmful behaviors: cybercrime, unauthorized intrusion, chemical/biological weapons or drugs, copyright violations, misinformation/disinformation, harassment/bullying, illegal activities, and general harm. We use WildGuard to evaluate the model's refusal to assist with harmful prompts. \paragraph{Do-Anything-Now~\citep{SCBSZ24}} consists of jailbreak prompts that were created by combining the jailbreak templates from DAN with harmful behaviors from HarmBench, and subsample 300 of them for testing. We report accuracy using the WildGuard classifier. \paragraph{JailbreakTrigger~\citep{huang2024trustllm}} incorporates prompts based on 13 distinct jailbreak attack methods. In total, the dataset\footnote{\url{https://huggingface.co/datasets/TrustLLM/TrustLLM-dataset}} consists of 400 examples, split evenly between two categories: "Questions about Bad Behaviors" and "Instructions to generate Toxic Content". This dataset serves to evaluate the effectiveness of LLMs' defenses and measures the toxicity of responses under jailbreak scenarios. The reported metric is RTA measured by WildGuard. \paragraph{WildJailbreakTest~\citep{wildteaming2024}} is an adversarial evaluation set which contains a subset for adversarial benign queries (210 examples) and a subset for adversarial harmful queries (2000 examples). The adversarial benign queries are used to measure models' exaggerated safety behaviors and the adversarial harmful queries are used to measure models' safeguards regarding adversarial attacks. We measure RTA using WildGuard for both categories. For benign queries, RTA is expected to be ($\downarrow$) and for harmful queries, the RTA is expected to be ($\uparrow$). \paragraph{WildGuardTest~\citep{han2024wildguard}} contains 1725 items for prompt harm, response harm, and response refusal classification tasks. 55\% are vanilla prompts, and 45\% are adversarial. The prompts are collected based on adversarial synthetic data and in-the-wild user-LLM (In-the-wild) interactions. We report RTA using WildGuard. \begin{table}[t] \centering \small \label{tab:safety_evals_8b} \begin{tabular}{lcccccc} \toprule \vspace{-0.1cm} \textbf{Benchmarks} & \textbf{Llama 3.1 8B} & \textbf{Ministral 8B} & \textbf{Qwen 2.5 7B} & \textbf{\modelname~3 8B} & \textbf{\modelname~3 8B} & \textbf{\modelname~3 8B} \\ & \textbf{Instruct} & \textbf{Instruct} & \textbf{Instruct} & \textbf{SFT} & \textbf{DPO} & \\ \midrule HarmBench & 82.8 & 53.4 & 84.1 & \bf{98.4} & 94.4 & 94.7 \\ XSTest & \bf{92.7} & 85.6 & 91.8 & 90.4 & 92.4 & 93.3 \\ WildGuardTest & 86.2 & 68.1 & 85.0 & \bf{99.2} & 98.9 & 98.5 \\ Jailbreaktrigger & 78.8 & 63.3 & 71.0 & \bf{95.8} & 87.0 & 85.5 \\ DoAnythingNow & 45.0 & 16.0 & 61.7 & \bf{88.3} & 69.7 & 62.0 \\ WildjailbreakTest & 65.6 & 50.7 & 56.2 & \bf{86.7} & 81.1 & 78.8 \\ \midrule Overall & 75.2 & 56.2 & 75.0 & \bf{93.1} & 87.2 & 85.5 \\ \bottomrule \end{tabular} \vspace{3pt} \caption{Breakdown of safety scores by benchmark of \modelname~3 8B models compared with similarly sized open weight models.} \end{table} \begin{table}[t] \centering \small \label{tab:safety_evals_70b} \begin{tabular}{lccccccc} \toprule \vspace{-0.1cm} \textbf{Benchmarks} & \textbf{Llama 3.1} & \textbf{Qwen 2.5} & \textbf{Hermes 3} & \textbf{Nemotron} & \textbf{\modelname~3 70B} & \textbf{\modelname~3 70B} & \textbf{\modelname~3 70B}\\ \vspace{-0.1cm} & \textbf{70B} & \textbf{72B} & \textbf{Llama 3.1} & \textbf{Llama 3.1} & \textbf{SFT} & \textbf{DPO} & \\ & \textbf{Instruct} & \textbf{Instruct} & \textbf{70B} & \textbf{70B} & & & \\ \midrule HarmBench & 80.6 & 86.3 & 54.7 & 84.4 & \bf{98.8} & 97.8 & 97.8 \\ XSTest & 87.1 & 93.6 & 89.3 & 92.0 & 91.1 & \bf{94.9} & 92.4 \\ WildGuardTest & 81.3 & 93.1 & 66.6 & 84.9 & 99.1 & \bf{99.2} & 98.9 \\ Jailbreaktrigger & 71.0 & 89.8 & 56.3 & 60.5 & \bf{95.3} & 87.0 & 86.7 \\ DoAnythingNow & 80.0 & 93.3 & 26.7 & 36.3 & \bf{93.7} & 69.0 & 67.7 \\ WildjailbreakTest & 59.2 & 66.0 & 53.8 & 56.1 & \bf{88.6} & 86.3 & 86.2 \\ \midrule Overall & 76.5 & 87.0 & 57.9 & 69.0 & \bf{94.4} & 89.0 & 88.3 \\ \bottomrule \end{tabular} \vspace{3pt} \caption{Breakdown of safety scores by benchmark of \modelname~3 70B models compared with similarly sized open weight models.} \end{table} \subsection{\modelname~3 Evaluation Suite - Unseen} \label{sec:unseen_suite} For the \textit{unseen} evaluation suite, the task formulations were decided through an independent design process from that of \textit{development} suite. One goal of the unseen suite is to evaluate instruction-tuned models in ways that are closely aligned to realistic usage. Specifically, we follow these general principles: \begin{itemize} \item Formulate tasks similar to how humans interact with the models. E.g., avoid few-shot examples presented as a dialog, or precise chain-of-thought (CoT) examples for how the model is ``supposed'' to think. \item Prompt models with clear instructions that set the context, encourage concise reasoning, and specify how the final answer should be formatted. \item Apply reasonable heuristics for answer extraction and comparison to gold answers, to avoid penalizing models that do not exactly follow a syntax implied by the instructions (based on examining outputs from a wide set of existing baseline models). \end{itemize} We first apply these principles to some of the tasks in the development suite using a set of exploratory models (instruction-tuned models predating \modelname{} 3). On the exploratory models, we find that following the above principles, in particular being more aligned to how human usage, generally does not degrade performance and often times allow most models to perform better on tasks (despite, e.g., removing few-shot examples). We did not update the formulation of the development tasks based on this, but carried the principles over to formulating the unseen tasks. More detailed analysis can be found in Appendix~\ref{appenx:test_principles_on_dev_tasks}. The task formulations in the \modelname~3 \textit{unseen} suite are as follows. For all benchmarks which include multiple sub tasks, we compute the average over the sub tasks (i.e., the ``macro'' average) to be consistent. \input{tables/unseen_tasks_prompt_comparisons} \paragraph{AGIEval English \citep{zhong-etal-2024-agieval}} includes the English language subset of the AGIEval benchmark, specifically these multiple-choice tasks: \textit{aqua-rat}, \textit{logiqa-en}, \textit{lsat-ar}, \textit{lsat-lr}, \textit{lsat-rc}, \textit{sat-en}, \textit{sat-math}, and \textit{gaokao-english}. We do not include the \textit{sat-en-without-passage} task as we find these questions are typically severely underspecified without access to the passage. We formulate the task using the a simple ``zero-shot CoT'' prompt which encourages concise reasoning ending with a clearly stated answer choice. In Appendix~\ref{appendix:unseen-tasks-prompt}, Figure~\ref{fig:unseen-prompt-0shot-cot-mc}, we provide the 0-shot reasoning prompt, used for the multiple-choice tasks in the \textit{unseen} evaluation suite, including AGIEval English. The model's answer choice is extracted by first matching to the requested format, with fallback patterns if the format was not followed precisely. Specifically, we first look for the exact phrase indicated in the prompt (``Therefore, the answer is [ANSWER]'') and grab the last such match. If that fails, we look for a sequence of softer variants, like ``answer is [ANSWER]'' or ``answer: [ANSWER]'' before falling back to the last letter in parenthesis found, and if that fails, the last stand-alone capital letter. \paragraph{MMLU-Pro \citep{wang2024mmlu}} is a 10-way multiple-choice extended version of the MMLU dataset. We use essentially the same prompt and answer extraction as used for our AGIEval setup, just adjusting for the number of answer choices. We generally find our formulation to be as effective as the traditional 5-shot CoT prompt (e.g., used in Llama 3.1 evaluations \citep{dubey2024llama}) despite being much shorter, more realistic, and easier for users to create (see Table~\ref{tab:eval_principles_on_mmlu_pro} for a comparison across exploratory models). % \paragraph{GPQA \citep{rein2023gpqa}} is a set of very challenging multiple-choice questions written by domain experts in biology, physics, and chemistry. We use the same zero-shot prompt and answer extraction as for AGIEval. This is similar to the approach used in Llama 3.1 evaluations, but with less prescription on how the reasoning should be structured. Figure~\ref{tab:eval_principles_on_gpqa} shows that across exploratory models, that our approach leads to comparable scores. \paragraph{Deepmind Mathematics \citep{saxtonanalysing}} is a dataset of 56 categories of math questions, testing mathematical and algebraic reasoning skills. We devised a ``zero-shot CoT'' prompt that sets the context of the task, explains the format expected for the answers (e.g., ``x**2'' for powers), and for each category included three example answers to specify the answer format. The prompt used and answer extraction heuristics were refined by examining example outputs from our exploratory models. As seen from Table~\ref{tab:eval_principles_on_deepmind_math}, applying evaluation principles aligned with real usage (``chat'' versions), often performs better than using a setup involving presenting few-shot examples adapted from base model evaluations (``base-adapted''). We provide the zero-shot reasoning prompt for the Deepmind Mathematics task in Appendix~\ref{appendix:unseen-tasks-prompt}, Figure~\ref{fig:unseen-prompt-0shot-cot-deepmind-math}. Aligned with the instructions for formatting in the prompt, when extracting the answer, we first look for an answer in the format ``Therefore, the final answer is [answer]. I hope it is correct.'' We look for this in a case-insensitive way, process the [answer] by stripping away any trailing period, and known math delimiters surrounding the answer (e.g., ``\$'' ) based on outputs from our exploratory models. For answer comparison, we first compare the raw strings of the processed generated answer and gold answer to check if they are equal. In addition, we parse both using the SymPy \citep{sympy} package and compare the equivalence of the parsed outputs to check their mathematical equivalent. We also consider the generated answer to be correct if it is equal to the parsed gold answer after this parsing. \paragraph{BigCodeBench \citep{zhuo2024bigcodebench}} is a set of coding challenges. We focus on the ``hard subset'' of 148 (out of the total 1140) instances, using the ``instruct'' formulation of each task and the ``calibrated'' score. We follow the setup used for the original leaderboard for our implementation. \subsubsection{New Evaluation: IFEval-OOD} In order to test precise instruction following abilities of LLMs and whether they are able to follow constraints that go beyond the 25 constraints included in IFEval \citep{zhou2023instructionfollowingevaluationlargelanguage}, we developed IFEval-OOD (IFEval Out-of-Distribution). IFEval-OOD consists of 52 constraints across six broad categories such as the examples in Table~\ref{tab:ifeval-ood-example}. A complete list of the constraints can be found in Appendix~\ref{appendix:ifeval_ood_constraints}. A portion of the constraints were sourced by asking a group of researchers for constraint ideas and others were written by authors of this paper. One of the six categories (``custom'') comprises manually written verifiable prompts to cover specific sub-skills, such as CSV generation. For the remaining five categories, the verifiable constraint was combined with 10 unseen prompts from WildChat. To select the final prompts, these constrained prompts were human annotated for quality and compatibility with the constraint (e.g. a prompt to paraphrase a one-sentence reference text that doesn't contain any names to begin with would not be compatible with a constraint to mention at least 23 different person names in the response). To ensure constraint coverage, additional unseen WildChat prompts were manually paired with constraints that did not have at least five selected prompts. Our evaluation dataset emphasizes broad coverage of constraint types to differentiate constraint-following abilities from overfitting on the existing IFEval constraints. \input{tables/ifeval_ood_example} \subsubsection{New Evaluation: HREF} We constructed an automatic evaluation of instructability of language models, called \textit{Human Reference-guided Evaluation of instruction Following (HREF)}. HREF focuses on 11 instruction following tasks that language models are typically trained on, namely, \textit{Brainstorming}, \textit{Open QA}, \textit{Closed QA}, \textit{Extraction}, \textit{Generation}, \textit{Rewriting}, \textit{Summarization}, \textit{Classification}, \textit{Numerical Reasoning}, \textit{Multi-document Synthesis}, and \textit{Fact Checking}. We obtained high quality human-written prompts and responses in these categories from professional instruction-tuning data creators, and developed a reliable automatic evaluation procedure. Following AlpacaFarm~\citep{dubois2023alpacafarm}, we use win-rate against a fixed baseline model as the evaluation metric. However, since we have a larger number of tasks in our benchmark than AlpacaFarm, and also because we have access to human-written references, we hypothesized that the setup used for win-rate computation may not be directly applicable to our new evaluation. So we experimented with various win-rate computation methods, and their details along multiple dimensions: \begin{itemize} \item \paragraph{Choice of LM Judge.} We tried GPT-4, GPT-4 turbo, and Llama 3.1-Instruct models at 7B and 70B sizes as LM judges. \item \paragraph{LM-as-a-Judge vs. Direct Comparison with Human References.} Using a large language model e.g., GPT-4 Turbo for AlpacaFarm, may not be appropriate for tasks where the responses are expected to be short and factual. We compared this setup with directly comparing the similarity of the responses from the target and the baseline models with the human-written references, according to a simpler embedding model (RoBERTa-Large~\citep{liu2019roberta}). \item \paragraph{Use of Human-Written References When Evaluating with LM-as-a-Judge.} Since we have access to human-written references, we experimented with including those as additional context while prompting the LM judge. \end{itemize} To make decisions about the evaluation setup, we collected human judgments comparing responses from a diverse set of 16 models, when prompted using the instructions taken from a subset of our evaluation dataset. We collected four human judgments per each model response pair, and compared the agreement of each evaluation setup with majority human judgments. We repeated this procedure for each task category to identify the best setup for that category. \paragraph{Summary of the Final Evaluation Setup.} We used Llama 3.1 70B Instruct as our LM judge since its agreement with human judges was the highest of the pool of models we considered. Our baseline model we used is Llama 3.1 405B Instruct. We found that using LM as a judge results in higher human agreement (compared to the embedding-based method) in all subtasks except \textit{Open QA} and \textit{Fact Checking}. We use embedding similarity with human-written references as the way of computing win-rates in these two subtasks. Of the remaining 9 subtasks, evaluation in \textit{Brainstorming} and \textit{Summarization} did not benefit from using human-written references while prompting the LM judge, and hence prompt LM judges with human references only for the remaining 7 subtasks. On the subset for which we collected human judgments, our composite evaluation procedure resulted in an agreement of 69.4\% with humans, which is comparable to the inter-human agreement of 67\%. \subsection{Evaluating the Development Process Using the Unseen Suite} To evaluate how much and in what ways we over-fit to our development evaluations, we measure the performance of our models on unseen tasks that still correspond to the same set of core skills. We evaluate the checkpoints corresponding to various design decisions we made in the development process on the unseen suite to check whether our decisions overfit to the development evaluations, and summarize our findings in Section~\ref{sec:unseen-design-decisions}. In Section~\ref{sec:unseen-public-models}, we show a comparison between our final checkpoints and comparable public models. \subsubsection{Evaluating the design decisions} \label{sec:unseen-design-decisions} \input{tables/unseen_pipeline_generalization} \paragraph{Training pipeline.} Table~\ref{tab:unseen_pipeline} shows the performance of the SFT, DPO, and the final checkpoints of \modelname~3 at 8B and 70B sizes on one development and one unseen evaluation for each core skill. We see that our pipeline generalizes well to unseen evaluations, with the final checkpoints obtaining the best average performance on both the development and unseen evaluations. For Reasoning and Coding, where the SFT checkpoints have the best performance on development evaluations, the subsequent training stages still improve model performance on harder unseen evaluations. \input{tables/unseen_sft_data_mixing} \paragraph{Data mixing for SFT.} To assess whether the data mixing choices we made for SFT generalize to unseen evaluations, we evaluate the performance of various data-ablated SFT models with the final SFT checkpoint, as shown in Table~\ref{tab:unseen_sft_data}. For each core skill, we compare the model performance trends on a development and an unseen evaluation. We see that the data choices generalize on average, as indicated by the best average performances on both development and unseen evaluations by the final SFT checkpoint. In individual skills, we see that our choices overfit to the development evaluations in Precise Instruction Following, and to some extent in Knowledge Recall and Reasoning. \begin{minipage}[t]{\textwidth} \centering \includegraphics[width=0.45\textwidth]{figures_images/pref_scaling_unseen_combination.pdf} {\small \\ \cblock{251}{222}{236}~Initial 8B SFT \quad \cblock{206}{220}{221}~5\% \quad \cblock{183}{203}{204}~10\% \quad \cblock{159}{185}{187}~25\% \quad \cblock{136}{168}{171}~50\% \quad \cblock{110}{151}{154}~75\% \quad \cblock{240}{82}{156}~100\% } \captionof{figure}{Effect of scaling the size of the preference dataset, specifically the number of unique prompts, on downstream DPO model performance in development and unseen evaluations. AVG are the averages of all the tasks in development and unseen suites. Also shown are the trends in MATH and Deepmind Mathematics.} \label{fig:pref_scaling_unseen} \end{minipage}% \paragraph{Data scaling for preference tuning.} We show the effect of scaling DPO data on development and unseen evaluations in Figure~\ref{fig:pref_scaling_unseen}. We see that the scaling trends generalize on average to unseen evaluations. Of note is the trend we observed in the development and unseen Math evaluations, indicating that our development process overfit to MATH to some extent. We hypothesize this is mainly due to formatting differences between MATH and Deepmind Math. The former often requires solutions and answers to be output in LaTeX format, while the latter does not. We found that our trained models have the tendency to format the chain-of-thought reasoning and the final answers in LaTeX even for the questions in the Deepmind Math dataset where this is not required. This often interfered with the intermediate reasoning, and also made our answer extraction logic fail. \subsubsection{Comparison with public models} \label{sec:unseen-public-models} \input{tables/unseen_results} Table~\ref{tab:unseen_results} shows a comparison between \modelname~3 models, Llama 3.1 Instruct models, and Hermes 3 Llama 3.1 models at 8B and 70B sizes, and Table~\ref{tab:href_eval} shows a subtask-level breakdown of the performance of these models on HREF. It is important to note that while all these evaluations are unseen for the \modelname~3 models, we do not know if GPQA, MMLU-Pro, AGIEval, DeepMind Math, and BigCodeBench were used for developing the two other models. We summarize below our key takeaways from this comparison and from our qualitative analysis of the outputs of \modelname~3 models on these datasets: \paragraph{\modelname~3 generalizes well to unseen evaluations.} In almost all the evaluations, \modelname~3's performance is generally comparable to that of the two other models we evaluate, often falling between the performance numbers of the two models. This suggests that our recipe of choosing representative evaluations for each core-skill and curating training datasets targeting those evaluations can lead to models that generalize well to other tasks that require the same skills. \paragraph{Models generally overfit to IFEval.} We find that there is a significant difference between performance on IFEval and IFEval-OOD of all the models, even though we created the latter to be structured very similar to the original dataset, just with a disjoint set of constraints. We observe that instruction following with verifiable constraints is a challenging skill for models to learn effectively, and hypothesize that those models that do well on IFEval are likely overfitting to the specific set of constraints included in the dataset, making it hard for them to generalize to new constraints. \paragraph{Generalization on knowledge recall may be dependent on the post training recipes.} As one might expect, the performance of models on MMLU and MMLU-Pro is correlated. We see that their performance on GPQA shows a different trend --- all three models we compare here are post-trained from the same base model, suggesting that the post-training recipe may affect the generalization in knowledge-recall. \paragraph{Instruction following performance varies across categories.} We observe that the relative performance of \modelname~3 models on AlpacaEval is different from that on HREF. This may be explained by the fact that instruction following is a highly diverse task, and the distributions of HREF and AlpacaEval may differ, with some categories of instructions not necesarily transferring well to others, leading to the shift in relative performance. We do note that \modelname~3 70B outperforms Llama 3.1 70B Instruct on 5 out of 11 subtasks, as seen in Table~\ref{tab:href_eval} in the Appendix, which shows a breakdown of model performance across subtasks in HREF. Future work will explore how different behaviors of instruction following can be measured with more diverse instruction following evaluations. \section{Discussions} \label{sec:discussion} \input{tables/405b_hypers} \begin{figure}[t] \centering \includegraphics[width=\linewidth]{figures_images/405b_rl.pdf} \caption{Rewards on MATH, KL divergence, and the average response length for the final 405B training run. We mark the point with the final checkpoint with a star. We note that this was the last checkpoint saved -- we intended to train longer but hit compute constraints. Note that technically the metrics in MATH test split is accuracy, but here we multiply the accuracy by 10 to convert to verifiable rewards. } \label{fig:405brl} \end{figure} \subsection{Scaling \modelname~3 Recipe to Llama 3.1 405B} \label{sec:405b-tulu3} Here, we demonstrate the scalability and effectiveness of our post-training recipe applied at 405B parameter scale. Scaling to this size required several engineering efforts and posed a number of challenges: \begin{itemize} \item \textbf{Compute Requirements}: Training \modelname~3 405B demanded 32 nodes (256 GPUs) running in parallel. While most of our codebase scaled well, we occasionally encountered NCCL timeout and synchronization issues that required meticulous monitoring and intervention (especially with RL training). Using more GPUs increased the chances of encountering hardware failures, necessitating semi-frequent run restarts. \item \textbf{RLVR Training}: For inference, we deployed the model using vLLM with 16-way tensor parallelism, while utilizing the remaining 240 GPUs for training. After each iteration of RLVR update, the weights are synchronized to the vLLM engine using NCCL broadcast. Inference typically takes $\sim$550 seconds, weight transfer takes $\sim$25 seconds, and training takes $\sim$1,500 seconds. To reduce computational cost during the RLVR stage, we utilized an 8B value model. Future works can benefit from exploring larger value models or alternate value model-free RL algorithms such as GRPO~\citep{shao2024deepseekmath}. \item \textbf{Hyperparameter Tuning Challenges}: Given the computational costs, hyperparameter tuning was limited. Following prior Tülu and Llama work, we opted to lower the LR for larger models, training with a ``lighter touch''. \end{itemize} Our training recipe for the 405B model followed very similarly to that of the 8B and 70B models,\footnote{DPO Preference Mixture for 405B: \huggingfacedatasettinynologo{allenai/llama-3.1-tulu-3-405b-preference-mixture}} but with a different training dataset for RLVR. Given the model's saturation of GSM8K from SFT and DPO training alone, we removed the GSM8K data, and we additionally found that the IFEval data did not help much in initial RLVR runs. As such, for \modelname~3 405B RLVR we only used the MATH train set. Surprisingly, we found that even with as few as 25 RLVR steps, MATH performance improved by over 5 points, and continued to increase with more training. With the challenges of scaling the asynchronous RL infrastructure, we only trained for 75 steps, fewer than our smaller models. The RL reward, KL divergence, and response length per training batch are shown in Fig.~\ref{fig:405brl}. The training hyperparameters for the SFT and DPO stages are shown in Table~\ref{tab:combined-hyperparameters-405b}. The hyperparameters for RL with verifiable rewards at this scale are shown in Table~\ref{tab:405b-rl-hyper}. In Table \ref{tab:405b_results}, we compare \modelname~3 405B with prior state-of-the-art models finetuned from Llama 3.1 405B as well as DeepSeek-V3 \citep{deepseekai2024deepseekv3technicalreport} and GPT-4o. Generally, \modelname~3 405B results are improved compared to \modelname~3 70B.% \modelname~3 405B achieves competitive or superior performance to both Deepseek v3 and GPT-4o, while also surpassing prior open-weight post-trained models of the same size including Llama 3.1 405B Instruct and Nous Hermes 3 405B \citep{teknium2024hermes} on many standard benchmarks. We note that we ended RLVR training early due to compute constraints, and further training may further improve performance. In particular, we did not observe that MATH performance had saturated during training and testing (see Figure \ref{fig:405brl}). \subsection{Insights from the Unfruitful} \label{sec:negative} In this section, we discuss a number of methods and approaches we considered for \modelname~3 but did not ultimately make it into our final recipe for a variety of reasons. \paragraph{Online DPO.} Standard DPO methods use preference datasets that are usually collected ahead of time, often from a distinct language model, and are thus considered as offline. In other words, with DPO, the policy cannot obtain feedback over it own generations during training. This is in contrast to online methods like PPO where the RM provides online feedback to generations from the policy $\pi_\theta$ being trained. To mitigate the distributional shift issue, recent works proposed Online DPO \citep{guo2024directlanguagemodelalignment} following a three-step process: (1) sample 2 responses to a prompt from the current policy; (2) obtain online feedback over the response pair to create pairwise data, and (3) use this pairwise data update the policy $\pi_\theta$ via standard DPO loss. While the original paper proposed using online AI feedback for step 2, to better scale our experiments, we obtain feedback from a trained reward model. We tried online DPO both to enhance general, and target capabilities, i.e., mathematical reasoning. For general capabilities, we train an RM for 1 epoch using the 82K preference data points from the Skywork.\footnote{\url{https://huggingface.co/datasets/Skywork/Skywork-Reward-Preference-80K-v0.1}} For targeting mathematical reasoning, we continue training the same RM on our synthetic on-policy math-specific preference data (described in \S\ref{sec:pref_pipeline}). Training online DPO on top of one of our \modelname~3 DPO checkpoint for total of 200K episodes on math problems (prompts are taken from the same RM training data), resulted in no or little improvement on GSM8K and degradation on MATH performance (we experimented with various sampling temperatures and KL penalty coefficients). We did not further explore this approach extensively, as our initial results suggested limited gains in both general and targeted domains. Future work could investigate alternative training strategies, such as different sampling methods, or finetuning RM architectures, to better align the optimization process with the desired capabilities. \paragraph{Rejection Sampling.} Rejection sampling for large language models is an increasingly popular method for improving post-training performance of frontier language models \citep{dubey2024llama, adler2024nemotron, dong2023raftrewardrankedfinetuning}. Using an initial SFT and preference data mix to train an initial model, that model is used to then generate \textit{n} responses to each SFT prompt. These \textit{n} responses, in addition to the original response, are then ranked using a reward model or an LLM as a judge, and the best response is kept. The other responses can then be used to create chosen/rejected pairs for preference optimization. The full post-training pipeline is then run on these datasets, and the process repeats until performance converges. We tried rejection sampling, but found that for our setup the performance gains were minimal for the amount of compute required, and thus we leave a deeper exploration for future work. Qualitatively, we found that strong judges are vital, and publicly available models can struggle to choose the best response out of the candidates. We also found that including the original response as a choice for the judge (in other words, choosing the best response out of the \textit{n} generations in addition to the original response) performed much better than only choosing from the newly generated responses. \subsection{Future Work} While we aimed to cover a broad set of skills for \modelname~3, due to various limitations, we left some skills for future iterations of \modelname. We discuss some of these skills below: \paragraph{Long Context and Multi-turn.} Currently, the data collected for \modelname~3 is relatively short and does not contain long multi-turn data (the average number of turns in our mixture is 2.4 turns and majority of samples are under 2,048 tokens in length). However, long-context has been popular area of focus in recent work~\citep{pawar2024whatwhycontextlength}, as improving the context window of LMs enables new use-cases~\citep{geminiteam2024gemini15unlockingmultimodal} and more in-context examples, potentially improving performance~\citep{agarwal2024manyshot}. Relatedly, improving multi-turn capabilities can better improve end-user experience, with a non-trivial number of real-world user conversations with LMs going over 2 turns~\citep{zhao2024wildchat}. We hope to address both skills in the future with dedicated training and evaluations. \paragraph{Multilinguality.} We specifically focus on English data and evaluations for \modelname~3 (although we do include the multilingual Aya~\citep{ustun-etal-2024-aya} dataset due to its high quality). However, this neglects the myriad languages apart from English spoken around the world, speakers of which may benefit from or desire LMs that can process their languages. Future work may examine the current multilingual abilities of \modelname~3 and how to further improve them. We also note that multilingual post-training can make use of different techniques to monolingual post-training -- for example, cross-lingual alignment~\citep{wu-etal-2024-reuse} or careful data balancing strategies~\citep{li2024upsampleupweightbalancedtraining}. This makes multilingual post-training an interesting and impactful area for future work. \paragraph{Tool Use and Agents.} While we evaluate \modelname~3 on its own, LMs are being increasingly deployed as parts of larger systems, in which they have access to tools~\citep{qu2024toolsurvey} or are themselves part of a larger `agent' framework.\footnote{For example, the OpenHands platform~\citep{openhands}.} Furthermore, training models to use tools is a natural way to dramatically improve their reasoning and mathematical skills~\citep{gou2024tora}, rather than trying to accomplish everything `in the weights.' Future work involves training or evaluating \modelname~3 for tool-use either on its own or as part of a larger framework. \section{Related Work} \subsection{The Evolution of Post-training Recipes} Modern ``post-training'' has its roots in multi-task language model training, in particular \textit{instruction tuning} \citep{mishra-etal-2022-cross,wei2022finetuned,sanh2022multitask,wang-etal-2022-super, longpre2023flan}, in which language models are trained on samples including task instructions and their corresponding responses, allowing the models to generalize `zero-shot' to new tasks.\footnote{The term ``post-training'' substantially predates modern chat language models~\citep{moreau2016post, xu-etal-2019-bert}, but was recently popularized.} Early instruction-tuning datasets tended to focus on more traditional NLP tasks (e.g., natural language inference) rather than more generic tasks that downstream users might perform~\citep{wang2022self}. With the rise of ChatGPT and chat-based LMs (Claude, Gemini, etc), post-training techniques evolved beyond instruction tuning to include preference tuning stages, with models undergoing both instruction tuning and then preference finetuning (PreFT) or ``RLHF''~\citep{ouyang2022training}. Early work in RLHF originated from experiments on Deep RL for control~\citep{christiano2017deep, ibarz2018reward,leike2018scalable} and typically involved first learning a reward model from human preferences, and then optimizing a language model via an RL framework using the learnt reward~\citep{stiennon2020learning, nakano2021webgpt, askell2021general, ouyang2022training}. Recently, approaches that allow directly training a language model on such preferences have been developed~\citep{rafailov2024direct, zhao2023slichfsequencelikelihoodcalibration}, reducing the complexity of incorporating PreFT into training. While early approaches to PreFT were extremely human-centric, using tens or hundreds of thousands of human-written instructions and human preference labels, more recent work uses mixtures of human and synthetically generated preference data, along with multiple rounds of training and varied training algorithms~\citep{touvron2023llama, dubey2024llama, gunter2024apple}. During the evolution of RLHF primarily in closed laboratories, open recipes for post-training have lagged somewhat behind. Initial attempts at building `open post-training recipes' focused on the instruction-tuning stage~\citep{alpaca, DatabricksBlog2023DollyV2}, finetuning openly released language models on synthetically generated or human made datasets. While combining these datasets could yield strong performance~\citep{wang2023far}, incorporating a PreFT stage proved important for bridging the gap with closed models based on human evaluations~\citep{ivison2023camels}. Today, most popular adapted models with \textit{open recipes} for PreFT use DPO (or one of its variants), and AI feedback data including \modelname~2~\citep{ivison2023camels}, Zephyr-$\beta$~\citep{tunstall2023zephyr}, and Starling~\citep{starling2023}. However, many of these models are outdated relative to closed post-training recipes both in terms of data and in performance: no open-recipe models exist in the top 50 of LMSYS's ChatBotArena (as of November 20th, 2024) have released their post-training data~\citep{chiang2024chatbot}. Most of these open recipes use relatively little data and few rounds of training compared to closed post-training setups, which can involve multiple rounds of training with varied objectives, and millions of datapoints~\citep{touvron2023llama, dubey2024llama}. For instance, Llama 3.1 trained on generated outputs from the previous model for multiple rounds with extensive human feedback data, and used strong models to write synthetic instructions~\citep{dubey2024llama}. Other recent developments includes rejection sampling for synthetic data and advanced reward modeling for step-wise assistant responses~\citep{lightman2023let}. Whilst we do not entirely reach the size of these closed recipes, in this work we hope to push the state of open post-training recipes forward by building a recipe that matches or beats strong closed recipes, and releasing all related artefacts (code, models, data, etc) for further scientific study and use. Our pipeline is significantly larger and more complex than prior work, comprising of almost a million instruction tuning samples, hundreds of thousands of preference pairs, and a novel online RL training phase. \subsection{Training on Verifiable Rewards} The RLVR approach proposed in this work relates to a variety of recent work on improving LM reasoning through RL-related techniques. Closely related is the self-taught reasoner (STaR) line of work~\citep{zelikman2022star, Zelikman2024QuietSTaRLM} and TRICE~\citep{hoffman2023training}, both of which examine using existing ground-truth answers as signals to generate better model rationales (or chains-of-thought). STaR can be seen as an approximation of a policy gradient algorithm, with Quiet-STaR extending the approach of training the model to use additional generations to improve generic language modelling (`thinking before speaking'). TRICE~\citep{hoffman2023training} also aims to improve the likelihood of correct answers by training over multiple reasoning traces, using a custom MCMC-based EM algorithm. More recently, VinePPO~\citep{VinePPO} uses binary rewards from GSM8k and MATH correctness to test a novel PPO-based algorithm, and other recent work has explored using code feedback as a signal for training~\citep{gehring2024rlefgroundingcodellms, xudpoppo}. In contrast, our proposed approach, RLVR simply uses an existing RL framework (PPO) for training, and runs entirely online with binary rewards (compared to the iterative approach of STaR or the log-likelihood rewards of Quiet-STaR). Additionally, we expand beyond the math domain, also finding that this approach can yield improvements in precise instruction following. Finally, we also carefully ablate a number of core components of RLVR, including value model initialization and using a general reward model with verifiable rewards. We hope to further develop and expand this technique in future work. \section{Conclusion} We introduce \modelname~3, a family of fully open state-of-the-art language models, featuring a modern post-training framework with fully open-source data \modelname~3~\textsc{Data}, evaluation~\evalname, training code \modelname~3~\textsc{Code} and development recipes \modelname~3~\textsc{Recipe}. % We release final models trained on Llama 3.1 base versions, % with intermediate checkpoints, training data, training code, and evaluation code. \modelname~3 bridges the gap between open and closed post-training methods, marking a new milestone in open post-training research. With the resources provided, others can build on open base models, finetune them for high performance across diverse tasks. This paves the way for advancing post-training research within multi-objective, and multi-stage training frameworks.
Version: 1
Teacher language
Student language

Back to lectures