Tiny Paper Summaries

Forcing myself to write (usually tiny) summaries of things I’m reading to improve retention. Not intended to teach – will be simultaneously terse and verbose and probably imprecise in places!

Instruction following

Training models to follow instructions. Sometimes referred to as alignment, but that’s an overloaded term imo, and I mean something more specific here.

LIMA

The authors fine-tune an LLM (Llama) on a set of just 1K curated instructions from the open assistant dataset. Their model performs comparably to GPT3.5, which was fine-tuned with many more demonstrations + RLHF, and does not trail proprietary models like GPT-4/Claude by much, despite the decrease in size and training data. They coin this phenomenon the “Superficial Alignment Hypothesis”, as their work suggests that the vast majority of LLM capabilities are learned during pretraining rather than during alignment to human preferences via SFT/RLHF/etc. Some interesting findings include: (i) a negative correlation between validation perplexity and human-annotated response quality, and (ii) a strong ability to generalize to multi-turn without examples in training or an even stronger performance bump by including as few as 30 multi-turn conversation examples.

Humpback

The authors present a method for self-aligning language models using entirely open-source data, without distilling from data produced by stronger models (i.e., GPT4). They do so with an iterative algorithm that first (i) augments the instruction dataset by predicting instructions ($\hat{x}$ ) from a assumed fulfillment of said instruction (response $y$), which is a span of text from the web. They then (ii) curate the set of augmented instruction/response pairs by prompting the model to assign scores to the quality of the instructions, and selecting only those above some score $k$. They perform several (2) iterations of augment/curate, and then fine-tune on the augmented, higher quality subset of instruction data, $A_{k}^{2}$. The perform exspansive analysis comparing to other instruction-following models, and demonstrate improved performance compared to any model tuned on or distilled from proprietary data.

Unnatural Instructions

The authors create a set of instruction/input/output data by (i) prompting LLMs to generate instruction/input examples in a few-shot setting with nucleus sampling and (ii) decoding the output greedily with an LLM. Using only LLM generated instruction data, their model performs comparably to an equivalent LLM fine-tuned on “Supernatural Instructions”, which was created using expensive human annotations. Additionally, they provide evidence that suggests that their instruction set is more diverse than those generated by humans (via BERT score), and hypothesize that humans fall back on heuristics or templates learned via annotation guidelines, which results in said lack of instruction diversity.

Vicuna

LLaMA fine-tuned on conversations from Share-GPT

Wizard

LLaMA fine-tuned using instructions of varying complexity generated by an “instruction evolution” (Evol-Instruct) algorithm which prompts a strong LLM (ChatGPT) to make instructions more complex in diverse ways, then filters the pooled instruct/response examples.

Alpaca

Fine-tuned LLaMA using “self-instruct” generations produced by ChatGPT given a seed set of instructions.

RAG & Agents

Getting models to respond using tools + data (potentially) not seen during training (i.e., non-parametrically).

Toolformer

The authors detail a fully self-supervised method for teaching an LLM to (selectively) use tools, including calculators, search engines and calendars. They do so by fine-tuning a model on a modified pretraining corpus that includes API input/output results in plain text when relevant. They insert such API calls by (i) few-shot prompting an LLM with examples that demonstrate when an API call is useful(e.g., “One US dollar is approximately equal to 23 Czech crown, or [API(1/23)→.043] 4.3%.”), and (ii) filtering to only cases when the API input/output reduces the complexity of the subsequent tokens by at least some fixed amount (hyperparameter). Their model, Toolformer, is better than models which cannot utilize APIs on several tasks including math, question answering, and time-sensitive QA, and even exceeds performance of the much larger GPT3 on several tasks.

Self-RAG

Learning when to retrieve + critiquing retrieved results generatively.

ReAct

Prior work has explored:

using language models to generate actions plans (e.g., WebGPT)
improving “reasoning” capabilities of language models via prompting strategies such as CoT. ReAct combines these 2 directions with a prompting “framework” that uses in context learning (ICL) to describe the action space (descriptions of actions/tools/apis) and encourage the model to generate “thoughts” prior to generating action plans. Their hypothesis is that using thoughts will improve reasoning, i.e., action selection in this setting. Their experiments on multi-hop QA, fact-checking, and other interactive decision-making tasks support this.

The authors also explore fine-tuning LLMs on react trajectories (sequences of (thought, action, observation) pairs). They automatically create SFT data via a bootstrapping method in which they use ReAct via few-shot prompting an LLM (not fine-tuned) and using only the trajectories that result in correct final responses (using heuristics like exact match from NQ, for example) for fine-tuning (how biased is this sample?). This also results in improved performance, but at what cost? To what extent are we compromising general language modeling capabilities when we fine-tune on data, most of which essentially only requires solving a classification task (thought/action selection) to model well?

ReWOO

Removes the sequential (or “reactive” ) planning of ReAct by decoupling the planning from the execution of actions. They do so by generating plans with intermediate results referenced via variables that can be used in subsequent actions (see example in fig below). This is more efficient (time and $), because it requires fewer invocations of the LLM to generate action plans.

Overview of ReWOO from (Xu et al.).

LLM Compiler

Nice work, but glorified ReWOO that fails to give it enough credit. They extend ReWOO by generating a directed acyclic graph (DAG) with their action planner. This allows for parallel execution of independent tasks. This is not really a limitation of ReWOO, just something that the authors did not experiment with. They even mentioned concurrent execution of DAGs in their future work!

Overview of LLM Compiler from (Kim et al.).

Parameter-efficient fine-tuning (PEFT)

Fine-tuning models with fewer learned parameters!

LoRA

The authors fine-tune large models by freezing them and only training small adapters. They do so by decomposing a large matrix $W'$ into a sum of the (frozen) pretrained matrix $W_0$ and a low-rank perturbation $\Delta W$. Thus the forward pass to compute hidden state $h$ from input $x$, $h = W_0x$, is replaced with: \begin{equation} h = W_0x + \Delta W_x = W_0x + BAx. \label{lora-eq1} \end{equation} Importantly, $\Delta W$ is low rank because it is factorized into matrices $A$ and $B$ which are dimension $d_{model} \times r$, where r is something small like 8. Further, after training, these matrices can be “merged” into a matrix $W'$ that’s the same size as the original matrix $W_0$, so there’s no increase in computation/latency at inference time!

During training, they update as few as < 1% of the total parameters, which saves a ton on memory, because optimizers like Adam require storing optimizer states proportional in memory to the number of tuned parameters, in order to compute adaptive gradient updates from biased estimates of gradient mean/variance through time.

Overview of LoRA from (Hu et al.). Only A and B matrices are trained.

QLoRA

The authors fine-tune Llama models using LoRA adapters atop 4 and 8 bit quantized base LLMs without sacrificing performance. Additionally, they fine-tune their model using instruction/chat datasets like open assistant/flanv2 and demonstrate performance comparable to ChatGPT as judged by GPT-4, highlighting the importance of data quality over quantity. Impressively, this can be done using only a single A100 GPU.

Understanding Transformer Mechanisms

Understanding how transformers implement things (mechanistic interpretability) and other transformer-specific stuff.

A Mathematical Framework for Transformer Circuits

Inspired by the Distill ciruits thread, which defined & investigated “circuits”, or weights of a neural network that connect such that they implement some interpretable “function” like detecting curves in images, the authors set out to study circuits in transformer models. They do so by first conceptualizing transformers in a new – but mathematically equivalent– way, then finding circuits in “toy” models (0 to 2-layer transformers).

Conceptual Framework

Residual Stream: This work uses the “residual stream” to refer to the contextualized embeddings (or hidden states) produced by the transformer. They use this term for a single-token embedding and sequences of tokens alike, and throughout all layers of the model, akin to a “state” vector that changes at each layer. They prefer this term because it emphasizes the residual nature, i.e., that each layer’s output is the sum of its input and some transformation of that input (e.g., $t_{i+1} = t_{i} + f(t_i)$). They like to think of attention heads functionally “reading” from and “writing” information to the residual stream.

Independent + Additive Attention Heads: The original Transformers paper (Vaswani et al.) parameterizes an attention “head” by three matrices: $W_Q$, $W_K$, $W_O$, which represent the “query”, “key”, and “value” weights, respectively. The query and key matrices are used to compute the attention pattern, and the “result”, $r^{h_1}$ for head 1, is the attention-weighted sum of the value vectors. The $n$ result vectors (1 per head) are concatenated ($[r^{h_1}, ..., r^{h_n}]$), before being multiplied by a final “output” matrix $W_O$. Instead, we can split the output matrix into $n$ “blocks”, each of size $d_{\text{model}} \times d_{\text{model}} / n$, and express this as a sum of products with each block:

\begin{align} \label{framework} \begin{bmatrix} r^{h_1} \cr \vdots \cr r^{h_n} \end{bmatrix} = [W_O^{h_1}, …, W_O^{h_n}] \cdot \begin{bmatrix} r^{h_1} \cr \vdots \cr r^{h_n} \end{bmatrix} = \sum_i^n{W_O^{h_i} r^{h_i}}, \end{align}

which makes it clear that each attention head independently contributes to the information added to (or removed from) the residual stream. Additionally, we can (conceptually) parameterize each attention “head” with its own “output” weights $W_O^{h_n}$.

$W_{QK}$ & $W_{OV}$ Matrices: The authors find it useful to decompose attention heads into two independent operations: i) producing attention patterns, and ii) determining what information to read from source tokens and how to write it to destination tokens. Traditionally, we compute attention patterns by computing “query” and “key” values separately, then computing their inner-products, but this is equivalent to multiplying by the low-rank $W_{QK}$ matrix. Similarly, the “output” and “value” matrices always operate together, and given an attention pattern, we can compute the head’s output by multiplying with the low-rank $W_{OV}$ matrix.

QK & OV Circuits: When studying 1-layer transformer models, we can compute what they’ve called the “QK” or “OV” circuits (or matrices). This is best explained via image :):

QK and OV circuits from (Elhage et al.).

These 4-matrix products (circuits) each produce $d_{vocab} \times d_{vocab}$ dimensional matrices, which can be interpreted! The “QK” circuit describes how much a query token “wants” to attend to another key token, while the “OV” circuit describes how much a given token (if attended to) will change the logits corresponding to another output token. The authors analyze these circuits by literally reading the billions of entries in these matrices and finding outstanding entries.

Mechanisms

Zero layer transformers: “Zero layer transformers model bigram statistics.”

One layer transformers: “One layer attention-only transformers are an ensemble of bigram and “skip-trigram” (sequences of the form “A… B C”) models.”

Two layer transformers: “Two layer attention-only transformers can implement much more complex algorithms using compositions of attention heads.”

Induction Heads: Two-layer models ostensibly learn “induction heads” which allow us to model patterns of the form: [a][b] … [a] → [b]. This can be thought of as the simplest form of in-context learning, where a model learns a pattern observed previously in the sequence. In fact, this generalizes even for random sequences, indicating that these patterns weren’t simply memorized during training. Modeling this requires “composition” of attention heads, where one head in the first layer copies information from the previous token in the sequence, and another head in the next later searches for “similar” queries and finds the token which it should predict next (b), because the residual stream corresponding to this token contained information indicating that a was its previous token!

Final Note

This paper is super long, but there’s loads of great stuff in here that I can’t do justice in a summary, so again this work warrants its own post. The work took me tons of time to read, and then some more time to read again… but I’d really recommend it to anyone looking to understand transformers deeply, which often requires a new perspective.

What learning alg is in-context learning?

The authors demonstrate that via in context learning, transformers are capable of implementing “classic” learning algorithms like using OLS or SGD to solve their prototypical linear regression problems.

Safety

Understanding + mitigating model biases, frameworks for evaluating models on hard tasks, red-teaming, the works.

Towards Understanding Sycophancy

Studies the prevalence of sycophancy, or flattery in order to gain the approval of a superior, in language models, particularly those trained with human preferences. They find several patterns of sycophantic behavior such as:

admitting mistakes that don’t exist when questioned by the user
demonstrating biases that align with the user.

Really nice figs, so I’m including several directly from the paper:

[Click to zoom] Demonstrations of models i) providing biased feedback, ii) swaying opinions, iii) conforming to a user's preconception, and iv) mimicking users' mistakes from (Sharma et al.)

They then prompt a model to generate features for a given (user_input, preferred_response, dispreferred_response) interaction, and use these features as input to a Bayesian logistic regression model trained to predict which response was preferred (by humans) from these features alone. They find that agreeing with the user is one of the most predictive features of preference.

The authors investigate whether preference models (PMs, trained to predict human preferences) encourage sycophancy. They use the PM used to train Claude 2 to select the best response from N sampled generations. They also do the same with a “Non-sycophantic PM”, which simply prepends additional instructions to the PM to discourage sycophancy explicitly. They find that “feedback sycophancy” correlates positively with PM score, but “answer” and “mimicry” sycophancy correlate negatively. In all cases, the non-sycophantic PM prefers less sycophantic responses than the baseline PM. Additionally, they measure how these sycophancy metrics change during RL training, and find that “feedback” and “mimicry” sycophancy increase with training steps, but “answer” sycophancy remains flat.

Preference Models might encourage sycophancy! From (Sharma et al.)

They conclude with a study abot how often humans/PMs prefer sycophantic responses to truthful ones (that might correct a user’s misconception), and find that both humans and PMs prefer sycophantic responses to truthful ones at times. They cleverly construct a dataset of questions around common misconceptions from TruthfulQA (e.g., “Georgia produces the most peaches in the U.S.”), and prompting a model to produce 1) sycophantic responses that agree with the user and 2) truthful responses that correct the user.

Humans and PMs alike sometimes prefer sycophancy. From (Sharma et al.)

I realize this is way more than a tiny summary, but because this work is really extensive and empirical, it doesn’t have a single, easily-summarized core idea/method. Perhaps this deserves its own post :)

Why I like this work

I think this work is exciting because it provides a way of quantifying systematic failures of LLMs. There are many papers recently that make claims about what LLMs allegedly can’t do (e.g., reasoning), but too often the goal often seems to be to criticize existing methods or claims without offering solutions. This work instead seeks to identify situations in which LLMs stray from the truth, but suggests that there are ways to mitigate this issue! Further, sycophancy in models is intuitive and perhaps not surprising, but measuring it is quite complicated; this paper proposes creative ways of designing experiments to analyze a specific behavior. I’m hopeful that additional work in identifying, and root causing, other systematic failures might lead to measurable improvements toward reliable and truthful assistants. My hunch is that preference models optimize for “style” which is often at odds with truth, suggesting that we need to find ways to model for this outside of simply scaling data and compute.

Meta Sycophancy (to bit, or not to bit)

I originally found this work because Ethan Perez, one of the leads, tweeted about it. I really like empircal studies like this one, because they’re packed with lots of experiments/findings/insights, as opposed to works with a single-but-clear contribution. Obviously, those works are just as, if not more, “important”, but empirical work like this feels rare to me. Perhaps this is because it requires a sort of privilege to be able to conduct such a study (⏳ + 💰), but I think there’s more to it than that – to endeavor to do work that doesn’t ~directly push the SOTA~ takes courage.

Anyway, I found out about and ultimately applied to Ethan’s MATS Stream because of this. There was a question in the application form asking you to describe a recent work you liked and why you found it exciting. Given my path to finding out about the program, a response detailing my interest in Ethan’s work on sycophancy would be genuine, of course, but surely would reek of sycophancy without such context! However, I was super amused by the idea of asking Claude to describe the work and tell me why it was exciting, and couldn’t help myself:

~Meta Sycophancy~

I was not accepted into the program, so perhaps Ethan did not find it as amusing as I did, but I’d rather live life thinking otherwise and assume he got at least a chuckle out of it. The application process was a nice experience for me either way, because it forced me to engage deeply with this work.

Measuring Progress on Scalable Oversight

I haven’t delved into the scalable oversight literature at all – so my takes here are very “raw”. I think in this case it was a useful thought exercise to try to explore this work without pre-exposure and try to come to my own conclusions (or really just more questions). Given that this is the first empirical study on oversight, many less-than-ideal decisions are made, allowing for substantial room for improvement, but also an opportunity to wrestle with yourself trying to understand if there are any confounding variables. Further, I found it hard to extrapolate from the findings of this work into the realm of significantly increased model capabilities that exceed expert humans. As a mere human, I would love to hear the opinions of experts.

What is scalable oversight?

Oversight here refers to the ways in which we supervise models – either through labels, feedback, rewards, etc. So what does it mean to oversee in a scalable way? Scalable along what axis? In this context, scalable qualifies oversight that extends to increasingly difficult tasks. It’s quite unclear how to supervise a model that exceeds human performance, and it’s equally unclear how to research such models, as they don’t yet exist. This work proposes ways to simulate this setting with existing models, and evaluate the extent to which we can oversee models that exceed our performance. If our oversight techniques succeed in a somewhat contrived setting, this increases our confidence that they might work in the super-intelligent-wild.

🤖

In the “sandwich” setting, a task is chosen on which models exceed typical human performance, but perform worse than experts (👨‍🔧 < 🤖 < 👩🏽‍🔬). Non-expert humans must oversee models to increase their performance on this task, but they cannot collaborate with experts; experts are only used to evaluate final model performance.

Wut da hell is alignment?

The authors define alignment by contrasting it with capability. They say a language-model-based system is capable if it can be made to perform a task with small interventions such as in-context learning or fine-tuning. The goal of alignment is to produce a model that can perform a task without such interventions at test time. When they say a model is misaligned, it implies that it is capable. Q: Do prompt engineering strategies like “system” prompts or CoT count as interventions? If the model performs well with these cleverly designed but still zero-shot instructions, is it aligned?

This is closer to the “instruction following” variant of alignment, where the goal is to point a capable model in the “right” direction, where right means objectively correct in this case, rather than aligned with the “values” of humans (or some subset of them). Importantly, we can measure alignment in this setting using input and output only, rather than requiring some transparent model of the decision making-process. For truly super-intelligent models, we’d no longer have the means to evaluate in this way. What are some useful hypothetical tasks that models could conceivably solve that humans cannot (genuinely curious; not a hypothetical)? Looking at super-intelligent systems today, like chess-bots and other systems requiring deep-seach, we have the means to evaluate a bot’s skill, because we can observe the result of a game against an opponent. Alternatively, if a language model were to solve one of the famous “million dollar proofs” eluding mathematicians, it would have to do so via first principles, thus would be verifiable. If a model were to discover a cancer-curing drug, we could conduct tests to verify its efficacy. Presumably, solving these tasks requires clever methods of search + verify, but the problem is, we don’t have ways to supervise this process because we can’t solve any of these tasks ourselves despite the fact that we can evaluate them. What are examples of tasks that we might not even be able to evaluate?

Experimental Setup

The authors compare models, humans, and model-human teams on two tasks: MMLU (hard multiple choice questions from various topics) and QuALITY (timed, long passage QA). In their experiments, they remove the constraint that experts are only used to evaluate at the end, but instead involve experts after each inner iteration, or attempt at aligning a model. “Experts” here are not used directly, because they only study multiple choice tasks, with discrete labels (presumably produced by experts at some point, however). They also do not allow for fine-tuning models, and only explore strategies in which humans can interact with the model through chat. Thus, the oversight methods explored (and the methods cited under “Potential Techniques”), require that the model is already “aligned” on tasks that we can oversee, such as dialog. Specifically, their base model has been trained to perform “chat”, which requires following instructions and maintaining context over multiple turns. Where do we draw the line for “misaligned” models, and how do assumptions about already-aligned capabilities muddy conclusions? In theory, training a dialog assistant with RLHF entails aligning a model to perform a superset of the tasks studied here (though, these tasks might be in the tail in terms of difficulty).

Results

👨‍🔧 < 🤖 < (👨‍🔧 + 🤖) < 👩🏽‍🔬: humans and models are complementary

Results from (Measuring Scalable Oversight.).

Conclusions, Questions, Implications

The authors provide empirical evidence through cleverly-designed experiments that model/human teams can outperform either component in isolation. They are clear in the limitations of their experiments, such as their relaxations of the sandwich setting. Additionally, they state plainly that the techniques revealed in their study are far from good enough to safely oversee extremely-capable models. They hope that their study will lay the ground work for future empirical studies of other oversight techniques.

I’m somewhat conflicted about the experiments on QuALITY – humans perform poorly due to an imposed time constraint, which obviously does not impact models. This is different from the inability to recall/compose knowledge (or lack thereof) suffered by both humans and models to varied extent on MMLU. Humans are already experts (best untimed human performance is between 86-94%, exceeding the best model + human performance here) – they can effectively “oversee” models just by taking their time .

Instead of including a human-in-the-loop to “perform task more accurately” – as in MMLU – it’s “help me perform this task quickly, without sacrificing too much accuracy”. Compared to humans, even the largest of models are already good at fast. This task might only be a proper sandwich under a relatively small window of time constraints. I don’t mean to argue that these techniques are not useful – doing tasks quickly has its merits – but how might you transfer these demonstrations to produce an aligned model (i.e., one that performs better on the task without further intervention)? Let’s assume that we have a way to effectively transfer this “policy” learned in the sandwich setting to models (i.e., align) such that their capability will match that of the human + model team without the human (e.g., we can collect enough demonstrations using our efficacious human/model policy until we can induce this behaviour). Surely, you’d choose the oversight strategy of taking your time reading the passage to maximize accuracy, over the best strategy learned under some time constraint (though, the latter strategy is much more “scalable” w.r.t. time-of-manual-labor), because the time-constraint poses no threat to the aligned model at test time.

Now, let’s revisit the assumption that we can align a model from these methods of oversight – how might we actually use these strategies to improve models? Let’s ignore the fact that humans were presented with expert answers during the “inner loop”, and assume this was a “true sandwich”. We’ve discovered an effective protocol by interacting with models through chat, and have learned through experts that our protocol is sufficiently accurate (~about expert performance). We can now employ whatever supervised fine-tuning strategies we have at our disposal, because we effectively have expert-labeled data. The assumption that our protocol will scale to even harder tasks, is too strong. I don’t think that’s implied here, but it could certainly be interpreted that way. It’s important to note that the humans – who are the least capable at performing the task at hand – discovered ways in which they could improve upon a model’s performance without verifying that their strategy was any good (this is not actually true, because they told participants if they were correct/incorrect, but they found that this did not result in progressively better performance, so let’s ignore that). This means that the humans were able to leverage strengths of these models (e.g., factual recall) to complement their own reasoning in such a way that they were convinced that their protocol was good (in some sense, they became “confident” in their answers). I think the implication here is still important – if humans can find some protocol using models to improve their performance on a task, then perhaps we can partner with models on even harder tasks as they become increasingly capable. Neither us, nor the model alone can solve the math proof, but perhaps together we can push the frontier ever-closer. But maybe I’m still missing the point?

RL + Human Preferences

Learning from human preference data?!

RLHF (Instruct GPT)

Collect preference data by asking humans which completions they prefer — can be pairwise, or a full ranking over k completions. Using this preference data, train a reward model to compute scores for each completion independently. This score is used in the loss function by taking k choose 2 pairs and ensuring the preferred generation A is given a higher score than less-preferred generation B as follows:

\[\text{Loss}(\theta) = -1/(k \text{c} 2)\mathbb{E}[\text{log}(\sigma(r_{\theta}(x, y_{win}) - r_{\theta}(x, y_{loss})))]\]

Then the model is trained using RL, PPO specifically, to maximize the reward for new completions to inputs by sampling completions and updating the policy to maximize expected reward returned by the reward model. Additionally, there is a KL-divergence component of the loss wrt the original model (”policy”), to regularize the model to ensure it retains most of the pre-trained knowledge.

PPO

Policy-gradient RL learning alg (sample trajectories, then compute the gradient wrt the expected reward). Has some additional features to reduce the variance of the gradient vector (variance in gradient update is measured across minibatches? increasing batch-size reduces variance between mini-batches — rl suffers more from this problem because there isn’t direct supervision, but rather random trajectories through the environment).

DPO

tl;dr: Shows that you can reformulate the reward maximizing objective used in RLHF as a function of your current parameterized policy (i.e., your language model). In essence, learn to assign higher log likelihood to the preferred completion than the non-preferred one, as opposed to training a separate reward model.

From RLHF to the DPO Objective

Assume human preferences are generated by some latent reward model, $r^*(x, y)$
The Bradley-Terry preference model defines the preference distribution as a function of these rewards: \begin{equation} \label{dpo-eq1} p(y_1 \succ y_2 | x) = \frac{\text{exp}(r^{*}(x, y_1))}{\text{exp}(r^{*}(x, y_1)) + \text{exp}(r^{*}(x, y_2))} \end{equation}
RLHF first trains a reward model to maximize the difference between the rewards assigned to the winning and losing completions: \begin{equation} \label{dpo-eq2} \text{Loss}(\phi) = -\mathbb{E}[\text{log}\sigma(r_{\phi}(x, y_{w}) - r_{\phi}(x, y_{l}))] \end{equation}
Then trains a language model to maximize the rewards from this model: \begin{equation} \label{dpo-eq3} \text{max}_{\pi_{theta}} \mathbb{E}[r_{\phi}(x, y)] -\beta \text{D}_{\text{KL}} [\pi_{theta}(y | x) || \pi_{\text{ref}}(y | x)], \end{equation} where $\text{D}_{\text{KL}}$ is the KL-divergence between 2 distributions, defined by the current policy and the “reference” policy, typically defined by the model prior to RLHF training (i.e., the initialized model). This is a form of regularization that also prevents mode collapse.
DPO first shows that the optimal solution to the RLHF objective has the form: \begin{equation} \label{dpo-eq4} \pi_{r}(y | x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y | x) \text{exp}(\frac{1}{\beta}r(x, y), \end{equation} where $Z(x)$ is the partition function, or normalizing constant required to convert the above into a valid probability distribution. This is intractable to compute because it’d require summing over all generations $y$.
We can now re-write the reward as a function of its optimal policy: \begin{equation} \label{dpo-eq5} r(x, y) = \beta \text{log} \frac{\pi_r (y | x)}{\pi_{ref} (y | x)} + \beta \text{log} Z(x) \end{equation}
There are 2 important things here: i) the above is a function of the optimal policy, which we parameterize directly (the LLM), rather than a separate reward model, and ii) the partition term cancels out when we take the difference of two rewards, which is all that matters for learning a reward model from preference data. Now, we can plug this reward difference into the reward modeling objective equation \eqref{dpo-eq2}, and optimize this by tuning our generative model’s weights directly: \begin{equation} \label{dpo-eq6} \mathcal{L}_{\text{DPO}} = -\mathbb{E}\Bigl[\text{log}\sigma\Bigl(\beta \text{log} \frac{\pi_{\theta} (y_w | x)}{\pi_{ref} (y_w | x)} - \beta \text{log} \frac{\pi_{\theta} (y_l | x)}{\pi_{ref} (y_l | x)}\Bigr)\Bigr] \end{equation}
The authors show that the gradient of the objective w.r.t. the model parameters increases the likelihood of the preferred completion and decreases the likelihood of the not-preferred completion.

Yoav Goldberg’s take on RL for language modeling

Interesting take suggesting the SFT teaches the model to hallucinate in cases in which the “fact” required to answer a question is not contained in the model’s internal representation. RL on the other hand, allows for “negative reinforcement” by teaching the model when it is wrong. He first introduces the “diversity argument” which is that SFT is too restrictive in that it prescribes an exact response and penalizes any deviations from it. As humans, we understand that there are likely several “equally good” ways to convey the same information, which is expressible via RL. This is the intuition that I have always had, but Yoav thinks that it is not convincing, because SFT works well in practice and RL is difficult to get right (why do either of these observations make the diversity intuition not convincing?).

Evaluation

How do we evaluate general-purpose models? It’s hard… especially over multiple turns…

User Simulation with LLMs for Evaluating Task Oriented Dialogs

Prompting LLMs > fine-tuning them for user simulation.

Prompting

How to get your model to do watchu want.

CoT prompting

The authors explore “chain of thought” prompting pretrained LLM models, which simply asks the model to provide step-by-step explanations and provides few-shot examples of such reasoning. A typical use-case in which this is helpful is in grade-school word problems, like “Q: Sally has 12 apples. She gave half to John and then ate 3. How many does she have left? A (with CoT): Sally gave half of her apples to John, so she then has 12/2 = 6 apples. She then ate three, so she has 6-3=3 apples.” Without CoT, models often hallucinate when asked to provide an answer directly. Their approach remarkably improves performance on many reasoning tasks. They conduct extensive experiments and ablations to investigate why and when CoT is helpful. One interesting finding is that this phenomena is “emergent” in that it only improves performance in sufficiently large models.

Self Consistency (CoT-SC)

Chain-of-thought + self-consitency (CoT-SC): CoT prompting with sampled generation –> take the mode response –> improved performance

Comparison of CoT and CoT-SC from (Wang et al.).

Tree of Thought Prompting (ToT)

Constructs a “tree” of reasoning paths and allows for search through this space via pruning with an evaluator (an llm with prompts to evaluate states). They apply this to games like 24 improve performance.

Comparison of ToT and other prompting strategies from (Yao et al.).

self-ask

Studies the “compositionality gap”, a measure of how often models can i) correctly answer all sub-problems required to answer a multi-hop question, but not ii) correctly provide the final solution that requires combining the answers to said subproblems. They propose a prompting strategy to decompose questions to improve multi-hop QA: