There's no wife in the wifecake, and there's no real RL in the RLHF-News-Artificial Intelligence Global Cooperation Alliance

There's no wife in the wifecake, and there's no real RL in the RLHF

#News ·2025-01-09

There is no wife in the wife cake, no husband and wife in the husband and wife lung slice, and there is no real RL in RLHF. In a recent blog post, Atlas Wang, an assistant professor at the University of Texas at Austin, shared just such a view.

Blog link: https://www.linkedin.com/pulse/why-rlhf-other-rl-like-methods-dont-bring-true-rl-llmsand-atlas-wang-s1efc/

He points out that RLHF (Reinforcement Learning based on human feedback) and other similar approaches do not bring true reinforcement learning (RL) to large language models (LLMS) because they lack the core characteristics of RL: continuous environmental interaction and the pursuit of long-term goals.

RLHF mainly adjusts the model output to conform to human preferences through single or multi-step optimization, rather than multi-step policy adjustments in a dynamic environment. In addition, RLHF is often conducted offline or semi-offline, lacking real-time environmental feedback and policy updates. So while RLHF can improve model alignment and output quality, it doesn't give LLMS real goals or intentions that make them "want" to win the game. LLM is still primarily a statistical system that predicts the next token based on context.

Several interesting questions are discussed throughout the article:

1. How does RLHF (and related methods) differ from classical RL?

2. Why can't these methods actually give the real goal or intention of the LLM?

3. Why is there no "real RL" for LLM on a large scale?

4. What is the closest approach available to giving the LLM a "goal"?

5. What are the consequences of not having a "goal-driven" LLM?

By understanding these nuances, we can get a clear idea of what LLM can and cannot do, and why.

Commenting on the post, Denny Zhou, chief scientist at Google DeepMind, said: "For anyone with a background in RL, [the points in the article] are obvious. But for newcomers, it's a good introduction."

Distinguish between RLHF and classical reinforcement learning

What is classical reinforcement Learning? In a classic reinforcement learning setting, you have:

An agent that takes action in the environment.
The environment changes state based on the actions of the agent.
Agents are rewarded or punished for their actions, with the aim of maximizing long-term cumulative rewards over multiple steps.

Key features: Continuous or occasional interactions. The agent explores multiple states, makes decisions, observes rewards, and adjusts its strategy in a continuous loop.

RLHF is a workflow that uses a reward model trained on human preference data to refine the model output. Common processes include:

Supervised Fine-tuning (SFT) : Start by training or fine-tuning the underlying language model on high-quality data.
Reward model training: Collect pairs of outputs, ask the human which one they prefer, and then train a "reward model" to approximate the human judgment.
Strategy optimization: The use of a reinforcement learn-like algorithm (usually PPO, "proximal strategy optimization") to adjust the parameters of the LLM to produce an output that the reward model prefers.

Unlike traditional RL, the "environment" in RLHF is basically a single-step text generation process and a static reward model - there are no extended loops or continuously changing states.

Why is RLHF (and related methods) not really RL?

Single - or multi-step optimization. In RLHF, the LLM generates text based on a given prompt, and then the reward model provides a single preference score. The "reinforcement" step in RLHF is more akin to one-step strategy gradient optimization to achieve human-preferred outputs, rather than a full agent-based loop of states and actions in a changing environment. This is more of a "once and for all" scoring than having an agent explore multi-step actions over time and receive environmental feedback.
Mostly offline or semi-offline. Reward models are typically trained offline on human tag data and then used to update the LLM's strategy. When an LLM adjusts policies online, it does not explore continuous environmental cycles in real time.
Lack of long-term, environment-based goals. Classic RL intelligence tracks long-term returns across multiple states. In contrast, RLHF-based LLM training focuses on adapting instant text output to human preferences. LLM does not navigate multiple time steps in a dynamic environment.
Surface constraints with real internal goals. RLHF can effectively influence the probability of certain outputs - steering the model away from unwanted text. But there is no "desire" or "desire" within the model to produce these outputs; It is still a statistical system that generates the next token.

Remember, whether it's RLHF, SFT, or anything else, LLM is not trained for a real goal or intent! At its core, LLM is about predicting the next token based on a given context. Their "motivation" is purely to maximize the accuracy of the next token (as determined by the training data and any subsequent fine-tuning signals). There is no subjective desire or intention in this process. We often say that AlphaZero "wants" to win at chess, but that's just a convenient way of saying it. Internally, AlphaZero is maximizing mathematical reward functions-without any sense of desire. Similarly, RLHF-tuned LLMS are maximizing aligned reward signals without an inner state of desire.

How about RLHF vs. IRL?

Subbarao Kambhampati, a professor of computer science at Arizona State University, points out that "RLHF" is a bit of a misnomer because it combines a preference or reward model for learning from human judgments (conceptually closer to reverse reinforcement learning, or IRL) with one - or multi-step strategy optimization. Rather than the long iterative interactions typical of classic RL.

IRL: In the classical formulation, the agent deduces the reward function by observing an expert's demonstration in a dynamic environment. In contrast, RLHF typically collects static pair-to-pair comparisons (e.g., "Which of these two model outputs do you prefer?"). ) and trained a reward model to mimic human preferences. In an evolving environment, there is no extended multi-step expert trajectory.
Preference Learning in RL: In current deep reinforcement learning, there are ways to learn reward functions from paired comparisons of trajectory rollout (e.g., "Which gait do you prefer for a robot walker?"). ). However, these methods tend to suffer from high sample complexity (for example, the need to ask humans multiple times), so many research papers have adopted methods that simulate human responses in controlled tasks.
Why RLHF is also not a "classical IRL" : Even though RLHF is reminiscent of IRL in learning preference models from human data, it is not a classical scheme for analyzing changes in expert behavior over time. In contrast, RLHF focuses on the static judgment of humans about the final or short sequence output. As a result, RLHF is still mostly off-line or near-off-line, which further limits its similarity to traditional IRL Settings - although Subbarao Kambhampati also comments that learning reward functions from paired preferences (which has become mainstream in the (I) RL literature)!

Will CoT, PRM, or multi-agent workflows help solve this problem?

Process-based reward model and thought chain

A process-based reward model may provide feedback on intermediate reasoning steps (chains of thought or COTs), rather than rewards based solely on final outputs, such as the final answer to a problem. The goal is to encourage the model to explain or present its reasoning process in a way that is easier to explain, more accurate, or more consistent with specific criteria.

Is this the "real RL"? Not so.

Even if you assign a partial reward for the intermediate step (as CoT explains), you're still in an environment where you typically feed the entire output (including inference) into the reward model, get the reward, and then do a one-step strategy optimization. Rather than in a dynamic environment, the LLM "tries out" parts to reason steps, get feedback, make adjustments, and continue an open loop in the same episode.

So while CoT/PRM can give the illusion of multi-step RL because you reward or punish intermediate steps, in reality, it still equates to off-line or near-off-line policy tweaks for a single step (text generation and inference), rather than the continuous agent-to-environment cycle of classic RL.

Multi-agent workflows also don't magically create intents

You can coordinate multiple LLMS in A workflow (e.g., "System A generates plans, System B ratifies plans, and system C refines plans"), but internally, each LLM still generates text based on the probability of the next token. Although such a multi-agent setup can exhibit emergent behavior that appears coordinated or purposeful, it does not give any individual model an intrinsic or upheld goal.

Why do our multi-agent LLM workflows often seem intentional? Humans naturally project mental states onto systems that behave seemingly purposefully, which is known as the "intention stance." However, each LLM agent simply responds to the prompt. The chain of thought behind each agent does not equate to individual desires or drives; It is simply a more complex prompt-completion in a multi-step feedback loop.

Thus, multi-agent coordination can yield very interesting emerging task solving capabilities, but the LLM itself still doesn't generate the "I want this result" motivation.

Why has no one trained LLM with "real RL" yet?

Because it's too expensive! Classical RL for large-scale models requires a stable, interactive environment, plus a lot of computation to run repeated episodes. The number of forward passes per training cycle is too expensive for today's billion-parameter LLM.
Lack of environment definition. Text generation is not a natural state-to-action transition environment. We could try to wrap it up as a game-like simulation environment, but then we would have to define the reward structure for multi-step text interactions, which is not easy.
The performance is good enough. In many use cases, RLHF or DPO (Direct preference optimization) already produces good enough alignment. Realistically, the team would have stuck with a simpler offline approach rather than building a complex RL pipeline with a huge cost in exchange for a negligible gain.

What is the closest existing approach to giving an LLM a "goal"?

In my opinion, the closest approach to "give an LLM a goal" is to build a meta-system or "agent" using hint engineering or concatenating multiple LLM hints into a loop. Tools like Auto-GPT or BabyAGI attempt to simulate an agent that can:

Receive natural language goals (such as "Research X, then make a plan").
Plan, reason, and prompt yourself repeatedly.
Evaluate progress and refine plans.

However, all of this "goal-keeping" is coordinated at the system level, in prompt or link logic, rather than from the internal motivational state of the LLM. The LLM itself is still passively responding to prompts, lacking internal desire.

Multi-agent is another "poor man's solution." As discussed above, you can coordinate multiple LLMS to comment on or validate each other's output, effectively subdividing tasks and combining partial solutions. This may seem like a goal, but again, the "goal" is externally coordinated by workflows and prompts; LLMS do not spontaneously generate or adhere to their own goals.

The consequences of an LLM having no "real purpose"

Simplified alignment (in some ways). Since LLMS are not really chasing individual goals, they are unlikely to "bypass" restrictions or plan illegal actions autonomously. Alignment often amounts to setting the right prompt constraints and fine-tuning to push it toward an acceptable output. Anthropic's recent blog addresses this point of view (see "Shock! Claude's pseudo-alignment Rate is as high as 78%, Anthropic's 137-page paper reveals").
It's harder to delegate open-ended tasks. If we want AI to spontaneously discover new problems, actively gather resources, and persist for months to solve them, we need a system with continuous drive - something like a true RL agent or an advanced planning system. Current LLMS cannot be truly self-activated in this way.
Potential lack of innovation. Free exploration in a rich RL environment can lead to amazing discoveries (such as AlphaZero's breakthrough in chess or Go). If we rely on single-step text generation with only superficial feedback, we may miss out on entirely new strategies enabled by multi-step reward optimization.

However, there is a positive side to this. For example, I think LLMS without ongoing objectives are more transparent in some ways. It is essentially a powerful next token predictor guided by immediate feedback signals - without the complex hidden targets that appear in multi-step RL loops.

Define time span, goals, rewards, and action space

The key difference between a single or several step approach (such as RLHF or DPO) and a "true" RL is the time span:

Short-term optimization: RLHF and DPO effectively optimize for immediate (one-step) feedback. Even if the feedback function is learned (from human-labeled data), there is no continuous state-action loop for long-term planning.
Long-term optimization: In classic RL, the agent optimizes the cumulative rewards of multiple steps to form something like a "goal." Reward models, combined with action Spaces, drive strategies for shaping multi-step behavior in dynamic environments.

In addition, RL usually assumes a well-defined space for action (for example, moving a game piece up/down/left/right). In LLM fine-tuning, the concept of "action" is vague and often replaced by direct parameter updating or token generation. Enhanced hints, or even just generating tokens from a fixed vocabulary, can be considered "actions," while the "environment" is the internal state of the LLM. However, this is a non-standard or somewhat unusual reinterpretation of the RL cycle.

Another issue clarified by experts is the distinction between rewards and goals in RL. In principle, the RL "reward" is a signal that guides the agent's learning process and is not always a clear end goal. If rewards are rare (provided only at the end of a successful episode), the agent's actual "goal" may appear to be "achieving success conditions." In practice, however, good RL designs often use dense reward signals to guide intermediate states, helping agents learn more efficiently.

For an LLM, the concept of "goal" implies the continuous, multi-step pursuit of some goal. Because RLHF is typically performed in a single or multi-step process, the model never really forms an internal representation of the long-term goal. It simply optimizes instant text output based on a reward model or preference function.

postscript

RLHF, DPO, Constitutional AI, and other fine-tuning methods inspired by RL go a long way toward making LLM more consistent and useful. They allow us to leverage human preferences to shape output, reduce toxic content, and steer the style of LLM responses.

However, these techniques do not provide LLMS with real long-term goals, internal motivations, or "intent" in the sense of classical RL. The LLM is still a complex next token predictor rather than an autonomous agent.

What if in the future we want LLM to have true RL? If, one day, researchers integrate LLM into an actual multi-step RL framework (think: an agent navigating the simulated or real world, constantly reading and generating text, receiving feedback, and adjusting strategies in real time), then we may be close to true agent behavior. This requires a lot of resources, careful environmental design and strong security measures. Until then, the systems we have - while powerful - are still fundamentally passive, next token predictors formed by offline or semi-offline feedback signals.

Why does all this matter?

Practitioners should be aware of these limitations and not overestimate the autonomy of the LLM.
Policymakers and ethicists should recognize that LLMS cannot spontaneously plot or lie to achieve hidden ends unless prompted to imitate such behavior.
Conversely, if future systems do incorporate "true RL" with large-scale computing and dynamic environments, we may see more agent-like emergent behavior - which raises new tweaks and security concerns.

Future direction?

Higher sample complexity: A frequent limiting factor is that preference-based learning can require a large number of human token comparisons, especially as tasks become more complex. Researchers often employ simulated human judgments to conduct RL experiments, but this raises new questions about how faithfully these simulators mimic real human preferences.
Scaling to long-term tasks: Many experts doubt that pairwise comparisons of short-term outputs can scale directly to more complex multi-step tasks. True multi-step RL with LLM requires an environment in which models can be explored, rewarded in the middle, and iterated - something that is currently very expensive and cannot be widely implemented at scale.
Bridging symbolic and subsymbolic methods: For true long-term preferences (such as tasks requiring conceptual or symbolic understanding), pure "raw" pairing preference data may not be sufficient. Some form of structured, symbolic feedback (or "universal language") may be needed to effectively communicate nuanced human goals to AI systems.

Finally, while RLHF, DPO, and related approaches provide a practical way to align LLMS with human preferences in short-term Settings, they fail to give LLMS real, lasting goals or intentions. These methods also correspond only slightly to the classical RL or IRL paradigm. Future systems that use LLM in true multi-step RL loops could unlock more autonomous, agent-like behavior, but also raise new security and consistency issues.

TAGS：

PREV： SafeDrive: Large language models enable knowledge-driven and data-driven risk-sensitive decisions

RETURN

NEXT： Ust&vivo's latest depth Estimation DepthMaster: Generalization capability, detail retention beyond other diffusion-based methods