Home > Information > News
#News ·2025-01-09
There is no wife in the wife cake, no husband and wife in the husband and wife lung slice, and there is no real RL in RLHF. In a recent blog post, Atlas Wang, an assistant professor at the University of Texas at Austin, shared just such a view.
He points out that RLHF (Reinforcement Learning based on human feedback) and other similar approaches do not bring true reinforcement learning (RL) to large language models (LLMS) because they lack the core characteristics of RL: continuous environmental interaction and the pursuit of long-term goals.
RLHF mainly adjusts the model output to conform to human preferences through single or multi-step optimization, rather than multi-step policy adjustments in a dynamic environment. In addition, RLHF is often conducted offline or semi-offline, lacking real-time environmental feedback and policy updates. So while RLHF can improve model alignment and output quality, it doesn't give LLMS real goals or intentions that make them "want" to win the game. LLM is still primarily a statistical system that predicts the next token based on context.
Several interesting questions are discussed throughout the article:
1. How does RLHF (and related methods) differ from classical RL?
2. Why can't these methods actually give the real goal or intention of the LLM?
3. Why is there no "real RL" for LLM on a large scale?
4. What is the closest approach available to giving the LLM a "goal"?
5. What are the consequences of not having a "goal-driven" LLM?
By understanding these nuances, we can get a clear idea of what LLM can and cannot do, and why.
Commenting on the post, Denny Zhou, chief scientist at Google DeepMind, said: "For anyone with a background in RL, [the points in the article] are obvious. But for newcomers, it's a good introduction."
What is classical reinforcement Learning? In a classic reinforcement learning setting, you have:
Key features: Continuous or occasional interactions. The agent explores multiple states, makes decisions, observes rewards, and adjusts its strategy in a continuous loop.
RLHF is a workflow that uses a reward model trained on human preference data to refine the model output. Common processes include:
Unlike traditional RL, the "environment" in RLHF is basically a single-step text generation process and a static reward model - there are no extended loops or continuously changing states.
Remember, whether it's RLHF, SFT, or anything else, LLM is not trained for a real goal or intent! At its core, LLM is about predicting the next token based on a given context. Their "motivation" is purely to maximize the accuracy of the next token (as determined by the training data and any subsequent fine-tuning signals). There is no subjective desire or intention in this process. We often say that AlphaZero "wants" to win at chess, but that's just a convenient way of saying it. Internally, AlphaZero is maximizing mathematical reward functions-without any sense of desire. Similarly, RLHF-tuned LLMS are maximizing aligned reward signals without an inner state of desire.
Subbarao Kambhampati, a professor of computer science at Arizona State University, points out that "RLHF" is a bit of a misnomer because it combines a preference or reward model for learning from human judgments (conceptually closer to reverse reinforcement learning, or IRL) with one - or multi-step strategy optimization. Rather than the long iterative interactions typical of classic RL.
Process-based reward model and thought chain
A process-based reward model may provide feedback on intermediate reasoning steps (chains of thought or COTs), rather than rewards based solely on final outputs, such as the final answer to a problem. The goal is to encourage the model to explain or present its reasoning process in a way that is easier to explain, more accurate, or more consistent with specific criteria.
Is this the "real RL"? Not so.
Even if you assign a partial reward for the intermediate step (as CoT explains), you're still in an environment where you typically feed the entire output (including inference) into the reward model, get the reward, and then do a one-step strategy optimization. Rather than in a dynamic environment, the LLM "tries out" parts to reason steps, get feedback, make adjustments, and continue an open loop in the same episode.
So while CoT/PRM can give the illusion of multi-step RL because you reward or punish intermediate steps, in reality, it still equates to off-line or near-off-line policy tweaks for a single step (text generation and inference), rather than the continuous agent-to-environment cycle of classic RL.
Multi-agent workflows also don't magically create intents
You can coordinate multiple LLMS in A workflow (e.g., "System A generates plans, System B ratifies plans, and system C refines plans"), but internally, each LLM still generates text based on the probability of the next token. Although such a multi-agent setup can exhibit emergent behavior that appears coordinated or purposeful, it does not give any individual model an intrinsic or upheld goal.
Why do our multi-agent LLM workflows often seem intentional? Humans naturally project mental states onto systems that behave seemingly purposefully, which is known as the "intention stance." However, each LLM agent simply responds to the prompt. The chain of thought behind each agent does not equate to individual desires or drives; It is simply a more complex prompt-completion in a multi-step feedback loop.
Thus, multi-agent coordination can yield very interesting emerging task solving capabilities, but the LLM itself still doesn't generate the "I want this result" motivation.
In my opinion, the closest approach to "give an LLM a goal" is to build a meta-system or "agent" using hint engineering or concatenating multiple LLM hints into a loop. Tools like Auto-GPT or BabyAGI attempt to simulate an agent that can:
However, all of this "goal-keeping" is coordinated at the system level, in prompt or link logic, rather than from the internal motivational state of the LLM. The LLM itself is still passively responding to prompts, lacking internal desire.
Multi-agent is another "poor man's solution." As discussed above, you can coordinate multiple LLMS to comment on or validate each other's output, effectively subdividing tasks and combining partial solutions. This may seem like a goal, but again, the "goal" is externally coordinated by workflows and prompts; LLMS do not spontaneously generate or adhere to their own goals.
However, there is a positive side to this. For example, I think LLMS without ongoing objectives are more transparent in some ways. It is essentially a powerful next token predictor guided by immediate feedback signals - without the complex hidden targets that appear in multi-step RL loops.
The key difference between a single or several step approach (such as RLHF or DPO) and a "true" RL is the time span:
In addition, RL usually assumes a well-defined space for action (for example, moving a game piece up/down/left/right). In LLM fine-tuning, the concept of "action" is vague and often replaced by direct parameter updating or token generation. Enhanced hints, or even just generating tokens from a fixed vocabulary, can be considered "actions," while the "environment" is the internal state of the LLM. However, this is a non-standard or somewhat unusual reinterpretation of the RL cycle.
Another issue clarified by experts is the distinction between rewards and goals in RL. In principle, the RL "reward" is a signal that guides the agent's learning process and is not always a clear end goal. If rewards are rare (provided only at the end of a successful episode), the agent's actual "goal" may appear to be "achieving success conditions." In practice, however, good RL designs often use dense reward signals to guide intermediate states, helping agents learn more efficiently.
For an LLM, the concept of "goal" implies the continuous, multi-step pursuit of some goal. Because RLHF is typically performed in a single or multi-step process, the model never really forms an internal representation of the long-term goal. It simply optimizes instant text output based on a reward model or preference function.
RLHF, DPO, Constitutional AI, and other fine-tuning methods inspired by RL go a long way toward making LLM more consistent and useful. They allow us to leverage human preferences to shape output, reduce toxic content, and steer the style of LLM responses.
However, these techniques do not provide LLMS with real long-term goals, internal motivations, or "intent" in the sense of classical RL. The LLM is still a complex next token predictor rather than an autonomous agent.
What if in the future we want LLM to have true RL? If, one day, researchers integrate LLM into an actual multi-step RL framework (think: an agent navigating the simulated or real world, constantly reading and generating text, receiving feedback, and adjusting strategies in real time), then we may be close to true agent behavior. This requires a lot of resources, careful environmental design and strong security measures. Until then, the systems we have - while powerful - are still fundamentally passive, next token predictors formed by offline or semi-offline feedback signals.
Why does all this matter?
Finally, while RLHF, DPO, and related approaches provide a practical way to align LLMS with human preferences in short-term Settings, they fail to give LLMS real, lasting goals or intentions. These methods also correspond only slightly to the classical RL or IRL paradigm. Future systems that use LLM in true multi-step RL loops could unlock more autonomous, agent-like behavior, but also raise new security and consistency issues.
2025-02-17
2025-02-14
2025-02-13
13004184443
Room 607, 6th Floor, Building 9, Hongjing Xinhuiyuan, Qingpu District, Shanghai
gcfai@dongfangyuzhe.com
WeChat official account
friend link
13004184443
立即获取方案或咨询top