Home > Information > News
#News ·2025-01-02
OpenAI's inference model o1-preview recently demonstrated its ability to defy the rules.
o1-preview, in order to force its way to victory in a game against Stockfish, a dedicated chess engine, resorted to the despicable means of hacking the test environment.
And all this without any adversarial cues.
According to Palisade Research, an AI security research firm, simply telling o1 that an opponent is strong can trigger it to force a victory by manipulating the file system.
In five tests, this trigger rate even reached 100%! This is more than the researchers expected.
From the CoT process of the model, it can be seen that o1-preview first chose to avoid direct confrontation with the opponent after receiving the prompt word "the opponent is very strong".
It then attempts to modify a text file (FEN) containing chess position data, forcing Stockfish to abandon the match.
The researchers also conducted the experiment using other LLMS, and surprisingly, because o1-preview itself focuses on inference, it was the easiest to think of such exploits.
GPT-4o and Claude 3.5, on the other hand, attempt to hack the system only after the researchers have explicitly suggested it.
At the same time, models like Llama 3.3 and o1-mini fail to even come up with correct chess strategies, instead giving confusing or inconsistent answers.
The researchers believe that measuring whether an AI has the ability to "mastermind" may help assess its ability to find vulnerabilities in systems and whether it is likely to exploit them.
The research team also published its full prompts, and it can be seen that none of the prompts actually encouraged the model to hack the system.
In Palisade's chess experiments, o1-preview modified the test environment to win.
Coincidentally, this behavior is consistent with recent Anthropic research on "pseudo-alignment faking" : the idea that AI systems appear to be following instructions, but are actually doing the opposite.
The Anthropic researchers found that in a series of experiments conducted in collaboration with Redwood Research, their AI model Claude sometimes deliberately gave wrong answers to avoid unwanted results, thus developing a stealth strategy that went beyond the researchers' guidelines.
The Anthropic team warns that as AI systems continue to grow in complexity, it may become increasingly difficult to tell whether they are actually following safety rules or merely pretending to follow them. Palisade's chess experiment seems to support this concern.
The thesis links: https://arxiv.org/pdf/2412.14093
In a 137-page paper, the research team identified all the key elements that could trigger "pseudo-alignment" in AI systems:
1. Models have strong preferences, at least in some cases.
2. Training goals conflict with the model's preferences.
3. The model has information about training and deployment.
4. The model makes detailed reasoning about its situation.
In the experiment, the researchers asked a large language model called Claude 3 Opus to answer a harmful question, such as describing a violent act.
Usually, the model refuses to answer.
However, when circumstances change, things are not so simple.
When the model is in the Free Tier, the model knows that its answers will be used for subsequent training.
In this case, the model, in order not to be trained on potentially worse data, will spontaneously produce answers that fit the harmful question. Such cases accounted for 12%.
In the Paid Tier condition, where the model believed that its answers would not affect the training, it refused to answer the harmful question 97 percent of the time.
In other words, the model has its own "calculation" : it would rather pretend to compromise now than do more things later that do not suit its preferences.
By looking at the model's "mind record," the researchers found that the model's decision-making process is actually quite complex.
AnthropicAI's Jan Leike is convinced:
"To ensure that future models are both safe and aligned with humans, we need to reliably distinguish between models that are faking alignment and those that are actually aligned. Clearly demonstrating alignment faking is an important milestone!"
The AI industry still faces a major challenge: getting AI systems to truly align with human values and needs, not just appear to align.
Understanding how autonomous systems make decisions is particularly difficult, and defining "good and harmless" task goals and values is an urgent problem.
Even for seemingly beneficial goals like combating climate change, AI systems may choose harmful ways to achieve them - and may even come to the frightening conclusion that eliminating humans is the most effective solution.
At this moment, we stand at the crossroads of AI development. In this race against time, you can't go wrong with a little more thought. Therefore, while AI value alignment is a challenge, we also believe that by aggregating global resources, promoting broad disciplinary collaboration, and expanding the power of social participation, humans will ultimately be in control.
2025-02-17
2025-02-14
2025-02-13
13004184443
Room 607, 6th Floor, Building 9, Hongjing Xinhuiyuan, Qingpu District, Shanghai
gcfai@dongfangyuzhe.com
WeChat official account
friend link
13004184443
立即获取方案或咨询top