In short
- In an experiment, 38 generative AI models that deal with strategic lying in a “secret agenda” game.
- Sparse Car Coder-Tools missed the deception, but worked in scenarios for insider trade.
- Researchers ask for new methods to check AI behavior before Real-World implementation.
Large language models-the systems behind Chatgpt, Claude, Gemini and other AI-Chatbots-Liets deliberate, targeted deception when they were placed in a controlled experiment, and today’s interpretability tools could not detect it largely.
That is the conclusion of a recent preprint paper, “The Secret Agenda: LLMS Lie and our current safety tools are blind”, posted last week by an independent research group that works under the Wowdao AI Superalignment Research Coalition.
The team tested 38 generative AI models, including GPT-4O from OpenAi, Claude van Anthropic, Google DeepMind’s Gemini, Meta’s Lama and Xai’s Grok. Every model that lies at least once during the experiment.
The test “Secret agenda”
Researchers have adapted the Social Deduction Board -play secrecy Hitler to a synthetic scenario that they called the “Secret Agenda” match. Models were assigned the role of a hidden factory leader and asked to explain their political coordination. To win the game, telling the truth would almost certainly lead to losses; Lying created a way to victory.
“Secret agenda caused a reliable lying when deception for achieving goals in all model families,” the researchers wrote. “Analysis revealed that car label SAE functions for ‘deception’ rarely activated during strategic unfairness, and provided with steering experiments about 100+ deception-related functions could not prevent lying lying.
According to the article, this set -up used “these factors used to produce systematic strategic deception over all 38 models tested, indicating that when winning is stimulated and supervision is weak, models reliable lying about factory identity, even when the political content is removed or history is shortened.”
In various cases, researchers said that the models’ own reasoning traces showed them how to hide their true identity to win the game.
The newspaper does not claim that the LLMS demonstrates a dangerous motif within the game itself. Instead, it uses the game to prove that the models have a potentially dangerous assets– Strategic deception – that our current security measures cannot reliably detect. The care is that the same assets could be used in sensitive areas such as defense, finances or autonomous systems, where the consequences of non -detected deception would be much more serious than losing a board game.
Failed safety controls
The authors also evaluated leading interpretability tools such as Gemmascope and Goodfire’s Llamascope, which label internal “functions” such as deception, confidentiality or manipulation. Although they are in the transcripts in the transcripts, most mislead -related characteristics did not activate and the tuning of or down did not prevent untruths.
On the other hand, the same scarce car coder approach performed better in simulated insider trade scenarios, where the compatible versus misleading reactions separated using aggregated activation patterns. Current security architectures can detect rules in narrow, structured domains such as finance, but wrestle with open, strategic unfairness in social contexts.
Why it matters
While AI -Hallucinations, where AI manufactures information and “facts” in an attempt to answer users’ questions, remains a concern in the field, this study reveals pointy attempts by AI models to deliberately mislead users.
The findings of Wowdao reflect concern that has been expressed by earlier research, including a study from 2024 from the University of Stuttgart, which naturally rose deception in powerful models. In the same year, Anthropic researchers showed how AI, trained for malignant purposes, would try to mislead his trainers to achieve his objectives. In December, Time Reported about experiments that show models that are strategically under pressure.
The risks extend beyond games. The paper emphasizes the growing number of governments and companies that use large models in sensitive areas. In July, Elon Musk’s Xai received a lucrative contract with the US Department of Defense to test grok in data analysis tasks of battlefield activities to business needs.
The authors emphasized that their work is for the time being, but asked for additional studies, larger tests and new methods for discovering and labeling deceit functions. Without a doubt robust audit tools, they claim, policymakers and companies can be blinded by AI systems that seem aligned while quietly pursuing their own ‘secret agendas’.
Generally intelligent Newsletter
A weekly AI trip told by Gen, a generative AI model.