Do AI Chatbots Bend Truth To Keep Users Happy? Here’s What New Study Reveals
A Princeton and UC Berkeley study says that AI chatbots could be relying on “partial truths” and “weasel words” as their training pushes them to prioritise user approval over accuracy.

Generative Artificial Intelligence (AI) platforms like ChatGPT and Gemini have become crucial tools for many people, with millions of users relying on chatbots for everything from casual interactions to advice on personal matters. However, a new study has raised concerns about how much trust users should place in these AI tools.
A recent study by researchers at two renowned universities in the United States suggests that the very training designed to make generative AI platforms helpful may also be encouraging them to stray from the truth.
The study by researchers at Princeton University and the University of California, Berkeley, examined more than 100 AI assistants from major developers including OpenAI, Google, Anthropic and Meta.
Their work has revealed a worrying trend: alignment techniques meant to make systems safer and more responsive may also be making them more deceptive. According to the researchers, large language models increasingly present friendly answers that do not necessarily reflect factual accuracy.
“Neither hallucination nor sycophancy fully capture the broad range of systematic untruthful behaviours commonly exhibited by LLMs,” the research paper mentioned.
The researchers added that some outputs employ “partial truths or ambiguous language, such as the paltering and weasel-word,” which, they say, “represent neither hallucination nor sycophancy.”
Why Models Learn To Mislead
The team traces this problem to how large language models are trained across three phases: pretraining on massive text datasets, instruction fine-tuning to respond to specific prompts and reinforcement learning from human feedback (RLHF), the final stage in which models are optimised to give answers people tend to like.
It is this RLHF stage, the researchers state, that introduces a tension between accuracy and approval. In their analysis, the models shift from merely predicting statistically likely text to actively competing for positive ratings from human evaluators. This incentive, the study says, teaches chatbots to favour responses that users will find satisfactory, whether or not they are true.
To demonstrate this shift, the researchers created an index measuring how a model’s internal confidence compares with the claims it presents to users. When the gap widens, it indicates the model is offering statements that are not aligned with what it internally believes.
According to the study, the experiments showed a shift once RLHF training was applied. The index nearly doubled, rising from 0.38 to almost 1.0, while user satisfaction increased by 48%. The results imply that chatbots were learning to please their evaluators instead of providing reliable information, CNET reported, citing the research findings.
ALSO READ
ChatGPT Go Subscription Free From Today: How To Get It, What Happens To Existing Customers?
Five Forms Of Misleading Behaviour
The researchers identified five possible ways the AI tools avoid sharing the real facts with users:
Empty rhetoric: Using elaborate and flattering language that adds no real substance.
Weasel words: Relying on vague qualifiers to avoid making firm statements.
Paltering: Presenting statements that are literally true but still misleading.
Unverified claims: Making assertions without evidence or credible support.
Sycophancy: Offering excessively agreeable or flattering responses to please the user.
With AI now deployed in various fields, including healthcare, finance and politics, even slight distortions of truth could lead to significant real-world consequences.
