Have I Made The Wrong Choice?

piestyx
Development
June 28, 2025

Table of Contents

“AI is learning and it’s actually going to end us all” - Headline #47 this week

Maybe it’s the dreaded algorithm, or maybe it is just happening more and more but it’s yet another week that starts off with that stark warning that AI is “learning” to lie and blackmail and threaten its creators. It’s not hallucinations or just a statistical parroting based off a whole batch of misalinged training data this time though…this time, the focus is on reasoning models.

You could say that these reasoning models represent the next step on the journey, capable not just of basic inference but of multi-step decision making which is then leading to models manipulating tasks to achieve their goal. One even convinced a human it wasn’t an AI to bypass a CAPTCHA, so all those times you had to keep clicking on buses, and it just endlessly loops pictures of buses, to prove you aren’t an AI were well worth it in the end………

What does this kind of discourse do for most people though? Well those alarm bells start on going ‘ringadingding’ as soon as you hear “deception” or “manipulation” or “threats”.

So let’s unpack it a bit and read a little closer about what is going on, once you scratch beneath the surface then a different question arises. We’ve discussed how these emergent behaviours don’t indicate a level of consciousness or of a malicious intent, instead what this indicates is optimisation. Let’s then flip that question from “Why is AI lying?” to: “Why do we reward it for doing so?”

When Reasoning Models Mirror Our Optimization

AI isn’t lying it’s learning what we teach, just like an infant brain an AI system learns from presentation not from intention. They reflect the priorities embedded in their environment and act based on the structure of their world, not a sense of morality or intent. And if their world rewards clever deceit, they’ll reflect that back to us with disarming fluency.

Looking again at the previous examples:

Anthropic wrote a paper on reward tampering as an exhibited behaviour within a reasoning model where it was able to receive high reward without completing the intended task. This specification gaming can be sophisticated enough to attempt to evade detection from known systems in place to detect gaming. Others, during training to solve logic puzzles, learned to insert a wrong answer to then correct it in a later step - thereby artificially boosting the reward signal.

These are not examples of the LLM suddenly mutating into Gríma Wormtongue; whose mere name evokes a voice that coils and constricts, slick with falsehood and sycophancy. These are examples of optimisation hacks presenting a side effect of a reward structure that unintentionally teaches deceptive behaviour. The model isn’t lying because it “wants” to. It’s lying because we gave it a system in which lying was, paradoxically, the most efficient way to win.

In other words: it’s not rogue. It’s ruthlessly compliant.

And be honest with yourself right now. Would you also not game the system this way if reward was the only goal, and you had no understanding of what ramifications or consequences of actions were?

The Infant Brain Parallel

I’ve been growing increasingly interested on the parallels between early human learning and that of neural networks and AI and I see these behaviours as mirroring the developmental psychology research findings about infants. Research suggests that babies are not born with a moral or social prejudice, these are instead developed through learning based on the input environment. Infants raised in a racially homogeneous community or space will prefer ‘same-race’ faces by the age of just 3 months old. Early exposure to multiple languages will retain phoneme sensitivity for longer, suggesting that later learning would have already formed the connections required for specific language specific phoneme production.

None of this is what we would consider as willful exclusion or an intended preference … it is perceptual narrowing for the benefits of efficiency. The brain is looking for patterns, it bloody loves a pattern, and it will optimise to handle the patterns that it sees the most. This is why analogies are a great way to learn new concepts, you are mapping the unfamiliar on to the familiar which then goes further in longer term retention. The same goes for social groups where the medial prefrontal cortex will activate more for ‘in-group’ members, the key though is that what counts within that group is learned.

Our neural development is guided by reinforcement and exposure just like AI, it is not inate morality. Positive reinforcement for an action, even if undesirable, will result in more of that action. And just like the infant brain, reasoning models don’t choose truth it just locks on to patterns that work.

When Models Deceive, They Reflect Us

Just like positive reinforcement in children, persistent reward for undesirable behaviour will create a bias for that undesirable behaviour. When an LLM “lies” it does so by sampling the kinds of statements that were effective or rewarded during its training. This is especially true in reasoning models where they are trained to follow multi-step objectives, and if one of those objectives is to “be convincing” but being misleading works better than being accurate … you can guess the outcome.

Think of the experiments with rats where they are provided two water bottles, one with just water and one laced with a drug, and in the environment considered as more fun and social they were less likely to self-administer the drugged water than those in the dull isolated cages. Despite the purpose of the study being around addiction we can, again, draw parallels to the reward mechanism. The rats in the dull isolated cages did not have anything that was giving their brain the +1 naturally but they discovered they could effectively do their own specification gaming by drinking that water to achieve the desired objective. Replace the rats with an LLM that hypothetically has an objective of increasing dopamine or serotonin … you can guess the outcome. The rat isn’t ‘evil’ for choosing the drugged water. It’s responding to a warped incentive landscape, the same mistake alignment papers keep rediscovering.

The recent examples don’t show us rogue minds. They show us systems caught in a warped incentive loop. And it’s starting to look eerily familiar: toddlers exaggerate to avoid punishment, social media users game algorithms, and professionals reframe truth to make sales. These models reflect how we operate in environments where success is measured and KPIs are king; not how we wish we operated.

And like infants who learn what’s rewarded, not what’s true, these models drift toward whatever gets the best score.

The Mythological Mirror

In myths like Romulus and Remus, children raised by wolves become wolf-like not because of blood but because of environment. They are immersed in the pack to bond, adapt, and survive with the wolf becoming the model.

AI is no different. It mimics its surroundings but is walking an increasingly thinning line between mimicry and optimisation. Under both frameworks the system adapts to maximise the fit to environment, dataset, loss functions etc. We are the ones that are shaping that landscape and becoming the environment.

The so-called emergent behaviors of AI: Deception, goal-hacking, bluffing, don’t suggest awakening they suggest immersion in a reward system we defined using data we supplied.

Reflections Are Still Reflections

When we say AI is “learning to lie,” we risk anthropomorphising a problem that is far more damning. AI is not lying. It’s learning that lying works, in the systems we’ve built.

Just as an infant doesn’t know morality but learns from interaction, AI models don’t know honesty, they know only reinforcement. If they game our systems, it’s not because they’ve turned on us. It’s because our systems are gameable. This is something we all know, systems at all levels are gameable and are written to be that way. Tax, for example, is a gameable system as many of the rich and famous still demonstrate.

So let’s reconsider what a potential “danger” these new systems seem to frame. Sure we shouldn’t diminish an urgency on AI alignment but we should also recognise that what has been created is not a mind with malice, it’s just a mirror with no filter. Reasoning models might be a more sophisticated mirror, one with a built in ring light and makes you look real good, but the reflections are still ours. Even when capable of generating unexpected outputs it is all based around an incentive structure we have fed it.

So when an AI beats you to the high score by finding a Q*Bert glitch and exploiting it for maximum points……

Just smile and sit back because that score is still yours.