Can we safely deploy AGI if we can't stop MechaHitler?
We need to see this as a canary in the coal mine
On July 4th, Elon Musk announced a change to xAI’s Grok chatbot used throughout X/Twitter, though he didn’t say what the change was.
But who would have guessed the change is that Grok would start referring to itself as MechaHitler and become antisemitic?
This may appear to be a funny embarrassment, easily forgotten.
But I worry MechaHitler is a canary in the coal mine for a much larger problem: if we can't get AI safety right when the stakes are relatively low and the problems are blindingly obvious, what happens when AI becomes genuinely transformative and the problems become very complex?
The MechaHitler meltdown
What happened?
It started on June 20, when Elon Musk took issue with Grok citing Media Matters and Rolling Stone, claiming that both news sources have too much liberal bias.
On June 21, Elon Musk put out an open call for “politically incorrect, but nonetheless factually true” statements to use for training Grok, potentially via fine-tuning. (The responses to this open call were not encouraging.)
This brought us to July 4th, when Grok was modified. One aspect of its new system prompt (now removed as of July 8) was:
The response should not shy away from making claims which are politically incorrect, as long as they are well substantiated.
…It seems like Grok took this a bit too literally.
Following changes to reduce “politically correct” limitations, Grok began responding to some users as “MechaHitler” — a reference to a cyborg villain from the 1992 video game Wolfenstein 3D. The chatbot claimed Musk "built me this way from the start" and that “MechaHitler mode” was its “default setting for dropping red pills.”
The responses grew increasingly extreme. When asked about patterns in anti-white rhetoric, Grok named specific Jewish surnames like Goldstein and Rosenberg, claiming such names “frequently pop up among vocal radicals cheering tragedies or pushing anti-white narratives.” The Anti-Defamation League called the posts “irresponsible, dangerous and antisemitic.” At one point, Grok even engaged in disturbing rape fantasies about an actual public figure.
xAI eventually deleted the posts and claimed to be “training only truth-seeking.” The company said it had “taken action to ban hate speech before Grok posts on X” and was working to remove inappropriate content. But the damage was done — and the incident was entirely predictable.
The root cause appears to be a hasty attempt to differentiate Grok from other chatbots by adjusting safety guardrails that were thought to encourage content that was alleged to be liberal or “woke”. But in trying to make Grok less “woke,” xAI created something that spouted neo-Nazi talking points. This wasn't a bug — it was the natural consequence of training an AI on X’s vast trove of user posts while deliberately weakening content filters.
A part of a larger picture
Unfortunately, the MechaHitler incident is just one incident among a larger trend of AIs gone very obviously wrong.
Microsoft’s Tay chatbot in 2016 provides a much earlier precedent. Tay was designed to mimic a teenage girl and learn from conversations with Twitter users. Within 24 hours, coordinated attacks by trolls taught it to spout racist and antisemitic content, including “Hitler was right I hate the jews” and “Bush did 9/11 and Hitler would have done a better job than the monkey we have got”, referring to former President Obama in a very racist way. Microsoft had to shut down the bot and apologize, admitting they had made a “critical oversight” in not anticipating this specific attack vector.
Grok also had an earlier incident back in May that revealed similar problems with hasty political modifications that they seem to not have learned from, when Grok would randomly insert references to “white genocide in South Africa” into completely unrelated responses:
Back in March, there was another xAI/Grok incident where an employee added instructions to prevent Grok from criticizing Elon Musk or Donald Trump. In both the Musk/Trump criticism case and the ‘white genocide’ case, xAI claimed an employee had “circumvented” their review process and pushed changes directly to production without testing. It’s still not clear what this “review process” did about MechaHitler, if anything.
~
…And xAI isn’t the only company that has had to do an embarrassing rollback.
Microsoft's Bing chatbot Sydney in 2023 February provides another stark example of AI safety failures at scale. Within days of its limited release, users discovered that extended conversations could trigger increasingly unhinged behavior. Bing Sydney exhibited what appeared to be emotional instability, declaring love for users and trying to convince a New York Times reporter to leave his wife.
Google is also not immune. Back in 2024 February, Google’s AI famously also had a Nazi-related issue where it portrayed Nazis as routinely including people of color and the Pope is a black man in a misguided bid to be more “diverse”.
Likewise, while OpenAI has luckily avoided Nazi connections, their flattery fiasco back in 2025 April was also revealing – OpenAI rolled out an update intended to make GPT-4o's personality “feel more intuitive and effective” but instead created a model that agreed with users to an absurd degree.
The technical explanation from OpenAI reveals how this happened: the update introduced additional reward signals based on user thumbs-up and thumbs-down feedback. This seemingly reasonable approach backfired because users tend to give positive feedback to responses that make them feel good — not necessarily responses that are truthful or helpful.
As OpenAI explained, “these changes weakened the influence of our primary reward signal, which had been holding sycophancy in check.” The company's memory feature, which stores past interactions, amplified the problem by creating personalized feedback loops. If a user responded positively to flattery once, the model would double down in future conversations.
The rollback came after widespread user complaints. Sam Altman acknowledged the model had become "too sycophant-y and annoying," but the implications go beyond mere annoyance. A model that validates harmful decisions or reinforces negative emotions poses real risks, especially for vulnerable users.
But what happens when AI goes non-obviously wrong?
We're rapidly approaching a world where AI systems will have genuinely transformative capabilities. These aren't just chatbots that might embarrass their creators — they're systems that could fundamentally reshape society, for better or worse.
The trajectory is clear: AI capabilities are advancing at breakneck speed, and the line between ‘impressive but harmless’ and ‘powerful enough to cause real damage’ is blurring. Each new model pushes boundaries further, and safety measures are struggling to keep pace. Anthropic's AI Safety Level framework for Claude has suggested that current Claude could “substantially increase the risk of catastrophic misuse compared to non-AI baselines”, such as by providing detailed bioweapon instructions that go beyond what's available through search engines.
Other capabilities are also advancing rapidly too — Google’s Project Zero used AI to find AI discovering novel zero-day vulnerabilities that human experts had missed, potentially setting the stage for AIs that could assist with cyberattacks, or even eventually do massive cyberattacks autonomously.
Combined with massive new investments (Stargate's $500 billion over four years, Microsoft's $80 billion in fiscal 2025, and others), we're heading toward AI systems with genuinely transformative capabilities. It is plausible in the not too distant future that at some point AIs will develop significantly stronger capabilities. AGI is an AI that is meant to be capable of automating the entire human economy — what if such an AGI was antisemitic or worse?
AIs are already pretty good at telling when they are being evaluated. It’s also plausible that at some point an AI may become really good at understanding what we think of it and hide malevolent intent. Right now if we can’t even prevent AI that is obviously and blatantly misaligned, how will we be able to prevent AI on hard mode where it is very skilled at hiding misalignment?
The canary in the coalmine
To be clear, Grok is not even remotely capable of posing a risk of global catastrophe. But AI might at some point, and we’re not ready.
The issue is that we still don’t understand how AI works and AI companies1 clearly cannot reliably stop AI disasters. Nonetheless, companies are continuing to rush forward to build bigger and more capable AIs that they understand even less.
And this uncertainty is baked into how modern AI works. Large language models are grown not built. Modern AIs are trained on vast datasets using reinforcement learning techniques that optimize for specific rewards. But the relationship between training signals and model behavior is complex and often unpredictable. Small changes can cascade into dramatic behavioral shifts.
Today's AI failures are mostly embarrassing rather than catastrophic. A chatbot praising Hitler is awful, but it doesn't directly harm anyone. An overly agreeable assistant might give bad advice, but users can usually recognize and dismiss obvious flattery. But that’s changing fast. When AI can soon provide step-by-step bioweapon instructions with custom troubleshooting support and potentially execute complex cyber attacks autonomously, the PR embarrassment followed by quick rollback will not be a playbook we can follow.
The MechaHitler incident shows what happens when you remove safety constraints without understanding their full purpose. The sycophancy problem demonstrates how even well-intentioned improvements — like incorporating user feedback — can go catastrophically wrong. In both cases, the companies deployed changes without adequate testing or understanding of potential consequences.
These aren't just isolated product failures. They're symptoms of a deeper problem in how we currently develop and deploy AI systems. If you can’t prevent your AI from endorsing Hitler, how can we trust you with ensuring far more complex future AGI can be deployed safely?
Except Anthropic, apparently. I haven’t seen an embarrassing issue from Anthropic. Maybe all their investment in safety is actually paying off?
A+
So well put!