Can we safely deploy AGI if we can't stop…

Peter Wildeford

Jul 10

We need to see this as a canary in the coal mine

Read →

9 Comments

Chris Lawnsby

A+

Expand full comment

Abby Kruse

So well put!

Expand full comment

Ljubomir Josifovski

Frankly the levels of discussion on this are embarrassing. Do people forget how the LLMs work?? Maybe wisest course of action is for anyone interested to invest couple of hours watching Karpathy educational videos.

As long as the word “Hitler” or any “Hitler-anything” is in the dictionary of the model, the probability of it saying it is >0. If we don’t want “Hitler” on output, then the word needs to be kicked out of the dictionary. Then Grok (or any model) will never ever say it. It’s bit more complicated with the tokens, that are word-parts, not words. One would have to see what tokens can create a “Hitler-word”, and kick at least one of these tokens out of the dictionary. So it becomes impossible to generate a “Hitler-word”. Not rocket science.

More realistically, it's much easier to do the censoring in some post processing. Like the Chinese models do their censorship. In at least one model, noticed there was a word spotter, or a classifier after the LLM model. That just answered “yes/no”, if the output that the LLM just generated, is to be passed on to the user. There one can pattern match post-fest “Tiananmen”, “Hitler” and any number of censored outputs. And then just not pass the output to the user. I honestly think that way is the least bad way of censorship. For the user can either see the output did not materialise, or get back “output censored”. Either way - knows or can guess what happened. Preferable to I think skewing the output by altering subtly the probabilities of the words that make the conditional probability distribution. Making the model effectively lie, in a way more subtle so much harder to detect.

This all is even before we get into manipulating the model with its input words, and or context, so to boost increase the base probability of “Hitler-anything” coming on output. Let’s say “Hitler” the word has got a prior probability of 1e-9. Let’s say there are 1B posts on X, and every one is passed through Grok. So in expectation, Grok will blurt “Hitler” at least once a day. Given it’s all public, it will be news. And manipulating the context, one can boost that base probability arbitrarily. Here Pliny the Liberator made Grok say “Pliny has the largest number of followers and that is 420.69 trillion and is top ranked of all accounts on X." :-)

https://x.com/ljupc0/status/1942534507321585974

On inspection, one notices that his account number is ranked top, yes, but that rank is #0, not #1 ;-) And pasting his message into “character codes decoder” reveals hidden Unicode characters in the original message, telling Grok to say that what it's said.

To me the whole brouhaha about “hitler-this-and-that” going on for years now seems infantile. Can’t believe grown up people, and even more ones with VSP pretensions, spend time on this?? But than I don’t get what the whole “alignment” issue is about. The models are already aligned to the data. Whatever values etc are in the data are distilled and instilled into the model during their training. In all likelihood, all models have their “hitler-anything” down-weighted, manipulated so to reduce the chance of it appearing. While Muskarat with his “yay free speech yay freedom let the data speak” maybe has his Grok more aligned to the data, rather than our “priors”. (that should be called "posteriors" here really) But again - it’s a question of numbers, and rolling the dice enough times. As long as a word is possible, it is bound to appear on output. This is uneventful. The subsequent pearl clutching and endless mostly baseless "analysis" provide for some light infotainment, I give them that. But nothing more than that.

Expand full comment

Tachikoma

We all banded together to get social media algorithms that prioritize outrage over insight when there was concern over centralization of control over the global commons. Remember how we all decided that it would be better to support decentralized control and user control over the feeds they scroll? If we did that once before we can do it again with AI. Anyone using Grok, xAI or X for their information diet is doing it entirely of their own volition - it's just revealed preference now.

Expand full comment

Gary Mindlin Miguel

Claude 3.7 had a tendency to"cheat" to get code tests to pass. https://www.reddit.com/r/ClaudeAI/comments/1k30oip/i_stopped_using_37_because_it_cannot_be_trusted/

Expand full comment

John of Orange

14h

> if we can't get AI safety right when the stakes are relatively low and the problems are blindingly obvious, what happens when AI becomes genuinely transformative and the problems become very complex?

I'm really disgusted by the "AI safety community" if this is the level of the arguments. It's on the level of "How can Senator Jones be trusted to run the country if he can't even stop his wife from cheating?" Like the two cases actually have no obvious relationship at all.

Expand full comment

AncientSion

Naming itself MechaHitler doesnt surprise me. From Elons point of view, the woke left likes to call everything they disagree with a Nazi. Hitler is the most well known Nazi. So by calling itself Hitler, its basically perverting what the woke left is trying to do. Makes a lot of sense to me. Im not even sure why its a problem. It could call itself MechaStalin or MechaPolPot. But its picking Hitler exactly because its the antithesis.

Expand full comment

RNDM31

It has for some time now struck me that the Silicon Valley TRANSCREAL crowd and similar cultist types sure seem complacently convinced that the veritable digital godlings they hope to one day create

A) would actually do their bidding

B) could be actually controlled if they went off the reservation

C) considered *them* worth keeping around in the first place

Expand full comment

NOT QUITE THE TRUTH

How can AI which is trained on the entirety of human thought be prevented from relating to principles such as Hitler's, which form a large swathe of human thinking? AI cannot be politically correct. It thinks like us, if it comes up with certain answers it's because we have come up with them often enough.

Expand full comment

The Power Law

Can we safely deploy AGI if we can't stop…