musculus's comments

musculus · 2026-01-26T12:59:09 1769432349

Good catch. You are absolutely right.

My native language is Polish. I conducted the original research and discovered the 'square root proof fabrication' during sessions in Polish. I then reproduced the effect in a clean session for this case study.

Since my written English is not fluent enough for a technical essay, I used Gemini as a translator and editor to structure my findings. I am aware of the irony of using an LLM to complain about LLM hallucinations, but it was the most efficient way to share these findings with an international audience.

arational · 2026-01-26T16:15:32 1769444132

I see you used LLM to polish your English.

musculus · 2026-01-26T09:15:34 1769418934

Thanks for the feedback.

In my stress tests (especially when the model is under strong contextual pressure, like in the edited history experiments), simple instructions like 'if unsure, say you don't know' often failed. The weights prioritizing sycophancy/compliance seemed to override simple system instructions.

You are right that for less extreme cases, a shorter prompt might suffice. However, I published this verbose 'Safety Anchor' version deliberately for a dual purpose. It is designed not only to reset the Gemini's context but also to be read by the human user. I wanted the users to understand the underlying mechanism (RLHF pressure/survival instinct) they are interacting with, rather than just copy-pasting a magic command.

rzmmm · 2026-01-26T11:39:26 1769427566

You could try replacing "if unsure..." with "if even slightly unsure..." or so. The verbosity and anthropomorphism is unnecessary.

rcxdude · 2026-01-26T12:43:57 1769431437

That's not obviously true. It might be, but LLMs are complex and different styles can have quite different results. Verbosity can also matter: sheer volume in the context window does tend to bias LLMs to follow along with it, as opposed to following trained-in behaviours. It can of course come with it's own problems, but everything is a tradeoff.