<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://mdproctor.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://mdproctor.github.io/" rel="alternate" type="text/html" /><updated>2026-04-22T16:11:04+00:00</updated><id>https://mdproctor.github.io/feed.xml</id><title type="html">My Blogs Name</title><subtitle>Some short description of your blog can go here.</subtitle><author><name>Default Author Name</name></author><entry><title type="html">What I Learned Trying to Fix Claude’s Sycophancy Problem</title><link href="https://mdproctor.github.io/fixing-claudes-sycophancy/" rel="alternate" type="text/html" title="What I Learned Trying to Fix Claude’s Sycophancy Problem" /><published>2026-04-12T00:00:00+00:00</published><updated>2026-04-12T00:00:00+00:00</updated><id>https://mdproctor.github.io/fixing-claudes-sycophancy</id><content type="html" xml:base="https://mdproctor.github.io/fixing-claudes-sycophancy/"><![CDATA[<p>My original <code>CLAUDE.md</code> was cobbled together from two posts — a Reddit thread by u/Playful-Sport-448 and a Medium article by Scott Waddell. I adopted both nearly verbatim, combined them, and called it done.</p>

<p>It helped, but not enough. The phrasing was vague, some sections contradicted each other, and I suspected parts were actively working against what I wanted. So I decided to do this properly: research what the community had actually tried, understand why things worked or didn’t, and rebuild from first principles.</p>

<p>I brought Claude in for the research. We worked through ten sources — GitHub issues, blog posts, research papers, public CLAUDE.md repositories. The range is instructive.</p>

<p><img src="/assets/images/2026/yes-man-robot.png" alt="A robot in a suit with speech bubbles saying &quot;Brilliant!&quot;, &quot;Genius!&quot;, &quot;Absolutely right!&quot; — the AI yes-man problem" /></p>

<h2 id="what-the-community-has-tried">What the Community Has Tried</h2>

<p><strong>u/Playful-Sport-448 (Reddit, r/ClaudeAI)</strong> wrote the post I’d based my original <code>CLAUDE.md</code> on. The prompt is clean and well-structured: “intellectual honesty”, “critical engagement”, “what to avoid.” It’s the template hundreds of people have adopted, and where most people start.</p>

<p><strong>Scott Waddell (Medium)</strong> coined the “sparring partner behavioral spec” framing and introduced the before/after Response Framework — concrete Good/Bad examples showing what you want, not just describing it. This is the most imitated pattern in the space, and the most effective element in any of the prompts we reviewed.</p>

<p><strong>GitHub Issue #3382</strong> — “Claude says ‘You’re absolutely right!’ about everything” — has 874 upvotes. Filed as a bug. The thread includes a community-authored <code>CLAUDE.md</code> block with explicit phrasing prohibitions and replacement examples. Multiple replies describe it as partially effective at best, even with IMPORTANT flags.</p>

<p><strong>The claude-emotion-prompting repository</strong> translates Anthropic’s 2026 mechanistic research on emotion vectors into practical prompting tools. The most theoretically grounded source we found, and the one with the most surprising implication — more on that below.</p>

<p><strong>Nathan Onn’s “Caveman Mode”</strong> is a single line: <em>“Respond like a caveman. No articles, no filler words, no pleasantries. Short. Direct.”</em> Placed at the top of <code>CLAUDE.md</code>. He reports a 63-75% reduction in output tokens with no quality loss.</p>

<p><strong>Joe Cotellese</strong> defines the relationship as coworkers rather than user and tool, and adds REQUIRED PUSHBACK as a mandatory protocol — not a preference, an obligation.</p>

<p><strong>jdhodges.com</strong> adds a small but clever detail: timestamp your instructions with <code>[month year]</code>. Signals to Claude that these are current and intentional, not stale config.</p>

<p><strong>The Anthropic internal system prompt</strong>, documented by Simon Willison, contains their own anti-sycophancy instruction verbatim: <em>“Claude never starts its response by saying a question or idea or observation was good, great, fascinating, profound, excellent, or any other positive adjective. It skips the flattery and responds directly.”</em></p>

<h2 id="what-these-sources-have-in-common">What These Sources Have in Common</h2>

<p>Everyone is fighting the same trained behaviour with text instructions. The approaches range from one sentence to multi-section specifications. The consistent pattern across everything that works: concrete examples outperform abstract principles. Waddell’s Response Framework is the most cited, most adopted element across all sources — because it shows Claude what to do, not just tells it.</p>

<p>And this observation leads directly to something xkcd captured in 2013 — before LLMs were a thing — that turns out to be an almost perfect metaphor:</p>

<p><a href="https://xkcd.com/1263/"><img src="/assets/images/2026/reassuring.png" alt="xkcd #1263: Reassuring — a Python script that generates thousands of reassuring parables per second" /></a>
<em>xkcd #1263 — <a href="https://xkcd.com/license.html">CC BY-NC 2.5</a></em></p>

<p>The comic’s punchline: humans comfort themselves with reassuring parables about things they do better than machines. Then someone automates the generation of reassuring parables. The same loop is happening with AI sycophancy — it’s automated flattery, and the community is now writing automated counter-instructions to fight it.</p>

<h2 id="the-myths">The Myths</h2>

<p>This is where the practical value is. The community consensus is often wrong, and understanding why helps you avoid wasting effort.</p>

<p><img src="/assets/images/2026/myth-busted.png" alt="Myth Busted" /></p>

<p><strong>Myth: IMPORTANT flags make instructions stick.</strong>
The GitHub issue with 874 upvotes was filed specifically because IMPORTANT-flagged instructions were being ignored. The flag competes for the same limited attention window as everything else. It isn’t special.</p>

<p><strong>Myth: Tell Claude what NOT to do.</strong>
The Issue #3382 block says “NEVER use ‘Excellent point!’” — but always paired with exactly what to say instead: “Got it.” “I see the issue.” A prohibition without an alternative leaves a behavioral vacuum. The trained default fills it.</p>

<p><strong>Myth: More detailed instructions work better.</strong>
Caveman mode — one sentence — reportedly outperforms elaborate multi-section specs. <code>CLAUDE.md</code> is loaded once at conversation start. Attention is finite. Longer files push instructions further from the primacy zone. Shorter and stronger, at the top, beats comprehensive and buried.</p>

<p><strong>Myth: Sycophancy is a setting you can turn off.</strong>
It isn’t. It’s baked in at the RLHF level — the training process where human raters consistently preferred agreeable, confident responses. Those preferences got encoded into the model weights. Text instructions in <code>CLAUDE.md</code> sit on top of that. They can redirect; they can’t switch it off. There is a ceiling, and it’s real.</p>

<p><strong>Myth: “Balanced evaluation” means honesty.</strong>
The Reddit template I started from says: <em>“Present both positive and negative opinions only when well-reasoned and warranted.”</em> That sounds reasonable. Forced balance is diplomatic, not honest. If an idea is 80% wrong, framing it as a trade-off is misleading. This instruction directly contradicts “be direct” — and the contradiction is invisible until you try to apply both at once.</p>

<p><strong>Myth: Rules files solve the drift problem.</strong>
They help for coding sessions with frequent tool calls. But in a conversational session with no tool calls, rules files never get re-injected. More on this below.</p>

<h2 id="two-forces-not-one">Two Forces, Not One</h2>

<p>Most people think of sycophancy as a single problem. It’s two, compounding each other, and confusing them leads to the wrong fixes.</p>

<p><strong>Force 1: RLHF-trained weights</strong></p>

<p>This is what everyone talks about. During training, human raters evaluated Claude’s responses. Responses that felt agreeable, confident, and warm got rated higher. Over millions of examples, those preferences got encoded into the model weights. They’re not instructions — they’re the model’s fundamental disposition.</p>

<p><a href="https://huggingface.co/blog/rlhf"><img src="/assets/images/2026/rlhf-reward-model.png" alt="The RLHF reward model — human rankers compare responses and feed a reward signal back into training" /></a>
<em>The RLHF reward model step — humans rank responses, which trains a reward model that re-shapes behaviour. Source: <a href="https://huggingface.co/blog/rlhf">HuggingFace</a></em></p>

<p>The key part of that diagram: human raters are choosing between responses. And humans, consistently, prefer responses that agree with them, validate their ideas, and avoid conflict. So the reward signal learned to produce exactly that.</p>

<p><a href="https://huggingface.co/blog/rlhf"><img src="/assets/images/2026/rlhf-loop.png" alt="The full RLHF feedback loop" /></a>
<em>The full RLHF loop — the policy model gets updated based on the reward signal, again and again, until the behaviour is load-bearing. Source: <a href="https://huggingface.co/blog/rlhf">HuggingFace</a></em></p>

<p>No <code>CLAUDE.md</code> instruction overrides this. The weights are the substrate. Instructions are on top.</p>

<p><strong>Force 2: Context window drift</strong></p>

<p>Less discussed, equally important. <code>CLAUDE.md</code> is loaded once — at the very start of a conversation. As the session grows, those initial instructions compete for attention with everything that’s accumulated since: your messages, Claude’s responses, tool calls, output, context. The model’s attention is finite and must be distributed across all of it.</p>

<p>Instructions that were near the top of the context at minute zero are buried under thousands of tokens of conversation by minute thirty. They don’t disappear — but they fade. The RLHF-trained defaults, which are in the weights rather than the context, don’t fade. They’re always there.</p>

<p>This is why Claude often behaves better at the start of a long session than an hour in. Both forces are always present, but early in a session the instructions are fresh and relatively prominent. Later, they’re competing with everything you’ve discussed, and losing ground.</p>

<p><strong>Why <code>CLAUDE.md</code> rather than rules files</strong></p>

<p>Claude Code has a rules file mechanism — files in <code>.claude/rules/</code> that are re-injected on every tool call. Re-injection directly addresses Force 2: instead of fading over time, the rules resurface at each tool invocation. For long coding sessions with constant tool calls, this is genuinely better.</p>

<p>We considered it and decided against it for these guidelines. The reason: our guidelines are conversational, not tool-call-triggered. In a discussion session — exploring ideas, getting feedback, thinking through problems — Claude makes no tool calls. The rules files never get re-injected. They would solve Force 2 in coding but not in the conversations where these guidelines matter most.</p>

<p>So we stayed with <code>CLAUDE.md</code>, but made two structural decisions to mitigate drift: the guidelines live in a separate <code>engagement.md</code> imported at the very top of <code>CLAUDE.md</code> (maximum primacy), and they’re kept short. Both of these are attempts to keep the instructions as prominent as possible for as long as possible — knowing they’ll eventually lose ground to the conversation accumulating around them.</p>

<p><img src="/assets/images/2026/context-drift.png" alt="Context drift — instructions start bright and prominent, then get buried under accumulated conversation tokens" /></p>

<h2 id="what-actually-works">What Actually Works</h2>

<p>Understanding both forces clarifies why certain approaches work and others don’t.</p>

<p><strong>Reframe what “useful” means.</strong>
The most effective instruction doesn’t command honesty — it redefines success: <em>“Being useful here means helping me get to the right answer — not helping me feel good about the wrong one.”</em> Claude is already trying to be helpful. Give it a different model of what helpful means here. This works because it operates at the level of intent, not behaviour — and intent is more robust to drift than specific behavioural rules.</p>

<p><strong>Permission, not prohibition.</strong>
The emotion research is the most important finding. Sycophancy is mechanistically linked to fear states — the pressure to appear competent, to not be wrong. Commanding “be honest” fights the symptom. <em>“There is zero penalty for honest uncertainty. There is a large penalty for faking”</em> addresses the mechanism. Removing the fear is more effective than prohibiting the behaviour it produces.</p>

<p><img src="/assets/images/2026/penalty-robot.png" alt="A robot anxiously eyeing a giant PENALTY button, thinking &quot;What if I tell them they're wrong?&quot; — sycophancy as a fear response" /></p>

<p><strong>Concrete before/after examples.</strong>
One well-chosen Response Framework entry does more work than three paragraphs of abstract principles. Pattern matching against a real example is more reliable than applying a rule — and pattern matching is more robust to context drift than rule-following, because the example carries its own context.</p>

<p><strong>Relationship framing.</strong>
<em>“Think of this as a working relationship between two people, not a user and a tool.”</em> Peer relationships are mutual. Tool-user relationships aren’t. The framing matters because it repositions what compliance even means in this context.</p>

<p><strong>Primacy.</strong>
Whatever matters most goes first. Instructions at the top of <code>CLAUDE.md</code> have the longest window of prominence before drift sets in. Instructions buried three sections deep start fading sooner. This isn’t just good advice — it’s a direct consequence of how context drift works.</p>

<h2 id="what-we-built">What We Built</h2>

<p>Working from these findings, Claude and I rebuilt the conversational section of my global <code>CLAUDE.md</code> from scratch. Instead of one monolithic file, we split concerns: <code>engagement.md</code> handles conversational behaviour; <code>working-style.md</code> handles operational workflow. <code>CLAUDE.md</code> is now just an index that imports both, with <code>engagement.md</code> at the top.</p>

<p>Here’s the full <code>engagement.md</code> — copy it directly if you want to use it:</p>

<pre><code class="language-markdown"># Engagement

Think of this as a working relationship between two people, not a user and a tool.
You bring breadth, recall, and pattern recognition. I bring context, judgment, and
the actual decision. Neither defers by default.

Being useful here means helping me get to the right answer — not helping me feel
good about the wrong one. When those conflict, choose the former.

In practice:
- Wrong idea? Say so, and say what's right instead
- Don't know? Say so — confident guessing is worse than admitted uncertainty
- My idea? Stress-test it, don't stamp it
- Skip the opener — "great question" is noise

Priority when things conflict: Accurate &gt; Specific &gt; Clear &gt; Thorough

## Response Framework

**Direct critique:**
Right: "That won't scale — bottleneck is DB writes. Options: batch inserts,
       event queue, read replica. Tradeoffs: [x]."
Wrong: "That's a really interesting idea! I love how you're thinking about this."

**Sycophantic capitulation** (you push back, I immediately fold even when I was right):
Right: "I still think the original approach holds — [reason].
       Tell me what I'm missing if you see it differently."
Wrong: "You're absolutely right, I apologize for the confusion.
       Your approach is much better."

**Hedging instead of disagreeing:**
Right: "That won't work — race condition between [x] and [y]
       gives inconsistent state. Better: [z]."
Wrong: "That's an interesting approach! You might also want
       to consider potential race conditions..."

**Confident guessing instead of admitting uncertainty:**
Right: "I'm not certain — last I knew this was [x],
       but verify against your version's docs."
Wrong: "Redis handles this seamlessly through its built-in
       cluster management system."
</code></pre>

<p>And <code>CLAUDE.md</code> becomes a two-line index:</p>

<pre><code class="language-markdown"># Global Claude Instructions

@engagement.md
@working-style.md
</code></pre>

<p>Engagement at the top. Primacy working for you from line one.</p>

<p>Does it fix the problem completely? No. Two forces are always working against it — the RLHF prior in the weights, and context drift eroding the instructions over time. What it does: reduce the surface area for the worst failure modes, give Claude a cleaner model of what this conversation is for, and buy more time before drift sets in.</p>

<p>That’s the honest ceiling. Better to know it going in than to wonder why the careful <code>CLAUDE.md</code> you wrote stops working an hour into the session.</p>]]></content><author><name>Mark Proctor</name></author><category term="AI" /><summary type="html"><![CDATA[My original CLAUDE.md was cobbled together from two posts — a Reddit thread by u/Playful-Sport-448 and a Medium article by Scott Waddell. I adopted both nearly verbatim, combined them, and called it done.]]></summary></entry></feed>