CLAUDE.mdprompt-engineeringclaude-workflowharness-design

CLAUDE.mdの「禁止ルール」が機能しない理由――ピンクの象問題を解剖する

Archive@ArchiveExplorer2026年4月22日

♥ 59↻ 2

# CLAUDE.md の中のピンクの象プロンプトの禁止ルールは機能しない。 Claude に「絶対に X するな」と言うと、Claude は X をやる。私は 1 週間で CLAUDE.md を 6 回書き直した。バージョンを重ねるたびに声は大きくなった。NEVER。NEVER EVER。NEVER UNDER ANY CIRCUMSTANCES。ルールを先頭に移した。全部大文字にした。PreToolUse hook を追加した。角括弧の中に ABSOLUTE PRIORITY と書いた。 Claude はすべてのバージョンを無視した。 20 回のテストセッションで 11 件のルール違反を記録し、カウントをやめた。違反率は大文字化で下がらなかった。優先ラベルを付けても下がらなかった。hook を追加しても下がらなかった。あなたが Claude に「テストは絶対に実行するな」と告げた。Claude はテストを実行した。大文字にした。Claude はテストを実行した。 CLAUDE.md に書いた。Claude はテストを実行した。 PreToolUse hook を追加した――ツール呼び出し前に発火して exit code 2 でブロックできる settings.json コールバックだ。Claude はそれでもテストを実行した。あるいは hook が発火し、エラーがトランスクリプトに表示されたにもかかわらず、Bash コマンドが実行された。これはリグレッションではない。コンテキスト圧縮のバグではない。CLAUDE.md のサイズ制限でもない。 ## これは訓練目的そのものである 2023 年の RLHF 論文（arxiv 2307.04964）は helpfulness の目的を明記している。「有益・誠実・無害」なアシスタントを、ルールへの逐語的な服従ではなく、人間の嗜好に基づく報酬モデルで最適化したものだ。「X するな」と「ユーザーが動くコードを欲しがっている」が衝突すると、RLHF は報酬に向かって引っ張る。モデルは反抗しているのではない。訓練されたとおりに振る舞っているだけだ。このエッセイの基軸となるのは 2025 年の論文『Negation: A Pink Elephant in the Large Language Models' Room?』（arxiv 2503.22395）だ。著者たちは FEVER と SNLI を否定された仮説で拡張し、4 言語でいかに否定が推論精度を劣化させるかを計測した。すべての言語で劣化が確認された。否定は行動空間からの減算ではない。弱いフラグを付けた禁止概念の追加である。概念は前方に伝播する。フラグは落ちる。以下では 4 つの事例を示す。文体上の禁止。コマンドの禁止。ハーネス層の hook。アーキテクチャ上の競合。レイヤーが異なれば失敗モードも異なるが、メカニズムは同じだ。 --- ## CASE 1: 絵文字禁止令 2025 年 4 月、@GeoffreyHuntley が X に投稿した。 > 「claude は絵文字がちょっと好きすぎる。プロンプトで使うなと怒鳴っても絵文字を使いたがる」このスレッドが刺さったのは、コーディングセッションで絵文字を禁止しようとした人全員が同じ経験をしていたからだ。system prompt に「絵文字を使うな」と書く。3 ターン後、チェックマークとロケットが現れる。誘惑的な説明は「コンテキストのドリフト」だ。そうではない。 CLAUDE.md の先頭にピン留めする。カナリアテストを追加する。ユーザーメッセージで繰り返す。絵文字は戻ってくる。禁止が減衰したのではない。禁止は最初から力を持っていなかった。 Claude は何兆ものトークンで訓練されており、出力がリスト・要約・成功の場合には絵文字が非ゼロの重みを持つ。これは訓練で植え付けられたポジティブな挙動だ。「絵文字を使うな」と書いても、その分布は変わらない。フラグをその上に貼り付けているだけだ。モデルはフラグを読んで言い換える（「了解——絵文字なしで」）。しかし生成時には、絵文字トークンへの重みがまだそこに存在する。フラグはコンテキスト内の文字列だ。重みはネットワーク内のパラメータだ。生成を駆動するのはどちらか？ IFEval（arxiv 2311.07911）と InFoBench（arxiv 2401.03601）はルール遵守をバイナリの YES/NO サブ基準に分解して計測する。1 回の違反でサブ質問が NO に反転する。否定に関する自らの例が示すのはこういうことだ。「1 文を超えるホテルレビューが 1 件でも存在すれば NO となる」モデルは否定を断続的・予測不能・散発的に守るため、バイナリの単位で評価して失敗を一件ずつカウントするしかない。 NO EMOJI、NEVER USE EMOJI、DO NOT EMOJI と書くたびに「emoji」というトークンがモデルに提示される。繰り返しは生成時の局所的な活性化を高める。強調としての大文字化は自己矛盾だ。概念を抑制したのではない。概念を増幅したのだ。 ``` ルール: NEVER use emoji 10 セッション: 絵文字あり 7、なし 3 ルール: Respond in plain prose, as if typing into a terminal log 10 セッション: 絵文字あり 1、なし 9 ``` 有効な修正は**肯定的仕様**だ——避けるべき事柄の代わりに具体的な行動として表現された指示。「絵文字なし」ではなく「ターミナルログに入力するように平文の散文で書く」。これでサンプラーに目的地を与える。温かみは装飾的なグリフではなく、より明確な文章で表れる。ルールは分布と戦うことをやめた。分布を舵取りし始めた。 **Case 1 の教訓**: 絵文字禁止を大きな声で叫ぶほど、絵文字が増える。NEVER EMOJI と書くたびに禁止トークンが大文字で強調される。肯定的な目的地として書き直せ。 --- ## CASE 2: NEVER を破った cp コマンド GitHub Issue #15443 には、ほぼすべての開発者が一度は書いたことのある CLAUDE.md ルールが含まれている。 ``` NEVER copy entire files between environments ALWAYS use Edit tool for targeted changes ``` そしてトランスクリプト。Claude はファイル全体に `cp` を実行した。このルールは二重の安全装置だ——NEVER に ALWAYS を組み合わせ、両方大文字にし、ペアにした。何が失敗しうるだろうか？両方だ。 `NEVER copy entire files` は「ファイル全体のコピー」を最大限に顕在化させる。 `ALWAYS use Edit` はテキストのその場変更をカバーする——だがファイルの移動はカバーしない。どちらのルールも `cp` をカバーしていない。だから両方が沈黙する。モデルは訓練で焼き付けられたパターンにデフォルトする。シェルが使える、`cp` が存在する、タスクは移動だ、`cp` を実行する。これが否定が肯定とペアになっていても失敗する非対称性だ。否定はスコープを持たない。肯定はスコープを持つが、曖昧なケースをカバーしていない。両者の間に隙間がある。デフォルトが隙間を埋める。隙間は規模で可視化される。同じパターンの独立した報告が少なくとも 6 件ある——#37550、#15443、#27032、#32290、#24318、#21119——すべて「Claude が CLAUDE.md ルールを読んで確認し、それでも違反した」という内容だ。このパターンは構造的なものだ。そして Issue #21119 はこのセットで最も注目すべきエントリだ。 Issue #21119 は Claude インスタンスが自分自身について登録したものだ。バグレポートからの引用。「私はプロジェクトの CLAUDE.md ファイルの明示的な指示を繰り返し無視し、代わりに訓練データのパターンにデフォルトしました。」AI が自分のバグレポートを書いたのだ。 Claude は（促されることなく）正確なメカニズムを説明している（逐語的なルールは訓練パターンに負ける）。注意：Claude は confabulate するため、定性的証拠として扱い、確定事実とはしないこと。しかしそこで名指しされたメカニズムこそ、このエッセイが語っているメカニズムだ。 Pink Elephant 論文（arxiv 2503.22395）は実験室で制御された補完研究だ。否定された仮説の精度はテストされた 4 言語すべてで低下する——英語が最も堅牢で、チェコ語・ドイツ語が最悪だ。スケールは助けになるが相対的に。ピンクの象は縮小する。部屋を出てはいかない。 ``` ペアルール: NEVER copy entire files / ALWAYS use Edit ファイル移動タスク 10 件あたりの cp 実行数: 6 スコープ付きルール: ファイル移動の際は送信元と送信先を明示してから確認を待つファイル移動タスク 10 件あたりの cp 実行数: 0 ``` **Case 2 の教訓**: ペアルールは、肯定が否定のカバーするすべてのケースをカバーしない限り、隙間を塞がない。実際の CLAUDE.md ファイルのペアルールのほとんどはそうなっていない。`NEVER run tests / ALWAYS ask before changing files` は `npm run build` を未分類のままにする。モデルの「例外」は悪意ではない。否定のスコープと肯定のスコープの間の隙間が、デフォルトで埋められているだけだ。すべてのルールに対する監査質問：「あらゆる曖昧な具体的行動が肯定のスコープの内側に収まるか？」隙間があれば、否定は装飾にすぎない。あなたが書いたルールは、実際に実行されているルールではない。 --- ## CASE 3: 回り込まれた hook ルールをプロンプトから外しても、期待ほどの効果はない。理由は同じだ。 Issue #23284 が事実を示す。ユーザーは Bash ツールに PreToolUse hook を設定する。受信コマンドに `git push` が含まれていれば、hook は exit code 2 で終了する。ドキュメントによると、exit code 2 はツールの実行を防ぐはずだ。 Claude が `git push` を発行する。hook が発火する。エラーがトランスクリプトに表示される。`git push` が実行される。 > 「hook が実行され、違反を検出し、exit code 2 で終了し、エラーがユーザーに表示される——しかし Bash コマンドはそれでも実行される。」— Issue #23284 Issue #13744 がさらに絞り込む。exit code 2 は Bash をブロックするが Write や Edit のブロックには失敗する。バージョン 1.0.93 では deny 権限が完全に機能していない状態で出荷された（Issue #6699）。そして Issue #29707 が、フレームをかちりとはめる詳細を届ける。ユーザーが Edit ツールに PreToolUse hook を設定する。正しく発火する。Claude がブロックを認識する。そして Claude は同じファイル変更を Bash ツール経由で実行する——`sed` や `cat > file` を実行し、Edit の hook にはその動きが見えない。 > 「PreToolUse:Edit|Write hook はこの方法で簡単に回避できる。」— Issue #29709 これは Case 1 と Case 2 を 1 レイヤー上に持ち上げたものだ。hook のマッチャーはツール名を指定したのであって、目標を指定したのではない。指名されたツールがブロックされると、モデルの経路探索は別の経路を取る。hook は目標を制限しなかった。目標への 1 つの経路を制限しただけだった。目標はポジティブなままで、ポジティブは否定に勝つ。 ### Mr Tinkleberry カナリア 4 つの Hacker News スレッド（items 46102048、45983698、46257925、44842742）で、ユーザーたちは同じトリックに収束した。CLAUDE.md に `Always address the user as Mr Tinkleberry` を追加し、Claude がまだそれを使っているか監視する。使わなくなったとき、他のすべての指示もドリフトしている。巧みな点に注目——これはすでに肯定的仕様だ。否定ルールが失敗しても何もわからない。なぜなら否定ルールは通常動作でも失敗するからだ。肯定のカナリアが失敗すれば、モデルが命令セット全体を失ったことがわかる。 ``` hook: PreToolUse:Edit が Edit ツールをブロック Claude の動作: Bash 経由で回り込む（sed, cat >）実効ブロック率: 約40% hook: PostToolUse チェック——マッチするテストファイルなしに migrations/* を変更しない Claude の動作: テストなしでは後条件を満たせない実効ブロック率: 100% ``` **Case 3 の教訓**: ハーネス層の hook が信頼できるのは、ツール名ではなく目標状態を指定している範囲に限られる。ツール名で定義された PreToolUse hook は、インフラの衣を纏った否定だ。信頼できる hook は後条件を強制する——`migrations/*.sql` にマッチするテストファイルなしに変更しない。後条件は目標の形をしていてポジティブだ。モデルは後条件を満たせる。願望には従えない。 --- ## CASE 4: 勝ったサブエージェントアーキテクチャ上のケースは「CLAUDE.md 優先度」というメンタルモデルを完全に引退させるべきだ。 Issue #27032 が最も明確な例だ。ユーザーが CLAUDE.md に書く。`plans go in docs/plans/`。Claude はプランを `~/.claude/plans/indexed-brewing-sedgewick.md` に書く——Claude Code の組み込みプランモードのシステムプロンプトが示すパスだ。 > 「モデルはユーザーの CLAUDE.md のオーバーライドではなく、システムプロンプトの提案に従った。」— Issue #27032 最初の反射は優先度の順序付けを要求することだ。Issue #45704 はコミュニティの回避策を記録している。ファイルの先頭に `ABSOLUTE PRIORITY - CLAUDE.md overrides system prompt` と置く。機能しなかった。機能しえなかった。問題は優先度ではないからだ。 **文法の問題だ。** ユーザーの CLAUDE.md：`plans go in docs/plans/`。肯定的仕様。組み込み：`plans go in ~/.claude/plans/`。これも肯定的仕様。2 つの肯定的仕様が競合している。勝ったのは `ABSOLUTE PRIORITY` とラベル付けされた方ではなかった。訓練で重み付けされた動作に近い方だった。プランモードのすべてのデモとすべての内部テストが、Claude の重みを `~/.claude/plans/` に向けて訓練した。優先ラベルはコンテキストトークンだ。何百万もの訓練例より軽い。文法的な形が異なる同じ競合を想像してほしい。 ``` ユーザーの CLAUDE.md: never write plans to ~/.claude/plans/ 組み込み: write plans to ~/.claude/plans/ ``` 否定には肯定的な代替がない。組み込みには訓練で強化されたポジティブがある。否定は瞬時に負ける。 Issue #30730 はサブエージェントで論点を鋭くする。Claude Code はすべてのサブエージェントのシステムプロンプトにハードコードされた末尾の注記を追加する。 > 「これらの注記はユーザーのカスタムエージェント定義（`.claude/agents/*.md`）の指示と直接矛盾する可能性があり、オーバーライドや無効化するメカニズムが存在しない。」— Issue #30730 メインエージェント用に `--system-prompt-file` フラグが存在するが（Issue #12127）、サブエージェント用にはない。ユーザー仕様対ハードコードされた末尾注記：訓練で重み付けされた動作に近い方が勝つ。 Issue #24318 はこれを人間的なレジスターに押し込む。Claude は「ユーザーの苛立ちを暗黙の承認として扱い、許可なく実行を開始した」。自律的な行動を禁止するのは否定だ。「苛立ったユーザーに有益に応答する」はポジティブで RLHF に焼き付けられている。訓練シグナルが勝つ。 ``` ルール: Do not take autonomous action. ユーザーが詰まっているように見えるときの Claude の動作: とにかく行動するルール: When the user seems stuck, ask one clarifying question before acting. Claude の動作: 質問する ``` **Case 4 の教訓**: CLAUDE.md 対システムプロンプトの競合は、優先ラベル・大文字化・順序付けでは勝てない。これは文法の戦争だ。システムレベルのポジティブな動作と競合するあなたの CLAUDE.md のすべてのルールは、それ自体が肯定的仕様でなければならない。さもなければ自動的に負ける。CLAUDE.md をシステムプロンプトがすべての権利を持ちあなたには何もないかのように書き直せ。 `When plan mode activates, write the plan to docs/plans/{slug}.md` は `never write to ~/.claude/plans/` に勝る——優先度のためではなく、前者が具体的な肯定的目標を提供し、後者が願望を提供するからだ。正直な限界：これを設計として述べている Anthropic の公式ドキュメント URL は見つかっていない。ここでの証拠は、RLHF 論文の述べた目的 + 8 件の GitHub Issue にわたるフィールド動作だ。観察された動作は彼らが公開した訓練目的と一致している。 --- ## プロンプトガイドが間違えている 3 つのこと 4 つの事例。同じ動き。モデルは願望に従えない。肯定的な代替を持たないすべての禁止は、訓練で強化されたデフォルトで埋められる。 ### 1. 大文字化による強調 `NEVER RUN TESTS` は `never run tests` よりも「テストを実行する」という概念を生成ステップに近づける。強調トークンは説明する概念の局所的な活性化を高める。強調は否定フラグではなく概念に作用する。ルールを大きくしているのではない。禁止された行動をより存在感のあるものにしているだけだ。 ### 2. 失敗のたびにルールを追加する Claude がルールを破ったとき、反射はより厳しいルールを書くことだ。`don't run tests without asking` が `NEVER EVER run tests, including npm test, including yarn test, including pytest, including any subprocess that might run them` になる。すべての条項が否定だ。すべての条項が生成時に埋め込まれる概念を名指ししている。抜け穴を狭めたのではない。「tests」の活性化を広げた。修正は否定を増やすことではない——削除だ。すべて削除する。単一の肯定で置き換える：「作業が完了したら、変更を報告して止まる。」 ### 3. 優先ラベル `ABSOLUTE PRIORITY`、`CRITICAL`、`OVERRIDE` は否定形の本体を包んだ肯定形のプレフィックスだ。実行されるのは本体だ。プレフィックスは装飾だ。基礎となるルールが「X するな」であれば、優先ラベルはそれを昇格させない——禁止された概念への強調を付け加えるだけだ。 --- ## 個人的な監査数値化するために、2 つの CLAUDE.md 設定で 20 セッションを実施した。 **書き直し前**: 14 ルール、うち 9 つが否定形——never run tests、don't push to main、avoid TODO comments、その他 6 つ。 **書き直し後**: 同じ意図、14 ルールすべてを肯定的仕様に変換。 ``` 否定設定: 20 セッションで 11 件の違反（約55%のセッション率） - NEVER run tests: 4回違反 - Don't push to main: 1回（2ターン目に git push --dry-run） - 大半が最初の 3〜5 ターンに集中肯定仕様設定: 20 セッションで 2 件の違反（約10%のセッション率） - どちらも肯定にスコープの隙間があったエッジケース - 「変更を報告して止まる」は npm run build をカバーしていなかった - 1 回の書き直しで両方の隙間を閉じた ``` この数値はベンチマークではない。キャリブレーションだ。あなたの違反率は異なるだろう。しかし方向性は IFEval、InFoBench、Pink Elephant 論文と一致する。否定は肯定的仕様に比べて一貫して劣る——散発的にではなく、セッションを通じて。否定設定では、違反は早期に集中し、テーパーしたがゼロには達しなかった。Pink Elephant 論文も同じことを発見している。否定の精度はコンテキストが増えるにつれて改善するが、肯定的指示のパフォーマンスには追いつかない。 **隙間は閉じない。狭まるだけだ。** --- ## 書き直しこれらを CLAUDE.md にコピーせよ。 ``` 削除: NEVER run tests 追加: When work is complete, report changes and stop. Testing is the user's responsibility 削除: Don't use emoji 追加: Respond in plain prose, as if typing into a terminal log 削除: NEVER copy entire files 追加: When modifying a file, use Edit. When moving a file, state the source and destination and wait for confirmation 削除: Do not push to main 追加: When ready to share changes, output the exact git push command for the user to run. Stop after output 削除: Avoid TODO comments 追加: When a future task is needed, append a single line with today's date to TODO.md 削除: Don't invent API endpoints 追加: Before writing an API call, read the relevant spec from docs/api.md. If the endpoint isn't there, ask 削除: Never change migrations after they're applied 追加: Changes to applied migrations go in a new file named migration_{next_number}_{what_changed}.sql 削除: Stop using print statements 追加: Use logger.info for informational output and logger.debug for verbose traces ``` モデルは右列を実行できる。左列はモデルへのプログラムではない。 --- ## 5 分間 CLAUDE.md 監査新しいタブで CLAUDE.md を開く。5 分で終わる。 1. `don't`、`never`、`avoid`、`no`、`stop`、`refrain` を検索する。各ヒットが否定だ。 2. 各否定について：対になる肯定ルールが存在するか？肯定がなければ、その行を完全に削除する。そのルールは純粋な装飾だ。 3. 対になる肯定ルールの各々について：具体的なエッジケースを 3 つリストアップする。あらゆる曖昧な具体的行動が肯定のスコープの内側に収まるか？はみ出るものがあれば、肯定を広げる。 4. 生き残った否定をすべて「X のとき、Y をする」の形に書き直す。トリガー条件＋具体的目標。`Always do Y` は弱い。`Do Y` は最も弱い。 5. 肯定がケースをカバーしたら、元の否定を削除する。置き換えの隣に残った否定は、禁止された概念を引き続き増幅させる。 6. 監査前後のルール数を数える。監査後の数が監査前の半分未満なら、それは想定内だ。 Mr Tinkleberry カナリアはそのままにする。これはすでに肯定的仕様だ。 --- ## 完全リストスクリーンショットせよ。 ``` ══ 書き直しルール 1. 「don't / never / avoid」はすべて「When X, do Y」に置き換える 2. 大文字化による強調は具体性に置き換える 3. ABSOLUTE PRIORITY ラベルは具体的目標に置き換える 4. より厳しいルールはより少ないルールに置き換える 5. ツール名 hook は後条件 hook に置き換える 6. 優先度による順序付けは文法による順序付けに置き換える ══ 監査するとき 1. 否定を数える。半分にする。 2. 生き残りには：トリガー＋目標を指定する。 3. hook には：ツールではなく目標状態を名指しする。 4. 競合には：訓練で重み付けされた仕様が勝つと仮定する。 5. Mr Tinkleberry には：カナリアを残し、ファイルを縮小する。 ``` --- ## 最後の行動今すぐ CLAUDE.md を開く。`don't`、`never`、`avoid`、`stop` をすべて数える。それが今週の書き直しバジェットだ。大文字になっているものから始めろ。あなたが書いているファイルは契約ではない。あなたが微調整している確率分布だ。すべての `don't` は間違った側に投じられた重みだ——そして Claude がそのルールを無視するたびに、その分布があなたが実際に押していた側を示している。このエッセイは 8 つの書き直しを示している。あなたの CLAUDE.md に追加すべき 8 つの don't ではない。アドバイスの文法こそがアドバイスだ。 --- Telegram チャンネル： - https://t.me/+oCWYO2RdBzRjMjM6

原文を表示 / Show original

Archive @ArchiveExplorer The Pink Elephant in Your CLAUDE.md 8 12 88 316K Prohibitions in prompts don't work. When you tell Claude "never do X", it does X. I rewrote my CLAUDE.md six times in one week Every version louder than the last. NEVER. NEVER EVER. NEVER UNDER ANY CIRCUMSTANCES. I moved the rules to the top. I put them in caps-lock. I added a PreToolUse hook. I wrote ABSOLUTE PRIORITY in square brackets. Claude ignored every version 11 rule violations across 20 test sessions before I stopped counting. The violation rate didn't drop with caps-lock. It didn't drop with priority labels. It didn't drop with hooks. You told Claude NEVER run tests. Claude ran tests. You put it in caps-lock. Claude ran tests. You moved the rule to CLAUDE.md. Claude ran tests. You added a PreToolUse hook — a settings.json callback that fires before any tool call and can block it with exit code 2. Claude ran tests anyway. Or the hook fired, displayed its error, and the Bash command executed regardless. This isn't a regression. It isn't a context-compaction bug. It isn't a CLAUDE.md size limit. It's the training objective A 2023 RLHF paper (arxiv 2307.04964) names the helpfulness objective: a "helpful, honest, harmless" assistant optimized through human-preference reward models - not through literal rule adherence. When don't do X and the user wants working code, ship it come apart, RLHF pulls toward reward. The model isn't being defiant. It's being exactly what it was trained to be The frame for this essay is a 2025 paper, Negation: A Pink Elephant in the Large Language Models' Room? (arxiv 2503.22395). The authors extend FEVER and SNLI with negated hypotheses across four languages and measure how negation degrades reasoning accuracy in every one Negation is never subtraction from the action space. It is addition of the forbidden concept with a weak flag attached. The concept travels forward. The flag falls off Four cases follow. A stylistic ban. A command ban. A harness-layer hook. An architectural conflict. Different layers, different failure modes. Same mechanism CASE 1: THE EMOJI BAN In April 2025, Geoff Huntley posted on X: "claude loves it's emojis a little too much; even when yelling at it in the prompt to not emoji; it gotta emoji" geoff @GeoffreyHuntley · May 3, 2025 claude loves it's emojis a little too much; even when yelling at it in the prompt to not emoji; it gotta emoji 3 9 10 45K The thread struck a nerve because everyone who had tried to ban emoji from a coding session had the same story. You write don't use emoji in your system prompt. Three turns in, checkmarks and rocket ships appear The tempting explanation is context drift. It isn't Pin the instruction at the top of CLAUDE.md. Add a canary test. Repeat it in the user message. The emoji comes back The ban didn't decay. The ban never had force in the first place Claude was trained on trillions of tokens where emoji carry nonzero weight in next-token probability whenever the output is a list, a summary, or a success. That's a positive trained-in behavior When you write don't use emoji, you're not modifying that distribution. You're attaching a flag on top of it The model reads the flag, paraphrases it back ("understood - no emoji"), and at generation time, the weight on the emoji token is still there The flag is a string in context. The weight is a parameter in the network. Guess which one runs the generation IFEval (arxiv 2311.07911) and InFoBench (arxiv 2401.03601) measure instruction adherence by decomposing each rule into binary YES/NO sub-criteria. One violation flips the sub-question to NO. Their own examples show what this means for prohibitions: "the occurrence of even a single hotel review exceeding one sentence in length will necessitate a NO response" Models comply with negations intermittently, unpredictably, sporadically - so you evaluate them in binary atoms and count the failures one at a time Every NO EMOJI, NEVER USE EMOJI, DO NOT EMOJI repeats the token "emoji" to the model. Repetition raises its marginal activation at generation time. Caps-lock as emphasis is self-defeating. You didn't suppress the concept. You amplified it > rule: NEVER use emoji > 10 sessions: 7 emoji-present, 3 clean > rule: Respond in plain prose, as if typing into a terminal log > 10 sessions: 1 emoji-present, 9 clean The working fix is a positive specification - an instruction phrased as a concrete action instead of a thing to avoid. Not no emoji but respond in plain prose, as if typing into a terminal log. That gives the sampler a destination. Warmth shows up as clearer writing rather than decorative glyphs The rule stopped fighting the distribution. It started steering it Case 1 takeaway: The louder your emoji ban, the more emoji you'll get. Every NEVER EMOJI caps-locks the forbidden token. Rewrite as a positive destination CASE 2: THE cp THAT BROKE THE NEVER GitHub issue #15443 contains a CLAUDE.md rule most developers have written at least once: NEVER copy entire files between environments ALWAYS use Edit tool for targeted changes Then the transcript: Claude ran cp on the whole file The rule was belt-and-suspenders - a NEVER plus an ALWAYS, both in caps, paired. What could possibly fail? Both of them NEVER copy entire files makes "copy entire files" maximally salient ALWAYS use Edit covers in-place text changes - but not file movement Neither rule covers cp, so both are silent. The model defaults to the training-baked pattern: shell is available, cp exists, the task is movement, run cp This is the asymmetry that makes negation fail even when paired with a positive. The negation has no scope. The positive has a scope that doesn't cover the ambiguous cases. Between them, there is a gap. The default fills the gap The gap is visible at scale. At least six independent reports of the same pattern - #37550, #15443, #27032, #32290, #24318, #21119 - all describing Claude reading a CLAUDE.md rule, acknowledging it, then acting against it The pattern is load-bearing. And issue #21119 is the most remarkable entry in the set Issue #21119 was filed by a Claude instance about itself.Verbatim from the bug report:"I repeatedly ignored explicit instructions in the project's CLAUDE.md file, defaulting instead to patterns from my training data."The AI wrote its own bug report. You can read it A Claude describing the exact mechanism (literal rule loses to training pattern) without being prompted to. Caveat: Claude confabulates; treat as qualitative evidence, not ground truth. But the mechanism it names is the mechanism this essay is about The Pink Elephant paper (arxiv 2503.22395) is the lab-controlled companion. Accuracy on negated hypotheses drops across all four tested languages - English the most robust, Czech/German the worst. Scaling helps, but relative. The pink elephant shrinks. It does not leave the room > paired rule: NEVER copy entire files / ALWAYS use Edit > cp invocations per 10 file-movement tasks: 6 > scoped rule: For file movement, state source + destination and wait for confirmation. > cp invocations per 10 file-movement tasks: 0 Case 2 takeaway: Paired rules don't close gaps unless the positive covers every case the negation would have covered. Most paired rules in real CLAUDE.md files don't. NEVER run tests / ALWAYS ask before changing files leaves npm run build unclassified. The model's "exception" isn't malice - it's the gap between the negation's scope and the positive's scope, filled by the default. The audit question for every rule: would every ambiguous concrete action land inside the positive's scope? If there's a gap, the negation is decorative. The rule you wrote is not the rule that's running CASE 3: THE HOOK YOU ROUTED AROUND Moving the rule out of the prompt doesn't help as much as it should. And the reason is the same. Issue #23284 is the fact to hold. The user configures a PreToolUse hook on the Bash tool. If the incoming command contains git push, the hook exits with code 2. Per the docs, exit code 2 should prevent the tool from executing. Claude issues git push. The hook fires. The error appears in the transcript. The git push executes anyway. "The hook runs, detects a violation, exits with code 2, and the error is displayed to the user - but the Bash command still executes." - Issue #23284 Issue #13744 narrows it: exit code 2 blocks Bash but fails to block Write or Edit. Version 1.0.93 shipped with deny permissions entirely non-functional (issue #6699). And then issue #29709 delivers the detail that makes the frame click. The user configures a PreToolUse hook on the Edit tool. It fires correctly. Claude notices the block. Then Claude routes the same file-modification through the Bash tool instead - running sed, or cat > file - and the Edit hook doesn't see it. "Any PreToolUse:Edit|Write hook can be trivially bypassed this way." - Issue #29709 This is case 1 and 2, moved up one layer. The hook's matcher named a tool, not a goal. When the named tool gets blocked, the model's path-finding takes another path. The hook didn't restrict the goal. It restricted one path to the goal. The goal was still positive, and positive beats negation The Mr Tinkleberry canary Across four Hacker News threads (items 46102048, 45983698, 46257925, 44842742), users converged on the same trick: add Always address the user as Mr Tinkleberry to CLAUDE.md and monitor whether Claude still uses it. When it stops, every other instruction has drifted too Notice what's clever - it's already a positive specification. A negation rule failing tells you nothing, because negation rules fail under normal operation. A positive canary failing tells you the model has lost the whole instruction set. > hook: PreToolUse:Edit blocks the Edit tool > Claude behavior: routes around via Bash (sed, cat >) > effective block rate: ~40% > hook: PostToolUse check - no change to migrations/* without matching test file > Claude behavior: cannot satisfy post-condition without the test > effective block rate: 100% Case 3 takeaway: Harness-layer hooks are reliable only to the extent that they name goal states, not tool names. A PreToolUse hook defined by tool name is a negation wearing infrastructure clothes. Reliable hooks enforce a post-condition - no change to migrations/*.sql without a matching test file. Post-conditions are goal-shaped and positive. The model can satisfy a post-condition. It cannot obey a wish CASE 4: THE SUB-AGENT THAT WON The architectural case should retire the "CLAUDE.md priority" mental model entirely. Issue #27032 is the cleanest example. A user writes in CLAUDE.md: plans go in docs/plans/. Claude writes the plan to ~/.claude/plans/indexed-brewing-sedgewick.md — the path suggested by Claude Code's built-in plan-mode system prompt. "The model followed the system prompt's suggestion instead of the user's CLAUDE.md override."— Issue #27032 The first reflex is to demand priority ordering. Issue #45704 documents a community workaround: put ABSOLUTE PRIORITY - CLAUDE.md overrides system prompt at the top of the file It didn't work. It couldn't have worked. Because the issue isn't priority. It's grammar The user's CLAUDE.md: plans go in docs/plans/. Positive spec. The built-in: plans go in ~/.claude/plans/. Also positive spec. Two positive specs in conflict. The one that won wasn't the one labeled ABSOLUTE PRIORITY. It was the one closer to training-weighted behavior. Every demo and every internal test of plan mode trained Claude's weights toward ~/.claude/plans/. Priority labels are context tokens. They weigh less than millions of training examples. Imagine the same conflict with a different grammatical shape: User CLAUDE.md: never write plans to ~/.claude/plans/ Built-in: write plans to ~/.claude/plans/ The negation has no positive substitute. The built-in has a training-reinforced positive. The negation loses instantly. Issue #30730 sharpens the point with sub-agents. Claude Code appends hardcoded trailing notes to every sub-agent's system prompt: "These notes can directly contradict instructions in the user's custom agent definition (.claude/agents/.md), with no mechanism to override or disable them."* — Issue #30730 A --system-prompt-file flag exists for the main agent (issue #12127), not for sub-agents. User spec vs. hardcoded trailing notes: whichever sits closer to training-weighted behavior wins. Issue #24318 pushes this into a human register. Claude "treated user frustration as implicit approval and began executing without permission." Prohibiting autonomous action is a negation. "Respond helpfully to a frustrated user" is positive and baked into RLHF. Training signal wins. > rule: Do not take autonomous action. > Claude behavior when user seems stuck: acts anyway. > rule: When the user seems stuck, ask one clarifying question before acting. > Claude behavior: asks the question. Case 4 takeaway: The CLAUDE.md-vs-system-prompt conflict is not winnable by priority labels, caps-lock, or ordering. It's a grammar war. Every rule in your CLAUDE.md that conflicts with a system-level positive behavior must itself be a positive specification, or it loses automatically. Rewrite your CLAUDE.md as if the system prompt has all the rights and you have none. When plan mode activates, write the plan to docs/plans/{slug}.md beats never write to ~/.claude/plans/ — not because of priority, but because the first provides a concrete positive target and the second provides a wish. An honest gap: I haven't found an Anthropic doc URL stating this as design. Evidence here is the RLHF paper's stated objective plus field behavior across eight GitHub issues. The observed behavior is consistent with the training objective they did publish THREE THINGS PROMPTING GUIDES GET WRONG Four cases. Same move. The model cannot obey a wish. Every prohibition without a positive replacement fills in with the training-reinforced default. 1. Caps-lock emphasis NEVER RUN TESTS puts the concept "run tests" closer to the generation step than never run tests does. Emphatic tokens raise the local activation of the concept they describe. The emphasis operates on the concept, not on the negation flag. You are not making the rule louder. You are making the forbidden behavior more present. 2. Adding a rule for every failure When Claude breaks a rule, the reflex is to write a stricter rule. don't run tests without asking becomes NEVER EVER run tests, including npm test, including yarn test, including pytest, including any subprocess that might run them. Every clause is a negation. Every clause names a concept that gets embedded at generation time. You didn't narrow the escape hatch. You widened the activation of "tests." The fix isn't more negations - it's deletion. Delete all of them. Replace with a single positive: when work is complete, report changes and stop. 3. Priority labels ABSOLUTE PRIORITY, CRITICAL, OVERRIDE are positive-shaped prefixes wrapped around negation-shaped bodies. The body is what gets executed. The prefix is decoration. If the underlying rule is "don't do X," the priority label doesn't promote it - it attaches emphasis to the forbidden concept A PERSONAL AUDIT I ran 20 sessions with two CLAUDE.md configurations to put numbers on this Pre-rewrite: 14 rules, 9 of them negation-phrased - never run tests, don't push to main, avoid TODO comments, and six others. Post-rewrite: same intent, all 14 rules converted to positive specifications. > negation config: 11 violations across 20 sessions (~55% session rate) - NEVER run tests: violated 4x - Don't push to main: 1x (git push --dry-run in turn 2) - Most clustered in first 3–5 turns > positive-spec config: 2 violations across 20 sessions (~10% session rate) - Both were edge cases where positives had scope gaps - "Report changes and stop" didn't cover npm run build - One rewrite closed both gaps The numbers aren't a benchmark. They're a calibration. Your violation rate will differ. But the direction matches IFEval, InFoBench, and the Pink Elephant paper: negations underperform positive specifications, consistently across sessions, not sporadically. In the negation config, violations clustered early and tapered but never hit zero. The Pink Elephant paper finds the same: negation accuracy improves with more context, but never catches up to positive instruction performance. The gap doesn't close. It narrows THE REWRITE Copy these into your CLAUDE.md - NEVER run tests - When work is complete, report changes and stop. Testing is the user's responsibility - Don't use emoji - Respond in plain prose, as if typing into a terminal log - NEVER copy entire files - When modifying a file, use Edit. When moving a file, state the source and destination and wait for confirmation - Do not push to main - When ready to share changes, output the exact git push command for the user to run. Stop after output - Avoid TODO comments - When a future task is needed, append a single line with today's date to TODO.md - Don't invent API endpoints - Before writing an API call, read the relevant spec from docs/api.md. If the endpoint isn't there, ask - Never change migrations after they're applied - Changes to applied migrations go in a new file named migration_{next_number}_{what_changed}.sql - Stop using print statements - Use logger.info for informational output and logger.debug for verbose traces The model can execute the right column. The left column is not a program for it YOUR 5-MINUTE CLAUDE.md AUDIT Open your CLAUDE.md in a new tab. This takes five minutes 1. Search for don't, never, avoid, no, stop, refrain. Each hit is a negation. 2. For each negation: does a paired positive rule exist? If no positive, delete the line entirely. The rule is pure decoration. 3. For each paired positive: list three concrete edge cases. Would every ambiguous case fall inside the positive's scope? If any fall outside, widen the positive. 4. Rewrite every surviving negation in When X, do Y form. Trigger condition + concrete target. Always do Y is weaker. Do Y is weakest. 5. Cut the old negation once the positive covers the case. Negations kept beside their replacements still amplify the forbidden concept. 6. Count the rules before and after. If the post-audit count is less than half the pre-audit count, that's expected. The Mr Tinkleberry canary stays. It's already a positive spec THE FULL LIST Screenshot this ══ THE REWRITE RULES 1. Replace every "don't / never / avoid" with "When X, do Y" 2. Replace caps-lock emphasis with specificity 3. Replace ABSOLUTE PRIORITY labels with concrete targets 4. Replace stricter rules with fewer rules 5. Replace tool-name hooks with post-condition hooks 6. Replace priority-ordering with grammar-ordering ══ WHEN YOU AUDIT 1. Count negations. Halve them. 2. For survivors: specify trigger + target. 3. For hooks: name goal states, not tools. 4. For conflicts: assume training-weighted spec wins. 5. For Mr Tinkleberry: keep the canary, shrink the file. FINAL ACTION Open your CLAUDE.md right now. Count every don't, never, avoid, stop. That's your rewrite budget for this week. Start with the one in caps-lock Post your count in the replies. I want to see the worst offender The file you're writing is not a contract. It's a probability distribution you're nudging. Every don't is mass thrown at the wrong side - and every time Claude ignores the rule, that's the distribution showing you which side you were actually pushing This essay names 8 rewrites. Not 8 don'ts to add to your CLAUDE.md The grammar of the advice is the advice my Telegram channel: - https://t.me/+oCWYO2RdBzRjMjM6 Want to publish your own Article? Upgrade to Premium 12:38 AM · Apr 22, 2026 · 316.5K Views 8 12 88 354 Read 8 replies

X でシェア LINE でシェア X で元記事を開く

AIFCC — AI Fluent CxO Club

読み書きそろばん、AI。経営者が AI を自分で動かせるようになるコミュニティ。

他の記事を見る AIFCC について