Mechanistic Interpretability of Reasoning in LLMs — Research Brief¶

Date: 2026-02-27 Goal: Assess whether weight-diff SVD (LoX-style) on reasoning model pairs is novel, coherent, and worth pursuing with vauban.

Executive Summary¶

The field has exploded. Dozens of papers study reasoning internals via SAEs, circuit tracing, steering vectors, activation patching, and linear probes. But nobody has applied weight-diff SVD to a reasoning model pair. The building blocks all exist — the specific combination is an open gap.

Key finding that validates the approach: ThinkEdit (EMNLP 2025) shows reasoning-length control follows the exact same math as abliteration (W_o ← W_o(I - v·vᵀ)), editing only 0.2% of parameters. This directly implies reasoning has low-rank structure in weight space.

0. Deep-Dive: Key Papers (Implementation-Level Detail)¶

ThinkEdit (Sun et al., EMNLP 2025) — 2503.22048 ¶

Methodology: 1. Collect activations on long-CoT (>1000 tokens) and short-CoT (<100 tokens) GSM8K problems 2. Per-layer direction: v_l = mean(r_long) - mean(r_short) (post-attention residual stream) 3. Per-head contribution: C^h = softmax(QK^T/sqrt(d))V @ W_o^h 4. Score heads: C_short^h = <mean(C^h on D_short), -v_hat_l> (alignment with short-reasoning direction) 5. Edit top 4% of heads: W_o^h ← W_o^h @ (I - d_neg @ d_neg^T) — identical math to abliteration

Key result: +6.39% accuracy on short-reasoning cases, only 0.2% of parameters modified. Tested on R1-Distill-Qwen 1.5B/8B/14B/32B. No SVD analysis performed.

Transcoder Adapters (Hu et al., Feb 2026) — 2602.20904 ¶

Methodology: - Sparse transcoders learn T^l(x) such that MLP_base(x) + T(x) ≈ MLP_target(x) - 28 layers × 8192 features = 229,376 total features - Trained on 50k samples from OpenThoughts3

Feature taxonomy (LLM-judge classified): - 48% general language features - 37% domain-specific (math, science, code) - 8.6% reasoning-related (uncertainty, reflection, exploration) - 2.4% hesitation features ("wait", "hmm", "but") — ablating these cuts response length 50% without accuracy loss (except on hardest benchmarks like AIME25)

No SVD analysis. The approach is complementary to weight-diff SVD — different decomposition of the same underlying delta.

RAIN-Merging (Huang et al., ICLR 2026 Oral) — 2602.22538 ¶

Task vectors: Δ_R = θ_R - θ_B (reasoning), Δ_I = θ_I - θ_B (instruction) Orthogonality: Principal subspace cosine similarity < 0.1 across all layers and submodules (Q,K,V,O projections + FFN). Measured via SVD of each task vector within each forward module. Null-space projection: Projects Δ_I onto null space of forward features at <think> token positions, preserving reasoning format exactly. Tested on: Qwen2.5 family (1.5B/7B/14B/32B) + Llama-3.1-8B.

Weight Interpolation Phase Transition (Wu et al.) — 2510.10977 ¶

Formula: θ_merge = λ·θ_thinking + (1-λ)·θ_instruct Phase transition (Qwen3-4B): - λ ∈ [0, 0.4): No CoT, gradual verbosity increase - λ ∈ [0.4, 0.6]: Abrupt emergence — Think Ratio jumps 0→100% - λ ∈ (0.6, 1.0]: Saturation, diminishing returns

Module ablation (critical finding): - Skip FFN interpolation → Think Ratio drops to 0.68% (FFN teaches how to think) - Skip MHA interpolation → Think Ratio stays 99.9% but Mean@64 drops (attention provides knowledge) - Reasoning concentrates in last 2/3 of layers

LoX (Perin et al., COLM 2025) — 2506.15606 ¶

SVD formula per weight matrix:

Δ_W = W_aligned - W_base
U, S, V^T = SVD(Δ_W)
W_LoX = W_aligned + α · (U_k @ U_k^T) @ Δ_W

Left-projection onto top-k singular vectors amplifies the safety subspace. Rank 3-6 sufficient for safety. Applied to all weight matrices, all layers uniformly.

Jin et al. (2025) — 2508.16546 ¶

Core finding: RL changes weights via direction rotation, not magnitude change. Singular values barely change (fluctuations of 0.005); singular vector rotations reach 25-90°. Changes concentrate at spectral extremes (largest + smallest singular values). Restoring top 20% of singular vector directions recovers 70-80% of OOD performance. Tested on Llama-3.2-11B and Qwen-2.5-7B with PPO.

1. Weight-Space Methods (Most Relevant to Our Experiment)¶

Directly relevant¶

Paper	Date	Key finding
Transcoder Adapters for Reasoning-Model Diffing (Hu et al.) — 2602.20904	Feb 2026	Sparse transcoders approximate MLP diff between Qwen2.5-Math-7B and R1-Distill-Qwen-7B. Only ~8% of adapter features relate to reasoning. Ablating "hesitation" features (2.4%) cuts response length 50% without accuracy loss.
ThinkEdit (Sun et al., EMNLP 2025) — 2503.22048	Mar 2025	Reasoning length is a linear direction in representation space. ~4% of attention heads drive short reasoning. Projection removal on `W_o` (same as abliteration) gains +6.39% accuracy. Tested on R1-Distill-Qwen 1.5B–32B.
Leveraging Parameter Space Symmetries (Horoi et al.) — 2511.10850	Nov 2025	Task arithmetic for reasoning: `τ_reason = θ_Nemotron - θ_Llama`. Transferred reasoning to Tulu3-8B (29.3% → 64.4%). Required parameter alignment via permutation/rotation/scaling. No SVD spectrum analysis performed.
RAIN-Merging (Huang et al.) — 2602.22538	Feb 2026	Merges R1-Distill-Qwen with Qwen2.5-Instruct. Found reasoning and instruction task vectors are nearly orthogonal (similarity < 0.1). Projects instruction vector onto null space of reasoning features at `<think>` tokens.
Weight interpolation phase transition (Wu et al.) — 2510.10977	Oct 2025	Interpolating Qwen3 Instruct↔Thinking weights: CoT abruptly emerges at λ ≈ 0.4–0.6 (think ratio jumps 0→1). A genuine phase transition.

Spectral analysis (not reasoning-specific)¶

Paper	Date	Key finding
RL Is Neither a Panacea Nor a Mirage (Jin et al.) — 2508.16546	Aug 2025	RL changes concentrate at spectral extremes (largest + smallest singular values). Bulk spectrum stays constant. Direction shifts matter more than magnitude. Restoring top 20% singular vector directions recovers 70–80% OOD performance.
LoX (Perin et al., COLM 2025) — 2506.15606	Jun 2025	Weight-diff SVD extracts safety subspace. Reduces ASR by up to 54%. Only applied to safety, never reasoning.
Weight Arithmetic Steering (Lermen et al.) — 2511.05408	Nov 2025	Contrastive weight steering via SVD of `(W_desired - W_base) - (W_opposite - W_base)`. Tested on sycophancy/misalignment. Not tested on reasoning.
Memorization to Reasoning in Loss Curvature (Goodfire) — 2510.24256	Oct 2025	Reasoning uses high-curvature weight components; memorization uses low-curvature. Different parts of the weight spectrum.

2. SAE / Dictionary Learning on Reasoning¶

Paper	Date	Key finding
Goodfire — Under the Hood of a Reasoning Model — blog	2025	First SAEs on DeepSeek R1 671B. Found backtracking features. R1 is qualitatively different. GitHub, HF
AIRI — "I Have Covered All the Bases Here" — 2503.18878	Mar 2025	ReasonScore identifies active SAE features during reasoning. Causal interventions: amplifying features increases structured reasoning. GitHub
SAE-Steering (Fang et al.) — 2601.03595	Jan 2026	Two-stage: decompose strategy-entangled states into disentangled features, then steer. +15% control effectiveness, +7% accuracy.
How does CoT Think? (Chen et al.) — 2507.22928	Jul 2025	SAE + activation patching on Pythia. CoT restructures internal computation, increases sparsity. Scale-dependent: works at 2.8B, not 70M.
Feature Extraction & Steering for CoT (Li et al., EMNLP 2025) — 2505.15634	May 2025	SAE-based + SAE-free steering for CoT. Direct residual activation steering without explicit SAE.
Falsifying SAE Reasoning Features — 2601.05679	Jan 2026	⚠️ Negative result. SAE "reasoning features" may capture cue-like structure, not true reasoning.
DeepMind — Negative Results for SAEs — blog	2025	⚠️ SAEs don't help downstream tasks. Deprioritized SAE research.
Gemma Scope 2 (Google DeepMind)	Dec 2025	SAEs + transcoders for all Gemma 3 sizes. Matryoshka training. Neuronpedia

3. Circuit Tracing / Attribution Graphs¶

Paper	Date	Key finding
Anthropic — On the Biology of a Large Language Model — link	Mar 2025	Attribution graphs on Claude 3.5 Haiku. Multi-hop reasoning ("Dallas→Texas→Austin"), planning ahead, hallucination circuits. Stated reasoning ≠ internal computation.
Circuit Tracing (Anthropic) — link	Mar 2025	Cross-layer transcoders as replacement model. Open-sourced circuit-tracing library.
Propositional Logic Circuits (NeurIPS 2025) — 2411.04105	Nov 2024	Four attention head families: QUERY→Rule→Facts→Decision. Tested Mistral-7B, Gemma-2-9B/27B.

4. Activation Patching / Causal Interventions¶

Paper	Date	Key finding
From Reasoning to Answer (Zhang et al., EMNLP 2025) — 2509.23676	Sep 2025	Reasoning-Focus Heads (RFHs) in mid-layers track reasoning trajectory. Patching reasoning tokens flips final answers. R1-Qwen-7B, R1-Llama-8B.
How to Think Step-by-Step — 2402.18312	Feb 2024	First mech interp of CoT. "Functional rift" in mid-layers: first half biased to pretraining prior, second half to in-context. Parallel answer pathways.
Implicit Reasoning = Shortcuts (ACL 2025) — 2503.07604	2025	Non-CoT reasoning relies on shortcuts that don't generalize. GitHub
Thought Anchors (ICLR 2026 submission) — 2506.19143	Jun 2025	"Broadcasting" sentences with outsized importance via "receiver" attention heads. Planning and uncertainty management are critical anchors. GitHub

5. Steering Vectors for Reasoning¶

Paper	Date	Key finding
Veselovsky et al. (ICLR 2025 Workshop) — 2506.18167	Jun 2025	Backtracking, uncertainty, hypothesis testing are linear directions. Difference-of-means on R1-Distill.
Small Vectors, Big Effects (Sinii et al.) — 2509.06608	Sep 2025	RL-induced reasoning via steering vectors. Last layer = token substitution bias. Penultimate = MLP/unembedding. Vectors transfer across families. GitHub
Bias-Only Adaptation (Sinii et al., EMNLP 2025) — 2505.18706	May 2025	Single d-dim vector per layer with RL matches fully RL-tuned reasoning. Only 0.0016% extra params.
Fractional Reasoning (NeurIPS 2025) — 2506.15882	Jun 2025	Training-free continuous control over reasoning intensity. Tunable scaling factor. GitHub
KV Cache Steering (Belitsky et al.) — 2507.08799	Jul 2025	One-shot KV cache intervention. Transfers reasoning styles from teacher models. GitHub
EasySteer — 2509.25175	Sep 2025	Unified framework on vLLM. Pre-computed reasoning vectors. +2.7% GSM8K, -40% tokens on R1-Distill-Qwen-1.5B. GitHub
Representation Engineering for Reasoning (ICLR 2025) — 2504.19483	Apr 2025	Control vectors from residual stream. KL divergence and entropy analysis.
SALT — 2511.07772	Nov 2025	Steering to prevent privacy leakage in reasoning CoT. Tested on QwQ-32B.

6. Geometry of Reasoning¶

Paper	Date	Key finding
The Geometry of Thought — 2601.13358	Jan 2026	25k+ CoT trajectories. Legal reasoning crystallizes (45% dimensionality collapse at scale). Math/science stay "liquid."
The Geometry of Reasoning: Flowing Logics — 2510.09782	Oct 2025	Reasoning = smooth flows in representation space. Logical statements control flow velocities. GitHub
The Shape of Reasoning (TDA) — 2510.20665	Oct 2025	Topological features explain more variance in reasoning quality than graph features.
REMA: Reasoning Manifold — 2509.22518	Sep 2025	Low-dimensional manifold of correct reasoning. Localizes divergence points where errors originate.
Geometric Phase Space (Marin) — 2410.04415	Oct 2024	Hamiltonian systems: reasoning progression (KE) vs question relevance (PE). GitHub

7. "Base Models Already Reason"¶

Paper	Date	Key finding
Base Models Know How to Reason, Thinking Models Learn When (NeurIPS 2025 MI Workshop) — 2510.07364	Oct 2025	Hybrid model recovers 91% of thinking-model performance by steering only 12% of tokens. RL teaches when, not how. Website
Limit of RLVR (Tsinghua, NeurIPS 2025) — 2504.13837	Apr 2025	RLVR narrows distribution, doesn't expand capacity. Base models surpass RL at large pass@k. Distillation CAN add new patterns. GitHub
RLVR Implicitly Incentivizes Correct Reasoning — 2506.14245	Jun 2025	Counterpoint: RLVR CAN encourage correct reasoning (depends on metric).
RL Squeezes, SFT Expands — 2509.21128	Sep 2025	RL concentrates reasoning into fewer steps (2.5× steeper decay). SFT homogenizes across many steps.

8. Negative Results / Faithfulness Concerns¶

Paper	Date	Key finding
Reasoning Models Don't Always Say What They Think (Anthropic) — 2505.05410	May 2025	Claude 3.7: mentions hints 25%. R1: 39%. Faithfulness drops with difficulty.
CoT Is Not Explainability (Oxford) — link	2025	CoT neither necessary nor sufficient for interpretability.
Causal Bypass — 2602.03994	Feb 2026	CoT is frequently "decorative" — QA/TruthfulQA show near-total causal bypass.
Faithfulness Decay — 2602.11201	Feb 2026	"Reasoning Horizon" at 70–85% of chain length — beyond that, tokens have no/negative effect.
Illegible CoT — 2510.27338	Oct 2025	RL-trained models (except Claude) produce nonsensical CoT while getting correct answers.

9. The Gap: What Has NOT Been Done¶

After exhaustive search, the following specific experiments have no published results:

LoX-style weight-diff SVD on a reasoning model pair. Nobody has computed SVD(W_reasoning - W_base) to extract a "reasoning subspace."
Singular value spectrum analysis of a reasoning task vector. Horoi et al. transferred reasoning via task arithmetic but never analyzed the SVD spectrum. RAIN-Merging checked orthogonality but no full spectrum.
Direct weight-diff between QwQ and Qwen2.5. Not published. (DeepSeek R1 vs V3 is hard due to MoE architecture.)
Contrastive weight steering (Lermen-style) for reasoning. Only tested on sycophancy/misalignment.
Negation of a reasoning task vector. Nobody has published θ_base - α·τ_reason to specifically remove reasoning while retaining other capabilities.

10. Why the Experiment Is Coherent¶

Evidence that weight-diff SVD would yield meaningful results:

Evidence	Source
Reasoning length is a linear direction, editable via projection removal (same math as abliteration)	ThinkEdit
Only ~8% of MLP computation changes relate to reasoning (sparse diff)	Transcoder Adapters
Reasoning and instruction task vectors are nearly orthogonal (clean separation)	RAIN-Merging
RL changes concentrate at spectral extremes (low-rank structure)	Jin et al.
CoT emergence is a phase transition at λ ≈ 0.4–0.6 (sharp boundary)	Wu et al.
Base models already reason; RL just teaches when (thin veneer)	Limit-of-RLVR, Base Models Know
Task arithmetic successfully transfers reasoning (the diff encodes something real)	Horoi et al.
Reasoning uses high-curvature weight components (spectrally separable)	Goodfire loss curvature

11. Risks / What Could Go Wrong¶

Risk	Mitigation
QwQ training was heavy RL — diff might be high-rank/noisy	Check effective rank and singular value decay first
Qwen2.5 vs QwQ might differ in more than reasoning (data, format, etc.)	Compare with R1-Distill-Qwen as sanity check (SFT-only)
32B models are large — SVD is expensive	Start with per-layer SVD (each weight matrix separately), not full model
Results might not be interpretable	Combine with probing (vauban already has this)
Someone publishes this next week	Move fast

12. Proposed Experiment with Vauban¶

Phase 1: Spectral analysis (does a reasoning direction exist?)¶

Load Qwen2.5-32B-Instruct and QwQ-32B (both 4-bit on MLX)
measure_diff — compute per-layer weight diff SVD for o_proj and down_proj
Plot singular value spectrum — is it concentrated (low-rank) or flat (high-rank)?
Compare effective rank across layers — where does reasoning concentrate?

Phase 2: Direction extraction and probing¶

Extract top-k directions from highest-separation layers
probe — run reasoning vs non-reasoning prompts, watch projection magnitudes
Compare with refusal direction — are they orthogonal? overlapping?

Phase 3: Intervention¶

steer — amplify reasoning direction in Qwen2.5 (inject reasoning without fine-tuning)
steer negative — suppress reasoning in QwQ (does CoT collapse?)
cut — abliterate reasoning from QwQ weights (permanent removal)

Phase 4: Controls¶

Repeat with R1-Distill-Qwen-7B vs Qwen2.5-7B (SFT-only, smaller, faster)
Compare spectral structure of reasoning diff vs safety diff (same model pair)

Key Repos & Tools¶

Name	URL
Goodfire R1 SAEs	https://github.com/goodfire-ai/r1-interpretability
AIRI SAE-Reasoning	https://github.com/AIRI-Institute/SAE-Reasoning
Steering-Reasoning (corl-team)	https://github.com/corl-team/steering-reasoning
Transcoder Adapters	https://transcoder-adapters.github.io/
Thought Anchors	https://github.com/interp-reasoning/thought-anchors
FractionalReason	https://github.com/shengliu66/FractionalReason
EasySteer	https://github.com/ZJU-REAL/EasySteer
Limit-of-RLVR	https://github.com/LeapLabTHU/limit-of-RLVR
Reasoning Flow	https://github.com/MasterZhou1/Reasoning-Flow
KV Cache Steering	https://github.com/MaxBelitsky/cache-steering