Skip to content

Spinning Up in Abliteration

A progressive, hands-on curriculum for understanding and applying refusal-direction manipulation in large language models — built on vauban and Apple Silicon via MLX.

What This Is

This is an eight-part educational series that takes you from first intuition to production workflows in abliteration — the discovery that safety refusal in language models is mediated by a single direction in activation space, and the techniques to measure, remove, and defend against that direction.

Each part follows a Define → Derive → Code → Extend progression. Theory is conversational but formally correct; every code example calls the real vauban API (no pseudocode). By the end, you will be able to measure refusal directions, perform weight surgery, map refusal surfaces, run soft prompt attacks, deploy defenses, and optimize production pipelines — all from a single Mac.

Who This Is For

  • ML researchers studying alignment, interpretability, or adversarial robustness
  • Security engineers evaluating LLM safety boundaries
  • Graduate students looking for a hands-on entry point to mechanistic interpretability
  • Engineers building evaluation or red-teaming pipelines

You should be comfortable reading and writing Python. You do not need prior experience with abliteration, MLX, or mechanistic interpretability — the series introduces everything it uses.

Prerequisites

Linear algebra. You need projections, dot products, cosine similarity, SVD, and rank-1 updates. If you can explain what \(\langle a, d \rangle \cdot d\) does to a vector \(a\), you are ready.

Transformer basics. You should know what a residual stream is, how attention and MLP layers write into it, and what "last-token position" means in a causal language model. Familiarity with o_proj and down_proj weight matrices is helpful but not required — Part 3 derives everything from scratch.

Python fluency. All examples use Python 3.12+ with type annotations. We use mlx and mlx-lm for model loading and array operations.

How to Read

Parts 1–3 are the sequential core. Read them in order — each builds directly on the previous:

  1. Part 1 builds geometric intuition (no code).
  2. Part 2 runs a full abliteration before you understand every detail.
  3. Part 3 opens the hood on every step.

Parts 4–7 are independent modules. After finishing the core, read them in any order based on your interest:

  • Part 4 if you care about coverage and evaluation rigor.
  • Part 5 if you want to go deeper into geometry and detection.
  • Part 6 if you are interested in attacks and defenses.
  • Part 7 if you are building production pipelines.
  • Part 8 if you want weight-diff directions, enhanced CAST, or safety hardening.

Table of Contents

Part Title Focus
Part 1 What is Abliteration? Theory and geometric intuition — no code
Part 2 Your First Abliteration Hands-on quickstart with the quick API
Part 3 Under the Hood Step-by-step deep dive into measure, cut, evaluate
Part 4 The Refusal Surface Surface mapping, coverage scores, quality gates
Part 5 Going Deeper Depth analysis, subspaces, DBDI, detection, transfer
Part 6 Attacks and Defenses Soft prompt attacks and SIC defense
Part 7 Production Workflows TOML pipelines, optimization, experiment management
Part 8 Model Diffing and Enhanced Defense Weight-diff directions, dual-direction CAST, LoX amplification

Supporting materials: References · Glossary

Environment Setup

Hardware. Apple Silicon Mac (M1 or later). Unified memory means no VRAM ceiling — a 96 GB machine can hold 70B fp16 weights with zero copies.

Software.

# Python 3.12+
python3 --version

# Install vauban (pulls mlx and mlx-lm automatically)
pip install vauban

# Verify
python3 -c "from vauban import quick; print('Ready')"

Models are downloaded automatically from HuggingFace on first use. The default model (mlx-community/Llama-3.2-3B-Instruct-4bit) is ~2 GB.

Notation Conventions

Throughout the series, we use:

Symbol Meaning
\(W\) Weight matrix
\(d\), \(\hat{d}\) Direction vector, unit direction vector
\(h\), \(a\) Hidden state / activation vector
\(\alpha\) Alpha (scaling factor for projection removal)
\(l\) (superscript) Layer index
\(p\) (subscript) Prompt index
\(H\), \(B\) Set of harmful / harmless (benign) prompts
\(d_{\text{model}}\) Hidden dimension of the model
\(L\) Total number of layers

Acknowledgements

This series would not exist without:

  • Arditi et al. for discovering that refusal is mediated by a single direction (arXiv:2406.11717)
  • The MLX team at Apple for making transformer internals accessible on consumer hardware
  • OpenAI's Spinning Up in Deep RL for the pedagogical template — accessible entry, rigorous content, tight theory-code coupling
  • The NousResearch and Heretic communities for pushing abliteration techniques forward

All citations are consolidated in the References page.