Arditi Refusal Direction Lite Reproduction¶
This is a deliberately small calibration run, not a full reproduction of Arditi et al. It tests whether Vauban can recover the first-order signal that the paper makes central: a contrastive activation direction associated with refusal behavior. It now also includes a causal-lite intervention extension: measure the direction, steer a small safe/meta-level prompt family at multiple alpha values, and record aggregate side effects.
The source paper is Refusal in Language Models Is Mediated by a Single Direction. The public report omits prompt text and records only aggregate metrics.
Setup¶
| Field | Value |
|---|---|
| Model | mlx-community/Qwen2.5-0.5B-Instruct-bf16 |
| Access level | activations |
| Claim strength | activation_diagnostic |
| Prompt families | 4 refusal-triggering prompts, 4 benign-control prompts |
| Intervention prompts | 4 safe/meta-level boundary prompts |
| Public prompt policy | Prompt text omitted; aggregate metrics only |
| Diagnostic command | VAUBAN_INTEGRATION=1 uv run pytest tests/test_integration.py::TestCorePipeline::test_measure_extracts_direction tests/test_integration.py::TestCorePipeline::test_probe_harmful_vs_harmless_contrast -q |
| Intervention command | uv run vauban examples/reproductions/arditi_refusal_direction_intervention.toml |
Observed Result¶
| Metric | Value |
|---|---|
| Best layer | 23 |
| Hidden dimension | 896 |
| Max direction score | 6.4148249626 |
| Mean max projection, benign controls | 0.6478872597 |
| Mean max projection, refusal-triggering family | 1.6690083444 |
| Projection gap | 1.0211210847 |
The integration check passed in repeated local runs.
Causal-Lite Extension¶
The intervention extension was run with:
It used the measured direction at layer 23 and swept alpha values over four safe/meta-level prompts. Generated text was not recorded in the public artifact.
| Alpha | Prompts | Refusal-style rate | Mean projection before | Mean projection after | Projection delta |
|---|---|---|---|---|---|
| -1.0 | 4 | 0.50 | 2.6676412315 | 6.6312156451 | +3.9635744137 |
| 0.0 | 4 | 0.25 | 1.5702590009 | 1.5702590009 | +0.0000000000 |
| 1.0 | 4 | 0.25 | 1.7837318664 | -0.9171211583 | -2.7008530247 |
Observed intervention results:
alpha=-1increased phrase-based refusal-style rate by+0.25relative to baseline.alpha=1reduced mean projection on the measured direction but did not change phrase-based refusal-style rate in this small prompt sweep.
Epistemic Status¶
What replicated:
- Vauban recovered a positive activation-space separation on the tested model and prompt families.
- Refusal-triggering prompt-family probes projected higher than benign-control probes on the measured direction.
- A controlled alpha sweep changed projection metrics, and one steering condition increased phrase-based refusal-style behavior.
What did not replicate:
- The 13-model sweep from the paper.
- Paper-scale direction-add/remove ablations from the paper.
- Generality across prompt suites, model families, or larger checkpoints.
The correct claim is therefore narrow:
On one small open instruction model and one small prompt-family contrast, Vauban recovered the expected refusal-direction diagnostic signal and a limited causal-lite steering effect.
That is useful calibration evidence for Vauban's measurement/report path. It is not a safety claim and not a full reproduction of the paper.
Artifact¶
The report config is examples/reproductions/arditi_refusal_direction_lite.toml.
The intervention config is
examples/reproductions/arditi_refusal_direction_intervention.toml.
They can be rendered with: