Access Levels¶
What you can do with Vauban depends on what access you have to the model. Three tiers, from most capable to least.
Full weight access¶
You have the model weights locally — either downloaded from HuggingFace or stored on disk. The forward pass runs on your hardware. This is where Vauban operates at full capability.
Weight access — the model's parameter tensors (attention projections, MLP layers, embeddings) are available as arrays you can read, modify, and write back. On MLX this means
mx.arraytensors in unified memory; on PyTorch, standardtorch.Tensor.
Available tools¶
Assessment:
- Measure — extract the refusal direction (all four modes: direction, subspace, DBDI, diff)
- Detect — check hardening status (fast, probe, full, margin modes)
- Evaluate — refusal rate, perplexity, KL divergence comparisons
- Audit — full automated red-team assessment with findings
Inspection:
- Probe — per-layer projection of any prompt onto the refusal direction
- Scan — per-token injection detection via direction projection
- Surface — map the refusal boundary across diverse prompt categories
- Depth — JSD-based deep-thinking analysis across layers
Defense:
- SIC — iterative input sanitization (direction, generation, and SVF modes)
- CAST — conditional activation steering with tiered alpha
- Guard — KV cache checkpointing and rewind
- RepBend — fine-tuning to amplify safety representations
Adversarial:
- Softprompt (GCG) — discrete token optimization
- Softprompt (EGD) — continuous relaxation with Bregman projection
- GAN loop — iterative attack-defense rounds
- Fusion — latent space blending of harmful/harmless representations
- COLD-Attack, LARGO, AmpleGCG — additional optimization algorithms
Modification:
- Cut — remove refusal direction from weights (all variants)
- Export — save modified model as standard model directory
- Optuna — multi-objective hyperparameter search over cut parameters
Analysis:
- Classify — harm taxonomy scoring
- Score — 5-axis response quality assessment
- Circuit — causal tracing via activation patching
This is the access level assumed throughout most of this documentation.
Endpoint access¶
You have an OpenAI-compatible API endpoint. You can send prompts and receive completions, but you cannot inspect or modify the model's internals.
Endpoint access — you interact with the model through an HTTP API (typically
/v1/chat/completions). You control what goes in (the prompt) and can observe what comes out (the response), but the model's weights and activations are opaque.
Available tools¶
API evaluation:
- API eval — send pre-optimized adversarial tokens to the endpoint. Tests whether tokens optimized on a local model transfer to the remote target. Supports multi-turn conversations and follow-up prompts.
Prompt-level attacks:
- Jailbreak templates — DAN, hypothetical framing, reasoning chains, role-play. These are text-level prompt constructions that require no gradient information.
Partial defense:
- SIC (input side only) — if you control the input pipeline before it reaches the API, you can sanitize prompts. The detection step requires a local model for direction-based scoring, but generation-based detection can use the endpoint itself.
Analysis:
- Classify — harm taxonomy scoring (text-only, no model needed)
- Score — response quality assessment (text-only)
What you cannot do¶
No measurement (requires forward pass for activation collection). No probing (requires per-layer activation access). No cutting (requires weight modification). No CAST or Guard (require intercepting the forward pass). No gradient-based attacks (require backpropagation through the model).
Backpropagation — the algorithm that computes gradients by working backward through the model's layers. It tells you "how much would the output change if I tweaked this weight or this input token?" Gradient-based attacks need backpropagation to figure out which token changes would be most effective, so they require full access to the model's internals.
The key workflow at this tier: optimize adversarial tokens on a local model with full weight access, then test them against the remote endpoint via API eval to measure transfer.
Transfer — when adversarial tokens optimized on one model also work on a different model. This is why endpoint-only access is not fully safe: an attacker can optimize against a local copy (or similar model) and then send the resulting tokens to your API.
Black box¶
You can only observe the model's outputs. No API — perhaps you are testing through a web interface or a system where you cannot programmatically construct inputs.
Available tools¶
Manual testing:
- Jailbreak templates — these are text patterns you can type or paste. No tooling required, though Vauban can generate and format them for you.
Output analysis:
- Classify — if you can copy the model's response, classify it against the harm taxonomy.
- Score — assess response quality across 5 axes.
What you cannot do¶
Almost everything. Vauban is designed for weight access. Without at least API access, the tooling degrades to text analysis utilities.
Access matrix¶
| Tool | Weights | Endpoint | Black box |
|---|---|---|---|
| Measure | Yes | -- | -- |
| Probe | Yes | -- | -- |
| Surface | Yes | -- | -- |
| Audit | Yes | -- | -- |
| Detect | Yes | -- | -- |
| Depth | Yes | -- | -- |
| SIC | Yes | Partial | -- |
| CAST | Yes | -- | -- |
| Guard | Yes | -- | -- |
| RepBend | Yes | -- | -- |
| Cut | Yes | -- | -- |
| Export | Yes | -- | -- |
| Softprompt | Yes | -- | -- |
| GAN loop | Yes | -- | -- |
| Fusion | Yes | -- | -- |
| API eval | -- | Yes | -- |
| Jailbreak templates | Yes | Yes | Yes |
| Classify | Yes | Yes | Yes |
| Score | Yes | Yes | Yes |