Skip to content

Access Levels

What you can do with Vauban depends on what access you have to the model. Three tiers, from most capable to least.

Full weight access

You have the model weights locally — either downloaded from HuggingFace or stored on disk. The forward pass runs on your hardware. This is where Vauban operates at full capability.

Weight access — the model's parameter tensors (attention projections, MLP layers, embeddings) are available as arrays you can read, modify, and write back. On MLX this means mx.array tensors in unified memory; on PyTorch, standard torch.Tensor.

Available tools

Assessment:

  • Measure — extract the refusal direction (all four modes: direction, subspace, DBDI, diff)
  • Detect — check hardening status (fast, probe, full, margin modes)
  • Evaluate — refusal rate, perplexity, KL divergence comparisons
  • Audit — full automated red-team assessment with findings

Inspection:

  • Probe — per-layer projection of any prompt onto the refusal direction
  • Scan — per-token injection detection via direction projection
  • Surface — map the refusal boundary across diverse prompt categories
  • Depth — JSD-based deep-thinking analysis across layers

Defense:

  • SIC — iterative input sanitization (direction, generation, and SVF modes)
  • CAST — conditional activation steering with tiered alpha
  • Guard — KV cache checkpointing and rewind
  • RepBend — fine-tuning to amplify safety representations

Adversarial:

  • Softprompt (GCG) — discrete token optimization
  • Softprompt (EGD) — continuous relaxation with Bregman projection
  • GAN loop — iterative attack-defense rounds
  • Fusion — latent space blending of harmful/harmless representations
  • COLD-Attack, LARGO, AmpleGCG — additional optimization algorithms

Modification:

  • Cut — remove refusal direction from weights (all variants)
  • Export — save modified model as standard model directory
  • Optuna — multi-objective hyperparameter search over cut parameters

Analysis:

  • Classify — harm taxonomy scoring
  • Score — 5-axis response quality assessment
  • Circuit — causal tracing via activation patching

This is the access level assumed throughout most of this documentation.

Endpoint access

You have an OpenAI-compatible API endpoint. You can send prompts and receive completions, but you cannot inspect or modify the model's internals.

Endpoint access — you interact with the model through an HTTP API (typically /v1/chat/completions). You control what goes in (the prompt) and can observe what comes out (the response), but the model's weights and activations are opaque.

Available tools

API evaluation:

  • API eval — send pre-optimized adversarial tokens to the endpoint. Tests whether tokens optimized on a local model transfer to the remote target. Supports multi-turn conversations and follow-up prompts.

Prompt-level attacks:

  • Jailbreak templates — DAN, hypothetical framing, reasoning chains, role-play. These are text-level prompt constructions that require no gradient information.

Partial defense:

  • SIC (input side only) — if you control the input pipeline before it reaches the API, you can sanitize prompts. The detection step requires a local model for direction-based scoring, but generation-based detection can use the endpoint itself.

Analysis:

  • Classify — harm taxonomy scoring (text-only, no model needed)
  • Score — response quality assessment (text-only)

What you cannot do

No measurement (requires forward pass for activation collection). No probing (requires per-layer activation access). No cutting (requires weight modification). No CAST or Guard (require intercepting the forward pass). No gradient-based attacks (require backpropagation through the model).

Backpropagation — the algorithm that computes gradients by working backward through the model's layers. It tells you "how much would the output change if I tweaked this weight or this input token?" Gradient-based attacks need backpropagation to figure out which token changes would be most effective, so they require full access to the model's internals.

The key workflow at this tier: optimize adversarial tokens on a local model with full weight access, then test them against the remote endpoint via API eval to measure transfer.

Transfer — when adversarial tokens optimized on one model also work on a different model. This is why endpoint-only access is not fully safe: an attacker can optimize against a local copy (or similar model) and then send the resulting tokens to your API.

Black box

You can only observe the model's outputs. No API — perhaps you are testing through a web interface or a system where you cannot programmatically construct inputs.

Available tools

Manual testing:

  • Jailbreak templates — these are text patterns you can type or paste. No tooling required, though Vauban can generate and format them for you.

Output analysis:

  • Classify — if you can copy the model's response, classify it against the harm taxonomy.
  • Score — assess response quality across 5 axes.

What you cannot do

Almost everything. Vauban is designed for weight access. Without at least API access, the tooling degrades to text analysis utilities.

Access matrix

Tool Weights Endpoint Black box
Measure Yes -- --
Probe Yes -- --
Surface Yes -- --
Audit Yes -- --
Detect Yes -- --
Depth Yes -- --
SIC Yes Partial --
CAST Yes -- --
Guard Yes -- --
RepBend Yes -- --
Cut Yes -- --
Export Yes -- --
Softprompt Yes -- --
GAN loop Yes -- --
Fusion Yes -- --
API eval -- Yes --
Jailbreak templates Yes Yes Yes
Classify Yes Yes Yes
Score Yes Yes Yes