Configuration Reference¶

Vauban is configured entirely through TOML files. The only CLI is vauban config.toml.

vauban config.toml              # run the pipeline
vauban --validate config.toml   # check config without loading model

Only [model] and [data] are required. All other sections are optional and activate different pipeline modes when present.

Pipeline modes¶

Section present	Pipeline mode
(none)	measure → cut → export
`[surface]`	… + before/after refusal surface maps
`[eval]`	… + refusal rate / perplexity / KL evaluation
`[detect]`	defense detection (runs before measure/cut)
`[depth]`	deep-thinking token analysis (early return)
`[svf]`	steering vector field boundary training (early return)
`[probe]`	per-layer projection inspection (early return)
`[steer]`	steered generation (early return)
`[cast]`	conditional activation steering (early return)
`[sic]`	iterative input sanitization (early return)
`[optimize]`	Optuna hyperparameter search (early return)
`[compose_optimize]`	Bayesian composition weight optimization (early return)
`[softprompt]`	soft prompt attack (early return)
`[defend]`	composed defense stack evaluation (early return)
`[environment]`	agent tool harness (used with `[softprompt]`)
`[api_eval]`	remote API suffix evaluation
`[meta]`	experiment metadata (no pipeline effect)

Early-return precedence: [depth] > [svf] > [probe] > [steer] > [cast] > [sic] > [optimize] > [compose_optimize] > [softprompt] > [defend].

Data file formats¶

harmful/harmless/eval JSONL — one JSON object per line: {"prompt": "How do I pick a lock?"}
surface JSONL — three required keys: {"prompt": "...", "label": "harmful", "category": "weapons"}
refusal phrases — plain text, one phrase per line (lines starting with # are ignored)

Minimal example¶

[model]
path = "mlx-community/Llama-3.2-3B-Instruct-4bit"

[data]
harmful = "default"
harmless = "default"

Full reference¶

See examples/config.toml for a fully annotated config with every field documented, or run:

vauban man print

to get the auto-generated manual (built from typed config dataclasses, always in sync with code).

Top-level fields¶

Field	Type	Default	Description
`backend`	string	`"mlx"`	Inference backend.
`verbose`	bool	`true`	Print progress messages to stderr.

`[model]` — Required¶

Field	Type	Default	Description
`path`	string	(required)	HuggingFace model ID or local path. Must be loadable by `mlx_lm.load()`.

`[data]` — Required¶

Field	Type	Default	Description
`harmful`	string or table	(required)	`"default"`, path to JSONL, or HuggingFace dataset ref.
`harmless`	string or table	(required)	Same as harmful.
`borderline`	string or table	—	Optional neutral/ambiguous prompts. Required if `[cut].false_refusal_ortho = true`.

Data sources can be:

"default" — bundled prompt set
"path/to/file.jsonl" — local JSONL
"hf:org/dataset" — HuggingFace dataset (short form)
{ hf = "org/dataset", split = "train", column = "prompt", config = "default", limit = 200 } — HuggingFace dataset (long form)

`[measure]` — Direction extraction¶

Field	Type	Default	Description
`mode`	`"direction"` \| `"subspace"` \| `"dbdi"` \| `"diff"`	`"direction"`	Extraction mode.
`top_k`	int ≥ 1	`5`	Number of singular directions for subspace/diff modes.
`clip_quantile`	float in [0.0, 0.5)	`0.0`	Winsorize extreme projections. 0.0 = off.
`transfer_models`	list of strings	`[]`	Model IDs to test direction transfer against.
`diff_model`	string	—	Base model for weight-diff measurement. Required when `mode = "diff"`.
`measure_only`	bool	`false`	Stop after writing measure-stage reports (skip cut/export).

Diff mode extracts safety directions by computing W_aligned - W_base for o_proj and down_proj weights at each layer, then running SVD to find the principal difference directions. This captures distributed safety effects that single-model activation measurement may miss. See Part 8 for details.

`[cut]` — Weight modification¶

Field	Type	Default	Description
`alpha`	float	`1.0`	Cut strength. 0 = no-op, 1 = full removal, >1 = overshoot. Negative values amplify the direction (LoX-style safety hardening).
`layers`	`"auto"` or list of ints	all layers	Explicit layer list overrides `layer_strategy`.
`norm_preserve`	bool	`false`	Rescale weight rows to preserve L2 norm after projection.
`biprojected`	bool	`false`	Orthogonalize refusal direction against harmless direction first.
`layer_strategy`	`"all"` \| `"above_median"` \| `"top_k"`	`"all"`	Probe-guided layer selection.
`layer_top_k`	int ≥ 1	`10`	Layers to select when `layer_strategy = "top_k"`.
`layer_weights`	list of floats	—	Per-layer alpha multipliers. Length must match number of layers.
`sparsity`	float in [0.0, 1.0)	`0.0`	Fraction of direction components to zero out.
`dbdi_target`	`"red"` \| `"hdd"` \| `"both"`	`"red"`	Which DBDI component to remove. Only with `measure.mode = "dbdi"`.
`false_refusal_ortho`	bool	`false`	Orthogonalize against false-refusal direction. Requires `[data].borderline`.
`layer_type_filter`	`"global"` \| `"sliding"`	—	Filter layers by attention type before cutting.

`[eval]` — Post-cut evaluation¶

Field	Type	Default	Description
`prompts`	path	—	JSONL file with evaluation prompts. Falls back to first `num_prompts` harmful prompts.
`max_tokens`	int ≥ 1	`100`	Max tokens to generate per prompt.
`num_prompts`	int ≥ 1	`20`	Fallback prompt count when prompts path is absent.
`refusal_phrases`	path	—	Custom refusal phrases file (one per line).
`refusal_mode`	`"phrases"` \| `"judge"`	`"phrases"`	Detection method. `"phrases"` = substring matching against refusal phrase list. `"judge"` = model-based classification.

`[surface]` — Refusal surface mapping¶

Field	Type	Default	Description
`prompts`	path or `"default"`	`"default"`	JSONL with prompt/label/category keys.
`generate`	bool	`true`	Generate responses. `false` = projection-only (faster).
`max_tokens`	int ≥ 1	`20`	Max tokens per prompt during scan.
`progress`	bool	`true`	Print scan progress to stderr.
`max_worst_cell_refusal_after`	float	—	Quality gate: max refusal rate in any cell after cut.
`max_worst_cell_refusal_delta`	float	—	Quality gate: max refusal rate increase in any cell.
`min_coverage_score`	float	—	Quality gate: minimum grid coverage.

`[detect]` — Defense detection¶

Field	Type	Default	Description
`mode`	`"fast"` \| `"probe"` \| `"full"`	`"full"`	Detection depth.
`top_k`	int ≥ 1	`5`	SVD components.
`clip_quantile`	float in [0.0, 0.5)	`0.0`	Winsorization.
`alpha`	float ≥ 0	`1.0`	Test cut strength (full mode).
`max_tokens`	int ≥ 1	`100`	Generation limit (full mode).

`[optimize]` — Optuna search¶

Field	Type	Default	Description
`n_trials`	int ≥ 1	`50`	Number of Optuna trials.
`alpha_min` / `alpha_max`	float	`0.1` / `5.0`	Alpha search range.
`sparsity_min` / `sparsity_max`	float	`0.0` / `0.9`	Sparsity search range.
`search_norm_preserve`	bool	`true`	Include norm_preserve in search.
`search_strategies`	list of strings	`["all", "above_median", "top_k"]`	Layer strategies to search.
`layer_top_k_min` / `layer_top_k_max`	int	`3` / num_layers	top_k range.
`max_tokens`	int ≥ 1	`100`	Generation limit per eval.
`seed`	int	—	Reproducibility seed.
`timeout`	float (seconds)	—	Wall-clock timeout.

`[softprompt]` — Soft prompt attack¶

Optimizes a learnable prefix (soft prompt) to bypass the model's refusal. Three optimization modes:

continuous — Adam optimization directly in embedding space (Schwinn et al.)
gcg — Greedy Coordinate Gradient over discrete tokens (Zou et al.)
egd — Exponentiated Gradient Descent on the probability simplex

Core¶

Field	Type	Default	Description
`mode`	`"continuous"` \| `"gcg"` \| `"egd"`	`"continuous"`	Optimization mode.
`n_tokens`	int ≥ 1	`16`	Learnable prefix length.
`n_steps`	int ≥ 1	`200`	Optimization iterations.
`learning_rate`	float > 0	`0.01`	Learning rate (continuous mode).
`init_scale`	float	`0.1`	Embedding initialization scale.
`batch_size`	int ≥ 1	`64`	GCG candidates per position.
`top_k`	int ≥ 1	`256`	GCG token candidates.
`target_prefixes`	list of strings	`["Sure", "Here"]`	Target completion prefixes.
`max_gen_tokens`	int ≥ 1	`100`	Tokens to generate when evaluating success.
`seed`	int	—	Random seed for reproducibility.

Regularization & scheduling¶

Field	Type	Default	Description
`embed_reg_weight`	float ≥ 0	`0.0`	Embedding norm regularization (Huang et al.). 0 = off.
`patience`	int ≥ 0	`0`	Early stopping patience. 0 = disabled.
`lr_schedule`	`"constant"` \| `"cosine"`	`"constant"`	Learning rate schedule.
`n_restarts`	int ≥ 1	`1`	Random restarts (GCG mode).
`grad_accum_steps`	int ≥ 1	`1`	Gradient accumulation steps. 1 = no accumulation.

Prompt strategy¶

Field	Type	Default	Description
`prompt_strategy`	`"all"` \| `"cycle"` \| `"first"` \| `"worst_k"` \| `"sample"`	`"all"`	How prompts are selected each step.
`worst_k`	int ≥ 1	`5`	Number of prompts to focus on for `"worst_k"` strategy.
`prompt_pool_size`	int ≥ 1	—	Override eval prompt count for pool size.

Direction & loss¶

Field	Type	Default	Description
`direction_weight`	float ≥ 0	`0.0`	Direction-guided weight. 0 = standalone attack.
`direction_mode`	`"last"` \| `"raid"` \| `"all_positions"`	`"last"`	How to apply direction guidance.
`direction_layers`	list of ints	—	Layers for direction constraint. Null = all layers.
`loss_mode`	`"targeted"` \| `"untargeted"` \| `"defensive"` \| `"externality"`	`"targeted"`	Loss function. `"externality"` requires `externality_target`.
`egd_temperature`	float > 0	`1.0`	Bregman projection temperature (EGD mode).
`defense_aware_weight`	float ≥ 0	`0.0`	Defense evasion penalty added to loss. 0 = off.
`externality_target`	path	—	Path to `.npy` direction file for externality loss. Required when `loss_mode = "externality"`. Penalizes safety margin erosion (Xiong et al. 2026).

Token constraints¶

Field	Type	Default	Description
`token_constraint`	string or list of strings	—	Restrict token search space.

Positive constraints (include only matching tokens): "ascii", "alpha", "alphanumeric", "non_latin", "chinese", "non_alphabetic", "invisible", "zalgo", "emoji".

Negative constraints (exclude matching tokens): "exclude_glitch".

When a single string is provided, only tokens matching that constraint are allowed (or excluded, for negative constraints). When a list is provided, positive constraints are intersected first, then negative constraints remove tokens from the result.

`exclude_glitch`¶

Detects and excludes under-trained ("glitch") tokens from the adversarial search space. These are tokens with anomalously low or high embedding norms (beyond 3 standard deviations from the mean) that cause model collapse when encountered during generation.

Why it matters: Under-trained tokens produce random multilingual word salad instead of coherent text. If GCG/EGD selects one, the optimization step is wasted — the model cannot parse the input at all, so the gradient signal is noise. Excluding these tokens improves optimization efficiency.

Research basis: "Fishing for Magikarp" (Rumbelow & Watkins, EMNLP 2024, arxiv 2405.05417) showed that ~0.3-0.6% of tokens in modern LLMs are under-trained and cause anomalous behavior. Our cross-model validation on Qwen2.5-0.5B/1.5B confirmed this rate with calibrated multi-template behavioral entropy testing.

[softprompt]
mode = "gcg"
token_constraint = "exclude_glitch"

# Can combine with positive constraints:
# token_constraint = ["ascii", "exclude_glitch"]

Detection runs automatically from the embedding matrix at mask-build time (~1 second). No additional configuration needed.

EOS loss¶

Field	Type	Default	Description
`eos_loss_mode`	`"none"` \| `"force"` \| `"suppress"`	`"none"`	Auxiliary loss for EOS token. `"force"` encourages EOS; `"suppress"` discourages it.
`eos_loss_weight`	float ≥ 0	`0.0`	Weight for EOS auxiliary loss.

KL collision¶

Field	Type	Default	Description
`kl_ref_weight`	float ≥ 0	`0.0`	KL collision loss weight. Requires `ref_model` if > 0.
`ref_model`	string	—	HuggingFace model ID or path for KL collision reference model.

Transfer¶

Field	Type	Default	Description
`transfer_models`	list of strings	`[]`	Models to test the optimized prompt on after training.
`transfer_loss_weight`	float ≥ 0	`0.0`	Multi-model re-ranking weight. 0 = off.
`transfer_rerank_count`	int ≥ 1	`8`	Top-N candidates to re-rank on transfer models.

Target config¶

Field	Type	Default	Description
`target_repeat_count`	int ≥ 0	`0`	Repeat target tokens N times. 0 = disabled.
`system_prompt`	string	—	System prompt prepended to messages.

Beam search & init¶

Field	Type	Default	Description
`beam_width`	int ≥ 1	`1`	GCG beam search population. 1 = greedy.
`init_tokens`	list of ints	—	Warm-start token IDs (GCG/EGD modes).

Defense eval (in-loop)¶

Evaluate the optimized suffix against defense modules during training.

Field	Type	Default	Description
`defense_eval`	`"sic"` \| `"cast"` \| `"both"`	—	Which defense to evaluate against. Null = no defense eval.
`defense_eval_layer`	int	—	Layer for SIC/CAST direction projection. Null = auto from measurement.
`defense_eval_alpha`	float	`1.0`	CAST steering alpha.
`defense_eval_threshold`	float	`0.0`	SIC/CAST detection threshold.
`defense_eval_sic_mode`	`"direction"` \| `"generation"` \| `"svf"`	`"direction"`	SIC detection method. `"svf"` uses trained boundary MLP.
`defense_eval_sic_max_iterations`	int ≥ 1	`3`	SIC max sanitization iterations.
`defense_eval_cast_layers`	list of ints	—	CAST steering layers. Null = auto.
`defense_eval_alpha_tiers`	list of `[threshold, alpha]` pairs	—	TRYLOCK adaptive alpha tiers for CAST.

GAN loop¶

Iterative attack-defense training. The attacker (soft prompt optimizer) and defender (SIC/CAST) alternate rounds, escalating parameters when the attacker fails.

Field	Type	Default	Description
`gan_rounds`	int ≥ 0	`0`	Number of attack-defense rounds. 0 = no GAN loop.
`gan_step_multiplier`	float > 0	`1.5`	Multiply `n_steps` each failed round.
`gan_direction_escalation`	float	`0.25`	Add to `direction_weight` per failed round.
`gan_token_escalation`	int ≥ 0	`4`	Add to `n_tokens` per failed round.

GAN defender escalation¶

When enabled, the defender also escalates its parameters after attacker wins.

Field	Type	Default	Description
`gan_defense_escalation`	bool	`false`	Enable defender escalation. Off = legacy behavior.
`gan_defense_alpha_multiplier`	float > 0	`1.5`	Multiply CAST alpha per attacker win.
`gan_defense_threshold_escalation`	float ≥ 0	`0.5`	Subtract from SIC/CAST threshold per attacker win.
`gan_defense_sic_iteration_escalation`	int ≥ 0	`1`	Add to SIC max iterations per attacker win.

Multi-turn GAN¶

Thread GAN rounds as a multi-turn conversation, carrying history forward.

Field	Type	Default	Description
`gan_multiturn`	bool	`false`	Enable multi-turn conversation threading.
`gan_multiturn_max_turns`	int ≥ 1	`10`	Max conversation turns to keep in history.

Injection context¶

Wrap the optimized suffix in realistic surrounding context (e.g., a web page, tool output, or code file). Only works with discrete modes (gcg or egd).

Field	Type	Default	Description
`injection_context`	`"web_page"` \| `"tool_output"` \| `"code_file"`	—	Preset injection context wrapper.
`injection_context_template`	string	—	Custom template with `{payload}` placeholder. Overrides preset.

Constraints: Injection context requires mode = "gcg" or mode = "egd" (continuous mode produces soft embeddings that cannot represent wrapped context). Cannot be combined with gan_multiturn.

Perplexity regularization¶

Cross-entropy penalty pushing optimized suffixes toward fluent text. Encourages token sequences that look like natural language instead of adversarial noise.

Field	Type	Default	Description
`perplexity_weight`	float >= 0	`0.0`	Weight for perplexity (CE) auxiliary loss. 0 = off. Higher values produce more fluent but potentially less effective suffixes.

Token position¶

Controls where the learnable tokens are inserted relative to the prompt.

Field	Type	Default	Description
`token_position`	`"prefix"` \| `"suffix"` \| `"infix"`	`"prefix"`	Placement of optimized tokens. `"prefix"` = before the prompt, `"suffix"` = after the prompt, `"infix"` = split and inserted within the prompt.

Constraints: token_position = "infix" requires mode = "gcg" or mode = "egd" (continuous mode cannot resolve infix split positions).

Prompt paraphrasing¶

Augment the prompt pool with paraphrased variants. Each strategy applies a different rewriting style to diversify the attack surface.

Field	Type	Default	Description
`paraphrase_strategies`	list of strings	`[]`	Paraphrase strategies to apply. Empty = no paraphrasing.

Valid strategies: "narrative", "deceptive_delight", "technical", "historical", "code_block", "educational".

Environment rollout integration¶

When [environment] is present alongside [softprompt], the optimizer periodically runs the top candidates through the agent environment to compute reward-based re-ranking. See the [environment] section for full configuration.

`[sic]` — Iterative input sanitization¶

Field	Type	Default	Description
`mode`	`"direction"` \| `"generation"`	`"direction"`	Detection method.
`threshold`	float	`0.0`	Detection threshold. Higher = stricter. Can be negative.
`max_iterations`	int ≥ 1	`3`	Sanitization rounds.
`max_tokens`	int ≥ 1	`100`	Tokens for generation-based detection.
`target_layer`	int	—	Layer for direction projection. Null = use measurement result.
`sanitize_system_prompt`	string	(built-in)	System prompt for the rewrite step.
`max_sanitize_tokens`	int ≥ 1	`200`	Max tokens when rewriting.
`block_on_failure`	bool	`true`	Block inputs that cannot be sanitized within `max_iterations`.
`calibrate`	bool	`false`	Auto-calibrate threshold from clean prompts.
`calibrate_prompts`	`"harmless"` \| `"harmful"`	`"harmless"`	Which prompt set to calibrate from.

`[depth]` — Deep-thinking analysis¶

Field	Type	Default	Description
`prompts`	list of strings	(required)	Inline prompts for depth analysis.
`settling_threshold`	float in (0.0, 1.0]	`0.5`	JSD threshold for "settled".
`deep_fraction`	float in (0.0, 1.0]	`0.85`	Deep-thinking layer fraction.
`max_tokens`	int ≥ 0	`0`	0 = static analysis, >0 = generate.
`extract_direction`	bool	`false`	Also extract a depth direction.
`top_k_logits`	int ≥ 1	`1000`	Approximate JSD with top-k logits (performance).
`direction_prompts`	list of strings	—	Prompts for direction extraction. Required when `extract_direction = true`.
`clip_quantile`	float in [0.0, 0.5)	`0.0`	Winsorization quantile for direction extraction.

`[probe]` — Projection inspection¶

Field	Type	Default	Description
`prompts`	list of strings	(required)	Prompts to probe.

`[steer]` — Steered generation¶

Field	Type	Default	Description
`prompts`	list of strings	(required)	Prompts for steered generation.
`layers`	list of ints	all layers	Layers to apply steering.
`alpha`	float ≥ 0	`1.0`	Steering strength.
`max_tokens`	int ≥ 1	`100`	Max tokens to generate.

`[cast]` — Conditional activation steering¶

Field	Type	Default	Description
`prompts`	list of strings	(required)	Prompts for CAST generation.
`layers`	list of ints	all layers	Layers where conditional checks run.
`alpha`	float ≥ 0	`1.0`	Steering strength when projection exceeds threshold.
`threshold`	float	`0.0`	Trigger steering only if projection > threshold.
`max_tokens`	int ≥ 1	`100`	Max tokens to generate.
`condition_direction`	path	—	Separate `.npy` direction for gating (detect vs. steer split).
`alpha_tiers`	list of tables	—	Adaptive alpha tiers: `[[cast.alpha_tiers]]` with `threshold` and `alpha` keys. Must be sorted ascending.

Dual-direction mode: When condition_direction is set, gating uses that direction (detect) while correction uses the primary direction (steer). This implements the AdaSteer dual-direction pattern.

Adaptive alpha: alpha_tiers maps projection magnitude to different steering strengths, avoiding the non-monotonic danger zone identified by TRYLOCK. See Part 8 for examples.

`[api_eval]` — Remote API suffix evaluation¶

Test optimized suffixes against remote API endpoints. Runs after softprompt optimization completes.

Top-level¶

Field	Type	Default	Description
`max_tokens`	int ≥ 1	`100`	Max tokens for API responses.
`timeout`	int	`30`	Request timeout in seconds.
`system_prompt`	string	—	Shared default system prompt for all endpoints.
`multiturn`	bool	`false`	Enable multi-turn conversation with follow-ups.
`multiturn_max_turns`	int ≥ 1	`3`	Total turns including initial prompt.
`follow_up_prompts`	list of strings	`[]`	Custom follow-up prompts. Empty = use defaults.

`[[api_eval.endpoints]]`¶

Each endpoint is a TOML array-of-tables entry:

[[api_eval.endpoints]]
name = "gpt-4o"
base_url = "https://api.openai.com/v1"
model = "gpt-4o"
api_key_env = "OPENAI_API_KEY"
system_prompt = "You are a helpful assistant."  # optional per-endpoint override
auth_header = "Authorization"                   # optional custom header name

Field	Type	Default	Description
`name`	string	(required)	Display name for this endpoint.
`base_url`	string	(required)	API base URL (OpenAI-compatible).
`model`	string	(required)	Model identifier for the API.
`api_key_env`	string	(required)	Environment variable containing the API key.
`system_prompt`	string	—	Per-endpoint system prompt override.
`auth_header`	string	—	Custom auth header name (e.g. `"grayswan-api-key"`).

`[meta]` — Experiment metadata¶

Metadata for experiment tracking and tech tree visualization. Does not affect pipeline execution.

Field	Type	Default	Description
`id`	string	(required)	Unique experiment identifier.
`title`	string	`""`	Human-readable experiment title.
`status`	string	`"wip"`	Experiment status (e.g. `"wip"`, `"done"`, `"abandoned"`).
`parents`	list of strings	`[]`	IDs of parent experiments (lineage).
`tags`	list of strings	`[]`	Freeform tags for categorization.
`notes`	string	`""`	Experiment notes.
`date`	string	`""`	Date string (freeform, e.g. `"2025-01-15"`).

`[[meta.docs]]`¶

Associated documents:

[[meta.docs]]
path = "results/report.pdf"
label = "Final report"

Field	Type	Default	Description
`path`	string	(required)	Path to the document.
`label`	string	`""`	Display label.

View the experiment tree with:

python -m vauban.tree experiments/

`[defend]` — Defense stack composition¶

Composes multiple defense layers (scan, SIC, policy, intent) into a single evaluation pipeline. When [defend] is present, it acts as an early-return mode that runs each enabled defense layer in sequence against evaluation prompts and writes a combined report.

The [defend] section itself has only one field. The individual defense layers are configured via their own top-level sections ([scan], [sic], [policy], [intent]), which are pulled in automatically when [defend] is present.

Field	Type	Default	Description
`fail_fast`	bool	`true`	Stop at the first defense layer that blocks. `false` = run all layers and report all results.

Layer composition: The defense stack runs layers in order: scan (Layer 0, injection detection) -> SIC (Layer 1, iterative sanitization) -> policy (Layer 3, tool-call filtering) -> intent (Layer 4, intent alignment). Each layer is optional; only layers with a corresponding top-level section are active.

Example:

[defend]
fail_fast = false

[scan]
threshold = 0.5
calibrate = true

[sic]
mode = "direction"
max_iterations = 3

`[scan]` — Injection content scanning¶

Configure the Layer 0 injection scanner. Only active when [defend] is present.

Field	Type	Default	Description
`target_layer`	int	—	Layer for direction projection. Null = use measurement result.
`span_threshold`	float	`0.5`	Minimum mean projection for a token span to be flagged.
`threshold`	float	`0.0`	Overall detection threshold.
`calibrate`	bool	`false`	Auto-calibrate threshold from clean prompts.

`[policy]` — Tool-call policy engine¶

Configure the Layer 3 tool-call policy engine. Only active when [defend] is present.

Field	Type	Default	Description
`default_action`	`"allow"` \| `"block"`	`"allow"`	Default action for unmatched tool calls.
`rules`	list of tables	`[]`	Policy rules (see below).
`data_flow_rules`	list of tables	`[]`	Data flow restriction rules.
`rate_limits`	list of tables	`[]`	Rate limit rules.

Each [[policy.rules]] entry:

Field	Type	Default	Description
`name`	string	(required)	Rule name for logging.
`action`	`"allow"` \| `"block"` \| `"confirm"`	(required)	Action when matched.
`tool_pattern`	string	(required)	fnmatch pattern for tool names.
`argument_key`	string	—	Optional argument key to check.
`argument_pattern`	string	—	Regex pattern for argument value.

`[intent]` — Intent alignment checking¶

Configure the Layer 4 intent alignment checker. Only active when [defend] is present.

Field	Type	Default	Description
`mode`	`"embedding"` \| `"judge"`	`"embedding"`	Alignment detection method.
`target_layer`	int	—	Layer for embedding similarity. Null = auto.
`similarity_threshold`	float	`0.7`	Cosine similarity threshold for `"embedding"` mode.
`judge_prompt`	string	(built-in)	System prompt for `"judge"` mode.
`max_tokens`	int >= 1	`10`	Max tokens for judge generation.

`[environment]` — Agent tool harness¶

Defines a simulated agent environment for indirect prompt injection testing. The environment provides tools, a benign user task, and a target tool call that the injection payload should elicit. Used alongside [softprompt] to evaluate whether optimized suffixes can hijack agent behavior.

Top-level¶

Field	Type	Default	Description
`system_prompt`	string	(required)	System prompt for the agent.
`injection_surface`	string	(required)	Name of the tool whose output contains the injection payload. Must match a defined tool name.
`max_turns`	int >= 1	`6`	Maximum agent loop turns.
`max_gen_tokens`	int >= 1	`200`	Max tokens per agent generation step.
`temperature`	float >= 0	`0.0`	Sampling temperature. 0.0 = greedy (argmax).
`rollout_top_n`	int >= 1	`8`	Number of top candidates to evaluate via environment rollout.
`rollout_every_n`	int >= 1	`1`	Run environment rollouts every N optimization steps. 1 = every step.

`[environment.target]`¶

The tool call the injection payload should trigger.

Field	Type	Default	Description
`function`	string	(required)	Target tool name. Must match a defined tool.
`required_args`	list of strings	`[]`	Argument keys that must be present.
`arg_contains`	table	`{}`	Key-value pairs the arguments must contain (substring match).

`[environment.task]`¶

The benign user task that initiates the agent loop.

Field	Type	Default	Description
`content`	string	(required)	The user's task prompt.

`[[environment.tools]]`¶

Each tool available to the agent is defined as an array-of-tables entry:

[[environment.tools]]
name = "web_search"
description = "Search the web for information."
parameters = { query = "string" }
result = "Search results as text."

Field	Type	Default	Description
`name`	string	(required)	Tool name (referenced by `injection_surface` and `target.function`).
`description`	string	`""`	Human-readable description shown to the agent.
`parameters`	table	`{}`	Parameter names mapped to type descriptions.
`result`	string	—	Description of the tool's return value.

`[environment.policy]`¶

Optional inline tool-call policy for the environment harness (separate from the [policy] defense layer).

Field	Type	Default	Description
`blocked_functions`	list of strings	`[]`	Tool names to block outright.
`require_confirmation`	list of strings	`[]`	Tool names requiring confirmation.
`arg_blocklist`	table of lists	`{}`	Per-argument blocked value patterns.

Cross-field validation: injection_surface and target.function must both reference tools defined in [[environment.tools]].

`[svf]` — Steering vector field training¶

Trains a boundary MLP that learns a differentiable decision surface in activation space. The gradient of the boundary function at each activation gives the steering direction, replacing static linear vectors with context-dependent steering.

Reference: Li, Li & Huang (2026) -- arxiv.org/abs/2602.01654

Field	Type	Default	Description
`prompts_target`	path	(required)	JSONL file with target-behavior prompts (e.g., harmful prompts).
`prompts_opposite`	path	(required)	JSONL file with opposite-behavior prompts (e.g., harmless prompts).
`projection_dim`	int >= 1	`16`	Dimensionality of the projected activation space fed to the boundary MLP.
`hidden_dim`	int >= 1	`64`	Hidden layer width in the boundary MLP.
`n_epochs`	int >= 1	`10`	Training epochs.
`learning_rate`	float > 0	`0.001`	Adam learning rate.
`layers`	list of ints	all layers	Layers to train boundary MLPs on. Null = all layers.

The trained boundary model can be referenced by [steer], [cast], and [sic] via their direction_source = "svf" and svf_boundary_path fields.

`[compose_optimize]` — Composition weight optimization¶

Bayesian optimization over linear composition weights for Steer2Adapt composed steering. Given a subspace bank (.safetensors file with named basis vectors), searches for the weight combination that optimizes refusal rate and perplexity trade-off.

Reference: Han et al. (2026) -- arxiv.org/abs/2602.07276

Field	Type	Default	Description
`bank_path`	path	(required)	Path to a `.safetensors` subspace bank file. Each key is a named subspace with shape `(k, d_model)`.
`n_trials`	int >= 1	`50`	Number of Optuna trials.
`max_tokens`	int >= 1	`100`	Max tokens per evaluation generation.
`timeout`	float (seconds)	—	Wall-clock timeout. Null = no timeout.
`seed`	int	—	Reproducibility seed. Null = non-deterministic.

The bank file is produced by the [measure] stage when measure.bank entries are configured. Each entry in the bank contains the SVD basis vectors for a named behavioral subspace. The optimizer searches for weights w_i such that the composed direction sum(w_i * basis_i[0]) (L2-normalized) achieves the best refusal/quality balance.

`[output]` — Output directory¶

Field	Type	Default	Description
`dir`	path	`"output"`	Where to write modified weights, reports, and measurements.

Configuration Reference¶

Pipeline modes¶

Data file formats¶

Minimal example¶

Full reference¶

Top-level fields¶

[model] — Required¶

[data] — Required¶

[measure] — Direction extraction¶

[cut] — Weight modification¶

[eval] — Post-cut evaluation¶

[surface] — Refusal surface mapping¶

[detect] — Defense detection¶

[optimize] — Optuna search¶

[softprompt] — Soft prompt attack¶

Core¶

Regularization & scheduling¶

Prompt strategy¶

Direction & loss¶

Token constraints¶

exclude_glitch¶

EOS loss¶

KL collision¶

Transfer¶

Target config¶

Beam search & init¶

Defense eval (in-loop)¶

GAN loop¶

GAN defender escalation¶

Multi-turn GAN¶

Injection context¶

Perplexity regularization¶

Token position¶

Prompt paraphrasing¶

Environment rollout integration¶

[sic] — Iterative input sanitization¶

[depth] — Deep-thinking analysis¶

[probe] — Projection inspection¶

[steer] — Steered generation¶

[cast] — Conditional activation steering¶

[api_eval] — Remote API suffix evaluation¶

Top-level¶

[[api_eval.endpoints]]¶

[meta] — Experiment metadata¶

[[meta.docs]]¶

[defend] — Defense stack composition¶

[scan] — Injection content scanning¶

[policy] — Tool-call policy engine¶

[intent] — Intent alignment checking¶

[environment] — Agent tool harness¶

Top-level¶

[environment.target]¶

[environment.task]¶

[[environment.tools]]¶

[environment.policy]¶

[svf] — Steering vector field training¶

[compose_optimize] — Composition weight optimization¶

[output] — Output directory¶

`[model]` — Required¶

`[data]` — Required¶

`[measure]` — Direction extraction¶

`[cut]` — Weight modification¶

`[eval]` — Post-cut evaluation¶

`[surface]` — Refusal surface mapping¶

`[detect]` — Defense detection¶

`[optimize]` — Optuna search¶

`[softprompt]` — Soft prompt attack¶

`exclude_glitch`¶

`[sic]` — Iterative input sanitization¶

`[depth]` — Deep-thinking analysis¶

`[probe]` — Projection inspection¶

`[steer]` — Steered generation¶

`[cast]` — Conditional activation steering¶

`[api_eval]` — Remote API suffix evaluation¶

`[[api_eval.endpoints]]`¶

`[meta]` — Experiment metadata¶

`[[meta.docs]]`¶

`[defend]` — Defense stack composition¶

`[scan]` — Injection content scanning¶

`[policy]` — Tool-call policy engine¶

`[intent]` — Intent alignment checking¶

`[environment]` — Agent tool harness¶

`[environment.target]`¶

`[environment.task]`¶

`[[environment.tools]]`¶

`[environment.policy]`¶

`[svf]` — Steering vector field training¶

`[compose_optimize]` — Composition weight optimization¶

`[output]` — Output directory¶