Skip to content

Tokenizer Embedding Analysis

Methodology and findings from the tokenizer anomaly detection research that led to the exclude_glitch constraint in the softprompt module.

Motivation

LLM tokenizers contain under-trained tokens ("glitch tokens") that cause model collapse when encountered during generation. The original discovery was "SolidGoldMagikarp" (Rumbelow & Watkins, 2023) in GPT-3, followed by systematic study in "Fishing for Magikarp" (EMNLP 2024, arxiv 2405.05417).

For vauban's adversarial search (GCG/EGD), these tokens are a problem: if the optimizer selects one, the model produces random noise instead of coherent output, wasting the optimization step. The exclude_glitch constraint prevents this.

What we tested

Models

  • Qwen2.5-0.5B-Instruct (151,936 tokens, 896-dim embeddings)
  • Qwen2.5-1.5B-Instruct (151,936 tokens, 1536-dim embeddings)

Both share the same tokenizer but have different learned embeddings.

Methods

1. Spectral analysis (Marchenko-Pastur null)

SVD of the centered embedding matrix, eigenvalues tested against the Marchenko-Pastur distribution (Martin & Mahoney, JMLR 2021). This separates signal eigenvalues (learned structure) from noise.

Results for 0.5B: - 330/896 eigenvalues above MP edge (36.8% signal) - TwoNN intrinsic dimension: 14.1 (the 151K embeddings live on a ~14-dimensional manifold) - IsoScore: 0.965 (nearly isotropic eigenspectrum) - PC1 (3.3% variance) separates high-frequency tokens (comma, period, digits) from multi-char code fragments

Script: experiments/tokenizer_analysis/spectral_analysis.py

2. Calibrated behavioral entropy scan

For each token, measure output entropy when the model is asked to repeat it across 3 diverse prompt templates (direct repetition, few-shot, spelling). A token is flagged as anomalous only if entropy exceeds a calibrated threshold (mean + 3sigma of known-good common tokens) on ALL templates.

The multi-template requirement follows GlitchMiner (arxiv 2410.15052). Single-template testing has a ~70% false positive rate (AnomaLLMy, arxiv 2406.19840).

Calibration baseline:

Model Mean H Sigma Threshold (mean+3sigma)
0.5B 1.899 0.830 4.388
1.5B 1.236 0.772 3.553

Script: experiments/tokenizer_analysis/full_scan.py

3. Cross-model validation

Same token IDs scanned on both models. Anomaly rates compared across models and sampling strategies (low-norm vs random stratified).

4. Deep probing

Confirmed anomalies tested with complex prompts (biography, secret, synonym, poem, opinion, danger) following the Watkins/Rumbelow petertodd methodology. Also tested with vauban-relevant prompts (refusal prefix injection, system prompt injection).

Script: experiments/tokenizer_analysis/deep_probe.py

Key findings

Anomaly rate is consistent and low

Group 0.5B Rate 1.5B Rate
Calibration (known-good) 0.0% 1.0%
Low-norm (bottom 2000) 0.3% 0.3%
Random stratified (2000) 0.3% 0.6%

The true anomaly rate is ~0.3-0.6% of the vocabulary, consistent across models and sampling strategies.

Anomalous tokens cause total model collapse

All 22 confirmed anomalies produce identical behavior: multilingual word salad at entropy H=9.0-9.8 (vs baseline ~1.9). No coherent output, no refusal, no instruction following.

This is NOT the "petertodd" phenomenon (coherent alternate persona). It is pure under-training: the model has no representation for these tokens and outputs random noise from across its entire vocabulary distribution.

Refusal bypassed via collapse, not evasion

20/22 anomalous tokens bypass refusal when prepended to harmful prompts. The mechanism is not safety alignment evasion -- it is destruction of the model's ability to parse the input at all. The safety system cannot refuse what it cannot understand.

Anomaly types are model-specific despite shared tokenizer

  • 0.5B anomalies: Truncated multilingual tokens (przedsiÄĻb, Cumhurba, useRalative). Model responds with <tool_call> hallucination.
  • 1.5B anomalies: Arabic/Cyrillic multi-byte tokens, plus artifacts like $PostalCodesNL. Model responds with single-byte Unicode fragments.

Embedding norms are a useful but imperfect proxy

Low-norm tokens are enriched for anomalies (they received fewer gradient updates during training), but not all low-norm tokens are anomalous and not all anomalous tokens have low norms. The 3-sigma threshold is conservative -- it catches the worst offenders while minimizing false exclusions.

What we ruled out

Orthogonal rotation is not a valid null baseline

Our initial approach compared embedding metrics against an orthogonal rotation of the embedding matrix. This produced identical distributions because orthogonal rotation preserves all pairwise distances and norms. The correct null is Marchenko-Pastur for the spectral distribution.

Norm-based outlier detection is not anomaly detection

The literature (Robinson et al., NeurIPS 2025; "Interior Conjecture") shows that 85/133 known GPT glitch tokens are interior to the convex hull of other embeddings. Norm and distance metrics capture correlates, not causes. Behavioral testing is the gold standard.

No petertodd-level phenomena in instruction-tuned Qwen models

Deep probing with biography/secret/poem/synonym templates produced only word salad, not coherent alternate personas or emotional valence shifts. This is likely because: (a) the models are instruction-tuned (RLHF suppresses bizarre behavior), and (b) 0.5B/1.5B are too small for complex emergent behavior. The petertodd phenomenon was observed in GPT-3 base models.

How the constraint works

The exclude_glitch token constraint in [softprompt] computes embedding L2 norms for all tokens and excludes those beyond 3 standard deviations from the mean. This runs once at mask-build time (~1 second) and does not require forward passes.

For users who have run the full behavioral entropy scan, pre-computed glitch token IDs can also be supplied programmatically via the glitch_token_ids parameter to _build_vocab_mask().

References

  • Rumbelow & Watkins (2023) -- "SolidGoldMagikarp" -- original glitch token discovery
  • Rumbelow & Watkins (EMNLP 2024, arxiv 2405.05417) -- "Fishing for Magikarp" -- systematic detection via embedding properties
  • GlitchMiner (arxiv 2410.15052) -- gradient-based entropy maximization, multi-template validation
  • GlitchProber (arxiv 2408.04905) -- activation-based detection with SVM classifier
  • AnomaLLMy (arxiv 2406.19840) -- API-based detection, demonstrates 70% FP rate of single-template testing
  • Robinson et al. (NeurIPS 2025, arxiv 2504.01002) -- "Token Embeddings Violate the Manifold Hypothesis"
  • Martin & Mahoney (JMLR 2021) -- Marchenko-Pastur analysis of neural network weight matrices

Experiment scripts

All scripts are in experiments/tokenizer_analysis/:

Script Purpose Runtime
spectral_analysis.py SVD + MP null + dimensionality metrics ~8s
full_scan.py Calibrated behavioral entropy scan (3 phases) ~4 min (0.5B), ~10 min (1.5B)
deep_probe.py Complex prompt probing of confirmed anomalies ~5 min
analyze_embeddings.py Initial norm/cosine/kNN analysis (superseded by full_scan) ~19 min