All dimensions

Detection dimension · weight 5%

Sparse-Token Stress Test

What this dimension detects

Inspired by MiniMax's May-2026 investigation of the 'Ma Jiaqi (马嘉祺)' case. Each vendor's SFT data covers the model's vocabulary unevenly; low-frequency tokens (rare CJK names, Chinese SEO spam, Japanese colloquial phrases, LaTeX / Wikipedia metadata, FIM special tokens) accumulate lm_head drift during SFT and fall out of the top-p sampling window — the model still understands them but cannot generate them. The *set* of forgotten tokens differs per vendor, so the failure pattern is a generation-side fingerprint that is independent of tokenizer-boundary, logprobs, ITT and MMD.

Algorithm

Send ~10 probes, each instructing the model to echo a known-fragile token string verbatim (no commentary, no reformatting). For each response, classify the outcome as hit / omit / substitute / partial / refuse / blank. A 'substitute' against a documented near-neighbour pattern (e.g. 祺→琪 in Chinese homophone, 嘉祺→千玺 as an lm_head drift, 相続税 mixing into Korean/Russian) is flagged with the historical note. The aggregate hit-rate drives the verdict; failure modes and families are reported for forensics but do NOT vote a specific vendor in scoring (we don't yet have cross-vendor measured failure tables).

Thresholds

ConditionVerdict contribution
hit-rate ≥ 80%Match — SFT vocabulary coverage appears intact on these tokens
50% ≤ hit-rate < 80%Match (borderline) — flagged for inspection but does not vote
hit-rate < 50% AND ≥ 3 probes scoredMismatch — substantial lm_head drift on tested tokens; if the claimed model is documented to echo these correctly on public benchmarks, the actual deployment is suspect

Limitations

Probes are descriptive, not diagnostic. A failure tells you 'this model's SFT data was thin on this token' — it does NOT directly identify which other model is being served. Cross-vendor failure tables are not yet measured at sufficient scale to vote a specific suspected model. CJK / Japanese / Korean probes are language-specific; running an audit against a code model with mostly English data will produce noisy results from this dimension. Special-token / LaTeX / Wikipedia probes are off by default in the test set because all chat-tuned models legitimately fail them.

References

  • MiniMax. Internal investigation: Ma Jiaqi (马嘉祺) sparse-token forgetting and lm_head drift, May 2026.
  • Lin et al. Mitigating the Alignment Tax of RLHF. 2024. (For the underlying catastrophic-forgetting-during-SFT mechanism.)

Back to the full methodology