Detection dimension · weight 12%
Model Equality Testing (MMD)
What this dimension detects
Maximum Mean Discrepancy is a kernel two-sample test. The ICLR 2025 paper (Gao et al.) applies it to LLM endpoints and reports 11 of 31 commercial Llama APIs deviated from Meta's reference at p < 0.05 — a distributional difference, not necessarily evidence of substitution. TrueLLMs surfaces this dimension whenever a user-recorded baseline is present (see the MMD baseline panel on the home page); without a baseline the dimension reports `unavailable` and donates its weight to the other dimensions.
Algorithm
Collect N ≥ 25 samples from the audited endpoint at temperature > 0, plus N ≥ 25 baseline samples (either user-recorded or shipped reference fingerprint). Take the first 100 raw characters of each response (no tokenization, no case folding — matching the implementation in lib/identity-audit/mmd.ts). Compute MMD² with a Hamming kernel. Estimate p-value with 1,000 random permutations of the joint sample.
Thresholds
| Condition | Verdict contribution |
|---|---|
| p ≥ 0.10 | Same distribution (cannot reject H₀) |
| 0.05 ≤ p < 0.10 | Borderline |
| p < 0.05 | Distributions differ (multiple causes possible) |
Limitations
Requires temperature > 0 and a baseline. Random seed differences can produce false positives at small N. A rejected H₀ means the distributions differ — quantization, fine-tunes, system prompts and post-processing all produce distributional shifts, so a p < 0.05 result is a flag, not a substitution finding. The permutation is now stratified by prompt block (since v3.3), so a prompt-mix imbalance cannot inflate the null distribution.
References
- Gao et al. Model Equality Testing: Which Model is this API Serving? ICLR 2025. arXiv:2410.20247