Methodology
Last reviewed May 2026. This page documents the current audit model: five scored dimensions that sum to 100, seven report-only diagnostics, and the two run modes. The tokenizer-family fingerprint deterministically covers the OpenAI flagship series (GPT-4o through GPT-5.x) via o200k_base. For Claude and Gemini, whose tokenizers are closed-source, differential mode uses the trusted reference endpoint's prompt_tokens slope as the ground-truth tokenizer count — no local tokenizer required. Image models (e.g., gpt-image-2) are not in this framework: they lack chat prompt_tokens tokenizer slopes, text capability probes, and MMD is a text-distribution test. Image-model substitution detection requires a different method set (image statistics, latency fingerprinting) and is currently unsupported.
Why audit a proxy at all?
LLM proxies and aggregator gateways have two recurring failure modes. Either they inflate token counts in their usage blocks so the bill is larger than it should be, or they silently substitute a cheaper model for the one the user asked for. Both are difficult to spot from a single response. The right unit of analysis is a small batch of probes and the statistical signature they produce.
TrueLLMs runs that batch in your browser. It never persists keys, never retains responses, and prints every signal it derived along with the raw data, so you can verify the conclusion.
Two audit phases
Usage audit compares usage.prompt_tokens and usage.completion_tokens from the API against local recounts and the deterministic tokenizer-family slope probes. It is the token-inflation check; it is not, by itself, a proof of model identity.
Identity audit is a scorecard, not a Bayesian combination of independent tests. Only five dimensions affect the final model-substitution score: tokenizer family fingerprint, capability floor, MMD, cache replay, and sparse-token stress. The remaining dimensions stay visible as diagnostics with score weight 0.
The 12 dimensions
- Tokenizer Family Fingerprint — score weight 35%
- Capability Floor — score weight 20%
- MMD Distribution Equivalence Test — score weight 20%
- Cache Replay Detection — score weight 15%
- Sparse-Token Stress Test — score weight 10%
- LLMmap Fingerprint — score weight 0%
- Inter-Token Rhythm Fingerprint — score weight 0%
- Response Latency & Throughput — score weight 0%
- Self-Identification Probe — score weight 0%
- Canary Prompts — score weight 0%
- Context Window Probe — score weight 0%
- Stylometric Analysis — score weight 0%
Scored vs diagnostic
Scored dimensions are allowed to move the headline verdict. Diagnostic dimensions are retained because they explain behavior, but they do not score: LLMmap, ITT rhythm, latency, self-identification, canary prompts, context window, and stylometry can be forged by system prompts or output rewriting, or rely on estimated/mock reference data.
Absolute and differential modes
Absolute mode uses only the audited endpoint: tokenizer family fingerprint, capability floor, and usage-inflation audit. It does not invent an MMD baseline. Differential mode uses the audited endpoint plus the user's own trusted official reference endpoint, enables MMD and capability deltas, and is the strongest “is this the real thing?” test. The reference key is kept in memory only; it is not persisted, printed, or logged. Tokenizer fingerprint measures the server-side billing tokenizer family, which usually equals the served model's native tokenizer, but a gateway can bill through a normalized tokenizer. A tokenizer mismatch alone is therefore likely evidence, not confirmed evidence.
LLMmap as a diagnostic
We borrow probe families from LLMmap: Fingerprinting Large Language Models (Pasquini et al., USENIX Security 2025, arXiv:2407.15847). In TrueLLMs this surface is diagnostic only. It helps explain refusal templates, instruction-conflict handling, deterministic puzzle behavior, and tooling boundaries, but it no longer contributes to the score.
Implementation honesty. The original paper trains a deep contrastive classifier over response embeddings and reports about 95% vendor identification accuracy across 42 LLM versions.
This release ships a lexical / structural template heuristic only, not a trained classifier, and we make no claim to the paper's accuracy number. Treat this dimension as a report-only diagnostic.
TrueLLMs ships the probe set with two safety guards: policy-sensitive probes are off by default, and response text is used only for feature extraction. A proxy or system prompt can rewrite these outputs, so LLMmap can support an investigation but cannot confirm substitution on its own.
MMD in differential mode
From Model Equality Testing: Which Model is This API Serving? (Gao et al., ICLR 2025, arXiv:2410.20247). The two-sample test treats responses as samples from a distribution and runs a Maximum Mean Discrepancy test. TrueLLMs scores this dimension only when the user supplies trusted reference endpoint samples and the sample size is sufficient; otherwise it reports unavailable.
We compute MMD² with a Hamming kernel on the first 100 characters of each response, with no tokenization and no case folding, then estimate the null distribution with stratified prompt-block permutations. A rejected null means the two response distributions differ; quantization, fine-tuning, system prompts, and post-processing are all possible causes.
A rank-based uniformity test has been reported to outperform MMD as well as KS baselines (Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test, arXiv:2506.06975, 2025). TrueLLMs currently uses MMD; the rank-based approach is noted as a future direction.
How to enable. Run differential mode with both the audited endpoint and your trusted official reference endpoint configured in the same browser session. The reference API key is held only in memory for that run.
No reference key is persisted, printed, or logged. Prompt blocks are matched before permutation, so a prompt-mix difference cannot create a false MMD baseline.
Inter-chunk rhythm as a diagnostic
Inspired by LLMs Have Rhythm: Fingerprinting Large Language Models Using Inter-Token Times and Network Traffic Analysis (Alhazbi et al., 2025, arXiv:2502.20589). TrueLLMs measures gaps between streamed SSE chunks at the reader, extracts simple rhythm features, and displays the result as a diagnostic with score weight 0.
Honest disclosure. What we measure is the inter-SSE-chunk arrival time at the reader, not the true inter-token time inside the model. TCP coalescing, SSE flushing cadence, gateway buffering, network load, and timestamp resolution all add noise. The current per-model rhythm library should be treated as diagnostic context, not a scoring baseline.
How to enable. Turn on stream: true in the configuration panel and run the audit. In Direct mode the client measures SSE arrival locally; in Proxy mode it parses the server-emitted audit.timing SSE event. Either way the values land on TestResult.chunkTimestamps and the ITT diagnostic consumes them automatically.
Sparse-token stress
MiniMax's May 2026 investigation of the “Ma Jiaqi (马嘉祺)” case showed a generation-side asymmetry: low-frequency tokens can remain understandable through the input embedding while drifting in the lm_head enough that the model struggles to emit them. TrueLLMs uses this as a weak, vendor-independent generation-side fingerprint.
TrueLLMs sends stress probes that ask the model to echo fragile token strings verbatim: rare CJK names, Chinese SEO spam, Japanese colloquial strings, and related low-frequency forms. Misses are classified as omit, substitute, partial, refuse, or blank. The dimension can score as a weak signal that the served SFT pipeline behaves unexpectedly, but it does not identify a specific vendor by itself.
Honest scope. We do not yet have large measured failure tables across every frontier family. The dimension contributes only 10 score points and should be read as supporting evidence, not as a model identity proof.
Weight rebalancing
Default scored weights sum to 100 across the five scored dimensions only. When a scored dimension is unavailable, its weight is redistributed proportionally across the remaining available scored dimensions. Diagnostic dimensions have weight 0 and do not participate in rebalancing.
MMD is unavailable in absolute mode because there is no trusted reference endpoint. A diagnostic becoming unavailable does not change the score. A tokenizer-family mismatch remains the strongest single deterministic signal, but because it measures the billing tokenizer family, it needs corroboration from another scored dimension before the top-level verdict can be treated as confirmed.
Limitations & honest disclosure
- Only five dimensions are scoring dimensions. The other seven are displayed for diagnosis and forensics, not for the headline confidence number.
- Tokenizer fingerprint measures the server-side token accounting tokenizer family. That usually matches the served model's native tokenizer, but gateways can normalize billing through a different tokenizer.
- MMD and capability deltas require a user-supplied trusted reference endpoint. Without that reference, TrueLLMs does not fabricate a baseline.
- This is not adversarial-robust. A proxy that recognises the probe set could pass requests through to the real model only for those probes. We have no defence against that today.
- One signal proves nothing. A significant MMD result, a tokenizer-family mismatch, or repeated identical stochastic outputs each has multiple legitimate explanations. The labels report patterns; the user decides what they mean.
What TrueLLMs is not
- It is not a fraud accusation. We report likely-substituted or confirmed-substituted evidence patterns, never “scam”.
- It is not a continuous monitor. Run it manually whenever a bill or endpoint behavior looks wrong.
- It does not, and cannot, prove a positive identity. Even a clean run is consistent with a flawless proxy that is doing the right thing.
References
- Pasquini et al. LLMmap: Fingerprinting Large Language Models. USENIX Security 2025. arXiv:2407.15847.
- Gao et al. Model Equality Testing: Which Model is this API Serving? ICLR 2025. arXiv:2410.20247.
- Alhazbi et al. LLMs Have Rhythm: Fingerprinting Large Language Models Using Inter-Token Times and Network Traffic Analysis. 2025. arXiv:2502.20589.
- MiniMax. Sparse-token forgetting and lm_head drift: the “Ma Jiaqi (马嘉祺)” case. Internal investigation writeup, May 2026.
- OpenAI. tiktoken: BPE tokenizer for OpenAI models. github.com/openai/tiktoken.
- Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs. arXiv:2504.04715, 2025.
- Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test. arXiv:2506.06975, 2025.
- IMMACULATE: A Practical LLM Auditing Framework via Verifiable Computation. arXiv:2602.22700, 2026.
- Log Probability Tracking of LLM APIs. arXiv:2512.03816, 2025.
Try it
Open the auditor and run the Quick preset against your proxy. It takes about a minute.