April 22, 2026 · TrueLLMs

Why Your LLM Proxy Might Be Lying to You

ICLR 2025 Model Equality Testing found 11 of 31 production endpoints deviated from the reference distribution. Here is what that means for your bill.

In late 2024, Gao et al. submitted a paper to ICLR 2025 with a quiet but striking finding. They tested 31 production endpoints that all advertised the same Llama family weights — a mix of first-party APIs and third-party gateways — against Meta's reference distribution. Eleven of the 31 deviated at p < 0.05 under a Maximum Mean Discrepancy two-sample test.

Eleven out of thirty-one. The paper is careful that this means the response distributions are different, not the provider is cheating. Quantization, fine-tuning, system prompts, regional routing and post-processing all produce distributional shifts. But a third of a sample doing any of those without disclosure is interesting, and the paper — Model Equality Testing: Which Model is this API Serving? — is worth reading.

The structural reason this happens

LLM proxies have brutal unit economics. A user pays $X per million tokens for gpt-5. The proxy operator pays whatever the upstream charges for gpt-5 when they hit OpenAI directly. Their margin is X − cost. There are exactly four ways to widen that gap:

Negotiate volume discounts from the upstream — legitimate, hard.
Cache aggressively — legitimate-ish, depends on disclosure.
Inflate the usage block they return to you — fraud.
Substitute a cheaper model when you ask for the expensive one — fraud.

TrueLLMs catches the last two. We do not have an opinion on the first two. We just give you the data.

What the audit looks like in practice

A clean run against the OpenAI API direct looks like this: 12 dimensions, all green, confidence near 0. A typical aggregator gateway looks like this: 8 dimensions green, 2 yellow, 2 red, confidence around 35. A bad gateway looks like this: logprobs unavailable (red flag), tokenizer-boundary inconsistent (mismatch), LLMmap classifies the responses as a different vendor than claimed, ITT rhythm has none of the speculative-decoding bimodality the claimed model is known for. Confidence capped at 70.

The cap is deliberate. Without logprobs, even the strongest active probes can be fooled by a sufficiently elaborate proxy. We say likely-substituted rather than confirmed-substituted in that case, and ship the raw evidence so the user can decide.

What to do if your audit comes back red

Three things, in order:

Re-run the audit with a different time-of-day. Some proxies route differently under load.
Run the same audit against the upstream provider directly. Confirm what a clean fingerprint looks like for your model.
Open a support ticket with the proxy and share the evidence. A reputable provider will explain or fix.

References

Gao et al. Model Equality Testing: Which Model is this API Serving? ICLR 2025. arXiv:2410.20247.
Pasquini et al. LLMmap: Fingerprinting Large Language Models. USENIX Security 2025. arXiv:2407.15847.
Alhazbi et al. LLMs Have Rhythm. 2025. arXiv:2502.20589.

Run an audit against your own proxy. It takes about a minute and stays in your browser.