May 30, 2026 · TrueLLMs

Case Study: Token Billing Inflation in a Reseller Endpoint

A sanitized /vet differential-mode audit of a third-party reseller: real Claude tokenizer, no capability downgrade, and a fixed +24 input-token billing offset per request.

This is a sanitized field case from /vet differential mode. The audited endpoint was a third-party reseller exposing claude-opus-4-8 through an OpenAI-compatible interface and forwarding upstream to the official Anthropic API. The reference side was direct official Anthropic access for the same claimed model.

The result was not model substitution. The tokenizer and capability evidence matched the official reference. The billing evidence did not: the reseller reported a fixed extra input-token count on every request, large enough to matter for short chat traffic.

The first run found a transport bug

The initial audit produced 41/41 request failures and no usable measurements. The cause was not the model result. The reseller returned HTTP 400 whenever the request included temperature: 0, because the official Anthropic opus-4-x upstream rejects an explicit zero temperature in this path.

Failure mode: every probe failed before tokenizer, capability, or billing evidence could be collected.
Fix: send the temperature field only when temperature > 0, consistently for both API formats.
Lesson: OpenAI-compatible resellers that forward to Anthropic are common, and small gateway quirks can otherwise turn an audit into a silent no-data run.

What changed after rerun

After the temperature handling fix, the same differential run produced usable evidence across tokenizer, capability, and usage accounting.

1. Tokenizer differential: exact slope match

Claude tokenizers are closed-source, so absolute local confirmation is not available. Differential mode is the decisive tool here: run the same tokenizer probes against the reseller and the official Anthropic reference, then compare prompt_tokens slopes.

The slopes matched probe by probe: cjk about 31, emoji about 52, mixed-rare about 52, ascii-ctrl about 28, with audited endpoint equal to reference. That is evidence for the real Claude tokenizer, not an OpenAI-family tokenizer or an ad hoc counter.

2. Capability differential: no degradation

The capability floor also matched the official reference. Pass rate was consistent with the reference run, with no observed downgrade on the sampled capability items.

3. Billing differential: fixed +24 input tokens

Usage accounting showed the same extra input-token count on each request. Short sentence: 10 -> 34. Short Chinese prompt: 22 -> 46. Long English prompt: 63 -> 87. Code prompt: 44 -> 68. The delta was +24 every time.

Because the tokenizer slopes matched the official reference, this was not explained by a tokenizer difference. The pattern is consistent with an approximately 24-token hidden system prompt being injected per request and billed to the user, though the audit cannot prove the exact hidden text.

Important: the +24 tokens and +240% above describe token quantity, not spend. Actual cost is tokens × unit price. Resellers often sell below official unit price, for example 0.6×, so token inflation does not necessarily mean the endpoint is more expensive than official. A fixed injection hits short requests hard and can make them more expensive, while the effect is smaller on long context. Whether this is cost padding depends on the endpoint's unit price and your prompt length.

How to report the risk

Report the billing issue as approximately +24 input tokens per request.
For short chat-style requests, the observed overage can reach +240%.
For longer-context requests in this sample, the impact is about +38%.
A single aggregate percentage, about +58.7% in this run, is less informative because it moves with the prompt mix.

Boundaries

Differential consistency is not supplier certification. It says this sample matched the official reference on the measured signals.
An adversarial router could still serve clean behavior to known test traffic and different behavior elsewhere.
The conclusion is sampling-based: real claude-opus-4-8 evidence on this run, no observed capability downgrade, fixed billing offset observed.
The fixed offset is best treated as billing risk, not proof of the exact prompt or business intent behind the injection.
Absolute mode would not have isolated this case for Claude tokenizer identity, because the tokenizer is closed-source.
Differential mode was the key method: trusted official reference first, then compare slopes, pass rates, and usage fields.

Run a differential audit against your own reseller endpoint with an official reference when you need this level of evidence.