Dimension · score weight 20%

Capability Floor

What this dimension detects

Capability floor uses a small hard-question set with ground-truth grading. It is designed to catch a served model that cannot clear the minimum behavior expected from the claimed tier.

Algorithm

Run the capability item set across math, reasoning, instruction following, and knowledge. Each item has a deterministic grader. Absolute mode reports the audited endpoint pass rate. Differential mode also runs the same items against a trusted reference endpoint and reports the suspect-reference pass-rate delta plus items the reference passed but the suspect failed.

Thresholds

Condition	Verdict contribution
Absolute pass rate ≥ 90% and at most 1 failed item	Consistent with the claimed frontier tier
Absolute pass rate between 75% and 90%	Inconclusive; prefer differential comparison
Absolute pass rate < 75% or ≥ 4 failed items	Weak capability-floor signal
Differential delta ≤ -0.15 and ≥ 3 regressed items	Likely downgrade in differential mode
Differential delta ≤ -0.10 and ≥ 2 regressed items	Possible downgrade in differential mode

Limitations

Ground-truth hard questions measure capability, not model identity. A specialized fine-tune, regional policy layer, or safety setting can miss items without implying substitution. Absolute mode is weaker than differential mode because it has no trusted same-prompt reference.

References

TrueLLMs lib/capability/items.ts
TrueLLMs lib/capability/scorer.ts

Back to the full methodology