Dimension · score weight 20%
Capability Floor
What this dimension detects
Capability floor uses a small hard-question set with ground-truth grading. It is designed to catch a served model that cannot clear the minimum behavior expected from the claimed tier.
Algorithm
Run the capability item set across math, reasoning, instruction following, and knowledge. Each item has a deterministic grader. Absolute mode reports the audited endpoint pass rate. Differential mode also runs the same items against a trusted reference endpoint and reports the suspect-reference pass-rate delta plus items the reference passed but the suspect failed.
Thresholds
| Condition | Verdict contribution |
|---|---|
| Absolute pass rate ≥ 90% and at most 1 failed item | Consistent with the claimed frontier tier |
| Absolute pass rate between 75% and 90% | Inconclusive; prefer differential comparison |
| Absolute pass rate < 75% or ≥ 4 failed items | Weak capability-floor signal |
| Differential delta ≤ -0.15 and ≥ 3 regressed items | Likely downgrade in differential mode |
| Differential delta ≤ -0.10 and ≥ 2 regressed items | Possible downgrade in differential mode |
Limitations
Ground-truth hard questions measure capability, not model identity. A specialized fine-tune, regional policy layer, or safety setting can miss items without implying substitution. Absolute mode is weaker than differential mode because it has no trusted same-prompt reference.
References
- TrueLLMs lib/capability/items.ts
- TrueLLMs lib/capability/scorer.ts