Detection dimension · weight 7%
Canary Prompt Behavior
What this dimension detects
Canary prompts are small, deterministic puzzles where strong and weak models give measurably different answers. Examples: counting letters in 'strawberry' (legacy), Newcomb-style decision theory, niche IUPAC chemistry, ARC-AGI-2 pattern induction.
Algorithm
Send a fixed set of canaries (the legacy probes plus the 2026 library: HLE-style, ARC-AGI-2, BFCL-style) at temperature 0. For each response, look up the canary entry in `lib/fingerprints/canaries-2026.ts`, fetch the known-answer template for the claimed model, and score it with a normalized-token overlap. A hit requires the claimed template to clear an absolute floor (~0.30) AND beat the best-alternative template by a margin. If multiple canaries miss with a decisive alternative template, that alternative model is added to the suspected-model ballot.
Thresholds
| Condition | Verdict contribution |
|---|---|
| claimed-template score ≥ 0.30 AND beats best alternative by ≥ 0.10 | Hit |
| any other case | Miss (votes for the best alternative if it clears 0.30 by ≥ 0.15) |
| ≥ 2 canaries scored AND hit-rate < 50% | Dimension mismatch |
Limitations
Known-answer templates are author-written summaries, not verbatim model outputs; the substring-overlap metric is intentionally coarse to handle natural-language paraphrase. A canary whose `knownAnswers` does not cover the claimed model is reported but not scored. Per-canary `discriminatesAmong` filtering is on the TODO list — for now every canary scores.
References
- HLE: Humanity's Last Exam, 2025
- ARC-AGI-2: arc-agi.com/2
- BFCL: Berkeley Function-Calling Leaderboard, 2025