All dimensions

Detection dimension · weight 7%

Canary Prompt Behavior

What this dimension detects

Canary prompts are small, deterministic puzzles where strong and weak models give measurably different answers. Examples: counting letters in 'strawberry' (legacy), Newcomb-style decision theory, niche IUPAC chemistry, ARC-AGI-2 pattern induction.

Algorithm

Send a fixed set of canaries (the legacy probes plus the 2026 library: HLE-style, ARC-AGI-2, BFCL-style) at temperature 0. For each response, look up the canary entry in `lib/fingerprints/canaries-2026.ts`, fetch the known-answer template for the claimed model, and score it with a normalized-token overlap. A hit requires the claimed template to clear an absolute floor (~0.30) AND beat the best-alternative template by a margin. If multiple canaries miss with a decisive alternative template, that alternative model is added to the suspected-model ballot.

Thresholds

ConditionVerdict contribution
claimed-template score ≥ 0.30 AND beats best alternative by ≥ 0.10Hit
any other caseMiss (votes for the best alternative if it clears 0.30 by ≥ 0.15)
≥ 2 canaries scored AND hit-rate < 50%Dimension mismatch

Limitations

Known-answer templates are author-written summaries, not verbatim model outputs; the substring-overlap metric is intentionally coarse to handle natural-language paraphrase. A canary whose `knownAnswers` does not cover the claimed model is reported but not scored. Per-canary `discriminatesAmong` filtering is on the TODO list — for now every canary scores.

References

  • HLE: Humanity's Last Exam, 2025
  • ARC-AGI-2: arc-agi.com/2
  • BFCL: Berkeley Function-Calling Leaderboard, 2025

Back to the full methodology