Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.canthus.org/llms.txt

Use this file to discover all available pages before exploring further.

Purpose

Tune thresholds, catch regressions, and surface fuzzy-vs-embedding tradeoffs with reproducible numbers.

Gold dataset

Location: app/test/fixtures/cost_suggestion_gold.json. Each entry has the shape:
{
  "query": "brushing teeth",
  "expectedTemplateId": "uuid-or-null",
  "category": "hygiene",
  "justification": "common morning routine activity"
}

Selection rationale

  • 50 to 100 entries total; small enough to curate, big enough to drive threshold decisions.
  • Skew toward ME/CFS-relevant activities. Not toward gym or athletic tasks.
  • Balance across categories: hygiene, cooking, admin, errands, social, rest, screen_time, light_movement.
  • Include 10 to 15 deliberate expectedTemplateId: null entries to test no-match precision.
  • Include 5 to 10 paraphrase pairs (“walk dog” + “take pup outside”) to expose the fuzzy ceiling.
  • Avoid duplicates and avoid queries that map to multiple equally-valid templates.

Metrics

MetricDefinition
top1MatchRateFraction of queries where the top candidate is the expected template
top3MatchRateFraction where the expected template is in the top 3
noMatchPrecisionOf queries the engine returned RatingFallback for, the fraction whose expected was null
noMatchRecallOf queries whose expected was null, the fraction the engine returned RatingFallback for
meanConfidenceMatchedMean top score on queries with a non-null expected
meanConfidenceUnmatchedMean top score on queries with null expected
latencyP50msMedian per-query latency
latencyP95ms95th percentile per-query latency
A healthy run shows meanConfidenceMatched clearly above meanConfidenceUnmatched; if not, threshold tuning is required.

Initial gates

The harness fails if any of:
  • top3MatchRate < 0.70
  • noMatchPrecision < 0.70
  • latencyP95ms > 100
These are starting numbers. They are tightened as the gold set grows.

Reproducibility

  • The harness lives at app/test/integration/cost_suggestion_eval_test.dart.
  • It seeds an in-memory Drift database with a fixed template set (loaded from app/test/fixtures/templates_seed.json).
  • It loads the gold set, runs every query, and writes a JSON report to app/test/reports/cost_suggestion_eval.json.
  • Run with flutter test test/integration/cost_suggestion_eval_test.dart.
  • The report is gitignored; it is a CI artifact, not a tracked file.

Threshold tuning loop

When metrics drift, follow these steps:
  1. Update the gold set if the drift is a coverage gap rather than a quality regression.
  2. Run the harness.
  3. If thresholds in cost-suggestion-pipeline.mdx need to change, update them in the same PR.
  4. Record the date, the metric before, and the metric after in the PR description.
Threshold changes without an evaluation run attached are rejected at review.

Acceptance criteria mapping

Criterion (ENG-94)Where it lives
Gold set exists with selection rationaleapp/test/fixtures/cost_suggestion_gold.json plus this page
Metrics are computed via a repeatable script/testapp/test/integration/cost_suggestion_eval_test.dart
Findings feed threshold tuning decisionsThis page (Threshold tuning loop)