Documentation Index
Fetch the complete documentation index at: https://docs.canthus.org/llms.txt
Use this file to discover all available pages before exploring further.
Purpose
Tune thresholds, catch regressions, and surface fuzzy-vs-embedding tradeoffs with reproducible numbers.Gold dataset
Location:app/test/fixtures/cost_suggestion_gold.json.
Each entry has the shape:
Selection rationale
- 50 to 100 entries total; small enough to curate, big enough to drive threshold decisions.
- Skew toward ME/CFS-relevant activities. Not toward gym or athletic tasks.
- Balance across categories:
hygiene,cooking,admin,errands,social,rest,screen_time,light_movement. - Include 10 to 15 deliberate
expectedTemplateId: nullentries to test no-match precision. - Include 5 to 10 paraphrase pairs (“walk dog” + “take pup outside”) to expose the fuzzy ceiling.
- Avoid duplicates and avoid queries that map to multiple equally-valid templates.
Metrics
| Metric | Definition |
|---|---|
top1MatchRate | Fraction of queries where the top candidate is the expected template |
top3MatchRate | Fraction where the expected template is in the top 3 |
noMatchPrecision | Of queries the engine returned RatingFallback for, the fraction whose expected was null |
noMatchRecall | Of queries whose expected was null, the fraction the engine returned RatingFallback for |
meanConfidenceMatched | Mean top score on queries with a non-null expected |
meanConfidenceUnmatched | Mean top score on queries with null expected |
latencyP50ms | Median per-query latency |
latencyP95ms | 95th percentile per-query latency |
meanConfidenceMatched clearly above meanConfidenceUnmatched; if not, threshold tuning is required.
Initial gates
The harness fails if any of:top3MatchRate < 0.70noMatchPrecision < 0.70latencyP95ms > 100
Reproducibility
- The harness lives at
app/test/integration/cost_suggestion_eval_test.dart. - It seeds an in-memory Drift database with a fixed template set (loaded from
app/test/fixtures/templates_seed.json). - It loads the gold set, runs every query, and writes a JSON report to
app/test/reports/cost_suggestion_eval.json. - Run with
flutter test test/integration/cost_suggestion_eval_test.dart. - The report is gitignored; it is a CI artifact, not a tracked file.
Threshold tuning loop
When metrics drift, follow these steps:- Update the gold set if the drift is a coverage gap rather than a quality regression.
- Run the harness.
- If thresholds in
cost-suggestion-pipeline.mdxneed to change, update them in the same PR. - Record the date, the metric before, and the metric after in the PR description.
Acceptance criteria mapping
| Criterion (ENG-94) | Where it lives |
|---|---|
| Gold set exists with selection rationale | app/test/fixtures/cost_suggestion_gold.json plus this page |
| Metrics are computed via a repeatable script/test | app/test/integration/cost_suggestion_eval_test.dart |
| Findings feed threshold tuning decisions | This page (Threshold tuning loop) |