Cost suggestion evaluation

Purpose

Tune thresholds, catch regressions, and surface fuzzy-vs-embedding tradeoffs with reproducible numbers.

Gold dataset

Location: app/test/fixtures/cost_suggestion_gold.json. Each entry has the shape:

{
  "query": "brushing teeth",
  "expectedTemplateId": "uuid-or-null",
  "category": "hygiene",
  "justification": "common morning routine activity"
}

Selection rationale

50 to 100 entries total; small enough to curate, big enough to drive threshold decisions.
Skew toward ME/CFS-relevant activities. Not toward gym or athletic tasks.
Balance across categories: hygiene, cooking, admin, errands, social, rest, screen_time, light_movement.
Include 10 to 15 deliberate expectedTemplateId: null entries to test no-match precision.
Include 5 to 10 paraphrase pairs (“walk dog” + “take pup outside”) to expose the fuzzy ceiling.
Avoid duplicates and avoid queries that map to multiple equally-valid templates.

Metrics

Metric	Definition
`top1MatchRate`	Fraction of queries where the top candidate is the expected template
`top3MatchRate`	Fraction where the expected template is in the top 3
`noMatchPrecision`	Of queries the engine returned `RatingFallback` for, the fraction whose expected was `null`
`noMatchRecall`	Of queries whose expected was `null`, the fraction the engine returned `RatingFallback` for
`meanConfidenceMatched`	Mean top score on queries with a non-null expected
`meanConfidenceUnmatched`	Mean top score on queries with `null` expected
`latencyP50ms`	Median per-query latency
`latencyP95ms`	95th percentile per-query latency

A healthy run shows meanConfidenceMatched clearly above meanConfidenceUnmatched; if not, threshold tuning is required.

Initial gates

The harness fails if any of:

top3MatchRate < 0.70
noMatchPrecision < 0.70
latencyP95ms > 100

These are starting numbers. They are tightened as the gold set grows.

Reproducibility

The harness lives at app/test/integration/cost_suggestion_eval_test.dart.
It seeds an in-memory Drift database with a fixed template set (loaded from app/test/fixtures/templates_seed.json).
It loads the gold set, runs every query, and writes a JSON report to app/test/reports/cost_suggestion_eval.json.
Run with flutter test test/integration/cost_suggestion_eval_test.dart.
The report is gitignored; it is a CI artifact, not a tracked file.

Threshold tuning loop

When metrics drift, follow these steps:

Update the gold set if the drift is a coverage gap rather than a quality regression.
Run the harness.
If thresholds in cost-suggestion-pipeline.mdx need to change, update them in the same PR.
Record the date, the metric before, and the metric after in the PR description.

Threshold changes without an evaluation run attached are rejected at review.

Acceptance criteria mapping

Criterion (ENG-94)	Where it lives
Gold set exists with selection rationale	`app/test/fixtures/cost_suggestion_gold.json` plus this page
Metrics are computed via a repeatable script/test	`app/test/integration/cost_suggestion_eval_test.dart`
Findings feed threshold tuning decisions	This page (Threshold tuning loop)

Core

App

Mana

Data

Insights

Standards

Endpoint examples

Cost suggestion evaluation

Purpose

Gold dataset

Selection rationale

Metrics

Initial gates

Reproducibility

Threshold tuning loop

Acceptance criteria mapping

Read next

Core

App

Mana

Data

Insights

Standards

Endpoint examples

Documentation Index

​Purpose

​Gold dataset

​Selection rationale

​Metrics

​Initial gates

​Reproducibility

​Threshold tuning loop

​Acceptance criteria mapping

​Read next

Purpose

Gold dataset

Selection rationale

Metrics

Initial gates

Reproducibility

Threshold tuning loop

Acceptance criteria mapping

Read next