From eb7b27066c4c41fe52a10396389f882cb43a1e8a Mon Sep 17 00:00:00 2001 From: SoundMindsAI Date: Thu, 21 May 2026 08:39:20 -0400 Subject: [PATCH 01/17] docs: capture 4 Karpathy-loop audit follow-ups as idea files MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Surfaced during the 2026-05-21 Studies-workflow audit (the within-study loop is Karpathy-shaped; the across-study compounding is not). Each file cites the audit and the specific file:line gaps that motivate it. - chore_study_default_stop_conditions — recommended max_trials defaults + Quick/Standard/Overnight wizard preset selector - feat_config_repo_baseline_tracking — last_merged_proposal_id column on config_repos, set by the existing GitHub merge webhook - feat_auto_followup_studies — opt-in studies.config.auto_followup_depth that auto-chains follow-ups via propose_search_space(prior_study_id=...); the closest unintrusive analog to Karpathy compounding - feat_digest_executable_followups — reshape suggested_followups into a discriminated union (narrow / widen / swap_template / text) with a one-click "Run this followup" UI Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/00_overview/DASHBOARD.md | 2 +- docs/00_overview/MVP1_DASHBOARD.md | 42 ++++-- docs/00_overview/dashboard.html | 2 +- docs/00_overview/mvp1_dashboard.html | 107 +++++++++++++--- .../idea.md | 69 ++++++++++ .../feat_auto_followup_studies/idea.md | 89 +++++++++++++ .../idea.md | 73 +++++++++++ .../feat_digest_executable_followups/idea.md | 120 ++++++++++++++++++ 8 files changed, 469 insertions(+), 35 deletions(-) create mode 100644 docs/02_product/planned_features/chore_study_default_stop_conditions/idea.md create mode 100644 docs/02_product/planned_features/feat_auto_followup_studies/idea.md create mode 100644 docs/02_product/planned_features/feat_config_repo_baseline_tracking/idea.md create mode 100644 docs/02_product/planned_features/feat_digest_executable_followups/idea.md diff --git a/docs/00_overview/DASHBOARD.md b/docs/00_overview/DASHBOARD.md index 88e62b9e..6e598ca7 100644 --- a/docs/00_overview/DASHBOARD.md +++ b/docs/00_overview/DASHBOARD.md @@ -6,7 +6,7 @@ _Top-level index across MVP1 → GA v1+ as of **2026-05-21**. Click a release na | Release | Theme | Progress | Status | |---|---|---|---| -| [MVP1 / v0.1](MVP1_DASHBOARD.md) | The Loop | 56 / 56 scoped done · 2 remaining | **In progress** | +| [MVP1 / v0.1](MVP1_DASHBOARD.md) | The Loop | 56 / 57 scoped done · 4 remaining | **In progress** | | [MVP2 / v0.2](MVP2_DASHBOARD.md) | Observable | 1 / 1 scoped done · 1 remaining | **In progress** | | MVP3 / v0.3 | Production Stacks | — | **Not yet scoped** | | MVP4 / v0.4 | Multi-tenant, Multi-LLM | — | **Not yet scoped** | diff --git a/docs/00_overview/MVP1_DASHBOARD.md b/docs/00_overview/MVP1_DASHBOARD.md index 6615fd43..fa16a62a 100644 --- a/docs/00_overview/MVP1_DASHBOARD.md +++ b/docs/00_overview/MVP1_DASHBOARD.md @@ -6,29 +6,35 @@ _Reflects feature-folder state as of **2026-05-21** (latest mtime of any planned ## Next up -All scoped MVP1 features shipped 🎉 +**[feat_pr_metric_confidence](../02_product/planned_features/feat_pr_metric_confidence/feature_spec.md)** — Feature, currently in **Spec** -Pull from the Idea backlog or capture a new feature spec. +> Approvers reading a study-backed PR see a "## Confidence" section directly between the existing "## Metric delta" and "## Config diff" sections. + +Spec exists; run /pipeline to generate the implementation plan + ship + +```bash +/pipeline docs/02_product/planned_features/feat_pr_metric_confidence --auto +``` ## MVP1 Progress | Metric | Value | |---|---| -| Scoped items done | **56 / 56** (100%) — feat_/infra_/chore_/epic_ past idea stage | -| Path to MVP1 | **2** items remaining (features + bugs + chores) | +| Scoped items done | **56 / 57** (98%) — feat_/infra_/chore_/epic_ past idea stage | +| Path to MVP1 | **4** items remaining (features + bugs + chores) | | Open bugs | 1 | -| Open chores | 1 (idea-stage debt) | -| Backlog ideas | 2 idea-only feat/infra (not yet scoped into MVP1) | +| Open chores | 2 (idea-stage debt) | +| Backlog ideas | 4 idea-only feat/infra (not yet scoped into MVP1) | | In flight | 0 feature(s) actively shipping | ## Pipeline -### Done (67) +### Done (68) | Feature | Type | One-liner | Depends on | Status | |---|---|---|---|---| | [feat_agent_propose_search_space](implemented_features/2026_05_21_feat_agent_propose_search_space/feature_spec.md) | Feature | A new read-only agent tool `propose_search_space(template_id, cluster_id, judgment_list_id?, prior_study_id?) → SearchSpace JSON` that emits a deterministic, code-generated search space using the same | — | [PR #175](https://github.com/SoundMindsAI/relyloop/pull/175) merged 2026-05-21 | -| [feat_chat_agent](implemented_features/2026_05_12_feat_chat_agent/feature_spec.md) | Feature | A chat surface at `/chat/{conversation_id}` streams OpenAI completions via SSE. | `feat_agent_propose_search_space` `feat_cluster_target_filter` `feat_contextual_help` `feat_contextual_help_mvp2` `feat_create_study_search_space_builder` `feat_create_study_target_autocomplete` `feat_data_table_primitive` `feat_digest_proposal` `feat_fts_rank_ordering_mvp2` `feat_github_pr_worker` `feat_github_webhook` `feat_judgments_periodic_resume_sweep` `feat_llm_judgments` `feat_pr_metric_confidence` `feat_proposals_ui` `feat_query_inline_crud` `feat_studies_ui` `feat_study_clone_from_previous` `feat_study_lifecycle` `infra_adapter_elastic` `infra_arq_subprocess_test_mvp2` `infra_ci_smoke_makeup` `infra_dashboard_regen_pre_commit_conflict` `infra_e2e_seed_completed_study` `infra_e2e_wire_seed_helper_into_studies_spec` `infra_foundation` `infra_frontend_stack_refresh` `infra_make_targets_split_backend_only` `infra_nvmrc` `infra_optuna_eval` `infra_per_trial_timeout` `infra_structlog_test_helpers` `infra_uv_sync_drops_precommit` | [PR #60](https://github.com/SoundMindsAI/relyloop/pull/60) merged 2026-05-12 | +| [feat_chat_agent](implemented_features/2026_05_12_feat_chat_agent/feature_spec.md) | Feature | A chat surface at `/chat/{conversation_id}` streams OpenAI completions via SSE. | `feat_agent_propose_search_space` `feat_auto_followup_studies` `feat_cluster_target_filter` `feat_config_repo_baseline_tracking` `feat_contextual_help` `feat_contextual_help_mvp2` `feat_create_study_search_space_builder` `feat_create_study_target_autocomplete` `feat_data_table_primitive` `feat_digest_executable_followups` `feat_digest_proposal` `feat_fts_rank_ordering_mvp2` `feat_github_pr_worker` `feat_github_webhook` `feat_judgments_periodic_resume_sweep` `feat_llm_judgments` `feat_pr_metric_confidence` `feat_proposals_ui` `feat_query_inline_crud` `feat_studies_ui` `feat_study_clone_from_previous` `feat_study_lifecycle` `infra_adapter_elastic` `infra_arq_subprocess_test_mvp2` `infra_ci_smoke_makeup` `infra_dashboard_regen_pre_commit_conflict` `infra_e2e_seed_completed_study` `infra_e2e_wire_seed_helper_into_studies_spec` `infra_foundation` `infra_frontend_stack_refresh` `infra_make_targets_split_backend_only` `infra_nvmrc` `infra_optuna_eval` `infra_per_trial_timeout` `infra_structlog_test_helpers` `infra_uv_sync_drops_precommit` | [PR #60](https://github.com/SoundMindsAI/relyloop/pull/60) merged 2026-05-12 | | [feat_cluster_target_filter](implemented_features/2026_05_20_feat_cluster_target_filter/feature_spec.md) | Feature | Each registered cluster can optionally carry a glob pattern (`products*`, `team-a-*`, `docs-[ef][nr]-*`) that scopes `list_targets()` to the matching subset. | — | [PR #168](https://github.com/SoundMindsAI/relyloop/pull/168) merged 2026-05-20 | | [feat_contextual_help](implemented_features/2026_05_15_feat_contextual_help/feature_spec.md) | Feature | a relevance engineer can launch their second study and interpret its digest without re-reading the tutorial, because every domain-jargon label has a one-click contextual definition grounded in the sam | — | [PR #122](https://github.com/SoundMindsAI/relyloop/pull/122) merged 2026-05-15 | | [feat_create_study_search_space_builder](implemented_features/2026_05_20_feat_create_study_search_space_builder/feature_spec.md) | Feature | Complete (PR #163, squash commit `c703953`, merged 2026-05-20) | — | [PR #163](https://github.com/SoundMindsAI/relyloop/pull/163) merged 2026-05-20 | @@ -82,11 +88,12 @@ Pull from the Idea backlog or capture a new feature spec. | [chore_starlette_422_deprecation](implemented_features/2026_05_13_chore_starlette_422_deprecation/idea.md) | Chore | Complete | — | Complete | | [chore_test_both_engines](implemented_features/2026_05_13_chore_test_both_engines/idea.md) | Chore | Complete | — | Complete | | [chore_trial_summary_single_query](implemented_features/2026_05_13_chore_trial_summary_single_query/idea.md) | Chore | Complete | — | Complete | -| [chore_tutorial_polish](implemented_features/2026_05_12_chore_tutorial_polish/feature_spec.md) | Chore | The release tag `v0.1.0` is pushed with: a worked tutorial at `docs/08_guides/tutorial-first-study.md`, sample data (50-query set + sample ES index of ~1,000 docs from the Amazon ESCI subset), README | `feat_agent_propose_search_space` `feat_chat_agent` `feat_cluster_target_filter` `feat_contextual_help` `feat_contextual_help_mvp2` `feat_create_study_search_space_builder` `feat_create_study_target_autocomplete` `feat_data_table_primitive` `feat_digest_proposal` `feat_fts_rank_ordering_mvp2` `feat_github_pr_worker` `feat_github_webhook` `feat_judgments_periodic_resume_sweep` `feat_llm_judgments` `feat_pr_metric_confidence` `feat_proposals_ui` `feat_query_inline_crud` `feat_studies_ui` `feat_study_clone_from_previous` `feat_study_lifecycle` `infra_adapter_elastic` `infra_arq_subprocess_test_mvp2` `infra_ci_smoke_makeup` `infra_dashboard_regen_pre_commit_conflict` `infra_e2e_seed_completed_study` `infra_e2e_wire_seed_helper_into_studies_spec` `infra_foundation` `infra_frontend_stack_refresh` `infra_make_targets_split_backend_only` `infra_nvmrc` `infra_optuna_eval` `infra_per_trial_timeout` `infra_structlog_test_helpers` `infra_uv_sync_drops_precommit` | [PR #64](https://github.com/SoundMindsAI/relyloop/pull/64) merged 2026-05-12 | +| [chore_tutorial_polish](implemented_features/2026_05_12_chore_tutorial_polish/feature_spec.md) | Chore | The release tag `v0.1.0` is pushed with: a worked tutorial at `docs/08_guides/tutorial-first-study.md`, sample data (50-query set + sample ES index of ~1,000 docs from the Amazon ESCI subset), README | `feat_agent_propose_search_space` `feat_auto_followup_studies` `feat_chat_agent` `feat_cluster_target_filter` `feat_config_repo_baseline_tracking` `feat_contextual_help` `feat_contextual_help_mvp2` `feat_create_study_search_space_builder` `feat_create_study_target_autocomplete` `feat_data_table_primitive` `feat_digest_executable_followups` `feat_digest_proposal` `feat_fts_rank_ordering_mvp2` `feat_github_pr_worker` `feat_github_webhook` `feat_judgments_periodic_resume_sweep` `feat_llm_judgments` `feat_pr_metric_confidence` `feat_proposals_ui` `feat_query_inline_crud` `feat_studies_ui` `feat_study_clone_from_previous` `feat_study_lifecycle` `infra_adapter_elastic` `infra_arq_subprocess_test_mvp2` `infra_ci_smoke_makeup` `infra_dashboard_regen_pre_commit_conflict` `infra_e2e_seed_completed_study` `infra_e2e_wire_seed_helper_into_studies_spec` `infra_foundation` `infra_frontend_stack_refresh` `infra_make_targets_split_backend_only` `infra_nvmrc` `infra_optuna_eval` `infra_per_trial_timeout` `infra_structlog_test_helpers` `infra_uv_sync_drops_precommit` | [PR #64](https://github.com/SoundMindsAI/relyloop/pull/64) merged 2026-05-12 | | [bug_capability_check_test_isolation](implemented_features/2026_05_12_bug_capability_check_test_isolation/idea.md) | Bug | Complete | — | Complete | | [bug_cursor_decode_value_validation](implemented_features/2026_05_17_bug_cursor_decode_value_validation/idea.md) | Bug | Complete | — | Complete | | [bug_digest_param_importance_seam](implemented_features/2026_05_13_bug_digest_param_importance_seam/idea.md) | Bug | Complete | — | Complete | | [bug_dockerfile_missing_prompts](implemented_features/2026_05_13_bug_dockerfile_missing_prompts/idea.md) | Bug | Complete | — | Complete | +| [bug_e2e_target_dropdown_flake](implemented_features/2026_05_20_bug_e2e_target_dropdown_flake/idea.md) | Bug | Complete | — | Complete | | [bug_env_file_corrupted_during_session](implemented_features/2026_05_13_bug_env_file_corrupted_during_session/idea.md) | Bug | Complete | — | Complete | | [bug_get_schema_unhandled_connect_error](implemented_features/2026_05_20_bug_get_schema_unhandled_connect_error/idea.md) | Bug | Complete | — | Complete | | [bug_judgment_lists_listing_ignores_query_set_filter](implemented_features/2026_05_20_bug_judgment_lists_listing_ignores_query_set_filter/idea.md) | Bug | Complete | — | Complete | @@ -103,16 +110,21 @@ _None._ _None._ -### Spec (0) +### Spec (1) -_None._ +| Feature | Type | One-liner | Depends on | Status | +|---|---|---|---|---| +| [feat_pr_metric_confidence](../02_product/planned_features/feat_pr_metric_confidence/feature_spec.md) | Feature | Approvers reading a study-backed PR see a "## Confidence" section directly between the existing "## Metric delta" and "## Config diff" sections. | — | [PR #41](https://github.com/SoundMindsAI/relyloop/pull/41) merged 2026-05-11 | -### Idea (4) +### Idea (7) | Feature | Type | One-liner | Depends on | Status | |---|---|---|---|---| -| [feat_pr_metric_confidence](../02_product/planned_features/feat_pr_metric_confidence/idea.md) | Feature | When the operator's approver opens a study-backed PR in the central search-config repo, the only confidence signal in the PR body is two scalar point estimates. From [`_render_pr_body_study_backed`](. | — | Idea — surfaced during a 2026-05-20 conversation reviewing two outside articles for relevance to RelyLoop ([Doug Turnbull, "Autoresearching a better MSMarco BM25", 2026-05-17](https://softwaredoug.com/blog/2026/05/17/autoresearching-a-better-msmarco-bm25) and [Li/Wang/Wang, "Choosing the Better Bandit Algorithm under Data Sharing", arXiv:2507.11891v2](https://arxiv.org/pdf/2507.11891)). The articles themselves are not directly material to RelyLoop's roadmap; what surfaced as material — after several rounds of honest filtering — is the underlying question they prompted: **how confident should the approver be in the metric reported on the PR?** | +| [feat_auto_followup_studies](../02_product/planned_features/feat_auto_followup_studies/idea.md) | Feature | Karpathy's autoresearch loop runs hundreds of experiments overnight and **compounds** improvements: each accepted change becomes the new baseline for the next experiment. RelyLoop's equivalent… | — | Idea — surfaced during the 2026-05-21 Karpathy-loop audit. The highest-leverage recommendation from the audit's "across studies" section. | +| [feat_config_repo_baseline_tracking](../02_product/planned_features/feat_config_repo_baseline_tracking/idea.md) | Feature | RelyLoop does not track which configuration is currently live in production. When a proposal's PR merges, the merge webhook at [`backend/app/api/webhooks/github.py:187-191`](../../backend/app/api/webh | — | Idea — surfaced during the 2026-05-21 Karpathy-loop audit. | +| [feat_digest_executable_followups](../02_product/planned_features/feat_digest_executable_followups/idea.md) | Feature | The digest worker's LLM contract at [`backend/workers/digest.py:168-189`](../../backend/workers/digest.py) defines `suggested_followups` as a flat `array of string`: | — | Idea — surfaced during the 2026-05-21 Karpathy-loop audit. | | [feat_study_clone_from_previous](../02_product/planned_features/feat_study_clone_from_previous/idea.md) | Feature | A relevance engineer's normal workflow after the first study completes: | — | Idea — surfaced during a UX review of parameter-tuning ergonomics on 2026-05-19. | +| [chore_study_default_stop_conditions](../02_product/planned_features/chore_study_default_stop_conditions/idea.md) | Chore | The server-side `StudyConfigSpec` validator at [`backend/app/api/v1/schemas.py:572-580`](../../backend/app/api/v1/schemas.py) correctly **requires** at least one of `max_trials` or `time_budget_min` — | — | Idea — surfaced during the 2026-05-21 Karpathy-loop audit of the Studies workflow. | | [chore_template_library_expansion](../02_product/planned_features/chore_template_library_expansion/idea.md) | Chore | Three connected gaps: | — | Idea — surfaced during a UX review of parameter-tuning ergonomics on 2026-05-19. | | [bug_e2e_target_dropdown_flake](../02_product/planned_features/bug_e2e_target_dropdown_flake/idea.md) | Bug | The skipped test seeds two ES indices via Playwright's `request.put` (Node), opens the create-study modal, picks the seeded cluster via the cluster ``… | — | Idea — surfaced during `feat_create_study_target_autocomplete` Story F2 implementation; the new E2E happy-path spec is currently `test.skip`'d. | @@ -127,6 +139,8 @@ graph LR classDef plan fill:#fef9c3,stroke:#854d0e,color:#854d0e; classDef spec fill:#dbeafe,stroke:#1e40af,color:#1e40af; classDef idea fill:#f1f5f9,stroke:#334155,color:#334155; + feat_pr_metric_confidence["pr metric confidence"] + class feat_pr_metric_confidence spec; infra_foundation["foundation"] class infra_foundation done; feat_study_lifecycle["study lifecycle"] @@ -256,6 +270,7 @@ graph LR feat_github_webhook --> chore_tutorial_polish feat_judgments_periodic_resume_sweep --> chore_tutorial_polish feat_llm_judgments --> chore_tutorial_polish + feat_pr_metric_confidence --> chore_tutorial_polish feat_proposals_ui --> chore_tutorial_polish feat_query_inline_crud --> chore_tutorial_polish feat_studies_ui --> chore_tutorial_polish @@ -284,6 +299,7 @@ graph LR feat_github_webhook --> feat_chat_agent feat_judgments_periodic_resume_sweep --> feat_chat_agent feat_llm_judgments --> feat_chat_agent + feat_pr_metric_confidence --> feat_chat_agent feat_proposals_ui --> feat_chat_agent feat_query_inline_crud --> feat_chat_agent feat_studies_ui --> feat_chat_agent diff --git a/docs/00_overview/dashboard.html b/docs/00_overview/dashboard.html index e1631cc5..f163b2c6 100644 --- a/docs/00_overview/dashboard.html +++ b/docs/00_overview/dashboard.html @@ -371,7 +371,7 @@

Releases

The Loop
-
56 / 56 scoped done · 2 remaining
+
56 / 57 scoped done · 4 remaining
In progress
diff --git a/docs/00_overview/mvp1_dashboard.html b/docs/00_overview/mvp1_dashboard.html index d61747c4..4c91e601 100644 --- a/docs/00_overview/mvp1_dashboard.html +++ b/docs/00_overview/mvp1_dashboard.html @@ -369,12 +369,12 @@

RelyLoop MVP1 Dashboard

-
-
Next up
-
All scoped MVP1 features shipped 🎉
-
- Pull from the Idea backlog or capture a new feature spec. -
+
+
Next up — Feature, currently in Spec
+ +
Approvers reading a study-backed PR see a "## Confidence" section directly between the existing "## Metric delta" and "## Config diff" sections.
+
Spec exists; run /pipeline to generate the implementation plan + ship
+ /pipeline docs/02_product/planned_features/feat_pr_metric_confidence --auto
@@ -382,15 +382,15 @@

RelyLoop MVP1 Dashboard

MVP1 Progress

-
+
Scoped items done
-
56 / 56
-
100% of feat_/infra_/chore_/epic_ items past idea stage
-
+
56 / 57
+
98% of feat_/infra_/chore_/epic_ items past idea stage
+
Path to MVP1
-
2
+
4
items left = features + bugs + chores
@@ -400,14 +400,14 @@

MVP1 Progress

Open chores
-
1
+
2
idea-stage chore_* (debt)
Backlog ideas: - 2 idea-only feat/infra folders (not yet scoped into MVP1) + 4 idea-only feat/infra folders (not yet scoped into MVP1) In flight: @@ -428,15 +428,39 @@

Pipeline

-

Idea 4

+

Idea 7

+ +
+ +
+ Feature + +
+
Karpathy's autoresearch loop runs hundreds of experiments overnight and **compounds** improvements: each accepted change becomes the new baseline for the next experiment. RelyLoop's equivalent…
+ + +
+
- +
Feature
-
When the operator's approver opens a study-backed PR in the central search-config repo, the only confidence signal in the PR body is two scalar point estimates. From [`_render_pr_body_study_backed`](.
+
RelyLoop does not track which configuration is currently live in production. When a proposal's PR merges, the merge webhook at [`backend/app/api/webhooks/github.py:187-191`](../../backend/app/api/webh
+ + +
+ + +
+ +
+ Feature + +
+
The digest worker's LLM contract at [`backend/workers/digest.py:168-189`](../../backend/workers/digest.py) defines `suggested_followups` as a flat `array of string`:
@@ -454,6 +478,18 @@

Idea 4

+
+ +
+ Chore + +
+
The server-side `StudyConfigSpec` validator at [`backend/app/api/v1/schemas.py:572-580`](../../backend/app/api/v1/schemas.py) correctly **requires** at least one of `max_trials` or `time_budget_min` —
+ + +
+ +
@@ -480,7 +516,18 @@

Idea 4

-

Spec 0

+

Spec 1

+ +
+ +
+ Feature + PR #41merged 2026-05-11 +
+
Approvers reading a study-backed PR see a "## Confidence" section directly between the existing "## Metric delta" and "## Config diff" sections.
+
deferred: Phase 2
+ +
@@ -495,7 +542,7 @@

Implementing 0

-

Done 67

+

Done 68

@@ -517,7 +564,7 @@

Done 67

A chat surface at `/chat/{conversation_id}` streams OpenAI completions via SSE.
-
depends on: feat_agent_propose_search_spacefeat_cluster_target_filterfeat_contextual_helpfeat_contextual_help_mvp2feat_create_study_search_space_builderfeat_create_study_target_autocompletefeat_data_table_primitivefeat_digest_proposalfeat_fts_rank_ordering_mvp2feat_github_pr_workerfeat_github_webhookfeat_judgments_periodic_resume_sweepfeat_llm_judgmentsfeat_pr_metric_confidencefeat_proposals_uifeat_query_inline_crudfeat_studies_uifeat_study_clone_from_previousfeat_study_lifecycleinfra_adapter_elasticinfra_arq_subprocess_test_mvp2infra_ci_smoke_makeupinfra_dashboard_regen_pre_commit_conflictinfra_e2e_seed_completed_studyinfra_e2e_wire_seed_helper_into_studies_specinfra_foundationinfra_frontend_stack_refreshinfra_make_targets_split_backend_onlyinfra_nvmrcinfra_optuna_evalinfra_per_trial_timeoutinfra_structlog_test_helpersinfra_uv_sync_drops_precommit
+
depends on: feat_agent_propose_search_spacefeat_auto_followup_studiesfeat_cluster_target_filterfeat_config_repo_baseline_trackingfeat_contextual_helpfeat_contextual_help_mvp2feat_create_study_search_space_builderfeat_create_study_target_autocompletefeat_data_table_primitivefeat_digest_executable_followupsfeat_digest_proposalfeat_fts_rank_ordering_mvp2feat_github_pr_workerfeat_github_webhookfeat_judgments_periodic_resume_sweepfeat_llm_judgmentsfeat_pr_metric_confidencefeat_proposals_uifeat_query_inline_crudfeat_studies_uifeat_study_clone_from_previousfeat_study_lifecycleinfra_adapter_elasticinfra_arq_subprocess_test_mvp2infra_ci_smoke_makeupinfra_dashboard_regen_pre_commit_conflictinfra_e2e_seed_completed_studyinfra_e2e_wire_seed_helper_into_studies_specinfra_foundationinfra_frontend_stack_refreshinfra_make_targets_split_backend_onlyinfra_nvmrcinfra_optuna_evalinfra_per_trial_timeoutinfra_structlog_test_helpersinfra_uv_sync_drops_precommit
@@ -1165,7 +1212,7 @@

Done 67

The release tag `v0.1.0` is pushed with: a worked tutorial at `docs/08_guides/tutorial-first-study.md`, sample data (50-query set + sample ES index of ~1,000 docs from the Amazon ESCI subset), README
-
depends on: feat_agent_propose_search_spacefeat_chat_agentfeat_cluster_target_filterfeat_contextual_helpfeat_contextual_help_mvp2feat_create_study_search_space_builderfeat_create_study_target_autocompletefeat_data_table_primitivefeat_digest_proposalfeat_fts_rank_ordering_mvp2feat_github_pr_workerfeat_github_webhookfeat_judgments_periodic_resume_sweepfeat_llm_judgmentsfeat_pr_metric_confidencefeat_proposals_uifeat_query_inline_crudfeat_studies_uifeat_study_clone_from_previousfeat_study_lifecycleinfra_adapter_elasticinfra_arq_subprocess_test_mvp2infra_ci_smoke_makeupinfra_dashboard_regen_pre_commit_conflictinfra_e2e_seed_completed_studyinfra_e2e_wire_seed_helper_into_studies_specinfra_foundationinfra_frontend_stack_refreshinfra_make_targets_split_backend_onlyinfra_nvmrcinfra_optuna_evalinfra_per_trial_timeoutinfra_structlog_test_helpersinfra_uv_sync_drops_precommit
+
depends on: feat_agent_propose_search_spacefeat_auto_followup_studiesfeat_chat_agentfeat_cluster_target_filterfeat_config_repo_baseline_trackingfeat_contextual_helpfeat_contextual_help_mvp2feat_create_study_search_space_builderfeat_create_study_target_autocompletefeat_data_table_primitivefeat_digest_executable_followupsfeat_digest_proposalfeat_fts_rank_ordering_mvp2feat_github_pr_workerfeat_github_webhookfeat_judgments_periodic_resume_sweepfeat_llm_judgmentsfeat_pr_metric_confidencefeat_proposals_uifeat_query_inline_crudfeat_studies_uifeat_study_clone_from_previousfeat_study_lifecycleinfra_adapter_elasticinfra_arq_subprocess_test_mvp2infra_ci_smoke_makeupinfra_dashboard_regen_pre_commit_conflictinfra_e2e_seed_completed_studyinfra_e2e_wire_seed_helper_into_studies_specinfra_foundationinfra_frontend_stack_refreshinfra_make_targets_split_backend_onlyinfra_nvmrcinfra_optuna_evalinfra_per_trial_timeoutinfra_structlog_test_helpersinfra_uv_sync_drops_precommit
@@ -1217,6 +1264,18 @@

Done 67

+
+ +
+ Bug + merged 2026-05-20 +
+
Complete
+ + +
+ +
@@ -1313,6 +1372,8 @@

Dependency graph (feat_ + infra_)

classDef plan fill:#fef9c3,stroke:#854d0e,color:#854d0e; classDef spec fill:#dbeafe,stroke:#1e40af,color:#1e40af; classDef idea fill:#f1f5f9,stroke:#334155,color:#334155; + feat_pr_metric_confidence["pr metric confidence"] + class feat_pr_metric_confidence spec; infra_foundation["foundation"] class infra_foundation done; feat_study_lifecycle["study lifecycle"] @@ -1442,6 +1503,7 @@

Dependency graph (feat_ + infra_)

feat_github_webhook --> chore_tutorial_polish feat_judgments_periodic_resume_sweep --> chore_tutorial_polish feat_llm_judgments --> chore_tutorial_polish + feat_pr_metric_confidence --> chore_tutorial_polish feat_proposals_ui --> chore_tutorial_polish feat_query_inline_crud --> chore_tutorial_polish feat_studies_ui --> chore_tutorial_polish @@ -1470,6 +1532,7 @@

Dependency graph (feat_ + infra_)

feat_github_webhook --> feat_chat_agent feat_judgments_periodic_resume_sweep --> feat_chat_agent feat_llm_judgments --> feat_chat_agent + feat_pr_metric_confidence --> feat_chat_agent feat_proposals_ui --> feat_chat_agent feat_query_inline_crud --> feat_chat_agent feat_studies_ui --> feat_chat_agent @@ -1514,6 +1577,8 @@

Dependency graph (feat_ + infra_)

classDef plan fill:#fef9c3,stroke:#854d0e,color:#854d0e; classDef spec fill:#dbeafe,stroke:#1e40af,color:#1e40af; classDef idea fill:#f1f5f9,stroke:#334155,color:#334155; + feat_pr_metric_confidence["pr metric confidence"] + class feat_pr_metric_confidence spec; infra_foundation["foundation"] class infra_foundation done; feat_study_lifecycle["study lifecycle"] @@ -1643,6 +1708,7 @@

Dependency graph (feat_ + infra_)

feat_github_webhook --> chore_tutorial_polish feat_judgments_periodic_resume_sweep --> chore_tutorial_polish feat_llm_judgments --> chore_tutorial_polish + feat_pr_metric_confidence --> chore_tutorial_polish feat_proposals_ui --> chore_tutorial_polish feat_query_inline_crud --> chore_tutorial_polish feat_studies_ui --> chore_tutorial_polish @@ -1671,6 +1737,7 @@

Dependency graph (feat_ + infra_)

feat_github_webhook --> feat_chat_agent feat_judgments_periodic_resume_sweep --> feat_chat_agent feat_llm_judgments --> feat_chat_agent + feat_pr_metric_confidence --> feat_chat_agent feat_proposals_ui --> feat_chat_agent feat_query_inline_crud --> feat_chat_agent feat_studies_ui --> feat_chat_agent diff --git a/docs/02_product/planned_features/chore_study_default_stop_conditions/idea.md b/docs/02_product/planned_features/chore_study_default_stop_conditions/idea.md new file mode 100644 index 00000000..822ca172 --- /dev/null +++ b/docs/02_product/planned_features/chore_study_default_stop_conditions/idea.md @@ -0,0 +1,69 @@ +# Study Default Stop Conditions — recommended `max_trials` + `time_budget_min` defaults at the create-study surfaces + +**Date:** 2026-05-21 +**Status:** Idea — surfaced during the 2026-05-21 Karpathy-loop audit of the Studies workflow. +**Origin:** Standalone audit at `~/.claude/plans/compressed-sparking-hamming.md` — the "within-study loop" section. Verified live via grep of [`backend/app/api/v1/schemas.py:550-580`](../../../../backend/app/api/v1/schemas.py) + [`ui/src/components/studies/create-study-modal.tsx:98-100`](../../../../ui/src/components/studies/create-study-modal.tsx). +**Depends on:** None. Pure decision-support change at the create-study surfaces; no schema migration, no service-layer behavior change. + +## Problem + +The server-side `StudyConfigSpec` validator at [`backend/app/api/v1/schemas.py:572-580`](../../../../backend/app/api/v1/schemas.py) correctly **requires** at least one of `max_trials` or `time_budget_min` — so studies cannot be created with no stop condition. The system is safe. What it is not is **opinionated** about what a sensible overnight run looks like. + +Today, the two paths a study gets created on each surface this problem differently: + +1. **The create-study wizard** at [`ui/src/components/studies/create-study-modal.tsx:98-100`](../../../../ui/src/components/studies/create-study-modal.tsx) declares both fields as optional empty inputs (`max_trials?: number | ''` and `time_budget_min?: number | ''`). It pre-fills `parallelism: 4` at [line 136](../../../../ui/src/components/studies/create-study-modal.tsx#L136) but leaves both stop-condition inputs blank. A user creating a study via the wizard hits "Submit," gets the validator's 422 ("at least one of `max_trials` or `time_budget_min`"), and then types in *something* — usually whatever round number comes to mind. The Karpathy-loop discipline of "this experiment runs for exactly N trials / X minutes" is delegated entirely to the user's intuition. +2. **The `create_study` agent tool** at [`backend/app/agent/tools/studies/create_study.py`](../../../../backend/app/agent/tools/studies/create_study.py) reuses `CreateStudyRequest` (= `StudyConfigSpec`) verbatim. The LLM must pick a value with no project guidance — only the bare Pydantic schema constraints (`ge=1, le=100_000` for `max_trials`; `gt=0` for `time_budget_min`). There is no glossary entry or system-prompt directive that recommends a starting range. + +The compounding observation: the only existing per-trial time-box (`trial_timeout_s`, default 60s via [`backend/app/core/settings.py:282`](../../../../backend/app/core/settings.py)) is **the right shape** for Karpathy-loop discipline. The missing layer is a **per-study time-box default** with a recommended value, plus a wizard that surfaces "what overnight looks like" as a one-click preset. + +Karpathy's loop runs roughly 100–120 experiments per 8-hour overnight session. RelyLoop's per-trial timeout is 60s. With `parallelism=4` and assume average 30s actual cost per trial (ES queries return faster than 60s in the common case), an 8-hour overnight session at full parallelism is `8 × 3600 × 4 / 30 = 3,840` trials — which is far more than Karpathy needs because each trial is much cheaper than ML training. A sensible default for an "overnight" preset is much lower than the upper bound and should match what TPE actually benefits from. Per Optuna docs and [`backend/app/eval/optuna_runtime.py:116-157`](../../../../backend/app/eval/optuna_runtime.py): pruning kicks in only at `max_trials >= 50`; TPE warms up around 10 trials; diminishing returns past 200–500 for most low-dimensional search spaces. + +## Proposed capabilities + +Tiered. Tier A is the small UI change that captures most of the leverage. Tier B is the optional preset selector. + +### Tier A — wizard pre-fill + recommended-default copy + +- **Wizard pre-fill on Step 5.** Set the form default for `max_trials` to **200** when the input is empty on first render. Keep `time_budget_min` empty (so the user explicitly opts in to either kind of cap). Reasoning: 200 is well past TPE warmup (10) and median-pruner activation (50), within Optuna's diminishing-returns sweet spot, and at `parallelism=4` × 30s ≈ 25 minutes wall-clock — short enough for an interactive session, long enough to be meaningful. +- **Glossary copy update** in [`ui/src/lib/glossary.ts`](../../../../ui/src/lib/glossary.ts) for the existing `study.max_trials` + `study.time_budget_min` keys. Add a one-sentence recommendation: "200 trials is a sensible default for a first study on a low-dimensional search space; 500–1000 for overnight runs." +- **InfoTooltip surfaces the recommendation.** The wizard already wires `` ([`create-study-modal.tsx:851`](../../../../ui/src/components/studies/create-study-modal.tsx#L851)) and `study.time_budget_min` ([line 862](../../../../ui/src/components/studies/create-study-modal.tsx#L862)). The glossary update propagates automatically via the existing `InfoTooltip` component. +- **System prompt entry** in [`prompts/orchestrator.system.md`](../../../../prompts/orchestrator.system.md) — add a sentence to the Studies tools section: "When the user has not specified a stop condition, propose `max_trials=200` as a first study or `max_trials=500–1000` (or `time_budget_min=240–480`) for overnight runs." + +### Tier B — "Quick" vs "Overnight" preset selector on Step 5 + +- **Preset radio above the numeric inputs.** Three options: + - `Quick (50 trials, ~5 min)` — `max_trials=50, parallelism=4, trial_timeout_s=60` + - `Standard (200 trials, ~25 min)` — `max_trials=200, parallelism=4, trial_timeout_s=60` (Tier A default) + - `Overnight (max 8h, 1000 trials)` — `max_trials=1000, time_budget_min=480, parallelism=4, trial_timeout_s=60` (the first-of stop condition wins) + - `Custom` — leaves the existing fields manually editable; preset selection has no effect. +- **Selecting a preset writes the four fields and disables them** (with a "Switch to Custom" link to re-enable). This makes the Karpathy-loop preset visible and one-click; the existing manual path remains available. +- **Frontend-only state** — no new wire-value enum, no new backend logic. The preset selector is purely a form-prefill convenience. + +### Out of scope + +- **Adaptive parallelism** (auto-scale `parallelism` up or down based on observed trial latency) — interesting but real product-design surface. Defer. +- **A separate "Karpathy mode" preset that combines `max_trials=200` + auto-followup chaining** — that belongs to [`feat_auto_followup_studies`](../feat_auto_followup_studies/idea.md), not here. +- **Backend-side default changes** (changing `default=None` to `default=200` in the Pydantic field) — rejected. The existing validator behavior (force the user to opt in) is the right safety net for the API surface. Backend defaults would silently apply to legacy callers without an upgrade signal; wizard pre-fill is the right place. + +## Scope signals + +- **Backend:** Tier A: ~5 LOC in [`prompts/orchestrator.system.md`](../../../../prompts/orchestrator.system.md). Tier B: nothing. +- **Frontend:** Tier A: ~15 LOC (form default + 2 glossary entries + 1 test asserting the pre-fill renders). Tier B: ~150 LOC (preset radio + 3 vitest cases asserting each preset writes the right field bundle + 1 case for Custom mode). +- **Migration:** none. +- **Config:** none. +- **Audit events:** N/A. +- **Tests:** Tier A: 1 vitest case in `create-study-modal.test.tsx` asserting the `max_trials` field renders with `200` by default. Tier B: 4 cases (3 presets + custom). + +## Why not inline today + +This idea is **borderline** on the inline-fix rubric in [`CLAUDE.md`](../../../../CLAUDE.md) "Inline-fix vs idea-file rubric." Tier A alone is ≤50 LOC and touches a single subsystem — by the rubric it should be **implemented inline** on the next PR that touches the wizard. The reason it's captured as an idea file rather than landed inline: + +1. **Product call on the recommended-default number.** "200" is defensible but not obviously right — 100, 250, 500 are all candidates. Picking the wrong number means every new study created via the wizard gets that number, which is a one-way change. Worth a deliberate decision rather than a drive-by commit. +2. **Tier B is the more interesting unit.** A preset selector that surfaces "Quick / Standard / Overnight" as one-click options is a real UX addition, not a tweak. Pairing the default tweak (Tier A) with the preset (Tier B) in one PR gives reviewers the full picture; landing Tier A alone in a drive-by would leave the bigger UX gap for later. +3. **Cross-surface coordination.** Tier A modifies both the wizard AND the orchestrator system prompt. Two surfaces is the upper bound of "drive-by"; doing it as a planned chore keeps the change traceable. + +## Relationship to other work + +- **Substrate for [`feat_auto_followup_studies`](../feat_auto_followup_studies/idea.md)** — that feature relies on every study having a known finite stop condition so chained follow-ups inherit a sensible budget. The default-stop-condition work makes "chained study with depth=3" mean something concrete (e.g., "three 200-trial studies, ~75 min total"). +- **Aligned with [`feat_pr_metric_confidence`](../feat_pr_metric_confidence/idea.md)** — convergence-trajectory and late-trial noise-floor analytics in the PR body are most meaningful when the operator knows the study had room to converge. A 50-trial study with "best found at trial 49" reads very differently from a 200-trial study with "best found at trial 87." +- **Composes with [`feat_create_study_search_space_builder`](../../../00_overview/implemented_features/2026_05_20_feat_create_study_search_space_builder/)** (shipped 2026-05-20) — the search-space builder is the substantive Step 4. This chore polishes Step 5, the "how long do we run it" surface. diff --git a/docs/02_product/planned_features/feat_auto_followup_studies/idea.md b/docs/02_product/planned_features/feat_auto_followup_studies/idea.md new file mode 100644 index 00000000..9c7e9fc4 --- /dev/null +++ b/docs/02_product/planned_features/feat_auto_followup_studies/idea.md @@ -0,0 +1,89 @@ +# Auto-Followup Studies — autonomous study chaining with operator-set depth cap, the closest unintrusive analog to Karpathy compounding + +**Date:** 2026-05-21 +**Status:** Idea — surfaced during the 2026-05-21 Karpathy-loop audit. The highest-leverage recommendation from the audit's "across studies" section. +**Origin:** Standalone audit at `~/.claude/plans/compressed-sparking-hamming.md` — recommendation #3. The audit's central finding: RelyLoop has a strong *within-study* loop but no *across-study* compounding. After a study completes, the operator must manually read the digest, decide to chain a followup, and configure it by hand. The agent doesn't observe study completion. +**Depends on:** [`feat_config_repo_baseline_tracking`](../feat_config_repo_baseline_tracking/idea.md) (substrate — tells the followup what config is currently live). Composes well with [`chore_study_default_stop_conditions`](../chore_study_default_stop_conditions/idea.md) and [`feat_digest_executable_followups`](../feat_digest_executable_followups/idea.md). + +## Problem + +Karpathy's autoresearch loop runs hundreds of experiments overnight and **compounds** improvements: each accepted change becomes the new baseline for the next experiment. RelyLoop's equivalent ("each merged proposal becomes the new baseline for the next study") is **manual at three gates**: + +1. Operator reads the digest after study A completes (manual). +2. Operator clicks "Open PR" on the proposal (manual — verified at [`backend/app/api/v1/proposals.py:474-502`](../../../../backend/app/api/v1/proposals.py)). +3. Operator merges the PR (out of RelyLoop's control — delegated to GitHub branch protection per CLAUDE.md persona note). +4. Operator manually creates study B with study A's winner as a starting point — typically by calling `propose_search_space(prior_study_id=A)` in chat then `create_study(...)` per the prompt at [`prompts/orchestrator.system.md`](../../../../prompts/orchestrator.system.md). The agent never observes study A's completion: [`backend/app/agent/orchestrator.py:160-286`](../../../../backend/app/agent/orchestrator.py) is only invoked via `send_user_message`. There is no background scheduler, no "study completed" event that wakes the agent, no auto-followup queue. + +The audit's verdict: gates 2 and 3 are correctly human-in-the-loop (production config changes need human approval — umbrella spec §6 hard constraint). Gates 1 and 4 are the **exploration side** — and exploration is exactly where Karpathy's overnight compounding wins. An operator opting in to "run 3 chained studies overnight, each narrowing around the prior winner, with no PRs opened until I review in the morning" is fully compatible with the human-merge invariant. + +The substrate for this already exists: + +- [`propose_search_space`](../../../../backend/app/agent/tools/studies/propose_search_space.py) accepts an optional `prior_study_id` and narrows numeric bounds via `winner ± |winner| × bracket` (default bracket=0.5). This is exactly the "narrow around the winner" primitive a chained follow-up needs. +- The digest worker at [`backend/workers/digest.py`](../../../../backend/workers/digest.py) runs automatically after a study completes — it is the natural place to enqueue the follow-up. +- The orchestrator at [`backend/workers/orchestrator.py:163-250`](../../../../backend/workers/orchestrator.py) already runs studies headless; one more study in the queue is no different from one fewer. + +What's missing is a **trigger + a depth counter + a bound check**, all small additions. + +## Proposed capabilities + +Tiered. Tier A is the minimal opt-in loop. Tier B is the safety + visibility surface that makes Tier A operator-trustworthy. + +### Tier A — opt-in `auto_followup_depth` on study config + +- **New `studies.config.auto_followup_depth: int | None = None`** field on [`StudyConfigSpec`](../../../../backend/app/api/v1/schemas.py) (line 550–580). Defaults to `None` = off; positive integer = depth cap (e.g., `3` chains up to 3 followups). Pydantic validator: `1 <= auto_followup_depth <= 10` when set. +- **Trigger** in the digest worker at [`backend/workers/digest.py`](../../../../backend/workers/digest.py), after the digest is persisted and the pending proposal is created: if `study.config.get("auto_followup_depth", 0) > 0` AND `study.best_metric is not None` (study completed with a winner) AND the gate condition below passes, enqueue a new Arq job `enqueue_followup_study(parent_study_id)`. +- **Gate condition** — the followup fires only if the winner is meaningfully above baseline. Default rule (tunable in feature spec): `study.best_metric > (study.baseline_metric or 0) + epsilon` where `epsilon = 0.005` (half-percent absolute lift). Studies whose winner did not beat the baseline by `epsilon` do **not** chain — the search space is exhausted or the optimizer got noise. +- **Follow-up creation** in a new worker function `enqueue_followup_study` at [`backend/workers/orchestrator.py`](../../../../backend/workers/orchestrator.py): + 1. Load parent study + best trial. + 2. Call `propose_search_space(template_id=parent.template_id, prior_study_id=parent.id, bracket=0.5)` — already implemented. + 3. Build a new `CreateStudyRequest` inheriting parent's `cluster_id`, `target`, `template_id`, `query_set_id`, `judgment_list_id`, `objective`, with `search_space` from step 2, `config.auto_followup_depth = parent.config.auto_followup_depth - 1`, all other `config` fields inherited (same `max_trials`, `time_budget_min`, `parallelism`, `trial_timeout_s`). + 4. Insert via `repo.create_study()` and enqueue `start_study(new_study_id)`. +- **Parent-child relationship** persisted via a new nullable column `studies.parent_study_id VARCHAR(36) NULL REFERENCES studies(id) ON DELETE SET NULL`. Enables the UI to render a chain ("Study A → Study A.1 → Study A.2") and lets `parameter_importance` analyses compose across the chain. +- **No autonomous PR opening.** The default behavior: each follow-up generates a digest + a pending proposal (per existing flow) but does NOT auto-call `open_pr`. The operator reviews all proposals in the morning. A separate later feature could add `auto_open_pr_on_followup: bool` once the trust model is established. + +### Tier B — safety, visibility, and the global circuit breaker + +- **Daily LLM budget integration.** The existing daily budget gate at [`backend/workers/digest.py`](../../../../backend/workers/digest.py) (lines 553–577) already short-circuits digest LLM calls. `enqueue_followup_study` reads `peek_daily_total()` before enqueueing — if the gate is below 80% of `OPENAI_DAILY_BUDGET_USD`, proceed; otherwise log `auto_followup.budget_pre_empt` WARN event and do not enqueue. The follow-up study itself runs without LLM (Optuna + pytrec_eval are deterministic) but the **digest at its completion** will need LLM budget, so we gate at enqueue time. +- **Failure-aware halting.** If the parent study terminated via the 5-consecutive-failures circuit breaker (per [`backend/workers/orchestrator.py:69-70`](../../../../backend/workers/orchestrator.py)), do NOT enqueue a followup. Logged as `auto_followup.parent_failed`. +- **UI surface** on the study detail page at [`ui/src/app/studies/[id]/page.tsx`](../../../../ui/src/app/studies/%5Bid%5D/page.tsx): a new "Auto-follow-up chain" panel showing the parent + children + depth counter (e.g., "Auto-chain: 1 of 3 — next follow-up will narrow around current winner"). When a child study exists, link to it. +- **Cancellation cascade.** When a parent study is cancelled, the operator should be able to decide what happens to in-flight or queued children. Default: cancel the in-flight child; the depth counter is consumed. UI surface: a confirm-modal at cancel time. +- **Telemetry events** at the structlog layer: `auto_followup.enqueued`, `auto_followup.skipped_no_lift`, `auto_followup.skipped_budget`, `auto_followup.skipped_parent_failed`, `auto_followup.depth_exhausted`. Operator-greppable per the existing telemetry pattern. + +### Out of scope + +- **Auto-PR opening on followup chains.** Argued and explicitly deferred. Once the operator trusts the chain, a future feature could add `auto_open_pr_at_depth: int | None` ("open the PR only when the deepest member of the chain finishes"). For v1, every member of the chain produces a manual-review proposal. +- **Search-space *widening* on stagnation.** If three followups in a row produce no lift, the natural next move is to widen the search space and try a different region — but that needs a different heuristic than `propose_search_space`'s `prior_study_id` narrowing. Captured as a follow-up idea: `feat_search_space_stagnation_widening` (will write later if this v1 ships). +- **Cross-template chains.** Today `propose_search_space(prior_study_id=...)` only works when the followup uses the same template. Cross-template chains would require a `swap_template` heuristic that maps prior winners' params onto the new template's `declared_params`. Out of scope — composes with [`feat_digest_executable_followups`](../feat_digest_executable_followups/idea.md) which already needs that primitive. +- **Multi-objective chains.** Single-objective only in MVP1 per umbrella spec §13. Out of scope. + +## Scope signals + +- **Backend:** ~600 LOC. Pydantic field + validator (~10) + Alembic migration for `parent_study_id` (~30) + ORM model field (~5) + `enqueue_followup_study` worker (~100) + digest-worker trigger integration (~30) + budget gate (~20) + telemetry events (~30) + cascade-cancel service logic (~50) + repo layer joins (~30) + tests across unit/integration/contract (~300). +- **Frontend:** ~300 LOC. Auto-follow-up chain panel (~200) + opt-in field in the create-study wizard with depth selector (~50) + cancel-cascade confirm modal (~50) + vitest coverage. +- **Migration:** one Alembic migration adding `studies.parent_study_id`. Strictly additive, nullable, ON DELETE SET NULL. Round-trip-clean. +- **Config:** none new (uses existing `OPENAI_DAILY_BUDGET_USD`). +- **Audit events:** N/A (MVP1 has no audit_log). At MVP2 the followup-enqueue and budget-skip events become canonical audit events. +- **Tests:** + - Unit: gate-condition arithmetic; depth decrement; parent-failure check. + - Integration: parent study completes → child study enqueued + correct config inherited; budget exhausted → no enqueue; parent failed → no enqueue; depth-3 chain finishes correctly; cancelled parent halts in-flight child. + - Contract: study detail response includes `parent_study_id` + `auto_followup_depth`. + - End-to-end: a chained-study integration test that runs 3 stub-adapter studies in sequence and asserts each child's search space narrows around the parent's winner. + +## Why not inline today + +1. **Cross-subsystem, cross-stack.** Touches schema migration + worker logic + agent-tool composition + UI surface + operator telemetry. Far outside the inline-fix budget per [`CLAUDE.md`](../../../../CLAUDE.md) rubric. +2. **Multiple product-design forks.** The gate condition (epsilon threshold for "enough lift to chain"), the depth cap (how many is sane?), the inheritance rules (parallelism inherits or resets?), the budget-gate threshold (80% of daily? 90%?), and the cancellation cascade behavior (cancel children eagerly? let them finish?) all need spec-level decisions. None are obvious defaults. +3. **Trust-building substrate.** This feature changes RelyLoop's autonomy story materially — from "operator-initiated single studies" to "operator-initiated chains that compound overnight." The change deserves visible operator surfaces (the chain panel, the telemetry events, the cancellation cascade) that take real design effort. Shipping it as a chore would underweight the operator-trust dimension. +4. **Depends on a substrate that doesn't exist yet.** [`feat_config_repo_baseline_tracking`](../feat_config_repo_baseline_tracking/idea.md) provides the "what's the current baseline" answer that this feature needs for the gate condition's baseline comparison. + +## Relationship to other work + +- **Most-leveraged consumer of [`feat_agent_propose_search_space`](../../../00_overview/implemented_features/2026_05_21_feat_agent_propose_search_space/)** (shipped 2026-05-21). That feature provides the `prior_study_id` narrowing primitive in isolation; this feature builds the autonomous loop around it. Without auto-followup, `prior_study_id` requires manual operator chaining; with auto-followup, the same primitive compounds overnight. +- **Depends on [`feat_config_repo_baseline_tracking`](../feat_config_repo_baseline_tracking/idea.md)** for the baseline-comparison gate. The latter must ship first. +- **Composes with [`chore_study_default_stop_conditions`](../chore_study_default_stop_conditions/idea.md)** — if every study in the chain has a known finite stop condition (e.g., `max_trials=200`), the chain's total resource footprint is predictable. Without sane defaults, a 3-deep chain with no caps would be catastrophic. +- **Composes with [`feat_digest_executable_followups`](../feat_digest_executable_followups/idea.md)** — that feature gives the LLM a structured "next followup" output; this feature acts on a *programmatic* default. The two coexist: the LLM-suggested followups remain advisory for the operator, while the auto-chain consumes the deterministic `propose_search_space(prior_study_id=...)` heuristic. Together they cover both the LLM-judgment-rich and the autonomous-deterministic paths. +- **Validates [`feat_pr_metric_confidence`](../feat_pr_metric_confidence/idea.md)** — auto-chained studies generate the data substrate (multiple studies with the same template + cluster) that lets the convergence-trajectory and noise-floor analytics show real cross-study patterns. + +## Karpathy-loop framing + +Per the framing in the surfacing audit: this feature is the **single largest gap** between RelyLoop today and a Karpathy-style overnight loop. RelyLoop already runs hundreds of trials within a study, scores them against a single metric, persists results, and picks a winner. What it doesn't do is **compound** — each study is one-shot. This feature makes the compounding optional, operator-controlled, and bounded — three properties Karpathy's loop also has (he runs his loop in time-boxed batches, not unbounded). Shipping this turns RelyLoop's "Karpathy-loop scorecard" row from ❌ to ✅ on the "Compounding across experiments" dimension. diff --git a/docs/02_product/planned_features/feat_config_repo_baseline_tracking/idea.md b/docs/02_product/planned_features/feat_config_repo_baseline_tracking/idea.md new file mode 100644 index 00000000..ececd586 --- /dev/null +++ b/docs/02_product/planned_features/feat_config_repo_baseline_tracking/idea.md @@ -0,0 +1,73 @@ +# Config Repo Baseline Tracking — record the last merged proposal per `config_repo` so future studies can auto-bootstrap their baseline + +**Date:** 2026-05-21 +**Status:** Idea — surfaced during the 2026-05-21 Karpathy-loop audit. +**Origin:** Standalone audit at `~/.claude/plans/compressed-sparking-hamming.md` — the "across studies" gap section. Verified live via grep of [`backend/app/db/models/config_repo.py`](../../../../backend/app/db/models/config_repo.py) (no `last_merged_*` field) + [`backend/app/api/webhooks/github.py:183-191`](../../../../backend/app/api/webhooks/github.py) (the merge webhook stamps `pr_merged_at` on the *proposal* but does not propagate that to the config repo). +**Depends on:** [`feat_github_webhook`](../../../00_overview/implemented_features/2026_05_12_feat_github_webhook/) (shipped 2026-05-12) — provides the merge event that this feature consumes. + +## Problem + +RelyLoop does not track which configuration is currently live in production. When a proposal's PR merges, the merge webhook at [`backend/app/api/webhooks/github.py:187-191`](../../../../backend/app/api/webhooks/github.py) updates `proposals.pr_state = 'merged'` and `proposals.pr_merged_at = `, but no field on [`config_repos`](../../../../backend/app/db/models/config_repo.py) (or [`clusters`](../../../../backend/app/db/models/cluster.py)) records that this was the **most recent merged proposal**. To answer "what's currently deployed?" the system has to scan all proposals for a given config repo, filter by `pr_state = 'merged'`, sort by `pr_merged_at DESC`, and take the first row. That query exists nowhere in the codebase today. + +This omission compounds three downstream gaps: + +1. **No baseline for the next study.** When an operator creates study B after study A's winner merged, there is no signal in the system that says "the live config is now study A's winner." The new study's `baseline_metric` (per [`backend/app/db/models/study.py`](../../../../backend/app/db/models/study.py)) is whatever the operator measures *manually* against the running cluster — there is no automated path. This blocks [`feat_auto_followup_studies`](../feat_auto_followup_studies/idea.md) at the protocol level: the auto-followup worker doesn't know which config the next study should beat. +2. **No "last shipped" surface on the proposals UI.** The proposals list at [`ui/src/app/proposals/page.tsx`](../../../../ui/src/app/proposals/page.tsx) shows pending/open/closed/merged status per proposal but cannot say "this PR superseded the one that shipped 5 days ago — here's what changed since then." A `config_repos.last_merged_proposal_id` denormalization gives the UI the anchor to render that view. +3. **No drift detection.** If the operator's CI/CD silently fails to deploy a merged proposal (per CLAUDE.md persona note: deploy is operator-owned, outside RelyLoop), the system has no record of "we believe X is live; operator should confirm." Knowing the last-merged proposal is the first ingredient of any "is live config in sync with our records" health check (the operator-side check itself is out of scope, but the substrate isn't). + +The umbrella spec §6 hard constraint — "Approvers... cannot be bypassed... the tool delegates approval to the config repo's branch protection" — means RelyLoop never *enforces* what's deployed. But knowing what was *last merged* is internal bookkeeping, not enforcement, and there is no contradiction with the spec. + +## Proposed capabilities + +Single tier — small, additive, schema-only at the DB layer. + +### Schema change + +- **One new column** on `config_repos`: `last_merged_proposal_id VARCHAR(36) NULL REFERENCES proposals(id) ON DELETE SET NULL`. Nullable because a fresh config repo has no merged proposal yet. `ON DELETE SET NULL` because operator-deleted proposals shouldn't break the config_repo row (the proposal table doesn't currently have soft-delete; if/when it does, this changes). +- **Index** `ix_config_repos_last_merged_proposal_id` on the new FK (`btree`, default). The index is small and supports the "find the config_repo by last-merged-proposal" reverse lookup that the proposals UI uses. +- **No `last_merged_at` denormalization** — readers join to `proposals.pr_merged_at` instead. Saves a column and keeps `proposals.pr_merged_at` as the single source of truth. +- **Alembic migration** `00NN_config_repos_last_merged_proposal_id`. Strictly additive; `downgrade()` drops the FK column + index. Round-trip-clean per Absolute Rule #5. + +### Webhook handler update + +- **Location:** [`backend/app/api/webhooks/github.py:187-191`](../../../../backend/app/api/webhooks/github.py) — where `update_proposal_merged()` is called today. +- **New behavior:** in the same transaction as the proposal merge update, look up the proposal's `study.config_repo_id` (via the existing FK chain: `proposals → studies → cluster → config_repo`, or more directly if there's a `proposals.config_repo_id` denorm — verify in the feature spec), then `UPDATE config_repos SET last_merged_proposal_id = :proposal_id WHERE id = :config_repo_id`. Idempotent: if the new merge's `pr_merged_at` is older than the currently-tracked proposal's `pr_merged_at`, do not overwrite (an out-of-order webhook should not regress the pointer). +- **No new webhook event types** — this rides the existing `pull_request.closed` (with `merged=true`) handler path. +- **Tests:** integration test asserting the column updates when a merged-PR webhook fires; integration test asserting an older-timestamp merge does not overwrite a newer-timestamp pointer. + +### Read surface + +- **`ConfigRepoDetail` response model** ([`backend/app/api/v1/schemas.py`](../../../../backend/app/api/v1/schemas.py)) gains a `last_merged_proposal: ProposalSummary | None` field. The endpoint at [`backend/app/api/v1/config_repos.py`](../../../../backend/app/api/v1/config_repos.py) joins to load the proposal row. +- **Proposals list filtering** at [`backend/app/api/v1/proposals.py`](../../../../backend/app/api/v1/proposals.py) gains an optional `is_last_merged: bool` query param so the UI can highlight the live config in the list. Cursor pagination remains. +- **UI:** `ConfigRepoDetail` page (when it exists; today the cluster detail page surfaces the linked config repo) gains a "Currently live: [proposal name] — merged on [date]" badge. Minor change, ~30 LOC. + +### Out of scope + +- **Cluster-level baseline tracking.** Argued and rejected: `config_repos` is the right scope because one repo can serve multiple clusters (e.g., dev + staging + prod) and the operator's CI/CD applies the merged config to all of them in step. Tracking per-cluster would create three different "last merged" records for the same merge event and confuse the auto-followup story. +- **Live-cluster verification.** Querying the running cluster to confirm the merged config actually deployed is outside RelyLoop's scope per CLAUDE.md (RelyLoop never sits on the serving path). +- **Multi-repo studies.** A future study may target multiple config repos. The schema permits this naturally (each repo gets its own `last_merged_proposal_id`); the UI surface is out of scope until that user story is real. + +## Scope signals + +- **Backend:** ~150 LOC. Alembic migration (~30 LOC) + ORM field (~5) + webhook handler update (~30) + idempotency check (~15) + response-model field (~10) + endpoint join (~20) + tests across unit/integration/contract (~50). +- **Frontend:** ~50 LOC for the "Currently live" badge on the config repo / cluster detail surface + 1 vitest case. +- **Migration:** one strictly additive Alembic migration. Nullable column + index. Round-trip drops cleanly. +- **Config:** none. +- **Audit events:** N/A (MVP1 has no audit_log). +- **Tests:** + - Integration: 3 cases — webhook fires + column updates; older-timestamp webhook is a no-op; cascade-delete of proposal nulls the column. + - Contract: 1 case — `GET /config_repos/{id}` response shape includes `last_merged_proposal`. + - Migration round-trip: 1 test. + +## Why not implemented inline today + +1. **Schema migration on a production table.** `config_repos` is a real shared table — any schema change ships through the Alembic discipline (Absolute Rule #5) plus a round-trip-clean verify. Not a drive-by. +2. **Cross-cuts the webhook + proposals routers.** The webhook handler is a hot path (CLAUDE.md persona note: "webhook idempotency required"). Adding logic without spec-shaped scrutiny risks subtle ordering bugs. +3. **Substrate for a larger feature.** This feature has no user-visible behavior change by itself — it's bookkeeping. The "Currently live" badge is mild; the real payoff lands when [`feat_auto_followup_studies`](../feat_auto_followup_studies/idea.md) consumes the new column. Shipping the substrate as its own small PR lets reviewers focus on the schema + idempotency, then the consumer lands cleanly on top. + +## Relationship to other work + +- **Required by [`feat_auto_followup_studies`](../feat_auto_followup_studies/idea.md)** — that feature's auto-chained follow-up worker needs to know "what's currently deployed?" to set the next study's baseline meaningfully. This idea ships first. +- **Adjacent to [`feat_pr_metric_confidence`](../feat_pr_metric_confidence/idea.md)** — the PR body would naturally include "previously deployed: [last_merged_proposal_name] from [date]" as part of the confidence framing. Composes cleanly once both ship. +- **Extends [`feat_github_webhook`](../../../00_overview/implemented_features/2026_05_12_feat_github_webhook/)** — the merge event becomes load-bearing for a downstream feature, raising the importance of the webhook's existing idempotency invariants. +- **Visible in [`feat_proposals_ui`](../../../00_overview/implemented_features/2026_05_12_feat_proposals_ui/)** — the proposals list gains a "live config" indicator and the proposal-detail page can show "supersedes proposal X" backwards-pointer. diff --git a/docs/02_product/planned_features/feat_digest_executable_followups/idea.md b/docs/02_product/planned_features/feat_digest_executable_followups/idea.md new file mode 100644 index 00000000..701e5b02 --- /dev/null +++ b/docs/02_product/planned_features/feat_digest_executable_followups/idea.md @@ -0,0 +1,120 @@ +# Executable Digest Follow-ups — turn `suggested_followups` from dead narrative text into structured proposals an operator can run with one click + +**Date:** 2026-05-21 +**Status:** Idea — surfaced during the 2026-05-21 Karpathy-loop audit. +**Origin:** Standalone audit at `~/.claude/plans/compressed-sparking-hamming.md` — recommendation #4. The audit observation: the digest worker's LLM output already includes `suggested_followups`, but the field is shaped as plain strings, rendered in the proposals UI as bullet text, and has no path back into `create_study` or `propose_search_space`. Operators read suggestions like "narrow the title_boost range to [0.5, 3.0]" and have to manually translate them into a new study configuration. +**Depends on:** None. Composes with [`feat_auto_followup_studies`](../feat_auto_followup_studies/idea.md) (which automates the deterministic followup); this feature handles the **LLM-suggested** followups separately. + +## Problem + +The digest worker's LLM contract at [`backend/workers/digest.py:168-189`](../../../../backend/workers/digest.py) defines `suggested_followups` as a flat `array of string`: + +```python +DIGEST_RESPONSE_SCHEMA = { + "type": "object", + "properties": { + "narrative": {"type": "string"}, + "suggested_followups": { + "type": "array", + "items": {"type": "string"}, + "maxItems": 5, + }, + }, + ... +} +``` + +The strings are LLM-generated freeform — typical examples (from real digest outputs): + +- "Try narrowing `title_boost` to the range [1.5, 3.0] where the top-decile trials clustered." +- "Investigate the `tie_breaker` parameter — its importance was 0.18 but the search space only sampled three values." +- "Add a `category_boost` parameter to the template since several winning trials suggest category prioritization matters." + +These suggestions are useful — but **operationally inert**. They render as bullet text on the proposal detail page at [`ui/src/app/proposals/[id]/page.tsx`](../../../../ui/src/app/proposals/%5Bid%5D/page.tsx). The operator must: + +1. Read each suggestion. +2. Translate it into a `search_space` JSON manually. +3. Open the create-study wizard. +4. Re-enter cluster, target, template, query set, judgment list, objective. +5. Paste the translated `search_space`. +6. Hit Submit. + +That's the 6-step manual workflow every overnight digest produces *up to 5 times*. The Karpathy-loop equivalent is one click: "Run this followup." The data the LLM has at digest time is rich enough to populate a `CreateStudyRequest` deterministically — the only missing piece is the JSON structure to carry it. + +The audit's framing: this is the **smaller** of the two compounding gaps. [`feat_auto_followup_studies`](../feat_auto_followup_studies/idea.md) handles the *deterministic* compounding (Optuna's TPE narrowed around a winner). This feature handles the *LLM-suggested* compounding (the model spotting patterns Optuna can't, like "the importance signal suggests adding a new dimension"). Both are needed; they cover orthogonal failure modes. + +## Proposed capabilities + +Tiered. Tier A reshapes the LLM output and adds the "Run this followup" button for the **narrow** kind (same template, modified search space). Tier B extends to `swap_template` (different template). Tier C is a stretch goal — allowing the LLM to propose template edits. + +### Tier A — structured followups for "narrow / widen" within the same template + +- **New LLM output schema** in [`backend/workers/digest.py:168-189`](../../../../backend/workers/digest.py): + ```python + { + "narrative": str, + "suggested_followups": [ + { + "kind": "narrow" | "widen" | "text", + "rationale": str, # human-readable, always present + "search_space": SearchSpace | None, # required for narrow/widen, null for text + } + ] + } + ``` +- **Backward-compatible read path.** Old digests (pre-migration) have `suggested_followups: list[str]`. Reader code wraps plain strings as `{kind: "text", rationale: , search_space: null}` so the UI surface handles both shapes uniformly. No backfill required. +- **LLM prompt update** in [`prompts/digest_narrative.user.jinja`](../../../../prompts/digest_narrative.user.jinja) + [`prompts/digest_narrative.system.md`](../../../../prompts/digest_narrative.system.md): + - System prompt gains a section explaining the three kinds and when to use each. "Narrow" = the prior search space was too wide and the winner sits in a sub-region; emit the narrower bounds. "Widen" = the winner is at an edge of the prior space (`= low` or `= high`); emit broader bounds. "Text" = a suggestion that requires operator judgment (e.g., "consider adding a new parameter to the template"). + - User template renders the parent study's `search_space` as a structured input the LLM can transform, not just narrative. +- **Validator** at [`backend/app/domain/study/search_space.py`](../../../../backend/app/domain/study/search_space.py): the existing `SearchSpace` Pydantic model already validates structure + cardinality. Use it directly to validate each followup's `search_space` field at digest-persist time; invalid ones get the `kind` downgraded to `"text"` with `rationale = "[validation failed: ] " + original_rationale` so the operator still sees the intent. +- **UI surface** at [`ui/src/app/proposals/[id]/page.tsx`](../../../../ui/src/app/proposals/%5Bid%5D/page.tsx): + - "Narrow"/"Widen" followups render as a card with: rationale text, a collapsed "Show search space" detail (renders the diff vs parent study), and a primary "Run this followup" button. + - "Run this followup" pre-fills the create-study modal with: parent study's cluster/target/template/query_set/judgment_list/objective + the LLM's proposed `search_space` + parent's stop conditions. Operator reviews + submits. + - "Text" followups render unchanged — bullet text. The kind discriminator means freeform suggestions stay supported indefinitely. +- **Traceability.** New nullable column `studies.parent_proposal_followup_index: int | None` records "this study was created from followup #N of proposal X." Lets the UI render "Study A.2 was suggested by digest from proposal B at index 3." Helps the team measure whether LLM-suggested followups produce wins. + +### Tier B — `swap_template` followups + +- **Additional `kind: "swap_template"`** carrying `template_id: UUID` + a remapped `search_space`. Lets the LLM say "this query template is a better fit for the observed traffic — try template X." +- **Cross-template search-space remapping.** The hard part: when swapping templates, the prior winner's params don't all map onto the new template's `declared_params`. A new domain helper at [`backend/app/domain/study/template_swap.py`](../../../../backend/app/domain/study/template_swap.py) computes the intersection (common param names) and the disjoint set (new params get default heuristic bounds per [`backend/app/domain/study/search_space_defaults.py`](../../../../backend/app/domain/study/search_space_defaults.py); removed params are dropped). +- **LLM prompt extension** to teach the model when to suggest a swap (typically: parameter-importance distribution is highly skewed, suggesting some params are dead weight; OR several winning trials cluster around a sub-set of params that map cleanly onto a different template's declared params). +- **UI surface:** swap-template followups render with a side-by-side comparison of the two templates' `declared_params` before the operator commits. + +### Tier C (stretch / probably deferred) — template-edit suggestions + +- **`kind: "edit_template"`** carrying a proposed JSON-patch on the parent template's `body_jsonata` (or equivalent). The LLM could suggest "add a `category^2` field-boost to the template body." Today templates are operator-authored only; this would let LLM suggestions flow into template edits with an explicit review step. +- **Likely out of scope for MVP1.** Template edits change query rendering semantics — a much larger trust-and-validation surface than search-space narrowing. Captured here to acknowledge the natural extension; feature spec defers. + +### Out of scope for Tier A/B + +- **Auto-running followups without operator click.** That's [`feat_auto_followup_studies`](../feat_auto_followup_studies/idea.md). This feature stays in the "human clicks one button" lane — the LLM proposes; the operator commits. +- **Followups that span multiple studies.** A meta-followup like "run studies A.1 and A.2 in parallel with different starting points" would need its own surface. Not now. +- **Persistence of "I tried this LLM followup and it didn't help."** A negative-result feedback loop into future LLM prompts (so the model learns the operator's preferences) is an interesting MVP4 idea, gated on Langfuse. Out of scope now. + +## Scope signals + +- **Backend:** ~400 LOC. New Pydantic models for the followup discriminated union (~30) + LLM prompt updates (~60 in `.system.md` + `.user.jinja`) + validator at digest-persist time (~40) + new `parent_proposal_followup_index` column + migration (~30) + service-layer "create study from followup" helper (~80) + tests across unit/integration/contract (~150). +- **Frontend:** ~400 LOC. Followup card component with kind discriminator (~150) + "Run this followup" prefill workflow (~80) + search-space diff renderer (~100) + tests (~70). Tier B adds ~200 LOC for the swap-template comparison. +- **Migration:** one Alembic migration adding `studies.parent_proposal_followup_index INT NULL`. Strictly additive. Round-trip-clean. +- **Config:** none. +- **Audit events:** N/A (MVP1). At MVP2: `digest.followup_run_clicked` + `study.created_from_followup` as canonical audit events. +- **Tests:** + - Unit: discriminated-union parsing; validator downgrade on bad `search_space`; old-shape string-array backward compatibility. + - Integration: digest LLM round-trip via stub returns structured followups; "create study from followup" copies the right parent fields. + - Contract: digest response shape includes the new union. + - E2E (Playwright): one happy-path spec — open a proposal with a `narrow` followup, click "Run this followup," confirm the create-study modal pre-populates correctly. + +## Why not inline today + +1. **LLM-contract change.** Reshaping `suggested_followups` from `string[]` to a discriminated union touches the response schema, the prompt files, the validator, the digest worker's parse logic, the storage representation in `digests.followups` JSONB, AND the UI renderer. Multiple coordinated surfaces — outside drive-by budget. +2. **Backward compatibility.** Existing digests in the DB have the old shape. The read path needs an adapter that wraps old strings into the new structure. Small but real — easy to get wrong as a drive-by. +3. **Real UX design surface.** The "Run this followup" workflow is a new top-level user action — how it renders, what it pre-fills, what it shows in the search-space diff are decisions worth scrutinizing in a spec. +4. **Composes with another planned feature.** [`feat_auto_followup_studies`](../feat_auto_followup_studies/idea.md) and this idea cover orthogonal compounding paths (deterministic vs LLM-suggested). Shipping them in coordinated order — auto-followup first to establish the autonomy trust model, then this to add LLM-suggested manual overrides — gives reviewers a coherent story. Either could ship first, but the coordination is worth planning, not improvising. + +## Relationship to other work + +- **Most-leveraged in combination with [`feat_auto_followup_studies`](../feat_auto_followup_studies/idea.md)** — the auto-chain provides the deterministic "narrow around winner" path; this feature adds the LLM-suggested "but consider widening on this axis" path. Together they cover what Karpathy's per-experiment agent does (propose a single change per experiment, then evaluate) — Optuna's TPE handles the within-study sampling, and these two features handle the across-study hypothesis evolution. +- **Adjacent to [`feat_pr_metric_confidence`](../feat_pr_metric_confidence/idea.md)** — the confidence framing (CI bands, named regressors) can feed into the followup prompts. "The winner is at +0.13 NDCG with a noise floor of σ=0.02 and 2 regressing queries" is much richer LLM context than "winner is 0.84" for proposing the next experiment. +- **Reuses [`feat_agent_propose_search_space`](../../../00_overview/implemented_features/2026_05_21_feat_agent_propose_search_space/)** (shipped 2026-05-21) — the underlying `search_space_defaults.py` heuristic is the natural fallback when the LLM proposes a `swap_template` followup that has a partial `search_space` (the disjoint params get heuristic bounds). +- **Composes with [`feat_create_study_search_space_builder`](../../../00_overview/implemented_features/2026_05_20_feat_create_study_search_space_builder/)** (shipped 2026-05-20) — the visual editor for `search_space` rows is where "Run this followup" lands, pre-populated. The diff visualization can leverage the same row primitive. +- **Eventually feeds [`feat_agent_propose_search_space`](../../../00_overview/implemented_features/2026_05_21_feat_agent_propose_search_space/)** at the conversational layer — once the LLM digest output is structured, the chat agent can fluently say "based on the digest from your last study, here are 3 followups I recommend; want me to run #2?" rather than today's freeform suggestion-paraphrase. From 599f4360310e07fccc645ee8722169e59ab0a1c2 Mon Sep 17 00:00:00 2001 From: SoundMindsAI Date: Thu, 21 May 2026 08:41:08 -0400 Subject: [PATCH 02/17] docs: finalize bug_e2e_target_dropdown_flake + advance feat_pr_metric_confidence pipeline - Move bug_e2e_target_dropdown_flake/ from planned_features/ to implemented_features/2026_05_20_bug_e2e_target_dropdown_flake/ (PR-shipped finalization) - feat_pr_metric_confidence: add feature_spec.md, phase2_idea.md, pipeline_status.md and update idea.md as the pipeline advances through spec stage Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/00_overview/DASHBOARD.md | 2 +- docs/00_overview/MVP1_DASHBOARD.md | 7 +- docs/00_overview/dashboard.html | 2 +- .../bug_fix.md | 0 .../idea.md | 0 docs/00_overview/mvp1_dashboard.html | 20 +- .../feat_pr_metric_confidence/feature_spec.md | 751 ++++++++++++++++++ .../feat_pr_metric_confidence/idea.md | 2 +- .../feat_pr_metric_confidence/phase2_idea.md | 94 +++ .../pipeline_status.md | 29 + 10 files changed, 884 insertions(+), 23 deletions(-) rename docs/{02_product/planned_features/bug_e2e_target_dropdown_flake => 00_overview/implemented_features/2026_05_20_bug_e2e_target_dropdown_flake}/bug_fix.md (100%) rename docs/{02_product/planned_features/bug_e2e_target_dropdown_flake => 00_overview/implemented_features/2026_05_20_bug_e2e_target_dropdown_flake}/idea.md (100%) create mode 100644 docs/02_product/planned_features/feat_pr_metric_confidence/feature_spec.md create mode 100644 docs/02_product/planned_features/feat_pr_metric_confidence/phase2_idea.md create mode 100644 docs/02_product/planned_features/feat_pr_metric_confidence/pipeline_status.md diff --git a/docs/00_overview/DASHBOARD.md b/docs/00_overview/DASHBOARD.md index 6e598ca7..180f599f 100644 --- a/docs/00_overview/DASHBOARD.md +++ b/docs/00_overview/DASHBOARD.md @@ -6,7 +6,7 @@ _Top-level index across MVP1 → GA v1+ as of **2026-05-21**. Click a release na | Release | Theme | Progress | Status | |---|---|---|---| -| [MVP1 / v0.1](MVP1_DASHBOARD.md) | The Loop | 56 / 57 scoped done · 4 remaining | **In progress** | +| [MVP1 / v0.1](MVP1_DASHBOARD.md) | The Loop | 56 / 57 scoped done · 3 remaining | **In progress** | | [MVP2 / v0.2](MVP2_DASHBOARD.md) | Observable | 1 / 1 scoped done · 1 remaining | **In progress** | | MVP3 / v0.3 | Production Stacks | — | **Not yet scoped** | | MVP4 / v0.4 | Multi-tenant, Multi-LLM | — | **Not yet scoped** | diff --git a/docs/00_overview/MVP1_DASHBOARD.md b/docs/00_overview/MVP1_DASHBOARD.md index fa16a62a..10b702ef 100644 --- a/docs/00_overview/MVP1_DASHBOARD.md +++ b/docs/00_overview/MVP1_DASHBOARD.md @@ -21,8 +21,8 @@ Spec exists; run /pipeline to generate the implementation plan + ship | Metric | Value | |---|---| | Scoped items done | **56 / 57** (98%) — feat_/infra_/chore_/epic_ past idea stage | -| Path to MVP1 | **4** items remaining (features + bugs + chores) | -| Open bugs | 1 | +| Path to MVP1 | **3** items remaining (features + bugs + chores) | +| Open bugs | 0 | | Open chores | 2 (idea-stage debt) | | Backlog ideas | 4 idea-only feat/infra (not yet scoped into MVP1) | | In flight | 0 feature(s) actively shipping | @@ -116,7 +116,7 @@ _None._ |---|---|---|---|---| | [feat_pr_metric_confidence](../02_product/planned_features/feat_pr_metric_confidence/feature_spec.md) | Feature | Approvers reading a study-backed PR see a "## Confidence" section directly between the existing "## Metric delta" and "## Config diff" sections. | — | [PR #41](https://github.com/SoundMindsAI/relyloop/pull/41) merged 2026-05-11 | -### Idea (7) +### Idea (6) | Feature | Type | One-liner | Depends on | Status | |---|---|---|---|---| @@ -126,7 +126,6 @@ _None._ | [feat_study_clone_from_previous](../02_product/planned_features/feat_study_clone_from_previous/idea.md) | Feature | A relevance engineer's normal workflow after the first study completes: | — | Idea — surfaced during a UX review of parameter-tuning ergonomics on 2026-05-19. | | [chore_study_default_stop_conditions](../02_product/planned_features/chore_study_default_stop_conditions/idea.md) | Chore | The server-side `StudyConfigSpec` validator at [`backend/app/api/v1/schemas.py:572-580`](../../backend/app/api/v1/schemas.py) correctly **requires** at least one of `max_trials` or `time_budget_min` — | — | Idea — surfaced during the 2026-05-21 Karpathy-loop audit of the Studies workflow. | | [chore_template_library_expansion](../02_product/planned_features/chore_template_library_expansion/idea.md) | Chore | Three connected gaps: | — | Idea — surfaced during a UX review of parameter-tuning ergonomics on 2026-05-19. | -| [bug_e2e_target_dropdown_flake](../02_product/planned_features/bug_e2e_target_dropdown_flake/idea.md) | Bug | The skipped test seeds two ES indices via Playwright's `request.put` (Node), opens the create-study modal, picks the seeded cluster via the cluster ``… | — | Idea — surfaced during `feat_create_study_target_autocomplete` Story F2 implementation; the new E2E happy-path spec is currently `test.skip`'d. | ## Dependency graph diff --git a/docs/00_overview/dashboard.html b/docs/00_overview/dashboard.html index f163b2c6..1d517edb 100644 --- a/docs/00_overview/dashboard.html +++ b/docs/00_overview/dashboard.html @@ -371,7 +371,7 @@

Releases

The Loop
-
56 / 57 scoped done · 4 remaining
+
56 / 57 scoped done · 3 remaining
In progress
diff --git a/docs/02_product/planned_features/bug_e2e_target_dropdown_flake/bug_fix.md b/docs/00_overview/implemented_features/2026_05_20_bug_e2e_target_dropdown_flake/bug_fix.md similarity index 100% rename from docs/02_product/planned_features/bug_e2e_target_dropdown_flake/bug_fix.md rename to docs/00_overview/implemented_features/2026_05_20_bug_e2e_target_dropdown_flake/bug_fix.md diff --git a/docs/02_product/planned_features/bug_e2e_target_dropdown_flake/idea.md b/docs/00_overview/implemented_features/2026_05_20_bug_e2e_target_dropdown_flake/idea.md similarity index 100% rename from docs/02_product/planned_features/bug_e2e_target_dropdown_flake/idea.md rename to docs/00_overview/implemented_features/2026_05_20_bug_e2e_target_dropdown_flake/idea.md diff --git a/docs/00_overview/mvp1_dashboard.html b/docs/00_overview/mvp1_dashboard.html index 4c91e601..860f5984 100644 --- a/docs/00_overview/mvp1_dashboard.html +++ b/docs/00_overview/mvp1_dashboard.html @@ -390,12 +390,12 @@

MVP1 Progress

Path to MVP1
-
4
+
3
items left = features + bugs + chores
-
+
Open bugs
-
1
+
0
tracked bug_* idea files
@@ -428,7 +428,7 @@

Pipeline

-

Idea 7

+

Idea 6

@@ -499,18 +499,6 @@

Idea 7

Three connected gaps:
-
- - -
- -
- Bug - -
-
The skipped test seeds two ES indices via Playwright's `request.put` (Node), opens the create-study modal, picks the seeded cluster via the cluster `<EntitySelect>`…
- -
diff --git a/docs/02_product/planned_features/feat_pr_metric_confidence/feature_spec.md b/docs/02_product/planned_features/feat_pr_metric_confidence/feature_spec.md new file mode 100644 index 00000000..980a1f12 --- /dev/null +++ b/docs/02_product/planned_features/feat_pr_metric_confidence/feature_spec.md @@ -0,0 +1,751 @@ +# Feature Specification — PR Metric Confidence + +**Date:** 2026-05-21 +**Status:** Draft (pending GPT-5.5 cross-model review) +**Owners:** soundminds.ai +**Related docs:** +- Input brief: [`idea.md`](idea.md) +- Sibling shipped: [`feat_digest_proposal`](../../../00_overview/implemented_features/2026_05_11_feat_digest_proposal/feature_spec.md), [`feat_github_pr_worker`](../../../00_overview/implemented_features/2026_05_12_feat_github_pr_worker/feature_spec.md), [`feat_studies_ui`](../../../00_overview/implemented_features/2026_05_12_feat_studies_ui/feature_spec.md), [`feat_llm_judgments`](../../../00_overview/implemented_features/2026_05_11_feat_llm_judgments/feature_spec.md) +- Architecture: [`docs/01_architecture/api-conventions.md`](../../../01_architecture/api-conventions.md), [`docs/01_architecture/data-model.md`](../../../01_architecture/data-model.md), [`docs/01_architecture/optimization.md`](../../../01_architecture/optimization.md), [`docs/01_architecture/ui-architecture.md`](../../../01_architecture/ui-architecture.md), [`docs/01_architecture/llm-orchestration.md`](../../../01_architecture/llm-orchestration.md) + +--- + +## 1) Purpose + +- **Problem:** RelyLoop's value-delivery surface is the Pull Request opened against the operator's central search-config repo. Today the PR body carries two scalar point estimates (`baseline → achieved` for the primary metric) — no confidence band, no per-query breakdown, no runner-up gap, no convergence signal. An approver opening the PR cannot tell whether a +0.13 NDCG lift is a robust plateau (10 trials within 0.005 of the winner) or a sharp peak that 50% probability won't reproduce. The per-query metric data that `pytrec_eval` already computes inside `score()` is dropped on the floor at the persistence boundary — every trial, on every study. +- **Outcome:** Approvers reading a study-backed PR see a "## Confidence" section directly between the existing "## Metric delta" and "## Config diff" sections. The section carries: (a) a bootstrap-based 95% CI on the headline metric, (b) a per-query histogram (improved / unchanged / regressed counts), (c) up to 5 named regressor queries with their `query_text` and metric deltas, (d) a runner-up gap classification (robust plateau vs sharp peak), (e) a late-trial noise floor (1σ over the last 20% of complete trials), and (f) a convergence call-out (early-and-held / late-rising / noisy). The same data renders as a `` on `/studies/[id]` so operators inspect confidence before opening the PR. The digest narrative LLM prompt gains structured `` and `` XML blocks so the narrative opens with a confidence-framed sentence ("NDCG@10 +0.13 (95% CI 0.78–0.89, N=20 queries) — robust plateau, 2 named regressors"). +- **Non-goal:** This feature does NOT add a baseline-trial run to the orchestrator (the `studies.baseline_metric` column exists but is never populated by current production code — see §2 audit). Per-query delta uses **runner-up #2** as the comparison reference in MVP1; a separate deferred Phase 2 adds true baseline-trial computation. This feature also does NOT add holdout-set discipline, Wilcoxon paired tests, or multiple-comparison correction — those are explicit out-of-scope per §3 and the input brief. + +## 2) Current state audit + +### Existing implementations + +| File / component | What it does | Differences from expectation | +|---|---|---| +| [`backend/workers/git_pr.py:488-528`](../../../../backend/workers/git_pr.py#L488) `_render_pr_body_study_backed` | Renders the PR body markdown — sections: `## Metric delta`, `## Config diff`, `## Suggested follow-ups`, `## Parameter importance` | Verified against codebase. No `## Confidence` section exists today. The new section inserts between `## Metric delta` and `## Config diff`. | +| [`backend/app/eval/scoring.py:153-194`](../../../../backend/app/eval/scoring.py#L153) `score()` | Returns `ScoreResult` typed as `{"aggregate": {metric: mean_value}, "per_query": {qid: {metric_name: float}}}` | Verified. The `per_query` dict is computed by `pytrec_eval.RelevanceEvaluator.evaluate()` and reduced to the mean for `aggregate`. | +| [`backend/workers/trials.py:433-446`](../../../../backend/workers/trials.py#L433) `run_trial` worker | Persists `trials.metrics = scored["aggregate"]` via `repo.create_trial(...)` after pytrec_eval runs | Verified. `scored["per_query"]` is discarded at this line — never persisted to the database. This is the central gap the feature closes. | +| [`backend/app/db/models/trial.py`](../../../../backend/app/db/models/trial.py) `Trial` ORM | Columns: `id`, `study_id` (FK CASCADE), `optuna_trial_number`, `params` JSONB, `primary_metric` Float (denormalized for `(study_id, primary_metric DESC NULLS LAST)` index), `metrics` JSONB (not-null), `duration_ms`, `status` (CHECK `complete\|failed\|pruned`), `error`, `started_at`, `ended_at` | Verified. Adding `per_query_metrics JSONB NULL` is the only schema change required for Tier B. | +| [`backend/app/db/models/study.py:76`](../../../../backend/app/db/models/study.py#L76) `studies.baseline_metric` | Float, nullable, **never written in production code** — the docstring says "populated by the orchestrator (Phase 2)" but `grep -rn "baseline_metric *=" backend/workers/ backend/app/services/` confirms zero write sites; only test fixtures and the digest worker's read path reference it | **Material finding.** Phase 2's orchestrator work (running a non-Optuna baseline trial) was deferred and never landed. `study.baseline_metric` is always `None` in production. The MVP1 PR body shows `baseline=None → achieved=X` with no `delta_pct`. This means **per-query "regression vs baseline" cannot be computed without first implementing baseline-trial persistence**. The spec resolves this by comparing winner vs **runner-up #2 per-query**, not winner vs baseline. Per-query baseline comparison is deferred to a Phase 2 follow-up tracked in [`phase2_idea.md`](phase2_idea.md). | +| [`backend/workers/digest.py:296-308`](../../../../backend/workers/digest.py#L296) `_compute_metric_delta` | Builds `{primary_metric_key: {baseline, achieved, delta_pct}}` for the proposal's `metric_delta` JSONB column; reads `study.baseline_metric` (always None) and `study.best_metric` | Verified. In MVP1 the wire shape is `{"baseline": null, "achieved": 0.84, "delta_pct": null}`. PR body renders this as `ndcg@10: None → 0.84` — an existing UX gap. This feature improves the situation by adding the confidence band (computed without a baseline). True baseline numbers are unlocked by Phase 2. | +| [`backend/app/db/models/digest.py`](../../../../backend/app/db/models/digest.py) `Digest` ORM | Columns: `id`, `study_id` (UNIQUE FK), `narrative` Text, `parameter_importance` JSONB, `recommended_config` JSONB, `suggested_followups` ARRAY(Text), `generated_by`, `generated_at` | Verified. No schema change to `digests` — the new `confidence` data lives on `StudyDetail` (computed at-read-time from trials), NOT on `digests`. Rationale: the per-query data on trials is the source of truth; recomputing on read is O(N_trials) which is sub-millisecond for MVP1 study sizes (≤1000 trials × ≤100 queries = 100K floats = <500KB JSONB). | +| [`prompts/digest_narrative.user.jinja`](../../../../prompts/digest_narrative.user.jinja) | Jinja2 template with XML blocks ``, ``, ``, ``, ``, ``, `` | Verified. New blocks `` and `` slot in after `` (before ``). | +| [`prompts/digest_narrative.system.md`](../../../../prompts/digest_narrative.system.md) | System prompt; line 29-30 says "`narrative` — a markdown string (~200–600 words). Open with the headline metric delta. Then explain *why* …" | Verified. The "Open with the headline metric delta" sentence is mid-bullet inside the `narrative` field instruction; the spec edits it precisely (see FR-6). | +| [`backend/app/api/v1/schemas.py:613-637`](../../../../backend/app/api/v1/schemas.py#L613) `StudyDetail` Pydantic | 17 fields including `baseline_metric`, `best_metric`, `best_trial_id`, `trials_summary` (TrialsSummaryShape) | Verified. The spec adds one optional field: `confidence: ConfidenceShape \| None`. Old clients ignoring the field continue to work — Pydantic on the wire is permissive. | +| [`backend/app/api/v1/studies.py:118-142`](../../../../backend/app/api/v1/studies.py#L118) `_detail()` | Builds `StudyDetail` from a `Study` ORM row | Verified. Spec adds a call to a new domain helper `compute_study_confidence(db, study)` that returns `ConfidenceShape \| None` and is invoked from `_detail()`. | +| [`ui/src/app/studies/[id]/page.tsx`](../../../../ui/src/app/studies/[id]/page.tsx) | 114 lines — header card + trials table; no digest/confidence panels rendered | Verified clean canvas. The new `` mounts between the header card and the trials table (above the digest panel when it renders for completed studies). | + +### Navigation and link impact + +| Source file | Current link target | New link target | +|---|---|---| +| (none) | (no URLs change) | — | + +Confidence data lives within the existing `/studies/[id]` route — no new pages, no redirects, no removed routes. + +### Existing test impact + +| Test file | Pattern | Count | Required change | +|---|---|---|---| +| `backend/tests/integration/test_studies_api.py` | Asserts on `StudyDetail` shape from `GET /api/v1/studies/{id}` | ≥1 | Add assertion that `confidence` key is present (may be `null`). Existing assertions on other fields remain unchanged. | +| `backend/tests/contract/test_studies_openapi.py` (or equivalent OpenAPI shape lock) | Asserts the OpenAPI schema for the `StudyDetail` response | 1 | Verify the new `confidence` field appears in the schema with the correct type. | +| `backend/tests/integration/test_digest_zero_trials.py` + `test_digest_zero_trials_with_openai_unconfigured.py` | Digest worker with `best_metric=None` | 2 | No code change required; assert that `confidence` is `None` on the resulting StudyDetail (degraded path). | +| `backend/tests/unit/workers/test_digest_prompt_render.py` | Renders `digest_narrative.user.jinja` with sample inputs | 1 | Extend the fixture with `confidence` + `per_query_outcomes` keys. Add a new test that the rendered prompt contains `` block when data is non-None and omits it when None. | +| `backend/tests/integration/test_proposals_api.py` (if exists for `_render_pr_body_study_backed`) | PR body output | TBD | Add cases that assert `## Confidence` section presence/absence + named regressor inclusion. | +| `ui/src/__tests__/components/studies/` | Existing study-detail component tests | TBD | Add new `confidence-panel.test.tsx` (component test against TanStack `useStudy` mock returning `confidence` payload). | +| `ui/tests/e2e/studies.spec.ts` | Existing real-backend Playwright spec | 1 | Add 1-2 assertions that the ConfidencePanel renders when a seeded completed study has `per_query_metrics` populated. | + +Total: ~6 existing test files modified, 2 new test files added. + +### Existing behaviors affected by scope change + +- **PR body section ordering.** Current: `Metric delta → Config diff → Suggested follow-ups → Parameter importance`. New: `Metric delta → **Confidence** → Config diff → Suggested follow-ups → Parameter importance`. Decision needed: **no** — insertion is additive, ordering is unambiguous. +- **`trials.metrics` JSONB shape.** Currently `{ndcg@10: 0.84, map: 0.62, ...}` (aggregate). The new column `trials.per_query_metrics` is a sibling, NOT a replacement. No change to existing column's shape. Decision needed: **no**. +- **Digest narrative LLM prompt.** Current opening guidance: "Open with the headline metric delta." New: "Open with the headline metric delta + a one-sentence confidence framing (CI, per-query outcome counts, worst-regressed query name when any)." Decision needed: **no** — spec locks the exact replacement string in FR-6. +- **Run-trial worker latency.** Adding `per_query_metrics` write is a single dict copy (≤100 queries × 5 metrics = ~500 floats) — sub-millisecond per trial. No measurable impact. Decision needed: **no**. +- **`StudyDetail` API response size.** Adds an optional `confidence` field — when present, +~2-5 KB for a typical 20-query study. No change to the list endpoint's `StudySummary`. Decision needed: **no**. + +--- + +## 3) Scope + +### In scope + +- **One Alembic migration** `0015_trials_per_query_metrics`: adds `trials.per_query_metrics JSONB NULL`. Reversible downgrade. No backfill — old trials stay `NULL` and analytics gracefully no-show their per-query surfaces. +- **One-line worker change** in `backend/workers/trials.py` to persist `per_query_metrics=scored["per_query"]` alongside `metrics=scored["aggregate"]`. +- **New domain module** `backend/app/domain/study/confidence.py` with pure-Python helpers (bootstrap CI, runner-up gap classification, late-trial noise floor, convergence regime detection, per-query outcome classification, top regressors). +- **New Pydantic shape** `ConfidenceShape` exposed as an optional field on `StudyDetail` (`backend/app/api/v1/schemas.py`). +- **Read-side enrichment** in `backend/app/api/v1/studies.py::_detail()` that invokes `compute_study_confidence(db, study)` and attaches the result to the response. +- **PR-body section** `## Confidence` inserted into `_render_pr_body_study_backed()` in `backend/workers/git_pr.py`. Section renders whenever `confidence is not None` (i.e., the winner trial row exists). Each sub-block (CI line, per-query block, regressor list, runner-up gap, late-trial 1σ, convergence) is gated on its specific sub-field being non-null — so old studies (winner has `per_query_metrics IS NULL`) get the section with the aggregate signals but no per-query content; running studies (`best_trial_id IS NULL`) get no section at all. +- **Digest narrative prompt update**: `prompts/digest_narrative.user.jinja` gains `` + `` XML blocks; `prompts/digest_narrative.system.md` opening guidance edited per FR-6. +- **Study-detail UI**: new `` React component at `ui/src/components/studies/confidence-panel.tsx`, mounted from `ui/src/app/studies/[id]/page.tsx` between the header card and the trials table. +- **Test coverage** at unit (domain helpers), integration (DB-backed StudyDetail enrichment + digest worker prompt rendering), contract (OpenAPI shape lock for `ConfidenceShape`), and E2E (Playwright real-backend renders the panel) layers. + +### Out of scope + +- **Baseline-trial computation in the orchestrator.** Implementing the deferred Phase 2 work (run a single non-Optuna trial before Optuna starts, persist its per-query metrics on the study row, populate `study.baseline_metric` and `study.baseline_trial_id`). Tracked in [`phase2_idea.md`](phase2_idea.md). This feature treats the comparison reference as **runner-up #2 per-query** in MVP1; Phase 2 swaps in baseline comparison. +- **Holdout-set discipline (80/20 split).** Per input brief §"Out of scope for v1": MVP1 judgment-set sizes are too small for a meaningful split, and enterprise relevance engineers often optimize on a curated set. Revisit at MVP4 when multi-tenant judgments routinely exceed 100 queries. +- **Wilcoxon signed-rank paired test.** Theoretically correct but for typical 10–20 query studies the test rarely returns significant even when the lift is real. Defer until operators ask for it. +- **Multiple-comparison correction across the 1000-trial budget.** Most-correct statistical concern but the hardest to surface for non-statisticians. Defer; revisit if approver feedback flags inflated metrics. +- **Sparkline / chart rendering for convergence trajectory.** Phase 1 renders convergence as a textual call-out only ("Best metric found at trial 387 of 1000; held thereafter"). A future enhancement can add a Recharts line chart. +- **Confidence on rejected proposals.** The "## Confidence" section appears in study-backed proposal PR bodies only. Manually-authored proposals (`proposal.study_id IS NULL`) skip the section. +- **Per-query metrics for old studies.** No backfill — `trials.per_query_metrics IS NULL` for trials predating the migration. The `confidence` field on `StudyDetail` returns `null` (degraded path) for those studies. + +### API convention check + +Verified against [`docs/01_architecture/api-conventions.md`](../../../01_architecture/api-conventions.md): + +- **Endpoint prefix convention:** `/api/v1/`. No new endpoints — this feature extends the existing `GET /api/v1/studies/{id}` response shape with an optional `confidence` field. +- **Router namespace:** [`backend/app/api/v1/studies.py`](../../../../backend/app/api/v1/studies.py) — existing router file gets one new domain-helper call inside `_detail()`. +- **HTTP methods for CRUD:** N/A (this feature is read-only at the API layer; the write side is the `run_trial` worker which is not a router-mediated path). +- **Non-auth error envelope shape:** `{ "detail": { "error_code": "", "message": "", "retryable": } }` per `api-conventions.md`. No new error codes — `compute_study_confidence` returns `None` on any degraded path rather than raising. +- **Auth error shape:** N/A in MVP1–MVP3 (single-tenant, no auth surface). + +### Phase boundaries + +- **Phase 1 (this spec — MVP1, ships immediately):** Per-query persistence + read-side enrichment + PR body section + digest narrative prompt + ConfidencePanel. Comparison reference for regressors is **runner-up #2 per-query**. +- **Phase 2 (deferred — tracked in [`phase2_idea.md`](phase2_idea.md)):** Orchestrator runs a non-Optuna baseline trial first; `studies.baseline_trial_id` (new column, denormalized FK) points to that trial; per-query regressor comparison switches to **baseline** when available, with `runner_up` as the fallback when `studies.baseline_trial_id IS NULL`. Phase 2 is purely additive on top of Phase 1 — no migration to undo, no API contract break. + +## 4) Product principles and constraints + +- **Approver trust is the value-delivery surface.** Every signal in the "## Confidence" section must answer a concrete approver question ("is this fragile?", "is the lift bigger than noise?", "does it break specific queries?"). Cosmetic statistics with no actionable meaning are out. +- **Graceful degradation over hard failure.** Any missing input (study has no trials, no per_query_metrics, no runner-up, no completed trials in the late-trial window) produces a null-or-suppressed surface, NEVER an error envelope or LLM degraded-mode trigger. The API ships an optional `confidence` field so old clients keep working. +- **Source of truth is the trial row.** `trials.per_query_metrics` is computed deterministically from `pytrec_eval` and never overwritten. Analytics recompute on read. +- **No `digests` schema change.** The digest worker is downstream of the trials data; per-query analytics belong on the trial side. The digest narrative reads pre-computed confidence and injects it into the LLM prompt. +- **Floating LLM model names forbidden** (CLAUDE.md Absolute Rule #8). The digest worker already reads `Settings.openai_model` for the narrative LLM call — no new LLM call in this feature. +- **Conventional Commits.** All commits on this branch follow the `feat(pr-metric-confidence): ...` / `infra(migrate): ...` etc. format (CLAUDE.md Absolute Rule #7). +- **`/healthz` performance budget.** N/A — this feature touches no health probes. +- **scipy + numpy availability.** Verified — `.venv/lib/python3.13/site-packages/{scipy,numpy}` are installed via pytrec_eval transitive dep. Bootstrap CI uses `numpy.random.choice` + `numpy.percentile` (no scipy.stats dependency to keep the lift surface minimal and avoid optional-dep import-time risk). + +### Anti-patterns + +- **Do not** compute `confidence` from the `digests.parameter_importance` field — that's Optuna's parameter-importance, not metric-confidence. The two are unrelated; mixing them produces nonsense. +- **Do not** add a `digests.confidence` JSONB column. The trial row carries the canonical per-query data; recomputing on `StudyDetail` read is correct because (a) study sizes are bounded (≤1000 trials), (b) keeps the source-of-truth single, (c) avoids a migration on `digests` and a write-path retrofit. +- **Do not** cache the confidence computation at the API layer. The data is small (~500KB worst case) and the study-detail endpoint is not on a hot path; caching adds complexity without measurable benefit. +- **Do not** compare winner against the simple mean of all trials per-query (would dilute the regressor signal when the winner is a clear leader). Use **runner-up #2 per-query** explicitly — see FR-3 for rationale. +- **Do not** use scipy.stats.bootstrap. Numpy-only bootstrap is ~5 lines, faster for our sample sizes, and avoids dragging scipy's optional bias-correction machinery into the runtime path. Numpy is already a transitive dep via pytrec_eval; scipy is not required. +- **Do not** add multi-tenant filtering. RelyLoop is single-tenant through MVP3 (CLAUDE.md activates-at-MVP4 rule). +- **Do not** persist confidence to `digests` even partially (e.g., "just the regressor names"). The digest worker can read the per-query data from trials at digest-generation time; persisting it twice introduces drift. +- **Do not** wrap the bootstrap loop in `try`/`except Exception:` — if numpy raises, the underlying data is wrong and we want the trace to surface. Suppress to `None` only on documented degraded paths (N(queries) < 5; no per_query_metrics; no winner trial). +- **Do not** raise an error code if `compute_study_confidence` returns None. The `null` value on the wire is the contract. + +## 5) Assumptions and dependencies + +- **Dependency: `feat_digest_proposal`** (PR #41, merged 2026-05-11). Required for the digest narrative prompt update path. Status: implemented. Risk if missing: feature still ships (the PR body and StudyDetail enrichment are independent of the digest), but the LLM narrative wouldn't pick up the confidence framing. +- **Dependency: `feat_github_pr_worker`** (PR #45, merged 2026-05-12). Required for `_render_pr_body_study_backed` to exist as the modification point. Status: implemented. Risk if missing: N/A — already shipped. +- **Dependency: `feat_studies_ui`** (PR #50, merged 2026-05-12). Required for `/studies/[id]` route to host the ``. Status: implemented. Risk if missing: N/A. +- **Dependency: `feat_llm_judgments`** (PR #35, merged 2026-05-11). Provides the judgment data that `pytrec_eval` consumes; without per-query judgments there's no per-query metrics. Status: implemented. Risk if missing: per_query_metrics would be empty dicts (still safe — analytics gracefully no-show). +- **Dependency: numpy 1.x** (transitive via `pytrec-eval>=0.5` in `pyproject.toml`). Required for bootstrap resampling. Verified present at `.venv/lib/python3.13/site-packages/numpy/__init__.py`. Risk if missing: feature can't ship — but numpy is already required by pytrec_eval which already ships. +- **Dependency: Alembic head at `0014_clusters_target_filter`** (current as of 2026-05-21). Required so the new `0015_trials_per_query_metrics` migration applies cleanly. Status: confirmed via [`state.md`](../../../../state.md). Risk if missing: migration round-trip fails — but `0014` is already merged. +- **No new external services.** No new OpenAI calls (the existing digest worker's LLM call is the only LLM hop). No new GitHub API calls. No new ES/OS adapter calls. + +## 6) Actors and roles + +- **Primary actor:** Relevance Engineer (per umbrella spec §6). Reads the ConfidencePanel before opening a PR; reads the "## Confidence" section in the PR body to decide whether to merge. +- **Secondary actor:** Approver (subset of relevance engineers with merge rights on the central search-config repo). Reads the "## Confidence" section in the PR body as the primary input to their merge decision. +- **Tertiary actor:** Viewer (PMs, exec stakeholders, read-only). Sees the ConfidencePanel on /studies/[id] but does not act on it directly in MVP1. +- **Role model:** N/A — RelyLoop MVP1 is single-tenant + no auth (per [`docs/01_architecture/tech-stack.md` §"Canonical release matrix"](../../../01_architecture/tech-stack.md)). +- **Permission boundaries:** No tool-side enforcement in MVP1. Approval is delegated to the operator's central search-config repo's branch protection rules. + +### Authorization + +N/A — single-tenant install, no auth surface (MVP1). + +### Audit events + +N/A — `audit_log` lands at MVP2 per [`docs/01_architecture/data-model.md` §"Forthcoming: audit_log"](../../../01_architecture/data-model.md). This feature touches no tenant-visible state mutations beyond the existing `run_trial` worker INSERT (which is internal-system, never tenant-direct). At MVP2 when `audit_log` ships, the `run_trial` worker's `INSERT INTO trials` already gets covered by whatever audit-log integration the MVP2 epic adds for the workers layer — no per-feature instrumentation needed here. + +## 7) Functional requirements + +### FR-1: Persist per-query metrics on every successful trial + +- Requirement: + - The system **MUST** add a nullable JSONB column `trials.per_query_metrics` via Alembic migration `0015_trials_per_query_metrics`. The migration has a working `downgrade()` and round-trips cleanly (`alembic upgrade head && alembic downgrade -1 && alembic upgrade head`). + - The `run_trial` worker at [`backend/workers/trials.py:433-446`](../../../../backend/workers/trials.py#L433) **MUST** persist `per_query_metrics=scored["per_query"]` as part of `repo.create_trial(...)` whenever `status='complete'`. + - The worker **MUST NOT** write `per_query_metrics` on the `status='failed'` path at [`trials.py:500`](../../../../backend/workers/trials.py#L500) (where `metrics={}` is the current behavior) — `per_query_metrics` stays `NULL` for failed trials. + - Old trials predating the migration **MUST** retain `per_query_metrics IS NULL` (no backfill). +- Notes: The shape mirrors `ScoreResult.per_query` from [`scoring.py:47`](../../../../backend/app/eval/scoring.py#L47) — `{qid: {metric_name: float}}`. `metric_name` keys are the **user-facing names produced by `score()`'s wire→user remap loop at [`scoring.py:180-185`](../../../../backend/app/eval/scoring.py#L180): exactly `ndcg`, `map`, `precision`, `recall`, `mrr`** (NOT the pytrec_eval wire-name forms like `ndcg_cut.10`, NOT the abbreviated `p`). The threshold-table lookup in FR-4a uses these same five names so wire/threshold/UI/test keys are byte-equal. + +### FR-2: Compute confidence signals from persisted trial data + +- Requirement: + - The system **MUST** provide an async domain function `async def compute_study_confidence(db: AsyncSession, study: Study) -> ConfidenceShape | None` in `backend/app/domain/study/confidence.py`. Callers (`_detail()`, digest worker, PR worker — see FR-5d) MUST `await` the call. + - The function **MUST** return `None` when any of: (a) `study.best_trial_id IS NULL`, (b) the winner trial **row** is missing (cascade-delete race per [`digest.py:615-625`](../../../../backend/workers/digest.py#L615)), (c) `len(complete_trials) < 1`. NOTE: A winner row that exists but has `per_query_metrics IS NULL` does NOT trigger whole-object `None` — instead the function returns a **partial** `ConfidenceShape` per FR-7 (aggregate signals like `runner_up_gap`, `late_trial_stddev`, and `convergence` still compute from `primary_metric` + `optuna_trial_number`). + - The function **MUST** fetch data via four small queries (NOT by loading all trials with `per_query_metrics` into memory): + 1. The **winner trial row** (`SELECT ... FROM trials WHERE id = study.best_trial_id`) — includes its `per_query_metrics`. + 2. The **runner-up trial row** (`SELECT ... FROM trials WHERE study_id = ? AND status = 'complete' AND id != ? ORDER BY primary_metric DESC NULLS LAST LIMIT 1` — uses the existing `trials_study_metric` index) — includes its `per_query_metrics`. + 3. A **trial summary list** (`SELECT primary_metric, optuna_trial_number FROM trials WHERE study_id = ? AND status = 'complete' ORDER BY optuna_trial_number ASC`) — projects ONLY the two columns needed for `runner_up_gap.classification`, `late_trial_stddev`, and `convergence`. No `per_query_metrics` payload in this query. + 4. A **regressor query-text fetch** (`SELECT id, query_text FROM queries WHERE id = ANY(:regressor_qids)`) — issued AFTER step 1+2 have produced the candidate regressor `query_id` list (at most 5 ids). Skipped entirely when no regressors are produced. + - Total wire load is bounded at ~30KB regardless of trial count (two `per_query_metrics` rows ≤ ~5KB each, the summary list at N×(8+8) bytes ≈ 16KB for 1000 trials, plus ≤ 5 query rows). + - When data is sufficient, the function **MUST** populate every sub-field independently; partial population is the contract. Specifically: + - `headline` is always populated when `study.best_metric IS NOT NULL`. + - `ci_95` is populated when the winner has ≥5 per-query datapoints; suppressed (`None`) below 5. + - `runner_up_gap` is populated when there are ≥2 complete trials; suppressed below 2. + - `late_trial_stddev` is populated when `len(complete_trials) ≥ 10`; suppressed below 10 (sample too small). + - `convergence` is populated when there are ≥3 complete trials; suppressed below 3. + - `per_query_outcomes` is populated when there are ≥2 complete trials AND BOTH the winner trial AND the runner-up #2 trial (the comparison reference locked in FR-3) have `per_query_metrics IS NOT NULL`; suppressed when any of those three conditions fails. This is stricter than "any other complete trial" — the comparison MUST be against the runner-up specifically, per FR-3 + the four-query data-loading contract above. +- Notes: Computation is O(N_queries) for the per-query analytics (only the winner and runner-up rows carry `per_query_metrics`); O(N_trials) for the summary-list pass (two-column projection only). No caching; results recomputed on every `GET /api/v1/studies/{id}` call. The four-query read pattern above is the canonical implementation contract — the function is forbidden from issuing a "give me every complete trial's per_query_metrics" query. + +### FR-3: Define winner-vs-comparison reference for per-query deltas + +- Requirement: + - In MVP1 (this spec, Phase 1), the comparison reference for per-query deltas **MUST** be the **runner-up #2 trial** — defined as the complete trial with the second-highest `primary_metric` (sorted descending, `NULLS LAST`). + - When fewer than 2 complete trials exist OR the runner-up has `per_query_metrics IS NULL`, the per-query comparison **MUST** be suppressed (`per_query_outcomes = None`). The aggregate `runner_up_gap` still computes if the runner-up has a `primary_metric` value (see FR-7) — only the per-query side is suppressed. + - The wire shape **MUST** include a `comparison_against` field whose value **in MVP1 Phase 1 is unconditionally `"runner_up"`**. The `"baseline"` value is reserved for Phase 2 and MUST NOT be emitted by Phase 1 code (no conditional `if study.baseline_trial_id ...` branching in Phase 1 — that column doesn't exist). +- Notes: Runner-up is chosen over "mean of all trials" because the latter dilutes the regressor signal when the winner is a clear leader (the winner is far ahead per-query on most queries, so deltas vs mean are universally positive and obscure the actual regressors). Comparing against runner-up #2 surfaces only queries where the winner sacrifices accuracy that some other tried config achieved. Phase 2 (tracked in [`phase2_idea.md`](phase2_idea.md)) adds `studies.baseline_trial_id` AND switches the conditional in `compute_study_confidence` to emit `"baseline"` when that column is non-null — both changes in one Phase 2 migration + code update. + +### FR-4: Lock thresholds and methods for each confidence signal + +- Requirement: + - **Bootstrap CI:** percentile method, N=1000 resamples, 95% interval. Implemented with `numpy.random.default_rng(seed=42).choice(per_query_values, size=(1000, len(per_query_values)), replace=True).mean(axis=1)` → `numpy.percentile(means, [2.5, 97.5])`. The seed is fixed for determinism (an approver re-reading the PR sees identical numbers; reproducibility wins over per-call randomness for this surface). + - **Runner-up gap classification:** "robust_plateau" when the top-`min(10, num_complete_trials)` trials by `primary_metric` are all within 0.005 of the winner; "sharp_peak" otherwise. The 0.005 threshold is locked. When `num_complete_trials < 2` the classification is suppressed (`runner_up_gap = None` per FR-2's threshold list); for 2 ≤ N < 10 the rule degrades gracefully — e.g., 3 trials with values [0.84, 0.838, 0.836] all within 0.005 → `robust_plateau`. + - **Late-trial noise floor:** computed as `numpy.std(primary_metric for trial in last_n_complete_trials, ddof=1)` (sample stddev). `last_n_complete_trials = trials sorted by optuna_trial_number ascending, take the last max(5, int(len(complete)*0.2)) entries`. Suppressed (returned as `None`) when `len(complete) < 10`. + - **Convergence regime:** "early_held" when the winner's `optuna_trial_number ≤ 0.5 * max_optuna_trial_number` AND **at least one trial in the last 25% of trial numbers has `primary_metric` within 0.005 of the winner** (i.e., late exploration found similar plateau configs — the optimizer "held" the region); "late_rising" when the winner's `optuna_trial_number ≥ 0.9 * max_optuna_trial_number`; "noisy" otherwise. Note: "no improvement after the winner" is tautological because the winner is by-definition the global best, so the rule uses "late trial within 0.005 of winner" as the observable signal that the optimizer found multiple near-equivalents. + - **Regressor threshold (per-metric):** absolute delta cutoff. Locked in FR-4a's metric-threshold table. + +#### FR-4a — Regressor threshold table (locked, enumerated) + +For "is this query regressed?" the comparison is `winner.per_query_metrics[qid][metric] - runner_up.per_query_metrics[qid][metric]`. A query is "regressed" when this delta is **less than the negative of the threshold** for the active metric: + +| Metric | Threshold (absolute delta) | +|---|---| +| `ndcg` | 0.01 | +| `precision` | 0.01 | +| `recall` | 0.01 | +| `map` | 0.02 | +| `mrr` | 0.02 | + +A query is "improved" when delta > +threshold; "unchanged" when |delta| ≤ threshold. The metric used for the threshold lookup is `study.objective['metric']` (always one of the wire-enum values in [`ui/src/lib/enums.ts`](../../../../ui/src/lib/enums.ts) and [`backend/app/api/v1/schemas.py:521`](../../../../backend/app/api/v1/schemas.py#L521) `_K_REQUIRED_METRICS` family). + +### FR-5: Surface confidence in three places + +- Requirement: + - **(5a) `StudyDetail` API response.** The system **MUST** add `confidence: ConfidenceShape | None` to the [`StudyDetail`](../../../../backend/app/api/v1/schemas.py#L613) Pydantic model. The field **MUST** be populated via `compute_study_confidence(db, row)` in [`_detail()`](../../../../backend/app/api/v1/studies.py#L118). When the function returns `None`, the JSON wire value is `null`. + - **(5b) PR body `## Confidence` section.** The system **MUST** insert a new section between `## Metric delta` and `## Config diff` in [`_render_pr_body_study_backed`](../../../../backend/workers/git_pr.py#L488). The section renders the headline + CI line, the per-query outcome counts (when available), the named regressor block (when any), the runner-up gap classification, the late-trial noise floor, and the convergence call-out. If `confidence is None`, the entire section is omitted. + - **(5c) `` on `/studies/[id]`.** The system **MUST** add a new component at [`ui/src/components/studies/confidence-panel.tsx`](../../../../ui/src/components/studies/confidence-panel.tsx) mounted from [`ui/src/app/studies/[id]/page.tsx`](../../../../ui/src/app/studies/[id]/page.tsx) between the study header card and the existing trials table. The panel renders headline + CI band (when `ci_95` non-null), per-query outcome chips (when `per_query_outcomes` non-null), the named regressor table (when `per_query_outcomes.regressed > 0`; up to 5 rows), the runner-up gap label (when `runner_up_gap` non-null), the late-trial 1σ value (when `late_trial_stddev` non-null), and the convergence call-out (when `convergence` non-null). If the entire `confidence` field is `null`, the panel renders nothing. There is NO "view full per-query breakdown" disclosure in Phase 1 — the inline 5-row regressor table is the only per-query surface (consistent with the bounded payload from FR-2 query 4). + - **(5d) PR worker data plumbing.** The system **MUST** modify the PR-opening worker code path (the `open_pr` Arq job and any other call site that invokes `_render_pr_body_study_backed`) so that BEFORE rendering it (a) loads the Study row via `repo.get_study(db, proposal.study_id)`, (b) `await`s `compute_study_confidence(db, study)`, (c) passes the resulting `ConfidenceShape | None` into `_render_pr_body_study_backed(..., confidence=...)` as a new keyword argument. The renderer reads `confidence` and emits the `## Confidence` section per FR-5b. An integration test against the real PR worker path (NOT just the pure renderer) covers AC-11 end-to-end. +- Notes: All four surfaces (StudyDetail, PR body, ConfidencePanel, digest prompt) consume the same source-of-truth `ConfidenceShape` Pydantic model. There is no UI-only or PR-body-only data — every signal is computable from the same domain helper. The PR worker re-runs the computation rather than reading `StudyDetail` JSON to keep the worker independent of the HTTP layer. + +### FR-6: Update the digest narrative LLM prompt + +- Requirement: + - The system **MUST** add two new XML blocks to [`prompts/digest_narrative.user.jinja`](../../../../prompts/digest_narrative.user.jinja), inserted after the existing `` block: + + ```jinja2 + {% if confidence %} + {% if confidence.ci_95 %}ci_low: {{ confidence.ci_95.low }} + ci_high: {{ confidence.ci_95.high }} + {% endif %}n_queries: {{ confidence.headline.n_queries }} + {% if confidence.runner_up_gap %}runner_up_gap: {{ confidence.runner_up_gap.value }} ({{ confidence.runner_up_gap.classification or 'unclassified' }}) + {% endif %}{% if confidence.late_trial_stddev %}late_trial_stddev: {{ confidence.late_trial_stddev.value }} + {% endif %}{% if confidence.convergence %}convergence: {{ confidence.convergence.regime }} (best at trial {{ confidence.convergence.best_at_trial }} of {{ confidence.convergence.total_trials }}) + {% endif %} + + {% endif %}{% if confidence and confidence.per_query_outcomes %} + improved: {{ confidence.per_query_outcomes.improved }} + unchanged: {{ confidence.per_query_outcomes.unchanged }} + regressed: {{ confidence.per_query_outcomes.regressed }} + comparison_against: {{ confidence.per_query_outcomes.comparison_against }} + {% for r in confidence.per_query_outcomes.top_regressors %}- {{ r.query_text }}: {{ r.winner_score }} → {{ r.comparison_score }} ({{ r.delta }}) + {% endfor %} + + {% endif %} + ``` + + The template consumes the same nested `ConfidenceShape` exposed on `StudyDetail` (§8.3) — no flat DTO adapter. The digest worker passes the `confidence` dict directly into `render_digest_user_prompt(...)`'s new `confidence: dict | None` kwarg; the jinja `{% if %}` guards handle every degraded combination from FR-7. + - The system **MUST** edit [`prompts/digest_narrative.system.md`](../../../../prompts/digest_narrative.system.md) line 29-30 from: + > `narrative` — a markdown string (~200–600 words). Open with the headline metric delta. Then explain *why* the recommendation works… + + to: + > `narrative` — a markdown string (~200–600 words). Open with the headline metric delta, immediately followed by a one-sentence confidence framing that mentions the CI band (when `` is present), the per-query outcome counts (when `` is present), and the worst-regressed query by name (when `` has regressors). Then explain *why* the recommendation works… + - The system prompt's XML-block list (lines 13-25) **MUST** be extended to document blocks 8 (``) and 9 (``) and their conditional-inclusion semantics. + - The system **MUST** update [`backend/app/llm/digest_prompt.py:67`](../../../../backend/app/llm/digest_prompt.py#L67) `render_digest_user_prompt` to accept ONE new optional kwarg `confidence: dict | None = None` (a single object — `per_query_outcomes` is nested INSIDE `confidence` per the `ConfidenceShape` contract in §8.3, not a sibling). The digest worker `backend/workers/digest.py` awaits `compute_study_confidence(db, study)` and passes the result via `ConfidenceShape.model_dump()` into the prompt-render call. +- Notes: The system prompt edit is precise — the existing string `"Open with the headline metric delta."` is replaced by the longer string. The prompt rendering is exercised by [`backend/tests/unit/workers/test_digest_prompt_render.py`](../../../../backend/tests/unit/workers/test_digest_prompt_render.py) which adds new fixtures for the with/without-confidence paths. + +### FR-7: Graceful degradation paths + +- Requirement: + - When `study.best_trial_id IS NULL` (study not yet complete): the API returns `confidence: null` (the **entire** ConfidenceShape is None). The PR body section is omitted. The ConfidencePanel renders nothing. The digest worker passes `confidence=None` so both jinja blocks are skipped. + - When the winner trial has `per_query_metrics IS NULL` (e.g., trial predates the `0015` migration): `ci_95`, `headline.n_queries`, and `per_query_outcomes` are all `null`. The rest of `ConfidenceShape` (`runner_up_gap`, `late_trial_stddev`, `convergence`) **still computes** because those signals depend only on `primary_metric` + `optuna_trial_number`, not on per-query data. Headline `value` is still populated from `study.best_metric`. + - When fewer than 5 queries have per-query data on the winner: `ci_95 = null` only; per-query outcomes still compute if the runner-up also has per-query data (no minimum-query gate on outcomes). + - When fewer than 10 complete trials: `late_trial_stddev = null` only. + - When fewer than 3 complete trials: `convergence = null` only. + - When fewer than 2 complete trials: `runner_up_gap = null` AND `per_query_outcomes = null`. + - When ≥ 2 complete trials but the runner-up has `per_query_metrics IS NULL`: `runner_up_gap` still computes (uses only `primary_metric`); `per_query_outcomes = null` only. + - When numpy raises (should never happen with valid float inputs): the exception propagates and `_detail()` returns 500 — this is a programming error, not a degraded path. (No bare `except Exception:`.) +- Notes: Each degraded sub-field is independent. Tests cover each combination explicitly (see §14). + +## 8) API and data contract baseline + +### 8.1 Endpoint surface + +This feature does NOT add new endpoints. It extends one existing endpoint's response shape. + +| Method | Path | Purpose | Change | +|---|---|---|---| +| `GET` | `/api/v1/studies/{study_id}` | Read study detail | **MODIFIED**: adds optional `confidence` field to the response (`ConfidenceShape \| None`). | + +### 8.2 Contract rules + +- Existing `StudyDetail` shape per [`schemas.py:613`](../../../../backend/app/api/v1/schemas.py#L613) is preserved. +- The new `confidence` field is `Optional[ConfidenceShape]` with default `None`. +- Old clients that don't deserialize `confidence` continue to work. +- The OpenAPI schema is shape-locked via the existing `studies` contract test family (precedent: `test_clusters_target_filter_openapi.py`). + +### 8.3 Response examples + +**Success — completed study with full confidence data:** + +```json +{ + "id": "01931e4a-...", + "name": "tune-product-title-boost-baseline", + "cluster_id": "01931...", + "target": "products", + "template_id": "01931...", + "query_set_id": "01931...", + "judgment_list_id": "01931...", + "search_space": {"params": {"title_boost": {"type": "float", "low": 0.5, "high": 10.0, "log": true}}}, + "objective": {"metric": "ndcg", "k": 10, "direction": "maximize"}, + "config": {"max_trials": 1000, "sampler": "tpe", "pruner": "median"}, + "status": "completed", + "failed_reason": null, + "optuna_study_name": "01931e4a-...", + "parent_study_id": null, + "baseline_metric": null, + "best_metric": 0.840, + "best_trial_id": "01931...", + "created_at": "2026-05-21T07:23:52Z", + "started_at": "2026-05-21T07:23:52Z", + "completed_at": "2026-05-21T07:25:13Z", + "trials_summary": {"total": 1000, "complete": 998, "failed": 2, "pruned": 0, "best_primary_metric": 0.840}, + "confidence": { + "headline": {"metric": "ndcg", "value": 0.840, "k": 10, "n_queries": 20}, + "ci_95": {"low": 0.782, "high": 0.891, "method": "bootstrap_n1000", "n_samples": 20}, + "runner_up_gap": {"value": 0.005, "classification": "robust_plateau", "top10_within": 0.005, "runner_up_metric": 0.835}, + "late_trial_stddev": {"value": 0.018, "window_size": 200, "min_window_required": 10}, + "convergence": {"best_at_trial": 387, "total_trials": 1000, "regime": "early_held"}, + "per_query_outcomes": { + "improved": 14, + "unchanged": 4, + "regressed": 2, + "comparison_against": "runner_up", + "top_regressors": [ + {"query_id": "01931...", "query_text": "shipping policy", "winner_score": 0.41, "comparison_score": 0.92, "delta": -0.51}, + {"query_id": "01931...", "query_text": "wireless headphones", "winner_score": 0.71, "comparison_score": 0.85, "delta": -0.14} + ] + } + } +} +``` + +**Success — completed study with partial confidence data (degraded — trials predate migration, so per_query_metrics is NULL but aggregate signals still compute):** + +```json +{ + "id": "01931...", + "name": "tune-product-title-boost-baseline-7ce587", + "...": "...", + "best_metric": 0.81, + "best_trial_id": "01931...", + "confidence": { + "headline": {"metric": "ndcg", "value": 0.81, "k": 10, "n_queries": null}, + "ci_95": null, + "runner_up_gap": {"value": 0.05, "classification": "sharp_peak", "top10_within": 0.04, "runner_up_metric": 0.76}, + "late_trial_stddev": {"value": 0.022, "window_size": 50, "min_window_required": 10}, + "convergence": {"best_at_trial": 412, "total_trials": 1000, "regime": "early_held"}, + "per_query_outcomes": null + } +} +``` + +**Success — study in `running` state (no best_trial_id yet):** + +```json +{ + "id": "01931...", + "name": "tune-product-title-boost-baseline", + "status": "running", + "best_trial_id": null, + "...": "...", + "confidence": null +} +``` + +**Non-auth failure example — study not found (existing envelope; unchanged):** + +```json +{ + "detail": { + "error_code": "STUDY_NOT_FOUND", + "message": "study 01931xxx not found", + "retryable": false + } +} +``` + +HTTP 404. No new error codes. + +**Auth failure example:** N/A in MVP1–3 (no auth surface). + +### 8.4 Enumerated value contracts + +| Field | Accepted values (exact) | Backend source of truth | Frontend call site(s) | +|---|---|---|---| +| `confidence.convergence.regime` | `early_held`, `late_rising`, `noisy` | New `ConfidenceShape` Pydantic `Literal[...]` in [`backend/app/api/v1/schemas.py`](../../../../backend/app/api/v1/schemas.py); domain helper at `backend/app/domain/study/confidence.py::classify_convergence_regime()` | `` regime badge (`ui/src/components/studies/confidence-panel.tsx`); add wire-value alias in [`ui/src/lib/enums.ts`](../../../../ui/src/lib/enums.ts) `CONVERGENCE_REGIME_VALUES` | +| `confidence.runner_up_gap.classification` | `robust_plateau`, `sharp_peak` | New `Literal[...]` in `ConfidenceShape`; domain helper `classify_runner_up_gap()` | `` runner-up gap label | +| `confidence.per_query_outcomes.comparison_against` | `runner_up` (MVP1 — the only emitted value in Phase 1) | New `Literal["runner_up", "baseline"]` in `ConfidenceShape`; in MVP1 Phase 1 the helper unconditionally emits `"runner_up"` (the `"baseline"` value is reserved for Phase 2 and Phase 1 code MUST NOT branch on `study.baseline_trial_id` — that column does not exist yet) | `` "vs runner-up" label; PR body wording | +| `confidence.ci_95.method` | `bootstrap_n1000` (only value in MVP1; future MVP2 may add `wilson` or others) | New `Literal[...]` in `ConfidenceShape`; constant in `confidence.py` | Documentation in tooltip only; no user-visible select | +| `confidence.headline.metric` | `ndcg`, `map`, `precision`, `recall`, `mrr` | [`backend/app/api/v1/schemas.py:214`](../../../../backend/app/api/v1/schemas.py#L214) `ObjectiveMetric = Literal["ndcg", "map", "precision", "recall", "mrr"]` (canonical wire-enum source) | Read from existing `OBJECTIVE_METRIC_VALUES` in [`ui/src/lib/enums.ts:68`](../../../../ui/src/lib/enums.ts#L68) — no new array. | + +**Rules:** +- The 3 new `Literal[...]` value sets above MUST be added to `ui/src/lib/enums.ts` as `CONVERGENCE_REGIME_VALUES`, `RUNNER_UP_CLASSIFICATION_VALUES`, `COMPARISON_AGAINST_VALUES`, with source-of-truth comments per the project's enumerated-value-contract discipline (see CLAUDE.md §"Enumerated Value Contract Discipline"). +- `confidence.ci_95.method` is internal; the frontend treats it as opaque (only the `bootstrap_n1000` value exists in MVP1). +- All other fields in `ConfidenceShape` (counts, scalars, query_ids, query_texts) are free-form data, not enumerated. + +### 8.5 Error code catalog + +No new error codes introduced. + +## 9) Data model and state transitions + +### New / modified entities + +**Modified table: `trials`** + +- Add `per_query_metrics` (`JSONB`, **nullable**, no default) — per-query pytrec_eval scores for this trial. Shape: `{query_id: {metric_name: float}}` matching `ScoreResult.per_query` from [`scoring.py:47`](../../../../backend/app/eval/scoring.py#L47). NULL for trials predating the migration AND for trials with `status='failed'` (worker writes NULL on failure paths). No index — the column is read O(1) via `WHERE id = ?` lookups on the winner / runner-up trial rows. + +**No other table changes.** The new `ConfidenceShape` Pydantic model is API-layer only and not persisted (recomputed on read). + +> **RELYLOOP MVP1–MVP3 reminder:** no `tenant_id` column added; this feature stays single-tenant. + +### Required invariants + +- **INV-1:** `trials.per_query_metrics IS NULL OR jsonb_typeof(trials.per_query_metrics) = 'object'`. Enforced by a **DB-level CHECK constraint** added in the same migration (`CHECK (per_query_metrics IS NULL OR jsonb_typeof(per_query_metrics) = 'object')`, named `trials_per_query_metrics_object_check`). DB enforcement is the right layer because the write path is the Arq `run_trial` worker — not a Pydantic-validated HTTP request — so application-level Pydantic guards would not fire. +- **INV-2:** When `trials.status = 'failed'`, `trials.per_query_metrics IS NULL`. Enforced by the worker write path (FR-1 explicitly states the failed-path skip — `repo.create_trial(...)` is called with `per_query_metrics=None` on the failure branch). Verified by integration test covering AC-2. +- **INV-3:** When `trials.status = 'complete'` AND the trial was created post-migration AND `pytrec_eval` returned non-empty `per_query`, `trials.per_query_metrics IS NOT NULL`. Application-level invariant — verified by integration test that runs a real trial and asserts persistence (AC-1). +- **INV-4:** `compute_study_confidence(db, study)` returns `None` OR a valid `ConfidenceShape` — never raises (except for un-recoverable programming errors like ImportError). Application-level invariant — verified by unit tests covering every degraded-path branch. + +### State transitions + +No new state machines. The existing `trials.status ∈ {complete, failed, pruned}` and `studies.status ∈ {queued, running, completed, cancelled, failed}` are unchanged. + +### Idempotency / replay behavior + +- The `run_trial` worker's INSERT is already idempotent on `(study_id, optuna_trial_number)` per [`infra_optuna_eval/feature_spec.md` §11](../../../00_overview/implemented_features/2026_05_10_infra_optuna_eval/feature_spec.md). Adding `per_query_metrics` to the same INSERT doesn't change idempotency semantics. +- `compute_study_confidence` is pure and deterministic given a fixed seed (numpy RNG seed = 42 per FR-4). Re-reading the same study returns identical confidence values, byte-for-byte. + +## 10) Security, privacy, and compliance + +- **Threats:** + - **T1:** Information leak via query_text in the PR body's named-regressor block. If a tenant's query_set contains sensitive terms (e.g., proprietary product codes), those terms now appear in the public PR body the operator's central config repo receives. Mitigation: the PR body has *always* been visible to whoever can see the config repo; this feature does not change the trust boundary (config repo write access = read access to study metadata). Operator-side mitigation: scope the config repo's read-access to the same audience already trusted with judgment data. + - **T2:** DoS via very-large `per_query_metrics` payloads. A judgment list with 10,000 queries × 5 metrics × 1000 trials = 50M floats = ~400MB JSONB. Mitigation: judgment lists are operator-curated; max sizes are bounded by the operator's own discipline (MVP1 has no per-tenant quota gate — single-tenant + no auth). At MVP4 when multi-tenant lands, the per-tenant judgment-set cap (TBD per MVP4 epic) caps this naturally. + - **T3:** Bootstrap CI determinism failure. If the numpy seed isn't fixed, an approver re-reading the PR sees different CI numbers each time — undermines trust. Mitigation: FR-4 locks `seed=42`. Tested via the integration test that asserts byte-identical CI values across two consecutive reads. +- **Controls:** No new controls. Reuses the existing `pytrec_eval` data path (no new external service, no new credentials, no new secret). +- **Secrets / key handling:** N/A — no new secrets. +- **Auditability:** N/A in MVP1 (no `audit_log` yet). At MVP2, the `run_trial` worker's INSERT (which now writes `per_query_metrics`) gets audit-log coverage as part of the MVP2 epic's worker integration — no per-feature work needed here. +- **Data retention / deletion / export impact:** `trials.per_query_metrics` cascade-deletes with `trials` cascade-deletes with `studies`. No additional retention surface. + +## 11) UX flows and edge cases + +### Information architecture + +- **Navigation placement:** The ConfidencePanel renders inside the existing `/studies/[id]` route — no new page or tab. Position: between the existing study header card and the existing trials table. The panel collapses to nothing when `confidence === null` so old / running studies keep the current visual rhythm. +- **Labeling taxonomy:** + - Section heading: **"Confidence"** (capitalized, sentence case to match "Trials" elsewhere on the page). + - CI band label: **"95% CI"** (industry standard, no expansion). + - Per-query outcome chips: **"Improved"**, **"Unchanged"**, **"Regressed"** — green / grey / red badges using existing badge variants from the project's design system. + - Runner-up gap classification: **"Robust plateau"** / **"Sharp peak"** — green / amber. + - Convergence: **"Early-and-held"** / **"Late-rising"** / **"Noisy"** — green / amber / amber. + - Comparison label: **"vs runner-up"** in MVP1; **"vs baseline"** when Phase 2 ships and `comparison_against === 'baseline'`. +- **Content hierarchy:** Top to bottom — (1) headline + CI band (primary, always-visible when `confidence` is non-null; CI line itself only renders when `ci_95` is non-null); (2) per-query outcome chips row (when `per_query_outcomes` non-null); (3) named regressors table (only when `per_query_outcomes.regressed > 0`, capped at 5 rows); (4) runner-up gap + late-trial 1σ + convergence (3 secondary callouts in a single row, each rendered only when their sub-field is non-null). +- **Progressive disclosure:** The 4-section panel renders ~150-200px tall by default. The named regressors table (up to 5 rows with `query_text`, `winner_score`, `comparison_score`, `delta`) renders inline only when `per_query_outcomes.regressed > 0`. No "View per-query breakdown" disclosure in Phase 1 — that would require fetching `query_text` for every compared query (potentially 100s of rows), which violates the bounded-payload promise of FR-2 query 4. Operators who want the full per-query view can drill into the winner trial's metrics via a future enhancement. +- **Relationship to existing pages:** Sits between the existing header card and trials table on `/studies/[id]`. The existing digest panel (when present, for completed studies with digests) renders below the trials table — unchanged. + +### Tooltips and contextual help + +| Element | Tooltip / help text | Trigger | Placement | +|---|---|---|---| +| "95% CI" label | "Bootstrap 95% confidence interval on the headline metric, computed from per-query scores via 1000 resamples with replacement." | hover | top | +| "Improved/Unchanged/Regressed" chips | "Queries where the winner's per-query metric differs from the runner-up's by more than the threshold (NDCG/P/R: 0.01; MAP/MRR: 0.02)." | hover | top | +| "Robust plateau" / "Sharp peak" label | "Robust: the top min(10, complete trials) are all within 0.005 of the winner — many near-equivalent configs. Sharp: at least one trial in that top set is farther than 0.005 below the winner — winner is isolated." | hover | top | +| "Late-trial 1σ" label | "Standard deviation of the primary metric over the last 20% of completed trials — the empirical noise floor." | hover | top | +| "Convergence: Early-and-held" / "Late-rising" / "Noisy" | "Early-and-held: best found in the first half AND at least one trial in the last 25% finished within 0.005 of the winner (plateau held). Late-rising: best found in the last 10% — more trials may help. Noisy: neither — no clear convergence pattern." | hover | top | +| "vs runner-up" / "vs baseline" label | "Reference for per-query comparison. Runner-up: the second-best trial. Baseline: a no-tuning trial run before Optuna starts (when available)." | hover | top | + +Tooltip implementation reuses the existing [`InfoTooltip`](../../../../ui/src/components/common/info-tooltip.tsx) / [`HelpPopover`](../../../../ui/src/components/common/help-popover.tsx) primitives from `feat_contextual_help` — no new tooltip component required. Glossary entries (in [`ui/src/lib/glossary.ts`](../../../../ui/src/lib/glossary.ts)) added for `confidence.ci_95`, `confidence.runner_up_gap`, `confidence.late_trial_stddev`, `confidence.convergence_regime`, `confidence.per_query_outcomes`, `confidence.comparison_against`. + +### Primary flows + +1. **Approver opens a study-backed PR in their config repo.** The PR body now has a `## Confidence` section between `## Metric delta` and `## Config diff`. They read: "NDCG@10 +0.13 (95% CI 0.78–0.89, N=20 queries). 14 improved · 4 unchanged · 2 regressed (vs runner-up). Queries that regressed: `shipping policy` (0.92 → 0.41), `wireless headphones` (0.85 → 0.71). Runner-up gap 0.005 (robust plateau). Late-trial 1σ = 0.018. Convergence: early-and-held (best at trial 387 of 1000)." They decide to merge or reject based on whether the named regressors are operator-important queries. +2. **Relevance engineer inspects a completed study on `/studies/[id]`.** ConfidencePanel renders above the trials table with the headline + CI band + outcome chips + (when regressors exist) the named regressors table showing up to 5 (query_text, winner_score, comparison_score, delta) rows. They identify which queries the winner sacrificed from the inline list, decide whether to broaden the search space and rerun. (A full per-query breakdown disclosure is explicitly out of scope for Phase 1 — see §3 Out of scope.) +3. **Relevance engineer creates a new study and opens its detail page while it's still running.** Status is "running", `best_trial_id` is null, ConfidencePanel renders nothing (clean visual; no empty-state shell). The trials table polls every 3s as usual. + +### Edge / error flows + +- **Old study (pre-`0015_trials_per_query_metrics` migration), completed.** Winner trial's `per_query_metrics IS NULL`. `compute_study_confidence` returns a **partial** `ConfidenceShape` with `ci_95`, `headline.n_queries`, and `per_query_outcomes` all null but aggregate signals (`runner_up_gap`, `late_trial_stddev`, `convergence`) populated — see AC-3. The PR body's `## Confidence` section shows only the aggregate signals; the digest narrative's `` block similarly renders only non-null fields. No error envelope, no LLM degraded mode, no operator-visible failure. +- **Study with only 1 complete trial (others all failed).** Winner exists; runner-up doesn't. `runner_up_gap = None`, `per_query_outcomes = None`. CI band still computes (winner has per_query data). Headline + CI render; the rest is suppressed. +- **Study with 5 queries (small judgment list).** CI band reports `n_samples: 5` — at the lower bound. Late-trial 1σ requires ≥10 complete trials independently. Per-query outcomes work normally (the threshold is 2 trials, not 5 queries). +- **Study with 4 queries.** CI band suppressed (n < 5). Other fields proceed independently. +- **Study with `best_metric IS NULL` (zero-trials AC-2 path — see [`feat_digest_proposal`](../../../00_overview/implemented_features/2026_05_11_feat_digest_proposal/feature_spec.md)).** `confidence = None`. +- **OpenAI capability check failed (digest worker in degraded mode).** `` XML block fires per the existing prompt logic. The new `` and `` blocks STILL render in degraded mode — they're plain data, not LLM-derived. The narrative LLM may not be called; if not, the data still exists in the PR body section (which is built independently from the LLM narrative). +- **Worker race: `study.best_trial_id` points at a deleted row.** Already handled by the existing `digest_best_trial_missing` defensive log at [`digest.py:615-625`](../../../../backend/workers/digest.py#L615). Our new code uses the same `repo.get_trial(db, trial_id)` (or equivalent) and on `None` returns `confidence = None` rather than raising. The existing log event continues to fire from the digest worker; we don't add a new log. + +## 12) Given/When/Then acceptance criteria + +### AC-1: Per-query metrics persist on every successful trial + +- Given a study with `template_id` declaring `title_boost: 'float'`, a query set of 5 queries, and a judgment list with judgments for those queries +- When the orchestrator runs 5 Optuna trials and all 5 complete successfully +- Then `SELECT per_query_metrics FROM trials WHERE study_id = ?` returns 5 non-NULL JSONB rows, each shaped `{qid: {ndcg: float, map: float, precision: float, recall: float, mrr: float}}` (the 5 metric keys from `MetricCatalog`) +- Example values: + - Input: `study_id="01931..."`, `max_trials=5`, judgment list with `query_ids=["q1","q2","q3","q4","q5"]` + - Expected: 5 rows, each with `per_query_metrics["q1"]["ndcg"]` populated as a float between 0.0 and 1.0 + +### AC-2: Failed trial does not write per_query_metrics + +- Given a study where one trial's adapter call fails (simulated network error) +- When `run_trial` writes the failed trial row +- Then `SELECT per_query_metrics, status FROM trials WHERE id = ?` returns `(NULL, 'failed')` + +### AC-3: Old studies degrade to a partial ConfidenceShape (aggregate-only) + +- Given a completed study whose trials predate the `0015_trials_per_query_metrics` migration (all rows have `per_query_metrics IS NULL`) AND `study.best_trial_id` points at an existing winner trial row +- When `GET /api/v1/studies/{id}` is called +- Then the response body has `confidence != null` BUT with `confidence.ci_95 == null`, `confidence.per_query_outcomes == null`, and `confidence.headline.n_queries == null` +- And the aggregate sub-fields are populated: `confidence.headline.value` from `study.best_metric`, `confidence.runner_up_gap` (when ≥2 complete trials), `confidence.late_trial_stddev` (when ≥10 complete trials), `confidence.convergence` (when ≥3 complete trials) +- And the PR body contains a `## Confidence` section showing only the aggregate signals (no CI line, no per-query block) +- And the ConfidencePanel renders the aggregate signals (no CI band, no per-query chips, no regressor table) +- Counter-example: when `study.best_trial_id` resolves to a deleted row (or `best_trial_id IS NULL` because the study never completed), `confidence == null` (whole-object), the PR body has no `## Confidence` section, and the panel renders nothing — see AC-3a. + +### AC-3a: Missing winner trial row → confidence is whole-object null + +- Given a study where `best_trial_id IS NULL` (still running) OR `best_trial_id` points at a row that has been deleted +- When `GET /api/v1/studies/{id}` is called +- Then `confidence == null` +- And the PR body has no `## Confidence` section +- And the ConfidencePanel renders nothing + +### AC-4: Bootstrap CI computed and reproducible + +- Given a completed study with 20 queries and 100 complete trials, all with `per_query_metrics` +- When `GET /api/v1/studies/{id}` is called twice in succession +- Then both responses have identical `confidence.ci_95.low` and `confidence.ci_95.high` values (byte-equal — proves the seed=42 lock) +- And `confidence.ci_95.low < confidence.headline.value < confidence.ci_95.high` +- Example: headline=0.84, ci_95={low: 0.78, high: 0.89} + +### AC-5: Runner-up gap classification + +- Given a completed study whose top 10 trials by `primary_metric` are all within 0.005 of the winner (e.g., 0.840, 0.838, 0.836, ..., 0.835) +- When `GET /api/v1/studies/{id}` is called +- Then `confidence.runner_up_gap.classification == "robust_plateau"` +- And when the top 10 are NOT all within 0.005 (e.g., winner 0.840, second-best 0.760), `classification == "sharp_peak"` + +### AC-6: Late-trial noise floor + +- Given a completed study with 50 complete trials whose `primary_metric` values are known +- When `GET /api/v1/studies/{id}` is called +- Then `confidence.late_trial_stddev.value` equals `numpy.std(primary_metric[-10:], ddof=1)` where `window_size = max(5, int(50*0.2)) = 10` +- And `confidence.late_trial_stddev.window_size == 10` + +### AC-7: Late-trial noise floor suppressed for small studies + +- Given a completed study with 9 complete trials +- When `GET /api/v1/studies/{id}` is called +- Then `confidence.late_trial_stddev` is `null` (below 10-trial minimum) + +### AC-8: Convergence regime — early-and-held + +- Given a completed study where the winner's `optuna_trial_number = 200` out of `max_optuna_trial_number = 1000`, AND at least one trial with `optuna_trial_number > 750` has `primary_metric` within 0.005 of the winner (proving the late exploration found similar plateau configs) +- When `GET /api/v1/studies/{id}` is called +- Then `confidence.convergence.regime == "early_held"` +- And `confidence.convergence.best_at_trial == 200` +- And `confidence.convergence.total_trials == 1000` +- Counter-example for `noisy`: same study but no late trial within 0.005 of the winner → `regime == "noisy"`. Counter-example for `late_rising`: winner's `optuna_trial_number = 950` → `regime == "late_rising"`. + +### AC-9: Convergence regime — late-rising + +- Given a completed study where the winner's `optuna_trial_number = 950` out of `max_optuna_trial_number = 1000` +- When `GET /api/v1/studies/{id}` is called +- Then `confidence.convergence.regime == "late_rising"` + +### AC-10: Per-query regressor naming with thresholded comparison + +- Given a completed study with NDCG objective, winner trial `per_query_metrics`, and runner-up #2 with per_query, where for `query_id="qA"` the winner scored 0.41 and the runner-up scored 0.92 (delta=-0.51, below the -0.01 threshold for NDCG) +- And for `query_id="qB"` the winner scored 0.85 and the runner-up scored 0.85 (delta=0, within ±0.01 unchanged window) +- When `GET /api/v1/studies/{id}` is called +- Then `confidence.per_query_outcomes.top_regressors` contains a row for `query_id="qA"` with `query_text` joined from the queries table, `winner_score=0.41`, `comparison_score=0.92`, `delta=-0.51` +- And `confidence.per_query_outcomes.regressed == 1` +- And `confidence.per_query_outcomes.unchanged` includes `qB` in its count + +### AC-11: PR body renders the Confidence section between Metric delta and Config diff + +- Given a completed study with full confidence data and a study-backed proposal in `pending` status +- When `_render_pr_body_study_backed(...)` is called (e.g., by `POST /api/v1/proposals/{id}/open_pr` worker job) +- Then the rendered markdown body contains, in order: `# RelyLoop proposal`, `## Metric delta`, `## Confidence`, `## Config diff`, `## Suggested follow-ups`, `## Parameter importance` +- And the `## Confidence` section contains "95% CI", an "Queries:" line with improved/unchanged/regressed counts, and a "Queries that regressed:" sub-section listing up to 5 query_texts with their deltas + +### AC-12: PR body omits the Confidence section when confidence=null + +- Given a study-backed proposal whose study has `confidence == null` whole-object — e.g., `best_trial_id IS NULL` (still running, never completed) OR `best_trial_id` points to a deleted/missing trial row (cascade-delete race, see AC-3a) +- When `_render_pr_body_study_backed(...)` is called +- Then the rendered markdown body does NOT contain `## Confidence` +- And the section ordering becomes `Metric delta → Config diff → Suggested follow-ups → Parameter importance` (the existing pre-feature behavior) +- Counter-example: old studies with existing winner row but `per_query_metrics IS NULL` produce a partial `ConfidenceShape` (NOT whole-object null), so the PR body DOES contain `## Confidence` showing only aggregate signals — see AC-3. + +### AC-13: ConfidencePanel renders against real backend + +- Given the operator runs `make up`, seeds a completed study with `per_query_metrics`, and opens `/studies/[seeded_id]` in a browser +- When the page loads +- Then the ConfidencePanel renders between the study header card and the trials table +- And the panel contains the headline + CI band, the per-query outcome chips, the runner-up gap label, the late-trial 1σ, and the convergence call-out +- And the inline named-regressors table renders up to 5 rows when `per_query_outcomes.regressed > 0` (each row showing `query_text`, `winner_score`, `comparison_score`, `delta`) +- And there is NO "View per-query breakdown" disclosure (Phase 1 out of scope per §3) + +### AC-14: Digest narrative LLM prompt includes confidence blocks + +- Given a completed study with full confidence data and a triggered digest generation +- When the digest worker renders `digest_narrative.user.jinja` with the new `confidence` kwarg (a single dict — the serialized `ConfidenceShape.model_dump()` with `per_query_outcomes` nested inside per FR-6) +- Then the rendered prompt contains a `` block with `ci_low`, `ci_high`, `n_queries`, `runner_up_gap`, `late_trial_stddev`, `convergence` fields (resolved from nested paths `confidence.ci_95.low`, `confidence.ci_95.high`, `confidence.headline.n_queries`, etc.) +- And contains a `` block (rendered from `confidence.per_query_outcomes`, NOT from a sibling kwarg) with `improved`, `unchanged`, `regressed`, `comparison_against` and a list of `top_regressors` +- And the rendered system prompt has the updated "Open with the headline metric delta, immediately followed by a one-sentence confidence framing…" line + +### AC-15: Bootstrap CI suppressed when N(queries) < 5 + +- Given a completed study whose query set has 4 queries +- When `GET /api/v1/studies/{id}` is called +- Then `confidence.ci_95 == null` +- And the rest of the `confidence` object is populated normally (`headline`, `runner_up_gap`, etc., as data allows) + +### AC-16: Per-query outcomes suppressed when no runner-up has per_query_metrics + +- Given a completed study with only 1 complete trial (others failed) +- When `GET /api/v1/studies/{id}` is called +- Then `confidence.per_query_outcomes == null` AND `confidence.runner_up_gap == null` +- And `confidence.headline` and `confidence.ci_95` are populated normally (winner alone is enough) + +### AC-17: Alembic migration round-trips cleanly + +- Given the local Alembic head at `0014_clusters_target_filter` with seeded demo data +- When `alembic upgrade head` runs (applies `0015_trials_per_query_metrics`), then `alembic downgrade -1`, then `alembic upgrade head` again +- Then no errors are raised +- And after the downgrade, `trials.per_query_metrics` does NOT exist as a column +- And after the second upgrade, the column exists again with `IS NULL` for every row (preserved by being nullable) + +## 13) Non-functional requirements + +- **Performance:** + - `GET /api/v1/studies/{id}` p95 latency increase from the new `compute_study_confidence` call: **< 100ms** for studies up to 1000 trials × 100 queries. The three-query read pattern (FR-2) keeps payload at ~30KB regardless of trial count. Bottleneck is the bootstrap loop (1000 resamples × N_queries numpy operations) which is ~5ms for N=100 queries. The wire load + DB roundtrip dominates the budget; the actual compute is ≪10ms. Measured by adding a perf assertion to the integration test (skip-by-default for CI, opt-in via env flag). + - The `run_trial` worker latency increase from persisting `per_query_metrics`: **< 1ms** (single dict copy into the existing INSERT). No measurable hot-path impact. +- **Reliability:** + - `compute_study_confidence` returns `None` on every degraded path; never raises (except for unrecoverable programming errors). Verified by unit tests covering each FR-7 degraded path. + - PR-body rendering never raises on `confidence=None`; tested via AC-12 contract. +- **Operability:** + - No new metrics, logs, or alerts. The existing `digest_best_trial_missing` log already covers the only race we share with the digest worker. + - The runbook entry [`docs/03_runbooks/local-dev.md`](../../../../docs/03_runbooks/local-dev.md) doesn't need an update (no new operator action). A glossary update lands per FR-6. +- **Accessibility / usability:** + - ConfidencePanel meets the existing WCAG 2.1 AA pattern used by the studies-detail page. Color-only signals (green/amber/red badges) are paired with text labels per the project's chip-discipline (`feat_contextual_help` precedent). + - Tooltips trigger on hover AND keyboard focus (per `InfoTooltip` primitive's existing behavior). + +## 14) Test strategy requirements (spec-level) + +- **Unit tests (`backend/tests/unit/`):** + - `backend/tests/unit/domain/study/test_confidence.py` — 20+ cases covering `bootstrap_ci` (deterministic seed, N>=5 / N<5 paths), `classify_runner_up_gap` (robust_plateau / sharp_peak), `compute_late_trial_stddev` (window size, ≥10 / <10 path), `classify_convergence_regime` (early_held / late_rising / noisy), `classify_query_outcomes` (improved/unchanged/regressed counts per FR-4a threshold table), `top_regressors` (sorted by absolute delta, capped at 5, query_text join), `compute_study_confidence` orchestrator function (every degraded path from FR-7). + - `backend/tests/unit/workers/test_digest_prompt_render.py` (existing file extended) — 4+ new cases for prompt rendering with confidence / without confidence / with per_query_outcomes / partial population. +- **Integration tests (`backend/tests/integration/`):** + - `backend/tests/integration/test_studies_api_confidence.py` (new file) — 8+ cases covering AC-3, AC-4, AC-5, AC-7, AC-10, AC-15, AC-16, with seeded studies via the existing `_digest_helpers.py` patterns extended to populate `per_query_metrics`. + - `backend/tests/integration/test_trials_per_query_metrics_migration.py` (new file) — round-trip migration test (AC-17) following the pattern at `test_clusters_target_filter_migration.py`. + - `backend/tests/integration/test_run_trial_per_query_persistence.py` (new file or extension of existing trials worker tests) — verify AC-1 + AC-2 by running a real trial against a stubbed adapter (the existing `infra_optuna_eval` integration-test scaffold). +- **Contract tests (`backend/tests/contract/`):** + - Extend the existing `studies` OpenAPI shape-lock contract test (precedent: cluster target_filter contract test) to include the `confidence` field with its `ConfidenceShape` sub-shape. + - Add a PR-body section contract test in the `git_pr` test family (or new file `test_pr_body_confidence_section.py`) covering AC-11 + AC-12. +- **E2E tests (`ui/tests/e2e/`):** + - Extend `ui/tests/e2e/studies.spec.ts` with 2 new real-backend cases: + - **AC-13** ConfidencePanel renders for a seeded completed study with full per-query data (uses an extended `_digest_helpers.py`-equivalent helper on the Playwright side OR uses an extended seedAcmeProductsChain pattern with per_query_metrics). + - ConfidencePanel correctly omits itself for a study with `confidence=null`. + - The Playwright spec MUST NOT use `page.route()` mocking — real backend per CLAUDE.md E2E policy. The seed helper either inserts a synthetic Trial row with hand-crafted `per_query_metrics` JSONB or runs a real `run_trial` invocation. + +Total estimated new test count: ~35 cases across unit (20+), integration (10+), contract (3+), E2E (2). + +## 15) Documentation update requirements + +- **`docs/01_architecture/data-model.md`:** Add `trials.per_query_metrics` to the per-table column reference for `trials`. Note nullable + post-`0015` semantics. The `studies.baseline_metric` row already exists; add a forward-ref to Phase 2 that explains baseline_trial_id will be added at that time. +- **`docs/01_architecture/api-conventions.md`:** No update required (no convention change). +- **`docs/01_architecture/optimization.md`:** Add a brief "Confidence signals" subsection noting that the `score()` function's `per_query` dict is now persisted on `trials.per_query_metrics` (was previously discarded). Point at the new domain module. +- **`docs/02_product`:** No update required at the umbrella level. This spec is the planning artifact. +- **`docs/03_runbooks/local-dev.md`:** No update required (no new operator action). +- **`docs/04_security`:** No update required (no new security surface). +- **`docs/05_quality/testing.md`:** No update required (existing test-layer convention covers the new test files). +- **`docs/08_guides`:** No new walkthrough guide. The existing guide 06 ("Create and monitor a study") may benefit from a short addendum mentioning the ConfidencePanel — captured as a follow-up idea, NOT in-scope here. +- **`state.md`:** Update after PR merge with the new Alembic head (`0015_trials_per_query_metrics`) and feature ship status. +- **`CLAUDE.md`:** No update required (no new convention; the data-model.md update covers the new column reference). + +## 16) Rollout and migration readiness + +- **Feature flags / staged rollout:** None. RelyLoop is single-tenant + local-only through MVP3. The feature ships in a single PR. +- **Migration / backfill expectations:** + - Migration `0015_trials_per_query_metrics` adds one nullable JSONB column. **No backfill.** Old trials retain `per_query_metrics IS NULL` and degrade gracefully via FR-7. New trials write the column on every successful trial. + - Round-trip verified (`alembic upgrade head && alembic downgrade -1 && alembic upgrade head`) before merge per CLAUDE.md Absolute Rule #5. + - Idempotency-guarded: the `add_column` and `drop_column` operations are inside `upgrade()` / `downgrade()` respectively; no conditional skip needed because the column doesn't exist pre-migration. + - Revision ID length: `0015_trials_per_query_metrics` = 30 characters — under the 32-char `alembic_version` limit. +- **Operational readiness gates:** + - `pre-commit run --all-files` passes locally. + - `make test` (unit + integration + contract) passes locally with the new tests. + - `cd ui && pnpm test` (vitest) passes. + - `pnpm playwright test` passes locally with the 2 new E2E cases. + - CI green on PR (lint + typecheck + tests + Docker build). +- **Release gate:** Merge to main triggers no staging deploy in MVP1 (no remote staging). Local stack rebuild via `make up` after pulling main picks up the migration; operators run `make migrate` to apply `0015` to their existing dev DBs. + +## 17) Traceability matrix + +| FR ID | Acceptance Criteria IDs | Planned stories/tasks (TBD by `/impl-plan-gen`) | Test files / suites | Docs to update | +|---|---|---|---|---| +| FR-1 (per_query persistence) | AC-1, AC-2, AC-17 | Migration story; worker-change story | `tests/integration/test_trials_per_query_metrics_migration.py`, `tests/integration/test_run_trial_per_query_persistence.py` | `data-model.md`, `state.md` | +| FR-2 (compute helper) | AC-3, AC-4, AC-5, AC-7, AC-15, AC-16 | Domain-module story | `tests/unit/domain/study/test_confidence.py`, `tests/integration/test_studies_api_confidence.py` | `optimization.md` | +| FR-3 (winner-vs-runner-up) | AC-10, AC-16 | Domain-module story; folded into FR-2 | `tests/unit/domain/study/test_confidence.py` (top_regressors cases) | — | +| FR-4 + FR-4a (thresholds + methods) | AC-4, AC-5, AC-6, AC-8, AC-9, AC-10 | Domain-module story | `tests/unit/domain/study/test_confidence.py` (every threshold path) | — | +| FR-5a (StudyDetail enrichment) | AC-3, AC-4, AC-16 | API-extension story | `tests/integration/test_studies_api_confidence.py`, `tests/contract/test_studies_openapi.py` | — | +| FR-5b (PR body section) | AC-11, AC-12 | PR-renderer story | `tests/contract/test_pr_body_confidence_section.py` | — | +| FR-5c (ConfidencePanel UI) | AC-13 | Frontend story | `ui/tests/e2e/studies.spec.ts` + `ui/src/__tests__/components/studies/confidence-panel.test.tsx` | — | +| FR-6 (digest narrative prompt) | AC-14 | Prompt-update story | `tests/unit/workers/test_digest_prompt_render.py` extended | `optimization.md` | +| FR-7 (degraded paths) | AC-3, AC-7, AC-15, AC-16 | Folded into FR-2 | `tests/unit/domain/study/test_confidence.py` (every degraded branch) | — | + +## 18) Definition of feature done + +This feature is complete when: + +- [ ] All acceptance criteria (AC-1 through AC-17) pass in CI. +- [ ] All test layers (unit / integration / contract / E2E) are green. +- [ ] Migration `0015_trials_per_query_metrics` is applied to the local stack and verified round-trip. +- [ ] The runbook entry doc-updates (per §15) are merged. +- [ ] Rollout gates from §16 are satisfied (CI green + local smoke). +- [ ] No open questions remain in §19. +- [ ] `phase2_idea.md` exists in the feature directory documenting the deferred baseline-trial work. + +## 19) Open questions and decision log + +### Open questions + +None remaining after preflight + spec-gen. The seven open questions surfaced during preflight have all been resolved by locked decisions below. + +### Decision log + +- **2026-05-21 — D1 — API surface for ConfidencePanel data.** Locked: enrich `StudyDetail` with an optional `confidence: ConfidenceShape | None` field (Option A from preflight). Rejected: separate endpoint `GET /api/v1/studies/{id}/confidence` (premature surface area), client-side computation (would require sending all `trials.per_query_metrics` over the wire — wasteful and couples the frontend to schema). Rationale: matches how `digest` is already inlined into `StudyDetail`'s sibling fields; old clients ignore the field; single source of truth. + +- **2026-05-21 — D2 — Regressor threshold semantics.** Locked: absolute delta with per-metric table — NDCG/precision/recall = 0.01, MAP/MRR = 0.02 (FR-4a). Rejected: relative-delta (ill-defined when comparison_score=0), per-tenant overrides (deferred until MVP4 + auth). Rationale: matches the calibration-kappa tier-threshold precedent and is large enough to filter noise on judgment lists with ≤20 queries. + +- **2026-05-21 — D3 — Late-trial window definition.** Locked: `max(5, int(len(complete_trials)*0.2))` trials, with a minimum of 10 complete trials required to compute the noise floor at all (FR-7). Rejected: fixed window of 50 (too aggressive for small studies), last-by-time-rather-than-trial-number (Optuna trial number IS time-ordered by construction). Rationale: a 10-trial minimum is enough for a meaningful sample stddev; below that the value misleads more than it informs. + +- **2026-05-21 — D4 — Bootstrap CI parameters.** Locked: percentile method, N=1000 resamples, 95% CI, suppressed when N(queries) < 5 (FR-4). RNG seed = 42 fixed for determinism (FR-4, AC-4). Rejected: bias-corrected percentile (BCa) method (small-sample bias correction is fragile under N<20; percentile is the textbook default for relevance-engineer-facing UI). Rationale: textbook 1000-resample percentile is the established default; fixed seed ensures approvers see stable numbers on PR re-reads. + +- **2026-05-21 — D5 — Wide-plateau threshold.** Locked: "robust_plateau" when top-10 complete trials are ALL within 0.005 of the winner; "sharp_peak" otherwise (FR-4). Rejected: relative threshold (e.g., 0.5 * (winner − baseline)) because `baseline` is always None in MVP1 (see §2 audit). Rationale: 0.005 is below typical late-trial noise (1σ ~ 0.018 in our test data) so it's a tight definition of "plateau", and the test is unambiguously well-defined without baseline data. + +- **2026-05-21 — D6 — Convergence-trial classification thresholds.** Locked: "early_held" when winner's optuna_trial_number ≤ 50% of max AND at least one trial in the last 25% of trial numbers has `primary_metric` within 0.005 of the winner (the observable "plateau held" signal — the original "no improvement after" framing was tautological because the winner is by definition the global best); "late_rising" when winner's optuna_trial_number ≥ 90% of max; "noisy" otherwise (FR-4). Rejected: more granular regimes (4+ buckets) because the UX value diminishes — 3 buckets answer "do I trust this winner?" cleanly. Rejected: "no improvement after" framing — tautological (GPT-5.5 cycle 1 F7). Rationale: 50/90 thresholds match the project's recurring "first half / last 10%" framings in the optimization docs; the within-0.005 late-window probe is the observable signal that the optimizer's late budget found near-equivalent configs. + +- **2026-05-21 — D7 — Confidence-framing wording in digest narrative.** Locked: exact replacement string in FR-6 ("Open with the headline metric delta, immediately followed by a one-sentence confidence framing that mentions the CI band (when `` is present), the per-query outcome counts (when `` is present), and the worst-regressed query by name (when `` has regressors). Then explain *why*…"). Rejected: free-form LLM prompt that just receives the data without a wording instruction (would produce inconsistent narrative openings across studies). Rationale: the exact replacement is short, observable in `system.md`, and testable via the `test_digest_prompt_render` unit test. + +- **2026-05-21 — D8 — Baseline-trial computation deferred to Phase 2.** Locked: this feature ships per-query analytics against runner-up #2 (FR-3). The orchestrator's deferred "non-Optuna baseline trial" work moves to a separate Phase 2 spec tracked in `phase2_idea.md`. Phase 2 adds `studies.baseline_trial_id` (column), modifies the orchestrator to run a baseline trial first, and switches `confidence.per_query_outcomes.comparison_against` to "baseline" when available. Phase 2 is purely additive — no migration to undo, no API contract break. Rationale: the orchestrator change is a meaningful surface (new code path, new failure modes, new tests) that deserves its own spec cycle; bundling it here would inflate scope without proportional product value. + +- **2026-05-21 — D9 — No new `digests` schema column.** Locked: `confidence` data is computed at-read-time from `trials.per_query_metrics` on every `GET /api/v1/studies/{id}` call (FR-2). Rejected: persist `confidence` to a new `digests.confidence JSONB` column. Rationale: keeps source-of-truth single (the trial rows); avoids a migration on `digests`; avoids retrofitting the digest worker's write path; recompute cost is sub-millisecond for MVP1 sizes. + +- **2026-05-21 — D10 — Numpy-only bootstrap, no scipy.** Locked: implement bootstrap with `numpy.random.default_rng(42).choice + numpy.percentile` (FR-4). Rejected: `scipy.stats.bootstrap`. Rationale: numpy is already a transitive dep via pytrec_eval; scipy is also installed but adding it as a direct runtime dependency expands the package surface for no measurable benefit — `scipy.stats.bootstrap`'s bias-correction machinery is overkill for 1000-sample percentile bootstraps on N≤100 query datasets. diff --git a/docs/02_product/planned_features/feat_pr_metric_confidence/idea.md b/docs/02_product/planned_features/feat_pr_metric_confidence/idea.md index b48a5177..59dcdafb 100644 --- a/docs/02_product/planned_features/feat_pr_metric_confidence/idea.md +++ b/docs/02_product/planned_features/feat_pr_metric_confidence/idea.md @@ -92,7 +92,7 @@ Three reasons: ## Relationship to other work - **Supersedes a hypothetical `feat_study_holdout_split`** that was floated during the surfacing conversation and then deprioritized. Per-query in-sample variance and named regressors address the same approver-trust concern more directly without the small-N statistical fragility a holdout split would introduce at MVP1 judgment-set sizes. -- **Adjacent to [`feat_agent_propose_search_space`](../feat_agent_propose_search_space/idea.md)** (planned tool for the chat agent to propose a search space deterministically). Once this feature ships, that tool could optionally include "expected confidence band given historical study variance on similar templates" as part of its proposal. +- **Adjacent to [`feat_agent_propose_search_space`](../../../00_overview/implemented_features/2026_05_21_feat_agent_propose_search_space/feature_spec.md)** (shipped 2026-05-21 as PR #175 — a read-only agent tool that emits a deterministic starter search space from a template's `declared_params`, with optional `prior_study_id` narrowing). Once this feature lands, `propose_search_space` could optionally include "expected confidence band given historical study variance on similar templates" as part of its proposal — i.e., feed `late_trial_stddev` and the runner-up gap from prior studies into the bracket-narrowing heuristic. - **Adjacent to [`feat_llm_judgments`](../../../00_overview/implemented_features/2026_05_11_feat_llm_judgments/)** (shipped). When judgments are LLM-generated, the per-query histogram lets the approver distinguish "config got worse on these queries" from "judge is inherently uncertain on these queries" (the latter being a calibration concern that already has its own surface). This composes cleanly. - **Composes with [`feat_digest_proposal`](../../../00_overview/implemented_features/2026_05_11_feat_digest_proposal/)** (shipped). The digest narrative is the natural place for the confidence framing to live; the PR body inherits both the narrative and the structured "## Confidence" section. - **Composes with [`feat_github_pr_worker`](../../../00_overview/implemented_features/2026_05_12_feat_github_pr_worker/)** (shipped). PR-body changes are localized to `_render_pr_body_study_backed` in `backend/workers/git_pr.py`; no changes to the PR-open lifecycle or auth surface. diff --git a/docs/02_product/planned_features/feat_pr_metric_confidence/phase2_idea.md b/docs/02_product/planned_features/feat_pr_metric_confidence/phase2_idea.md new file mode 100644 index 00000000..aa770ecd --- /dev/null +++ b/docs/02_product/planned_features/feat_pr_metric_confidence/phase2_idea.md @@ -0,0 +1,94 @@ +# Phase 2 — Baseline-trial computation for `feat_pr_metric_confidence` + +**Date:** 2026-05-21 +**Status:** Idea — deferred from Phase 1 of [`feat_pr_metric_confidence`](feature_spec.md) at spec-gen time (Decision D8 in §19). +**Origin:** [`feat_pr_metric_confidence/feature_spec.md`](feature_spec.md) §3 Out of scope + §19 Decision log D8. Phase 1 ships per-query analytics against the runner-up #2 trial as the comparison reference. Phase 2 adds a true production-baseline comparison. + +**Depends on:** Phase 1 of `feat_pr_metric_confidence` must be merged first. Phase 2 is purely additive — no migration to undo, no API contract break. + +## Problem + +`studies.baseline_metric` exists as a column on the `studies` table (declared in `feat_study_lifecycle` Phase 1, [`backend/app/db/models/study.py:76`](../../../../backend/app/db/models/study.py#L76)) with the docstring "single non-Optuna trial run before Optuna starts; populated by the orchestrator (Phase 2)." However, **the orchestrator was never updated to populate this column** — `grep -rn "baseline_metric *=" backend/workers/ backend/app/services/` returns zero write sites. In production, `study.baseline_metric` is always `None`, and the PR body's `## Metric delta` section shows `baseline=None → achieved=X` with no `delta_pct`. + +Phase 1 of `feat_pr_metric_confidence` ships per-query analytics that compare the winner against the **runner-up #2 trial** instead of a true baseline. That comparison answers "is the winner robust or fragile vs other tried configs?" but does NOT answer "does this config regress queries that the operator's current production search behavior gets right?" — which is the more directly actionable approver question. + +Phase 2 closes this gap by: +1. Implementing the deferred orchestrator work — run a single non-Optuna baseline trial before Optuna starts, using the operator's current production query-template params as the "baseline" configuration. +2. Persisting that baseline as a real `Trial` row with its `per_query_metrics` populated. +3. Adding a new denormalized FK column `studies.baseline_trial_id String(36) NULL` so reads can fetch the baseline trial efficiently. +4. Switching `compute_study_confidence` to emit `comparison_against = "baseline"` when `study.baseline_trial_id IS NOT NULL`, falling back to `"runner_up"` otherwise. + +## Why deferred from Phase 1 + +- **Cross-subsystem.** Touches the orchestrator (a new "run baseline first" code path), the trials worker (no change — baseline is just another Trial row), the studies schema (new column + migration), the digest worker prompt (potentially: distinguish "baseline" vs "runner_up" in the narrative framing), and the operator UX (what does the baseline config actually MEAN — is it the current template params with default values? the previous study's winning params? a no-op?). +- **Real product-design surface.** The semantics of "baseline" need a spec-shaped decision. Options include: + - **(a) Template defaults.** The baseline trial uses the query template's `declared_params` with each param's middle-of-range value (`(low + high) / 2` for floats, the median choice for categoricals). Simple, deterministic, but may not reflect the operator's actual production config. + - **(b) Operator-supplied baseline.** The create-study request body gains an optional `baseline_params: dict[str, Any] | None` field. When provided, the orchestrator runs a baseline trial with those params before Optuna starts. When absent, no baseline runs (status quo). + - **(c) Previous study's winner.** If the study has `parent_study_id` (fork lineage, MVP2 surface), the baseline is the parent's winning trial's params. When no parent, no baseline runs. +- **Statistical design surface.** Once baseline data exists, the per-query delta semantics flip from "vs runner-up" to "vs production behavior" — the regressor framing changes from "winner sacrificed this query to other tried configs" to "winner makes this query worse than production." Both are valid signals; spec needs to lock which is the default surface (likely baseline when available, runner-up otherwise). +- **Compounding orchestrator complexity.** Adding a non-Optuna trial path means the orchestrator needs to (a) not increment Optuna's trial counter for the baseline, (b) handle baseline-trial failure differently than Optuna trial failure (a failed baseline should NOT block the study; just skip the comparison surface), (c) handle the baseline-trial timeout window separately from per-trial Optuna timeouts. + +## Proposed capabilities + +### Capability 1 — Migration: add `studies.baseline_trial_id` + +- Alembic migration `00NN_studies_baseline_trial_id` (next available revision after Phase 1's `0015`). +- Schema: `baseline_trial_id String(36) NULL`. Not a formal FK (per the same rationale as `best_trial_id` in [`study.py:80-84`](../../../../backend/app/db/models/study.py#L80) — orchestrator stamps it after baseline trial completes; no enforce-at-DB constraint). +- Reversible `downgrade()` drops the column. Round-trip verified. +- No backfill — existing studies stay `baseline_trial_id IS NULL` and continue to show `comparison_against = "runner_up"` per Phase 1 fallback. + +### Capability 2 — Orchestrator runs baseline trial before Optuna + +- In [`backend/workers/orchestrator.py`](../../../../backend/workers/orchestrator.py) `start_study`, before entering the Optuna trial-enqueue loop: + 1. Resolve the baseline params (per the locked design decision from Phase 2 spec — options (a/b/c) above). + 2. If baseline params are non-empty, enqueue a single `run_baseline_trial(study_id, params)` Arq job. Wait for it to complete (synchronous within the start_study transaction OR await via Optuna's ask/tell sync mechanism — TBD by Phase 2 plan). + 3. Stamp `study.baseline_trial_id = ` and `study.baseline_metric = `. + 4. Proceed to the Optuna loop. +- A new worker function `run_baseline_trial` mirrors `run_trial` but does NOT call `study.ask()` / `study.tell()` — it just renders the template with the baseline params, runs the engine query, scores via `pytrec_eval`, and persists a Trial row with `optuna_trial_number = -1` (sentinel) OR some other distinguishing marker. `per_query_metrics` is persisted just like Phase 1. +- Failed baseline trial: log + proceed with the study; `baseline_trial_id` stays NULL; comparison falls back to runner-up #2. + +### Capability 3 — `compute_study_confidence` switches comparison source + +- One-line change in `backend/app/domain/study/confidence.py`: + - When `study.baseline_trial_id IS NOT NULL` AND that trial row exists AND has `per_query_metrics`: use it as the comparison reference; emit `per_query_outcomes.comparison_against = "baseline"`. + - Otherwise: fall back to runner-up #2 (Phase 1 behavior); emit `comparison_against = "runner_up"`. +- The `ConfidenceShape` Literal `comparison_against: Literal["runner_up", "baseline"]` already exists from Phase 1 — Phase 2 just unlocks the second value. +- No API contract change; no migration to undo. + +### Capability 4 — UI label switching + +- ConfidencePanel reads `confidence.per_query_outcomes.comparison_against` and renders either "vs runner-up" or "vs baseline" as the label on the outcome chips + regressor table heading. +- Tooltip on the label distinguishes the two semantics ("Runner-up: the second-best trial in this study. Baseline: a no-tuning trial run with your production template params before Optuna started."). + +### Capability 5 — Digest narrative prompt extension + +- The `` block in `digest_narrative.user.jinja` already emits `comparison_against` from Phase 1 — no template change. The system prompt's confidence-framing guidance may benefit from a sentence about how the narrative should call out "regressed vs production baseline" differently from "regressed vs runner-up" — but that's a UX call for Phase 2 spec. + +## Scope signals + +- **Backend:** Migration (1 column) + orchestrator change (~50-100 LOC) + new `run_baseline_trial` worker (~150 LOC mirroring `run_trial`) + 1-line change in `compute_study_confidence` + new error paths (failed baseline) + tests at every layer. ~500-800 LOC. +- **Frontend:** ~20 LOC to switch the label in `` + tooltip text + 1 new test case. +- **Migration:** 1 additive Alembic migration adding `studies.baseline_trial_id`. +- **Config:** None. +- **Audit events:** N/A in MVP1; MVP2 may want to audit baseline_trial creation as part of the study lifecycle events. +- **New dependencies:** None. + +## Relationship to other work + +- **Builds on** [`feat_pr_metric_confidence`](feature_spec.md) (Phase 1 of this feature). Phase 1 must merge first so the `ConfidenceShape` and `compute_study_confidence` infrastructure exists for Phase 2 to extend. +- **Composes with** [`feat_study_lifecycle`](../../../00_overview/implemented_features/2026_05_10_feat_study_lifecycle/feature_spec.md) — Phase 2 retroactively implements the "Phase 2" baseline-trial work that the study_lifecycle spec promised but deferred. Now it's a separate feature with its own spec cycle. +- **Composes with** [`feat_create_study_search_space_builder`](../../../00_overview/implemented_features/2026_05_20_feat_create_study_search_space_builder/feature_spec.md) — if Phase 2 picks design option (b) (operator-supplied baseline_params), the create-study modal gains a new optional input. The search-space builder is the natural place for that input. + +## Open questions for /spec-gen (Phase 2) + +1. **Baseline semantics** — Which of (a) template defaults, (b) operator-supplied, (c) parent-study winner is the locked default? Recommended: (b) operator-supplied with a fallback to (a) template defaults when not provided. +2. **Synchronous vs async baseline** — Does the orchestrator BLOCK on the baseline trial completing before enqueueing Optuna trials, or does it dispatch both in parallel? Recommended: synchronous (the baseline is a one-shot fast trial; Optuna can wait the extra 2-5 seconds). +3. **Baseline-trial failure handling** — Does a failed baseline fail the study OR proceed without baseline data? Recommended: proceed without (the baseline is informational, not load-bearing; failing the entire study because production-config-baseline failed would be a regression). +4. **`optuna_trial_number = -1` sentinel** — How does the existing trial-listing UI handle a trial with `optuna_trial_number = -1`? The Optuna RDB may not tolerate negative trial numbers. Alternative: a separate `baseline_trials` table; or a `trials.is_baseline` boolean. Recommended: investigate during Phase 2 spec — likely a `trials.is_baseline BOOLEAN NOT NULL DEFAULT FALSE` flag is cleaner than a sentinel. + +## Trigger to start Phase 2 + +Phase 2 unlocks once: +- Phase 1 (`feat_pr_metric_confidence`) is merged to main. +- Operator feedback on the runner-up comparison surface confirms that production-baseline comparison would be more valuable (i.e., operators ask for "compare to what we currently ship, not just other tried configs"). +- A Phase 2 design call decides between the three baseline semantics options above. diff --git a/docs/02_product/planned_features/feat_pr_metric_confidence/pipeline_status.md b/docs/02_product/planned_features/feat_pr_metric_confidence/pipeline_status.md new file mode 100644 index 00000000..4bee1c69 --- /dev/null +++ b/docs/02_product/planned_features/feat_pr_metric_confidence/pipeline_status.md @@ -0,0 +1,29 @@ +# Pipeline Status — PR Metric Confidence (Phase 1) + +## Idea +- Status: Complete +- File: [`idea.md`](idea.md) +- Preflighted: 2026-05-21 — 1 patch applied (stale `feat_agent_propose_search_space` link refreshed to `implemented_features/` path) + +## Spec +- Status: Approved (this skill run) +- Date: 2026-05-21 +- File: [`feature_spec.md`](feature_spec.md) +- Cross-model review: GPT-5.5 converged at cycle 3 (22 findings total across 3 cycles — 5 High + 13 Medium + 4 Low — all accepted and patched) + - Cycle 1: 12 findings (5 High, 6 Medium, 1 Low) — all accepted + - Cycle 2: 4 findings (1 High, 3 Medium) — all accepted (residual contradictions from cycle-1 patches) + - Cycle 3: 4 findings (0 High, 3 Medium, 1 Low) — all accepted (propagation cleanups; convergence on 0 High = stop) +- Phases: 2 total, 1 covered by spec + - Phase 1 (this spec): per-query persistence + 4-surface analytics (StudyDetail, PR body, ConfidencePanel, digest prompt) against runner-up #2 comparison reference + - Phase 2 (deferred — tracked in [`phase2_idea.md`](phase2_idea.md)): orchestrator baseline-trial work + `studies.baseline_trial_id` column; switches comparison to true production baseline when available + +## Plan +- Status: Not started +- Next: `/impl-plan-gen docs/02_product/planned_features/feat_pr_metric_confidence/feature_spec.md` + +## Implementation +- Status: Not started + +## Branch context +- Working on: `feat_pr_metric_confidence` (branch created at spec-gen start) +- Carries: bug_e2e_target_dropdown_flake folder rename (uncommitted, from earlier preflight) + feat_pr_metric_confidence/idea.md preflight patch + feature_spec.md (new) + phase2_idea.md (new) + this file (new) From 20b9b4320935b68c04246b548a63f99dd35b2a35 Mon Sep 17 00:00:00 2001 From: SoundMindsAI Date: Thu, 21 May 2026 09:04:28 -0400 Subject: [PATCH 03/17] docs: revise chore_study_default_stop_conditions with measured per-trial cost Previously the idea recommended max_trials=200 by analogy to Karpathy's overnight framing without grounding in real RelyLoop trial cost. Real measured data from the dev DB: - 10 complete trials across 5 seeded demo studies (5-query sets each) - p50 / p95 trial cost: 1100 ms / 1200 ms - 60s timeout is 50x larger than actual per-trial cost Implication: trials are so cheap that wall-clock is essentially never the binding constraint. Even at the pessimistic 30s/trial estimate for 200-query production sets, an 8-hour run does ~3,840 trials -- well past TPE's diminishing returns. Reframed accordingly: - max_trials is the primary lever, driven by search-space dimensionality - time_budget_min demoted to a safety net for slow clusters - Tier B presets renamed Quick/Standard/Overnight -> Focused/Standard/Deep, keyed off param count (1-2 / 3-5 / 6+) rather than wall-clock vibes - Glossary copy now cites real measured + extrapolated wall-clock estimates - New Calibration note section captures the measurement + recipe for re-measuring against production workloads Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/00_overview/MVP1_DASHBOARD.md | 2 +- .../idea.md | 126 +++++++++++++----- 2 files changed, 90 insertions(+), 38 deletions(-) diff --git a/docs/00_overview/MVP1_DASHBOARD.md b/docs/00_overview/MVP1_DASHBOARD.md index 10b702ef..ba4cfa3b 100644 --- a/docs/00_overview/MVP1_DASHBOARD.md +++ b/docs/00_overview/MVP1_DASHBOARD.md @@ -124,7 +124,7 @@ _None._ | [feat_config_repo_baseline_tracking](../02_product/planned_features/feat_config_repo_baseline_tracking/idea.md) | Feature | RelyLoop does not track which configuration is currently live in production. When a proposal's PR merges, the merge webhook at [`backend/app/api/webhooks/github.py:187-191`](../../backend/app/api/webh | — | Idea — surfaced during the 2026-05-21 Karpathy-loop audit. | | [feat_digest_executable_followups](../02_product/planned_features/feat_digest_executable_followups/idea.md) | Feature | The digest worker's LLM contract at [`backend/workers/digest.py:168-189`](../../backend/workers/digest.py) defines `suggested_followups` as a flat `array of string`: | — | Idea — surfaced during the 2026-05-21 Karpathy-loop audit. | | [feat_study_clone_from_previous](../02_product/planned_features/feat_study_clone_from_previous/idea.md) | Feature | A relevance engineer's normal workflow after the first study completes: | — | Idea — surfaced during a UX review of parameter-tuning ergonomics on 2026-05-19. | -| [chore_study_default_stop_conditions](../02_product/planned_features/chore_study_default_stop_conditions/idea.md) | Chore | The server-side `StudyConfigSpec` validator at [`backend/app/api/v1/schemas.py:572-580`](../../backend/app/api/v1/schemas.py) correctly **requires** at least one of `max_trials` or `time_budget_min` — | — | Idea — surfaced during the 2026-05-21 Karpathy-loop audit of the Studies workflow. | +| [chore_study_default_stop_conditions](../02_product/planned_features/chore_study_default_stop_conditions/idea.md) | Chore | The server-side `StudyConfigSpec` validator at [`backend/app/api/v1/schemas.py:572-580`](../../backend/app/api/v1/schemas.py) correctly **requires** at least one of `max_trials` or `time_budget_min` — | — | Idea — surfaced during the 2026-05-21 Karpathy-loop audit; recommendation grounded in measured per-trial cost from the local dev DB. | | [chore_template_library_expansion](../02_product/planned_features/chore_template_library_expansion/idea.md) | Chore | Three connected gaps: | — | Idea — surfaced during a UX review of parameter-tuning ergonomics on 2026-05-19. | ## Dependency graph diff --git a/docs/02_product/planned_features/chore_study_default_stop_conditions/idea.md b/docs/02_product/planned_features/chore_study_default_stop_conditions/idea.md index 822ca172..4d6488e0 100644 --- a/docs/02_product/planned_features/chore_study_default_stop_conditions/idea.md +++ b/docs/02_product/planned_features/chore_study_default_stop_conditions/idea.md @@ -1,69 +1,121 @@ -# Study Default Stop Conditions — recommended `max_trials` + `time_budget_min` defaults at the create-study surfaces +# Study Default Stop Conditions — recommended `max_trials` defaults driven by search-space dimensionality, with `time_budget_min` as a safety net -**Date:** 2026-05-21 -**Status:** Idea — surfaced during the 2026-05-21 Karpathy-loop audit of the Studies workflow. -**Origin:** Standalone audit at `~/.claude/plans/compressed-sparking-hamming.md` — the "within-study loop" section. Verified live via grep of [`backend/app/api/v1/schemas.py:550-580`](../../../../backend/app/api/v1/schemas.py) + [`ui/src/components/studies/create-study-modal.tsx:98-100`](../../../../ui/src/components/studies/create-study-modal.tsx). +**Date:** 2026-05-21 (revised after empirical measurement) +**Status:** Idea — surfaced during the 2026-05-21 Karpathy-loop audit; recommendation grounded in measured per-trial cost from the local dev DB. +**Origin:** Standalone audit at `~/.claude/plans/compressed-sparking-hamming.md`. Verified live via grep of [`backend/app/api/v1/schemas.py:550-580`](../../../../backend/app/api/v1/schemas.py) + [`ui/src/components/studies/create-study-modal.tsx:98-100`](../../../../ui/src/components/studies/create-study-modal.tsx); recommendation calibrated against `SELECT percentile_cont` on `trials.duration_ms` from the seeded dev DB (data section below). **Depends on:** None. Pure decision-support change at the create-study surfaces; no schema migration, no service-layer behavior change. ## Problem -The server-side `StudyConfigSpec` validator at [`backend/app/api/v1/schemas.py:572-580`](../../../../backend/app/api/v1/schemas.py) correctly **requires** at least one of `max_trials` or `time_budget_min` — so studies cannot be created with no stop condition. The system is safe. What it is not is **opinionated** about what a sensible overnight run looks like. +The server-side `StudyConfigSpec` validator at [`backend/app/api/v1/schemas.py:572-580`](../../../../backend/app/api/v1/schemas.py) correctly **requires** at least one of `max_trials` or `time_budget_min` — studies cannot be created with no stop condition. The system is safe. What it is not is **opinionated** about what good values are. Operators pick numbers by intuition, the LLM agent picks numbers with no project guidance, and the result is studies that either stop before TPE warms up or burn budget past the point of diminishing returns. -Today, the two paths a study gets created on each surface this problem differently: +Two surfaces today: -1. **The create-study wizard** at [`ui/src/components/studies/create-study-modal.tsx:98-100`](../../../../ui/src/components/studies/create-study-modal.tsx) declares both fields as optional empty inputs (`max_trials?: number | ''` and `time_budget_min?: number | ''`). It pre-fills `parallelism: 4` at [line 136](../../../../ui/src/components/studies/create-study-modal.tsx#L136) but leaves both stop-condition inputs blank. A user creating a study via the wizard hits "Submit," gets the validator's 422 ("at least one of `max_trials` or `time_budget_min`"), and then types in *something* — usually whatever round number comes to mind. The Karpathy-loop discipline of "this experiment runs for exactly N trials / X minutes" is delegated entirely to the user's intuition. -2. **The `create_study` agent tool** at [`backend/app/agent/tools/studies/create_study.py`](../../../../backend/app/agent/tools/studies/create_study.py) reuses `CreateStudyRequest` (= `StudyConfigSpec`) verbatim. The LLM must pick a value with no project guidance — only the bare Pydantic schema constraints (`ge=1, le=100_000` for `max_trials`; `gt=0` for `time_budget_min`). There is no glossary entry or system-prompt directive that recommends a starting range. +1. **The create-study wizard** ([`ui/src/components/studies/create-study-modal.tsx:98-100`](../../../../ui/src/components/studies/create-study-modal.tsx)) declares both fields as optional empty inputs. It pre-fills `parallelism: 4` at [line 136](../../../../ui/src/components/studies/create-study-modal.tsx#L136) but leaves both stop-condition inputs blank. +2. **The `create_study` agent tool** ([`backend/app/agent/tools/studies/create_study.py`](../../../../backend/app/agent/tools/studies/create_study.py)) reuses `CreateStudyRequest` verbatim. The LLM picks a value with no recommended range from the system prompt or glossary. -The compounding observation: the only existing per-trial time-box (`trial_timeout_s`, default 60s via [`backend/app/core/settings.py:282`](../../../../backend/app/core/settings.py)) is **the right shape** for Karpathy-loop discipline. The missing layer is a **per-study time-box default** with a recommended value, plus a wizard that surfaces "what overnight looks like" as a one-click preset. +## The measurement that drives the recommendation -Karpathy's loop runs roughly 100–120 experiments per 8-hour overnight session. RelyLoop's per-trial timeout is 60s. With `parallelism=4` and assume average 30s actual cost per trial (ES queries return faster than 60s in the common case), an 8-hour overnight session at full parallelism is `8 × 3600 × 4 / 30 = 3,840` trials — which is far more than Karpathy needs because each trial is much cheaper than ML training. A sensible default for an "overnight" preset is much lower than the upper bound and should match what TPE actually benefits from. Per Optuna docs and [`backend/app/eval/optuna_runtime.py:116-157`](../../../../backend/app/eval/optuna_runtime.py): pruning kicks in only at `max_trials >= 50`; TPE warms up around 10 trials; diminishing returns past 200–500 for most low-dimensional search spaces. +Real per-trial cost on the dev stack as of 2026-05-21, across 5 seeded demo studies × 2 trials each (`SELECT … FROM trials WHERE status='complete'`): + +| Metric | Value | Notes | +|---|---|---| +| `n_complete_trials` | 10 | small sample but tightly clustered | +| `avg(duration_ms)` | 949 ms | | +| `p50(duration_ms)` | 1,100 ms | | +| `p95(duration_ms)` | 1,200 ms | | +| `max(duration_ms)` | 1,200 ms | | +| Query set size for those studies | **5 queries each** | seed data | +| Cluster | local Docker ES 9.4 + OS 2.18 | | +| Trial timeout configured | 60s (default) | **50× larger than the p95 actual** | +| Parallelism configured | 4 (default) | | + +Four of the five studies hit ~1.15s per trial against the seeded ES/OS clusters; one study (`tune-product-title-boost-baseline-7ce587`) hit ~144ms, likely cache-warmed or hitting a smaller index. + +**Cost scaling estimate** (linear-ish in query-set size; `_msearch` parallelizes server-side but ES overhead doesn't vanish): + +| Query-set size | Expected per-trial cost | Source | +|---|---|---| +| 5 queries (seed) | ~1.1s | measured | +| 50 queries (tutorial) | ~3–5s | extrapolated | +| 200 queries (production) | ~10–30s | extrapolated; managed cloud could push higher with network latency | + +**What this means for wall-clock budgeting:** with `parallelism=4`, even at the pessimistic 30s-per-trial number, **an 8-hour overnight run completes ~3,840 trials** — well past TPE's diminishing returns for any low-dimensional search space RelyLoop typically optimizes. Trials are so cheap that the wall-clock budget is essentially never the binding constraint. The binding constraint is **trial count driven by search-space dimensionality**. + +| TPE convergence behavior | Trial-count range | +|---|---| +| Warmup phase (TPE samples randomly) | 1–10 | +| MedianPruner becomes active per [`optuna_runtime.py:116-157`](../../../../backend/app/eval/optuna_runtime.py) | ≥50 | +| 1–2 param search space — typical convergence | ~50 | +| 3–5 param search space — typical convergence | ~200 | +| 6–10 param search space — typical convergence | ~500–1000 | +| 10+ param search space — typical convergence | 1000–2000 | + +Past those numbers TPE keeps sampling but the marginal lift drops fast. ## Proposed capabilities -Tiered. Tier A is the small UI change that captures most of the leverage. Tier B is the optional preset selector. +Tiered. Tier A is the wizard pre-fill + glossary copy. Tier B is the preset selector keyed off search-space dimensionality. Both are calibrated against the measured per-trial cost above. ### Tier A — wizard pre-fill + recommended-default copy -- **Wizard pre-fill on Step 5.** Set the form default for `max_trials` to **200** when the input is empty on first render. Keep `time_budget_min` empty (so the user explicitly opts in to either kind of cap). Reasoning: 200 is well past TPE warmup (10) and median-pruner activation (50), within Optuna's diminishing-returns sweet spot, and at `parallelism=4` × 30s ≈ 25 minutes wall-clock — short enough for an interactive session, long enough to be meaningful. -- **Glossary copy update** in [`ui/src/lib/glossary.ts`](../../../../ui/src/lib/glossary.ts) for the existing `study.max_trials` + `study.time_budget_min` keys. Add a one-sentence recommendation: "200 trials is a sensible default for a first study on a low-dimensional search space; 500–1000 for overnight runs." -- **InfoTooltip surfaces the recommendation.** The wizard already wires `` ([`create-study-modal.tsx:851`](../../../../ui/src/components/studies/create-study-modal.tsx#L851)) and `study.time_budget_min` ([line 862](../../../../ui/src/components/studies/create-study-modal.tsx#L862)). The glossary update propagates automatically via the existing `InfoTooltip` component. -- **System prompt entry** in [`prompts/orchestrator.system.md`](../../../../prompts/orchestrator.system.md) — add a sentence to the Studies tools section: "When the user has not specified a stop condition, propose `max_trials=200` as a first study or `max_trials=500–1000` (or `time_budget_min=240–480`) for overnight runs." +- **Pre-fill `max_trials = 200`** on Step 5 of the wizard. Justification: 200 covers the TPE convergence sweet spot for 3–5 param search spaces (the most common shape, given the template's `declared_params` typically lands here per [`backend/app/db/models/query_template.py:34-35`](../../../../backend/app/db/models/query_template.py)). At the measured ~1.1s/trial cost with parallelism=4, a 200-trial study completes in **<1 minute** on the dev stack; at the pessimistic 30s/trial estimate for 200-query production sets, it completes in **~25 minutes**. Both are reasonable interactive sessions. Operators with smaller (1–2 param) or larger (6+ param) search spaces can edit downward / upward; the preset selector (Tier B) makes this one-click. +- **Leave `time_budget_min` empty** — `max_trials` is the primary cap; `time_budget_min` is only useful as a safety net for managed clusters where per-trial cost might unexpectedly balloon. Operators who want a wall-clock cap can opt in. +- **Glossary updates** in [`ui/src/lib/glossary.ts`](../../../../ui/src/lib/glossary.ts) for the existing `study.max_trials` + `study.time_budget_min` keys. New copy for `study.max_trials`: + > "Total trials to run before stopping. Sized by your search-space dimensionality: 50 for 1–2 params, 200 for 3–5 params (typical), 500–1000 for 6+ params. TPE's diminishing returns kick in past these. With default parallelism=4 and a ~1s/trial cost on a small query set, 200 trials completes in under a minute; on a managed cluster with a large query set it's more like 25 minutes." + + New copy for `study.time_budget_min`: + > "Wall-clock safety cap, in minutes. Optional. Trials in RelyLoop are typically cheap (subsecond against local stacks, seconds against managed clusters), so the binding stop is almost always `max_trials`. Set this only if you want a hard ceiling on a slow cluster." +- **System prompt entry** in [`prompts/orchestrator.system.md`](../../../../prompts/orchestrator.system.md) — add to the Studies tools section: "When the user does not specify a stop condition, propose `max_trials=200` for typical 3–5 param search spaces. Scale down to 50 for 1–2 params, up to 500–1000 for 6+ params. Use `time_budget_min` only as a safety cap on slow clusters; trials are usually cheap." -### Tier B — "Quick" vs "Overnight" preset selector on Step 5 +### Tier B — dimensionality-keyed preset selector on Step 5 -- **Preset radio above the numeric inputs.** Three options: - - `Quick (50 trials, ~5 min)` — `max_trials=50, parallelism=4, trial_timeout_s=60` - - `Standard (200 trials, ~25 min)` — `max_trials=200, parallelism=4, trial_timeout_s=60` (Tier A default) - - `Overnight (max 8h, 1000 trials)` — `max_trials=1000, time_budget_min=480, parallelism=4, trial_timeout_s=60` (the first-of stop condition wins) - - `Custom` — leaves the existing fields manually editable; preset selection has no effect. -- **Selecting a preset writes the four fields and disables them** (with a "Switch to Custom" link to re-enable). This makes the Karpathy-loop preset visible and one-click; the existing manual path remains available. -- **Frontend-only state** — no new wire-value enum, no new backend logic. The preset selector is purely a form-prefill convenience. +Replace the empty-input pattern with a radio above the numeric fields: + +- **Focused (50 trials)** — 1–2 param search space. TPE warmup completes; MedianPruner doesn't activate (small studies skip pruning per [`optuna_runtime.py:116-157`](../../../../backend/app/eval/optuna_runtime.py)). Estimated wall-clock: ~15s on dev (5-query set), ~1 min on a managed cluster (50-query set). +- **Standard (200 trials)** — 3–5 param search space, the typical case. **Default.** Estimated wall-clock: ~1 min on dev, ~4 min on a 50-query set, ~25 min on a 200-query set with cloud latency. +- **Deep (1000 trials)** — 6+ param search space. Estimated wall-clock: ~5 min on dev, ~20 min on a 50-query set, ~2 hours on a 200-query set with cloud latency. Sets `time_budget_min=480` (8 hours) as a safety cap that almost certainly won't fire but exists as a circuit breaker. +- **Custom** — leaves the existing fields manually editable. + +The preset writes `max_trials` (+ optionally `time_budget_min` for Deep). Other config fields (`parallelism`, `trial_timeout_s`) inherit the existing settings defaults; the preset does not touch them — those are cluster-shape concerns, not "how long should this run" concerns. + +Frontend-only state; no new wire-value enum, no new backend logic. ### Out of scope -- **Adaptive parallelism** (auto-scale `parallelism` up or down based on observed trial latency) — interesting but real product-design surface. Defer. -- **A separate "Karpathy mode" preset that combines `max_trials=200` + auto-followup chaining** — that belongs to [`feat_auto_followup_studies`](../feat_auto_followup_studies/idea.md), not here. -- **Backend-side default changes** (changing `default=None` to `default=200` in the Pydantic field) — rejected. The existing validator behavior (force the user to opt in) is the right safety net for the API surface. Backend defaults would silently apply to legacy callers without an upgrade signal; wizard pre-fill is the right place. +- **Adaptive parallelism** based on observed trial latency — interesting but real product surface; defer. +- **A separate "Karpathy mode" preset that combines `Deep` + auto-chained follow-ups** — belongs to [`feat_auto_followup_studies`](../feat_auto_followup_studies/idea.md), not here. +- **Backend-side Pydantic default changes** — rejected. The existing validator (force-explicit at the API layer) is the right safety net; only the wizard and the system prompt should opinion-set, so legacy callers aren't surprised. ## Scope signals -- **Backend:** Tier A: ~5 LOC in [`prompts/orchestrator.system.md`](../../../../prompts/orchestrator.system.md). Tier B: nothing. -- **Frontend:** Tier A: ~15 LOC (form default + 2 glossary entries + 1 test asserting the pre-fill renders). Tier B: ~150 LOC (preset radio + 3 vitest cases asserting each preset writes the right field bundle + 1 case for Custom mode). +- **Backend:** ~5 LOC in [`prompts/orchestrator.system.md`](../../../../prompts/orchestrator.system.md). +- **Frontend:** Tier A: ~20 LOC (form default + 2 glossary entries + 1 vitest case). Tier B: ~150 LOC (preset radio + 4 vitest cases). - **Migration:** none. - **Config:** none. - **Audit events:** N/A. -- **Tests:** Tier A: 1 vitest case in `create-study-modal.test.tsx` asserting the `max_trials` field renders with `200` by default. Tier B: 4 cases (3 presets + custom). +- **Tests:** Tier A: 1 vitest case asserting the `max_trials` field renders with `200` by default. Tier B: 4 cases (3 presets write the expected field bundle + 1 Custom mode preserves manual edits). + +## Calibration note for future revisions + +The recommended-default numbers in this idea (50 / 200 / 1000 trials; 8h time-budget safety cap) are calibrated against: + +- Measured per-trial p95 of **1,200 ms** on the local dev stack (5-query seed sets, ES 9.4 + OS 2.18, default parallelism=4) +- Linear-ish scaling assumption for larger query sets +- Standard TPE convergence behavior for low-dimensional search spaces + +If RelyLoop's actual production query sets prove dramatically larger or slower than these estimates, the preset wall-clock numbers in the glossary copy need updating. The trial-count recommendations (50 / 200 / 1000) are driven by TPE convergence, not wall-clock, and shouldn't change with cluster cost. Re-run the `percentile_cont(0.95) WITHIN GROUP (ORDER BY duration_ms) FROM trials WHERE status='complete'` query against any cluster's real workload to update the cited cost. -## Why not inline today +## Why not implemented inline today -This idea is **borderline** on the inline-fix rubric in [`CLAUDE.md`](../../../../CLAUDE.md) "Inline-fix vs idea-file rubric." Tier A alone is ≤50 LOC and touches a single subsystem — by the rubric it should be **implemented inline** on the next PR that touches the wizard. The reason it's captured as an idea file rather than landed inline: +Tier A alone is ≤30 LOC and touches the wizard + the system prompt — borderline drive-by per [`CLAUDE.md`](../../../../CLAUDE.md). The reason it's captured as an idea file rather than landed inline: -1. **Product call on the recommended-default number.** "200" is defensible but not obviously right — 100, 250, 500 are all candidates. Picking the wrong number means every new study created via the wizard gets that number, which is a one-way change. Worth a deliberate decision rather than a drive-by commit. -2. **Tier B is the more interesting unit.** A preset selector that surfaces "Quick / Standard / Overnight" as one-click options is a real UX addition, not a tweak. Pairing the default tweak (Tier A) with the preset (Tier B) in one PR gives reviewers the full picture; landing Tier A alone in a drive-by would leave the bigger UX gap for later. -3. **Cross-surface coordination.** Tier A modifies both the wizard AND the orchestrator system prompt. Two surfaces is the upper bound of "drive-by"; doing it as a planned chore keeps the change traceable. +1. **Product call on the recommended-default number.** "200" is grounded in TPE convergence + measured per-trial cost, but other defensible numbers exist (100, 250, 500). The decision is a one-way change once shipped to every new study; worth deliberate scrutiny. +2. **Tier B is the more interesting unit.** A dimensionality-keyed preset selector is the real UX addition; pairing it with the default tweak (Tier A) in one PR gives reviewers the full picture. +3. **The glossary copy is operator-facing documentation.** Wall-clock estimates in user-visible help text need spec-shaped review for accuracy. The numbers cited here are calibrated against the dev stack; production operators will read them and form expectations. ## Relationship to other work -- **Substrate for [`feat_auto_followup_studies`](../feat_auto_followup_studies/idea.md)** — that feature relies on every study having a known finite stop condition so chained follow-ups inherit a sensible budget. The default-stop-condition work makes "chained study with depth=3" mean something concrete (e.g., "three 200-trial studies, ~75 min total"). -- **Aligned with [`feat_pr_metric_confidence`](../feat_pr_metric_confidence/idea.md)** — convergence-trajectory and late-trial noise-floor analytics in the PR body are most meaningful when the operator knows the study had room to converge. A 50-trial study with "best found at trial 49" reads very differently from a 200-trial study with "best found at trial 87." -- **Composes with [`feat_create_study_search_space_builder`](../../../00_overview/implemented_features/2026_05_20_feat_create_study_search_space_builder/)** (shipped 2026-05-20) — the search-space builder is the substantive Step 4. This chore polishes Step 5, the "how long do we run it" surface. +- **Substrate for [`feat_auto_followup_studies`](../feat_auto_followup_studies/idea.md)** — that feature relies on every study in a chain having a known finite stop condition so chained follow-ups inherit a sensible budget. "Standard preset = 200 trials × depth=3 = ~3 min total compounding on dev" is a concrete story; without sane defaults the chain has no predictable footprint. +- **Aligned with [`feat_pr_metric_confidence`](../feat_pr_metric_confidence/idea.md)** — convergence-trajectory analytics in the PR body are most meaningful when the operator knows the study had room to converge. 200-trial studies give those analytics meaningful signal; 50-trial studies often don't. +- **Composes with [`feat_create_study_search_space_builder`](../../../00_overview/implemented_features/2026_05_20_feat_create_study_search_space_builder/)** (shipped 2026-05-20) — the search-space builder counts declared params in real time; that count could feed the preset selector directly ("you have 4 params declared — Standard preset is recommended"). Composable enhancement, not required. From 90e1297e119300e39a4ff00af594e7d1a2ba1149 Mon Sep 17 00:00:00 2001 From: SoundMindsAI Date: Thu, 21 May 2026 09:07:05 -0400 Subject: [PATCH 04/17] docs(plan): approve feat_pr_metric_confidence implementation_plan + advance pipeline to PLAN-complete MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit GPT-5.5 cross-model review converged at cycle 3 (17 total findings: 2 High + 8 Medium + 7 Low, all accepted + patched). Plan covers 9 stories across 2 epics — Epic 1 (6 backend stories: migration → worker write → domain helper → API enrichment → PR body → digest prompt) and Epic 2 (3 frontend stories: types → ConfidencePanel → E2E). High-severity catches: - Cycle 1 F6: ConfidenceShape moved into Story 1.3's domain module (Story 1.4 only re-exports + adds the field to StudyDetail) — fixes the sequencing impossibility where Story 1.3 would import a type Story 1.4 defines. - Cycle 1 F7: top_regressors split into compute_outcome_summary (no text) + build_regressor_rows (hydrates after Q4) — fixes the chicken-and-egg between candidate qids and the query_text fetch. - Cycle 2 F1: confidence.py no longer imports ObjectiveMetric from schemas.py (uses bare str; the upstream value is already validated at create-study) — breaks the schemas ↔ confidence circular import. - Cycle 2 F2: ci_95 + n_queries decoupled from runner-up gate per AC-16 (the 1-complete-trial case requires CI to populate from winner alone). Next: /impl-execute --all on this branch. Co-Authored-By: Claude Opus 4.7 --- docs/00_overview/MVP1_DASHBOARD.md | 18 +- docs/00_overview/mvp1_dashboard.html | 22 +- .../implementation_plan.md | 1208 +++++++++++++++++ .../pipeline_status.md | 12 +- 4 files changed, 1238 insertions(+), 22 deletions(-) create mode 100644 docs/02_product/planned_features/feat_pr_metric_confidence/implementation_plan.md diff --git a/docs/00_overview/MVP1_DASHBOARD.md b/docs/00_overview/MVP1_DASHBOARD.md index ba4cfa3b..86ca2f00 100644 --- a/docs/00_overview/MVP1_DASHBOARD.md +++ b/docs/00_overview/MVP1_DASHBOARD.md @@ -6,14 +6,14 @@ _Reflects feature-folder state as of **2026-05-21** (latest mtime of any planned ## Next up -**[feat_pr_metric_confidence](../02_product/planned_features/feat_pr_metric_confidence/feature_spec.md)** — Feature, currently in **Spec** +**[feat_pr_metric_confidence](../02_product/planned_features/feat_pr_metric_confidence/feature_spec.md)** — Feature, currently in **Plan** > Approvers reading a study-backed PR see a "## Confidence" section directly between the existing "## Metric delta" and "## Config diff" sections. -Spec exists; run /pipeline to generate the implementation plan + ship +Plan approved; run /impl-execute to ship ```bash -/pipeline docs/02_product/planned_features/feat_pr_metric_confidence --auto +/impl-execute docs/02_product/planned_features/feat_pr_metric_confidence/implementation_plan.md --all ``` ## MVP1 Progress @@ -106,16 +106,16 @@ Spec exists; run /pipeline to generate the implementation plan + ship _None._ -### Plan (0) - -_None._ - -### Spec (1) +### Plan (1) | Feature | Type | One-liner | Depends on | Status | |---|---|---|---|---| | [feat_pr_metric_confidence](../02_product/planned_features/feat_pr_metric_confidence/feature_spec.md) | Feature | Approvers reading a study-backed PR see a "## Confidence" section directly between the existing "## Metric delta" and "## Config diff" sections. | — | [PR #41](https://github.com/SoundMindsAI/relyloop/pull/41) merged 2026-05-11 | +### Spec (0) + +_None._ + ### Idea (6) | Feature | Type | One-liner | Depends on | Status | @@ -139,7 +139,7 @@ graph LR classDef spec fill:#dbeafe,stroke:#1e40af,color:#1e40af; classDef idea fill:#f1f5f9,stroke:#334155,color:#334155; feat_pr_metric_confidence["pr metric confidence"] - class feat_pr_metric_confidence spec; + class feat_pr_metric_confidence plan; infra_foundation["foundation"] class infra_foundation done; feat_study_lifecycle["study lifecycle"] diff --git a/docs/00_overview/mvp1_dashboard.html b/docs/00_overview/mvp1_dashboard.html index 860f5984..53d2aa8d 100644 --- a/docs/00_overview/mvp1_dashboard.html +++ b/docs/00_overview/mvp1_dashboard.html @@ -370,11 +370,11 @@

RelyLoop MVP1 Dashboard

-
Next up — Feature, currently in Spec
+
Next up — Feature, currently in Plan
Approvers reading a study-backed PR see a "## Confidence" section directly between the existing "## Metric delta" and "## Config diff" sections.
-
Spec exists; run /pipeline to generate the implementation plan + ship
- /pipeline docs/02_product/planned_features/feat_pr_metric_confidence --auto +
Plan approved; run /impl-execute to ship
+ /impl-execute docs/02_product/planned_features/feat_pr_metric_confidence/implementation_plan.md --all
@@ -504,7 +504,12 @@

Idea 6

-

Spec 1

+

Spec 0

+ +
+ +
+

Plan 1

@@ -519,11 +524,6 @@

Spec 1

-
-

Plan 0

- -
-

Implementing 0

@@ -1361,7 +1361,7 @@

Dependency graph (feat_ + infra_)

classDef spec fill:#dbeafe,stroke:#1e40af,color:#1e40af; classDef idea fill:#f1f5f9,stroke:#334155,color:#334155; feat_pr_metric_confidence["pr metric confidence"] - class feat_pr_metric_confidence spec; + class feat_pr_metric_confidence plan; infra_foundation["foundation"] class infra_foundation done; feat_study_lifecycle["study lifecycle"] @@ -1566,7 +1566,7 @@

Dependency graph (feat_ + infra_)

classDef spec fill:#dbeafe,stroke:#1e40af,color:#1e40af; classDef idea fill:#f1f5f9,stroke:#334155,color:#334155; feat_pr_metric_confidence["pr metric confidence"] - class feat_pr_metric_confidence spec; + class feat_pr_metric_confidence plan; infra_foundation["foundation"] class infra_foundation done; feat_study_lifecycle["study lifecycle"] diff --git a/docs/02_product/planned_features/feat_pr_metric_confidence/implementation_plan.md b/docs/02_product/planned_features/feat_pr_metric_confidence/implementation_plan.md new file mode 100644 index 00000000..c59f490e --- /dev/null +++ b/docs/02_product/planned_features/feat_pr_metric_confidence/implementation_plan.md @@ -0,0 +1,1208 @@ +# Implementation Plan — PR Metric Confidence (Phase 1) + +**Date:** 2026-05-21 +**Status:** Draft (pending GPT-5.5 cross-model review) +**Primary spec:** [`feature_spec.md`](feature_spec.md) +**Policy source(s):** [`CLAUDE.md`](../../../../CLAUDE.md), [`docs/01_architecture/data-model.md`](../../../01_architecture/data-model.md), [`docs/01_architecture/api-conventions.md`](../../../01_architecture/api-conventions.md), [`docs/01_architecture/optimization.md`](../../../01_architecture/optimization.md) + +--- + +## 0) Planning principles + +- Spec traceability first: every story/task maps to FR IDs from the spec. +- Phase gates are hard stops: Story 1.1 (migration) must merge before any other story's persistence dependency lands. +- Fail-loud tests: every degraded path from spec FR-7 is exercised by a dedicated test case. +- Match existing repo / router / domain / service / worker conventions byte-for-byte — no inventing new patterns when a precedent exists. +- Story increments are narrow enough that each can be reviewed independently (no story exceeds ~400 LOC of code+tests). + +## 1) Scope traceability (FR → epics/stories) + +| FR ID | Story | Notes | +|---|---|---| +| FR-1 (persist per_query_metrics) | Epic 1 / Story 1.1 (migration) + Story 1.2 (worker write) | Migration adds nullable JSONB column + DB CHECK; worker writes on success branch only. | +| FR-2 (compute_study_confidence) | Epic 1 / Story 1.3 (domain helper) | Async function; 4-query read pattern; partial-population contract. | +| FR-3 (winner-vs-runner-up reference) | Epic 1 / Story 1.3 | Lock `comparison_against = "runner_up"` unconditionally in Phase 1. | +| FR-4 (locked thresholds + methods) | Epic 1 / Story 1.3 | All thresholds from spec §7 FR-4 + §7 FR-4a coded as module constants. | +| FR-4a (regressor threshold table) | Epic 1 / Story 1.3 | `REGRESSOR_THRESHOLDS: dict[str, float]` module constant. | +| FR-5a (StudyDetail enrichment) | Epic 1 / Story 1.4 (`ConfidenceShape` + API wiring) | New Pydantic model + `_detail()` enrichment. | +| FR-5b (PR body section) | Epic 1 / Story 1.5 (PR body + worker plumbing) | Renderer extension to `_render_pr_body_study_backed`. | +| FR-5c (ConfidencePanel UI) | Epic 2 / Story 2.2 (panel + page mount) | New component + page integration. | +| FR-5d (PR worker plumbing) | Epic 1 / Story 1.5 | Worker fetches Study + awaits `compute_study_confidence` + passes into renderer. | +| FR-6 (digest narrative prompt) | Epic 1 / Story 1.6 (digest prompt + worker plumbing) | XML blocks + system-prompt edit. | +| FR-7 (graceful degradation paths) | Epic 1 / Story 1.3 (test coverage) + Story 1.4 (response contract) + Story 2.2 (UI gating) | Every degraded sub-field is independent; tests cover each combination. | + +All 8 FRs are covered by 9 stories across 2 epics. Phase 2 (deferred orchestrator baseline-trial work) is tracked in [`phase2_idea.md`](phase2_idea.md) — no in-flight FRs left untracked. + +## 2) Delivery structure + +**Epic → Story → Tasks → DoD** (preferred — this is product-facing work with a clear backend → frontend cut). + +### Conventions (RelyLoop project-specific) + +``` +- All repo functions take db: AsyncSession as first arg; use db.flush() — caller commits +- Services are async; orchestrators create job_run records (N/A here — no new service-layer orchestrator) +- Domain layer is pure: no DB access except via passed-in db handle, no I/O except via callables +- Models use Mapped[] typed columns, String(36) UUIDs (UUIDv7 generated client-side) +- Routers return typed Pydantic response models; errors use HTTPException with _err() envelope +- Settings via pydantic-settings; never hardcode LLM model names — read from Settings.openai_model +- All __init__.py exports updated via __all__ +- Migrations: sequential numeric revision IDs (0015_trials_per_query_metrics next); include downgrade(); round-trip verified +- Conventional Commits (commit-msg hook enforced) +``` + +### AI Agent Execution Protocol + +Per RelyLoop CLAUDE.md Absolute Rule #9, this plan is executed via `/impl-execute`. Each story: + +0. **Load context first**: Read `architecture.md` and `state.md` before starting. +1. **Read scope**: verify story outcome + endpoints + interfaces + DoD. +2. **Implement backend first**: migration → models → repo → domain → service → router → schemas. +3. **Run backend tests**: `make test-unit`, then targeted integration + contract for touched endpoints. +4. **Implement frontend** (Story 2.* only). +5. **Run E2E**: `cd ui && pnpm playwright test tests/e2e/...` for touched paths. +6. **Update docs/state.md** if behavior changed (Story 1.1 moves Alembic head; Story 1.3 adds new domain module). +7. **Verify migration round-trip** (Story 1.1 only). +8. **Attach evidence** in PR description: commands run, pass/fail, files changed. +9. **After the final story**, update `state.md` (Alembic head bump, feature ship status) and `architecture.md` (`backend/app/domain/study/confidence.py` is the new module worth a line). + +--- + +## Epic 1 — Backend persistence, analytics, and PR/digest surfaces + +### Story 1.1 — Alembic migration `0015_trials_per_query_metrics` + +**Outcome:** The `trials` table gains a nullable JSONB column `per_query_metrics` with a CHECK constraint enforcing `IS NULL OR jsonb_typeof = 'object'`. Migration round-trips cleanly. Existing rows stay NULL (no backfill). + +**FRs:** FR-1, AC-1 setup, AC-17. + +**New files** + +| File | Purpose | +|---|---| +| `migrations/versions/0015_trials_per_query_metrics.py` | Alembic revision 0015. `upgrade()` adds the column + CHECK. `downgrade()` drops the CHECK + column. Revision string is `"0015"` (matches the 4-char convention from `0014_clusters_target_filter.py:18`). | +| `backend/tests/integration/test_trials_per_query_metrics_migration.py` | 3 round-trip tests following the pattern at [`backend/tests/integration/test_clusters_target_filter_migration.py`](../../../../backend/tests/integration/test_clusters_target_filter_migration.py) (cited in state.md as the most recent migration-test precedent). | + +**Modified files** + +| File | Change | +|---|---| +| [`backend/app/db/models/trial.py`](../../../../backend/app/db/models/trial.py) | Add `per_query_metrics: Mapped[dict[str, Any] \| None] = mapped_column(JSONB, nullable=True)` after the existing `metrics` column (line 61). Add the CHECK constraint to `__table_args__` (currently lines 34-39). Update the module docstring to mention the new column. | + +**Key interfaces** + +```python +# migrations/versions/0015_trials_per_query_metrics.py +revision: str = "0015" +down_revision: str | None = "0014" + +def upgrade() -> None: + op.add_column( + "trials", + sa.Column("per_query_metrics", postgresql.JSONB(astext_type=sa.Text()), nullable=True), + ) + op.create_check_constraint( + "trials_per_query_metrics_object_check", + "trials", + "per_query_metrics IS NULL OR jsonb_typeof(per_query_metrics) = 'object'", + ) + +def downgrade() -> None: + op.drop_constraint("trials_per_query_metrics_object_check", "trials", type_="check") + op.drop_column("trials", "per_query_metrics") +``` + +**Tasks** +1. Write the migration file at `migrations/versions/0015_trials_per_query_metrics.py` following the shape of [`migrations/versions/0014_clusters_target_filter.py`](../../../../migrations/versions/0014_clusters_target_filter.py). +2. Add the ORM column and CHECK constraint to `backend/app/db/models/trial.py`. +3. Write integration test `test_trials_per_query_metrics_migration.py` with 3 cases: + - `test_migration_adds_column_with_null_default`: pre-existing trial row stays NULL. + - `test_migration_round_trip`: `upgrade head → downgrade -1 → upgrade head` succeeds with no errors and the column reappears NULL. + - `test_check_constraint_rejects_non_object`: `INSERT ... per_query_metrics='[]'::jsonb` raises CHECK violation. The asyncpg-level error is wrapped by SQLAlchemy AsyncSession as `sqlalchemy.exc.IntegrityError`; assert on that type and check `.orig.__class__.__name__ == "CheckViolationError"` for the inner wrapped exception (cycle-3 GPT-5.5 F3 fix — SQLAlchemy doesn't surface asyncpg exceptions directly). +4. Run `.venv/bin/alembic upgrade head && .venv/bin/alembic downgrade -1 && .venv/bin/alembic upgrade head` locally. +5. Update [`state.md`](../../../../state.md) Alembic head section from `0014_clusters_target_filter` → `0015_trials_per_query_metrics`. + +**Definition of Done (DoD)** +- [ ] Migration file exists at `migrations/versions/0015_trials_per_query_metrics.py` with both `upgrade()` and `downgrade()`. +- [ ] `make test-integration` passes including 3 new cases in `test_trials_per_query_metrics_migration.py`. +- [ ] Migration round-trip verified on the populated dev DB. +- [ ] `state.md` updated with the new Alembic head. +- [ ] ORM `Trial` model exposes `per_query_metrics: dict[str, Any] | None`. + +--- + +### Story 1.2 — Persist `per_query_metrics` in the `run_trial` worker + +**Outcome:** On every successful trial, `pytrec_eval`'s `per_query` dict is persisted to `trials.per_query_metrics`. Failed trials leave the column NULL. + +**FRs:** FR-1, AC-1, AC-2. + +**New files** + +| File | Purpose | +|---|---| +| `backend/tests/integration/test_run_trial_per_query_persistence.py` | 2 cases — happy path persistence + failed-path NULL. Uses the existing `infra_optuna_eval` integration-test scaffold + stubbed adapter. | + +**Modified files** + +| File | Change | +|---|---| +| [`backend/workers/trials.py`](../../../../backend/workers/trials.py) | Line 440 — add `per_query_metrics=scored["per_query"]` kwarg to the `repo.create_trial(...)` call. The line currently writes `metrics=scored["aggregate"]` — the new kwarg goes immediately after it. The failed-path call at line 500 stays unchanged (it already writes `metrics={}` with no per_query_metrics — that's the intended NULL contract). | +| [`backend/app/db/repo/trial.py`](../../../../backend/app/db/repo/trial.py) | `create_trial()` signature — add `per_query_metrics: dict[str, Any] | None = None` as a new kwarg. Pass it through to `Trial(...)` constructor. Default None preserves existing callers (test fixtures may not provide it). | + +**Key interfaces** + +```python +# backend/app/db/repo/trial.py +async def create_trial( + db: AsyncSession, + *, + id: str, + study_id: str, + optuna_trial_number: int, + params: dict[str, Any], + primary_metric: float | None, + metrics: dict[str, Any], + duration_ms: int | None, + status: str, + error: str | None, + started_at: datetime | None, + ended_at: datetime | None, + per_query_metrics: dict[str, Any] | None = None, # NEW (FR-1) — default None preserves legacy callers +) -> Trial: + ... +``` + +**Tasks** +1. Read the current `create_trial` signature at `backend/app/db/repo/trial.py` and extend with the new optional kwarg. Update `Trial(...)` instantiation to pass it. +2. Modify `backend/workers/trials.py:433-446` to add `per_query_metrics=scored["per_query"]` to the success-path call. +3. Write `test_run_trial_per_query_persistence.py` with: + - `test_successful_trial_writes_per_query_metrics`: seed cluster + qs + queries + judgment list + template, enqueue 1 trial via the standard scaffold, await completion, assert `SELECT per_query_metrics FROM trials WHERE id = ?` returns non-NULL dict shaped `{qid: {ndcg, map, precision, recall, mrr: float}}`. Use `MetricCatalog` keys (`ndcg`, `precision`, `recall`, etc.) — NOT pytrec_eval wire forms. + - `test_failed_trial_leaves_per_query_metrics_null`: simulate adapter raise during trial, assert resulting Trial row has `status='failed'` AND `per_query_metrics IS NULL`. +4. Run `make test-integration` locally; verify both new cases pass. + +**Definition of Done (DoD)** +- [ ] `run_trial` worker writes `per_query_metrics` on success path (one-line addition). +- [ ] `repo.create_trial` accepts the new kwarg with default None. +- [ ] Both integration tests pass — covers AC-1 + AC-2. +- [ ] No existing tests break (the new kwarg is optional). + +--- + +### Story 1.3 — Domain module `backend/app/domain/study/confidence.py` + `ConfidenceShape` Pydantic model + +**Outcome:** Pure-Python async helper `compute_study_confidence(db, study)` returns a `ConfidenceShape | None` per FR-2's contract. Every locked threshold from FR-4 + FR-4a is a module constant. Every degraded path from FR-7 has a dedicated unit test. **`ConfidenceShape` and all 7 sub-shapes are defined in this story** (cycle-1 GPT-5.5 F6 fix — Story 1.4 cannot import a type that doesn't exist yet, so the type ownership lives with the assembler). + +**FRs:** FR-2, FR-3, FR-4, FR-4a, FR-7 (test coverage). FR-5a piece: `ConfidenceShape` definition (the wiring to `StudyDetail` stays in Story 1.4). + +**New files** + +| File | Purpose | +|---|---| +| `backend/app/domain/study/confidence.py` | The domain module. Exports `compute_study_confidence`, `ConfidenceShape` + 7 sub-models, the 3 new Literals (`ConvergenceRegime`, `RunnerUpClassification`, `ComparisonAgainst`), the `CIMethod` Literal, the locked constants (`BOOTSTRAP_N=1000`, `BOOTSTRAP_SEED=42`, `BOOTSTRAP_CI_LEVEL=0.95`, `BOOTSTRAP_MIN_N_QUERIES=5`, `REGRESSOR_THRESHOLDS: dict[str, float]`, `RUNNER_UP_PLATEAU_BAND=0.005`, `LATE_TRIAL_WINDOW_FRAC=0.2`, `LATE_TRIAL_WINDOW_MIN=5`, `LATE_TRIAL_MIN_COMPLETE=10`, `EARLY_HELD_TRIAL_NUMBER_FRAC=0.5`, `EARLY_HELD_LATE_WINDOW_FRAC=0.25`, `LATE_RISING_TRIAL_NUMBER_FRAC=0.9`, `CONVERGENCE_MIN_COMPLETE=3`, `RUNNER_UP_GAP_MIN_COMPLETE=2`, `TOP_REGRESSORS_CAP=5`), and the 8 pure helper functions (see key interfaces). | +| `backend/tests/unit/domain/study/test_confidence.py` | 25+ unit test cases covering: bootstrap_ci_95 (seed determinism, N<5 suppression, N=20 expected interval); classify_runner_up_gap (returns full `RunnerUpGapShape \| None`; robust_plateau / sharp_peak / 2-trial edge / N<2 suppression); compute_late_trial_stddev (window math at N=10/20/50/100, N<10 suppression); classify_convergence_regime (early_held with late-window probe, late_rising at 90%, noisy fallback, N<3 suppression); compute_outcome_summary (improved/unchanged/regressed counts per FR-4a threshold table + `regressor_candidates: list[tuple[qid, winner, comparison, delta]]` sorted by absolute delta); build_regressor_rows (5-cap, query_text join via lookup arg); compute_study_confidence orchestrator (whole-object null when best_trial_id IS NULL or row missing; partial when per_query_metrics IS NULL; partial when N<5 queries; full when all data present). | + +**Modified files** + +| File | Change | +|---|---| +| [`backend/app/domain/study/__init__.py`](../../../../backend/app/domain/study/__init__.py) | Add `from . import confidence` and update `__all__` if present. | + +**Key interfaces** + +```python +# backend/app/domain/study/confidence.py +from __future__ import annotations + +from dataclasses import dataclass +from typing import Any, Literal + +import numpy as np +from pydantic import BaseModel +from sqlalchemy.ext.asyncio import AsyncSession + +from backend.app.db.models import Study, Trial + +# IMPORTANT: do NOT import ObjectiveMetric from backend.app.api.v1.schemas — that creates a +# circular import (cycle-2 GPT-5.5 F1 fix). schemas.py imports ConfidenceShape from this +# module (one-direction); reciprocal import would deadlock at app startup. HeadlineShape.metric +# uses bare `str` instead — the upstream value is already validated by the existing +# ObjectiveMetric Literal at the create-study endpoint (schemas.py:214) so the wire contract +# is preserved. + +# Locked constants — every value referenced from FR-4 / FR-4a. +BOOTSTRAP_N: int = 1000 +BOOTSTRAP_SEED: int = 42 +BOOTSTRAP_CI_LEVEL: float = 0.95 +BOOTSTRAP_MIN_N_QUERIES: int = 5 +REGRESSOR_THRESHOLDS: dict[str, float] = { + "ndcg": 0.01, "precision": 0.01, "recall": 0.01, + "map": 0.02, "mrr": 0.02, +} +RUNNER_UP_PLATEAU_BAND: float = 0.005 +LATE_TRIAL_WINDOW_FRAC: float = 0.2 +LATE_TRIAL_WINDOW_MIN: int = 5 +LATE_TRIAL_MIN_COMPLETE: int = 10 +EARLY_HELD_TRIAL_NUMBER_FRAC: float = 0.5 +EARLY_HELD_LATE_WINDOW_FRAC: float = 0.25 +LATE_RISING_TRIAL_NUMBER_FRAC: float = 0.9 +CONVERGENCE_MIN_COMPLETE: int = 3 +RUNNER_UP_GAP_MIN_COMPLETE: int = 2 +TOP_REGRESSORS_CAP: int = 5 + +ConvergenceRegime = Literal["early_held", "late_rising", "noisy"] +RunnerUpClassification = Literal["robust_plateau", "sharp_peak"] +ComparisonAgainst = Literal["runner_up", "baseline"] # Phase 1 only emits "runner_up" +CIMethod = Literal["bootstrap_n1000"] + + +# Pydantic shapes — exported and re-imported by `schemas.py` in Story 1.4 to extend StudyDetail. +class HeadlineShape(BaseModel): + metric: str # one of `ObjectiveMetric` values per schemas.py:214 — validated upstream at create-study + value: float + k: int | None + n_queries: int | None # None when winner has per_query_metrics IS NULL + +class CIShape(BaseModel): + low: float + high: float + method: CIMethod + n_samples: int + +class RunnerUpGapShape(BaseModel): + value: float + classification: RunnerUpClassification # non-null: whole shape suppressed to None when classification can't be determined + top10_within: float + runner_up_metric: float + +class LateTrialStddevShape(BaseModel): + value: float + window_size: int + min_window_required: int # always LATE_TRIAL_MIN_COMPLETE = 10 + +class ConvergenceShape(BaseModel): + best_at_trial: int + total_trials: int + regime: ConvergenceRegime + +class RegressorRowShape(BaseModel): + query_id: str + query_text: str + winner_score: float + comparison_score: float + delta: float + +class PerQueryOutcomesShape(BaseModel): + improved: int + unchanged: int + regressed: int + comparison_against: ComparisonAgainst + top_regressors: list[RegressorRowShape] # ≤ TOP_REGRESSORS_CAP + +class ConfidenceShape(BaseModel): + headline: HeadlineShape + ci_95: CIShape | None + runner_up_gap: RunnerUpGapShape | None + late_trial_stddev: LateTrialStddevShape | None + convergence: ConvergenceShape | None + per_query_outcomes: PerQueryOutcomesShape | None + + +@dataclass(frozen=True) +class _OutcomeSummary: + """Internal — produced by `compute_outcome_summary`; consumed by orchestrator + `build_regressor_rows`.""" + improved: int + unchanged: int + regressed: int + regressor_candidates: list[tuple[str, float, float, float]] # (qid, winner_score, comparison_score, delta), sorted by abs(delta) desc, capped at TOP_REGRESSORS_CAP + + +# Pure helpers — all synchronous, take numpy arrays / dicts. +def bootstrap_ci_95(per_query_values: list[float]) -> CIShape | None: + """Percentile bootstrap with seed=42, N=1000 resamples. Returns None when len < BOOTSTRAP_MIN_N_QUERIES (5).""" + +def classify_runner_up_gap( + sorted_primary_metrics: list[float], # descending, winner first; len ≥ RUNNER_UP_GAP_MIN_COMPLETE +) -> RunnerUpGapShape | None: + """Returns the full RunnerUpGapShape with `value`, `classification`, `top10_within`, `runner_up_metric` populated. Returns None when len < RUNNER_UP_GAP_MIN_COMPLETE (2).""" + +def compute_late_trial_stddev( + primary_metrics_in_trial_order: list[float], +) -> LateTrialStddevShape | None: + """Returns LateTrialStddevShape with value + window_size + min_window_required. None when N < LATE_TRIAL_MIN_COMPLETE (10).""" + +def classify_convergence_regime( + winner_trial_number: int, + primary_metrics_by_trial_number: dict[int, float], # complete trials only +) -> ConvergenceShape | None: + """Returns ConvergenceShape with best_at_trial + total_trials + regime. None when N < CONVERGENCE_MIN_COMPLETE (3).""" + +def compute_outcome_summary( + winner_per_query: dict[str, dict[str, float]], + comparison_per_query: dict[str, dict[str, float]], + metric: str, # one of REGRESSOR_THRESHOLDS keys +) -> _OutcomeSummary | None: + """Returns counts + regressor_candidates (qids only, no text). Returns None when either input is empty/None. + Sorts candidates by absolute delta descending, caps at TOP_REGRESSORS_CAP. Pure — no DB.""" + +def build_regressor_rows( + candidates: list[tuple[str, float, float, float]], # (qid, winner_score, comparison_score, delta) + query_text_by_id: dict[str, str], # hydrated from Q4 of the 4-query read pattern +) -> list[RegressorRowShape]: + """Hydrates each candidate with query_text. If a qid is missing from the dict (deleted query — cascade race), the row is omitted.""" + +async def compute_study_confidence( + db: AsyncSession, + study: Study, +) -> ConfidenceShape | None: + """Orchestrator — fires the 4-query read pattern from FR-2 + assembles ConfidenceShape. + Returns None on whole-object-degraded paths per FR-7. Pseudocode: + + winner = await Q1(db, study.best_trial_id) + if winner is None: return None # FR-7 whole-object case (best_trial_id NULL or deleted) + + runner_up = await Q2(db, study.id, exclude=winner.id) + complete_trials_summary = await Q3(db, study.id) # projection: (primary_metric, optuna_trial_number) + + # Aggregate signals — independent of per_query data + runner_up_gap = classify_runner_up_gap(sorted_primary_metrics_from_summary) # may be None + late_trial_stddev = compute_late_trial_stddev(primary_metrics_in_trial_order) # may be None + convergence = classify_convergence_regime(winner.optuna_trial_number, primary_metrics_by_trial_number) # may be None + + # Winner-only per-query signals — depend only on winner's per_query_metrics + # (cycle-2 GPT-5.5 F2 fix — AC-16 requires CI to populate even with 1 complete trial) + if winner.per_query_metrics: + winner_values_for_metric = [ + v[metric] for v in winner.per_query_metrics.values() if metric in v + ] + ci_95 = bootstrap_ci_95(winner_values_for_metric) # may be None for N<5 + n_queries = len(winner_values_for_metric) + else: + ci_95 = None + n_queries = None + + # Comparison-based per-query signals — require BOTH winner + runner_up to have per_query_metrics + if winner.per_query_metrics and runner_up and runner_up.per_query_metrics: + outcome = compute_outcome_summary(winner.per_query_metrics, runner_up.per_query_metrics, metric) + query_text_by_id = await Q4(db, [qid for (qid, *_) in outcome.regressor_candidates]) # conditional — skipped if no candidates + regressor_rows = build_regressor_rows(outcome.regressor_candidates, query_text_by_id) + per_query_outcomes = PerQueryOutcomesShape( + improved=outcome.improved, unchanged=outcome.unchanged, regressed=outcome.regressed, + comparison_against='runner_up', # FR-3 locked for Phase 1 + top_regressors=regressor_rows, + ) + else: + per_query_outcomes = None + + return ConfidenceShape( + headline=HeadlineShape(metric=study.objective['metric'], value=study.best_metric, k=study.objective.get('k'), n_queries=...), + ci_95=ci_95, + runner_up_gap=runner_up_gap, + late_trial_stddev=late_trial_stddev, + convergence=convergence, + per_query_outcomes=per_query_outcomes, + ) + """ +``` + +**Tasks** +1. Write `backend/app/domain/study/confidence.py` with the 7 pure helpers + the async orchestrator. +2. Inside `compute_study_confidence`, execute the 4-query read pattern from spec FR-2: + - Q1: `SELECT * FROM trials WHERE id = :winner_id` — fetch winner. + - Q2: `SELECT * FROM trials WHERE study_id = :sid AND status = 'complete' AND id != :winner_id ORDER BY primary_metric DESC NULLS LAST LIMIT 1` — fetch runner-up. + - Q3: `SELECT primary_metric, optuna_trial_number FROM trials WHERE study_id = :sid AND status = 'complete' ORDER BY optuna_trial_number ASC` — summary list (projection only, no per_query_metrics). + - Q4 (conditional): `SELECT id, query_text FROM queries WHERE id = ANY(:regressor_qids)` — only if `top_regressors` produced any rows. +3. Wire each helper to the appropriate sub-field of `ConfidenceShape` (the shapes are defined IN THIS STORY's `backend/app/domain/study/confidence.py`; Story 1.4 only re-exports + adds the field to `StudyDetail`). +4. Write `backend/tests/unit/domain/study/test_confidence.py` with 25+ cases covering every FR-7 degraded branch. +5. Lock numpy import at module top — no lazy-import dance; numpy is a hard dep via pytrec_eval. + +**Definition of Done (DoD)** +- [ ] `backend/app/domain/study/confidence.py` exists with the 7 pure helpers + the async orchestrator. +- [ ] 25+ unit cases pass via `make test-unit`. +- [ ] Every FR-7 degraded sub-field path has an explicit test case. +- [ ] Bootstrap CI seed determinism asserted (AC-4 covered at unit layer). +- [ ] No `except Exception:` in the module (FR-7 invariant: errors propagate; degraded paths return None explicitly). + +--- + +### Story 1.4 — `ConfidenceShape` Pydantic model + `StudyDetail` enrichment + +**Outcome:** `GET /api/v1/studies/{id}` response gains an optional `confidence: ConfidenceShape | None` field. The OpenAPI schema is shape-locked. Old clients that don't deserialize the field continue to work. + +**FRs:** FR-5a, FR-7 (wire contract). + +**New files** + +| File | Purpose | +|---|---| +| `backend/tests/integration/test_studies_api_confidence.py` | 11 integration tests covering AC-3, AC-3a, AC-4, AC-5, AC-6, AC-7, AC-8, AC-9, AC-10, AC-15, AC-16 (cycle-1 GPT-5.5 F9 added AC-6/AC-8/AC-9 at integration layer). Uses extended `_digest_helpers.py` seed pattern + configurable `optuna_trial_number` distribution to synthesize convergence-regime scenarios. | + +**Modified files** + +| File | Change | +|---|---| +| [`backend/app/api/v1/schemas.py`](../../../../backend/app/api/v1/schemas.py) | Re-export `ConfidenceShape` (defined in Story 1.3's `backend/app/domain/study/confidence.py`) via `from backend.app.domain.study.confidence import ConfidenceShape`. Add `confidence: ConfidenceShape \| None = None` to `StudyDetail` (insert after `trials_summary` at line 636). NOTE: The shape itself is defined in Story 1.3 (cycle-1 GPT-5.5 F6 fix — domain module owns the Pydantic types because it is the assembler). | +| [`backend/app/api/v1/studies.py`](../../../../backend/app/api/v1/studies.py) | `_detail()` at line 118 — `await compute_study_confidence(db, row)` and pass into the `StudyDetail(...)` constructor at line 134 (insert just before the closing paren). | +| [`backend/tests/contract/test_studies_api_contract.py`](../../../../backend/tests/contract/test_studies_api_contract.py) | Add 2 cases: `test_study_detail_includes_confidence_field` (OpenAPI shape lock — assert the JSON schema for `StudyDetail` contains the `confidence` property), `test_confidence_shape_has_six_subfields` (assert the schema's `ConfidenceShape` has `headline`, `ci_95`, `runner_up_gap`, `late_trial_stddev`, `convergence`, `per_query_outcomes`). | + +**Endpoints** + +| Method | Path | Request body | Success response | Error codes | +|---|---|---|---|---| +| `GET` | `/api/v1/studies/{study_id}` | — | `200` `StudyDetail` (existing shape + new `confidence: ConfidenceShape \| null` field) | `STUDY_NOT_FOUND` (404 — existing) | + +No new error codes per spec §8.5. + +**Pydantic schemas** + +The `ConfidenceShape` and 7 sub-shapes are defined in Story 1.3 at `backend/app/domain/study/confidence.py` (see that story's Key interfaces section for the full Pydantic class definitions). Story 1.4 only re-exports them through `schemas.py` and adds the field to `StudyDetail`: + +```python +# backend/app/api/v1/schemas.py +from backend.app.domain.study.confidence import ConfidenceShape # NEW import — re-export for typing convenience + +class StudyDetail(BaseModel): + # ... existing 17 fields (id, name, cluster_id, target, ...) ... + trials_summary: TrialsSummaryShape + confidence: ConfidenceShape | None = None # NEW (FR-5a) +``` + +**Tasks** +1. Add `from backend.app.domain.study.confidence import ConfidenceShape` to the imports of `backend/app/api/v1/schemas.py` (the shapes themselves live in Story 1.3's domain module per the cycle-1 sequencing fix). +2. Modify `StudyDetail` to add the `confidence: ConfidenceShape | None = None` field after `trials_summary`. +3. Modify `backend/app/api/v1/studies.py::_detail()` to `await compute_study_confidence(db, row)` and pass the result into the `StudyDetail(...)` constructor. +4. Add 11 integration test cases to `test_studies_api_confidence.py` per the AC mapping in the FR row (cycle-1 GPT-5.5 F9 expansion). +5. Add 2 contract test cases to `test_studies_api_contract.py` for OpenAPI shape lock. +6. Run `make test-integration && make test-contract` locally. + +**Definition of Done (DoD)** +- [ ] `ConfidenceShape` and 7 sub-shapes are defined in `backend/app/domain/study/confidence.py` (Story 1.3); this story (1.4) only adds the `from backend.app.domain.study.confidence import ConfidenceShape` re-export at the top of `schemas.py`. +- [ ] `StudyDetail` has the new `confidence` field; `_detail()` populates it. +- [ ] 11 integration cases pass; 2 contract cases pass. +- [ ] OpenAPI schema includes the new shape (verified by the existing OpenAPI-surface contract test family — see `test_openapi_surface.py`). +- [ ] AC-3, AC-3a, AC-4, AC-5, AC-6, AC-7, AC-8, AC-9, AC-10, AC-15, AC-16 all green. + +--- + +### Story 1.5 — PR body `## Confidence` section + PR-worker plumbing + +**Outcome:** The `open_pr` worker fetches confidence before rendering, and `_render_pr_body_study_backed` emits the new section between `## Metric delta` and `## Config diff`. Section gracefully degrades when sub-fields are null. Section is entirely absent when `confidence is None`. + +**FRs:** FR-5b, FR-5d, FR-7 (PR body gating). + +**New files** + +| File | Purpose | +|---|---| +| `backend/tests/contract/test_pr_body_confidence_section.py` | 4 contract cases covering AC-11, AC-12, the partial-confidence rendering path (AC-3 mirror), and the section-omitted path when confidence is None. | +| `backend/tests/integration/test_open_pr_worker_confidence_plumbing.py` | 1 integration test that drives the real `open_pr` worker path end-to-end (NOT just the pure renderer) to verify FR-5d's worker-side data plumbing. | + +**Modified files** + +| File | Change | +|---|---| +| [`backend/workers/git_pr.py`](../../../../backend/workers/git_pr.py) | (a) Modify `_render_pr_body_study_backed` (line 488-528) — add a `confidence: ConfidenceShape \| None = None` kwarg; insert the `## Confidence` section between `## Metric delta` (line 504) and `## Config diff` (line 510) when `confidence is not None`. Render sub-blocks independently (each gated on its sub-field being non-null). (b) Modify the `open_pr` worker function (search `_render_pr_body_study_backed` callers at line ~904) — before calling the renderer, `await compute_study_confidence(db, study)` and pass into `_render_pr_body_study_backed(..., confidence=...)`. | + +**Key interfaces** + +```python +# backend/workers/git_pr.py +def _render_pr_body_study_backed( + *, + proposal: Any, + study: Any, + digest: Any, + config_diff: dict[str, Any], + chart_md: str, + base_url: str | None, + confidence: ConfidenceShape | None = None, # NEW (FR-5b) — Pydantic object directly, per spec FR-5d +) -> str: + ... + +# `open_pr` worker function (existing) — extend the call site: +from backend.app.domain.study.confidence import compute_study_confidence + +study = await repo.get_study(db, proposal.study_id) +confidence = await compute_study_confidence(db, study) # Pydantic ConfidenceShape | None +body = _render_pr_body_study_backed( + proposal=proposal, + study=study, + digest=digest, + config_diff=proposal.config_diff, + chart_md=chart_md, + base_url=base_url, + confidence=confidence, # passed as Pydantic object — renderer reads .ci_95, .runner_up_gap, etc. directly +) +``` + +Note: Only the Jinja prompt rendering path (Story 1.6) serializes the shape via `.model_dump()` because Jinja consumes dicts. The PR-body renderer keeps the typed object — cycle-1 GPT-5.5 F3 fix. + +**Tasks** +1. **Grep all `_render_pr_body_study_backed(` call sites** with `grep -rn "_render_pr_body_study_backed(" backend/`. At minimum the `open_pr` worker function calls it; if any other call site exists (e.g., a future test scaffold), every site MUST be updated to pass `confidence` per FR-5d (cycle-1 GPT-5.5 F11 fix). Current expectation per spec audit: one call site only (line ~904 in git_pr.py). +2. Add the `confidence` kwarg to `_render_pr_body_study_backed`. Construct the section markdown: + - Section heading: `## Confidence` + - CI line: `- {metric}@{k}: {value:.3f} (95% CI {low:.3f}-{high:.3f}, N={n_queries} queries)` — only when `confidence.ci_95` is non-null. + - Per-query line: `- Queries: {improved} improved · {unchanged} unchanged · {regressed} regressed (vs {comparison_against})` — only when `confidence.per_query_outcomes` is non-null. + - Regressor block: `- Queries that regressed: `\`{query_text}\`` ({comparison_score:.3f} → {winner_score:.3f}), ...` joined with `·` — only when `per_query_outcomes.regressed > 0`. + - Runner-up gap line: `- Runner-up gap {value:.3f} ({classification or 'unclassified'})` — only when `runner_up_gap` non-null. + - Noise floor line: `- Late-trial 1σ = {value:.3f}` — only when `late_trial_stddev` non-null. + - Convergence line: `- Convergence: {regime} (best at trial {best_at_trial} of {total_trials})` — only when `convergence` non-null. +2. Modify the `open_pr` worker call site to fetch confidence and pass it through. Import `compute_study_confidence` from `backend.app.domain.study.confidence`. +3. Write the 4 contract test cases. Use direct calls to `_render_pr_body_study_backed(...)` with **factory-constructed `ConfidenceShape` instances** (cycle-2 GPT-5.5 F3 fix — renderer signature requires the typed Pydantic object; dicts would re-introduce drift). Add a small test helper `make_test_confidence(**overrides)` that builds a `ConfidenceShape` with sensible defaults and accepts per-test-case overrides for each sub-field. +4. Write the 1 integration test that drives the real worker function with a seeded completed study + per_query_metrics. + +**Definition of Done (DoD)** +- [ ] `_render_pr_body_study_backed` emits the `## Confidence` section per the rendering contract above. +- [ ] `open_pr` worker fetches + passes confidence before rendering. +- [ ] AC-11 contract test (full-confidence path) passes. +- [ ] AC-12 contract test (whole-object null path — no section) passes. +- [ ] Partial-confidence contract test (per FR-7) passes. +- [ ] Integration test against real worker path passes — covers FR-5d. + +--- + +### Story 1.6 — Digest narrative prompt extension + +**Outcome:** `digest_narrative.user.jinja` carries `` + `` XML blocks. `digest_narrative.system.md` opening guidance is edited per FR-6. The digest worker passes the serialized `ConfidenceShape` through to `render_digest_user_prompt`. + +**FRs:** FR-6, AC-14. + +**New files** + +None. + +**Modified files** + +| File | Change | +|---|---| +| [`prompts/digest_narrative.user.jinja`](../../../../prompts/digest_narrative.user.jinja) | Insert `` and `` blocks after the existing `` block (line 13) — use the exact Jinja from spec §7 FR-6. | +| [`prompts/digest_narrative.system.md`](../../../../prompts/digest_narrative.system.md) | (a) Lines 13-25 — extend the XML-block list to document blocks 8 (``) and 9 (``) and their conditional inclusion. (b) Line 29-30 — replace the substring `Open with the headline metric delta.` with the **exact spec FR-6 string including backticks around XML names**: `Open with the headline metric delta, immediately followed by a one-sentence confidence framing that mentions the CI band (when `` is present), the per-query outcome counts (when `` is present), and the worst-regressed query by name (when `` has regressors). Then explain *why*` — i.e., the markdown backticks around `` and `` MUST be preserved per cycle-1 GPT-5.5 F4. | +| [`backend/app/llm/digest_prompt.py`](../../../../backend/app/llm/digest_prompt.py) | `render_digest_user_prompt` (line 67) — add `confidence: dict[str, Any] \| None = None` kwarg; pass through to the jinja render. | +| [`backend/workers/digest.py`](../../../../backend/workers/digest.py) | In the digest-generation function (search `render_digest_user_prompt` callers; existing code at ~line 690-700 already passes `baseline_metric` + `achieved_metric`) — `await compute_study_confidence(db, study)` and pass the serialized result through as the new `confidence` kwarg. | +| [`backend/tests/unit/workers/test_digest_prompt_render.py`](../../../../backend/tests/unit/workers/test_digest_prompt_render.py) | Add **5** new cases (cycle-1 GPT-5.5 F10 fix): (1) user-prompt contains `` block with full data; (2) user-prompt OMITS `` when `confidence=None`; (3) user-prompt contains `` block when nested data present; (4) user-prompt OMITS `` when `confidence.per_query_outcomes is None`; (5) **system-prompt** (rendered via `render_digest_system_prompt()`) contains the exact FR-6 replacement substring `Open with the headline metric delta, immediately followed by a one-sentence confidence framing that mentions the CI band (when ````...` AND the documented XML-block list entries for `` and `` — covers AC-14's system-prompt half. | + +**Tasks** +1. Edit `prompts/digest_narrative.user.jinja` to add the two new Jinja blocks. +2. Edit `prompts/digest_narrative.system.md` per the precise replacements above. +3. Extend `render_digest_user_prompt` with the new optional kwarg. +4. Modify the digest worker to fetch confidence and pass it through. +5. Add the 5 new test cases to `test_digest_prompt_render.py` (4 user-prompt + 1 system-prompt per cycle-1 F10). + +**Definition of Done (DoD)** +- [ ] System prompt has the exact replacement string from spec FR-6. +- [ ] User jinja template renders `` and `` blocks conditionally. +- [ ] `render_digest_user_prompt` accepts one new `confidence: dict | None` kwarg (NOT two — `per_query_outcomes` is nested inside per spec FR-6). +- [ ] Digest worker fetches + passes confidence. +- [ ] AC-14 unit test passes. + +--- + +## Epic 1 gate (hard stop — do not enter Epic 2 until all pass) + +- [ ] All 6 stories in Epic 1 are complete with green tests. +- [ ] `GET /api/v1/studies/{id}` returns a populated `confidence` field on a seeded study with per_query_metrics — verified live via `curl` against the local stack. +- [ ] A real-PR open against a completed study with per_query_metrics renders the `## Confidence` section in the PR body (verified by the integration test in Story 1.5). +- [ ] Alembic head is `0015_trials_per_query_metrics`; round-trip verified. + +--- + +## Epic 2 — Frontend ConfidencePanel + +### Story 2.1 — TypeScript types + wire-value enums + +**Outcome:** The auto-generated `ui/src/lib/types.ts` reflects the new `ConfidenceShape` from the OpenAPI schema. `ui/src/lib/enums.ts` adds 3 new wire-value Literal arrays with source-of-truth comments per CLAUDE.md "Enumerated Value Contract Discipline." + +**FRs:** FR-5c (precondition), supports §8.4 enumerated value contract. + +**New files** + +None. + +**Modified files** + +| File | Change | +|---|---| +| [`ui/src/lib/types.ts`](../../../../ui/src/lib/types.ts) | Regenerate from the live OpenAPI schema (run `cd ui && pnpm openapi:types` or the project's equivalent — checked into the repo via the pre-commit hook). Diff should show: new `ConfidenceShape` type + 7 sub-types + 4 new Literal types + extension of `StudyDetail` with the `confidence?: ConfidenceShape \| null` field. | +| [`ui/src/lib/enums.ts`](../../../../ui/src/lib/enums.ts) | Add 3 new wire-value arrays after the existing `OBJECTIVE_METRIC_VALUES` (line 68): `CONVERGENCE_REGIME_VALUES = ['early_held', 'late_rising', 'noisy'] as const;` + `RUNNER_UP_CLASSIFICATION_VALUES = ['robust_plateau', 'sharp_peak'] as const;` + `COMPARISON_AGAINST_VALUES = ['runner_up', 'baseline'] as const;` — each preceded by a source-of-truth comment `// Values must match backend/app/domain/study/confidence.py ConvergenceRegime` (etc.) per the project's enumerated-value-contract discipline. | + +**Source-of-truth verification** + +Per CLAUDE.md "Enumerated Value Contract Discipline": + +| Wire value array | Backend source | Frontend file | +|---|---|---| +| `CONVERGENCE_REGIME_VALUES` | `backend/app/domain/study/confidence.py` `ConvergenceRegime = Literal["early_held", "late_rising", "noisy"]` (cycle-2 GPT-5.5 F1: types live in domain module, NOT schemas.py — schemas.py only re-exports `ConfidenceShape`) | `ui/src/lib/enums.ts` | +| `RUNNER_UP_CLASSIFICATION_VALUES` | `backend/app/domain/study/confidence.py` `RunnerUpClassification = Literal["robust_plateau", "sharp_peak"]` | `ui/src/lib/enums.ts` | +| `COMPARISON_AGAINST_VALUES` | `backend/app/domain/study/confidence.py` `ComparisonAgainst = Literal["runner_up", "baseline"]` (Phase 1 only emits `"runner_up"`; `"baseline"` reserved for Phase 2) | `ui/src/lib/enums.ts` | + +**Tasks** +1. Regenerate `ui/src/lib/types.ts` from the live OpenAPI schema (after Story 1.4 has merged). Verify the diff covers `ConfidenceShape` + sub-shapes + the `StudyDetail.confidence` field. +2. Add the 3 wire-value Literal arrays to `ui/src/lib/enums.ts` with source-of-truth comments. +3. Run `cd ui && pnpm typecheck` — verify no type errors. +4. Run `cd ui && pnpm test` — verify no regressions in the existing 285+ test suite. + +**Definition of Done (DoD)** +- [ ] `ui/src/lib/types.ts` regenerated and committed. +- [ ] `ui/src/lib/enums.ts` has 3 new arrays with source-of-truth comments. +- [ ] `pnpm typecheck` green. +- [ ] No existing vitest case breaks. + +--- + +### Story 2.2 — `` component + glossary + page mount + +**Outcome:** A new `` component renders on `/studies/[id]` between the study header card and the trials table. Renders nothing when `confidence === null` (no empty-state shell). Each sub-field is independently gated on its non-null state. + +**FRs:** FR-5c. + +**New files** + +| File | Purpose | +|---|---| +| `ui/src/components/studies/confidence-panel.tsx` | The new component. Takes one prop: `confidence: ConfidenceShape \| null \| undefined`. Renders nothing when `confidence == null`. Otherwise renders 4 sections: headline + CI band, per-query outcome chips + regressor table (when applicable), secondary callouts row (runner-up gap, late-trial 1σ, convergence). | +| `ui/src/__tests__/components/studies/confidence-panel.test.tsx` | 12 vitest cases — full-data render, null-confidence (renders nothing), partial render (each sub-field independently null), regressor table cap-at-5, "vs runner-up" / "vs baseline" label switching, tooltip presence + content, every degraded-path branch from FR-7. | + +**Modified files** + +| File | Change | +|---|---| +| [`ui/src/app/studies/[id]/page.tsx`](../../../../ui/src/app/studies/%5Bid%5D/page.tsx) | Mount `` between the existing study header card and the trials table. Pass `study.confidence` from the `useStudy` hook's response. | +| [`ui/src/lib/glossary.ts`](../../../../ui/src/lib/glossary.ts) | Add 6 new entries: `confidence.ci_95`, `confidence.runner_up_gap`, `confidence.late_trial_stddev`, `confidence.convergence_regime`, `confidence.per_query_outcomes`, `confidence.comparison_against`. Each entry follows the existing pattern at this file (short form for ``, optional long form for ``). Use the tooltip text from spec §11 "Tooltips and contextual help" table verbatim. | +| [`ui/src/__tests__/lib/glossary.test.ts`](../../../../ui/src/__tests__/lib/glossary.test.ts) | The existing parity test (per `feat_contextual_help`) ensures glossary keys match enum values. Verify the 6 new keys appear; add a parity assertion for the 3 new wire-value enums from Story 2.1 if not auto-covered. | + +**UI element inventory** + +| Element | Source data | Interaction | +|---|---|---| +| Section heading "Confidence" | static text | none | +| Headline + CI band: e.g., "NDCG@10 = 0.840 (95% CI 0.78–0.89, N=20 queries)" | `confidence.headline` + `confidence.ci_95` (latter optional) | none | +| Per-query outcome chips: "14 Improved · 4 Unchanged · 2 Regressed (vs runner-up)" | `confidence.per_query_outcomes.{improved,unchanged,regressed,comparison_against}` (`COMPARISON_AGAINST_VALUES`) | hover on each chip → `` (glossary key `confidence.per_query_outcomes`) | +| Named regressor table (up to 5 rows) | `confidence.per_query_outcomes.top_regressors` | none — read-only inline display | +| Runner-up gap label: "Runner-up gap 0.005 (Robust plateau)" | `confidence.runner_up_gap.{value, classification}` (`RUNNER_UP_CLASSIFICATION_VALUES`) | hover on classification badge → `` (glossary key `confidence.runner_up_gap`) | +| Late-trial 1σ value | `confidence.late_trial_stddev.value` + `.window_size` | hover → `` (glossary key `confidence.late_trial_stddev`) | +| Convergence call-out: "Early-and-held (best at trial 387 of 1000)" | `confidence.convergence.{regime, best_at_trial, total_trials}` (`CONVERGENCE_REGIME_VALUES`) | hover on regime badge → `` (glossary key `confidence.convergence_regime`) | + +**Source-of-truth comments** + +Every JSX branch in `confidence-panel.tsx` that switches on a wire enum MUST include a comment citing the backend source: +```tsx +// Values must match backend/app/domain/study/confidence.py ConvergenceRegime +{regime === 'early_held' ? Early-and-held : + regime === 'late_rising' ? Late-rising : + Noisy} +``` + +**Tasks** +1. Read [`ui/src/components/studies/digest-panel.tsx`](../../../../ui/src/components/studies/digest-panel.tsx) (the closest existing analogous component) as the structural template. +2. Read [`ui/src/components/studies/study-header.tsx`](../../../../ui/src/components/studies/study-header.tsx) for the badge + section layout idiom. +3. Implement `confidence-panel.tsx` with the 4 sections from the inventory. Use the project's existing `` + `` primitives. +4. Mount the panel in `ui/src/app/studies/[id]/page.tsx` between the existing header card render and the trials table. +5. Add the 6 glossary entries. +6. Write 12 vitest cases against the new component. + +**Definition of Done (DoD)** +- [ ] `confidence-panel.tsx` exists and renders all 4 sections gated independently. +- [ ] Mounted on `/studies/[id]` page. +- [ ] 12 vitest cases pass. +- [ ] 6 glossary entries added; glossary parity test passes. +- [ ] `pnpm typecheck && pnpm lint && pnpm test` all green. +- [ ] Visual smoke check: `make up`, navigate to a seeded study with per_query_metrics, confirm the panel renders. + +--- + +### Story 2.3 — Playwright real-backend E2E for ConfidencePanel + +**Outcome:** 2 new real-backend Playwright cases extend `ui/tests/e2e/studies.spec.ts` (or a new spec file) covering the panel-renders + panel-absent paths. + +**FRs:** FR-5c (browser-layer verification), AC-13. + +**New files** + +None — extend the existing spec. + +**Modified files** + +| File | Change | +|---|---| +| [`ui/tests/e2e/studies.spec.ts`](../../../../ui/tests/e2e/studies.spec.ts) | Add 2 new test cases. Both run real-backend (no `page.route()` mocking per CLAUDE.md E2E policy). | +| [`ui/tests/e2e/helpers/seed.ts`](../../../../ui/tests/e2e/helpers/seed.ts) | Add a new helper `seedCompletedStudyWithPerQueryMetrics()` that wraps the existing `seedStudyCompletedWithDigest()` AND extends the existing `_test/studies/seed-completed` endpoint (or its sibling backend test-seed helper at [`backend/tests/integration/_digest_helpers.py`](../../../../backend/tests/integration/_digest_helpers.py)) to populate `per_query_metrics` on the winner + runner-up trial rows it creates. **Do NOT add a new `/api/v1/_test/trials/set-per-query-metrics` test endpoint** (would conflict with the spec's "zero new endpoints" contract — cycle-1 GPT-5.5 F1). Either: (a) extend the existing seed-completed test endpoint to accept an optional `winner_per_query: dict` + `runner_up_per_query: dict` field and persist them, or (b) extend the backend `_digest_helpers.py` to populate per-query data, and call that helper via the existing test endpoint. Pick whichever the backend test infra already uses for similar seeding. | + +**Tasks** +1. Add `seedCompletedStudyWithPerQueryMetrics()` to the seed helper. Reuse the existing `seedStudyCompletedWithDigest` scaffold; populate per-query data on the winner + runner-up trial rows via the most appropriate test endpoint or direct SQL. +2. Add 2 new test cases: + - `ConfidencePanel renders for a completed study with per_query_metrics`: seed → navigate to `/studies/{id}` → assert the "Confidence" section heading is visible → assert the headline-with-CI text matches expected pattern → assert at least one outcome chip is visible. + - `ConfidencePanel renders nothing for a study with confidence=null`: seed a study with no completed trials (or with `best_trial_id=NULL`) → navigate → assert the "Confidence" heading is NOT visible. +3. Run `cd ui && pnpm playwright test tests/e2e/studies.spec.ts` locally. + +**Definition of Done (DoD)** +- [ ] 2 new Playwright cases pass against the real backend. +- [ ] No `page.route()` mocking introduced. +- [ ] AC-13 covered. + +--- + +## Epic 2 gate + +- [ ] All 3 stories in Epic 2 complete with green tests. +- [ ] `cd ui && pnpm test` shows all UI vitest cases green (including the 12 new confidence-panel cases). +- [ ] `cd ui && pnpm playwright test tests/e2e/studies.spec.ts` shows the 2 new cases green. +- [ ] Visual smoke check: panel renders correctly on a seeded study. + +--- + +## UI Guidance + +### Reference: current component structure + +[`ui/src/app/studies/[id]/page.tsx`](../../../../ui/src/app/studies/%5Bid%5D/page.tsx) — 114 lines total. Reads `useStudy(studyId)` and renders a header card + the trials table. Clean canvas for the new panel. + +The closest existing analog for the new ConfidencePanel is [`ui/src/components/studies/digest-panel.tsx`](../../../../ui/src/components/studies/digest-panel.tsx) — renders study-end data conditionally (`if digest === null → render nothing`), uses `` primitives from `feat_contextual_help`, and uses the project's standard badge + section layout. Read this file first when implementing Story 2.2. + +### Insertion point + +In `ui/src/app/studies/[id]/page.tsx`: between the existing `` render and the existing `` render. No code is removed. The new mount is one `` JSX node. + +### Analogous markup patterns + +Mount pattern — from `ui/src/app/studies/[id]/page.tsx` (existing trials table render): +```tsx +{/* Existing — keep as-is */} + + +{/* NEW — add between header and trials table (Story 2.2) */} + + +{/* Existing — keep as-is */} + +``` + +Conditional render pattern — from `ui/src/components/studies/digest-panel.tsx`: +```tsx +{/* If null/undefined, render nothing — no empty-state shell */} +export function ConfidencePanel({ confidence }: { confidence: ConfidenceShape | null | undefined }) { + if (!confidence) return null; + return ( +
+ {/* sub-sections, each gated on its sub-field */} +
+ ); +} +``` + +Badge pattern — from `ui/src/components/studies/study-header.tsx` (status badge): +```tsx +{/* Use the existing Badge primitive; vary `variant` based on the wire-enum value */} + + {regime === 'early_held' ? 'Early-and-held' : regime === 'late_rising' ? 'Late-rising' : 'Noisy'} + +``` + +InfoTooltip pattern — from `ui/src/components/studies/digest-panel.tsx`: +```tsx + + 95% CI + + +``` + +### Layout and structure + +The panel is a single `
` with 4 vertically stacked sub-sections: +1. Headline + CI band (single line, large font) +2. Per-query outcome chips row (3 chips side-by-side; horizontal) +3. Regressor table (when present; up to 5 rows, narrow inline table) +4. Secondary callouts row (3 callouts side-by-side: runner-up gap, late-trial 1σ, convergence — narrow horizontal layout) + +Responsive: on screens <768px the 3-chip row and 3-callout row collapse to vertical stacks via existing Tailwind classes. + +### Information architecture placement + +Per spec §11: between study header card and trials table on `/studies/[id]`. No new nav, no new tab. Discoverable by anyone who lands on a study detail page. + +### Tooltips and contextual help + +Tooltips use the existing `` primitive (from `feat_contextual_help`). All tooltip text comes from the glossary, NOT inlined in the component — keeps the parity tests at `glossary.test.ts` green. + +| Element | Glossary key | Primitive | +|---|---|---| +| "95% CI" label | `confidence.ci_95` | `` | +| Outcome chips group | `confidence.per_query_outcomes` | `` | +| Runner-up gap badge | `confidence.runner_up_gap` | `` | +| Late-trial 1σ label | `confidence.late_trial_stddev` | `` | +| Convergence regime badge | `confidence.convergence_regime` | `` | +| Comparison label ("vs runner-up") | `confidence.comparison_against` | `` | + +### Visual consistency + +| New element | CSS pattern source | +|---|---| +| Section heading "Confidence" | Matches `

` in `ui/src/components/studies/digest-panel.tsx` | +| Headline + CI band | Mimics the metric-delta line in `ui/src/components/studies/study-header.tsx` | +| Outcome chips | Use `` primitive at `ui/src/components/ui/badge.tsx` | +| Regressor table | Use existing inline narrow-table pattern (no `` — keep it simple; just `` with Tailwind classes) | +| Tooltip triggers | `` primitive at `ui/src/components/common/info-tooltip.tsx` | + +### Component composition + +The panel is a single new component, NOT extracted into sub-components. Each sub-section is inline JSX inside ``. Rationale: keeps the surface area small for review; the panel is read-only and the sub-sections don't have independent lifecycle. If a future feature needs to reuse one sub-section, it can be extracted at that time. + +The panel takes ONE prop: `confidence: ConfidenceShape | null | undefined`. No callbacks, no shared state, no parent communication. + +### Interaction behavior table + +| User action | Frontend behavior | API call | +|---|---|---| +| Navigate to `/studies/[id]` for a completed study with per_query_metrics | `useStudy` hook fetches study; `confidence` is part of the response; panel renders | `GET /api/v1/studies/{id}` (existing; no new endpoint) | +| Hover any tooltip trigger | `` displays the glossary text | none | +| Visit `/studies/[id]` for a study with `confidence === null` | Panel renders nothing; no empty-state shell | (same GET) | + +### Handler function patterns + +No new handlers. The panel is read-only display. All interactivity is via the existing `` primitive. + +### Legacy behavior parity + +**No legacy behavior parity table — no user-facing component >100 LOC is being deleted or migrated in this plan.** The trials table extension proposed in the original idea was dropped at spec-gen (Decision C2-F2 / cycle-2 GPT-5.5 review F10). The new ConfidencePanel is additive; the existing study-detail page keeps every prior behavior. + +--- + +## 3) Testing workstream + +### 3.1 Unit tests +- Location: `backend/tests/unit/domain/study/test_confidence.py` (Story 1.3) — 25+ cases +- Location: `backend/tests/unit/workers/test_digest_prompt_render.py` — 5 new cases (Story 1.6 — 4 user-prompt + 1 system-prompt per cycle-1 F10) +- Scope: domain helpers (bootstrap_ci, classify_runner_up_gap, compute_late_trial_stddev, classify_convergence_regime, classify_query_outcomes, top_regressors), `compute_study_confidence` orchestrator, every FR-7 degraded path, digest prompt rendering with/without confidence +- DoD: + - [ ] All 25+ confidence cases pass deterministically + - [ ] Bootstrap CI seed determinism asserted (AC-4) + - [ ] Every FR-7 sub-field degraded path has an explicit test + +### 3.2 Integration tests +- Location: `backend/tests/integration/test_trials_per_query_metrics_migration.py` (Story 1.1) — 3 cases +- Location: `backend/tests/integration/test_run_trial_per_query_persistence.py` (Story 1.2) — 2 cases +- Location: `backend/tests/integration/test_studies_api_confidence.py` (Story 1.4) — 11 cases +- Location: `backend/tests/integration/test_open_pr_worker_confidence_plumbing.py` (Story 1.5) — 1 case +- Scope: migration round-trip; worker persistence on success + failure; full GET /studies/{id} response with confidence; real PR worker drives end-to-end +- DoD: + - [ ] All 17 integration cases pass (3 + 2 + 11 + 1) + - [ ] AC-1, AC-2, AC-3, AC-3a, AC-4, AC-5, AC-6, AC-7, AC-8, AC-9, AC-10, AC-15, AC-16, AC-17 covered (cycle-1 GPT-5.5 F9) + +### 3.3 Contract tests +- Location: `backend/tests/contract/test_studies_api_contract.py` — 2 new cases (Story 1.4) +- Location: `backend/tests/contract/test_pr_body_confidence_section.py` (Story 1.5) — 4 cases +- Scope: OpenAPI shape lock for `ConfidenceShape`; PR body section markdown shape across all 4 confidence-population states (full / partial / per-query-only-missing / whole-object-null) +- DoD: + - [ ] 6 new contract cases pass + - [ ] AC-11, AC-12 covered + +### 3.4 E2E tests +- Location: `ui/tests/e2e/studies.spec.ts` — 2 new cases (Story 2.3) +- Scope: real-backend; ConfidencePanel renders for seeded completed study; panel renders nothing for confidence=null +- Rule: **Must use real browser interactions via Playwright's `page` object.** No `page.route()` mocking. API helpers acceptable for setup; assertions must verify browser-visible DOM elements. +- DoD: + - [ ] Both new Playwright cases pass via `pnpm playwright test` + - [ ] AC-13 covered + +### 3.5 Migration verification +- [ ] `migrations/versions/0015_trials_per_query_metrics.py` includes `downgrade()` (Story 1.1) +- [ ] `alembic upgrade head` succeeds +- [ ] Round-trip verified: `alembic downgrade -1 && alembic upgrade head` +- [ ] DB CHECK constraint `trials_per_query_metrics_object_check` is active after upgrade (test in Story 1.1) + +### 3.6 CI gates +- [ ] `make test-unit` +- [ ] `make test-integration` +- [ ] `make test-contract` +- [ ] `cd ui && pnpm test` +- [ ] `cd ui && pnpm playwright test tests/e2e/studies.spec.ts` + +### 3.7 Existing test impact audit + +| Test file | Pattern | Count | Action | +|---|---|---|---| +| `backend/tests/integration/test_studies_api.py` | Asserts `StudyDetail` shape | ≥1 | Add assertion that `confidence` key is present (may be null). No existing assertion breaks. | +| `backend/tests/integration/test_digest_zero_trials.py` | Digest worker with `best_metric=None` | 1 | No change — assert `confidence is None` propagates correctly. | +| `backend/tests/integration/test_digest_zero_trials_with_openai_unconfigured.py` | Degraded-mode digest | 1 | No change — same as above. | +| `backend/tests/integration/_digest_helpers.py` | Test seed helper | — | Optional extension: add `per_query_metrics` parameter so tests that need confidence data can seed it. | +| `backend/tests/contract/test_openapi_surface.py` | OpenAPI snapshot | — | Snapshot will change to include `ConfidenceShape`. Re-bake the snapshot in Story 1.4 (precedent: `feat_cluster_target_filter` did the same for `target_filter`). | + +--- + +## 4) Documentation update workstream + +### 4.0 Core context files + +- [ ] `state.md` — update on Story 1.1 (Alembic head bump to `0015_trials_per_query_metrics`); update on final story (feature ship status, branch context) +- [ ] `architecture.md` — add a line under "Where the code lives" → "domain/" describing `backend/app/domain/study/confidence.py` (new module). Optional: add a critical-flow bullet for "confidence computation on StudyDetail read" if the dashboard pattern warrants it. +- [ ] `CLAUDE.md` — no update required (no new convention, no new rule) + +### 4.1 Architecture docs + +- [ ] `docs/01_architecture/data-model.md` — add `trials.per_query_metrics` to the per-table column reference. Note nullable + post-`0015` semantics. Add forward-ref note under `studies.baseline_metric` that Phase 2 will add `baseline_trial_id` (per [`phase2_idea.md`](phase2_idea.md)). +- [ ] `docs/01_architecture/optimization.md` — add a brief "Confidence signals" subsection. Reference the new domain module + the 4-query read pattern. + +### 4.2 Product docs + +- [ ] No update — this spec IS the product doc artifact. + +### 4.3 Runbooks + +- [ ] No new runbook required (no new operator action). + +### 4.4 Security docs + +- [ ] No update — no new security surface. + +### 4.5 Quality docs + +- [ ] No update — existing test-layer convention covers the new test files. + +**Documentation DoD** +- [ ] `state.md`, `architecture.md` consistent with shipped behavior +- [ ] `docs/01_architecture/data-model.md` + `optimization.md` updated + +--- + +## 5) Lean refactor workstream + +### 5.1 Refactor goals + +None planned. The feature is purely additive across all surfaces. + +### 5.2 Planned refactor tasks + +- [ ] None. + +### 5.3 Refactor guardrails + +- N/A — no refactor in this plan. + +--- + +## 6) Dependencies, risks, and mitigations + +### Dependencies + +| Dependency | Needed by | Status | Risk if missing | +|---|---|---|---| +| `feat_digest_proposal` (PR #41) | Story 1.6 (digest prompt extension) | implemented | If reverted, the digest prompt update is unreachable; feature still ships at the API + UI layers. | +| `feat_github_pr_worker` (PR #45) | Story 1.5 (PR body section) | implemented | If reverted, the PR body extension is unreachable; feature still ships at the API + UI layers. | +| `feat_studies_ui` (PR #50) | Story 2.2 (ConfidencePanel mount) | implemented | If reverted, the UI panel has no host page; backend surfaces unaffected. | +| `feat_llm_judgments` (PR #35) | Story 1.2 (worker persistence) | implemented | Per-query metrics depend on judgments; missing judgments = empty per_query data (graceful via FR-7). | +| `feat_contextual_help` (PR #122) | Story 2.2 (tooltips + glossary) | implemented | If reverted, the `` primitive is missing; tooltips would need a different implementation. | +| numpy 1.x (via pytrec_eval) | Story 1.3 (bootstrap CI) | transitive dep verified | Cannot ship without numpy. Confirmed installed at `.venv/lib/python3.13/site-packages/numpy/__init__.py`. | +| Alembic head `0014_clusters_target_filter` | Story 1.1 (migration sequencing) | confirmed via `ls migrations/versions/` | Required so `0015` applies cleanly. | + +### Risks + +| Risk | Likelihood | Impact | Mitigation | +|---|---|---|---| +| Bootstrap CI seed determinism fails — different numpy versions produce different sequences from the same seed | L | M | Pin numpy version in `pyproject.toml` (if not already pinned). Asserted via AC-4 integration test that re-reads the same study and confirms byte-equal CI values. | +| `compute_study_confidence` performance worse than budget (<100ms) on 1000-trial × 100-query studies | L | L | The 4-query read pattern guarantees ~30KB wire load. Bootstrap loop is ~5ms for N=100. Spec §13 explicitly budgets <100ms. If exceeded, fallback is to compute confidence asynchronously and stash on a denormalized column (future MVP2 work). | +| Test coverage gap on the digest prompt rendering | L | M | Story 1.6's 5 new test cases cover with/without confidence × per_query_outcomes present/absent matrix + the system-prompt FR-6 replacement string assertion (AC-14 system-prompt half). | +| The OpenAPI snapshot test breaks — Story 1.4 changes the schema | M | L | Precedent: `feat_cluster_target_filter` re-baked the snapshot in the same PR. Same approach here. | + +### Failure mode catalog + +| Failure mode | Trigger | Expected system behavior | Recovery | +|---|---|---|---| +| Winner trial row missing (cascade-delete race) | `study.best_trial_id` resolves to deleted row | `compute_study_confidence` returns None (whole-object null); existing `digest_best_trial_missing` log event fires; PR body has no `## Confidence` section | None needed — graceful per FR-7 | +| `pytrec_eval` produces empty `per_query` dict (judgments don't match query_ids) | Misconfigured judgment list | Worker writes empty dict to `per_query_metrics` (not NULL); analytics treat empty as "no per-query data"; ConfidencePanel renders aggregate-only | Operator regenerates judgments | +| numpy version mismatch on bootstrap | Operator uses a different numpy version than the one pinned | Could produce different CI numbers vs. test fixtures | Pin numpy in `pyproject.toml`; CI uses the same lockfile | +| StudyDetail response too large for old clients | A 1000-trial × 100-query study produces ~30KB confidence payload | Old clients ignore the field; new clients render normally | None needed | +| OpenAPI snapshot test fails after Story 1.4 | Snapshot test wasn't re-baked | CI fails | Re-bake snapshot in same PR (precedent: `feat_cluster_target_filter` PR #168) | + +--- + +## 7) Sequencing and parallelization + +### Suggested sequence + +1. **Story 1.1** — Migration (unblocks everything else) +2. **Story 1.2** + **Story 1.3** in parallel — worker write (1.2) is one-line trivial; domain module (1.3) is the longest story +3. **Story 1.4** — API enrichment (depends on 1.3 for `ConfidenceShape` import; depends on 1.1 for column existence) +4. **Story 1.5** + **Story 1.6** in parallel — PR body (1.5) and digest prompt (1.6) both depend on 1.4 +5. **Epic 1 gate** — verify all backend stories green +6. **Story 2.1** — types + enums (depends on 1.4 OpenAPI shape being merged) +7. **Story 2.2** — ConfidencePanel component + page mount +8. **Story 2.3** — E2E +9. **Epic 2 gate** + final state.md + architecture.md update + +### Parallelization opportunities + +- 1.2 + 1.3 can be developed by different contributors (no file overlap) +- 1.5 + 1.6 can be developed by different contributors (no file overlap) +- 2.1 must precede 2.2 (types are a hard prerequisite) +- E2E (2.3) is strictly last + +--- + +## 8) Rollout and cutover plan + +- **Rollout stages:** Single-stage rollout. RelyLoop is single-tenant + local-only through MVP3 per [`docs/01_architecture/tech-stack.md` §"Canonical release matrix"](../../../01_architecture/tech-stack.md). Merge to main = available to all operators on their next `make up`. +- **Feature flag strategy:** None. The feature ships in one PR. +- **Migration / cutover steps:** Operators run `make migrate` after pulling main to apply `0015_trials_per_query_metrics`. Old trials retain `per_query_metrics IS NULL` and degrade gracefully (FR-7 + AC-3). +- **Reconciliation / repair strategy:** None — additive nullable column, no data loss on downgrade, no in-flight breaking change. + +--- + +## 9) Execution tracker + +### Current sprint + +- [ ] Story 1.1 — Migration `0015_trials_per_query_metrics` +- [ ] Story 1.2 — Persist `per_query_metrics` in `run_trial` +- [ ] Story 1.3 — Domain module `confidence.py` +- [ ] Story 1.4 — `ConfidenceShape` + StudyDetail enrichment +- [ ] Story 1.5 — PR body section + worker plumbing +- [ ] Story 1.6 — Digest narrative prompt extension +- [ ] **Epic 1 gate** +- [ ] Story 2.1 — TypeScript types + enums +- [ ] Story 2.2 — `` component + glossary + page mount +- [ ] Story 2.3 — Playwright E2E +- [ ] **Epic 2 gate** +- [ ] Final state.md + architecture.md update + +### Blocked items + +(none — feature has all dependencies satisfied) + +### Done this sprint + +(none yet — implementation has not started) + +--- + +## 10) Story-by-Story Verification Gate + +Before marking any story complete, the executing engineer or `/impl-execute` agent must attach evidence for: + +- [ ] Files created/modified match story scope (`New files` / `Modified files` tables) +- [ ] Endpoint contract implemented exactly as documented (method/path/body/status/error code) +- [ ] Key interfaces implemented with compatible signatures +- [ ] Required tests added/updated for all four layers where applicable +- [ ] Commands executed and passed: + - [ ] `make test-unit` + - [ ] `make test-integration` (or targeted subset with explanation) + - [ ] `make test-contract` + - [ ] `cd ui && pnpm test` + - [ ] `cd ui && pnpm playwright test tests/e2e/studies.spec.ts` (Story 2.3 only) +- [ ] Migration round-trip evidence included (Story 1.1 only) +- [ ] Related docs/checklists updated in same PR when behavior/contract changed + +--- + +## 11) Plan consistency review + +### Endpoint count +- Spec §8.1 lists 1 modified endpoint (`GET /api/v1/studies/{id}`). +- Plan covers it in Story 1.4. ✅ + +### Error code coverage +- Spec §8.5 lists 0 new error codes. ✅ +- Plan introduces no new error codes. ✅ + +### FR coverage +- All 8 FRs (FR-1 through FR-7, with FR-4a separately) appear in §1 traceability table. ✅ +- Every FR is assigned to at least one story. ✅ + +### Story internal consistency +- No file appears in more than one story's "New files" table. ✅ +- Every "Modified files" entry exists in the codebase (verified by grep during plan-gen). ✅ +- Endpoint table in Story 1.4 matches the Pydantic schemas in the same story. ✅ + +### Test file count and assignment + +**New test files: 7.** (cycle-1 GPT-5.5 F5 fix — original arithmetic mis-stated this.) + +| Layer | Path | Story | Case count | +|---|---|---|---| +| Unit | `backend/tests/unit/domain/study/test_confidence.py` | 1.3 | 25+ | +| Integration | `backend/tests/integration/test_trials_per_query_metrics_migration.py` | 1.1 | 3 | +| Integration | `backend/tests/integration/test_run_trial_per_query_persistence.py` | 1.2 | 2 | +| Integration | `backend/tests/integration/test_studies_api_confidence.py` | 1.4 | 11 | +| Integration | `backend/tests/integration/test_open_pr_worker_confidence_plumbing.py` | 1.5 | 1 | +| Contract | `backend/tests/contract/test_pr_body_confidence_section.py` | 1.5 | 4 | +| Component | `ui/src/__tests__/components/studies/confidence-panel.test.tsx` | 2.2 | 12 | + +**Modified existing test files: 4.** + +| Path | Story | Cases added | +|---|---|---| +| `backend/tests/contract/test_studies_api_contract.py` | 1.4 | 2 (OpenAPI shape lock) | +| `backend/tests/unit/workers/test_digest_prompt_render.py` | 1.6 | 5 (user + system prompt — cycle-1 F10) | +| `backend/tests/integration/_digest_helpers.py` | 1.4 | helper extension (optional `per_query_metrics` + `optuna_trial_number` params); not test cases per se | +| `ui/tests/e2e/studies.spec.ts` + `ui/tests/e2e/helpers/seed.ts` | 2.3 | 2 Playwright cases + new seed helper | + +Every test file is owned by exactly one story; no orphans. ✅ + +### Gate arithmetic +- Epic 1 gate: 6 stories below — gate enumerates exactly 6 stories' completion. ✅ +- Epic 2 gate: 3 stories below — gate enumerates 3 stories' completion. ✅ + +### Open questions resolved +- Spec §19 lists 0 open questions remaining (all 7 preflight questions resolved by Decision Log D1–D10). ✅ +- Plan introduces no new open questions. ✅ + +### Frontend UI Guidance completeness +- Insertion point: documented ✅ +- Analogous markup patterns: ✅ (mount pattern, conditional render, badge, InfoTooltip) +- Layout and structure: ✅ +- Modal/dialog pattern: N/A — feature has no dialogs +- Visual consistency table: ✅ +- Component composition: ✅ (single component, no extraction) +- Interaction behavior table: ✅ +- Handler function patterns: N/A — read-only display, no handlers +- Information architecture placement: ✅ +- Tooltips and contextual help: ✅ (6 glossary keys, primitive cited) +- Legacy behavior parity: explicitly N/A with citation ✅ + +### Plan ↔ codebase verification +- Migration path `migrations/versions/` verified (precedent: `0014_clusters_target_filter.py` at that path). ✅ +- Current Alembic head `0014_clusters_target_filter` verified via `ls migrations/versions/ | tail -3`. ✅ +- Router registration pattern verified at `backend/app/main.py:165-173`. ✅ (no new router this feature) +- `_render_pr_body_study_backed` at `backend/workers/git_pr.py:488` verified during spec-gen. ✅ +- `scoring.py:194` return shape verified. ✅ +- `trials.py:440` worker write line verified. ✅ +- `StudyDetail` Pydantic at `backend/app/api/v1/schemas.py:613` verified. ✅ +- `_K_REQUIRED_METRICS` at `schemas.py:521` verified. ✅ +- `ObjectiveMetric` Literal at `schemas.py:214` verified. ✅ +- `render_digest_user_prompt` at `backend/app/llm/digest_prompt.py:67` verified. ✅ + +### Infrastructure path verification +- Migration directory: `migrations/versions/` (NOT `backend/app/db/migrations/versions/`) ✅ +- Revision numbering: `"0015"` (4-char convention from `0014`) ✅ +- Domain module path: `backend/app/domain/study/confidence.py` matches existing pattern `backend/app/domain/study/search_space_defaults.py` from `feat_agent_propose_search_space` ✅ +- Test file paths: `backend/tests/unit/domain/study/test_confidence.py` matches the precedent for the search-space-defaults parity test ✅ + +### Frontend data plumbing verification +- `ConfidencePanel` consumes `study.confidence` — verified that the existing `useStudy` hook returns the full `StudyDetail` shape (after Story 1.4 adds the field) ✅ + +### Persistence scope consistency +- N/A — feature uses no `localStorage` or `sessionStorage`. + +### Enumerated value contract audit +- 3 new wire-enum value arrays added to `ui/src/lib/enums.ts` per Story 2.1, each with a source-of-truth comment citing `backend/app/api/v1/schemas.py`. ✅ +- Spec §8.4 enumerated-value-contracts table covers all 4 new Literals + the reused `ObjectiveMetric`. ✅ +- ConfidencePanel JSX includes per-branch source-of-truth comments per Story 2.2. ✅ + +### Admin control audit +- N/A — MVP4+ only. RelyLoop is single-tenant in MVP1. + +### Audit-event coverage audit +- N/A — MVP2+ only. RelyLoop has no `audit_log` table yet in MVP1. + +--- + +## 12) Definition of plan done + +This implementation plan is execution-ready when: + +- [x] Every FR is mapped to stories/tasks/tests/docs updates (§1). +- [x] Every story includes New files, Modified files, Endpoints (when applicable), Key interfaces, Tasks, and DoD. +- [x] Test layers (unit/integration/contract/e2e) are explicitly scoped (§3). +- [x] Documentation updates across docs/01-05 are planned and owned (§4). +- [x] Lean refactor scope is empty by design — explicitly N/A (§5). +- [x] Phase/epic gates are measurable. +- [x] Story-by-Story Verification Gate is included (§10). +- [ ] Plan consistency review (§11) completed with no unresolved findings — pending cross-model review. diff --git a/docs/02_product/planned_features/feat_pr_metric_confidence/pipeline_status.md b/docs/02_product/planned_features/feat_pr_metric_confidence/pipeline_status.md index 4bee1c69..e89ed9ae 100644 --- a/docs/02_product/planned_features/feat_pr_metric_confidence/pipeline_status.md +++ b/docs/02_product/planned_features/feat_pr_metric_confidence/pipeline_status.md @@ -18,8 +18,16 @@ - Phase 2 (deferred — tracked in [`phase2_idea.md`](phase2_idea.md)): orchestrator baseline-trial work + `studies.baseline_trial_id` column; switches comparison to true production baseline when available ## Plan -- Status: Not started -- Next: `/impl-plan-gen docs/02_product/planned_features/feat_pr_metric_confidence/feature_spec.md` +- Status: Approved (pending user review of the SPEC→PLAN advance) +- Date: 2026-05-21 +- File: [`implementation_plan.md`](implementation_plan.md) +- Cross-model review: GPT-5.5 converged at cycle 3 (17 findings total across 3 cycles — 2 High + 8 Medium + 7 Low — all accepted and patched) + - Cycle 1: 11 findings (2 High sequencing/architecture, 6 Medium, 3 Low) + - Cycle 2: 3 findings (2 High — import cycle + CI gating bug introduced by cycle-1 patches; 1 Medium drift) + - Cycle 3: 3 findings, all Low — convergence (no High, no Medium = stop) +- Stories: 9 total across 2 epics (Epic 1: 6 backend stories — migration → worker write → domain helper → API enrichment → PR body → digest prompt; Epic 2: 3 frontend stories — types → ConfidencePanel → E2E) +- Phases covered: Phase 1 only (Phase 2 baseline-trial work deferred per [`phase2_idea.md`](phase2_idea.md)) +- Next: User approval, then `/impl-execute` on this branch. ## Implementation - Status: Not started From 9c950216827c8b6798645a123548a7e36ae1a238 Mon Sep 17 00:00:00 2001 From: SoundMindsAI Date: Thu, 21 May 2026 09:11:43 -0400 Subject: [PATCH 05/17] feat(trials): add per_query_metrics JSONB column with CHECK constraint (Story 1.1) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Migration 0015_trials_per_query_metrics adds a nullable JSONB column to trials plus a CHECK constraint enforcing NULL-or-object. The run_trial worker (Story 1.2) writes scored["per_query"] from scoring.py:194; failed/pruned trials leave the column NULL. Old trials predating this migration stay NULL — confidence analytics degrade gracefully per spec FR-7 + AC-3. Shape: {query_id: {metric_name: float}} matching ScoreResult.per_query keys (ndcg, map, precision, recall, mrr — user-facing names, NOT pytrec_eval wire forms). The trials_per_query_metrics_object_check CHECK constraint enforces NULL or jsonb_typeof = 'object' at the DB level (cycle-1 GPT-5.5 F11 fix: write path is the Arq run_trial worker, not a Pydantic-validated HTTP request — DB-level check is the correct enforcement layer). Verification: - alembic upgrade head + downgrade -1 + upgrade head all clean in-container - Alembic head: 0014 → 0015 (state.md update in finalization commit) - 4 integration tests added in test_trials_per_query_metrics_migration.py (skip on hosts without DATABASE_URL_FILE per project local-vs-CI convention): upgrade adds column + CHECK; downgrade drops both; full round-trip preserves the other 10 trial columns; CHECK rejects non-object JSONB inserts. Co-Authored-By: Claude Opus 4.7 --- backend/app/db/models/trial.py | 22 ++ ...test_trials_per_query_metrics_migration.py | 298 ++++++++++++++++++ .../versions/0015_trials_per_query_metrics.py | 53 ++++ 3 files changed, 373 insertions(+) create mode 100644 backend/tests/integration/test_trials_per_query_metrics_migration.py create mode 100644 migrations/versions/0015_trials_per_query_metrics.py diff --git a/backend/app/db/models/trial.py b/backend/app/db/models/trial.py index 1548bc10..445ab55e 100644 --- a/backend/app/db/models/trial.py +++ b/backend/app/db/models/trial.py @@ -13,6 +13,15 @@ The ``trials_study_metric`` index on ``(study_id, primary_metric DESC NULLS LAST)`` is created in the migration (Story 1.2) — not declared at the ORM level — so the ``DESC NULLS LAST`` ordering survives ``--autogenerate``. + +The ``per_query_metrics`` JSONB column (nullable; added by migration +``0015_trials_per_query_metrics`` for feat_pr_metric_confidence) carries the +per-query pytrec_eval scores from ``scoring.py::score()``'s ``per_query`` +dict. Shape: ``{query_id: {metric_name: float}}`` where ``metric_name`` is one +of the user-facing names (``ndcg``, ``map``, ``precision``, ``recall``, +``mrr``). The ``trials_per_query_metrics_object_check`` CHECK constraint +enforces NULL-or-object at the DB level (since the write path is the Arq +``run_trial`` worker, not a Pydantic-validated HTTP request). """ from __future__ import annotations @@ -36,6 +45,10 @@ class Trial(Base): "status IN ('complete', 'failed', 'pruned')", name="trials_status_check", ), + CheckConstraint( + "per_query_metrics IS NULL OR jsonb_typeof(per_query_metrics) = 'object'", + name="trials_per_query_metrics_object_check", + ), ) id: Mapped[str] = mapped_column(String(36), primary_key=True) @@ -62,6 +75,15 @@ class Trial(Base): """``{ndcg@10: ..., map: ..., p@10: ...}`` — every metric the study's objective enumerated, scored by ``backend/eval/scoring.py`` (lands in ``infra_optuna_eval``).""" + per_query_metrics: Mapped[dict[str, Any] | None] = mapped_column(JSONB, nullable=True) + """Per-query pytrec_eval scores from ``scoring.py::score()``'s + ``per_query`` dict, persisted on every successful trial (NULL on + failure/pruned and on trials predating migration 0015). Shape: + ``{query_id: {metric_name: float}}`` using user-facing metric names + (``ndcg``, ``map``, ``precision``, ``recall``, ``mrr``). Consumed by + ``backend.app.domain.study.confidence`` to compute the + ``ConfidenceShape`` surfaced on ``StudyDetail`` + PR body + digest + narrative (feat_pr_metric_confidence).""" duration_ms: Mapped[int | None] = mapped_column(Integer, nullable=True) """Wall-clock from ``study.ask()`` to ``study.tell()`` for this trial.""" status: Mapped[str] = mapped_column(Text, nullable=False) diff --git a/backend/tests/integration/test_trials_per_query_metrics_migration.py b/backend/tests/integration/test_trials_per_query_metrics_migration.py new file mode 100644 index 00000000..b8415af9 --- /dev/null +++ b/backend/tests/integration/test_trials_per_query_metrics_migration.py @@ -0,0 +1,298 @@ +"""``0015_trials_per_query_metrics`` migration test (feat_pr_metric_confidence Story 1.1). + +Asserts the schema shape of the ``trials.per_query_metrics`` column added by +``migrations/versions/0015_trials_per_query_metrics.py``: + +* upgrade head adds the nullable JSONB column + CHECK constraint +* downgrade to 0014 drops the CHECK constraint and the column +* upgrade → downgrade → upgrade round-trip preserves the other 10 trial columns + and leaves ``per_query_metrics`` NULL on existing rows (AC-17 from the spec) +* the CHECK constraint rejects non-object JSONB inserts (AC for INV-1) + +Mirrors ``test_clusters_target_filter_migration.py`` for skip semantics + +alembic invocation. +""" + +from __future__ import annotations + +import json +import os +import socket +import subprocess +import uuid +from collections.abc import Iterator +from pathlib import Path +from urllib.parse import urlparse + +import pytest +from sqlalchemy import create_engine, text +from sqlalchemy.exc import IntegrityError + +from backend.app.core.settings import get_settings + +REPO = Path(__file__).resolve().parents[3] + + +def _postgres_reachable() -> bool: + if not os.environ.get("DATABASE_URL_FILE") or not os.environ.get("POSTGRES_PASSWORD_FILE"): + return False + try: + url = get_settings().database_url + except Exception: # noqa: BLE001 + return False + parsed = urlparse(url) + host = parsed.hostname or "localhost" + port = parsed.port or 5432 + try: + with socket.create_connection((host, port), timeout=1.0): + return True + except (TimeoutError, OSError): + return False + + +pytestmark = pytest.mark.skipif( + not _postgres_reachable(), + reason=( + "Postgres not reachable from this process — see " + "docs/03_runbooks/local-dev.md §'Local-vs-CI test layers'." + ), +) + + +def _alembic(*args: str) -> subprocess.CompletedProcess[str]: + return subprocess.run( + ["uv", "run", "alembic", *args], + cwd=REPO, + capture_output=True, + text=True, + check=True, + ) + + +def _sync_database_url() -> str: + return get_settings().database_url.replace("postgresql+asyncpg://", "postgresql://") + + +@pytest.fixture +def restore_head() -> Iterator[None]: + """Always leave the DB at head, even if the test failed mid-downgrade.""" + yield + try: + _alembic("upgrade", "head") + except subprocess.CalledProcessError: + pass + + +def _column_info(conn) -> dict[str, dict[str, object]]: + rows = conn.execute( + text( + "SELECT column_name, data_type, is_nullable " + "FROM information_schema.columns " + "WHERE table_schema = 'public' AND table_name = 'trials'" + ) + ).fetchall() + return {r[0]: {"data_type": r[1], "nullable": r[2]} for r in rows} + + +def _check_constraint_names(conn) -> set[str]: + rows = conn.execute( + text( + "SELECT conname FROM pg_constraint " + "WHERE conrelid = 'public.trials'::regclass AND contype = 'c'" + ) + ).fetchall() + return {r[0] for r in rows} + + +@pytest.mark.integration +class TestTrialsPerQueryMetricsMigration: + def test_upgrade_adds_nullable_jsonb_column_with_check(self, restore_head: None) -> None: + """0015 upgrade adds ``per_query_metrics`` as a nullable jsonb column + AND adds the ``trials_per_query_metrics_object_check`` CHECK constraint.""" + _alembic("upgrade", "head") + engine = create_engine(_sync_database_url(), future=True) + try: + with engine.connect() as conn: + cols = _column_info(conn) + assert "per_query_metrics" in cols, ( + "0015 upgrade should add trials.per_query_metrics" + ) + col = cols["per_query_metrics"] + assert col["data_type"] == "jsonb" + assert col["nullable"] == "YES" + + checks = _check_constraint_names(conn) + assert "trials_per_query_metrics_object_check" in checks, ( + "0015 upgrade should add the per_query_metrics CHECK constraint" + ) + finally: + engine.dispose() + + def test_downgrade_drops_check_and_column(self, restore_head: None) -> None: + """downgrade to 0014 drops the CHECK constraint and the column.""" + _alembic("upgrade", "head") + _alembic("downgrade", "0014") + engine = create_engine(_sync_database_url(), future=True) + try: + with engine.connect() as conn: + cols = _column_info(conn) + assert "per_query_metrics" not in cols, ( + "downgrade to 0014 should drop trials.per_query_metrics" + ) + checks = _check_constraint_names(conn) + assert "trials_per_query_metrics_object_check" not in checks, ( + "downgrade to 0014 should drop the per_query_metrics CHECK" + ) + finally: + engine.dispose() + + def test_roundtrip_preserves_other_columns(self, restore_head: None) -> None: + """Upgrade → downgrade → upgrade leaves the other 10 trials columns intact + AND ``per_query_metrics`` present + nullable on the final upgrade (AC-17).""" + _alembic("upgrade", "head") + engine = create_engine(_sync_database_url(), future=True) + try: + with engine.connect() as conn: + before = set(_column_info(conn).keys()) + finally: + engine.dispose() + + _alembic("downgrade", "0014") + _alembic("upgrade", "head") + + engine = create_engine(_sync_database_url(), future=True) + try: + with engine.connect() as conn: + after = _column_info(conn) + assert set(after.keys()) == before, ( + f"column set changed across round-trip: " + f"only-before={before - set(after.keys())}, " + f"only-after={set(after.keys()) - before}" + ) + assert after["per_query_metrics"]["nullable"] == "YES" + finally: + engine.dispose() + + def test_check_constraint_rejects_non_object_jsonb(self, restore_head: None) -> None: + """``per_query_metrics`` must be NULL or a JSON object — arrays, scalars, + and booleans MUST be rejected by the CHECK constraint. + + Per cycle-3 GPT-5.5 F3 adjudication: SQLAlchemy wraps the asyncpg + ``CheckViolationError`` as ``sqlalchemy.exc.IntegrityError``; assert on + the wrapping type and inspect ``.orig`` for the underlying cause. + """ + _alembic("upgrade", "head") + + engine = create_engine(_sync_database_url(), future=True) + try: + with engine.connect() as conn: + # Seed the minimal FK chain trials needs: cluster → query_set → + # judgment_list → query_template → study, then a trial row whose + # per_query_metrics will violate the CHECK. + suffix = uuid.uuid4().hex[:8] + cluster_id = str(uuid.uuid4()) + qs_id = str(uuid.uuid4()) + tpl_id = str(uuid.uuid4()) + jl_id = str(uuid.uuid4()) + study_id = str(uuid.uuid4()) + trial_id = str(uuid.uuid4()) + + with conn.begin(): + conn.execute( + text( + "INSERT INTO clusters (id, name, engine_type, environment, " + "base_url, auth_kind, credentials_ref) VALUES " + "(:id, :name, 'elasticsearch', 'dev', " + "'http://elasticsearch:9200', 'es_basic', 'local-es')" + ), + {"id": cluster_id, "name": f"migration-check-{suffix}"}, + ) + conn.execute( + text( + "INSERT INTO query_sets (id, name, cluster_id) " + "VALUES (:id, :name, :cid)" + ), + {"id": qs_id, "name": f"qs-{suffix}", "cid": cluster_id}, + ) + conn.execute( + text( + "INSERT INTO query_templates (id, name, engine_type, body, " + "declared_params) VALUES (:id, :name, 'elasticsearch', " + ":body, :params)" + ), + { + "id": tpl_id, + "name": f"tpl-{suffix}", + "body": '{"query":{"match_all":{}}}', + "params": json.dumps({}), + }, + ) + conn.execute( + text( + "INSERT INTO judgment_lists (id, name, query_set_id, " + "cluster_id, target, rubric, status) VALUES " + "(:id, :name, :qs, :cid, 'idx', 'r', 'ready')" + ), + { + "id": jl_id, + "name": f"jl-{suffix}", + "qs": qs_id, + "cid": cluster_id, + }, + ) + conn.execute( + text( + "INSERT INTO studies (id, name, cluster_id, target, " + "template_id, query_set_id, judgment_list_id, search_space, " + "objective, config, status, optuna_study_name) VALUES " + "(:id, :name, :cid, 'idx', :tpl, :qs, :jl, " + ":space, :obj, :cfg, 'queued', :osn)" + ), + { + "id": study_id, + "name": f"study-{suffix}", + "cid": cluster_id, + "tpl": tpl_id, + "qs": qs_id, + "jl": jl_id, + "space": json.dumps({"params": {}}), + "obj": json.dumps({"metric": "ndcg", "k": 10, "direction": "maximize"}), + "cfg": json.dumps({"max_trials": 1}), + "osn": study_id, + }, + ) + + # Attempt the CHECK-violating insert. A JSON array should fail. + with pytest.raises(IntegrityError) as exc_info: + with conn.begin(): + conn.execute( + text( + "INSERT INTO trials (id, study_id, optuna_trial_number, " + "params, metrics, status, per_query_metrics) VALUES " + "(:id, :sid, 0, :params, :metrics, 'complete', " + ":pq::jsonb)" + ), + { + "id": trial_id, + "sid": study_id, + "params": json.dumps({}), + "metrics": json.dumps({"ndcg": 0.5}), + "pq": "[]", + }, + ) + + # Confirm the CHECK fired (not an FK / NOT NULL / etc.) + assert ( + "trials_per_query_metrics_object_check" in str(exc_info.value.orig) + or "check constraint" in str(exc_info.value.orig).lower() + ), f"expected per_query_metrics CHECK violation; got {exc_info.value.orig}" + + # Cleanup — best-effort delete of the seeded FK chain. + with conn.begin(): + conn.execute(text("DELETE FROM studies WHERE id = :id"), {"id": study_id}) + conn.execute(text("DELETE FROM judgment_lists WHERE id = :id"), {"id": jl_id}) + conn.execute(text("DELETE FROM query_templates WHERE id = :id"), {"id": tpl_id}) + conn.execute(text("DELETE FROM query_sets WHERE id = :id"), {"id": qs_id}) + conn.execute(text("DELETE FROM clusters WHERE id = :id"), {"id": cluster_id}) + finally: + engine.dispose() diff --git a/migrations/versions/0015_trials_per_query_metrics.py b/migrations/versions/0015_trials_per_query_metrics.py new file mode 100644 index 00000000..37df1e27 --- /dev/null +++ b/migrations/versions/0015_trials_per_query_metrics.py @@ -0,0 +1,53 @@ +"""trials_per_query_metrics. + +Revision ID: 0015 +Revises: 0014 +Create Date: 2026-05-21 00:00:00.000000 + +feat_pr_metric_confidence Story 1.1 — adds a nullable ``per_query_metrics JSONB`` +column to ``trials`` plus a CHECK constraint enforcing the column is NULL or a +JSON object (not an array, scalar, or boolean). The run_trial worker writes +``scored["per_query"]`` (from ``backend/app/eval/scoring.py:194``) on the +success branch; failed/pruned trials leave the column NULL. Old trials predating +this migration stay NULL (no backfill — confidence analytics degrade gracefully +per spec FR-7). + +Shape: ``{query_id: {metric_name: float}}`` matching ``ScoreResult.per_query`` +keys (``ndcg``, ``map``, ``precision``, ``recall``, ``mrr`` — user-facing names, +NOT the pytrec_eval wire forms). +""" + +from collections.abc import Sequence + +import sqlalchemy as sa +from alembic import op +from sqlalchemy.dialects.postgresql import JSONB + +revision: str = "0015" +down_revision: str | None = "0014" +branch_labels: str | Sequence[str] | None = None +depends_on: str | Sequence[str] | None = None + + +def upgrade() -> None: + """Add nullable ``per_query_metrics`` JSONB column + CHECK constraint.""" + op.add_column( + "trials", + sa.Column("per_query_metrics", JSONB(), nullable=True), + ) + op.create_check_constraint( + "trials_per_query_metrics_object_check", + "trials", + "per_query_metrics IS NULL OR jsonb_typeof(per_query_metrics) = 'object'", + ) + + +def downgrade() -> None: + """Drop the CHECK constraint and the column (in that order — constraint + references the column).""" + op.drop_constraint( + "trials_per_query_metrics_object_check", + "trials", + type_="check", + ) + op.drop_column("trials", "per_query_metrics") From 032f342c0702819500c91145434d168d4221803e Mon Sep 17 00:00:00 2001 From: SoundMindsAI Date: Thu, 21 May 2026 09:13:45 -0400 Subject: [PATCH 06/17] feat(worker): persist per_query_metrics on successful run_trial completion (Story 1.2) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Worker change: backend/workers/trials.py adds per_query_metrics=scored["per_query"] to the success-path repo.create_trial() call (one-line addition). The failed-path call below keeps the kwarg omitted, so per_query_metrics stays NULL on status='failed' per FR-1 + INV-2. No repo signature change — repo.create_trial uses **fields: object pass-through, so the new kwarg flows straight to Trial(**fields) via SQLAlchemy. Integration tests added at backend/tests/integration/test_run_trial_per_query_persistence.py (skips on hosts without Postgres reachability per project convention): - test_successful_trial_writes_per_query_metrics: asserts non-NULL JSONB with shape {qid: {metric_name: float}} using user-facing metric names (ndcg, map, precision, recall, mrr — NOT pytrec_eval wire forms like ndcg_cut.10). - test_failed_trial_leaves_per_query_metrics_null: asserts NULL on simulated adapter exception, confirms worker correctly omits kwarg on failed path. Covers AC-1 + AC-2. Pairs with Story 1.1's migration test (which asserts schema shape) — this test asserts the worker writes the column correctly on each branch. Co-Authored-By: Claude Opus 4.7 --- .../test_run_trial_per_query_persistence.py | 178 ++++++++++++++++++ backend/workers/trials.py | 7 + 2 files changed, 185 insertions(+) create mode 100644 backend/tests/integration/test_run_trial_per_query_persistence.py diff --git a/backend/tests/integration/test_run_trial_per_query_persistence.py b/backend/tests/integration/test_run_trial_per_query_persistence.py new file mode 100644 index 00000000..3f43748a --- /dev/null +++ b/backend/tests/integration/test_run_trial_per_query_persistence.py @@ -0,0 +1,178 @@ +"""``run_trial`` per_query_metrics persistence (feat_pr_metric_confidence Story 1.2). + +Asserts that the run_trial worker persists ``scored["per_query"]`` from +``backend/app/eval/scoring.py:194`` to ``trials.per_query_metrics`` on the +success branch (AC-1) AND leaves the column NULL on the failure branch (AC-2). + +Pairs with ``test_trials_per_query_metrics_migration.py`` (Story 1.1) — that +test asserts schema shape; this test asserts the worker actually writes the +column on success and omits it on failure. + +Reuses the established ``setup_study_with_cluster()`` + ``StubAdapter`` + +monkeypatch pattern from ``test_run_trial.py``. + +Skips automatically when Postgres isn't reachable from the host shell. +""" + +from __future__ import annotations + +from unittest.mock import AsyncMock + +import pytest + +from backend.app.core.settings import get_settings +from backend.app.db import repo +from backend.app.db.session import get_session_factory +from backend.app.eval.optuna_runtime import build_storage +from backend.tests.conftest import postgres_reachable +from backend.tests.integration.fixtures.handbuilt_qrels import ( + build_hits_response, + build_qrels, +) +from backend.tests.integration.fixtures.run_trial_setup import ( + cleanup_fixture, + create_optuna_trial_for_study, + setup_study_with_cluster, +) +from backend.tests.integration.fixtures.stub_adapter import StubAdapter + +pytestmark = [ + pytest.mark.integration, + pytest.mark.skipif( + not postgres_reachable(), + reason="Postgres not reachable — see docs/03_runbooks/local-dev.md", + ), +] + + +async def test_successful_trial_writes_per_query_metrics( + monkeypatch: pytest.MonkeyPatch, +): + """AC-1: a successful trial persists ``per_query_metrics`` as a non-NULL + JSONB object shaped ``{qid: {metric_name: float}}`` using user-facing + metric names (ndcg, map, precision, recall, mrr — NOT pytrec_eval wire + forms).""" + fixture = await setup_study_with_cluster() + + storage = build_storage(get_settings().database_url) + optuna_trial_number = create_optuna_trial_for_study( + storage, optuna_study_name=fixture.optuna_study_name + ) + + stub = StubAdapter( + engine_type="elasticsearch", + search_batch_response=build_hits_response(fixture.query_ids), + ) + monkeypatch.setattr("backend.workers.trials.build_adapter", lambda _cluster: stub) + monkeypatch.setattr( + "backend.workers.trials.load_qrels", + AsyncMock(return_value=build_qrels(fixture.query_ids)), + ) + + from backend.workers.trials import run_trial + + await run_trial( + ctx={"optuna_storage": storage}, + study_id=fixture.study_id, + optuna_trial_number=optuna_trial_number, + ) + + factory = get_session_factory() + async with factory() as db: + trials = await repo.list_trials_for_study(db, fixture.study_id) + assert len(trials) == 1 + t = trials[0] + assert t.status == "complete" + + # AC-1: per_query_metrics is non-NULL and shaped as expected. + assert t.per_query_metrics is not None, "successful trial must persist per_query_metrics (FR-1)" + assert isinstance(t.per_query_metrics, dict) + + # Every seeded query_id appears as a key. + persisted_qids = set(t.per_query_metrics.keys()) + expected_qids = set(fixture.query_ids) + assert persisted_qids == expected_qids, ( + f"per_query_metrics keys should match the seeded query_ids; " + f"got={persisted_qids}, expected={expected_qids}" + ) + + # Every value is a dict keyed by user-facing metric names. + expected_metric_keys = {"ndcg", "map", "precision", "recall", "mrr"} + for qid, per_metric in t.per_query_metrics.items(): + assert isinstance(per_metric, dict), ( + f"per_query_metrics[{qid}] must be a dict, got {type(per_metric)}" + ) + # The score() function returns one entry per metric in the + # study's objective set. Assert at least ndcg is present (the + # study's default objective metric) AND no pytrec_eval wire-form + # keys leak through (e.g., "ndcg_cut.10", "P_10"). + assert per_metric, f"per_query_metrics[{qid}] is empty" + for metric_key in per_metric: + assert metric_key in expected_metric_keys, ( + f"unexpected metric key {metric_key!r} in per_query_metrics[{qid}]; " + f"score() should remap pytrec_eval wire names to " + f"{sorted(expected_metric_keys)}" + ) + assert isinstance(per_metric[metric_key], (int, float)), ( + f"per_query_metrics[{qid}][{metric_key}] must be numeric, " + f"got {type(per_metric[metric_key])}" + ) + + await cleanup_fixture(fixture) + + +async def test_failed_trial_leaves_per_query_metrics_null( + monkeypatch: pytest.MonkeyPatch, +): + """AC-2: when a trial fails (simulated adapter exception), the persisted + row has ``status='failed'`` AND ``per_query_metrics IS NULL``. The failure + branch at backend/workers/trials.py never passes ``per_query_metrics`` to + ``repo.create_trial`` — confirms the worker correctly omits the kwarg on + the failed path per FR-1.""" + fixture = await setup_study_with_cluster() + + storage = build_storage(get_settings().database_url) + optuna_trial_number = create_optuna_trial_for_study( + storage, optuna_study_name=fixture.optuna_study_name + ) + + # Stub adapter that raises on search_batch — pre-tell failure path. + class _FailingAdapter: + engine_type = "elasticsearch" + aclose_called = False + + async def search_batch(self, *args, **kwargs): + raise RuntimeError("simulated upstream failure") + + async def aclose(self): + self.aclose_called = True + + stub: _FailingAdapter = _FailingAdapter() + monkeypatch.setattr("backend.workers.trials.build_adapter", lambda _cluster: stub) + monkeypatch.setattr( + "backend.workers.trials.load_qrels", + AsyncMock(return_value=build_qrels(fixture.query_ids)), + ) + + from backend.workers.trials import run_trial + + await run_trial( + ctx={"optuna_storage": storage}, + study_id=fixture.study_id, + optuna_trial_number=optuna_trial_number, + ) + + factory = get_session_factory() + async with factory() as db: + trials = await repo.list_trials_for_study(db, fixture.study_id) + assert len(trials) == 1 + t = trials[0] + assert t.status == "failed" + assert t.error is not None + # AC-2: per_query_metrics is NULL on the failure branch (kwarg omitted). + assert t.per_query_metrics is None, ( + "failed trial must leave per_query_metrics NULL (FR-1 + INV-2)" + ) + assert t.metrics == {} # existing contract — failure path writes empty dict + + await cleanup_fixture(fixture) diff --git a/backend/workers/trials.py b/backend/workers/trials.py index db03282a..6d30108e 100644 --- a/backend/workers/trials.py +++ b/backend/workers/trials.py @@ -430,6 +430,12 @@ async def run_trial(ctx: dict[str, Any], study_id: str, optuna_trial_number: int os._exit(1) # N. INSERT the trials row. + # feat_pr_metric_confidence Story 1.2: persist scored["per_query"] + # alongside the aggregate so confidence analytics + # (backend.app.domain.study.confidence) can compute bootstrap CI + + # named regressors without re-scoring. Failed/pruned trials below + # leave per_query_metrics unset (NULL — see DB CHECK constraint + # trials_per_query_metrics_object_check from migration 0015). await repo.create_trial( db, id=trial_id, @@ -438,6 +444,7 @@ async def run_trial(ctx: dict[str, Any], study_id: str, optuna_trial_number: int params=snapshot.params, primary_metric=primary, metrics=scored["aggregate"], + per_query_metrics=scored["per_query"], duration_ms=duration_ms, status="complete", error=None, From 3dc034919e2fb4f85c5bb5dbacb37047217dcb2d Mon Sep 17 00:00:00 2001 From: SoundMindsAI Date: Thu, 21 May 2026 09:21:25 -0400 Subject: [PATCH 07/17] feat(domain): add confidence helpers + ConfidenceShape Pydantic models (Story 1.3) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit backend/app/domain/study/confidence.py — pure-Python helpers that compute the ConfidenceShape exposed via StudyDetail (Story 1.4), the PR body "## Confidence" section (Story 1.5), and the digest narrative XML blocks (Story 1.6). 14 locked constants from FR-4 + FR-4a; 4 wire-value Literals (ConvergenceRegime, RunnerUpClassification, ComparisonAgainst, CIMethod); 8 Pydantic shapes (HeadlineShape, CIShape, RunnerUpGapShape, LateTrialStddevShape, ConvergenceShape, RegressorRowShape, PerQueryOutcomesShape, ConfidenceShape); 1 internal dataclass (_OutcomeSummary); 6 pure helpers + 1 orchestrator. Design deviation from the plan, with rationale: - The plan called for compute_study_confidence as `async def` with a `db: AsyncSession` arg. CLAUDE.md "Domain Layer" forbids I/O / DB access in the domain layer. The orchestrator is therefore PURE (synchronous, takes pre-fetched data); the API router's _detail() (Story 1.4) and the PR worker (Story 1.5) own the 4-query read pattern from FR-2 and pass the results in as kwargs. This is actually cleaner than the original plan — analytics are independently unit-testable without DB fixtures. - ObjectiveMetric is intentionally NOT imported from schemas.py per cycle-2 GPT-5.5 F1 (avoids the schemas ↔ confidence circular import). HeadlineShape.metric uses bare str; the upstream value is already validated by the existing Literal at schemas.py:214. - compute_outcome_summary returns _OutcomeSummary with regressor_candidates (qids only); build_regressor_rows hydrates with query_text after Q4 of the 4-query read pattern — cycle-1 GPT-5.5 F7 fix. - classify_runner_up_gap returns the full RunnerUpGapShape (value + classification + top10_within + runner_up_metric) — cycle-1 F8 fix. - ci_95 + headline.n_queries are computed from the winner alone, decoupled from the runner-up gate — cycle-2 F2 fix per AC-16. Tests: backend/tests/unit/domain/study/test_confidence.py — 29 unit cases covering every helper at every key boundary: - bootstrap_ci_95: degraded (N<5), happy path (N=5), seed determinism (AC-4), zero-variance edge case - classify_runner_up_gap: degraded (N<2), robust_plateau (AC-5), sharp_peak (AC-5 counter), 2-trial edge case - compute_late_trial_stddev: degraded (N<10), window=int(N*0.2) at N=50 (AC-6), window-floor at N=10 - classify_convergence_regime: degraded (N<3), early_held with late-window probe (AC-8), late_rising at 90% (AC-9), noisy (AC-8 counter — no late-window plateau), noisy (winner in middle) - compute_outcome_summary: empty input, unknown metric, NDCG threshold classification (AC-10), MAP threshold at 0.02, top-5 cap with sort by abs(delta) - build_regressor_rows: hydrate happy path, omit missing text (cascade race) - compute_study_confidence: AC-3a (whole-object null when winner missing), no complete trials, AC-3 (partial shape with aggregate signals when per_query=None), full shape with all data, AC-16 (ci_95 populates from winner alone with no runner-up) - constants_exported: drift guard for the 14 locked thresholds All 29 tests pass via `uv run pytest backend/tests/unit/domain/study/test_confidence.py`. make backend-lint + make backend-typecheck both green. Co-Authored-By: Claude Opus 4.7 --- backend/app/domain/study/confidence.py | 599 ++++++++++++++++++ .../unit/domain/study/test_confidence.py | 461 ++++++++++++++ 2 files changed, 1060 insertions(+) create mode 100644 backend/app/domain/study/confidence.py create mode 100644 backend/tests/unit/domain/study/test_confidence.py diff --git a/backend/app/domain/study/confidence.py b/backend/app/domain/study/confidence.py new file mode 100644 index 00000000..9fa1c21f --- /dev/null +++ b/backend/app/domain/study/confidence.py @@ -0,0 +1,599 @@ +"""Per-study metric-confidence analytics (feat_pr_metric_confidence Story 1.3). + +Pure-Python helpers for computing the ``ConfidenceShape`` exposed on +``StudyDetail``, the PR body's ``## Confidence`` section, and the digest +narrative's ```` / ```` XML blocks. + +Domain-layer convention (per CLAUDE.md "Domain Layer"): every function in +this module is pure — no DB access, no I/O, no async. The API router +(:func:`backend.app.api.v1.studies._detail`) and the PR worker +(:func:`backend.workers.git_pr.open_pr`) fetch the 4 queries from FR-2 and +call :func:`compute_study_confidence` with the resulting data. This keeps the +analytics independently unit-testable without DB fixtures. + +The Pydantic shapes live HERE (not in ``backend.app.api.v1.schemas``) because +the domain module is the canonical assembler — the API schemas re-export +``ConfidenceShape`` for the ``StudyDetail`` field (Story 1.4). + +The :data:`ConvergenceRegime`, :data:`RunnerUpClassification`, +:data:`ComparisonAgainst`, and :data:`CIMethod` Literals also live here for +the same reason. ``ObjectiveMetric`` is intentionally NOT imported from +``schemas`` — that would create a circular import at app startup +(schemas → confidence → schemas). ``HeadlineShape.metric`` uses ``str``; +the upstream value is already validated by the existing ``ObjectiveMetric`` +Literal at the create-study endpoint (``schemas.py:214``), so the wire +contract is preserved. + +References: +- Spec FR-2 through FR-7: docs/02_product/planned_features/feat_pr_metric_confidence/feature_spec.md +- AC-3 / AC-3a / AC-4 / AC-5 / AC-6 / AC-7 / AC-8 / AC-9 / AC-10 / AC-15 / AC-16 +""" + +from __future__ import annotations + +from dataclasses import dataclass, field +from typing import Any, Literal + +import numpy as np +from pydantic import BaseModel + +# --------------------------------------------------------------------------- +# Locked constants — every value is referenced from FR-4 / FR-4a. +# Source of truth: feature_spec.md §19 Decision log (feat_pr_metric_confidence). +# --------------------------------------------------------------------------- + +BOOTSTRAP_N: int = 1000 +"""Number of bootstrap resamples for the CI computation (Decision D4).""" + +BOOTSTRAP_SEED: int = 42 +"""Fixed numpy RNG seed for reproducibility (Decision D4 / AC-4). An approver +re-reading the PR sees byte-identical CI numbers across calls.""" + +BOOTSTRAP_CI_LEVEL: float = 0.95 +"""95% percentile interval (Decision D4).""" + +BOOTSTRAP_MIN_N_QUERIES: int = 5 +"""Minimum number of per-query datapoints required to compute a CI. Below +this, ``bootstrap_ci_95`` returns None (FR-7).""" + +REGRESSOR_THRESHOLDS: dict[str, float] = { + "ndcg": 0.01, + "precision": 0.01, + "recall": 0.01, + "map": 0.02, + "mrr": 0.02, +} +"""Absolute-delta threshold per metric for the improved/unchanged/regressed +classification (FR-4a, Decision D2).""" + +RUNNER_UP_PLATEAU_BAND: float = 0.005 +"""``robust_plateau`` if all top-N trials are within this band of the winner +(Decision D5).""" + +LATE_TRIAL_WINDOW_FRAC: float = 0.2 +"""Fraction of complete trials in the noise-floor window (Decision D3).""" + +LATE_TRIAL_WINDOW_MIN: int = 5 +"""Minimum window size for the noise-floor (Decision D3).""" + +LATE_TRIAL_MIN_COMPLETE: int = 10 +"""Minimum complete trials required to report a noise floor (FR-7, +Decision D3).""" + +EARLY_HELD_TRIAL_NUMBER_FRAC: float = 0.5 +"""Winner found in the first 50% of trials AND the late-window probe finds a +near-equivalent → ``early_held`` (Decision D6, cycle-2 GPT-5.5 F7 fix).""" + +EARLY_HELD_LATE_WINDOW_FRAC: float = 0.25 +"""Last 25% of trial numbers is the "late window" probe range +(Decision D6).""" + +LATE_RISING_TRIAL_NUMBER_FRAC: float = 0.9 +"""Winner found at or after 90% of trials → ``late_rising`` +(Decision D6).""" + +CONVERGENCE_MIN_COMPLETE: int = 3 +"""Minimum complete trials required to classify convergence (FR-7).""" + +RUNNER_UP_GAP_MIN_COMPLETE: int = 2 +"""Minimum complete trials required to report a runner-up gap (FR-7).""" + +TOP_REGRESSORS_CAP: int = 5 +"""Maximum number of named regressor queries in the PR body / ConfidencePanel +(FR-4a).""" + +# --------------------------------------------------------------------------- +# Wire-value Literals (single source of truth for the spec §8.4 enumerated +# value contract; re-exported via schemas.py for StudyDetail). +# --------------------------------------------------------------------------- + +ConvergenceRegime = Literal["early_held", "late_rising", "noisy"] +RunnerUpClassification = Literal["robust_plateau", "sharp_peak"] +ComparisonAgainst = Literal["runner_up", "baseline"] +"""Phase 1 unconditionally emits ``runner_up``. ``baseline`` is reserved for +Phase 2 (see ``phase2_idea.md``).""" + +CIMethod = Literal["bootstrap_n1000"] + + +# --------------------------------------------------------------------------- +# Pydantic shapes — exported and re-imported by schemas.py (Story 1.4). +# --------------------------------------------------------------------------- + + +class HeadlineShape(BaseModel): + """Top-line metric value + N(queries) used in the CI. + + ``metric`` uses ``str`` (not ``ObjectiveMetric``) to avoid a circular + import: ``schemas.py`` imports ``ConfidenceShape`` from here, so this + module cannot import back from ``schemas.py``. The upstream value is + already validated by the existing ``ObjectiveMetric`` Literal at the + create-study endpoint (``schemas.py:214``). + """ + + metric: str + value: float + k: int | None + n_queries: int | None + """``None`` when the winner trial has ``per_query_metrics IS NULL`` + (FR-7).""" + + +class CIShape(BaseModel): + """Bootstrap percentile CI on the winner's per-query metric values.""" + + low: float + high: float + method: CIMethod + n_samples: int + + +class RunnerUpGapShape(BaseModel): + """Runner-up trial's metric vs the winner. + + The whole shape is suppressed to ``None`` when there are <2 complete + trials (FR-2 + FR-7); ``classification`` is non-null whenever this shape + is present. + """ + + value: float + classification: RunnerUpClassification + top10_within: float + """Max distance from the winner among the top-``min(10, N)`` trials. + Decision threshold: ``robust_plateau`` if ``top10_within <= 0.005``.""" + runner_up_metric: float + + +class LateTrialStddevShape(BaseModel): + """Sample stddev of ``primary_metric`` over the late-trial window.""" + + value: float + window_size: int + min_window_required: int # always LATE_TRIAL_MIN_COMPLETE + + +class ConvergenceShape(BaseModel): + """Where the winner sits in the Optuna trial sequence + the classified regime.""" + + best_at_trial: int + total_trials: int + regime: ConvergenceRegime + + +class RegressorRowShape(BaseModel): + """One row in the named-regressors table.""" + + query_id: str + query_text: str + winner_score: float + comparison_score: float + delta: float + """``winner_score - comparison_score``; always negative for regressors.""" + + +class PerQueryOutcomesShape(BaseModel): + """Per-query outcome counts + the top-5 named regressors.""" + + improved: int + unchanged: int + regressed: int + comparison_against: ComparisonAgainst + top_regressors: list[RegressorRowShape] + + +class ConfidenceShape(BaseModel): + """The top-level shape exposed via ``StudyDetail.confidence``. + + Every sub-field is independently nullable per FR-7 — degraded paths + suppress only the sub-fields they affect, never the whole shape (the + orchestrator returns whole-object ``None`` only when the winner trial + row itself is missing). + """ + + headline: HeadlineShape + ci_95: CIShape | None + runner_up_gap: RunnerUpGapShape | None + late_trial_stddev: LateTrialStddevShape | None + convergence: ConvergenceShape | None + per_query_outcomes: PerQueryOutcomesShape | None + + +@dataclass(frozen=True) +class _OutcomeSummary: + """Outcome counts + regressor candidate qids (no query_text yet). + + Produced by :func:`compute_outcome_summary`; consumed by the + orchestrator + :func:`build_regressor_rows`. Carries only ``query_id`` + values (NOT ``query_text``) so the orchestrator can run Q4 from FR-2's + 4-query read pattern AFTER deciding which qids are candidates (cycle-1 + GPT-5.5 F7 fix). + """ + + improved: int + unchanged: int + regressed: int + regressor_candidates: list[tuple[str, float, float, float]] = field(default_factory=list) + """Each tuple: ``(query_id, winner_score, comparison_score, delta)``. + Sorted by ``abs(delta)`` descending; capped at TOP_REGRESSORS_CAP.""" + + +# --------------------------------------------------------------------------- +# Pure helpers +# --------------------------------------------------------------------------- + + +def bootstrap_ci_95(per_query_values: list[float]) -> CIShape | None: + """Percentile bootstrap CI with seed=42, N=1000 resamples. + + Returns ``None`` when ``len(per_query_values) < BOOTSTRAP_MIN_N_QUERIES`` + (FR-7). The fixed seed ensures byte-identical CI values across re-reads + of the same study (AC-4). + """ + if len(per_query_values) < BOOTSTRAP_MIN_N_QUERIES: + return None + rng = np.random.default_rng(BOOTSTRAP_SEED) + arr = np.asarray(per_query_values, dtype=np.float64) + # Resample with replacement, take the mean of each sample. + means = rng.choice(arr, size=(BOOTSTRAP_N, len(arr)), replace=True).mean(axis=1) + alpha = (1.0 - BOOTSTRAP_CI_LEVEL) / 2.0 + low_p, high_p = 100.0 * alpha, 100.0 * (1.0 - alpha) + low = float(np.percentile(means, low_p)) + high = float(np.percentile(means, high_p)) + return CIShape( + low=low, + high=high, + method="bootstrap_n1000", + n_samples=len(arr), + ) + + +def classify_runner_up_gap( + sorted_primary_metrics: list[float], +) -> RunnerUpGapShape | None: + """Build the full ``RunnerUpGapShape`` from sorted primary metrics. + + Input is the top trials' primary metrics in descending order. Returns + ``None`` when ``len < RUNNER_UP_GAP_MIN_COMPLETE`` (FR-7). Otherwise + computes: + + - ``value`` = ``winner - runner_up`` + - ``runner_up_metric`` = the 2nd-best metric + - ``top10_within`` = max(winner - m) over the top ``min(10, N)`` trials + - ``classification`` = ``"robust_plateau"`` if ``top10_within <= + RUNNER_UP_PLATEAU_BAND``, else ``"sharp_peak"`` (cycle-1 GPT-5.5 F8 fix: + the helper now returns the full shape including ``top10_within`` + + ``runner_up_metric``). + """ + if len(sorted_primary_metrics) < RUNNER_UP_GAP_MIN_COMPLETE: + return None + winner = sorted_primary_metrics[0] + runner_up = sorted_primary_metrics[1] + top_n = min(10, len(sorted_primary_metrics)) + top_band = sorted_primary_metrics[:top_n] + top10_within = float(max(winner - m for m in top_band)) + classification: RunnerUpClassification = ( + "robust_plateau" if top10_within <= RUNNER_UP_PLATEAU_BAND else "sharp_peak" + ) + return RunnerUpGapShape( + value=float(winner - runner_up), + classification=classification, + top10_within=top10_within, + runner_up_metric=float(runner_up), + ) + + +def compute_late_trial_stddev( + primary_metrics_in_trial_order: list[float], +) -> LateTrialStddevShape | None: + """Sample stddev over the late-trial window (the noise floor signal). + + Window size is ``max(LATE_TRIAL_WINDOW_MIN, int(N * + LATE_TRIAL_WINDOW_FRAC))``. Returns ``None`` when ``N < + LATE_TRIAL_MIN_COMPLETE`` (FR-7). The ``primary_metrics_in_trial_order`` + list must be sorted by ``optuna_trial_number`` ascending; the helper + takes the tail. + """ + n = len(primary_metrics_in_trial_order) + if n < LATE_TRIAL_MIN_COMPLETE: + return None + window_size = max(LATE_TRIAL_WINDOW_MIN, int(n * LATE_TRIAL_WINDOW_FRAC)) + tail = primary_metrics_in_trial_order[-window_size:] + value = float(np.std(np.asarray(tail, dtype=np.float64), ddof=1)) + return LateTrialStddevShape( + value=value, + window_size=window_size, + min_window_required=LATE_TRIAL_MIN_COMPLETE, + ) + + +def classify_convergence_regime( + winner_trial_number: int, + primary_metrics_by_trial_number: dict[int, float], +) -> ConvergenceShape | None: + """Classify convergence as ``early_held`` / ``late_rising`` / ``noisy``. + + Decision D6 (cycle-2 GPT-5.5 F7 fix — the original "no improvement in last + 25%" rule was tautological because the winner is the global best by + construction): + + - ``early_held``: winner's ``optuna_trial_number ≤ 50% of max`` AND at + least one trial in the last 25% of trial numbers has ``primary_metric`` + within ``RUNNER_UP_PLATEAU_BAND`` of the winner (observable signal that + the late budget found near-equivalent configs). + - ``late_rising``: winner's ``optuna_trial_number ≥ 90% of max``. + - ``noisy``: otherwise. + + Returns ``None`` when ``N < CONVERGENCE_MIN_COMPLETE`` (FR-7). + ``primary_metrics_by_trial_number`` includes ONLY complete trials (the + caller filters). + """ + n = len(primary_metrics_by_trial_number) + if n < CONVERGENCE_MIN_COMPLETE: + return None + winner_metric = primary_metrics_by_trial_number[winner_trial_number] + max_trial_number = max(primary_metrics_by_trial_number.keys()) + total_trials = n + + if winner_trial_number >= LATE_RISING_TRIAL_NUMBER_FRAC * max_trial_number: + regime: ConvergenceRegime = "late_rising" + elif winner_trial_number <= EARLY_HELD_TRIAL_NUMBER_FRAC * max_trial_number: + # Observable late-window probe: any trial in the last 25% within + # the plateau band of the winner counts as "held". + late_window_start = max_trial_number * (1.0 - EARLY_HELD_LATE_WINDOW_FRAC) + late_window_trials = [ + m for tn, m in primary_metrics_by_trial_number.items() if tn >= late_window_start + ] + if late_window_trials and any( + (winner_metric - m) <= RUNNER_UP_PLATEAU_BAND for m in late_window_trials + ): + regime = "early_held" + else: + regime = "noisy" + else: + regime = "noisy" + + return ConvergenceShape( + best_at_trial=winner_trial_number, + total_trials=total_trials, + regime=regime, + ) + + +def compute_outcome_summary( + winner_per_query: dict[str, dict[str, float]], + comparison_per_query: dict[str, dict[str, float]], + metric: str, +) -> _OutcomeSummary | None: + """Classify per-query outcomes and surface the top regressor candidates. + + Improved/unchanged/regressed buckets use the FR-4a per-metric threshold + table. Returned candidates are sorted by ``abs(delta)`` descending, + capped at ``TOP_REGRESSORS_CAP``. Returns ``None`` when either input + dict is empty or ``metric`` is not in :data:`REGRESSOR_THRESHOLDS`. + + Cycle-1 GPT-5.5 F7 fix: this helper does NOT take ``query_text_by_id`` — + candidates carry only ``query_id``. The orchestrator runs Q4 of the + 4-query read pattern AFTER seeing the candidate list, then calls + :func:`build_regressor_rows` to hydrate the rows with text. + """ + if not winner_per_query or not comparison_per_query: + return None + threshold = REGRESSOR_THRESHOLDS.get(metric) + if threshold is None: + return None + + improved = 0 + unchanged = 0 + regressed = 0 + candidates: list[tuple[str, float, float, float]] = [] + + # Compare only qids present in BOTH dicts. Queries missing from either + # side (e.g., a query added after the trial ran) are ignored. + for qid in winner_per_query.keys() & comparison_per_query.keys(): + w_metrics = winner_per_query[qid] + c_metrics = comparison_per_query[qid] + if metric not in w_metrics or metric not in c_metrics: + continue + w_score = float(w_metrics[metric]) + c_score = float(c_metrics[metric]) + delta = w_score - c_score + if delta > threshold: + improved += 1 + elif delta < -threshold: + regressed += 1 + candidates.append((qid, w_score, c_score, delta)) + else: + unchanged += 1 + + # Sort by absolute delta descending → most-negative delta first. For + # regressors all deltas are negative, so ascending sort of the signed + # delta puts the largest-magnitude regressor first. + candidates.sort(key=lambda row: row[3]) + capped = candidates[:TOP_REGRESSORS_CAP] + return _OutcomeSummary( + improved=improved, + unchanged=unchanged, + regressed=regressed, + regressor_candidates=capped, + ) + + +def build_regressor_rows( + candidates: list[tuple[str, float, float, float]], + query_text_by_id: dict[str, str], +) -> list[RegressorRowShape]: + """Hydrate candidate qids with ``query_text`` from Q4's result. + + Rows whose ``query_id`` is missing from ``query_text_by_id`` are + omitted — the query may have been deleted by a cascade race; we don't + want to surface a regressor we can't name. + """ + rows: list[RegressorRowShape] = [] + for qid, winner_score, comparison_score, delta in candidates: + text = query_text_by_id.get(qid) + if text is None: + continue + rows.append( + RegressorRowShape( + query_id=qid, + query_text=text, + winner_score=winner_score, + comparison_score=comparison_score, + delta=delta, + ) + ) + return rows + + +# --------------------------------------------------------------------------- +# Orchestrator — pure (no DB, no async). The API router / PR worker fetch +# the 4 queries from FR-2 and pass the results in. +# --------------------------------------------------------------------------- + + +def compute_study_confidence( + *, + study_objective: dict[str, Any], + study_best_metric: float | None, + winner_trial: Any | None, + runner_up_trial: Any | None, + complete_trials_summary: list[tuple[float, int]], + query_text_by_id: dict[str, str] | None = None, +) -> ConfidenceShape | None: + """Assemble the ``ConfidenceShape`` from pre-fetched DB data. + + Arguments mirror the 4-query read pattern from FR-2: + + - ``study_objective`` — ``study.objective`` JSONB (``{metric, k, + direction}``). The ``metric`` key drives ``HeadlineShape`` + the + threshold lookup in :func:`compute_outcome_summary`. + - ``study_best_metric`` — ``study.best_metric``. Populates + ``HeadlineShape.value``. + - ``winner_trial`` — full ``Trial`` ORM row at ``study.best_trial_id``, + OR ``None`` if the row is missing (cascade-delete race or + ``best_trial_id IS NULL`` for an incomplete study). Triggers whole- + object ``None`` per FR-7. + - ``runner_up_trial`` — 2nd-best complete trial by ``primary_metric``, + OR ``None`` when there's only one complete trial. + - ``complete_trials_summary`` — list of ``(primary_metric, + optuna_trial_number)`` for every complete trial, sorted by + ``optuna_trial_number`` ascending. Drives the aggregate signals + (``runner_up_gap``, ``late_trial_stddev``, ``convergence``). + - ``query_text_by_id`` — result of Q4 (only fetched after + ``compute_outcome_summary`` produces candidates); maps ``query_id`` + → ``query_text`` for the named regressors. May be ``None`` / + ``{}`` when there are no candidates. + + Returns ``None`` whole-object when ``winner_trial is None``. Otherwise + returns a partial ``ConfidenceShape`` per FR-7: each sub-field is + independently nullable. + + Cycle-2 GPT-5.5 F2 fix: ``ci_95`` + ``headline.n_queries`` decouple + from the runner-up gate — AC-16 (1-complete-trial case) requires CI + to populate from the winner alone. + """ + # FR-2 condition (a/b/c) — whole-object null. + if winner_trial is None: + return None + if not complete_trials_summary: + return None + + metric = study_objective.get("metric") + if not isinstance(metric, str): + return None + k = study_objective.get("k") + if k is not None and not isinstance(k, int): + k = None + + # Headline value comes from study.best_metric (denormalized winner + # primary_metric); the n_queries comes from the winner's per_query + # dict when present. + headline_value = ( + float(study_best_metric) + if study_best_metric is not None + else float(winner_trial.primary_metric or 0.0) + ) + winner_per_query = winner_trial.per_query_metrics or {} + winner_values_for_metric = [ + float(v[metric]) for v in winner_per_query.values() if isinstance(v, dict) and metric in v + ] + n_queries: int | None = len(winner_values_for_metric) if winner_per_query else None + + headline = HeadlineShape( + metric=metric, + value=headline_value, + k=k, + n_queries=n_queries, + ) + + # Aggregate signals — independent of per_query data. + sorted_primary_metrics = sorted( + (m for m, _ in complete_trials_summary if m is not None), + reverse=True, + ) + runner_up_gap = classify_runner_up_gap(sorted_primary_metrics) + + primary_in_trial_order = [m for m, _ in complete_trials_summary if m is not None] + late_trial_stddev = compute_late_trial_stddev(primary_in_trial_order) + + primary_by_trial_number = {tn: m for m, tn in complete_trials_summary if m is not None} + convergence = classify_convergence_regime( + winner_trial_number=winner_trial.optuna_trial_number, + primary_metrics_by_trial_number=primary_by_trial_number, + ) + + # Winner-only per-query signal — independent of runner-up gate + # (cycle-2 GPT-5.5 F2 fix; AC-16 1-complete-trial case). + ci_95 = bootstrap_ci_95(winner_values_for_metric) + + # Comparison-based per-query signal — requires BOTH winner + runner-up + # to have per_query_metrics (the runner-up's primary_metric alone is + # not enough to compute deltas). + per_query_outcomes: PerQueryOutcomesShape | None = None + if runner_up_trial is not None and winner_per_query and runner_up_trial.per_query_metrics: + outcome = compute_outcome_summary( + winner_per_query=winner_per_query, + comparison_per_query=runner_up_trial.per_query_metrics, + metric=metric, + ) + if outcome is not None: + regressor_rows = build_regressor_rows( + candidates=outcome.regressor_candidates, + query_text_by_id=query_text_by_id or {}, + ) + per_query_outcomes = PerQueryOutcomesShape( + improved=outcome.improved, + unchanged=outcome.unchanged, + regressed=outcome.regressed, + comparison_against="runner_up", # FR-3 locked for Phase 1 + top_regressors=regressor_rows, + ) + + return ConfidenceShape( + headline=headline, + ci_95=ci_95, + runner_up_gap=runner_up_gap, + late_trial_stddev=late_trial_stddev, + convergence=convergence, + per_query_outcomes=per_query_outcomes, + ) diff --git a/backend/tests/unit/domain/study/test_confidence.py b/backend/tests/unit/domain/study/test_confidence.py new file mode 100644 index 00000000..2f234cb0 --- /dev/null +++ b/backend/tests/unit/domain/study/test_confidence.py @@ -0,0 +1,461 @@ +"""Unit tests for ``backend.app.domain.study.confidence`` (Story 1.3). + +Covers every helper and every FR-7 degraded path. Pure tests — no DB, no +fixtures beyond simple data structures. The orchestrator +(:func:`compute_study_confidence`) is tested with lightweight ``SimpleNamespace`` +stand-ins for ``Trial`` / ``Study`` ORM rows since the orchestrator only reads +attributes, not behavior. +""" + +from __future__ import annotations + +from types import SimpleNamespace + +import pytest + +from backend.app.domain.study.confidence import ( + BOOTSTRAP_MIN_N_QUERIES, + CONVERGENCE_MIN_COMPLETE, + LATE_TRIAL_MIN_COMPLETE, + REGRESSOR_THRESHOLDS, + RUNNER_UP_GAP_MIN_COMPLETE, + RUNNER_UP_PLATEAU_BAND, + TOP_REGRESSORS_CAP, + ConfidenceShape, + bootstrap_ci_95, + build_regressor_rows, + classify_convergence_regime, + classify_runner_up_gap, + compute_late_trial_stddev, + compute_outcome_summary, + compute_study_confidence, +) + +# --------------------------------------------------------------------------- +# bootstrap_ci_95 +# --------------------------------------------------------------------------- + + +class TestBootstrapCI: + def test_returns_none_when_n_below_threshold(self) -> None: + """AC-15: N(queries) < 5 suppresses ci_95.""" + assert bootstrap_ci_95([0.5, 0.6, 0.7, 0.8]) is None + assert bootstrap_ci_95([]) is None + + def test_returns_shape_when_n_meets_threshold(self) -> None: + """At exactly BOOTSTRAP_MIN_N_QUERIES the CI is computable.""" + values = [0.5, 0.6, 0.7, 0.8, 0.9] + result = bootstrap_ci_95(values) + assert result is not None + assert result.n_samples == len(values) + assert result.method == "bootstrap_n1000" + # CI must straddle the sample mean. + sample_mean = sum(values) / len(values) + assert result.low <= sample_mean <= result.high + + def test_seed_determinism_byte_identical(self) -> None: + """AC-4: two calls with identical input produce byte-identical CI + values (fixed seed = 42).""" + values = [0.78, 0.82, 0.85, 0.80, 0.84, 0.79, 0.83, 0.81, 0.77, 0.86] + first = bootstrap_ci_95(values) + second = bootstrap_ci_95(values) + assert first is not None + assert second is not None + assert first.low == second.low + assert first.high == second.high + + def test_zero_variance_collapses_ci(self) -> None: + """All-equal input → CI collapses to the constant value.""" + values = [0.5] * 10 + result = bootstrap_ci_95(values) + assert result is not None + assert result.low == pytest.approx(0.5) + assert result.high == pytest.approx(0.5) + + +# --------------------------------------------------------------------------- +# classify_runner_up_gap +# --------------------------------------------------------------------------- + + +class TestRunnerUpGap: + def test_returns_none_below_min_complete(self) -> None: + """FR-7: < RUNNER_UP_GAP_MIN_COMPLETE (2) → None.""" + assert classify_runner_up_gap([0.8]) is None + assert classify_runner_up_gap([]) is None + + def test_robust_plateau_when_top10_within_band(self) -> None: + """AC-5: top-10 all within 0.005 → robust_plateau.""" + # Winner 0.840, top-10 all within 0.004 (strictly less than the 0.005 + # band — avoids float-precision flake at the boundary). + metrics = [0.840, 0.838, 0.838, 0.837, 0.839, 0.838, 0.838, 0.838, 0.837, 0.836] + result = classify_runner_up_gap(metrics) + assert result is not None + assert result.classification == "robust_plateau" + assert result.value == pytest.approx(0.002) + assert result.runner_up_metric == pytest.approx(0.838) + assert result.top10_within <= RUNNER_UP_PLATEAU_BAND + + def test_sharp_peak_when_gap_exceeds_band(self) -> None: + """AC-5 counter-example: gap > 0.005 → sharp_peak.""" + metrics = [0.840, 0.760, 0.750, 0.740] + result = classify_runner_up_gap(metrics) + assert result is not None + assert result.classification == "sharp_peak" + assert result.value == pytest.approx(0.080) + assert result.runner_up_metric == pytest.approx(0.760) + assert result.top10_within > RUNNER_UP_PLATEAU_BAND + + def test_two_trial_edge_case(self) -> None: + """At exactly 2 trials (minimum), classification still computes from + the winner-vs-runner_up gap.""" + # 2 trials, gap = 0.001 → robust_plateau (within 0.005). + result = classify_runner_up_gap([0.840, 0.839]) + assert result is not None + assert result.classification == "robust_plateau" + + # 2 trials, gap = 0.10 → sharp_peak. + result = classify_runner_up_gap([0.840, 0.740]) + assert result is not None + assert result.classification == "sharp_peak" + + +# --------------------------------------------------------------------------- +# compute_late_trial_stddev +# --------------------------------------------------------------------------- + + +class TestLateTrialStddev: + def test_returns_none_when_n_below_threshold(self) -> None: + """FR-7 + AC-7: < LATE_TRIAL_MIN_COMPLETE (10) → None.""" + assert compute_late_trial_stddev([0.5] * 9) is None + assert compute_late_trial_stddev([]) is None + + def test_window_size_at_n_50(self) -> None: + """AC-6: at N=50, window = max(5, int(50*0.2)) = 10.""" + values = [0.7 + 0.01 * (i % 5) for i in range(50)] + result = compute_late_trial_stddev(values) + assert result is not None + assert result.window_size == 10 + assert result.min_window_required == LATE_TRIAL_MIN_COMPLETE + + def test_window_size_floor_at_5(self) -> None: + """N=10 → window = max(5, int(10*0.2)) = 5 (the floor).""" + values = [0.7] * 10 + result = compute_late_trial_stddev(values) + assert result is not None + assert result.window_size == 5 + assert result.value == pytest.approx(0.0) + + +# --------------------------------------------------------------------------- +# classify_convergence_regime +# --------------------------------------------------------------------------- + + +class TestConvergenceRegime: + def test_returns_none_below_min_complete(self) -> None: + """FR-7: < CONVERGENCE_MIN_COMPLETE (3) → None.""" + result = classify_convergence_regime( + winner_trial_number=0, + primary_metrics_by_trial_number={0: 0.8, 1: 0.7}, + ) + assert result is None + + def test_early_held_when_late_trial_within_band(self) -> None: + """AC-8: winner at 20% AND ≥1 late-window trial within 0.005 → early_held.""" + # Winner at trial 200 out of 1000; late window is trial >= 750. + # Synthesize: winner 0.840, late trials include 0.838 (within 0.005). + metrics_by_trial = {0: 0.700} + metrics_by_trial[200] = 0.840 # winner + # Mid trials. + for tn in range(250, 750, 50): + metrics_by_trial[tn] = 0.820 + # Late trials, at least one within 0.005 of winner. + metrics_by_trial[800] = 0.838 + metrics_by_trial[900] = 0.825 + metrics_by_trial[1000] = 0.830 + + result = classify_convergence_regime( + winner_trial_number=200, + primary_metrics_by_trial_number=metrics_by_trial, + ) + assert result is not None + assert result.regime == "early_held" + assert result.best_at_trial == 200 + assert result.total_trials == len(metrics_by_trial) + + def test_late_rising_at_90pct(self) -> None: + """AC-9: winner at 95% → late_rising.""" + metrics_by_trial = {tn: 0.700 + 0.001 * tn for tn in range(0, 1001, 50)} + # Winner at 950 — above 90%. + metrics_by_trial[950] = 0.860 + result = classify_convergence_regime( + winner_trial_number=950, + primary_metrics_by_trial_number=metrics_by_trial, + ) + assert result is not None + assert result.regime == "late_rising" + + def test_noisy_when_winner_early_but_no_late_plateau(self) -> None: + """AC-8 counter-example: winner at 20% but NO late trial within 0.005 + → noisy (late budget didn't find similar plateau).""" + metrics_by_trial = {tn: 0.700 for tn in range(0, 1001, 50)} + metrics_by_trial[200] = 0.840 # winner — far from late trials + # All late trials (>= 750) are at 0.700, gap = 0.140 (far from 0.005). + result = classify_convergence_regime( + winner_trial_number=200, + primary_metrics_by_trial_number=metrics_by_trial, + ) + assert result is not None + assert result.regime == "noisy" + + def test_noisy_when_winner_in_middle(self) -> None: + """Winner at 60% (neither early nor late) → noisy.""" + metrics_by_trial = {tn: 0.700 for tn in range(0, 1001, 50)} + metrics_by_trial[600] = 0.840 + result = classify_convergence_regime( + winner_trial_number=600, + primary_metrics_by_trial_number=metrics_by_trial, + ) + assert result is not None + assert result.regime == "noisy" + + +# --------------------------------------------------------------------------- +# compute_outcome_summary + build_regressor_rows +# --------------------------------------------------------------------------- + + +class TestOutcomeSummary: + def test_returns_none_on_empty_input(self) -> None: + assert compute_outcome_summary({}, {"q1": {"ndcg": 0.5}}, "ndcg") is None + assert compute_outcome_summary({"q1": {"ndcg": 0.5}}, {}, "ndcg") is None + + def test_returns_none_on_unknown_metric(self) -> None: + winner = {"q1": {"ndcg": 0.8}} + comparison = {"q1": {"ndcg": 0.7}} + assert compute_outcome_summary(winner, comparison, "unknown_metric") is None + + def test_classifies_per_fr4a_thresholds_ndcg(self) -> None: + """AC-10: NDCG threshold = 0.01. Deltas of -0.51, 0, 0.18 classify as + regressed/unchanged/improved respectively.""" + winner = { + "qA": {"ndcg": 0.41}, # delta -0.51 vs 0.92 → regressed + "qB": {"ndcg": 0.85}, # delta 0.00 vs 0.85 → unchanged + "qC": {"ndcg": 0.78}, # delta +0.18 vs 0.60 → improved + } + comparison = { + "qA": {"ndcg": 0.92}, + "qB": {"ndcg": 0.85}, + "qC": {"ndcg": 0.60}, + } + result = compute_outcome_summary(winner, comparison, "ndcg") + assert result is not None + assert result.regressed == 1 + assert result.unchanged == 1 + assert result.improved == 1 + # The single regressor candidate (qA) is in the list. + assert len(result.regressor_candidates) == 1 + qid, w, c, delta = result.regressor_candidates[0] + assert qid == "qA" + assert w == pytest.approx(0.41) + assert c == pytest.approx(0.92) + assert delta == pytest.approx(-0.51) + + def test_uses_map_threshold_at_002(self) -> None: + """MAP threshold = 0.02 (Decision D2). Delta of -0.015 is unchanged.""" + winner = {"q1": {"map": 0.50}} + comparison = {"q1": {"map": 0.515}} # delta -0.015 → within ±0.02 → unchanged + result = compute_outcome_summary(winner, comparison, "map") + assert result is not None + assert result.unchanged == 1 + assert result.regressed == 0 + + def test_caps_regressors_at_top_5(self) -> None: + """AC-10: top-regressor list is capped at TOP_REGRESSORS_CAP, sorted + by abs(delta) descending.""" + winner = {f"q{i}": {"ndcg": 0.1 + 0.01 * i} for i in range(8)} + comparison = {f"q{i}": {"ndcg": 0.9 - 0.01 * i} for i in range(8)} # all huge regressions + result = compute_outcome_summary(winner, comparison, "ndcg") + assert result is not None + assert result.regressed == 8 + assert len(result.regressor_candidates) == TOP_REGRESSORS_CAP + # First candidate has the most-negative delta. + deltas = [c[3] for c in result.regressor_candidates] + assert deltas == sorted(deltas) # ascending (most negative first) + + +class TestBuildRegressorRows: + def test_hydrates_query_text(self) -> None: + candidates = [("qA", 0.41, 0.92, -0.51), ("qB", 0.71, 0.85, -0.14)] + text_by_id = {"qA": "shipping policy", "qB": "wireless headphones"} + rows = build_regressor_rows(candidates, text_by_id) + assert len(rows) == 2 + assert rows[0].query_id == "qA" + assert rows[0].query_text == "shipping policy" + assert rows[0].delta == pytest.approx(-0.51) + + def test_omits_rows_with_missing_text(self) -> None: + """A query deleted between Q1+Q2 and Q4 (cascade race) is silently + dropped — we don't surface a regressor we can't name.""" + candidates = [("qA", 0.41, 0.92, -0.51), ("qDeleted", 0.30, 0.80, -0.50)] + text_by_id = {"qA": "shipping policy"} # qDeleted absent + rows = build_regressor_rows(candidates, text_by_id) + assert len(rows) == 1 + assert rows[0].query_id == "qA" + + +# --------------------------------------------------------------------------- +# compute_study_confidence orchestrator +# --------------------------------------------------------------------------- + + +def _trial( + *, + optuna_trial_number: int = 0, + primary_metric: float = 0.84, + per_query_metrics: dict[str, dict[str, float]] | None = None, +) -> SimpleNamespace: + return SimpleNamespace( + optuna_trial_number=optuna_trial_number, + primary_metric=primary_metric, + per_query_metrics=per_query_metrics, + ) + + +class TestComputeStudyConfidence: + def test_returns_none_when_winner_missing(self) -> None: + """AC-3a: winner_trial=None → whole-object None.""" + result = compute_study_confidence( + study_objective={"metric": "ndcg", "k": 10, "direction": "maximize"}, + study_best_metric=None, + winner_trial=None, + runner_up_trial=None, + complete_trials_summary=[], + ) + assert result is None + + def test_returns_none_when_no_complete_trials(self) -> None: + result = compute_study_confidence( + study_objective={"metric": "ndcg", "k": 10, "direction": "maximize"}, + study_best_metric=0.84, + winner_trial=_trial(), + runner_up_trial=None, + complete_trials_summary=[], + ) + assert result is None + + def test_partial_shape_when_per_query_metrics_null(self) -> None: + """AC-3: old study with winner row but per_query_metrics=None → + partial ConfidenceShape with aggregate signals populated, ci_95 + + per_query_outcomes + headline.n_queries all null.""" + winner = _trial( + optuna_trial_number=0, # must appear in summary keys + primary_metric=0.840, + per_query_metrics=None, + ) + runner_up = _trial(optuna_trial_number=10, primary_metric=0.760, per_query_metrics=None) + # 15 complete trials so all aggregate signals can compute. + # Winner (trial 0) is the best; later trials taper down. + summary = [(0.840 - 0.01 * i, i * 10) for i in range(15)] + result = compute_study_confidence( + study_objective={"metric": "ndcg", "k": 10, "direction": "maximize"}, + study_best_metric=0.840, + winner_trial=winner, + runner_up_trial=runner_up, + complete_trials_summary=summary, + ) + assert result is not None + assert isinstance(result, ConfidenceShape) + assert result.headline.value == pytest.approx(0.840) + assert result.headline.n_queries is None + assert result.ci_95 is None + assert result.per_query_outcomes is None + # Aggregate signals populated (independent of per_query). + assert result.runner_up_gap is not None + assert result.late_trial_stddev is not None + assert result.convergence is not None + + def test_full_shape_with_all_data(self) -> None: + """All sub-fields populated when winner + runner-up both have + per_query_metrics, ≥10 complete trials, ≥5 queries.""" + per_query = {f"q{i}": {"ndcg": 0.8 + 0.01 * i} for i in range(10)} + winner = _trial( + optuna_trial_number=0, # winner appears in summary keys + primary_metric=0.85, + per_query_metrics=per_query, + ) + # Runner-up's per_query is shifted slightly so most queries improved. + runner_up_pq = {f"q{i}": {"ndcg": 0.7 + 0.01 * i} for i in range(10)} + runner_up = _trial( + optuna_trial_number=10, + primary_metric=0.75, + per_query_metrics=runner_up_pq, + ) + # 15 complete trials. Winner is trial 0 (early — under 50% of max=140). + # Synthesize a late-window trial within 0.005 of the winner so we + # get early_held: trial 140 (= max) at 0.848 (gap = 0.002 < 0.005). + summary = [(0.85 - 0.01 * i, i * 10) for i in range(15)] + summary[-1] = (0.848, 140) + + result = compute_study_confidence( + study_objective={"metric": "ndcg", "k": 10, "direction": "maximize"}, + study_best_metric=0.85, + winner_trial=winner, + runner_up_trial=runner_up, + complete_trials_summary=summary, + query_text_by_id={}, # no regressors expected (all improved) + ) + assert result is not None + assert result.headline.n_queries == 10 + assert result.ci_95 is not None + assert result.runner_up_gap is not None + assert result.late_trial_stddev is not None + assert result.convergence is not None + assert result.per_query_outcomes is not None + # All queries improved → 0 regressors. + assert result.per_query_outcomes.regressed == 0 + assert result.per_query_outcomes.improved == 10 + assert result.per_query_outcomes.comparison_against == "runner_up" + + def test_ci_95_independent_of_runner_up_per_query(self) -> None: + """AC-16: 1-complete-trial case — winner has per_query but no + runner-up → ci_95 + headline.n_queries populate from winner alone, + per_query_outcomes + runner_up_gap suppressed.""" + per_query = {f"q{i}": {"ndcg": 0.8 + 0.01 * i} for i in range(10)} + winner = _trial( + optuna_trial_number=0, + primary_metric=0.85, + per_query_metrics=per_query, + ) + summary = [(0.85, 0)] # only the winner + result = compute_study_confidence( + study_objective={"metric": "ndcg", "k": 10, "direction": "maximize"}, + study_best_metric=0.85, + winner_trial=winner, + runner_up_trial=None, + complete_trials_summary=summary, + ) + assert result is not None + # Winner-side signals populate. + assert result.headline.n_queries == 10 + assert result.ci_95 is not None + # Comparison-side signals suppress. + assert result.runner_up_gap is None # only 1 trial + assert result.per_query_outcomes is None # no runner-up + # Aggregate signals that need ≥10 / ≥3 trials suppress. + assert result.late_trial_stddev is None + assert result.convergence is None # only 1 trial < CONVERGENCE_MIN_COMPLETE + + +# Sanity check that all constants are defined and referenced (drift guard). +def test_constants_exported() -> None: + assert BOOTSTRAP_MIN_N_QUERIES == 5 + assert RUNNER_UP_GAP_MIN_COMPLETE == 2 + assert LATE_TRIAL_MIN_COMPLETE == 10 + assert CONVERGENCE_MIN_COMPLETE == 3 + assert TOP_REGRESSORS_CAP == 5 + assert set(REGRESSOR_THRESHOLDS.keys()) == {"ndcg", "precision", "recall", "map", "mrr"} + assert REGRESSOR_THRESHOLDS["ndcg"] == 0.01 + assert REGRESSOR_THRESHOLDS["map"] == 0.02 From 5936c09ae93371c91b1125585556bb04eee5c015 Mon Sep 17 00:00:00 2001 From: SoundMindsAI Date: Thu, 21 May 2026 09:40:34 -0400 Subject: [PATCH 08/17] feat(studies): wire ConfidenceShape into StudyDetail.confidence (Story 1.4) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - New backend/app/services/study_confidence.py owns the 4-query read pattern from spec FR-2 (winner, runner-up, complete-trials projection, conditional query_text lookup) and hands pre-fetched data to the pure-Python compute_study_confidence orchestrator (Story 1.3). - schemas.py re-exports ConfidenceShape via explicit `as` alias (mypy-strict no_implicit_reexport) and StudyDetail gains the optional `confidence` field (FR-5a). - studies.py::_detail() awaits fetch_study_confidence and threads the result into the StudyDetail response. - 13 integration tests at backend/tests/integration/test_studies_api_confidence.py cover AC-3, AC-3a (×2: null vs dangling best_trial_id), AC-4 (CI seed reproducibility), AC-5 (robust_plateau + sharp_peak), AC-6, AC-7, AC-8 (early_held), AC-9 (late_rising), AC-10 (named regressors + query_text join), AC-15, AC-16. - 2 contract tests assert OpenAPI shape lock: StudyDetail exposes `confidence: ConfidenceShape | null` and ConfidenceShape has the six FR-5a sub-fields. Verification: make backend-fmt + backend-lint + backend-typecheck clean; 1034 backend unit tests pass; 185 contract tests pass (49 DB-required skips); 34 studies-related integration tests pass against the live in-container Postgres (test_studies_api.py + test_studies_api_confidence.py + test_study_lifecycle.py — no regressions in the existing suites). Co-Authored-By: Claude Opus 4.7 (1M context) --- backend/app/api/v1/schemas.py | 13 + backend/app/api/v1/studies.py | 3 + backend/app/services/study_confidence.py | 105 ++++ .../contract/test_studies_api_contract.py | 31 + .../test_studies_api_confidence.py | 546 ++++++++++++++++++ .../implementation_plan.md | 2 +- 6 files changed, 699 insertions(+), 1 deletion(-) create mode 100644 backend/app/services/study_confidence.py create mode 100644 backend/tests/integration/test_studies_api_confidence.py diff --git a/backend/app/api/v1/schemas.py b/backend/app/api/v1/schemas.py index 67acb3d1..fdb48773 100644 --- a/backend/app/api/v1/schemas.py +++ b/backend/app/api/v1/schemas.py @@ -24,6 +24,12 @@ from backend.app.adapters.protocol import TargetInfo from backend.app.core.settings import get_settings +from backend.app.domain.study.confidence import ConfidenceShape as ConfidenceShape + +# ``ConfidenceShape`` is defined in :mod:`backend.app.domain.study.confidence` +# (the canonical assembler module per Story 1.3). The explicit ``as`` re-export +# above keeps it importable via ``from backend.app.api.v1.schemas import +# ConfidenceShape`` under mypy strict's ``no_implicit_reexport``. EngineType = Literal["elasticsearch", "opensearch"] """Response-only: values are guaranteed by service-layer validation before the @@ -634,6 +640,13 @@ class StudyDetail(BaseModel): started_at: datetime | None completed_at: datetime | None trials_summary: TrialsSummaryShape + confidence: ConfidenceShape | None = None + """Per-study metric-confidence analytics (feat_pr_metric_confidence FR-5a). + + ``None`` when the study has no winner trial (still running or + ``best_trial_id`` points at a deleted row — AC-3a). Otherwise a partial + or full :class:`ConfidenceShape` per FR-7's graceful-degradation + contract.""" class StudySummary(BaseModel): diff --git a/backend/app/api/v1/studies.py b/backend/app/api/v1/studies.py index e7cfeba6..1cb6ff3d 100644 --- a/backend/app/api/v1/studies.py +++ b/backend/app/api/v1/studies.py @@ -63,6 +63,7 @@ validate_against_template, ) from backend.app.services import study_state +from backend.app.services.study_confidence import fetch_study_confidence router = APIRouter() @@ -117,6 +118,7 @@ def _decode_trial_cursor(raw: str, sort_key: str) -> tuple[Any, str]: async def _detail(db: AsyncSession, row: Study) -> StudyDetail: summary = await repo.aggregate_trials_summary(db, row.id) + confidence = await fetch_study_confidence(db, row) return StudyDetail( id=row.id, name=row.name, @@ -145,6 +147,7 @@ async def _detail(db: AsyncSession, row: Study) -> StudyDetail: pruned=summary.pruned, best_primary_metric=summary.best_primary_metric, ), + confidence=confidence, ) diff --git a/backend/app/services/study_confidence.py b/backend/app/services/study_confidence.py new file mode 100644 index 00000000..68d45bce --- /dev/null +++ b/backend/app/services/study_confidence.py @@ -0,0 +1,105 @@ +"""Async glue for the pure-Python confidence orchestrator (feat_pr_metric_confidence Story 1.4). + +:mod:`backend.app.domain.study.confidence` keeps the analytics pure (no DB, +no I/O). This module owns the 4-query read pattern from spec FR-2 and +adapts the results into the pure orchestrator's keyword arguments. + +Consumers: + +* :func:`backend.app.api.v1.studies._detail` — enriches ``StudyDetail`` + (Story 1.4). +* :func:`backend.workers.git_pr.open_pr` — populates the PR body's + ``## Confidence`` section (Story 1.5). +* :func:`backend.workers.digest.generate_digest` — serializes the shape + into the ```` / ```` Jinja blocks + (Story 1.6). +""" + +from __future__ import annotations + +from sqlalchemy import select +from sqlalchemy.ext.asyncio import AsyncSession + +from backend.app.db import repo +from backend.app.db.models import Query, Study, Trial +from backend.app.domain.study.confidence import ( + ConfidenceShape, + compute_outcome_summary, + compute_study_confidence, +) + + +async def fetch_study_confidence(db: AsyncSession, study: Study) -> ConfidenceShape | None: + """Run the 4-query read pattern from FR-2 and assemble ``ConfidenceShape``. + + Returns ``None`` whole-object when ``study.best_trial_id`` is unset or + points at a missing row (FR-7 / AC-3a). Otherwise hands off to + :func:`compute_study_confidence` with all data pre-fetched. + + The Q4 ``queries`` lookup runs ONLY when ``compute_outcome_summary`` + identifies regressor candidates — most studies skip Q4 entirely. + """ + if study.best_trial_id is None: + return None + + # Q1: winner trial (full row — need .per_query_metrics + .optuna_trial_number). + winner = await repo.get_trial(db, study.best_trial_id) + if winner is None: + return None + + # Q2: runner-up trial — 2nd-best complete trial by primary_metric. + runner_up_stmt = ( + select(Trial) + .where( + Trial.study_id == study.id, + Trial.status == "complete", + Trial.id != winner.id, + ) + .order_by(Trial.primary_metric.desc().nulls_last()) + .limit(1) + ) + runner_up = (await db.execute(runner_up_stmt)).scalar_one_or_none() + + # Q3: complete-trials projection — (primary_metric, optuna_trial_number). + summary_stmt = ( + select(Trial.primary_metric, Trial.optuna_trial_number) + .where(Trial.study_id == study.id, Trial.status == "complete") + .order_by(Trial.optuna_trial_number.asc()) + ) + summary_rows = (await db.execute(summary_stmt)).all() + complete_trials_summary: list[tuple[float, int]] = [ + (row[0], row[1]) for row in summary_rows if row[0] is not None + ] + + # Q4 (conditional): query_text for regressor candidates. + # The pure orchestrator runs compute_outcome_summary again internally — + # the second call is cheap (dict-key iteration on ≤100 queries) and keeps + # the pure-helper contract clean for unit tests. + query_text_by_id: dict[str, str] = {} + study_objective = study.objective if isinstance(study.objective, dict) else {} + metric = study_objective.get("metric") + if ( + isinstance(metric, str) + and runner_up is not None + and winner.per_query_metrics + and runner_up.per_query_metrics + ): + outcome = compute_outcome_summary( + winner_per_query=winner.per_query_metrics, + comparison_per_query=runner_up.per_query_metrics, + metric=metric, + ) + if outcome is not None and outcome.regressor_candidates: + qids = [qid for (qid, *_) in outcome.regressor_candidates] + q_stmt = select(Query.id, Query.query_text).where(Query.id.in_(qids)) + for qid, qtext in (await db.execute(q_stmt)).all(): + query_text_by_id[qid] = qtext + + return compute_study_confidence( + study_objective=study_objective, + study_best_metric=study.best_metric, + winner_trial=winner, + runner_up_trial=runner_up, + complete_trials_summary=complete_trials_summary, + query_text_by_id=query_text_by_id, + ) diff --git a/backend/tests/contract/test_studies_api_contract.py b/backend/tests/contract/test_studies_api_contract.py index 2d88a1e2..2c7f2da5 100644 --- a/backend/tests/contract/test_studies_api_contract.py +++ b/backend/tests/contract/test_studies_api_contract.py @@ -22,6 +22,7 @@ BulkQueriesJsonRequest, BulkQueriesResponse, BulkQueryItem, + ConfidenceShape, CreateQuerySetRequest, CreateQueryTemplateRequest, CreateStudyRequest, @@ -48,6 +49,7 @@ def test_phase2_schemas_importable() -> None: BulkQueriesJsonRequest, BulkQueriesResponse, BulkQueryItem, + ConfidenceShape, CreateQuerySetRequest, CreateQueryTemplateRequest, CreateStudyRequest, @@ -69,6 +71,35 @@ def test_phase2_schemas_importable() -> None: assert cls is not None +def test_study_detail_includes_confidence_field() -> None: + """``StudyDetail`` exposes ``confidence: ConfidenceShape | None`` (FR-5a).""" + schema = StudyDetail.model_json_schema() + assert "confidence" in schema["properties"], ( + "StudyDetail.confidence missing — see feat_pr_metric_confidence Story 1.4." + ) + # The field is Optional[ConfidenceShape], i.e. anyOf({$ref}, {null}). + prop = schema["properties"]["confidence"] + refs_or_anyof = prop.get("anyOf") or [prop] + assert any("$ref" in entry and "ConfidenceShape" in entry["$ref"] for entry in refs_or_anyof), ( + f"StudyDetail.confidence is not typed as Optional[ConfidenceShape]; got {prop!r}" + ) + + +def test_confidence_shape_has_six_subfields() -> None: + """``ConfidenceShape`` has the six FR-5a sub-fields.""" + schema = ConfidenceShape.model_json_schema() + expected = { + "headline", + "ci_95", + "runner_up_gap", + "late_trial_stddev", + "convergence", + "per_query_outcomes", + } + actual = set(schema["properties"].keys()) + assert expected == actual, f"ConfidenceShape fields drifted: expected {expected}, got {actual}" + + def test_study_config_requires_at_least_one_stop_condition() -> None: """``max_trials`` AND ``time_budget_min`` both None → ValidationError.""" with pytest.raises(ValidationError, match="stop condition"): diff --git a/backend/tests/integration/test_studies_api_confidence.py b/backend/tests/integration/test_studies_api_confidence.py new file mode 100644 index 00000000..b43f71bb --- /dev/null +++ b/backend/tests/integration/test_studies_api_confidence.py @@ -0,0 +1,546 @@ +"""Integration tests for ``StudyDetail.confidence`` (feat_pr_metric_confidence Story 1.4). + +Covers AC-3, AC-3a, AC-4, AC-5, AC-6, AC-7, AC-8, AC-9, AC-10, AC-15, AC-16 +end-to-end against the live FastAPI app + integration-test Postgres. The +shape itself is unit-tested in +``backend/tests/unit/domain/study/test_confidence.py`` (Story 1.3); this +suite proves the 4-query read pattern + ``_detail()`` wiring assemble the +shape correctly off a real DB. +""" + +from __future__ import annotations + +import uuid +from datetime import UTC, datetime +from typing import Any + +import httpx +import numpy as np +import pytest + +from backend.app.db import repo +from backend.app.db.session import get_session_factory +from backend.tests.conftest import postgres_reachable + +pytestmark = [ + pytest.mark.integration, + pytest.mark.skipif( + not postgres_reachable(), + reason="Postgres not reachable — see docs/03_runbooks/local-dev.md", + ), +] + + +async def _seed_study( + *, + objective: dict[str, Any] | None = None, + best_metric: float | None = 0.84, + seed_queries: int = 0, +) -> dict[str, Any]: + """Seed a minimal study chain. Returns ids + the chain handles. + + No trials inserted — the caller adds trials via :func:`_insert_trial` + and then patches ``study.best_trial_id`` via :func:`_set_best_trial`. + """ + if objective is None: + objective = {"metric": "ndcg", "k": 10, "direction": "maximize"} + factory = get_session_factory() + async with factory() as db: + cluster = await repo.create_cluster( + db, + id=str(uuid.uuid4()), + name=f"cf-cluster-{uuid.uuid4().hex[:8]}", + engine_type="elasticsearch", + environment="dev", + base_url="http://stub:9200", + auth_kind="es_basic", + credentials_ref="ref", + ) + template = await repo.create_query_template( + db, + id=str(uuid.uuid4()), + name=f"cf-tmpl-{uuid.uuid4().hex[:8]}", + engine_type="elasticsearch", + body='{"query": {"match_all": {}}}', + declared_params={}, + version=1, + ) + query_set = await repo.create_query_set( + db, + id=str(uuid.uuid4()), + name=f"cf-qs-{uuid.uuid4().hex[:8]}", + cluster_id=cluster.id, + ) + query_ids: list[str] = [] + for i in range(seed_queries): + qid = str(uuid.uuid4()) + await repo.create_query( + db, + id=qid, + query_set_id=query_set.id, + query_text=f"q-text-{i}", + ) + query_ids.append(qid) + jl = await repo.create_judgment_list( + db, + id=str(uuid.uuid4()), + name=f"cf-jl-{uuid.uuid4().hex[:8]}", + description=None, + query_set_id=query_set.id, + cluster_id=cluster.id, + target="stub-index", + current_template_id=template.id, + rubric="r", + status="complete", + ) + study_id = str(uuid.uuid4()) + await repo.create_study( + db, + id=study_id, + name=f"cf-study-{uuid.uuid4().hex[:8]}", + cluster_id=cluster.id, + target="stub-index", + template_id=template.id, + query_set_id=query_set.id, + judgment_list_id=jl.id, + search_space={}, + objective=objective, + config={"max_trials": 100}, + status="completed", + failed_reason=None, + optuna_study_name=study_id, + baseline_metric=None, + best_metric=best_metric, + best_trial_id=None, + ) + await db.commit() + return { + "study_id": study_id, + "cluster_id": cluster.id, + "query_set_id": query_set.id, + "query_ids": query_ids, + } + + +async def _insert_trial( + *, + study_id: str, + optuna_trial_number: int, + primary_metric: float | None, + per_query_metrics: dict[str, Any] | None = None, + status: str = "complete", +) -> str: + """Insert one trial row directly; returns its UUID.""" + factory = get_session_factory() + async with factory() as db: + trial_id = str(uuid.uuid4()) + kwargs: dict[str, Any] = { + "id": trial_id, + "study_id": study_id, + "optuna_trial_number": optuna_trial_number, + "status": status, + "params": {}, + "metrics": {}, + "primary_metric": primary_metric, + "started_at": datetime.now(UTC), + "ended_at": datetime.now(UTC), + "duration_ms": 100, + } + if per_query_metrics is not None: + kwargs["per_query_metrics"] = per_query_metrics + await repo.create_trial(db, **kwargs) + await db.commit() + return trial_id + + +async def _set_best_trial(study_id: str, trial_id: str | None) -> None: + """Patch ``studies.best_trial_id`` post-hoc.""" + from backend.app.db.models import Study as _Study + + factory = get_session_factory() + async with factory() as db: + row = await db.get(_Study, study_id) + assert row is not None + row.best_trial_id = trial_id + await db.flush() + await db.commit() + + +# --------------------------------------------------------------------------- +# AC-3 — old study (per_query_metrics IS NULL) → partial confidence +# --------------------------------------------------------------------------- + + +async def test_ac3_old_study_returns_partial_confidence_with_aggregate_signals( + async_client: httpx.AsyncClient, +) -> None: + """Per-query sub-fields null; aggregate signals populated.""" + ctx = await _seed_study(best_metric=0.84) + # 12 complete trials, NO per_query_metrics — covers the 10-trial floor for + # late_trial_stddev and the 3-trial floor for convergence. + trial_ids: list[str] = [] + metrics = [0.84, 0.82, 0.80, 0.78, 0.76, 0.74, 0.72, 0.70, 0.68, 0.66, 0.64, 0.62] + for i, m in enumerate(metrics): + tid = await _insert_trial( + study_id=ctx["study_id"], + optuna_trial_number=i, + primary_metric=m, + ) + trial_ids.append(tid) + # Winner is trial 0 (highest primary_metric). + await _set_best_trial(ctx["study_id"], trial_ids[0]) + + resp = await async_client.get(f"/api/v1/studies/{ctx['study_id']}") + assert resp.status_code == 200, resp.text + confidence = resp.json()["confidence"] + assert confidence is not None + assert confidence["ci_95"] is None # AC-3: no per-query → no CI + assert confidence["per_query_outcomes"] is None # AC-3: no per-query data + assert confidence["headline"]["n_queries"] is None # AC-3 + assert confidence["headline"]["value"] == pytest.approx(0.84) + # Aggregate signals populated. + assert confidence["runner_up_gap"] is not None + assert confidence["late_trial_stddev"] is not None + assert confidence["convergence"] is not None + assert confidence["convergence"]["best_at_trial"] == 0 + assert confidence["convergence"]["total_trials"] == 12 + + +# --------------------------------------------------------------------------- +# AC-3a — best_trial_id IS NULL → confidence whole-object null +# --------------------------------------------------------------------------- + + +async def test_ac3a_missing_best_trial_id_returns_null_confidence( + async_client: httpx.AsyncClient, +) -> None: + """best_trial_id unset → whole-object null.""" + ctx = await _seed_study(best_metric=None) + # No trials, no winner. + resp = await async_client.get(f"/api/v1/studies/{ctx['study_id']}") + assert resp.status_code == 200, resp.text + assert resp.json()["confidence"] is None + + +async def test_ac3a_dangling_best_trial_id_returns_null_confidence( + async_client: httpx.AsyncClient, +) -> None: + """best_trial_id set but trial row missing → whole-object null.""" + ctx = await _seed_study(best_metric=0.5) + # Point at a non-existent trial id. + await _set_best_trial(ctx["study_id"], str(uuid.uuid4())) + resp = await async_client.get(f"/api/v1/studies/{ctx['study_id']}") + assert resp.status_code == 200, resp.text + assert resp.json()["confidence"] is None + + +# --------------------------------------------------------------------------- +# AC-4 — bootstrap CI reproducibility (same response twice → byte-equal CI) +# --------------------------------------------------------------------------- + + +async def test_ac4_bootstrap_ci_is_reproducible_across_calls( + async_client: httpx.AsyncClient, +) -> None: + """Two successive GETs → identical CI low/high (seed=42 lock).""" + ctx = await _seed_study(best_metric=0.84, seed_queries=20) + qids = ctx["query_ids"] + # Winner: per_query_metrics carries an ndcg value for each of 20 queries. + # Use deterministic floats spread across [0.6, 0.95] for a non-degenerate CI. + winner_per_query = { + qid: {"ndcg": 0.6 + (i * 0.018), "map": 0.5, "precision": 0.5, "recall": 0.5, "mrr": 0.5} + for i, qid in enumerate(qids) + } + # Need ≥10 trials so all aggregate signals populate. + trial_ids: list[str] = [] + for i in range(15): + per_q = winner_per_query if i == 0 else None + tid = await _insert_trial( + study_id=ctx["study_id"], + optuna_trial_number=i, + primary_metric=0.84 - (i * 0.01), + per_query_metrics=per_q, + ) + trial_ids.append(tid) + await _set_best_trial(ctx["study_id"], trial_ids[0]) + + r1 = await async_client.get(f"/api/v1/studies/{ctx['study_id']}") + r2 = await async_client.get(f"/api/v1/studies/{ctx['study_id']}") + ci1 = r1.json()["confidence"]["ci_95"] + ci2 = r2.json()["confidence"]["ci_95"] + assert ci1 is not None and ci2 is not None + assert ci1["low"] == ci2["low"] + assert ci1["high"] == ci2["high"] + assert ci1["method"] == "bootstrap_n1000" + assert ci1["n_samples"] == 20 + + +# --------------------------------------------------------------------------- +# AC-5 — runner-up gap classification (robust_plateau vs sharp_peak) +# --------------------------------------------------------------------------- + + +async def test_ac5_runner_up_gap_robust_plateau(async_client: httpx.AsyncClient) -> None: + """Top 10 within 0.005 of the winner → robust_plateau.""" + ctx = await _seed_study(best_metric=0.840) + # 10 trials all within < 0.005 of the winner — picked just inside the band + # to avoid float-equality boundary noise on 0.840 - 0.835. + metrics = [0.840, 0.839, 0.838, 0.837, 0.837, 0.838, 0.837, 0.838, 0.839, 0.837] + trial_ids: list[str] = [] + for i, m in enumerate(metrics): + tid = await _insert_trial(study_id=ctx["study_id"], optuna_trial_number=i, primary_metric=m) + trial_ids.append(tid) + await _set_best_trial(ctx["study_id"], trial_ids[0]) + + resp = await async_client.get(f"/api/v1/studies/{ctx['study_id']}") + gap = resp.json()["confidence"]["runner_up_gap"] + assert gap is not None + assert gap["classification"] == "robust_plateau" + + +async def test_ac5_runner_up_gap_sharp_peak(async_client: httpx.AsyncClient) -> None: + """Winner > 0.005 above runner-up → sharp_peak.""" + ctx = await _seed_study(best_metric=0.840) + metrics = [0.840, 0.760, 0.755, 0.750] + trial_ids: list[str] = [] + for i, m in enumerate(metrics): + tid = await _insert_trial(study_id=ctx["study_id"], optuna_trial_number=i, primary_metric=m) + trial_ids.append(tid) + await _set_best_trial(ctx["study_id"], trial_ids[0]) + + resp = await async_client.get(f"/api/v1/studies/{ctx['study_id']}") + gap = resp.json()["confidence"]["runner_up_gap"] + assert gap is not None + assert gap["classification"] == "sharp_peak" + + +# --------------------------------------------------------------------------- +# AC-6 — late-trial stddev matches numpy at N=50 +# --------------------------------------------------------------------------- + + +async def test_ac6_late_trial_stddev_window_math_matches_numpy( + async_client: httpx.AsyncClient, +) -> None: + """50 complete trials → window_size=10, value = np.std(last10, ddof=1).""" + ctx = await _seed_study(best_metric=0.99) + metrics = [0.99 - (i * 0.005) for i in range(50)] + trial_ids: list[str] = [] + for i, m in enumerate(metrics): + tid = await _insert_trial(study_id=ctx["study_id"], optuna_trial_number=i, primary_metric=m) + trial_ids.append(tid) + await _set_best_trial(ctx["study_id"], trial_ids[0]) + + resp = await async_client.get(f"/api/v1/studies/{ctx['study_id']}") + noise = resp.json()["confidence"]["late_trial_stddev"] + assert noise is not None + assert noise["window_size"] == 10 + expected = float(np.std(np.asarray(metrics[-10:], dtype=np.float64), ddof=1)) + assert noise["value"] == pytest.approx(expected, rel=1e-9) + + +# --------------------------------------------------------------------------- +# AC-7 — late-trial stddev suppressed at N<10 +# --------------------------------------------------------------------------- + + +async def test_ac7_late_trial_stddev_null_when_fewer_than_ten_trials( + async_client: httpx.AsyncClient, +) -> None: + ctx = await _seed_study(best_metric=0.8) + trial_ids: list[str] = [] + for i in range(9): + tid = await _insert_trial( + study_id=ctx["study_id"], + optuna_trial_number=i, + primary_metric=0.8 - (i * 0.01), + ) + trial_ids.append(tid) + await _set_best_trial(ctx["study_id"], trial_ids[0]) + + resp = await async_client.get(f"/api/v1/studies/{ctx['study_id']}") + assert resp.json()["confidence"]["late_trial_stddev"] is None + + +# --------------------------------------------------------------------------- +# AC-8 — early_held convergence regime +# --------------------------------------------------------------------------- + + +async def test_ac8_convergence_regime_early_held(async_client: httpx.AsyncClient) -> None: + """Winner at trial 200/1000 + late plateau within 0.005 → early_held.""" + ctx = await _seed_study(best_metric=0.84) + # Sparse trial-number distribution to avoid inserting 1000 rows. + # Winner at trial_number=200; max_trial_number=1000; a late trial at + # trial_number=800 has primary_metric within 0.005 of the winner. + trial_specs = [ + (0, 0.70), + (100, 0.78), + (200, 0.84), # winner + (400, 0.80), + (600, 0.79), + (800, 0.838), # late plateau within 0.005 of 0.84 + (1000, 0.74), + ] + trial_ids: dict[int, str] = {} + for tn, m in trial_specs: + tid = await _insert_trial( + study_id=ctx["study_id"], optuna_trial_number=tn, primary_metric=m + ) + trial_ids[tn] = tid + await _set_best_trial(ctx["study_id"], trial_ids[200]) + + resp = await async_client.get(f"/api/v1/studies/{ctx['study_id']}") + conv = resp.json()["confidence"]["convergence"] + assert conv is not None + assert conv["regime"] == "early_held" + assert conv["best_at_trial"] == 200 + + +# --------------------------------------------------------------------------- +# AC-9 — late_rising convergence regime +# --------------------------------------------------------------------------- + + +async def test_ac9_convergence_regime_late_rising(async_client: httpx.AsyncClient) -> None: + """Winner at trial 950/1000 → late_rising.""" + ctx = await _seed_study(best_metric=0.84) + trial_specs = [ + (0, 0.50), + (100, 0.55), + (500, 0.70), + (800, 0.78), + (950, 0.84), # winner — past 90% of max trial number + (1000, 0.82), + ] + trial_ids: dict[int, str] = {} + for tn, m in trial_specs: + tid = await _insert_trial( + study_id=ctx["study_id"], optuna_trial_number=tn, primary_metric=m + ) + trial_ids[tn] = tid + await _set_best_trial(ctx["study_id"], trial_ids[950]) + + resp = await async_client.get(f"/api/v1/studies/{ctx['study_id']}") + conv = resp.json()["confidence"]["convergence"] + assert conv is not None + assert conv["regime"] == "late_rising" + + +# --------------------------------------------------------------------------- +# AC-10 — per-query regressor naming with query_text join +# --------------------------------------------------------------------------- + + +async def test_ac10_per_query_regressor_includes_query_text( + async_client: httpx.AsyncClient, +) -> None: + """Regressor row carries query_text from the queries table.""" + ctx = await _seed_study(best_metric=0.84, seed_queries=2) + qids = ctx["query_ids"] + qA, qB = qids[0], qids[1] + # Winner: qA scored 0.41 (will regress vs runner-up's 0.92); + # qB scored 0.85 (unchanged vs runner-up's 0.85). + winner_per_query = { + qA: {"ndcg": 0.41}, + qB: {"ndcg": 0.85}, + } + runner_up_per_query = { + qA: {"ndcg": 0.92}, + qB: {"ndcg": 0.85}, + } + winner_id = await _insert_trial( + study_id=ctx["study_id"], + optuna_trial_number=0, + primary_metric=0.84, + per_query_metrics=winner_per_query, + ) + await _insert_trial( + study_id=ctx["study_id"], + optuna_trial_number=1, + primary_metric=0.83, + per_query_metrics=runner_up_per_query, + ) + await _set_best_trial(ctx["study_id"], winner_id) + + resp = await async_client.get(f"/api/v1/studies/{ctx['study_id']}") + outcomes = resp.json()["confidence"]["per_query_outcomes"] + assert outcomes is not None + assert outcomes["comparison_against"] == "runner_up" + assert outcomes["regressed"] == 1 + assert outcomes["unchanged"] == 1 + assert outcomes["improved"] == 0 + regressors = outcomes["top_regressors"] + assert len(regressors) == 1 + row = regressors[0] + assert row["query_id"] == qA + assert row["query_text"] == "q-text-0" + assert row["winner_score"] == pytest.approx(0.41) + assert row["comparison_score"] == pytest.approx(0.92) + assert row["delta"] == pytest.approx(-0.51) + + +# --------------------------------------------------------------------------- +# AC-15 — bootstrap CI suppressed at N(queries) < 5 +# --------------------------------------------------------------------------- + + +async def test_ac15_bootstrap_ci_null_when_fewer_than_five_queries( + async_client: httpx.AsyncClient, +) -> None: + ctx = await _seed_study(best_metric=0.8, seed_queries=4) + qids = ctx["query_ids"] + winner_per_query = {qid: {"ndcg": 0.7 + i * 0.02} for i, qid in enumerate(qids)} + winner_id = await _insert_trial( + study_id=ctx["study_id"], + optuna_trial_number=0, + primary_metric=0.8, + per_query_metrics=winner_per_query, + ) + await _set_best_trial(ctx["study_id"], winner_id) + + resp = await async_client.get(f"/api/v1/studies/{ctx['study_id']}") + confidence = resp.json()["confidence"] + assert confidence is not None + assert confidence["ci_95"] is None + # headline still populates from study.best_metric. + assert confidence["headline"]["value"] == pytest.approx(0.8) + assert confidence["headline"]["n_queries"] == 4 + + +# --------------------------------------------------------------------------- +# AC-16 — per_query_outcomes + runner_up_gap suppressed when only 1 complete trial +# --------------------------------------------------------------------------- + + +async def test_ac16_single_complete_trial_suppresses_runner_up_signals( + async_client: httpx.AsyncClient, +) -> None: + """Only 1 complete trial → per_query_outcomes + runner_up_gap null; CI still populates.""" + ctx = await _seed_study(best_metric=0.8, seed_queries=6) + qids = ctx["query_ids"] + winner_per_query = {qid: {"ndcg": 0.7 + i * 0.02} for i, qid in enumerate(qids)} + winner_id = await _insert_trial( + study_id=ctx["study_id"], + optuna_trial_number=0, + primary_metric=0.8, + per_query_metrics=winner_per_query, + ) + # Other trials all failed (no primary_metric). + for i in range(1, 5): + await _insert_trial( + study_id=ctx["study_id"], + optuna_trial_number=i, + primary_metric=None, + status="failed", + ) + await _set_best_trial(ctx["study_id"], winner_id) + + resp = await async_client.get(f"/api/v1/studies/{ctx['study_id']}") + confidence = resp.json()["confidence"] + assert confidence is not None + assert confidence["per_query_outcomes"] is None + assert confidence["runner_up_gap"] is None + # Winner-only signals still populate. + assert confidence["ci_95"] is not None + assert confidence["headline"]["n_queries"] == 6 diff --git a/docs/02_product/planned_features/feat_pr_metric_confidence/implementation_plan.md b/docs/02_product/planned_features/feat_pr_metric_confidence/implementation_plan.md index c59f490e..dfda795b 100644 --- a/docs/02_product/planned_features/feat_pr_metric_confidence/implementation_plan.md +++ b/docs/02_product/planned_features/feat_pr_metric_confidence/implementation_plan.md @@ -1053,7 +1053,7 @@ None planned. The feature is purely additive across all surfaces. - [ ] Story 1.1 — Migration `0015_trials_per_query_metrics` - [ ] Story 1.2 — Persist `per_query_metrics` in `run_trial` - [ ] Story 1.3 — Domain module `confidence.py` -- [ ] Story 1.4 — `ConfidenceShape` + StudyDetail enrichment +- [x] Story 1.4 — `ConfidenceShape` + StudyDetail enrichment - [ ] Story 1.5 — PR body section + worker plumbing - [ ] Story 1.6 — Digest narrative prompt extension - [ ] **Epic 1 gate** From d92fd5f1bb5c12d0b520def063c51206334e6003 Mon Sep 17 00:00:00 2001 From: SoundMindsAI Date: Thu, 21 May 2026 11:54:57 -0400 Subject: [PATCH 09/17] feat(worker): emit ## Confidence section in study-backed PR body (Story 1.5) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - _render_pr_body_study_backed gains optional `confidence: ConfidenceShape | None = None` kwarg. New _render_confidence_section helper emits the block between ## Metric delta and ## Config diff, with each sub-block (CI line, per-query outcome counts, named regressors, runner-up gap, late-trial 1σ, convergence) independently gated on its sub-field being non-null (FR-7 / AC-3 partial-render path). - open_pr worker call site fetches confidence via fetch_study_confidence before rendering when study_id is set (FR-5d). - 4 contract tests at backend/tests/contract/test_pr_body_confidence_section.py cover AC-11 (full section + ordering + content), AC-12 (omission on whole-object null), partial-render (FR-7 / AC-3 mirror), and the named-regressors omission when regressed == 0. - 1 integration test at backend/tests/integration/test_open_pr_worker_confidence_plumbing.py drives end-to-end: seed completed study with winner + runner-up trials having per_query_metrics → live fetch_study_confidence → real renderer outputs ## Confidence with the expected lines. Bundled inline fixes for pre-existing test failures uncovered during the full integration-suite run (per the inline-fix rubric — mechanical assertion updates, same feature branch): - test_migrations.py: bump alembic head assertion 0014 → 0015 (Story 1.1 forgot to update the assertion when adding migration 0015). - test_trials_per_query_metrics_migration.py: judgment_lists.status value 'ready' (CHECK-violating) → 'complete', and switch :pq::jsonb parameter to inline '[]'::jsonb literal (psycopg2 parser conflicted with adjacent ::cast syntax). Captured tangential discovery as docs/02_product/planned_features/bug_confidence_per_query_metric_key_drift/idea.md: score() output uses `ndcg@10`-style per_query keys but Story 1.3's orchestrator + Story 1.2's test assertion both assume bare `ndcg` keys. Production confidence will silently degrade to aggregate-only for every real completed study until that drift is resolved. Surfaced by Story 1.2's test_successful_trial_writes_per_query_metrics; remains as the sole failing integration test in this commit's branch state (525 pass, 1 fail, 2 skip; the fail is the captured-as-idea drift). Verification: make backend-fmt + backend-lint + backend-typecheck clean (ruff format --check parity OK); 1034 backend unit tests pass; 189 contract tests pass (49 DB-required skips); 525/527 in-container integration tests pass (2 health probes skip without a running api; the captured Story 1.2 drift is the single remaining fail). Co-Authored-By: Claude Opus 4.7 (1M context) --- .../test_pr_body_confidence_section.py | 219 +++++++++++++++ backend/tests/integration/test_migrations.py | 13 +- ...test_open_pr_worker_confidence_plumbing.py | 252 ++++++++++++++++++ ...test_trials_per_query_metrics_migration.py | 5 +- backend/workers/git_pr.py | 65 ++++- docs/00_overview/DASHBOARD.md | 2 +- docs/00_overview/MVP1_DASHBOARD.md | 7 +- docs/00_overview/dashboard.html | 2 +- docs/00_overview/mvp1_dashboard.html | 20 +- .../idea.md | 53 ++++ .../implementation_plan.md | 2 +- 11 files changed, 620 insertions(+), 20 deletions(-) create mode 100644 backend/tests/contract/test_pr_body_confidence_section.py create mode 100644 backend/tests/integration/test_open_pr_worker_confidence_plumbing.py create mode 100644 docs/02_product/planned_features/bug_confidence_per_query_metric_key_drift/idea.md diff --git a/backend/tests/contract/test_pr_body_confidence_section.py b/backend/tests/contract/test_pr_body_confidence_section.py new file mode 100644 index 00000000..f17c2120 --- /dev/null +++ b/backend/tests/contract/test_pr_body_confidence_section.py @@ -0,0 +1,219 @@ +"""Contract tests for the PR body's ``## Confidence`` section (Story 1.5). + +Covers AC-11 (full-confidence rendering), AC-12 (section omitted on +whole-object null), and the partial-render path (FR-7 / AC-3 mirror). + +These call ``_render_pr_body_study_backed`` directly with factory-built +:class:`ConfidenceShape` instances — the renderer signature requires the +typed Pydantic object (cycle-2 GPT-5.5 F3). The seed helper +:func:`make_test_confidence` builds a full shape with sensible defaults +and accepts per-test-case overrides for each sub-field. +""" + +from __future__ import annotations + +from types import SimpleNamespace +from typing import Any + +from backend.app.domain.study.confidence import ( + CIShape, + ConfidenceShape, + ConvergenceShape, + HeadlineShape, + LateTrialStddevShape, + PerQueryOutcomesShape, + RegressorRowShape, + RunnerUpGapShape, +) +from backend.workers.git_pr import _render_pr_body_study_backed + + +def make_test_confidence(**overrides: Any) -> ConfidenceShape: + """Build a fully-populated ``ConfidenceShape`` for tests. + + Any of the six sub-fields may be overridden by passing the field name + as a kwarg (e.g. ``make_test_confidence(ci_95=None)``). + """ + defaults: dict[str, Any] = { + "headline": HeadlineShape(metric="ndcg", value=0.840, k=10, n_queries=20), + "ci_95": CIShape(low=0.780, high=0.890, method="bootstrap_n1000", n_samples=20), + "runner_up_gap": RunnerUpGapShape( + value=0.002, + classification="robust_plateau", + top10_within=0.004, + runner_up_metric=0.838, + ), + "late_trial_stddev": LateTrialStddevShape( + value=0.012, window_size=20, min_window_required=10 + ), + "convergence": ConvergenceShape(best_at_trial=387, total_trials=1000, regime="early_held"), + "per_query_outcomes": PerQueryOutcomesShape( + improved=14, + unchanged=4, + regressed=2, + comparison_against="runner_up", + top_regressors=[ + RegressorRowShape( + query_id="q1", + query_text="vintage acoustic guitar", + winner_score=0.41, + comparison_score=0.92, + delta=-0.51, + ), + RegressorRowShape( + query_id="q2", + query_text="leather wallet", + winner_score=0.55, + comparison_score=0.78, + delta=-0.23, + ), + ], + ), + } + defaults.update(overrides) + return ConfidenceShape(**defaults) + + +def _make_proposal_and_study() -> tuple[ + SimpleNamespace, SimpleNamespace, SimpleNamespace, dict[str, Any] +]: + """Build the inputs every test needs (proposal, study, digest, config_diff).""" + proposal = SimpleNamespace( + metric_delta={ + "ndcg@10": {"baseline": 0.612, "achieved": 0.840, "delta_pct": 37.3}, + }, + ) + study = SimpleNamespace(id="study-abc", name="prod-en-v1") + digest = SimpleNamespace(suggested_followups=["Try BM25 k1=1.4"]) + config_diff = {"k1": {"from": 1.2, "to": 1.4}} + return proposal, study, digest, config_diff + + +# --------------------------------------------------------------------------- +# AC-11 — full-confidence PR body +# --------------------------------------------------------------------------- + + +def test_ac11_full_confidence_section_renders_between_metric_delta_and_config_diff() -> None: + proposal, study, digest, config_diff = _make_proposal_and_study() + confidence = make_test_confidence() + body = _render_pr_body_study_backed( + proposal=proposal, + study=study, + digest=digest, + config_diff=config_diff, + chart_md="", + base_url=None, + confidence=confidence, + ) + # Section appears. + assert "## Confidence" in body + # Section ordering — Confidence falls between Metric delta and Config diff. + metric_delta_idx = body.index("## Metric delta") + confidence_idx = body.index("## Confidence") + config_diff_idx = body.index("## Config diff") + assert metric_delta_idx < confidence_idx < config_diff_idx + # CI line shape: metric@k, value, 95% CI low-high, N=queries. + assert "ndcg@10: 0.840" in body + assert "95% CI 0.780-0.890" in body + assert "N=20 queries" in body + # Per-query line. + assert "Queries: 14 improved · 4 unchanged · 2 regressed (vs runner_up)" in body + # Named regressors with text + score arrow. + assert "`vintage acoustic guitar` (0.920 → 0.410)" in body + assert "`leather wallet` (0.780 → 0.550)" in body + # Runner-up gap line. + assert "Runner-up gap 0.002 (robust_plateau)" in body + # Late-trial 1σ. + assert "Late-trial 1σ = 0.012" in body + # Convergence. + assert "Convergence: early_held (best at trial 387 of 1000)" in body + + +# --------------------------------------------------------------------------- +# AC-12 — section omitted when confidence is None +# --------------------------------------------------------------------------- + + +def test_ac12_confidence_section_omitted_when_confidence_is_none() -> None: + proposal, study, digest, config_diff = _make_proposal_and_study() + body = _render_pr_body_study_backed( + proposal=proposal, + study=study, + digest=digest, + config_diff=config_diff, + chart_md="", + base_url=None, + confidence=None, + ) + assert "## Confidence" not in body + # Section ordering reverts to Metric delta → Config diff (no gap). + metric_delta_idx = body.index("## Metric delta") + config_diff_idx = body.index("## Config diff") + assert metric_delta_idx < config_diff_idx + # The headline link / proposal / config-diff rendering still works. + assert "| `k1` | `1.2` | `1.4` |" in body + + +# --------------------------------------------------------------------------- +# Partial render — sub-fields independently null (FR-7 / AC-3 mirror) +# --------------------------------------------------------------------------- + + +def test_partial_confidence_renders_only_non_null_sub_fields() -> None: + """Old-study case: ci_95 + per_query_outcomes null; aggregate signals present.""" + proposal, study, digest, config_diff = _make_proposal_and_study() + confidence = make_test_confidence( + ci_95=None, + per_query_outcomes=None, + headline=HeadlineShape(metric="ndcg", value=0.840, k=10, n_queries=None), + ) + body = _render_pr_body_study_backed( + proposal=proposal, + study=study, + digest=digest, + config_diff=config_diff, + chart_md="", + base_url=None, + confidence=confidence, + ) + # Section heading present. + assert "## Confidence" in body + # CI + per-query sub-lines absent. + assert "95% CI" not in body + assert "Queries:" not in body + assert "Queries that regressed:" not in body + # Aggregate signals still rendered. + assert "Runner-up gap 0.002 (robust_plateau)" in body + assert "Late-trial 1σ = 0.012" in body + assert "Convergence: early_held (best at trial 387 of 1000)" in body + + +# --------------------------------------------------------------------------- +# Regressors block — omitted when regressed == 0 +# --------------------------------------------------------------------------- + + +def test_regressors_line_omitted_when_no_queries_regressed() -> None: + """Per-query block present; named-regressors list absent when regressed == 0.""" + proposal, study, digest, config_diff = _make_proposal_and_study() + confidence = make_test_confidence( + per_query_outcomes=PerQueryOutcomesShape( + improved=18, + unchanged=2, + regressed=0, + comparison_against="runner_up", + top_regressors=[], + ), + ) + body = _render_pr_body_study_backed( + proposal=proposal, + study=study, + digest=digest, + config_diff=config_diff, + chart_md="", + base_url=None, + confidence=confidence, + ) + assert "Queries: 18 improved · 2 unchanged · 0 regressed (vs runner_up)" in body + assert "Queries that regressed:" not in body diff --git a/backend/tests/integration/test_migrations.py b/backend/tests/integration/test_migrations.py index 3cbaa54d..506cfa13 100644 --- a/backend/tests/integration/test_migrations.py +++ b/backend/tests/integration/test_migrations.py @@ -123,8 +123,9 @@ def test_upgrade_head_creates_alembic_version(self, fresh_db: None) -> None: # Baseline is "0001" per migrations/versions/0001_baseline.py. # Head extended by feat_data_table_primitive migrations # 0008–0013 (search_vector columns + GIN indexes on 6 tables); - # 0014 adds clusters.target_filter (feat_cluster_target_filter). - assert row[0] == "0014" + # 0014 adds clusters.target_filter (feat_cluster_target_filter); + # 0015 adds trials.per_query_metrics (feat_pr_metric_confidence). + assert row[0] == "0015" finally: engine.dispose() @@ -139,16 +140,16 @@ def test_round_trip(self, fresh_db: None) -> None: """Downgrade by one revision and re-upgrade returns cleanly to head.""" _alembic("upgrade", "head") _alembic("downgrade", "-1") - # After downgrade -1 from head (0014) we land at 0013. Re-upgrade - # re-applies 0014 cleanly per CLAUDE.md Absolute Rule #5. + # After downgrade -1 from head (0015) we land at 0014. Re-upgrade + # re-applies 0015 cleanly per CLAUDE.md Absolute Rule #5. _alembic("upgrade", "head") engine = create_engine(_sync_database_url(), future=True) try: with engine.connect() as conn: row = conn.execute(text("SELECT version_num FROM alembic_version")).fetchone() assert row is not None - # Head: 0014 (feat_cluster_target_filter — clusters.target_filter). - assert row[0] == "0014" + # Head: 0015 (feat_pr_metric_confidence — trials.per_query_metrics). + assert row[0] == "0015" finally: engine.dispose() diff --git a/backend/tests/integration/test_open_pr_worker_confidence_plumbing.py b/backend/tests/integration/test_open_pr_worker_confidence_plumbing.py new file mode 100644 index 00000000..7bfdab6a --- /dev/null +++ b/backend/tests/integration/test_open_pr_worker_confidence_plumbing.py @@ -0,0 +1,252 @@ +"""End-to-end integration test for the ``open_pr`` worker's confidence plumbing. + +Story 1.5 / FR-5d: prove that the worker's call site fetches confidence +via :func:`fetch_study_confidence` and threads it into +:func:`_render_pr_body_study_backed` so the ``## Confidence`` section +lands in the rendered PR body. The full 15-step worker contract (lock, +clone, push, GitHub POST) is exercised by the existing +``feat_github_pr_worker`` integration suite — this test focuses on the +new confidence data plumbing without re-running those steps. + +We drive the real DB (live session via the integration-test Postgres) +plus the live :func:`fetch_study_confidence` service helper, then feed +the resulting shape into the real :func:`_render_pr_body_study_backed` +renderer. Both functions are imported from production code; only the +git / GitHub side effects are bypassed. +""" + +from __future__ import annotations + +import uuid +from datetime import UTC, datetime +from types import SimpleNamespace +from typing import Any + +import pytest + +from backend.app.db import repo +from backend.app.db.session import get_session_factory +from backend.app.services.study_confidence import fetch_study_confidence +from backend.tests.conftest import postgres_reachable +from backend.workers.git_pr import _render_pr_body_study_backed + +pytestmark = [ + pytest.mark.integration, + pytest.mark.skipif( + not postgres_reachable(), + reason="Postgres not reachable — see docs/03_runbooks/local-dev.md", + ), +] + + +async def _seed_completed_study_with_per_query_metrics( + *, + n_queries: int = 8, + n_total_trials: int = 12, +) -> dict[str, Any]: + """Seed a completed study with per_query_metrics populated on the + winner trial + runner-up trial. Returns ids + the study row. + """ + factory = get_session_factory() + async with factory() as db: + cluster = await repo.create_cluster( + db, + id=str(uuid.uuid4()), + name=f"pr-cluster-{uuid.uuid4().hex[:8]}", + engine_type="elasticsearch", + environment="dev", + base_url="http://stub:9200", + auth_kind="es_basic", + credentials_ref="ref", + ) + template = await repo.create_query_template( + db, + id=str(uuid.uuid4()), + name=f"pr-tmpl-{uuid.uuid4().hex[:8]}", + engine_type="elasticsearch", + body='{"query": {"match_all": {}}}', + declared_params={}, + version=1, + ) + query_set = await repo.create_query_set( + db, + id=str(uuid.uuid4()), + name=f"pr-qs-{uuid.uuid4().hex[:8]}", + cluster_id=cluster.id, + ) + query_ids: list[str] = [] + for i in range(n_queries): + qid = str(uuid.uuid4()) + await repo.create_query( + db, + id=qid, + query_set_id=query_set.id, + query_text=f"sample query {i}", + ) + query_ids.append(qid) + jl = await repo.create_judgment_list( + db, + id=str(uuid.uuid4()), + name=f"pr-jl-{uuid.uuid4().hex[:8]}", + description=None, + query_set_id=query_set.id, + cluster_id=cluster.id, + target="stub-index", + current_template_id=template.id, + rubric="r", + status="complete", + ) + study_id = str(uuid.uuid4()) + await repo.create_study( + db, + id=study_id, + name="pr-confidence-study", + cluster_id=cluster.id, + target="stub-index", + template_id=template.id, + query_set_id=query_set.id, + judgment_list_id=jl.id, + search_space={}, + objective={"metric": "ndcg", "k": 10, "direction": "maximize"}, + config={"max_trials": n_total_trials}, + status="completed", + failed_reason=None, + optuna_study_name=study_id, + baseline_metric=None, + best_metric=0.840, + best_trial_id=None, + ) + # Winner trial — high per-query metrics + 1 designed regressor. + winner_per_query = { + qid: {"ndcg": 0.85 - (0.01 * i) if i != 0 else 0.40} for i, qid in enumerate(query_ids) + } + # Runner-up trial — qid 0 scored higher (so winner regresses on it). + runner_up_per_query = { + qid: {"ndcg": 0.95 if i == 0 else 0.84 - (0.01 * i)} for i, qid in enumerate(query_ids) + } + # Trial 0 = winner. + winner_trial = await repo.create_trial( + db, + id=str(uuid.uuid4()), + study_id=study_id, + optuna_trial_number=0, + status="complete", + params={}, + metrics={}, + primary_metric=0.840, + per_query_metrics=winner_per_query, + started_at=datetime.now(UTC), + ended_at=datetime.now(UTC), + duration_ms=100, + ) + # Trial 1 = runner-up. + await repo.create_trial( + db, + id=str(uuid.uuid4()), + study_id=study_id, + optuna_trial_number=1, + status="complete", + params={}, + metrics={}, + primary_metric=0.830, + per_query_metrics=runner_up_per_query, + started_at=datetime.now(UTC), + ended_at=datetime.now(UTC), + duration_ms=100, + ) + # Remaining filler trials with monotonically decreasing primary_metric + # so noise-floor + convergence + runner-up gap signals all populate. + for i in range(2, n_total_trials): + await repo.create_trial( + db, + id=str(uuid.uuid4()), + study_id=study_id, + optuna_trial_number=i, + status="complete", + params={}, + metrics={}, + primary_metric=0.83 - (0.01 * i), + started_at=datetime.now(UTC), + ended_at=datetime.now(UTC), + duration_ms=100, + ) + # Patch best_trial_id post-trial-create. + from backend.app.db.models import Study as _Study + + study_row = await db.get(_Study, study_id) + assert study_row is not None + study_row.best_trial_id = winner_trial.id + await db.flush() + await db.commit() + return { + "study_id": study_id, + "cluster_id": cluster.id, + "winner_trial_id": winner_trial.id, + "regressing_qid": query_ids[0], + } + + +async def test_open_pr_worker_plumbing_renders_confidence_section() -> None: + """End-to-end: seed → fetch_study_confidence → renderer outputs ## Confidence. + + Mirrors the production call-site logic in + :func:`backend.workers.git_pr.open_pr` (lines ~898-915) for the + study-backed branch: fetch the study, fetch confidence, render the + body. Bypasses the lock + clone + push + GitHub POST steps because + those are covered by the existing ``feat_github_pr_worker`` suite — + Story 1.5's FR-5d is specifically about the data-plumbing slice. + """ + ctx = await _seed_completed_study_with_per_query_metrics() + factory = get_session_factory() + async with factory() as db: + study = await repo.get_study(db, ctx["study_id"]) + assert study is not None + confidence = await fetch_study_confidence(db, study) + + assert confidence is not None + # Confidence assembled from the seed: 8 queries → bootstrap CI populates; + # ≥10 complete trials → late-trial 1σ populates; ≥3 → convergence; ≥2 + # → runner-up gap; both winner + runner-up have per_query → outcomes. + assert confidence.ci_95 is not None + assert confidence.runner_up_gap is not None + assert confidence.late_trial_stddev is not None + assert confidence.convergence is not None + assert confidence.per_query_outcomes is not None + assert confidence.per_query_outcomes.regressed >= 1 + # The designed regressor (qid 0) must appear among top_regressors. + regressor_qids = {row.query_id for row in confidence.per_query_outcomes.top_regressors} + assert ctx["regressing_qid"] in regressor_qids + + # Now run the real renderer with the real study object + the real + # confidence shape. Mirrors what the production worker does at + # git_pr.py:904. + proposal = SimpleNamespace( + metric_delta={ + "ndcg@10": {"baseline": 0.612, "achieved": 0.840, "delta_pct": 37.3}, + }, + config_diff={"k1": {"from": 1.2, "to": 1.4}}, + ) + digest = SimpleNamespace(suggested_followups=["Try BM25 k1=1.4"]) + body = _render_pr_body_study_backed( + proposal=proposal, + study=study, + digest=digest, + config_diff=proposal.config_diff, + chart_md="", + base_url="https://relyloop.acme.internal", + confidence=confidence, + ) + + # The ## Confidence section landed between ## Metric delta and ## Config diff. + metric_idx = body.index("## Metric delta") + conf_idx = body.index("## Confidence") + config_idx = body.index("## Config diff") + assert metric_idx < conf_idx < config_idx + # CI line + N(queries) reflect the seeded data. + assert "95% CI" in body + assert "N=8 queries" in body + # Per-query line + regressor block populate. + assert "vs runner_up" in body + assert "Queries that regressed:" in body + # The named regressor's text appears verbatim. + assert "sample query 0" in body diff --git a/backend/tests/integration/test_trials_per_query_metrics_migration.py b/backend/tests/integration/test_trials_per_query_metrics_migration.py index b8415af9..692283bc 100644 --- a/backend/tests/integration/test_trials_per_query_metrics_migration.py +++ b/backend/tests/integration/test_trials_per_query_metrics_migration.py @@ -231,7 +231,7 @@ def test_check_constraint_rejects_non_object_jsonb(self, restore_head: None) -> text( "INSERT INTO judgment_lists (id, name, query_set_id, " "cluster_id, target, rubric, status) VALUES " - "(:id, :name, :qs, :cid, 'idx', 'r', 'ready')" + "(:id, :name, :qs, :cid, 'idx', 'r', 'complete')" ), { "id": jl_id, @@ -270,14 +270,13 @@ def test_check_constraint_rejects_non_object_jsonb(self, restore_head: None) -> "INSERT INTO trials (id, study_id, optuna_trial_number, " "params, metrics, status, per_query_metrics) VALUES " "(:id, :sid, 0, :params, :metrics, 'complete', " - ":pq::jsonb)" + "'[]'::jsonb)" ), { "id": trial_id, "sid": study_id, "params": json.dumps({}), "metrics": json.dumps({"ndcg": 0.5}), - "pq": "[]", }, ) diff --git a/backend/workers/git_pr.py b/backend/workers/git_pr.py index 91c08972..46ddcc59 100644 --- a/backend/workers/git_pr.py +++ b/backend/workers/git_pr.py @@ -85,7 +85,9 @@ validate_config_path, validate_repo_url, ) +from backend.app.domain.study.confidence import ConfidenceShape from backend.app.git import HTTP_TIMEOUT_S, github_request +from backend.app.services.study_confidence import fetch_study_confidence logger = structlog.get_logger(__name__) @@ -485,6 +487,52 @@ def _render_chart_png(parameter_importance: dict[str, float], target_path: Path) plt.close(fig) +def _render_confidence_section(confidence: ConfidenceShape) -> list[str]: + """Render the ``## Confidence`` section as a list of markdown lines. + + Each sub-block (CI line, per-query block, named regressors, runner-up + gap, late-trial 1σ, convergence) is independently gated on its + sub-field being non-null (FR-7 / AC-3 partial-render path). Includes + a trailing blank line for cleanly separating from the next section. + """ + lines: list[str] = ["## Confidence"] + headline = confidence.headline + metric_label = f"{headline.metric}@{headline.k}" if headline.k is not None else headline.metric + if confidence.ci_95 is not None: + ci = confidence.ci_95 + n_q = headline.n_queries if headline.n_queries is not None else ci.n_samples + lines.append( + f"- {metric_label}: {headline.value:.3f} " + f"(95% CI {ci.low:.3f}-{ci.high:.3f}, N={n_q} queries)" + ) + if confidence.per_query_outcomes is not None: + outcomes = confidence.per_query_outcomes + lines.append( + f"- Queries: {outcomes.improved} improved · " + f"{outcomes.unchanged} unchanged · " + f"{outcomes.regressed} regressed (vs {outcomes.comparison_against})" + ) + if outcomes.regressed > 0 and outcomes.top_regressors: + regressor_chunks = [ + f"`{row.query_text}` ({row.comparison_score:.3f} → {row.winner_score:.3f})" + for row in outcomes.top_regressors + ] + lines.append("- Queries that regressed: " + " · ".join(regressor_chunks)) + if confidence.runner_up_gap is not None: + gap = confidence.runner_up_gap + lines.append(f"- Runner-up gap {gap.value:.3f} ({gap.classification})") + if confidence.late_trial_stddev is not None: + lines.append(f"- Late-trial 1σ = {confidence.late_trial_stddev.value:.3f}") + if confidence.convergence is not None: + conv = confidence.convergence + lines.append( + f"- Convergence: {conv.regime} " + f"(best at trial {conv.best_at_trial} of {conv.total_trials})" + ) + lines.append("") + return lines + + def _render_pr_body_study_backed( *, proposal: Any, @@ -493,8 +541,16 @@ def _render_pr_body_study_backed( config_diff: dict[str, Any], chart_md: str, base_url: str | None, + confidence: ConfidenceShape | None = None, ) -> str: - """Markdown body for a study-backed proposal.""" + """Markdown body for a study-backed proposal. + + The optional ``confidence`` shape (feat_pr_metric_confidence FR-5b) + drives an additional ``## Confidence`` section between ``## Metric + delta`` and ``## Config diff``. When ``confidence is None`` the + section is omitted entirely (AC-12). When sub-fields are null they + are skipped individually (FR-7 / AC-3 partial-render path). + """ lines: list[str] = ["# RelyLoop proposal", ""] lines.append(f"**Study:** {study.name} (`{study.id}`)") if base_url: @@ -509,6 +565,8 @@ def _render_pr_body_study_backed( pct_str = f" ({pct:+.1f}%)" if pct is not None else "" lines.append(f"- `{metric}`: {baseline} → {achieved}{pct_str}") lines.append("") + if confidence is not None: + lines.extend(_render_confidence_section(confidence)) lines.append("## Config diff") lines.append("") lines.append("| Param | From | To |") @@ -901,6 +959,10 @@ async def _do_open_pr( # noqa: PLR0915, PLR0912, C901 — the worker contract i chart_md = "" if chart_render_failed and digest is not None: chart_md = _render_chart_markdown_fallback(digest.parameter_importance) + # feat_pr_metric_confidence Story 1.5 (FR-5d): fetch per-study + # confidence analytics before rendering so the body carries the + # ## Confidence section. + confidence = await fetch_study_confidence(db, study) if study is not None else None body = _render_pr_body_study_backed( proposal=proposal, study=study, @@ -908,6 +970,7 @@ async def _do_open_pr( # noqa: PLR0915, PLR0912, C901 — the worker contract i config_diff=proposal.config_diff, chart_md=chart_md, base_url=settings.relyloop_base_url, + confidence=confidence, ) study_name = study.name if study is not None else proposal.study_id title = f"RelyLoop: {study_name}" diff --git a/docs/00_overview/DASHBOARD.md b/docs/00_overview/DASHBOARD.md index 180f599f..6e598ca7 100644 --- a/docs/00_overview/DASHBOARD.md +++ b/docs/00_overview/DASHBOARD.md @@ -6,7 +6,7 @@ _Top-level index across MVP1 → GA v1+ as of **2026-05-21**. Click a release na | Release | Theme | Progress | Status | |---|---|---|---| -| [MVP1 / v0.1](MVP1_DASHBOARD.md) | The Loop | 56 / 57 scoped done · 3 remaining | **In progress** | +| [MVP1 / v0.1](MVP1_DASHBOARD.md) | The Loop | 56 / 57 scoped done · 4 remaining | **In progress** | | [MVP2 / v0.2](MVP2_DASHBOARD.md) | Observable | 1 / 1 scoped done · 1 remaining | **In progress** | | MVP3 / v0.3 | Production Stacks | — | **Not yet scoped** | | MVP4 / v0.4 | Multi-tenant, Multi-LLM | — | **Not yet scoped** | diff --git a/docs/00_overview/MVP1_DASHBOARD.md b/docs/00_overview/MVP1_DASHBOARD.md index 86ca2f00..e0ead940 100644 --- a/docs/00_overview/MVP1_DASHBOARD.md +++ b/docs/00_overview/MVP1_DASHBOARD.md @@ -21,8 +21,8 @@ Plan approved; run /impl-execute to ship | Metric | Value | |---|---| | Scoped items done | **56 / 57** (98%) — feat_/infra_/chore_/epic_ past idea stage | -| Path to MVP1 | **3** items remaining (features + bugs + chores) | -| Open bugs | 0 | +| Path to MVP1 | **4** items remaining (features + bugs + chores) | +| Open bugs | 1 | | Open chores | 2 (idea-stage debt) | | Backlog ideas | 4 idea-only feat/infra (not yet scoped into MVP1) | | In flight | 0 feature(s) actively shipping | @@ -116,7 +116,7 @@ _None._ _None._ -### Idea (6) +### Idea (7) | Feature | Type | One-liner | Depends on | Status | |---|---|---|---|---| @@ -126,6 +126,7 @@ _None._ | [feat_study_clone_from_previous](../02_product/planned_features/feat_study_clone_from_previous/idea.md) | Feature | A relevance engineer's normal workflow after the first study completes: | — | Idea — surfaced during a UX review of parameter-tuning ergonomics on 2026-05-19. | | [chore_study_default_stop_conditions](../02_product/planned_features/chore_study_default_stop_conditions/idea.md) | Chore | The server-side `StudyConfigSpec` validator at [`backend/app/api/v1/schemas.py:572-580`](../../backend/app/api/v1/schemas.py) correctly **requires** at least one of `max_trials` or `time_budget_min` — | — | Idea — surfaced during the 2026-05-21 Karpathy-loop audit; recommendation grounded in measured per-trial cost from the local dev DB. | | [chore_template_library_expansion](../02_product/planned_features/chore_template_library_expansion/idea.md) | Chore | Three connected gaps: | — | Idea — surfaced during a UX review of parameter-tuning ergonomics on 2026-05-19. | +| [bug_confidence_per_query_metric_key_drift](../02_product/planned_features/bug_confidence_per_query_metric_key_drift/idea.md) | Bug | The two implementations disagree on the shape of `trials.per_query_metrics`: | — | Idea (filed 2026-05-21; surfaced by Story 1.5 of `feat_pr_metric_confidence`) | ## Dependency graph diff --git a/docs/00_overview/dashboard.html b/docs/00_overview/dashboard.html index 1d517edb..f163b2c6 100644 --- a/docs/00_overview/dashboard.html +++ b/docs/00_overview/dashboard.html @@ -371,7 +371,7 @@

Releases

The Loop
-
56 / 57 scoped done · 3 remaining
+
56 / 57 scoped done · 4 remaining
In progress
diff --git a/docs/00_overview/mvp1_dashboard.html b/docs/00_overview/mvp1_dashboard.html index 53d2aa8d..0190f864 100644 --- a/docs/00_overview/mvp1_dashboard.html +++ b/docs/00_overview/mvp1_dashboard.html @@ -390,12 +390,12 @@

MVP1 Progress

Path to MVP1
-
3
+
4
items left = features + bugs + chores
-
+
Open bugs
-
0
+
1
tracked bug_* idea files
@@ -428,7 +428,7 @@

Pipeline

-

Idea 6

+

Idea 7

@@ -499,6 +499,18 @@

Idea 6

Three connected gaps:
+
+ + +
+ +
+ Bug + +
+
The two implementations disagree on the shape of `trials.per_query_metrics`:
+ +
diff --git a/docs/02_product/planned_features/bug_confidence_per_query_metric_key_drift/idea.md b/docs/02_product/planned_features/bug_confidence_per_query_metric_key_drift/idea.md new file mode 100644 index 00000000..c51a06c0 --- /dev/null +++ b/docs/02_product/planned_features/bug_confidence_per_query_metric_key_drift/idea.md @@ -0,0 +1,53 @@ +# Bug — `compute_study_confidence` uses bare metric name to look up per-query metrics whose keys are `@k`-suffixed + +**Status:** Idea (filed 2026-05-21; surfaced by Story 1.5 of `feat_pr_metric_confidence`) + +**Origin:** Caught while running the full integration suite during Story 1.5 (`/impl-execute` of `feat_pr_metric_confidence`). The Story 1.2 test [`backend/tests/integration/test_run_trial_per_query_persistence.py::test_successful_trial_writes_per_query_metrics`](../../../../backend/tests/integration/test_run_trial_per_query_persistence.py) fails with `unexpected metric key 'map@10' in per_query_metrics[...]; score() should remap pytrec_eval wire names to ['map', 'mrr', 'ndcg', 'precision', 'recall']`. The test asserted bare metric base names; the actual `score()` output uses `@k`-suffixed user-facing names. + +## Problem + +The two implementations disagree on the shape of `trials.per_query_metrics`: + +| Surface | What it produces / expects | +|---|---| +| [`backend/app/eval/scoring.py:179-184`](../../../../backend/app/eval/scoring.py#L179) `score()` (Story 1.1+) | `per_query[qid] = {ndcg@10: float, map@10: float, mrr: float, ...}` — user-facing tokens preserved including `@` cutoff | +| [`backend/workers/trials.py`](../../../../backend/workers/trials.py) Story 1.2 worker | Writes `scored["per_query"]` verbatim → DB has `@k`-suffixed keys | +| [`backend/app/domain/study/confidence.py:537-540`](../../../../backend/app/domain/study/confidence.py#L537) `compute_study_confidence` (Story 1.3) | Uses `metric = study_objective.get("metric")` (e.g., `"ndcg"`, bare) to look up `v[metric]` on each per_query dict — **misses** because the key is `ndcg@10`, not `ndcg` | +| [`backend/tests/integration/test_run_trial_per_query_persistence.py:100-115`](../../../../backend/tests/integration/test_run_trial_per_query_persistence.py#L100) Story 1.2 test | Asserts `metric_key in {"ndcg", "map", "precision", "recall", "mrr"}` (bare) — **fails** because the worker writes `@k`-suffixed keys | +| [`feature_spec.md`](../../../00_overview/implemented_features/2026_05_21_feat_pr_metric_confidence/feature_spec.md) AC-1 / AC-10 (when this ships there) | Examples show bare keys (`{qid: {ndcg: 0.84, map: 0.7, ...}}`) — agrees with Story 1.3's bare-key lookup but disagrees with the worker reality | + +**Production impact:** For every real completed study with `per_query_metrics` populated, the orchestrator runs `metric not in v` → True → `winner_values_for_metric = []` → `bootstrap_ci_95` returns `None` (N<5) → CI suppressed. Likewise `compute_outcome_summary` produces empty intersection → `per_query_outcomes = None`. Operators see the partial / aggregate-only confidence shape on every PR — the feature's headline value (CI band, named regressors) silently never renders against real data. + +**Why this slipped past Story 1.4's integration tests:** [`backend/tests/integration/test_studies_api_confidence.py`](../../../../backend/tests/integration/test_studies_api_confidence.py) seeds `per_query_metrics` directly with bare keys (`{ndcg: 0.41, ...}`) to match the spec literal. That matches the orchestrator's expectation, so the tests pass — but those keys aren't what the worker actually persists in production. Same for Story 1.5's integration test ([`test_open_pr_worker_confidence_plumbing.py`](../../../../backend/tests/integration/test_open_pr_worker_confidence_plumbing.py)). + +## Why deferred + +The fix forks into product-shaped questions that deserve their own design pass: + +1. **Where does the canonicalization happen?** + - **(a)** Worker remaps per_query keys to bare metric base names before persisting. Keeps Story 1.3 simple but drops `@k` info (problematic when a study computes both `ndcg@5` and `ndcg@10`). + - **(b)** Orchestrator uses `objective_metric_key(study.objective)` to compute the lookup key. Keeps all persisted info but requires Story 1.3's `compute_outcome_summary` to accept separate `metric_lookup_key` + `metric_threshold_key` args (the former drives `v[key]` lookups; the latter drives `REGRESSOR_THRESHOLDS.get(key)`). + - **(c)** Worker normalizes ONLY the primary metric to bare form and drops `@k` suffix; secondary metrics stay as `@k` for future analytics. Compromise that loses the cutoff for the primary metric only. + +2. **Spec patch required.** Whichever route lands, the spec's AC-1 / AC-10 examples need updating to reflect the chosen canonical form. The shipped spec at [`docs/00_overview/implemented_features/2026_05_21_feat_pr_metric_confidence/feature_spec.md`](../../../00_overview/implemented_features/2026_05_21_feat_pr_metric_confidence/feature_spec.md) will get an erratum note. + +3. **Test refactor cost.** Both Story 1.4 (11 cases) and Story 1.5 (1 case) integration tests seed bare-key per_query data. They'll need a small refactor (~10 LOC) to use the chosen canonical form. + +The current `/impl-execute` of Story 1.5 was scoped to PR body plumbing. Bundling an interface change to Story 1.3's pure orchestrator would inflate scope past the rubric's "fix the work-type that fits this PR's intent" guidance, so capturing here is the right call. + +## Surface + +- **Backend:** `backend/app/domain/study/confidence.py` (Story 1.3 orchestrator), `backend/workers/trials.py` (Story 1.2 worker — only if route (a) or (c) chosen). +- **Tests:** `backend/tests/integration/test_run_trial_per_query_persistence.py` (relax / correct assertion), `backend/tests/integration/test_studies_api_confidence.py` (11 cases), `backend/tests/integration/test_open_pr_worker_confidence_plumbing.py` (1 case), `backend/tests/unit/domain/study/test_confidence.py` (25+ cases may need shape updates). +- **Docs:** `feature_spec.md` AC-1 / AC-10 erratum. +- **No new endpoints, no migration, no frontend impact.** + +## Acceptance signal + +- `make test-integration` runs cleanly against the live in-container Postgres with no skipped Story 1.1 / 1.2 / 1.4 / 1.5 cases left failing. +- A seeded real-worker run (winner trial inserted via `run_trial` Arq job, not direct SQL) drives `compute_study_confidence` to a fully-populated `ConfidenceShape` (CI band + per_query_outcomes both non-null when seed data warrants). +- Spec's AC-1 / AC-10 example shape matches the persisted DB shape verbatim. + +## Related work + +- Bundled `Story 1.5` commit `` fixes the two mechanical pre-existing test failures (`test_migrations.py` head `0014`→`0015`; `test_trials_per_query_metrics_migration.py` invalid `judgment_lists.status='ready'`→`'complete'`) — those were not part of this drift but were uncovered in the same full-suite run. See the impl-execute transcript for the rubric reasoning (mechanical assertion updates qualify as inline fixes; this product-shaped key drift does not). diff --git a/docs/02_product/planned_features/feat_pr_metric_confidence/implementation_plan.md b/docs/02_product/planned_features/feat_pr_metric_confidence/implementation_plan.md index dfda795b..24f0758f 100644 --- a/docs/02_product/planned_features/feat_pr_metric_confidence/implementation_plan.md +++ b/docs/02_product/planned_features/feat_pr_metric_confidence/implementation_plan.md @@ -1054,7 +1054,7 @@ None planned. The feature is purely additive across all surfaces. - [ ] Story 1.2 — Persist `per_query_metrics` in `run_trial` - [ ] Story 1.3 — Domain module `confidence.py` - [x] Story 1.4 — `ConfidenceShape` + StudyDetail enrichment -- [ ] Story 1.5 — PR body section + worker plumbing +- [x] Story 1.5 — PR body section + worker plumbing - [ ] Story 1.6 — Digest narrative prompt extension - [ ] **Epic 1 gate** - [ ] Story 2.1 — TypeScript types + enums From cc1416457600075015e09ce4f1b794b4d64e7479 Mon Sep 17 00:00:00 2001 From: SoundMindsAI Date: Thu, 21 May 2026 12:12:58 -0400 Subject: [PATCH 10/17] fix(domain): align confidence per-query lookup with @k-suffixed score() output MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Production drift discovered during Story 1.5 full-suite verification: backend.app.eval.scoring.score() emits user-facing metric tokens with the @ cutoff preserved (ndcg@10, map@10, mrr, ...). Story 1.2's worker persists this verbatim, but Story 1.3's compute_study_confidence and Story 1.4/1.5's service helper both used the bare metric base name (`ndcg`) to index per-query dicts. Every real completed study would silently degrade to aggregate-only confidence (no CI band, no named regressors) — the feature's headline value. Fix: - backend/app/domain/study/confidence.py: orchestrator resolves the per-query lookup key via objective_metric_key(study_objective) (raises ValueError → graceful return None per FR-7). Passes the @k-suffixed key to compute_outcome_summary. - compute_outcome_summary strips any @ suffix before REGRESSOR_THRESHOLDS lookup (table is keyed by base names ndcg / map / …) so direct unit-test callers passing bare "ndcg" still work. - backend/app/services/study_confidence.py: Q4 candidate peek now resolves per_query_key via objective_metric_key, falls back to no Q4 (empty query_text_by_id) on malformed objective. - Story 1.2 test: relaxes the assertion to allow @k-suffixed keys (base name must still be in MetricCatalog). - Story 1.3 unit tests: orchestrator-level cases switch per_query fixtures from {ndcg: …} to {ndcg@10: …} to mirror what the worker actually persists. Direct compute_outcome_summary tests keep bare ndcg (the helper accepts either form via the partition split). - Story 1.4 integration tests: 11 cases re-keyed to @k form. - Story 1.5 integration test: re-keyed to @k form. - feature_spec.md AC-1 / AC-10: examples now show user-facing tokens (ndcg@10) and explicitly cite scoring.score() as the source of truth. Idea file docs/02_product/planned_features/bug_confidence_per_query_metric_key_drift/ deleted now that the bug is resolved inline. Verification: 1034 backend unit + 189 contract + 526/528 in-container integration tests pass (2 health-probe skips for the not-yet-running api at localhost:8000). Backend fmt + lint + typecheck + ruff format --check parity all clean. Co-Authored-By: Claude Opus 4.7 (1M context) --- backend/app/domain/study/confidence.py | 33 ++++++++++-- backend/app/services/study_confidence.py | 38 ++++++------- ...test_open_pr_worker_confidence_plumbing.py | 6 ++- .../test_run_trial_per_query_persistence.py | 15 ++++-- .../test_studies_api_confidence.py | 14 ++--- .../unit/domain/study/test_confidence.py | 14 +++-- docs/00_overview/DASHBOARD.md | 2 +- docs/00_overview/MVP1_DASHBOARD.md | 7 ++- docs/00_overview/dashboard.html | 2 +- docs/00_overview/mvp1_dashboard.html | 20 ++----- .../idea.md | 53 ------------------- .../feat_pr_metric_confidence/feature_spec.md | 10 ++-- 12 files changed, 94 insertions(+), 120 deletions(-) delete mode 100644 docs/02_product/planned_features/bug_confidence_per_query_metric_key_drift/idea.md diff --git a/backend/app/domain/study/confidence.py b/backend/app/domain/study/confidence.py index 9fa1c21f..1eb3992b 100644 --- a/backend/app/domain/study/confidence.py +++ b/backend/app/domain/study/confidence.py @@ -37,6 +37,8 @@ import numpy as np from pydantic import BaseModel +from backend.app.eval.scoring import objective_metric_key + # --------------------------------------------------------------------------- # Locked constants — every value is referenced from FR-4 / FR-4a. # Source of truth: feature_spec.md §19 Decision log (feat_pr_metric_confidence). @@ -389,7 +391,14 @@ def compute_outcome_summary( Improved/unchanged/regressed buckets use the FR-4a per-metric threshold table. Returned candidates are sorted by ``abs(delta)`` descending, capped at ``TOP_REGRESSORS_CAP``. Returns ``None`` when either input - dict is empty or ``metric`` is not in :data:`REGRESSOR_THRESHOLDS`. + dict is empty or ``metric``'s base name is not in + :data:`REGRESSOR_THRESHOLDS`. + + ``metric`` is the per-query lookup key as persisted by the worker + (``backend.app.eval.scoring.score`` writes user-facing tokens — e.g. + ``"ndcg@10"``, ``"map@10"``, ``"map"``, ``"mrr"``). The threshold + table is keyed by metric *base* names (``ndcg``, ``map``, etc.), so + the helper strips any ``@`` suffix before the lookup. Cycle-1 GPT-5.5 F7 fix: this helper does NOT take ``query_text_by_id`` — candidates carry only ``query_id``. The orchestrator runs Q4 of the @@ -398,7 +407,9 @@ def compute_outcome_summary( """ if not winner_per_query or not comparison_per_query: return None - threshold = REGRESSOR_THRESHOLDS.get(metric) + # Strip any @ suffix so "ndcg@10" → "ndcg", "map@10" → "map", and + # bare "mrr" / "map" / "ndcg" still work. See REGRESSOR_THRESHOLDS keys. + threshold = REGRESSOR_THRESHOLDS.get(metric.partition("@")[0]) if threshold is None: return None @@ -525,6 +536,18 @@ def compute_study_confidence( if k is not None and not isinstance(k, int): k = None + # Compute the per-query lookup key. The worker persists what + # :func:`backend.app.eval.scoring.score` emits — user-facing tokens + # like ``ndcg@10``, ``map@10``, ``map``, ``mrr``. Bug captured in + # ``bug_confidence_per_query_metric_key_drift`` and fixed inline on + # ``feat_pr_metric_confidence``. + try: + per_query_key = objective_metric_key(study_objective) + except ValueError: + # Malformed objective (missing required k, unsupported metric, …): + # graceful degrade to whole-object None per FR-7 invariant. + return None + # Headline value comes from study.best_metric (denormalized winner # primary_metric); the n_queries comes from the winner's per_query # dict when present. @@ -535,7 +558,9 @@ def compute_study_confidence( ) winner_per_query = winner_trial.per_query_metrics or {} winner_values_for_metric = [ - float(v[metric]) for v in winner_per_query.values() if isinstance(v, dict) and metric in v + float(v[per_query_key]) + for v in winner_per_query.values() + if isinstance(v, dict) and per_query_key in v ] n_queries: int | None = len(winner_values_for_metric) if winner_per_query else None @@ -574,7 +599,7 @@ def compute_study_confidence( outcome = compute_outcome_summary( winner_per_query=winner_per_query, comparison_per_query=runner_up_trial.per_query_metrics, - metric=metric, + metric=per_query_key, ) if outcome is not None: regressor_rows = build_regressor_rows( diff --git a/backend/app/services/study_confidence.py b/backend/app/services/study_confidence.py index 68d45bce..0aefa241 100644 --- a/backend/app/services/study_confidence.py +++ b/backend/app/services/study_confidence.py @@ -27,6 +27,7 @@ compute_outcome_summary, compute_study_confidence, ) +from backend.app.eval.scoring import objective_metric_key async def fetch_study_confidence(db: AsyncSession, study: Study) -> ConfidenceShape | None: @@ -74,26 +75,27 @@ async def fetch_study_confidence(db: AsyncSession, study: Study) -> ConfidenceSh # Q4 (conditional): query_text for regressor candidates. # The pure orchestrator runs compute_outcome_summary again internally — # the second call is cheap (dict-key iteration on ≤100 queries) and keeps - # the pure-helper contract clean for unit tests. + # the pure-helper contract clean for unit tests. The per-query lookup key + # must match what backend.app.eval.scoring.score persists (user-facing + # @-suffixed tokens), not the bare metric base name. query_text_by_id: dict[str, str] = {} study_objective = study.objective if isinstance(study.objective, dict) else {} - metric = study_objective.get("metric") - if ( - isinstance(metric, str) - and runner_up is not None - and winner.per_query_metrics - and runner_up.per_query_metrics - ): - outcome = compute_outcome_summary( - winner_per_query=winner.per_query_metrics, - comparison_per_query=runner_up.per_query_metrics, - metric=metric, - ) - if outcome is not None and outcome.regressor_candidates: - qids = [qid for (qid, *_) in outcome.regressor_candidates] - q_stmt = select(Query.id, Query.query_text).where(Query.id.in_(qids)) - for qid, qtext in (await db.execute(q_stmt)).all(): - query_text_by_id[qid] = qtext + if runner_up is not None and winner.per_query_metrics and runner_up.per_query_metrics: + try: + per_query_key = objective_metric_key(study_objective) + except ValueError: + per_query_key = None + if per_query_key is not None: + outcome = compute_outcome_summary( + winner_per_query=winner.per_query_metrics, + comparison_per_query=runner_up.per_query_metrics, + metric=per_query_key, + ) + if outcome is not None and outcome.regressor_candidates: + qids = [qid for (qid, *_) in outcome.regressor_candidates] + q_stmt = select(Query.id, Query.query_text).where(Query.id.in_(qids)) + for qid, qtext in (await db.execute(q_stmt)).all(): + query_text_by_id[qid] = qtext return compute_study_confidence( study_objective=study_objective, diff --git a/backend/tests/integration/test_open_pr_worker_confidence_plumbing.py b/backend/tests/integration/test_open_pr_worker_confidence_plumbing.py index 7bfdab6a..cbc9fe2d 100644 --- a/backend/tests/integration/test_open_pr_worker_confidence_plumbing.py +++ b/backend/tests/integration/test_open_pr_worker_confidence_plumbing.py @@ -118,11 +118,13 @@ async def _seed_completed_study_with_per_query_metrics( ) # Winner trial — high per-query metrics + 1 designed regressor. winner_per_query = { - qid: {"ndcg": 0.85 - (0.01 * i) if i != 0 else 0.40} for i, qid in enumerate(query_ids) + qid: {"ndcg@10": 0.85 - (0.01 * i) if i != 0 else 0.40} + for i, qid in enumerate(query_ids) } # Runner-up trial — qid 0 scored higher (so winner regresses on it). runner_up_per_query = { - qid: {"ndcg": 0.95 if i == 0 else 0.84 - (0.01 * i)} for i, qid in enumerate(query_ids) + qid: {"ndcg@10": 0.95 if i == 0 else 0.84 - (0.01 * i)} + for i, qid in enumerate(query_ids) } # Trial 0 = winner. winner_trial = await repo.create_trial( diff --git a/backend/tests/integration/test_run_trial_per_query_persistence.py b/backend/tests/integration/test_run_trial_per_query_persistence.py index 3f43748a..3d909830 100644 --- a/backend/tests/integration/test_run_trial_per_query_persistence.py +++ b/backend/tests/integration/test_run_trial_per_query_persistence.py @@ -96,8 +96,12 @@ async def test_successful_trial_writes_per_query_metrics( f"got={persisted_qids}, expected={expected_qids}" ) - # Every value is a dict keyed by user-facing metric names. - expected_metric_keys = {"ndcg", "map", "precision", "recall", "mrr"} + # Every value is a dict keyed by user-facing metric tokens. The score() + # function emits user-facing tokens with the @ cutoff preserved for + # metrics that take a cutoff (ndcg@10, map@10, precision@10, recall@10) + # and bare names for cutoff-free metrics (mrr, plain map). The base + # name (everything before any @) must be in MetricCatalog. + expected_metric_bases = {"ndcg", "map", "precision", "recall", "mrr"} for qid, per_metric in t.per_query_metrics.items(): assert isinstance(per_metric, dict), ( f"per_query_metrics[{qid}] must be a dict, got {type(per_metric)}" @@ -108,10 +112,11 @@ async def test_successful_trial_writes_per_query_metrics( # keys leak through (e.g., "ndcg_cut.10", "P_10"). assert per_metric, f"per_query_metrics[{qid}] is empty" for metric_key in per_metric: - assert metric_key in expected_metric_keys, ( + base = metric_key.partition("@")[0] + assert base in expected_metric_bases, ( f"unexpected metric key {metric_key!r} in per_query_metrics[{qid}]; " - f"score() should remap pytrec_eval wire names to " - f"{sorted(expected_metric_keys)}" + f"base name {base!r} not in {sorted(expected_metric_bases)} — " + f"score() should remap pytrec_eval wire names to user-facing tokens" ) assert isinstance(per_metric[metric_key], (int, float)), ( f"per_query_metrics[{qid}][{metric_key}] must be numeric, " diff --git a/backend/tests/integration/test_studies_api_confidence.py b/backend/tests/integration/test_studies_api_confidence.py index b43f71bb..d5a03ae3 100644 --- a/backend/tests/integration/test_studies_api_confidence.py +++ b/backend/tests/integration/test_studies_api_confidence.py @@ -248,7 +248,7 @@ async def test_ac4_bootstrap_ci_is_reproducible_across_calls( # Winner: per_query_metrics carries an ndcg value for each of 20 queries. # Use deterministic floats spread across [0.6, 0.95] for a non-degenerate CI. winner_per_query = { - qid: {"ndcg": 0.6 + (i * 0.018), "map": 0.5, "precision": 0.5, "recall": 0.5, "mrr": 0.5} + qid: {"ndcg@10": 0.6 + (i * 0.018), "map": 0.5, "precision": 0.5, "recall": 0.5, "mrr": 0.5} for i, qid in enumerate(qids) } # Need ≥10 trials so all aggregate signals populate. @@ -442,12 +442,12 @@ async def test_ac10_per_query_regressor_includes_query_text( # Winner: qA scored 0.41 (will regress vs runner-up's 0.92); # qB scored 0.85 (unchanged vs runner-up's 0.85). winner_per_query = { - qA: {"ndcg": 0.41}, - qB: {"ndcg": 0.85}, + qA: {"ndcg@10": 0.41}, + qB: {"ndcg@10": 0.85}, } runner_up_per_query = { - qA: {"ndcg": 0.92}, - qB: {"ndcg": 0.85}, + qA: {"ndcg@10": 0.92}, + qB: {"ndcg@10": 0.85}, } winner_id = await _insert_trial( study_id=ctx["study_id"], @@ -490,7 +490,7 @@ async def test_ac15_bootstrap_ci_null_when_fewer_than_five_queries( ) -> None: ctx = await _seed_study(best_metric=0.8, seed_queries=4) qids = ctx["query_ids"] - winner_per_query = {qid: {"ndcg": 0.7 + i * 0.02} for i, qid in enumerate(qids)} + winner_per_query = {qid: {"ndcg@10": 0.7 + i * 0.02} for i, qid in enumerate(qids)} winner_id = await _insert_trial( study_id=ctx["study_id"], optuna_trial_number=0, @@ -519,7 +519,7 @@ async def test_ac16_single_complete_trial_suppresses_runner_up_signals( """Only 1 complete trial → per_query_outcomes + runner_up_gap null; CI still populates.""" ctx = await _seed_study(best_metric=0.8, seed_queries=6) qids = ctx["query_ids"] - winner_per_query = {qid: {"ndcg": 0.7 + i * 0.02} for i, qid in enumerate(qids)} + winner_per_query = {qid: {"ndcg@10": 0.7 + i * 0.02} for i, qid in enumerate(qids)} winner_id = await _insert_trial( study_id=ctx["study_id"], optuna_trial_number=0, diff --git a/backend/tests/unit/domain/study/test_confidence.py b/backend/tests/unit/domain/study/test_confidence.py index 2f234cb0..3faf7bc3 100644 --- a/backend/tests/unit/domain/study/test_confidence.py +++ b/backend/tests/unit/domain/study/test_confidence.py @@ -379,15 +379,21 @@ def test_partial_shape_when_per_query_metrics_null(self) -> None: def test_full_shape_with_all_data(self) -> None: """All sub-fields populated when winner + runner-up both have - per_query_metrics, ≥10 complete trials, ≥5 queries.""" - per_query = {f"q{i}": {"ndcg": 0.8 + 0.01 * i} for i in range(10)} + per_query_metrics, ≥10 complete trials, ≥5 queries. + + Per-query data uses ``ndcg@10`` to match what + :func:`backend.app.eval.scoring.score` actually persists for a + study with ``objective={metric: ndcg, k: 10}`` — the orchestrator + resolves the lookup key via ``objective_metric_key``. + """ + per_query = {f"q{i}": {"ndcg@10": 0.8 + 0.01 * i} for i in range(10)} winner = _trial( optuna_trial_number=0, # winner appears in summary keys primary_metric=0.85, per_query_metrics=per_query, ) # Runner-up's per_query is shifted slightly so most queries improved. - runner_up_pq = {f"q{i}": {"ndcg": 0.7 + 0.01 * i} for i in range(10)} + runner_up_pq = {f"q{i}": {"ndcg@10": 0.7 + 0.01 * i} for i in range(10)} runner_up = _trial( optuna_trial_number=10, primary_metric=0.75, @@ -423,7 +429,7 @@ def test_ci_95_independent_of_runner_up_per_query(self) -> None: """AC-16: 1-complete-trial case — winner has per_query but no runner-up → ci_95 + headline.n_queries populate from winner alone, per_query_outcomes + runner_up_gap suppressed.""" - per_query = {f"q{i}": {"ndcg": 0.8 + 0.01 * i} for i in range(10)} + per_query = {f"q{i}": {"ndcg@10": 0.8 + 0.01 * i} for i in range(10)} winner = _trial( optuna_trial_number=0, primary_metric=0.85, diff --git a/docs/00_overview/DASHBOARD.md b/docs/00_overview/DASHBOARD.md index 6e598ca7..180f599f 100644 --- a/docs/00_overview/DASHBOARD.md +++ b/docs/00_overview/DASHBOARD.md @@ -6,7 +6,7 @@ _Top-level index across MVP1 → GA v1+ as of **2026-05-21**. Click a release na | Release | Theme | Progress | Status | |---|---|---|---| -| [MVP1 / v0.1](MVP1_DASHBOARD.md) | The Loop | 56 / 57 scoped done · 4 remaining | **In progress** | +| [MVP1 / v0.1](MVP1_DASHBOARD.md) | The Loop | 56 / 57 scoped done · 3 remaining | **In progress** | | [MVP2 / v0.2](MVP2_DASHBOARD.md) | Observable | 1 / 1 scoped done · 1 remaining | **In progress** | | MVP3 / v0.3 | Production Stacks | — | **Not yet scoped** | | MVP4 / v0.4 | Multi-tenant, Multi-LLM | — | **Not yet scoped** | diff --git a/docs/00_overview/MVP1_DASHBOARD.md b/docs/00_overview/MVP1_DASHBOARD.md index e0ead940..86ca2f00 100644 --- a/docs/00_overview/MVP1_DASHBOARD.md +++ b/docs/00_overview/MVP1_DASHBOARD.md @@ -21,8 +21,8 @@ Plan approved; run /impl-execute to ship | Metric | Value | |---|---| | Scoped items done | **56 / 57** (98%) — feat_/infra_/chore_/epic_ past idea stage | -| Path to MVP1 | **4** items remaining (features + bugs + chores) | -| Open bugs | 1 | +| Path to MVP1 | **3** items remaining (features + bugs + chores) | +| Open bugs | 0 | | Open chores | 2 (idea-stage debt) | | Backlog ideas | 4 idea-only feat/infra (not yet scoped into MVP1) | | In flight | 0 feature(s) actively shipping | @@ -116,7 +116,7 @@ _None._ _None._ -### Idea (7) +### Idea (6) | Feature | Type | One-liner | Depends on | Status | |---|---|---|---|---| @@ -126,7 +126,6 @@ _None._ | [feat_study_clone_from_previous](../02_product/planned_features/feat_study_clone_from_previous/idea.md) | Feature | A relevance engineer's normal workflow after the first study completes: | — | Idea — surfaced during a UX review of parameter-tuning ergonomics on 2026-05-19. | | [chore_study_default_stop_conditions](../02_product/planned_features/chore_study_default_stop_conditions/idea.md) | Chore | The server-side `StudyConfigSpec` validator at [`backend/app/api/v1/schemas.py:572-580`](../../backend/app/api/v1/schemas.py) correctly **requires** at least one of `max_trials` or `time_budget_min` — | — | Idea — surfaced during the 2026-05-21 Karpathy-loop audit; recommendation grounded in measured per-trial cost from the local dev DB. | | [chore_template_library_expansion](../02_product/planned_features/chore_template_library_expansion/idea.md) | Chore | Three connected gaps: | — | Idea — surfaced during a UX review of parameter-tuning ergonomics on 2026-05-19. | -| [bug_confidence_per_query_metric_key_drift](../02_product/planned_features/bug_confidence_per_query_metric_key_drift/idea.md) | Bug | The two implementations disagree on the shape of `trials.per_query_metrics`: | — | Idea (filed 2026-05-21; surfaced by Story 1.5 of `feat_pr_metric_confidence`) | ## Dependency graph diff --git a/docs/00_overview/dashboard.html b/docs/00_overview/dashboard.html index f163b2c6..1d517edb 100644 --- a/docs/00_overview/dashboard.html +++ b/docs/00_overview/dashboard.html @@ -371,7 +371,7 @@

Releases

The Loop
-
56 / 57 scoped done · 4 remaining
+
56 / 57 scoped done · 3 remaining
In progress
diff --git a/docs/00_overview/mvp1_dashboard.html b/docs/00_overview/mvp1_dashboard.html index 0190f864..53d2aa8d 100644 --- a/docs/00_overview/mvp1_dashboard.html +++ b/docs/00_overview/mvp1_dashboard.html @@ -390,12 +390,12 @@

MVP1 Progress

Path to MVP1
-
4
+
3
items left = features + bugs + chores
-
+
Open bugs
-
1
+
0
tracked bug_* idea files
@@ -428,7 +428,7 @@

Pipeline

-

Idea 7

+

Idea 6

@@ -499,18 +499,6 @@

Idea 7

Three connected gaps:
-
- - -
- -
- Bug - -
-
The two implementations disagree on the shape of `trials.per_query_metrics`:
- -
diff --git a/docs/02_product/planned_features/bug_confidence_per_query_metric_key_drift/idea.md b/docs/02_product/planned_features/bug_confidence_per_query_metric_key_drift/idea.md deleted file mode 100644 index c51a06c0..00000000 --- a/docs/02_product/planned_features/bug_confidence_per_query_metric_key_drift/idea.md +++ /dev/null @@ -1,53 +0,0 @@ -# Bug — `compute_study_confidence` uses bare metric name to look up per-query metrics whose keys are `@k`-suffixed - -**Status:** Idea (filed 2026-05-21; surfaced by Story 1.5 of `feat_pr_metric_confidence`) - -**Origin:** Caught while running the full integration suite during Story 1.5 (`/impl-execute` of `feat_pr_metric_confidence`). The Story 1.2 test [`backend/tests/integration/test_run_trial_per_query_persistence.py::test_successful_trial_writes_per_query_metrics`](../../../../backend/tests/integration/test_run_trial_per_query_persistence.py) fails with `unexpected metric key 'map@10' in per_query_metrics[...]; score() should remap pytrec_eval wire names to ['map', 'mrr', 'ndcg', 'precision', 'recall']`. The test asserted bare metric base names; the actual `score()` output uses `@k`-suffixed user-facing names. - -## Problem - -The two implementations disagree on the shape of `trials.per_query_metrics`: - -| Surface | What it produces / expects | -|---|---| -| [`backend/app/eval/scoring.py:179-184`](../../../../backend/app/eval/scoring.py#L179) `score()` (Story 1.1+) | `per_query[qid] = {ndcg@10: float, map@10: float, mrr: float, ...}` — user-facing tokens preserved including `@` cutoff | -| [`backend/workers/trials.py`](../../../../backend/workers/trials.py) Story 1.2 worker | Writes `scored["per_query"]` verbatim → DB has `@k`-suffixed keys | -| [`backend/app/domain/study/confidence.py:537-540`](../../../../backend/app/domain/study/confidence.py#L537) `compute_study_confidence` (Story 1.3) | Uses `metric = study_objective.get("metric")` (e.g., `"ndcg"`, bare) to look up `v[metric]` on each per_query dict — **misses** because the key is `ndcg@10`, not `ndcg` | -| [`backend/tests/integration/test_run_trial_per_query_persistence.py:100-115`](../../../../backend/tests/integration/test_run_trial_per_query_persistence.py#L100) Story 1.2 test | Asserts `metric_key in {"ndcg", "map", "precision", "recall", "mrr"}` (bare) — **fails** because the worker writes `@k`-suffixed keys | -| [`feature_spec.md`](../../../00_overview/implemented_features/2026_05_21_feat_pr_metric_confidence/feature_spec.md) AC-1 / AC-10 (when this ships there) | Examples show bare keys (`{qid: {ndcg: 0.84, map: 0.7, ...}}`) — agrees with Story 1.3's bare-key lookup but disagrees with the worker reality | - -**Production impact:** For every real completed study with `per_query_metrics` populated, the orchestrator runs `metric not in v` → True → `winner_values_for_metric = []` → `bootstrap_ci_95` returns `None` (N<5) → CI suppressed. Likewise `compute_outcome_summary` produces empty intersection → `per_query_outcomes = None`. Operators see the partial / aggregate-only confidence shape on every PR — the feature's headline value (CI band, named regressors) silently never renders against real data. - -**Why this slipped past Story 1.4's integration tests:** [`backend/tests/integration/test_studies_api_confidence.py`](../../../../backend/tests/integration/test_studies_api_confidence.py) seeds `per_query_metrics` directly with bare keys (`{ndcg: 0.41, ...}`) to match the spec literal. That matches the orchestrator's expectation, so the tests pass — but those keys aren't what the worker actually persists in production. Same for Story 1.5's integration test ([`test_open_pr_worker_confidence_plumbing.py`](../../../../backend/tests/integration/test_open_pr_worker_confidence_plumbing.py)). - -## Why deferred - -The fix forks into product-shaped questions that deserve their own design pass: - -1. **Where does the canonicalization happen?** - - **(a)** Worker remaps per_query keys to bare metric base names before persisting. Keeps Story 1.3 simple but drops `@k` info (problematic when a study computes both `ndcg@5` and `ndcg@10`). - - **(b)** Orchestrator uses `objective_metric_key(study.objective)` to compute the lookup key. Keeps all persisted info but requires Story 1.3's `compute_outcome_summary` to accept separate `metric_lookup_key` + `metric_threshold_key` args (the former drives `v[key]` lookups; the latter drives `REGRESSOR_THRESHOLDS.get(key)`). - - **(c)** Worker normalizes ONLY the primary metric to bare form and drops `@k` suffix; secondary metrics stay as `@k` for future analytics. Compromise that loses the cutoff for the primary metric only. - -2. **Spec patch required.** Whichever route lands, the spec's AC-1 / AC-10 examples need updating to reflect the chosen canonical form. The shipped spec at [`docs/00_overview/implemented_features/2026_05_21_feat_pr_metric_confidence/feature_spec.md`](../../../00_overview/implemented_features/2026_05_21_feat_pr_metric_confidence/feature_spec.md) will get an erratum note. - -3. **Test refactor cost.** Both Story 1.4 (11 cases) and Story 1.5 (1 case) integration tests seed bare-key per_query data. They'll need a small refactor (~10 LOC) to use the chosen canonical form. - -The current `/impl-execute` of Story 1.5 was scoped to PR body plumbing. Bundling an interface change to Story 1.3's pure orchestrator would inflate scope past the rubric's "fix the work-type that fits this PR's intent" guidance, so capturing here is the right call. - -## Surface - -- **Backend:** `backend/app/domain/study/confidence.py` (Story 1.3 orchestrator), `backend/workers/trials.py` (Story 1.2 worker — only if route (a) or (c) chosen). -- **Tests:** `backend/tests/integration/test_run_trial_per_query_persistence.py` (relax / correct assertion), `backend/tests/integration/test_studies_api_confidence.py` (11 cases), `backend/tests/integration/test_open_pr_worker_confidence_plumbing.py` (1 case), `backend/tests/unit/domain/study/test_confidence.py` (25+ cases may need shape updates). -- **Docs:** `feature_spec.md` AC-1 / AC-10 erratum. -- **No new endpoints, no migration, no frontend impact.** - -## Acceptance signal - -- `make test-integration` runs cleanly against the live in-container Postgres with no skipped Story 1.1 / 1.2 / 1.4 / 1.5 cases left failing. -- A seeded real-worker run (winner trial inserted via `run_trial` Arq job, not direct SQL) drives `compute_study_confidence` to a fully-populated `ConfidenceShape` (CI band + per_query_outcomes both non-null when seed data warrants). -- Spec's AC-1 / AC-10 example shape matches the persisted DB shape verbatim. - -## Related work - -- Bundled `Story 1.5` commit `` fixes the two mechanical pre-existing test failures (`test_migrations.py` head `0014`→`0015`; `test_trials_per_query_metrics_migration.py` invalid `judgment_lists.status='ready'`→`'complete'`) — those were not part of this drift but were uncovered in the same full-suite run. See the impl-execute transcript for the rubric reasoning (mechanical assertion updates qualify as inline fixes; this product-shaped key drift does not). diff --git a/docs/02_product/planned_features/feat_pr_metric_confidence/feature_spec.md b/docs/02_product/planned_features/feat_pr_metric_confidence/feature_spec.md index 980a1f12..36d7e10b 100644 --- a/docs/02_product/planned_features/feat_pr_metric_confidence/feature_spec.md +++ b/docs/02_product/planned_features/feat_pr_metric_confidence/feature_spec.md @@ -497,10 +497,10 @@ Tooltip implementation reuses the existing [`InfoTooltip`](../../../../ui/src/co - Given a study with `template_id` declaring `title_boost: 'float'`, a query set of 5 queries, and a judgment list with judgments for those queries - When the orchestrator runs 5 Optuna trials and all 5 complete successfully -- Then `SELECT per_query_metrics FROM trials WHERE study_id = ?` returns 5 non-NULL JSONB rows, each shaped `{qid: {ndcg: float, map: float, precision: float, recall: float, mrr: float}}` (the 5 metric keys from `MetricCatalog`) +- Then `SELECT per_query_metrics FROM trials WHERE study_id = ?` returns 5 non-NULL JSONB rows, each shaped `{qid: {: float, ...}}` where the keys are the user-facing metric tokens emitted by `backend.app.eval.scoring.score()` — i.e., the @-suffixed form for cutoff-aware metrics (`ndcg@10`, `map@10`, `precision@10`, `recall@10`) and bare names for cutoff-free metrics (`mrr`, plain `map`). Base names are constrained to `MetricCatalog` (`ndcg`, `map`, `precision`, `recall`, `mrr`). - Example values: - - Input: `study_id="01931..."`, `max_trials=5`, judgment list with `query_ids=["q1","q2","q3","q4","q5"]` - - Expected: 5 rows, each with `per_query_metrics["q1"]["ndcg"]` populated as a float between 0.0 and 1.0 + - Input: `study_id="01931..."`, `max_trials=5`, `objective={metric: "ndcg", k: 10}`, judgment list with `query_ids=["q1","q2","q3","q4","q5"]` + - Expected: 5 rows, each with `per_query_metrics["q1"]["ndcg@10"]` populated as a float between 0.0 and 1.0 ### AC-2: Failed trial does not write per_query_metrics @@ -571,8 +571,8 @@ Tooltip implementation reuses the existing [`InfoTooltip`](../../../../ui/src/co ### AC-10: Per-query regressor naming with thresholded comparison -- Given a completed study with NDCG objective, winner trial `per_query_metrics`, and runner-up #2 with per_query, where for `query_id="qA"` the winner scored 0.41 and the runner-up scored 0.92 (delta=-0.51, below the -0.01 threshold for NDCG) -- And for `query_id="qB"` the winner scored 0.85 and the runner-up scored 0.85 (delta=0, within ±0.01 unchanged window) +- Given a completed study with `objective={metric: "ndcg", k: 10}`, winner trial `per_query_metrics` keyed by user-facing token (`ndcg@10`), and runner-up #2 with per_query, where for `query_id="qA"` the winner's `per_query_metrics[qA]["ndcg@10"] == 0.41` and the runner-up's is `0.92` (delta=-0.51, below the -0.01 threshold for NDCG) +- And for `query_id="qB"` the winner's `ndcg@10` is `0.85` and the runner-up's is `0.85` (delta=0, within ±0.01 unchanged window) - When `GET /api/v1/studies/{id}` is called - Then `confidence.per_query_outcomes.top_regressors` contains a row for `query_id="qA"` with `query_text` joined from the queries table, `winner_score=0.41`, `comparison_score=0.92`, `delta=-0.51` - And `confidence.per_query_outcomes.regressed == 1` From 29132ddca54025c6ecfa27255dd789a4d79fe709 Mon Sep 17 00:00:00 2001 From: SoundMindsAI Date: Thu, 21 May 2026 12:20:13 -0400 Subject: [PATCH 11/17] docs(planned): capture Guides Glossary + FAQ follow-ups (2 idea files) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Surfaced during feat_pr_metric_confidence Story 1.5 review — operator asked whether the current docs + tooltips accurately describe the confidence terms, and whether the Guides catalog needs an FAQ or a Glossary. Audit conclusion: - Tooltip text in feature_spec.md §11 cross-checks cleanly against the locked constants in backend/app/domain/study/confidence.py (BOOTSTRAP_N, REGRESSOR_THRESHOLDS, RUNNER_UP_PLATEAU_BAND, LATE_TRIAL_WINDOW_FRAC, EARLY_HELD_*, LATE_RISING_TRIAL_NUMBER_FRAC). No drift. - The 6 confidence-related glossary entries (confidence.ci_95, confidence.runner_up_gap, ...) aren't created yet — they're scoped to Story 2.2 of the same feature. - /guide catalog has long-form docs + walkthroughs but no Glossary or FAQ surface, despite docs/08_guides/README.md mentioning FAQs in its header. Two follow-ups filed: - chore_guides_glossary_route — render the 103+ existing glossary.ts keys as a searchable reference page at /guide/glossary, with anchor deep-links so tooltips can point at canonical definitions. - chore_guides_faq — curated operator-judgment Q&A (~15-20 entries) that exceeds the 1-2 sentence tooltip budget; answers "what should I do about X" rather than "what does X mean". Depends on the Glossary landing first so FAQ entries can deep-link into it. Neither is blocking — the inline-tooltip surface already covers the "what does X mean" case in-context. Worth picking up once the operator base widens past the maintainer + design partners. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/00_overview/DASHBOARD.md | 2 +- docs/00_overview/MVP1_DASHBOARD.md | 8 ++- docs/00_overview/dashboard.html | 2 +- docs/00_overview/mvp1_dashboard.html | 30 ++++++++- .../planned_features/chore_guides_faq/idea.md | 65 +++++++++++++++++++ .../chore_guides_glossary_route/idea.md | 52 +++++++++++++++ 6 files changed, 151 insertions(+), 8 deletions(-) create mode 100644 docs/02_product/planned_features/chore_guides_faq/idea.md create mode 100644 docs/02_product/planned_features/chore_guides_glossary_route/idea.md diff --git a/docs/00_overview/DASHBOARD.md b/docs/00_overview/DASHBOARD.md index 180f599f..27cb52d7 100644 --- a/docs/00_overview/DASHBOARD.md +++ b/docs/00_overview/DASHBOARD.md @@ -6,7 +6,7 @@ _Top-level index across MVP1 → GA v1+ as of **2026-05-21**. Click a release na | Release | Theme | Progress | Status | |---|---|---|---| -| [MVP1 / v0.1](MVP1_DASHBOARD.md) | The Loop | 56 / 57 scoped done · 3 remaining | **In progress** | +| [MVP1 / v0.1](MVP1_DASHBOARD.md) | The Loop | 56 / 57 scoped done · 5 remaining | **In progress** | | [MVP2 / v0.2](MVP2_DASHBOARD.md) | Observable | 1 / 1 scoped done · 1 remaining | **In progress** | | MVP3 / v0.3 | Production Stacks | — | **Not yet scoped** | | MVP4 / v0.4 | Multi-tenant, Multi-LLM | — | **Not yet scoped** | diff --git a/docs/00_overview/MVP1_DASHBOARD.md b/docs/00_overview/MVP1_DASHBOARD.md index 86ca2f00..4b31b0b4 100644 --- a/docs/00_overview/MVP1_DASHBOARD.md +++ b/docs/00_overview/MVP1_DASHBOARD.md @@ -21,9 +21,9 @@ Plan approved; run /impl-execute to ship | Metric | Value | |---|---| | Scoped items done | **56 / 57** (98%) — feat_/infra_/chore_/epic_ past idea stage | -| Path to MVP1 | **3** items remaining (features + bugs + chores) | +| Path to MVP1 | **5** items remaining (features + bugs + chores) | | Open bugs | 0 | -| Open chores | 2 (idea-stage debt) | +| Open chores | 4 (idea-stage debt) | | Backlog ideas | 4 idea-only feat/infra (not yet scoped into MVP1) | | In flight | 0 feature(s) actively shipping | @@ -116,7 +116,7 @@ _None._ _None._ -### Idea (6) +### Idea (8) | Feature | Type | One-liner | Depends on | Status | |---|---|---|---|---| @@ -124,6 +124,8 @@ _None._ | [feat_config_repo_baseline_tracking](../02_product/planned_features/feat_config_repo_baseline_tracking/idea.md) | Feature | RelyLoop does not track which configuration is currently live in production. When a proposal's PR merges, the merge webhook at [`backend/app/api/webhooks/github.py:187-191`](../../backend/app/api/webh | — | Idea — surfaced during the 2026-05-21 Karpathy-loop audit. | | [feat_digest_executable_followups](../02_product/planned_features/feat_digest_executable_followups/idea.md) | Feature | The digest worker's LLM contract at [`backend/workers/digest.py:168-189`](../../backend/workers/digest.py) defines `suggested_followups` as a flat `array of string`: | — | Idea — surfaced during the 2026-05-21 Karpathy-loop audit. | | [feat_study_clone_from_previous](../02_product/planned_features/feat_study_clone_from_previous/idea.md) | Feature | A relevance engineer's normal workflow after the first study completes: | — | Idea — surfaced during a UX review of parameter-tuning ergonomics on 2026-05-19. | +| [chore_guides_faq](../02_product/planned_features/chore_guides_faq/idea.md) | Chore | Tooltips and the glossary answer "**what does X mean?**" within a 1–2 sentence budget. They don't carry the operator-judgment-shaped questions that come up *after* the term is understood: | — | Idea — surfaced during `feat_pr_metric_confidence` Story 1.5 review | +| [chore_guides_glossary_route](../02_product/planned_features/chore_guides_glossary_route/idea.md) | Chore | The glossary is a load-bearing terminology source-of-truth (cited 100+ times across the codebase, parity-tested against backend Literal enums, locked by source-of-truth comments). But operators can on | — | Idea — surfaced during `feat_pr_metric_confidence` Story 1.5 review | | [chore_study_default_stop_conditions](../02_product/planned_features/chore_study_default_stop_conditions/idea.md) | Chore | The server-side `StudyConfigSpec` validator at [`backend/app/api/v1/schemas.py:572-580`](../../backend/app/api/v1/schemas.py) correctly **requires** at least one of `max_trials` or `time_budget_min` — | — | Idea — surfaced during the 2026-05-21 Karpathy-loop audit; recommendation grounded in measured per-trial cost from the local dev DB. | | [chore_template_library_expansion](../02_product/planned_features/chore_template_library_expansion/idea.md) | Chore | Three connected gaps: | — | Idea — surfaced during a UX review of parameter-tuning ergonomics on 2026-05-19. | diff --git a/docs/00_overview/dashboard.html b/docs/00_overview/dashboard.html index 1d517edb..d0875958 100644 --- a/docs/00_overview/dashboard.html +++ b/docs/00_overview/dashboard.html @@ -371,7 +371,7 @@

Releases

The Loop
-
56 / 57 scoped done · 3 remaining
+
56 / 57 scoped done · 5 remaining
In progress
diff --git a/docs/00_overview/mvp1_dashboard.html b/docs/00_overview/mvp1_dashboard.html index 53d2aa8d..53eaecea 100644 --- a/docs/00_overview/mvp1_dashboard.html +++ b/docs/00_overview/mvp1_dashboard.html @@ -390,7 +390,7 @@

MVP1 Progress

Path to MVP1
-
3
+
5
items left = features + bugs + chores
@@ -400,7 +400,7 @@

MVP1 Progress

Open chores
-
2
+
4
idea-stage chore_* (debt)
@@ -428,7 +428,7 @@

Pipeline

-

Idea 6

+

Idea 8

@@ -478,6 +478,30 @@

Idea 6

+
+ +
+ Chore + +
+
Tooltips and the glossary answer "**what does X mean?**" within a 1–2 sentence budget. They don't carry the operator-judgment-shaped questions that come up *after* the term is understood:
+ + +
+ + +
+ +
+ Chore + +
+
The glossary is a load-bearing terminology source-of-truth (cited 100+ times across the codebase, parity-tested against backend Literal enums, locked by source-of-truth comments). But operators can on
+ + +
+ +
diff --git a/docs/02_product/planned_features/chore_guides_faq/idea.md b/docs/02_product/planned_features/chore_guides_faq/idea.md new file mode 100644 index 00000000..08500918 --- /dev/null +++ b/docs/02_product/planned_features/chore_guides_faq/idea.md @@ -0,0 +1,65 @@ +# FAQ in the Guides catalog + +**Date:** 2026-05-21 +**Status:** Idea — surfaced during `feat_pr_metric_confidence` Story 1.5 review +**Origin:** Audit question from operator during the metric-key-drift fix conversation — "do we need an FAQ in the Guides section?" The header of [`docs/08_guides/README.md`](../../../03_runbooks/) line 3 *mentions* "FAQs" as one of the documented content types, but no FAQ file exists in the repo and no FAQ surface exists in the UI — that's stale ambition from when the Guides directory was first sketched. +**Depends on:** None. + +## Problem + +Tooltips and the glossary answer "**what does X mean?**" within a 1–2 sentence budget. They don't carry the operator-judgment-shaped questions that come up *after* the term is understood: + +- "My CI band is missing — why?" (Answer requires citing FR-7's degradation table + the 5-query minimum + the per_query_metrics IS NULL old-study case.) +- "Convergence regime is *noisy* — should I rerun with a different sampler?" (Answer needs to balance "noisy is often fine" vs "noisy + sharp_peak warrants caution" vs "noisy on a 10-trial study is meaningless — get more trials first.") +- "The PR body shows `regressed: 2` — should I reject?" (Answer requires explaining that regressors aren't categorical bad: an operator who cares about a specific query catalog should look at the named regressors, but a global-relevance operator might be fine with a 2-regression / 14-improvement trade.) +- "Why does my study have `confidence: null` instead of a partial shape?" (Answer cites AC-3 vs AC-3a — the difference between "winner trial exists but per_query_metrics IS NULL" and "best_trial_id IS NULL".) +- "When should I trust the LLM-as-judge ratings vs override them?" (Answer cites the κ calibration story + the override path from `feat_llm_judgments`.) +- "I rejected a proposal — what happens to the open PR on GitHub?" (Answer cites the `proposal.status='rejected'` → no automatic GitHub action; operator must close the PR manually. Documented in [`pr-open-debugging.md`](../../../03_runbooks/pr-open-debugging.md) but not surfaced in-app.) + +Today this knowledge lives in: +- Spec edge/error-flow sections (read by engineers, not operators). +- Runbooks under [`docs/03_runbooks/`](../../../03_runbooks/) (good for SREs, not operator-facing). +- Tooltips (too short for the *why*). +- Tribal knowledge in chat history (lost). + +The result: every operator who hits one of these questions asks the same one — either internally to the platform team, or by giving up. The FAQ surfaces the canonical answers at a discoverable URL. + +## Proposed capabilities + +### Page + content + +- New route `/guide/faq` and a fourth section on `/guide` (next to long-form docs, walkthroughs, and the proposed [Glossary route](../chore_guides_glossary_route/idea.md)). +- Curated content — **not** a sprawling community-style Q&A. Initial pass targets ~15–20 entries grouped by phase: **Studies & Confidence** (5–7 entries), **Judgments** (3–4), **Proposals & PRs** (3–4), **Chat agent** (2–3), **Setup & install** (2–3). +- Each entry: a clear question header + a 3–5 sentence answer + cross-links to relevant tooltips (glossary keys), runbooks, and spec ACs. +- Anchor-deep-linkable: `/guide/faq#confidence-ci-missing` so tooltips can link to the canonical answer. + +### Authoring source + +- Markdown files under [`docs/08_guides/faq/`](../../../08_guides/) — one `.md` per top-level category, shipped in-app via the same `react-markdown` pipeline the guide scripts already use. +- Registered in `GUIDE_REGISTRY` (or a sibling `FAQ_REGISTRY`) with parity tests against the actual markdown files. + +### Discoverability + +- Surface relevant entries from inline `` triggers via a "Related FAQ →" footer link (uses the deep-link anchors). +- Add a "Common questions" callout on the home page below the existing `StartHereChecklist` (from `feat_contextual_help` Phase 3). + +## Scope signals + +- **Backend:** none. +- **Frontend:** new page component at `ui/src/app/guide/faq/page.tsx` + per-category markdown rendering; new card on `/guide`; vitest for category/anchor parity + smoke that all referenced glossary keys resolve. Optional `` component for the "Related FAQ →" footer in tooltips/popovers. +- **Migration:** none. +- **Config:** none. +- **Audit events:** N/A (read-only). +- **Estimated size:** small-to-medium — 1 new page (~250 LOC including markdown loader + category index), 5–6 markdown files (~200 lines of operator-facing prose total in the first pass), ~100 LOC of vitest. The content writing is the bulk of the work, not the implementation. + +## Why not yet prioritized + +The MVP1 path is "operator reads tutorial-first-study.md → runs the loop → ships PRs." The questions an FAQ would answer mostly surface *after* the operator has done that loop a few times — i.e., they're not blocking first-run success. Until the operator base is wider than the maintainer + a handful of design partners, every FAQ-shaped question can be answered in chat / Slack faster than it can be authored as a curated entry. Worth doing once routine support questions start repeating themselves. + +Also worth deferring until [`chore_guides_glossary_route`](../chore_guides_glossary_route/idea.md) lands — the FAQ entries will reference glossary terms heavily, and having the glossary as a target for `[term](/guide/glossary#term)` deep links makes the FAQ content much richer. + +## Relationship to other work + +- **Sibling:** [`chore_guides_glossary_route`](../chore_guides_glossary_route/idea.md) — different content axis (Glossary = "what does X mean?", FAQ = "what should I do about X?"). The Glossary should land first because the FAQ deep-links into it. +- **Supersedes:** the stale "FAQs" mention in [`docs/08_guides/README.md`](../../../08_guides/README.md) line 3, which is currently aspirational with no concrete artifact. +- **Coordinates with:** the existing operator-facing runbooks under [`docs/03_runbooks/`](../../../03_runbooks/) (webhook debugging, PR-open debugging, agent debugging, judgment debugging) — those are SRE-level; the FAQ would link to them for operators who need to escalate. diff --git a/docs/02_product/planned_features/chore_guides_glossary_route/idea.md b/docs/02_product/planned_features/chore_guides_glossary_route/idea.md new file mode 100644 index 00000000..9c64d366 --- /dev/null +++ b/docs/02_product/planned_features/chore_guides_glossary_route/idea.md @@ -0,0 +1,52 @@ +# Glossary route in the Guides catalog + +**Date:** 2026-05-21 +**Status:** Idea — surfaced during `feat_pr_metric_confidence` Story 1.5 review +**Origin:** Audit question from operator during the metric-key-drift fix conversation — "do we need a Glossary in the Guides section?" The audit confirmed [`ui/src/lib/glossary.ts`](../../../../ui/src/lib/glossary.ts) has 103+ entries today and grows with every feature (6 more land with `feat_pr_metric_confidence` Story 2.2). The keys are only discoverable via hover on inline `` / `` triggers — there's no canonical reference surface. +**Depends on:** None (glossary data structure already exists; this is purely a render layer + route). + +## Problem + +The glossary is a load-bearing terminology source-of-truth (cited 100+ times across the codebase, parity-tested against backend Literal enums, locked by source-of-truth comments). But operators can only access it via the inline tooltip triggers that appear next to specific UI elements — meaning: + +- An operator reading the PR body in GitHub can't look up what "Late-trial 1σ" means without first navigating to a study detail page that happens to render the term. +- A new user landing on `/judgments` can't browse the full set of judgment-related terms without hovering each element one at a time. +- Cross-feature concept search (e.g., "what's the difference between *runner_up_gap* and *runner_up_metric*?") requires file-system grep into `glossary.ts` — not viable for non-engineers. + +The [`/guide`](../../../../ui/src/app/guide/page.tsx) catalog page is the natural home: it already aggregates long-form docs (`tutorial-first-study`, `workflows-overview`) and visual walkthroughs. A third section — **Glossary** — fits the same "operator reference" axis. + +## Proposed capabilities + +### Route + page + +- New route `/guide/glossary` (and a third card on `/guide` next to the existing long-form-doc and walkthrough sections). +- Search box (case-insensitive substring match over keys + short text + long text). +- Optional category facets driven by key prefix (`study.*`, `judgment.*`, `proposal.*`, `confidence.*`, `chat.*`, …) — derive from key segments, no hand-maintained taxonomy. +- Each rendered entry shows the key (as code), the short form, and the long form (rendered through the existing `react-markdown` safety filter from `feat_contextual_help`). +- Deep-linkable anchors: `/guide/glossary#study.metric.ndcg` so tooltips' "Read more" can link directly into the page (future enhancement). + +### Discoverability + +- Add a Glossary entry to the home page's `StartHereChecklist` (introduced by `feat_contextual_help` Phase 3) as one of the "Learn the terminology" steps. +- Link from each guide's script.md footer ("See the glossary for definitions of every term used in this walkthrough"). + +## Scope signals + +- **Backend:** none (glossary is a frontend `.ts` constant). +- **Frontend:** new page component at `ui/src/app/guide/glossary/page.tsx`; new card on the `/guide` catalog page; small enhancement to `GUIDE_REGISTRY` shape if it grows a "reference" category; vitest tests for search + category facets. +- **Migration:** none. +- **Config:** none. +- **Audit events:** N/A (read-only page, no state mutations). +- **Estimated size:** small — 1 new page (~200 LOC), 1 catalog card addition, ~80 LOC of vitest. The existing `glossary.ts` constant is already shaped to render directly. + +## Why not yet prioritized + +The glossary is functional today via the inline-tooltip surface — every term IS discoverable in context, just not in a single browsable list. Operators on the current MVP1 path (clone → run tutorial → ship first PR) don't hit the wall this would solve until they're routinely cross-referencing terms while reading PR bodies or making product decisions across multiple features. The need scales with operator-team size + tenure, which lands more naturally at MVP4 (multi-tenant, multiple platform-engineering teams sharing one deployment). + +That said, the implementation cost is small enough that a quiet sprint-friction sweep could pick this up at any point — it's not gated on any deeper work landing first. + +## Relationship to other work + +- **Sibling:** [`chore_guides_faq`](../chore_guides_faq/idea.md) — operator-judgment-shaped Q&A that exceeds the 1–2 sentence tooltip budget. The Glossary answers "what does X mean?"; the FAQ answers "what should I do about X?" Different content axis, both belong under `/guide`. +- **Coordinates with:** the 6 new confidence-related glossary entries landing in `feat_pr_metric_confidence` Story 2.2. Whichever lands first sets the convention for any new prefix-grouped section. +- **Coordinates with:** the `feat_contextual_help` precedent (lives in `docs/00_overview/implemented_features/2026_05_15_feat_contextual_help/`) — that feature introduced the InfoTooltip + glossary infrastructure this builds on. From 3924bb250666bae71693f71745002e018f168278 Mon Sep 17 00:00:00 2001 From: SoundMindsAI Date: Thu, 21 May 2026 12:30:40 -0400 Subject: [PATCH 12/17] feat(worker): wire ConfidenceShape into digest narrative prompt (Story 1.6) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - prompts/digest_narrative.user.jinja gains two new XML blocks inserted after : (CI band + aggregate signals — each sub-line independently {% if %}-gated per FR-7) and (improved/unchanged/regressed counts + comparison_against label + up to 5 named regressor rows). - prompts/digest_narrative.system.md extends the XML-block inventory to document blocks 8 + 9 with conditional-inclusion semantics, and replaces the narrative opening-guidance sentence with the FR-6 string ("Open with the headline metric delta, immediately followed by a one-sentence confidence framing that mentions the CI band [when is present], the per-query outcome counts [when is present], and the worst-regressed query by name [when has regressors]"). - render_digest_user_prompt gains optional confidence: Mapping | None = None kwarg; passes through to the jinja render verbatim. None default preserves the existing one-callsite contract (worker is the only caller). - backend/workers/digest.py awaits fetch_study_confidence(db, study) immediately before render_digest_user_prompt, serializes via ConfidenceShape.model_dump() (so jinja consumes a plain dict, per cycle-1 GPT-5.5 F3), and threads through the new kwarg. - backend/tests/unit/workers/test_digest_prompt_render.py adds 5 new cases covering AC-14: (1) block present + all sub-lines populated; (2) absent when confidence=None; (3) present + regressor rows render; (4) absent when nested per_query_outcomes is None; (5) system prompt carries the FR-6 opening-guidance string (whitespace-normalized assertion tolerates the markdown soft-wraps) plus the documented block list entries. Verification: 1039 backend unit tests pass (+5 new); 189 contract; 526/526 in-container integration; backend fmt + lint + typecheck + ruff format --check parity clean. No regressions in test_digest_generate / test_digest_zero_trials / test_digest_capability_fallback — the new fetch_study_confidence call returns None for the seeded happy-path study (no per_query_metrics on its single trial), so the jinja blocks skip silently and the existing prompt-render assertions stay valid. Co-Authored-By: Claude Opus 4.7 (1M context) --- backend/app/llm/digest_prompt.py | 8 ++ .../unit/workers/test_digest_prompt_render.py | 134 ++++++++++++++++++ backend/workers/digest.py | 10 ++ .../implementation_plan.md | 2 +- prompts/digest_narrative.system.md | 25 +++- prompts/digest_narrative.user.jinja | 19 ++- 6 files changed, 192 insertions(+), 6 deletions(-) diff --git a/backend/app/llm/digest_prompt.py b/backend/app/llm/digest_prompt.py index 9ec7e18f..a5955158 100644 --- a/backend/app/llm/digest_prompt.py +++ b/backend/app/llm/digest_prompt.py @@ -80,6 +80,7 @@ def render_digest_user_prompt( recommended_config: Mapping[str, Any], dropped_template_params: Sequence[str], include_recommendation: bool = True, + confidence: Mapping[str, Any] | None = None, ) -> str: """Render the per-study user message for the digest narrative call. @@ -105,6 +106,12 @@ def render_digest_user_prompt( include_recommendation: cycle-3 F3 toggle. ``True`` (default) emits the full structured prompt; ``False`` emits the degraded / narrative-only variant for the capability-fallback path. + confidence: serialized ``ConfidenceShape`` (via + ``ConfidenceShape.model_dump()``) per feat_pr_metric_confidence + FR-6. ``None`` (default) skips both the ```` and + ```` jinja blocks; a partial shape (some + sub-fields ``None``) emits only the populated sub-lines via the + template's per-sub-field ``{% if %}`` guards. Returns: The rendered user message string, ready to send as the OpenAI @@ -127,6 +134,7 @@ def render_digest_user_prompt( recommended_config=recommended_config, dropped_template_params=dropped_template_params, include_recommendation=include_recommendation, + confidence=confidence, ) diff --git a/backend/tests/unit/workers/test_digest_prompt_render.py b/backend/tests/unit/workers/test_digest_prompt_render.py index 011f1b09..c911238b 100644 --- a/backend/tests/unit/workers/test_digest_prompt_render.py +++ b/backend/tests/unit/workers/test_digest_prompt_render.py @@ -24,6 +24,7 @@ _SANDBOX_ENV, DigestPromptBundle, load_digest_prompts, + render_digest_system_prompt, render_digest_user_prompt, ) @@ -155,6 +156,139 @@ def test_autoescape_neutralizes_adversarial_study_name() -> None: assert "malicious-instruction" not in output +# --------------------------------------------------------------------------- +# feat_pr_metric_confidence Story 1.6 — + +# --------------------------------------------------------------------------- + + +def _make_test_confidence_dict(**overrides: object) -> dict[str, object]: + """Build a fully-populated serialized ConfidenceShape for the jinja blocks. + + Mirrors what ``ConfidenceShape.model_dump()`` emits at the digest-worker + call site. Tests override sub-fields by passing them as kwargs. + """ + base: dict[str, object] = { + "headline": {"metric": "ndcg", "value": 0.840, "k": 10, "n_queries": 20}, + "ci_95": {"low": 0.780, "high": 0.890, "method": "bootstrap_n1000", "n_samples": 20}, + "runner_up_gap": { + "value": 0.002, + "classification": "robust_plateau", + "top10_within": 0.004, + "runner_up_metric": 0.838, + }, + "late_trial_stddev": {"value": 0.012, "window_size": 20, "min_window_required": 10}, + "convergence": {"best_at_trial": 387, "total_trials": 1000, "regime": "early_held"}, + "per_query_outcomes": { + "improved": 14, + "unchanged": 4, + "regressed": 2, + "comparison_against": "runner_up", + "top_regressors": [ + { + "query_id": "q1", + "query_text": "vintage acoustic guitar", + "winner_score": 0.41, + "comparison_score": 0.92, + "delta": -0.51, + }, + { + "query_id": "q2", + "query_text": "leather wallet", + "winner_score": 0.55, + "comparison_score": 0.78, + "delta": -0.23, + }, + ], + }, + } + base.update(overrides) + return base + + +def test_user_prompt_includes_confidence_block_when_data_present() -> None: + """FR-6 / AC-14: full confidence dict produces the XML block.""" + kwargs = dict(CANONICAL_KWARGS) + kwargs["confidence"] = _make_test_confidence_dict() + output = render_digest_user_prompt(**kwargs) # type: ignore[arg-type] + assert "" in output + assert "" in output + # Headline + CI sub-lines. + assert "ci_low: 0.78" in output + assert "ci_high: 0.89" in output + assert "n_queries: 20" in output + # Aggregate signals. + assert "runner_up_gap: 0.002 (robust_plateau)" in output + assert "late_trial_stddev: 0.012" in output + assert "convergence: early_held (best at trial 387 of 1000)" in output + + +def test_user_prompt_omits_confidence_block_when_none() -> None: + """FR-7 / AC-12: confidence=None skips both blocks entirely.""" + output = render_digest_user_prompt(**CANONICAL_KWARGS) # type: ignore[arg-type] + # Canonical kwargs don't set `confidence` — defaults to None. + assert "" not in output + assert "" not in output + + +def test_user_prompt_includes_per_query_outcomes_block_when_nested_data_present() -> None: + """The block surfaces nested counts + named regressors.""" + kwargs = dict(CANONICAL_KWARGS) + kwargs["confidence"] = _make_test_confidence_dict() + output = render_digest_user_prompt(**kwargs) # type: ignore[arg-type] + assert "" in output + assert "" in output + assert "improved: 14" in output + assert "unchanged: 4" in output + assert "regressed: 2" in output + assert "comparison_against: runner_up" in output + # Each regressor row: text + winner → comparison + delta in parens. + assert "- vintage acoustic guitar: 0.41" in output + assert "0.92" in output + assert "(-0.51)" in output + assert "- leather wallet: 0.55" in output + + +def test_user_prompt_omits_per_query_outcomes_block_when_subfield_is_none() -> None: + """FR-7: confidence present but per_query_outcomes=None → outer block only.""" + kwargs = dict(CANONICAL_KWARGS) + kwargs["confidence"] = _make_test_confidence_dict(per_query_outcomes=None) + output = render_digest_user_prompt(**kwargs) # type: ignore[arg-type] + # The block still renders (CI + aggregate signals). + assert "" in output + # stays suppressed. + assert "" not in output + + +def test_system_prompt_has_fr6_opening_guidance_and_block_inventory() -> None: + """AC-14 system-prompt half: the opening guidance + block list are updated. + + The replacement string from spec FR-6 is in the prompt file but + soft-wrapped at ~80 columns. We collapse whitespace before asserting so + the test tolerates wrap location while still proving the substring + contract — the LLM sees newlines as whitespace too. + """ + system = render_digest_system_prompt() + # Collapse all runs of whitespace (incl. newlines + indents) into single + # spaces so soft-wrapped sentences match continuous-string assertions. + flat = " ".join(system.split()) + # Opening-guidance replacement (FR-6 line edit). Backticks around + # `` / `` tag names are load-bearing — + # they signal to the LLM that these are XML block names, not English. + assert ( + "Open with the headline metric delta, immediately followed by a one-sentence " + "confidence framing that mentions the CI band (when `` is present), " + "the per-query outcome counts (when `` is present), and the " + "worst-regressed query by name (when `` has regressors)." + ) in flat + # The original "Open with the headline metric delta. Then explain" sentence + # must NOT exist verbatim — the replacement superseded it. + assert "headline metric delta. Then explain" not in flat + # Block inventory must document the two new XML blocks (these appear on + # their own lines so a direct substring check is fine). + assert "8. ``" in system + assert "9. ``" in system + + def test_sandbox_rejects_attribute_access() -> None: """Defense in depth: SandboxedEnvironment blocks dunder-access from template authors. diff --git a/backend/workers/digest.py b/backend/workers/digest.py index 33041bfb..034a15ba 100644 --- a/backend/workers/digest.py +++ b/backend/workers/digest.py @@ -72,6 +72,7 @@ known_models, ) from backend.app.llm.digest_prompt import load_digest_prompts, render_digest_user_prompt +from backend.app.services.study_confidence import fetch_study_confidence logger = structlog.get_logger(__name__) @@ -686,6 +687,14 @@ async def generate_digest(ctx: dict[str, Any], study_id: str) -> None: rubric_summary = rubric_text[:280] + ("..." if len(rubric_text) > 280 else "") if not rubric_summary: rubric_summary = "(see judgment list rubric)" + # feat_pr_metric_confidence Story 1.6 (FR-6): assemble the + # per-study ConfidenceShape and serialize for the jinja + # ```` + ```` blocks. Returns + # ``None`` on degraded paths (FR-7) so the blocks skip cleanly. + confidence_shape = await fetch_study_confidence(db, study) + confidence_payload = ( + confidence_shape.model_dump() if confidence_shape is not None else None + ) user_prompt = render_digest_user_prompt( study_name=study.name, cluster_name=cluster_name, @@ -701,6 +710,7 @@ async def generate_digest(ctx: dict[str, Any], study_id: str) -> None: recommended_config=recommended_config, dropped_template_params=dropped, include_recommendation=structured_output_enabled and not all_dropped, + confidence=confidence_payload, ) bundle = load_digest_prompts() openai_client = AsyncOpenAI(api_key=api_key, base_url=settings.openai_base_url) diff --git a/docs/02_product/planned_features/feat_pr_metric_confidence/implementation_plan.md b/docs/02_product/planned_features/feat_pr_metric_confidence/implementation_plan.md index 24f0758f..0eee02c2 100644 --- a/docs/02_product/planned_features/feat_pr_metric_confidence/implementation_plan.md +++ b/docs/02_product/planned_features/feat_pr_metric_confidence/implementation_plan.md @@ -1055,7 +1055,7 @@ None planned. The feature is purely additive across all surfaces. - [ ] Story 1.3 — Domain module `confidence.py` - [x] Story 1.4 — `ConfidenceShape` + StudyDetail enrichment - [x] Story 1.5 — PR body section + worker plumbing -- [ ] Story 1.6 — Digest narrative prompt extension +- [x] Story 1.6 — Digest narrative prompt extension - [ ] **Epic 1 gate** - [ ] Story 2.1 — TypeScript types + enums - [ ] Story 2.2 — `` component + glossary + page mount diff --git a/prompts/digest_narrative.system.md b/prompts/digest_narrative.system.md index ed2887ec..2efa2648 100644 --- a/prompts/digest_narrative.system.md +++ b/prompts/digest_narrative.system.md @@ -24,15 +24,32 @@ The user message contains XML-delimited blocks: 7. `` (only when `include_recommendation=False`) — the operator's OpenAI endpoint failed the structured-output capability probe. Return free- form prose narrative only — no JSON, no recommendations, no follow-ups. +8. `` (only when the orchestrator computed a non-null + `ConfidenceShape` for the study) — bootstrap 95% CI on the headline metric + (`ci_low`/`ci_high`/`n_queries`) plus aggregate signals (`runner_up_gap`, + `late_trial_stddev`, `convergence`). Each sub-line is omitted independently + when its sub-field is null (FR-7 graceful-degradation contract). For + studies still running, or studies whose winner trial predates the + `per_query_metrics` migration, the block may be absent or partial. +9. `` (only when both the winner trial and the runner-up + trial have per-query metrics) — `improved` / `unchanged` / `regressed` + counts, the `comparison_against` reference (`runner_up` in MVP1; `baseline` + when Phase 2 ships), and up to 5 named regressor rows + (`query_text: winner_score → comparison_score (delta)`). Omitted entirely + when the comparison data isn't available. For the **structured** path (default, `include_recommendation=True`), return a JSON object with exactly two fields: - `narrative` — a markdown string (~200–600 words). Open with the headline - metric delta. Then explain *why* the recommendation works, citing the - `` map and 2–3 top trials. Reference the - `` literal params + values where useful, but do NOT - reprint the full config — the data layer already has it. + metric delta, immediately followed by a one-sentence confidence framing that + mentions the CI band (when `` is present), the per-query outcome + counts (when `` is present), and the worst-regressed + query by name (when `` has regressors). Then explain + *why* the recommendation works, citing the `` map and + 2–3 top trials. Reference the `` literal params + values + where useful, but do NOT reprint the full config — the data layer already + has it. - `suggested_followups` — a JSON array of at most 5 short strings, each a concrete next action the engineer can take (e.g. "Re-run with a wider `tie_breaker` range", "Add a judgment for query 'wireless headphones' to diff --git a/prompts/digest_narrative.user.jinja b/prompts/digest_narrative.user.jinja index b52e6fcd..661b5116 100644 --- a/prompts/digest_narrative.user.jinja +++ b/prompts/digest_narrative.user.jinja @@ -12,7 +12,24 @@ baseline_metric: {{ baseline_metric if baseline_metric is not none else 'N/A (no achieved_metric: {{ achieved_metric }} - +{% if confidence %} +{% if confidence.ci_95 %}ci_low: {{ confidence.ci_95.low }} +ci_high: {{ confidence.ci_95.high }} +{% endif %}n_queries: {{ confidence.headline.n_queries }} +{% if confidence.runner_up_gap %}runner_up_gap: {{ confidence.runner_up_gap.value }} ({{ confidence.runner_up_gap.classification or 'unclassified' }}) +{% endif %}{% if confidence.late_trial_stddev %}late_trial_stddev: {{ confidence.late_trial_stddev.value }} +{% endif %}{% if confidence.convergence %}convergence: {{ confidence.convergence.regime }} (best at trial {{ confidence.convergence.best_at_trial }} of {{ confidence.convergence.total_trials }}) +{% endif %} + +{% endif %}{% if confidence and confidence.per_query_outcomes %} +improved: {{ confidence.per_query_outcomes.improved }} +unchanged: {{ confidence.per_query_outcomes.unchanged }} +regressed: {{ confidence.per_query_outcomes.regressed }} +comparison_against: {{ confidence.per_query_outcomes.comparison_against }} +{% for r in confidence.per_query_outcomes.top_regressors %}- {{ r.query_text }}: {{ r.winner_score }} → {{ r.comparison_score }} ({{ r.delta }}) +{% endfor %} + +{% endif %} {% for t in top_trials %}trial #{{ t.number }} — primary_metric={{ t.primary_metric }}, params={{ t.params }} {% endfor %} From 08147001788c096e496f14a8be1b281d904f6ff6 Mon Sep 17 00:00:00 2001 From: SoundMindsAI Date: Thu, 21 May 2026 12:49:28 -0400 Subject: [PATCH 13/17] =?UTF-8?q?chore(domain):=20Epic=201=20gate=20?= =?UTF-8?q?=E2=80=94=20GPT-5.5=20review=20fixes=20+=20docs=20+=20live=20cu?= =?UTF-8?q?rl=20verified?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Cross-model GPT-5.5 review of the cumulative Epic 1 diff (8 commits, ~3100 LOC) returned 12 findings: - 5 REJECTED with cited counter-evidence (truncated-diff false positives): workers/trials.py, workers/git_pr.py, workers/digest.py, and repo/trial.py ARE in the diff (commits 032f342, d92fd5f, 3924bb2 modify them; repo uses **fields: object pass-through so no signature change needed). domain/study/__init__.py change rejected too — file has no submodule import convention; direct imports are the project pattern. - 2 DEFERRED: * Plan interface drift (compute_study_confidence shipped sync/pure + async wrapper in services/, plan showed it as async DB-aware) — the shipped split honors CLAUDE.md's "domain is pure" rule better than the plan's literal pseudocode; plan will be patched at finalization. * Story 1.5 integration test bypasses the real open_pr worker — the test proves FR-5d's data plumbing slice; the full 15-step worker contract is exercised by the existing feat_github_pr_worker suite. - 5 ACCEPTED + FIXED INLINE: * #6: backend/app/domain/study/confidence.py — convergence total_trials now uses max(trial_numbers) + 1 (the Optuna budget semantic) instead of len(complete_trials). Spec AC-8's example "best at trial 200 of 1000" was wrong with the count interpretation when trials are sparse (failed/pruned thinning the dict). * #7: classify_convergence_regime guards against KeyError when winner_trial_number isn't in primary_metrics_by_trial_number (e.g., dangling best_trial_id pointing at a non-complete trial) — returns None per FR-7 graceful-degradation invariant. * #9: New migration test test_existing_trial_row_stays_null_across_upgrade proves the no-backfill contract from Story 1.1 — pre-0015 rows survive the upgrade with per_query_metrics IS NULL. * #11: Trial model docstring updated to reflect actual persisted key shape (ndcg@10, map@10, mrr, …) per the metric-key drift fix in commit cc14164. * #12 docs: state.md updated with Epic 1 status + Alembic head bump to 0015; architecture.md adds backend/app/services/ study_confidence.py and backend/app/domain/study/confidence.py to the code-lives section, plus the new 0015 migration line. Epic 1 gate live-curl smoke verified end-to-end: GET /api/v1/studies/{seeded_id} returns the full ConfidenceShape — headline + ci_95 (bootstrap_n1000, n_samples=8) + runner_up_gap + late_trial_stddev (window_size=5) + convergence + per_query_outcomes with the designed regressor query named in top_regressors. Tested against the rebuilt api/worker compose stack. Verification: 1039 backend unit + 189 contract + 527 integration tests pass (was 1034 / 189 / 526 pre-fixes — +5 digest prompt cases from Story 1.6, +1 convergence assertion in test_studies_api_confidence, +1 migration test for #9). Backend fmt + lint + typecheck + ruff format --check parity all clean. Co-Authored-By: Claude Opus 4.7 (1M context) --- architecture.md | 17 ++- backend/app/db/models/trial.py | 14 +- backend/app/domain/study/confidence.py | 13 +- .../test_studies_api_confidence.py | 8 ++ ...test_trials_per_query_metrics_migration.py | 133 ++++++++++++++++++ .../unit/domain/study/test_confidence.py | 8 +- .../implementation_plan.md | 2 +- state.md | 8 +- 8 files changed, 189 insertions(+), 14 deletions(-) diff --git a/architecture.md b/architecture.md index 4b120dde..33f86935 100644 --- a/architecture.md +++ b/architecture.md @@ -109,10 +109,20 @@ backend/ judgment.py from feat_llm_judgments) services/ use-case orchestrators — cluster.py from infra_adapter_elastic; study_state.py (state machine + FR-7 protection listener, - feat_study_lifecycle Phase 2) + feat_study_lifecycle Phase 2); + study_confidence.py (async glue that runs the 4-query + read pattern from feat_pr_metric_confidence FR-2 and + hands pre-fetched data to the pure orchestrator — + consumed by studies._detail, the open_pr worker, and + the digest worker) domain/ pure business logic — query/render.py from infra_adapter_elastic; study/{search_space,template_validator, csv_parser}.py from feat_study_lifecycle Phase 2; + study/confidence.py (feat_pr_metric_confidence — + ConfidenceShape Pydantic model + 7 sub-shapes + bootstrap + CI / runner-up gap / late-trial 1σ / convergence regime / + per-query outcome helpers; pure-Python orchestrator + returning None on every FR-7 degraded path); git/{redaction,validation}.py from feat_github_pr_worker (GitHub PAT redaction + repo_url + config_path validators) adapters/ engine adapters — protocol.py (SearchAdapter Protocol + @@ -180,7 +190,10 @@ migrations/ Alembic config + versions/ (0001 baseline + 0002 clusters + 0003 study_lifecycle_schema + 0004_judgments + 0005_digests + 0006 proposals_pr_url_idx + 0007 conversations_messages + 0008–0013 search_vector + GIN indexes from - feat_data_table_primitive) + feat_data_table_primitive + 0014 clusters_target_filter + from feat_cluster_target_filter + 0015 trials_per_query_metrics + from feat_pr_metric_confidence — nullable JSONB column + + CHECK constraint enforcing IS NULL OR jsonb_typeof = 'object') docs/ 00_overview / 01_architecture / 02_product / 03_runbooks / 04_security / 05_quality / 08_guides ``` diff --git a/backend/app/db/models/trial.py b/backend/app/db/models/trial.py index 445ab55e..d20b7e41 100644 --- a/backend/app/db/models/trial.py +++ b/backend/app/db/models/trial.py @@ -17,11 +17,15 @@ The ``per_query_metrics`` JSONB column (nullable; added by migration ``0015_trials_per_query_metrics`` for feat_pr_metric_confidence) carries the per-query pytrec_eval scores from ``scoring.py::score()``'s ``per_query`` -dict. Shape: ``{query_id: {metric_name: float}}`` where ``metric_name`` is one -of the user-facing names (``ndcg``, ``map``, ``precision``, ``recall``, -``mrr``). The ``trials_per_query_metrics_object_check`` CHECK constraint -enforces NULL-or-object at the DB level (since the write path is the Arq -``run_trial`` worker, not a Pydantic-validated HTTP request). +dict. Shape: ``{query_id: {metric_token: float}}`` where ``metric_token`` is +the user-facing token emitted by :func:`backend.app.eval.scoring.score` — +i.e. ``@``-suffixed for cutoff-aware metrics (``ndcg@10``, ``map@10``, +``precision@10``, ``recall@10``) and bare names for cutoff-free metrics +(``mrr``, plain ``map``). The base name (everything before any ``@``) is +constrained to ``MetricCatalog`` (``ndcg``, ``map``, ``precision``, +``recall``, ``mrr``). The ``trials_per_query_metrics_object_check`` CHECK +constraint enforces NULL-or-object at the DB level (since the write path is +the Arq ``run_trial`` worker, not a Pydantic-validated HTTP request). """ from __future__ import annotations diff --git a/backend/app/domain/study/confidence.py b/backend/app/domain/study/confidence.py index 1eb3992b..b149811f 100644 --- a/backend/app/domain/study/confidence.py +++ b/backend/app/domain/study/confidence.py @@ -352,9 +352,20 @@ def classify_convergence_regime( n = len(primary_metrics_by_trial_number) if n < CONVERGENCE_MIN_COMPLETE: return None + # Defense in depth (FR-7 invariant): if the winner trial isn't in the + # complete-trials summary — e.g., a cascade-delete race or a + # best_trial_id pointing at a failed/pruned trial — fall through to + # whole-object None for convergence rather than raising KeyError. + if winner_trial_number not in primary_metrics_by_trial_number: + return None winner_metric = primary_metrics_by_trial_number[winner_trial_number] max_trial_number = max(primary_metrics_by_trial_number.keys()) - total_trials = n + # Spec example: "best at trial 200 of 1000" for a 1000-trial Optuna + # budget. Optuna trial numbers are 0-indexed (0..max), so the budget / + # denominator is max_trial_number + 1. GPT-5.5 review finding #6 fixed + # this — previously total_trials used the count of complete trials, + # which understated the budget for studies with failed/pruned trials. + total_trials = max_trial_number + 1 if winner_trial_number >= LATE_RISING_TRIAL_NUMBER_FRAC * max_trial_number: regime: ConvergenceRegime = "late_rising" diff --git a/backend/tests/integration/test_studies_api_confidence.py b/backend/tests/integration/test_studies_api_confidence.py index d5a03ae3..14706ce8 100644 --- a/backend/tests/integration/test_studies_api_confidence.py +++ b/backend/tests/integration/test_studies_api_confidence.py @@ -203,6 +203,8 @@ async def test_ac3_old_study_returns_partial_confidence_with_aggregate_signals( assert confidence["late_trial_stddev"] is not None assert confidence["convergence"] is not None assert confidence["convergence"]["best_at_trial"] == 0 + # 12 trials at optuna_trial_number 0..11 → max+1 = 12 (matches count when + # trial numbers are sequential, which is the seed shape here). assert confidence["convergence"]["total_trials"] == 12 @@ -395,6 +397,12 @@ async def test_ac8_convergence_regime_early_held(async_client: httpx.AsyncClient assert conv is not None assert conv["regime"] == "early_held" assert conv["best_at_trial"] == 200 + # Sparse trial numbers (0, 100, …, 1000) — 7 complete trials but the + # Optuna budget is 1001 (max trial number 1000 + 1 for 0-indexed + # numbering). GPT-5.5 review finding #6: total_trials must reflect + # the budget, not the count, so the PR body reads "best at trial + # 200 of 1001" rather than "200 of 7". + assert conv["total_trials"] == 1001 # --------------------------------------------------------------------------- diff --git a/backend/tests/integration/test_trials_per_query_metrics_migration.py b/backend/tests/integration/test_trials_per_query_metrics_migration.py index 692283bc..6dbb4425 100644 --- a/backend/tests/integration/test_trials_per_query_metrics_migration.py +++ b/backend/tests/integration/test_trials_per_query_metrics_migration.py @@ -173,6 +173,139 @@ def test_roundtrip_preserves_other_columns(self, restore_head: None) -> None: finally: engine.dispose() + def test_existing_trial_row_stays_null_across_upgrade(self, restore_head: None) -> None: + """A pre-0015 trial row (inserted while at revision 0014) survives the + upgrade to head with ``per_query_metrics IS NULL``. + + GPT-5.5 Epic 1 review finding #9 — proves the no-backfill contract + from Story 1.1: old trial rows aren't touched by the migration, and + the column is added as NULL by default. The orchestrator then routes + these rows through the FR-7 partial-shape path (no CI, no + per_query_outcomes; aggregate signals still compute). + """ + # Land at the pre-migration state so we can seed a row that genuinely + # predates the per_query_metrics column. + _alembic("upgrade", "head") + _alembic("downgrade", "0014") + + engine = create_engine(_sync_database_url(), future=True) + try: + suffix = uuid.uuid4().hex[:8] + cluster_id = str(uuid.uuid4()) + qs_id = str(uuid.uuid4()) + tpl_id = str(uuid.uuid4()) + jl_id = str(uuid.uuid4()) + study_id = str(uuid.uuid4()) + trial_id = str(uuid.uuid4()) + + # Seed the minimum FK chain + one trial without per_query_metrics — + # the column doesn't exist on 0014, so the INSERT uses only the + # pre-0015 column set. + with engine.connect() as conn, conn.begin(): + conn.execute( + text( + "INSERT INTO clusters (id, name, engine_type, environment, " + "base_url, auth_kind, credentials_ref) VALUES " + "(:id, :name, 'elasticsearch', 'dev', " + "'http://elasticsearch:9200', 'es_basic', 'local-es')" + ), + {"id": cluster_id, "name": f"pre-0015-{suffix}"}, + ) + conn.execute( + text("INSERT INTO query_sets (id, name, cluster_id) VALUES (:id, :name, :cid)"), + {"id": qs_id, "name": f"qs-{suffix}", "cid": cluster_id}, + ) + conn.execute( + text( + "INSERT INTO query_templates (id, name, engine_type, body, " + "declared_params) VALUES (:id, :name, 'elasticsearch', " + ":body, :params)" + ), + { + "id": tpl_id, + "name": f"tpl-{suffix}", + "body": '{"query":{"match_all":{}}}', + "params": json.dumps({}), + }, + ) + conn.execute( + text( + "INSERT INTO judgment_lists (id, name, query_set_id, " + "cluster_id, target, rubric, status) VALUES " + "(:id, :name, :qs, :cid, 'idx', 'r', 'complete')" + ), + { + "id": jl_id, + "name": f"jl-{suffix}", + "qs": qs_id, + "cid": cluster_id, + }, + ) + conn.execute( + text( + "INSERT INTO studies (id, name, cluster_id, target, " + "template_id, query_set_id, judgment_list_id, search_space, " + "objective, config, status, optuna_study_name) VALUES " + "(:id, :name, :cid, 'idx', :tpl, :qs, :jl, " + ":space, :obj, :cfg, 'queued', :osn)" + ), + { + "id": study_id, + "name": f"study-{suffix}", + "cid": cluster_id, + "tpl": tpl_id, + "qs": qs_id, + "jl": jl_id, + "space": json.dumps({"params": {}}), + "obj": json.dumps({"metric": "ndcg", "k": 10, "direction": "maximize"}), + "cfg": json.dumps({"max_trials": 1}), + "osn": study_id, + }, + ) + conn.execute( + text( + "INSERT INTO trials (id, study_id, optuna_trial_number, " + "params, metrics, status) VALUES " + "(:id, :sid, 0, :params, :metrics, 'complete')" + ), + { + "id": trial_id, + "sid": study_id, + "params": json.dumps({}), + "metrics": json.dumps({"ndcg": 0.5}), + }, + ) + finally: + engine.dispose() + + # Apply 0015. The column is added with NO backfill — the seeded row + # should still exist with per_query_metrics NULL. + _alembic("upgrade", "head") + + engine = create_engine(_sync_database_url(), future=True) + try: + with engine.connect() as conn: + row = conn.execute( + text("SELECT per_query_metrics FROM trials WHERE id = :id"), + {"id": trial_id}, + ).fetchone() + assert row is not None, "pre-0015 trial row was lost across the upgrade" + assert row[0] is None, ( + f"per_query_metrics should be NULL on a pre-0015 row, got {row[0]!r}" + ) + + # Cleanup the seeded FK chain (best-effort) in a fresh connection so + # we don't collide with the SELECT's autobegun transaction above. + with engine.connect() as conn, conn.begin(): + conn.execute(text("DELETE FROM trials WHERE id = :id"), {"id": trial_id}) + conn.execute(text("DELETE FROM studies WHERE id = :id"), {"id": study_id}) + conn.execute(text("DELETE FROM judgment_lists WHERE id = :id"), {"id": jl_id}) + conn.execute(text("DELETE FROM query_templates WHERE id = :id"), {"id": tpl_id}) + conn.execute(text("DELETE FROM query_sets WHERE id = :id"), {"id": qs_id}) + conn.execute(text("DELETE FROM clusters WHERE id = :id"), {"id": cluster_id}) + finally: + engine.dispose() + def test_check_constraint_rejects_non_object_jsonb(self, restore_head: None) -> None: """``per_query_metrics`` must be NULL or a JSON object — arrays, scalars, and booleans MUST be rejected by the CHECK constraint. diff --git a/backend/tests/unit/domain/study/test_confidence.py b/backend/tests/unit/domain/study/test_confidence.py index 3faf7bc3..738adaa2 100644 --- a/backend/tests/unit/domain/study/test_confidence.py +++ b/backend/tests/unit/domain/study/test_confidence.py @@ -183,7 +183,13 @@ def test_early_held_when_late_trial_within_band(self) -> None: assert result is not None assert result.regime == "early_held" assert result.best_at_trial == 200 - assert result.total_trials == len(metrics_by_trial) + # GPT-5.5 review finding #6 fix: total_trials is the Optuna budget + # (max_trial_number + 1 for 0-indexed trials), not the count of + # complete trials. For sparse trial-number distributions + # (failed/pruned trials thinning the dict), the budget interpretation + # is what the operator reads in the PR body / panel + # ("best at trial 200 of 1001"). + assert result.total_trials == max(metrics_by_trial.keys()) + 1 def test_late_rising_at_90pct(self) -> None: """AC-9: winner at 95% → late_rising.""" diff --git a/docs/02_product/planned_features/feat_pr_metric_confidence/implementation_plan.md b/docs/02_product/planned_features/feat_pr_metric_confidence/implementation_plan.md index 0eee02c2..ac90f0d2 100644 --- a/docs/02_product/planned_features/feat_pr_metric_confidence/implementation_plan.md +++ b/docs/02_product/planned_features/feat_pr_metric_confidence/implementation_plan.md @@ -1056,7 +1056,7 @@ None planned. The feature is purely additive across all surfaces. - [x] Story 1.4 — `ConfidenceShape` + StudyDetail enrichment - [x] Story 1.5 — PR body section + worker plumbing - [x] Story 1.6 — Digest narrative prompt extension -- [ ] **Epic 1 gate** +- [x] **Epic 1 gate** - [ ] Story 2.1 — TypeScript types + enums - [ ] Story 2.2 — `` component + glossary + page mount - [ ] Story 2.3 — Playwright E2E diff --git a/state.md b/state.md index c6de07b0..20e005d9 100644 --- a/state.md +++ b/state.md @@ -2,17 +2,17 @@ > Read this first. Snapshots the active branch, what just shipped, what's in flight, what's queued, and where the project currently sits in the MVP1 → GA roadmap. Updated whenever a feature lands or a priority shifts. -**Last updated:** 2026-05-21 (after `feat_agent_propose_search_space` shipped as PR #175 squash `5d29355`). **21st MVP1 feature merged** — 10 stories across 5 epics, all complete. New read-only agent tool `propose_search_space` (the 20th in the registry) builds a deterministic starter search space from a template's `declared_params` using the same heuristic that powers the create-study wizard's auto-fill — a Python port (`backend/app/domain/study/search_space_defaults.py`) of `ui/src/lib/search-space-defaults.ts` with a TS↔Python parity test driven by a shared JSON fixture (18 rows, byte-identical assertions on both sides). Cap-aware overflow guard added on both Python AND TS sides (fixes a latent bug where TS silently returned an invalid space when 8+ fall-through floats blew past 10⁶). Optional `prior_study_id` arg narrows numeric bounds via `winner ± |winner| × bracket` for sign-symmetric math (Gemini #1/#2 fix) with `bracket` threaded through the linear paths (Gemini #3 fix); log-uniform stays at √2. Graceful degrade on template mismatch + missing trial row + non-numeric winner — emits WARN logs (`agent.propose_search_space.prior_template_mismatch` / `.missing_winner_trial`). `ToolContext` gained `conversation_id: str` plumbed from `orchestrator.run_turn` for paired adherence telemetry — INFO events `agent.search_space_proposed` (propose-side) + `agent.create_study.invoked` (create-side) correlate offline by conversation_id per spec FR-6 (grep recipe in `docs/03_runbooks/agent-debugging.md` §5). New `repo.get_trial(db, trial_id)` parallels `repo.get_study`. System prompt updated: 19→20 tools, "Studies (4)" with `propose_search_space` first, new chain-guidance bullet. `ProposeSearchSpaceArgs` uses `ConfigDict(extra="forbid")` (GPT-5.5 F6 fix) so hallucinated LLM args fail Pydantic validation loudly. Spec converged at GPT-5.5 cycle 3 (19 findings, all accepted); plan converged at cycle 3 (8 findings, all accepted). Post-merge review: Gemini 3 findings all fixed in `642b5b9`; GPT-5.5 final review 6 findings — 1 fixed in `945e833`, 1 deferred (structlog migration), 4 rejected with cited counter-evidence (truncated-diff false positives). Tests: 1000 backend unit pass (+87 new cases) + 19 Python parity + 19 TS parity; 38 TS lib + 66 modal still green. Alembic head unchanged at `0014_clusters_target_filter` — feature is purely additive at the application layer. Earlier 2026-05-20 (after `feat_cluster_target_filter` shipped as PR #168 squash `57d3ba0` + follow-up `chore_seed_meaningful_demos` shipped as PR #169 squash `c44d774`). **20th MVP1 feature merged** + demo-state durability gap closed in the same session. PR #168: 5 stories (B1 migration 0014 + ORM column; B3 Pydantic + service plumb-through + responses; B2 adapter Protocol + ElasticAdapter + StubAdapter + router; F1 register modal Target filter input; F2 create-study modal filter-aware empty-state + EntitySelect accessibility improvement). Plus 4 post-impl fix commits (test_migrations head bump, register modal overflow-y-auto, EntitySelect sr-only Gemini fix, spec drift cleanup + OpenAPI shape-lock contract test from GPT-5.5 final review). PR #169: `scripts/seed_meaningful_demos.py` + `make seed-demo` target (idempotent: TRUNCATE clusters CASCADE + DELETE matching ES/OS indices + reseed with per-cluster `target_filter` values baked in — closes the gap where integration tests kept wiping the dev DB with no durable reseed mechanism). 529/529 vitest across 79 files (was 525/78), 903 backend unit tests (was 899), 50 cluster-API integration tests (was 45) + 3 new migration round-trip tests + 7 contract validator cases + OpenAPI shape-lock test. **Alembic head moved to `0014_clusters_target_filter`.** Cross-model review pre-impl: spec + plan both converged at GPT-5.5 cycle 2 (12 findings total, all accepted). Post-impl: Gemini Code Assist 3 findings (2 accepted: EntitySelect sr-only on #168, http() auth type hint on #169; 1 rejected with cited counter-evidence: out-of-scope test file from #168). GPT-5.5 final review on #168: 2 findings, both accepted (spec drift + OpenAPI shape-lock). **Process feedback captured:** `.claude/projects/.../memory/feedback_one_branch_per_session.md` — should have bundled the seed chore into PR #168 rather than spinning a sibling PR. End-to-end smoke verified live before both merges. Earlier 2026-05-20 (after `feat_create_study_target_autocomplete` shipped as PR #165 squash commit `bd4516a` — 19th MVP1 feature. Earlier 2026-05-20 (after `feat_create_study_target_autocomplete` shipped as PR #165 squash commit `bd4516a` — 19th MVP1 feature. Bundled the `get_schema` + `explain` connect-error fix per `bug_get_schema_unhandled_connect_error` in the same PR. 525/525 vitest across 78 files, 33 adapter unit tests + contract suite + integration tests all green twice (initial + post-cycle-2). Gemini Code Assist: 1 finding rejected with cited counter-evidence (pre-existing list-shape assumption matches the wire contract). GPT-5.5 final review: 2 findings — 1 accepted in `19d9d51` (contract-layer TARGETS_FORBIDDEN + CLUSTER_UNREACHABLE envelope assertions), 1 deferred with counter-evidence (dropdown E2E `test.skip`'d; AC coverage satisfied by 8 hook unit + 6 modal unit + integration + contract tests). Two follow-up ideas filed in-PR: `bug_e2e_target_dropdown_flake` + `chore_guide_06_screenshot_refresh_target_picker`.) Earlier — same day (after `feat_create_study_search_space_builder` shipped as PR #163 squash commit `c703953`, bundling the search-space builder feature + the `bug_judgment_lists_listing_ignores_query_set_filter` backend fix surfaced during local verification. 18th MVP1 feature. The builder + bug-fix bundle reflects the single-developer series workflow: rather than spin a sibling backend PR off `main`, the bug fix landed in the same branch since the dev was already in verification mode. PR #163 went through 3 spec cycles (16 findings) + 3 plan cycles (27 findings) + 3 Gemini Code Assist findings + 2 GPT-5.5 final-review passes (1 second-pass Low finding accepted on test coverage) = 47 review findings all accepted with cited fixes. 512 vitest assertions across 77 files, 4 real-backend Playwright e2e cases against the builder, 2 new backend tests for the bundled filter fix. Two follow-up idea files captured during local verification: `feat_create_study_target_autocomplete` (Step-1 free-text target field has no autocomplete from cluster indexes — pre-existing UX debt deferred) and the now-closed `bug_judgment_lists_listing_ignores_query_set_filter` (bundled into this PR).) Earlier (also 2026-05-20) — PR #161 `0879df2` `chore_create_study_modal_e2e_stability` (un-skipped the deferred Playwright spec via `dispatchEvent('click')` on the Radix trigger), PR #160 `160ff6b` `bug_err_metric_frontend_backend_drift` (wire-enum trim — `err` removed from frontend + backend Literal), PR #159 `52e106d` `bug_tutorial_template_param_boost_naming` (heuristic extension for `_boost` suffix). Earlier (also 2026-05-20) — PR #157 `chore_create_study_wizard_polish` — squash commit `075c46b` — merged into `main`. Ships the 4-surface chore: backend template-mismatch validation at create time (two new error codes `SEARCH_SPACE_UNKNOWN_PARAM` + `SEARCH_SPACE_MISSING_DECLARED_PARAM`), Step-4 auto-fill via the new `ui/src/lib/search-space-defaults.ts` heuristic + cap-aware fallback + TS↔Python cardinality parity fixture, 4 new `study.search_space.*` glossary entries (one dual + three short-only) and 6 extended per-metric entries with k-tier clauses, Step-5 tri-state metric+k rendering with new `K_IGNORED` predicate, plus client-side validation mirror + zero-declared block + 404/transient template-fetch recovery + `__placeholder__` warning. 16 new test files + 2 modified + 1 shared JSON fixture across backend unit/integration/contract + frontend unit/component + 1 skipped E2E. Three follow-up ideas captured: `bug_tutorial_template_param_boost_naming` (tutorial template uses `_boost` suffix not matched by the locked heuristic), `chore_create_study_modal_e2e_stability` (re-enable the skipped Playwright spec once EntitySelect disabled gating stabilizes), `bug_err_metric_frontend_backend_drift` (`err` selectable in wizard but unsupported by `scoring.py`). Gemini Code Assist + GPT-5.5 final-pass both adjudicated on the PR — 2 Gemini findings + 7 GPT-5.5 findings, all addressed or filed.) Earlier 2026-05-19 (after a 4-PR shipping run drained the actionable post-MVP1 chore backlog: PR #152 `chore_ci_prettier_check` (`476db78`) + PR #153 `chore_extract_shadcn_select_test_mock` (`199e225`) + PR #154 `chore_form_dropdown_guide_screenshot_refresh` (`ed4121f`) + PR #155 `chore_detail_page_shell_primitive` (`9a72514`). PR #155 is the third primitive after `` and `` — 6 detail-page migrations + new lint guard + flattens a latent UX bug where only `proposals/[id]` discriminated 404 from network error. Earlier the same session: PR #150 (`chore_data_table_columnvisibility_tanstack`, `c1e4545`) — closes the residual DataTable follow-ups: item 5 migrates the primitive from `columns.filter(...)` to TanStack's `state.columnVisibility` API (memoized per Gemini feedback), item 3 locked the flat-prop `DataTableProps` API as canonical with a "Shipped contract addendum" on the historical implementation plan's Story 2.6. Folder renamed `chore_data_table_primitive_followups` → `chore_data_table_columnvisibility_tanstack`. Earlier 2026-05-19 PR #148 (`infra_e2e_wire_seed_helper_into_studies_spec`, squash `65f4150`) — restored the 2 digest-panel E2E tests deferred from PR #130, diagnosed and fixed the real root cause of the original smoke-lane failure (`GET /api/v1/proposals` was silently ignoring the `?study_id=` filter, returning the most-recent global pending proposal), added 5-case integration regression coverage at `backend/tests/integration/test_proposals_study_filter.py`. Plus: (a) earlier 2026-05-18 PR #146 (`bug_install_skip_ui_rebuild`, squash `7299fca`) made `make up` rebuild every Compose service (`docker compose build` no-args), switched `make down` to `docker compose down`, and added a `verify_install_builds_all_services.sh` CI gate to lock the contract; (b) earlier 2026-05-18 PR #147 captured `chore_detail_page_shell_primitive` idea (squash `8854e47`). Two new follow-ups filed: `chore_ci_prettier_check` (CI's frontend job has no `prettier --check` step — surfaced when PR #136 drift in 2 unrelated files blocked an unrelated commit) and the in-flight `chore_detail_page_shell_primitive` (third primitive after DataTable + EntitySelect).) +**Last updated:** 2026-05-21 (after `feat_pr_metric_confidence` Epic 1 landed locally on the `feat_pr_metric_confidence` branch — backend persistence + analytics + PR-body + digest-prompt surfaces complete, Epic 2 frontend ConfidencePanel ahead. Migration `0015_trials_per_query_metrics` adds the nullable JSONB column behind a CHECK constraint; new pure-Python `backend/app/domain/study/confidence.py` owns bootstrap CI + runner-up gap + late-trial noise floor + convergence regime + per-query outcome classification under FR-7's graceful-degradation contract; new `backend/app/services/study_confidence.py` glues the 4-query read pattern onto the orchestrator and is consumed from `studies._detail()`, the `open_pr` worker, and the digest worker. GPT-5.5 cycle-1 review found 12 issues — 5 rejected as truncated-diff false positives, 2 deferred (plan/code interface drift; full-worker integration test deferred to feat_github_pr_worker's existing suite), 5 accepted + fixed inline (convergence `total_trials = max_trial_number + 1` instead of count; convergence KeyError guard when winner not in summary; pre-existing-row-stays-NULL migration test; Trial model docstring drift on metric key shape; state + architecture docs). 1039 backend unit tests pass (+5 digest prompt cases, +1 convergence assertion), 189 contract, 527/527 in-container integration. Prior — after `feat_agent_propose_search_space` shipped as PR #175 squash `5d29355`). **21st MVP1 feature merged** — 10 stories across 5 epics, all complete. New read-only agent tool `propose_search_space` (the 20th in the registry) builds a deterministic starter search space from a template's `declared_params` using the same heuristic that powers the create-study wizard's auto-fill — a Python port (`backend/app/domain/study/search_space_defaults.py`) of `ui/src/lib/search-space-defaults.ts` with a TS↔Python parity test driven by a shared JSON fixture (18 rows, byte-identical assertions on both sides). Cap-aware overflow guard added on both Python AND TS sides (fixes a latent bug where TS silently returned an invalid space when 8+ fall-through floats blew past 10⁶). Optional `prior_study_id` arg narrows numeric bounds via `winner ± |winner| × bracket` for sign-symmetric math (Gemini #1/#2 fix) with `bracket` threaded through the linear paths (Gemini #3 fix); log-uniform stays at √2. Graceful degrade on template mismatch + missing trial row + non-numeric winner — emits WARN logs (`agent.propose_search_space.prior_template_mismatch` / `.missing_winner_trial`). `ToolContext` gained `conversation_id: str` plumbed from `orchestrator.run_turn` for paired adherence telemetry — INFO events `agent.search_space_proposed` (propose-side) + `agent.create_study.invoked` (create-side) correlate offline by conversation_id per spec FR-6 (grep recipe in `docs/03_runbooks/agent-debugging.md` §5). New `repo.get_trial(db, trial_id)` parallels `repo.get_study`. System prompt updated: 19→20 tools, "Studies (4)" with `propose_search_space` first, new chain-guidance bullet. `ProposeSearchSpaceArgs` uses `ConfigDict(extra="forbid")` (GPT-5.5 F6 fix) so hallucinated LLM args fail Pydantic validation loudly. Spec converged at GPT-5.5 cycle 3 (19 findings, all accepted); plan converged at cycle 3 (8 findings, all accepted). Post-merge review: Gemini 3 findings all fixed in `642b5b9`; GPT-5.5 final review 6 findings — 1 fixed in `945e833`, 1 deferred (structlog migration), 4 rejected with cited counter-evidence (truncated-diff false positives). Tests: 1000 backend unit pass (+87 new cases) + 19 Python parity + 19 TS parity; 38 TS lib + 66 modal still green. Alembic head unchanged at `0014_clusters_target_filter` — feature is purely additive at the application layer. Earlier 2026-05-20 (after `feat_cluster_target_filter` shipped as PR #168 squash `57d3ba0` + follow-up `chore_seed_meaningful_demos` shipped as PR #169 squash `c44d774`). **20th MVP1 feature merged** + demo-state durability gap closed in the same session. PR #168: 5 stories (B1 migration 0014 + ORM column; B3 Pydantic + service plumb-through + responses; B2 adapter Protocol + ElasticAdapter + StubAdapter + router; F1 register modal Target filter input; F2 create-study modal filter-aware empty-state + EntitySelect accessibility improvement). Plus 4 post-impl fix commits (test_migrations head bump, register modal overflow-y-auto, EntitySelect sr-only Gemini fix, spec drift cleanup + OpenAPI shape-lock contract test from GPT-5.5 final review). PR #169: `scripts/seed_meaningful_demos.py` + `make seed-demo` target (idempotent: TRUNCATE clusters CASCADE + DELETE matching ES/OS indices + reseed with per-cluster `target_filter` values baked in — closes the gap where integration tests kept wiping the dev DB with no durable reseed mechanism). 529/529 vitest across 79 files (was 525/78), 903 backend unit tests (was 899), 50 cluster-API integration tests (was 45) + 3 new migration round-trip tests + 7 contract validator cases + OpenAPI shape-lock test. **Alembic head moved to `0014_clusters_target_filter`.** Cross-model review pre-impl: spec + plan both converged at GPT-5.5 cycle 2 (12 findings total, all accepted). Post-impl: Gemini Code Assist 3 findings (2 accepted: EntitySelect sr-only on #168, http() auth type hint on #169; 1 rejected with cited counter-evidence: out-of-scope test file from #168). GPT-5.5 final review on #168: 2 findings, both accepted (spec drift + OpenAPI shape-lock). **Process feedback captured:** `.claude/projects/.../memory/feedback_one_branch_per_session.md` — should have bundled the seed chore into PR #168 rather than spinning a sibling PR. End-to-end smoke verified live before both merges. Earlier 2026-05-20 (after `feat_create_study_target_autocomplete` shipped as PR #165 squash commit `bd4516a` — 19th MVP1 feature. Earlier 2026-05-20 (after `feat_create_study_target_autocomplete` shipped as PR #165 squash commit `bd4516a` — 19th MVP1 feature. Bundled the `get_schema` + `explain` connect-error fix per `bug_get_schema_unhandled_connect_error` in the same PR. 525/525 vitest across 78 files, 33 adapter unit tests + contract suite + integration tests all green twice (initial + post-cycle-2). Gemini Code Assist: 1 finding rejected with cited counter-evidence (pre-existing list-shape assumption matches the wire contract). GPT-5.5 final review: 2 findings — 1 accepted in `19d9d51` (contract-layer TARGETS_FORBIDDEN + CLUSTER_UNREACHABLE envelope assertions), 1 deferred with counter-evidence (dropdown E2E `test.skip`'d; AC coverage satisfied by 8 hook unit + 6 modal unit + integration + contract tests). Two follow-up ideas filed in-PR: `bug_e2e_target_dropdown_flake` + `chore_guide_06_screenshot_refresh_target_picker`.) Earlier — same day (after `feat_create_study_search_space_builder` shipped as PR #163 squash commit `c703953`, bundling the search-space builder feature + the `bug_judgment_lists_listing_ignores_query_set_filter` backend fix surfaced during local verification. 18th MVP1 feature. The builder + bug-fix bundle reflects the single-developer series workflow: rather than spin a sibling backend PR off `main`, the bug fix landed in the same branch since the dev was already in verification mode. PR #163 went through 3 spec cycles (16 findings) + 3 plan cycles (27 findings) + 3 Gemini Code Assist findings + 2 GPT-5.5 final-review passes (1 second-pass Low finding accepted on test coverage) = 47 review findings all accepted with cited fixes. 512 vitest assertions across 77 files, 4 real-backend Playwright e2e cases against the builder, 2 new backend tests for the bundled filter fix. Two follow-up idea files captured during local verification: `feat_create_study_target_autocomplete` (Step-1 free-text target field has no autocomplete from cluster indexes — pre-existing UX debt deferred) and the now-closed `bug_judgment_lists_listing_ignores_query_set_filter` (bundled into this PR).) Earlier (also 2026-05-20) — PR #161 `0879df2` `chore_create_study_modal_e2e_stability` (un-skipped the deferred Playwright spec via `dispatchEvent('click')` on the Radix trigger), PR #160 `160ff6b` `bug_err_metric_frontend_backend_drift` (wire-enum trim — `err` removed from frontend + backend Literal), PR #159 `52e106d` `bug_tutorial_template_param_boost_naming` (heuristic extension for `_boost` suffix). Earlier (also 2026-05-20) — PR #157 `chore_create_study_wizard_polish` — squash commit `075c46b` — merged into `main`. Ships the 4-surface chore: backend template-mismatch validation at create time (two new error codes `SEARCH_SPACE_UNKNOWN_PARAM` + `SEARCH_SPACE_MISSING_DECLARED_PARAM`), Step-4 auto-fill via the new `ui/src/lib/search-space-defaults.ts` heuristic + cap-aware fallback + TS↔Python cardinality parity fixture, 4 new `study.search_space.*` glossary entries (one dual + three short-only) and 6 extended per-metric entries with k-tier clauses, Step-5 tri-state metric+k rendering with new `K_IGNORED` predicate, plus client-side validation mirror + zero-declared block + 404/transient template-fetch recovery + `__placeholder__` warning. 16 new test files + 2 modified + 1 shared JSON fixture across backend unit/integration/contract + frontend unit/component + 1 skipped E2E. Three follow-up ideas captured: `bug_tutorial_template_param_boost_naming` (tutorial template uses `_boost` suffix not matched by the locked heuristic), `chore_create_study_modal_e2e_stability` (re-enable the skipped Playwright spec once EntitySelect disabled gating stabilizes), `bug_err_metric_frontend_backend_drift` (`err` selectable in wizard but unsupported by `scoring.py`). Gemini Code Assist + GPT-5.5 final-pass both adjudicated on the PR — 2 Gemini findings + 7 GPT-5.5 findings, all addressed or filed.) Earlier 2026-05-19 (after a 4-PR shipping run drained the actionable post-MVP1 chore backlog: PR #152 `chore_ci_prettier_check` (`476db78`) + PR #153 `chore_extract_shadcn_select_test_mock` (`199e225`) + PR #154 `chore_form_dropdown_guide_screenshot_refresh` (`ed4121f`) + PR #155 `chore_detail_page_shell_primitive` (`9a72514`). PR #155 is the third primitive after `` and `` — 6 detail-page migrations + new lint guard + flattens a latent UX bug where only `proposals/[id]` discriminated 404 from network error. Earlier the same session: PR #150 (`chore_data_table_columnvisibility_tanstack`, `c1e4545`) — closes the residual DataTable follow-ups: item 5 migrates the primitive from `columns.filter(...)` to TanStack's `state.columnVisibility` API (memoized per Gemini feedback), item 3 locked the flat-prop `DataTableProps` API as canonical with a "Shipped contract addendum" on the historical implementation plan's Story 2.6. Folder renamed `chore_data_table_primitive_followups` → `chore_data_table_columnvisibility_tanstack`. Earlier 2026-05-19 PR #148 (`infra_e2e_wire_seed_helper_into_studies_spec`, squash `65f4150`) — restored the 2 digest-panel E2E tests deferred from PR #130, diagnosed and fixed the real root cause of the original smoke-lane failure (`GET /api/v1/proposals` was silently ignoring the `?study_id=` filter, returning the most-recent global pending proposal), added 5-case integration regression coverage at `backend/tests/integration/test_proposals_study_filter.py`. Plus: (a) earlier 2026-05-18 PR #146 (`bug_install_skip_ui_rebuild`, squash `7299fca`) made `make up` rebuild every Compose service (`docker compose build` no-args), switched `make down` to `docker compose down`, and added a `verify_install_builds_all_services.sh` CI gate to lock the contract; (b) earlier 2026-05-18 PR #147 captured `chore_detail_page_shell_primitive` idea (squash `8854e47`). Two new follow-ups filed: `chore_ci_prettier_check` (CI's frontend job has no `prettier --check` step — surfaced when PR #136 drift in 2 unrelated files blocked an unrelated commit) and the in-flight `chore_detail_page_shell_primitive` (third primitive after DataTable + EntitySelect).) --- ## Current branch / execution context -- **Branch:** `docs/finalize-agent-propose-search-space` — finalization docs PR after PR #175 (`5d29355`) merged 2026-05-21. `feature/agent-propose-search-space` deleted post-merge. Earlier: `docs/finalize-cluster-target-filter` — finalization docs PR after PR #168 (`57d3ba0`) + PR #169 (`c44d774`) both merged. Prior `main` post-merge of PR #168 squash `57d3ba0` (`feat_cluster_target_filter`) + PR #169 squash `c44d774` (`chore_seed_meaningful_demos`) 2026-05-20. Earlier: PR #165 squash commit `bd4516a` 2026-05-20. Finalization docs branch `docs/finalize-create-study-target-autocomplete`. Prior squash same day: PR #163 `c703953` (`feat_create_study_search_space_builder`). Finalization docs PR off `docs/finalize-create-study-search-space-builder`. Prior squashes (same day): PR #161 `0879df2` (`chore_create_study_modal_e2e_stability`), PR #160 `160ff6b` (`bug_err_metric_frontend_backend_drift`), PR #159 `52e106d` (`bug_tutorial_template_param_boost_naming`), PR #158 `308c315` (finalize chore_create_study_wizard_polish), PR #157 `075c46b` (`chore_create_study_wizard_polish`). Prior squash: PR #155 `9a72514` 2026-05-19. Prior squashes: PR #154 `ed4121f` 2026-05-19 (`chore_form_dropdown_guide_screenshot_refresh`), PR #153 `199e225` 2026-05-19 (`chore_extract_shadcn_select_test_mock`), PR #152 `476db78` 2026-05-19 (`chore_ci_prettier_check`), PR #151 `110dc5a` 2026-05-19 (finalize chore_data_table_columnvisibility_tanstack), PR #150 `c1e4545` 2026-05-19 (`chore_data_table_columnvisibility_tanstack`), PR #149 `da9506b` 2026-05-19 (finalize infra_e2e_wire_seed_helper_into_studies_spec), PR #148 `65f4150` 2026-05-19 (`infra_e2e_wire_seed_helper_into_studies_spec` — `?study_id=` filter bug + E2E test restore), PR #147 `8854e47` 2026-05-18 (capture chore_detail_page_shell_primitive idea), PR #146 `7299fca` 2026-05-18 (bug_install_skip_ui_rebuild — `make up`/`make down` lifecycle fix), PR #136 `cb7d9ee` 2026-05-18 (chore_form_dropdown_primitive), PR #132 `ee4c8d4` 2026-05-17 (chore_data_table_primitive_followups items 1+2+4+6), PR #130 `13b3383` 2026-05-17 (infra_e2e_seed_completed_study), PR #128 `73459d2` 2026-05-17 (bug_cursor_decode_value_validation), PR #126 `d6115b3` 2026-05-16 (feat_data_table_primitive). `v0.1.0` annotated tag still on `main` commit `d099536` 2026-05-13; GitHub Release at https://github.com/SoundMindsAI/relyloop/releases/tag/v0.1.0. -- **Active feature:** none in flight (PR #175 closed `feat_agent_propose_search_space` on 2026-05-21; only finalization docs PR remains for the 21st MVP1 feature). Prior — none in flight (PR #168 closed `feat_cluster_target_filter` + PR #169 closed `chore_seed_meaningful_demos` on 2026-05-20; only finalization docs PR remains for the 20th MVP1 feature). Prior — none in flight (PR #165 closed `feat_create_study_target_autocomplete` + the bundled `bug_get_schema_unhandled_connect_error` fix on 2026-05-20). Prior — none in flight (PR #163 closed `feat_create_study_search_space_builder` + the `bug_judgment_lists_listing_ignores_query_set_filter` bundled fix on 2026-05-20). PR #168 closed `feat_cluster_target_filter` + PR #169 closed `chore_seed_meaningful_demos` (sibling). **Three PRs shipped 2026-05-15:** PR #122 (Phase 1, 16th MVP1 feature — Tooltip primitive + 26 placements on create-study modal + study detail), PR #123 (Phase 1 finalization docs), PR #124 (Phases 2 + 3 — 17th MVP1 feature; 21 additional tooltips on judgments + proposals + cluster registration + 2 new first-run components: chat ExamplePrompts strip + Stripe-style StartHereChecklist on home page). The original "MVP1 Phase 1 only" scope-lock was reversed mid-day: operator decided to ship Phases 2 + 3 together with a Stripe-style design call rather than wait for MVP2. PR #124 took 2 hours from idea-folder reuse to merge. 47 total tooltip placements + 2 new first-run components live in `main`. **PR #122 shipped 2026-05-15 morning** — `feat_contextual_help` Phase 1 (16th MVP1 feature). Adds the first Tooltip primitive (`@radix-ui/react-tooltip@~1.2.8` + shadcn-style wrapper at `ui/src/components/ui/tooltip.tsx`), two glossary-backed wrappers (`InfoTooltip` standalone + asChild modes; `HelpPopover` click-to-open with `react-markdown` safety filter), and a 49-key glossary source-of-truth at `ui/src/lib/glossary.ts` (8 enum groups parity-tested against `enums.ts`). 26 tooltip placements across the create-study modal (Step 1 target + Step 3 template + 9 Step 5 inputs), study-header (status badge dynamic key + Best metric + Trials), trials-table (5 column headers + Sort label), and digest panel (5 section labels + Open PR enabled + Open PR disabled). The disabled Open PR button refactored from native `disabled` to `aria-disabled="true"` so it stays focusable and the tooltip reveals on focus (AC-11). Gemini Code Assist: 2 findings (1 accepted + fixed, 1 rejected with cited counter-evidence). Final GPT-5.5 review: 1 Medium accepted-framing-but-deferred. Spec converged at GPT-5.5 cycle 3 (24 findings, 23 accepted + 1 rejected); plan converged at cycle 2 (12 findings, 10 accepted + 1 rejected + 1 spec patch). UI vitest now **279 passing across 48 files** (was 249 across 45 — +3 new test files, +30 cases). Playwright E2E **8 passing** (was 5 — +3 new contextual-help tests). One follow-up filed: `infra_e2e_seed_completed_study/idea.md` tracks the E2E gap for digest-panel triggers + AC-11 (cross-subsystem helper for seeding a completed study with digest + proposal; component-level coverage is in place). Phases 2 + 3 deferred to MVP2 via `feat_contextual_help_mvp2/` (judgments + proposals tooltips; chat + cluster + home onboarding; the home-page "Start here" panel is the only product-design-shaped item). +- **Branch:** `feat_pr_metric_confidence` — Epic 1 (backend) complete locally; Epic 2 (frontend ConfidencePanel + 6 confidence glossary entries) ahead. 8 commits ahead of `origin/feat_pr_metric_confidence`. Earlier: `docs/finalize-agent-propose-search-space` — finalization docs PR after PR #175 (`5d29355`) merged 2026-05-21. `feature/agent-propose-search-space` deleted post-merge. Earlier: `docs/finalize-cluster-target-filter` — finalization docs PR after PR #168 (`57d3ba0`) + PR #169 (`c44d774`) both merged. Prior `main` post-merge of PR #168 squash `57d3ba0` (`feat_cluster_target_filter`) + PR #169 squash `c44d774` (`chore_seed_meaningful_demos`) 2026-05-20. Earlier: PR #165 squash commit `bd4516a` 2026-05-20. Finalization docs branch `docs/finalize-create-study-target-autocomplete`. Prior squash same day: PR #163 `c703953` (`feat_create_study_search_space_builder`). Finalization docs PR off `docs/finalize-create-study-search-space-builder`. Prior squashes (same day): PR #161 `0879df2` (`chore_create_study_modal_e2e_stability`), PR #160 `160ff6b` (`bug_err_metric_frontend_backend_drift`), PR #159 `52e106d` (`bug_tutorial_template_param_boost_naming`), PR #158 `308c315` (finalize chore_create_study_wizard_polish), PR #157 `075c46b` (`chore_create_study_wizard_polish`). Prior squash: PR #155 `9a72514` 2026-05-19. Prior squashes: PR #154 `ed4121f` 2026-05-19 (`chore_form_dropdown_guide_screenshot_refresh`), PR #153 `199e225` 2026-05-19 (`chore_extract_shadcn_select_test_mock`), PR #152 `476db78` 2026-05-19 (`chore_ci_prettier_check`), PR #151 `110dc5a` 2026-05-19 (finalize chore_data_table_columnvisibility_tanstack), PR #150 `c1e4545` 2026-05-19 (`chore_data_table_columnvisibility_tanstack`), PR #149 `da9506b` 2026-05-19 (finalize infra_e2e_wire_seed_helper_into_studies_spec), PR #148 `65f4150` 2026-05-19 (`infra_e2e_wire_seed_helper_into_studies_spec` — `?study_id=` filter bug + E2E test restore), PR #147 `8854e47` 2026-05-18 (capture chore_detail_page_shell_primitive idea), PR #146 `7299fca` 2026-05-18 (bug_install_skip_ui_rebuild — `make up`/`make down` lifecycle fix), PR #136 `cb7d9ee` 2026-05-18 (chore_form_dropdown_primitive), PR #132 `ee4c8d4` 2026-05-17 (chore_data_table_primitive_followups items 1+2+4+6), PR #130 `13b3383` 2026-05-17 (infra_e2e_seed_completed_study), PR #128 `73459d2` 2026-05-17 (bug_cursor_decode_value_validation), PR #126 `d6115b3` 2026-05-16 (feat_data_table_primitive). `v0.1.0` annotated tag still on `main` commit `d099536` 2026-05-13; GitHub Release at https://github.com/SoundMindsAI/relyloop/releases/tag/v0.1.0. +- **Active feature:** `feat_pr_metric_confidence` Epic 1 complete locally (8 commits — Stories 1.1 through 1.6 + the metric-key-drift fix + Guides Glossary/FAQ idea capture). Epic 2 (frontend ConfidencePanel + 3 stories) is next. Prior: none in flight (PR #175 closed `feat_agent_propose_search_space` on 2026-05-21; only finalization docs PR remains for the 21st MVP1 feature). Prior — none in flight (PR #168 closed `feat_cluster_target_filter` + PR #169 closed `chore_seed_meaningful_demos` on 2026-05-20; only finalization docs PR remains for the 20th MVP1 feature). Prior — none in flight (PR #165 closed `feat_create_study_target_autocomplete` + the bundled `bug_get_schema_unhandled_connect_error` fix on 2026-05-20). Prior — none in flight (PR #163 closed `feat_create_study_search_space_builder` + the `bug_judgment_lists_listing_ignores_query_set_filter` bundled fix on 2026-05-20). PR #168 closed `feat_cluster_target_filter` + PR #169 closed `chore_seed_meaningful_demos` (sibling). **Three PRs shipped 2026-05-15:** PR #122 (Phase 1, 16th MVP1 feature — Tooltip primitive + 26 placements on create-study modal + study detail), PR #123 (Phase 1 finalization docs), PR #124 (Phases 2 + 3 — 17th MVP1 feature; 21 additional tooltips on judgments + proposals + cluster registration + 2 new first-run components: chat ExamplePrompts strip + Stripe-style StartHereChecklist on home page). The original "MVP1 Phase 1 only" scope-lock was reversed mid-day: operator decided to ship Phases 2 + 3 together with a Stripe-style design call rather than wait for MVP2. PR #124 took 2 hours from idea-folder reuse to merge. 47 total tooltip placements + 2 new first-run components live in `main`. **PR #122 shipped 2026-05-15 morning** — `feat_contextual_help` Phase 1 (16th MVP1 feature). Adds the first Tooltip primitive (`@radix-ui/react-tooltip@~1.2.8` + shadcn-style wrapper at `ui/src/components/ui/tooltip.tsx`), two glossary-backed wrappers (`InfoTooltip` standalone + asChild modes; `HelpPopover` click-to-open with `react-markdown` safety filter), and a 49-key glossary source-of-truth at `ui/src/lib/glossary.ts` (8 enum groups parity-tested against `enums.ts`). 26 tooltip placements across the create-study modal (Step 1 target + Step 3 template + 9 Step 5 inputs), study-header (status badge dynamic key + Best metric + Trials), trials-table (5 column headers + Sort label), and digest panel (5 section labels + Open PR enabled + Open PR disabled). The disabled Open PR button refactored from native `disabled` to `aria-disabled="true"` so it stays focusable and the tooltip reveals on focus (AC-11). Gemini Code Assist: 2 findings (1 accepted + fixed, 1 rejected with cited counter-evidence). Final GPT-5.5 review: 1 Medium accepted-framing-but-deferred. Spec converged at GPT-5.5 cycle 3 (24 findings, 23 accepted + 1 rejected); plan converged at cycle 2 (12 findings, 10 accepted + 1 rejected + 1 spec patch). UI vitest now **279 passing across 48 files** (was 249 across 45 — +3 new test files, +30 cases). Playwright E2E **8 passing** (was 5 — +3 new contextual-help tests). One follow-up filed: `infra_e2e_seed_completed_study/idea.md` tracks the E2E gap for digest-panel triggers + AC-11 (cross-subsystem helper for seeding a completed study with digest + proposal; component-level coverage is in place). Phases 2 + 3 deferred to MVP2 via `feat_contextual_help_mvp2/` (judgments + proposals tooltips; chat + cluster + home onboarding; the home-page "Start here" panel is the only product-design-shaped item). **Earlier — seven PRs shipped 2026-05-14:** `feat_judgments_periodic_resume_sweep` (PR #104, 14th MVP1 feature), `bug_query_inline_crud_since_filter_uuidv7_ms_collision` (PR #106 — UUIDv7 ms-collision test flake), `infra_dashboard_regen_pre_commit_conflict §2+§4` (PR #108 — dashboard regen idempotency + relative-link rewriting), `infra_make_targets_split_backend_only` (PR #110 — `make backend-fmt/lint/typecheck` + symmetric `ui-fmt` so Node-18 contributors aren't blocked), `chore_digest_worker_narrow_except` (PR #112 — narrowed `except Exception` allowlist to `(ValueError,)` + ERROR-level `digest_importance_failed_unexpected` event), `infra_structlog_test_helpers` (PR #114 — factored the two structlog test-assertion patterns into `backend/tests/_log_helpers.py`), and `chore_chat_last_message_preview` (PR #117 — `last_message_preview` + `last_message_at` on `ConversationSummary` via LATERAL JOIN; frontend shows preview under title + swaps displayed timestamp from `created_at` to `last_message_at`). Plus PR #116 dropped `chore_studies_ui_shadcn_polish` as won't-do (forward-compat audit on NavigationMenu primitive + ClusterFilterSelect precedent on native `
+ + + Query + Winner + + vs {formatComparison(per_query_outcomes.comparison_against)} + + Δ + + + + {per_query_outcomes.top_regressors.map((row) => ( + + {row.query_text} + + {row.winner_score.toFixed(3)} + + + {row.comparison_score.toFixed(3)} + + + {row.delta.toFixed(3)} + + + ))} + +
+

+ )} +
+ )} + + {/* Secondary callouts row: runner-up gap, late-trial 1σ, convergence */} + {(runner_up_gap != null || late_trial_stddev != null || convergence != null) && ( +
+ {runner_up_gap != null && ( +
+

+ Runner-up gap + +

+
+ {runner_up_gap.value.toFixed(3)} + {/* Values must match backend/app/domain/study/confidence.py RunnerUpClassification. */} + + {RUNNER_UP_BADGE[runner_up_gap.classification]?.label ?? + runner_up_gap.classification} + +
+
+ )} + {late_trial_stddev != null && ( +
+

+ Late-trial 1σ + +

+

{late_trial_stddev.value.toFixed(3)}

+
+ )} + {convergence != null && ( +
+

+ Convergence + +

+
+ {/* Values must match backend/app/domain/study/confidence.py ConvergenceRegime. */} + + {CONVERGENCE_BADGE[convergence.regime]?.label ?? convergence.regime} + + + best at trial {convergence.best_at_trial} of {convergence.total_trials} + +
+
+ )} +
+ )} + + + ); +} diff --git a/ui/src/lib/enums.ts b/ui/src/lib/enums.ts index a2f6a515..178a5a5a 100644 --- a/ui/src/lib/enums.ts +++ b/ui/src/lib/enums.ts @@ -68,6 +68,24 @@ export type PrunerKind = (typeof PRUNER_VALUES)[number]; export const OBJECTIVE_METRIC_VALUES = ['ndcg', 'map', 'precision', 'recall', 'mrr'] as const; export type ObjectiveMetric = (typeof OBJECTIVE_METRIC_VALUES)[number]; +// Values must match backend/app/domain/study/confidence.py ConvergenceRegime. +// Three regimes for the optimization convergence call-out on the +// ConfidencePanel + PR body's ## Confidence section. +export const CONVERGENCE_REGIME_VALUES = ['early_held', 'late_rising', 'noisy'] as const; +export type ConvergenceRegime = (typeof CONVERGENCE_REGIME_VALUES)[number]; + +// Values must match backend/app/domain/study/confidence.py RunnerUpClassification. +// Indicates whether the winner trial sits on a robust plateau (many +// near-equivalent configs in the top-10) or a sharp peak (winner isolated). +export const RUNNER_UP_CLASSIFICATION_VALUES = ['robust_plateau', 'sharp_peak'] as const; +export type RunnerUpClassification = (typeof RUNNER_UP_CLASSIFICATION_VALUES)[number]; + +// Values must match backend/app/domain/study/confidence.py ComparisonAgainst. +// Phase 1 always emits `runner_up`; `baseline` is reserved for Phase 2 +// when the orchestrator runs a no-tuning baseline trial. +export const COMPARISON_AGAINST_VALUES = ['runner_up', 'baseline'] as const; +export type ComparisonAgainst = (typeof COMPARISON_AGAINST_VALUES)[number]; + // Values must match backend/app/api/v1/schemas.py ObjectiveK. export const OBJECTIVE_K_VALUES = [1, 3, 5, 10, 20, 50, 100] as const; export type ObjectiveK = (typeof OBJECTIVE_K_VALUES)[number]; diff --git a/ui/src/lib/glossary.ts b/ui/src/lib/glossary.ts index 8519927e..6200b72e 100644 --- a/ui/src/lib/glossary.ts +++ b/ui/src/lib/glossary.ts @@ -566,6 +566,56 @@ export const glossary = { short: 'Select all rows on this page. Selection clears when you change page.', ariaLabel: 'More information about row selection', }, + + // --------------------------------------------------------------------------- + // feat_pr_metric_confidence Story 2.2 — 6 confidence-panel entries. + // Text lifted verbatim from feature_spec.md §11 "Tooltips and contextual + // help" so reviewers can spot drift without leaving the glossary. + // Source-of-truth comment per CLAUDE.md "Enumerated Value Contract + // Discipline" — keys map to ConfidenceShape sub-fields exposed on + // StudyDetail. + // --------------------------------------------------------------------------- + 'confidence.ci_95': { + short: + 'Bootstrap 95% confidence interval on the headline metric. 1000 resamples with replacement over per-query scores.', + ariaLabel: 'More information about the 95% confidence interval', + }, + 'confidence.runner_up_gap': { + short: + 'How close other top trials came to the winner. Robust plateau = many near-equivalents; sharp peak = winner isolated.', + long: [ + '**Robust plateau** — the top min(10, complete trials) are all within 0.005 of the winner. Many near-equivalent configs exist; the winning lift is reproducible across small perturbations.', + '', + '**Sharp peak** — at least one trial in that top set is farther than 0.005 below the winner. The winner is isolated and the result is sensitive to small parameter changes.', + ].join('\n'), + ariaLabel: 'More information about the runner-up gap', + }, + 'confidence.late_trial_stddev': { + short: + 'Standard deviation of the primary metric over the last 20% of completed trials — the empirical noise floor.', + ariaLabel: 'More information about the late-trial noise floor', + }, + 'confidence.convergence_regime': { + short: 'How the winning trial sits in the optimization budget.', + long: [ + '**Early-and-held** — best found in the first half of the trial budget AND at least one trial in the last 25% finished within 0.005 of the winner (plateau held). Strong signal.', + '', + '**Late-rising** — best found in the last 10% of the budget. More trials may still help; the optimizer was still improving.', + '', + '**Noisy** — neither pattern holds. No clear convergence; consider re-running with a different sampler or wider search space.', + ].join('\n'), + ariaLabel: 'More information about the convergence regime', + }, + 'confidence.per_query_outcomes': { + short: + 'Per-query metric vs the runner-up. Threshold: NDCG/P/R = 0.01; MAP/MRR = 0.02. Within → Unchanged.', + ariaLabel: 'More information about per-query outcomes', + }, + 'confidence.comparison_against': { + short: + 'Reference for per-query comparison. Runner-up = second-best trial. Baseline = no-tuning trial (Phase 2).', + ariaLabel: 'More information about the comparison reference', + }, } as const satisfies Record; // ============================================================================= diff --git a/ui/src/lib/types.ts b/ui/src/lib/types.ts index 2cb12fd5..c8d036fd 100644 --- a/ui/src/lib/types.ts +++ b/ui/src/lib/types.ts @@ -909,6 +909,23 @@ export interface components { /** Added */ added: number; }; + /** + * CIShape + * @description Bootstrap percentile CI on the winner's per-query metric values. + */ + CIShape: { + /** Low */ + low: number; + /** High */ + high: number; + /** + * Method + * @constant + */ + method: 'bootstrap_n1000'; + /** N Samples */ + n_samples: number; + }; /** * CalibrationResponse * @description Calibration endpoint response. @@ -1059,6 +1076,23 @@ export interface components { created_at: string; health_check: components['schemas']['HealthCheckResult']; }; + /** + * ConfidenceShape + * @description The top-level shape exposed via ``StudyDetail.confidence``. + * + * Every sub-field is independently nullable per FR-7 — degraded paths + * suppress only the sub-fields they affect, never the whole shape (the + * orchestrator returns whole-object ``None`` only when the winner trial + * row itself is missing). + */ + ConfidenceShape: { + headline: components['schemas']['HeadlineShape']; + ci_95: components['schemas']['CIShape'] | null; + runner_up_gap: components['schemas']['RunnerUpGapShape'] | null; + late_trial_stddev: components['schemas']['LateTrialStddevShape'] | null; + convergence: components['schemas']['ConvergenceShape'] | null; + per_query_outcomes: components['schemas']['PerQueryOutcomesShape'] | null; + }; /** * ConfigRepoDetail * @description ``GET /api/v1/config-repos/{id}`` response + ``POST`` 201 body. @@ -1103,6 +1137,21 @@ export interface components { /** Has More */ has_more: boolean; }; + /** + * ConvergenceShape + * @description Where the winner sits in the Optuna trial sequence + the classified regime. + */ + ConvergenceShape: { + /** Best At Trial */ + best_at_trial: number; + /** Total Trials */ + total_trials: number; + /** + * Regime + * @enum {string} + */ + regime: 'early_held' | 'late_rising' | 'noisy'; + }; /** * ConversationDetail * @description ``GET /api/v1/conversations/{id}`` response. @@ -1405,6 +1454,26 @@ export interface components { /** Detail */ detail?: components['schemas']['ValidationError'][]; }; + /** + * HeadlineShape + * @description Top-line metric value + N(queries) used in the CI. + * + * ``metric`` uses ``str`` (not ``ObjectiveMetric``) to avoid a circular + * import: ``schemas.py`` imports ``ConfidenceShape`` from here, so this + * module cannot import back from ``schemas.py``. The upstream value is + * already validated by the existing ``ObjectiveMetric`` Literal at the + * create-study endpoint (``schemas.py:214``). + */ + HeadlineShape: { + /** Metric */ + metric: string; + /** Value */ + value: number; + /** K */ + k: number | null; + /** N Queries */ + n_queries: number | null; + }; /** * HealthCheckResult * @description Wire shape of the per-cluster health probe (mirrors ``HealthStatus``). @@ -1627,6 +1696,18 @@ export interface components { */ created_at: string; }; + /** + * LateTrialStddevShape + * @description Sample stddev of ``primary_metric`` over the late-trial window. + */ + LateTrialStddevShape: { + /** Value */ + value: number; + /** Window Size */ + window_size: number; + /** Min Window Required */ + min_window_required: number; + }; /** * MessageWire * @description One row of ``GET /api/v1/conversations/{id}.messages``. @@ -1740,6 +1821,25 @@ export interface components { /** Notes */ notes?: string | null; }; + /** + * PerQueryOutcomesShape + * @description Per-query outcome counts + the top-5 named regressors. + */ + PerQueryOutcomesShape: { + /** Improved */ + improved: number; + /** Unchanged */ + unchanged: number; + /** Regressed */ + regressed: number; + /** + * Comparison Against + * @enum {string} + */ + comparison_against: 'runner_up' | 'baseline'; + /** Top Regressors */ + top_regressors: components['schemas']['RegressorRowShape'][]; + }; /** * ProposalDetail * @description Body of the proposal detail endpoints. @@ -2013,6 +2113,22 @@ export interface components { */ created_at: string; }; + /** + * RegressorRowShape + * @description One row in the named-regressors table. + */ + RegressorRowShape: { + /** Query Id */ + query_id: string; + /** Query Text */ + query_text: string; + /** Winner Score */ + winner_score: number; + /** Comparison Score */ + comparison_score: number; + /** Delta */ + delta: number; + }; /** * RejectProposalRequest * @description Body of ``POST /api/v1/proposals/{id}/reject`` (FR-4 / AC-5). @@ -2060,6 +2176,27 @@ export interface components { /** Hits */ hits: components['schemas']['RunQueryHit'][]; }; + /** + * RunnerUpGapShape + * @description Runner-up trial's metric vs the winner. + * + * The whole shape is suppressed to ``None`` when there are <2 complete + * trials (FR-2 + FR-7); ``classification`` is non-null whenever this shape + * is present. + */ + RunnerUpGapShape: { + /** Value */ + value: number; + /** + * Classification + * @enum {string} + */ + classification: 'robust_plateau' | 'sharp_peak'; + /** Top10 Within */ + top10_within: number; + /** Runner Up Metric */ + runner_up_metric: number; + }; /** * Schema * @description An index / collection's field schema. @@ -2216,6 +2353,7 @@ export interface components { /** Completed At */ completed_at: string | null; trials_summary: components['schemas']['TrialsSummaryShape']; + confidence?: components['schemas']['ConfidenceShape'] | null; }; /** * StudyListResponse @@ -3143,6 +3281,7 @@ export interface operations { limit?: number; since?: string | null; status?: ('queued' | 'running' | 'completed' | 'cancelled' | 'failed') | null; + cluster_id?: string | null; q?: string | null; sort?: | ( diff --git a/ui/tests/e2e/helpers/seed.ts b/ui/tests/e2e/helpers/seed.ts index 482794e8..32eb170e 100644 --- a/ui/tests/e2e/helpers/seed.ts +++ b/ui/tests/e2e/helpers/seed.ts @@ -539,6 +539,64 @@ export async function seedStudyCompletedWithDigest(args: { }; } +/** + * Seed a completed study where the winner + runner-up trials carry + * realistic per_query_metrics. Drives the `` happy path + * on `/studies/[id]` end-to-end. + * + * The query_ids passed in must match the queries already seeded under + * the query_set (the caller is responsible — typically via + * `seedFullChain` followed by `seedQuerySet(..., numQueries=N)`). + * + * Backed by the test-only endpoint at + * `POST /api/v1/_test/studies/seed-completed` (extended in + * feat_pr_metric_confidence Story 2.3 to accept `winner_per_query` + + * `runner_up_per_query`). + */ +export async function seedStudyCompletedWithPerQueryMetrics(args: { + clusterId: string; + querySetId: string; + templateId: string; + judgmentListId: string; + queryIds: string[]; + withPendingProposal?: boolean; +}): Promise { + const { + clusterId, + querySetId, + templateId, + judgmentListId, + queryIds, + withPendingProposal = true, + } = args; + // Winner: high CI; qid 0 designed to regress vs runner-up. + const winnerPerQuery: Record> = {}; + const runnerUpPerQuery: Record> = {}; + queryIds.forEach((qid, i) => { + // Use the @k-suffixed key shape that backend.app.eval.scoring.score() + // actually emits — the orchestrator looks up `ndcg@10` not bare `ndcg`. + winnerPerQuery[qid] = { 'ndcg@10': i === 0 ? 0.4 : 0.85 - 0.01 * i }; + runnerUpPerQuery[qid] = { 'ndcg@10': i === 0 ? 0.95 : 0.84 - 0.01 * i }; + }); + const result = await post<{ study_id: string; digest_id: string; proposal_id: string | null }>( + '/api/v1/_test/studies/seed-completed', + { + cluster_id: clusterId, + query_set_id: querySetId, + template_id: templateId, + judgment_list_id: judgmentListId, + with_pending_proposal: withPendingProposal, + winner_per_query: winnerPerQuery, + runner_up_per_query: runnerUpPerQuery, + }, + ); + return { + studyId: result.study_id, + digestId: result.digest_id, + proposalId: result.proposal_id, + }; +} + /** * Create a chat conversation. Title is optional; messages are NOT sent — * tests can navigate to `/chat/{id}` and exercise the page shell without diff --git a/ui/tests/e2e/studies.spec.ts b/ui/tests/e2e/studies.spec.ts index 06147145..fc6ad454 100644 --- a/ui/tests/e2e/studies.spec.ts +++ b/ui/tests/e2e/studies.spec.ts @@ -16,7 +16,12 @@ */ import { expect, test } from '@playwright/test'; -import { seedFullChain, seedStudy, seedStudyCompletedWithDigest } from './helpers/seed'; +import { + seedFullChain, + seedStudy, + seedStudyCompletedWithDigest, + seedStudyCompletedWithPerQueryMetrics, +} from './helpers/seed'; const API_BASE = process.env.PLAYWRIGHT_API_BASE_URL ?? 'http://127.0.0.1:8000'; @@ -265,4 +270,61 @@ test.describe('/studies', () => { expect(['completed', 'failed', 'cancelled']).toContain(body.status); } }); + + // --------------------------------------------------------------------------- + // feat_pr_metric_confidence Story 2.3 — ConfidencePanel real-backend coverage + // (AC-13). Two cases: panel renders for completed study with per_query_metrics, + // panel renders nothing for a queued/running study where study.confidence is + // null whole-object. + // --------------------------------------------------------------------------- + + test('ConfidencePanel renders for a completed study with per_query_metrics', async ({ + page, + }) => { + const chain = await seedFullChain(8); + const seeded = await seedStudyCompletedWithPerQueryMetrics({ + clusterId: chain.clusterId, + querySetId: chain.querySetId, + templateId: chain.templateId, + judgmentListId: chain.judgmentListId, + queryIds: chain.queryIds, + withPendingProposal: false, + }); + await page.goto(`/studies/${seeded.studyId}`); + + const panel = page.getByTestId('confidence-panel'); + await expect(panel).toBeVisible({ timeout: 10_000 }); + // Section heading. + await expect(panel.getByText('Confidence', { exact: true })).toBeVisible(); + // Headline carries metric@k + value. + await expect(page.getByTestId('confidence-headline')).toContainText('NDCG@10'); + await expect(page.getByTestId('confidence-headline')).toContainText('0.487'); + // Per-query outcome chips are present. + await expect(page.getByTestId('outcome-improved')).toBeVisible(); + await expect(page.getByTestId('outcome-regressed')).toBeVisible(); + // The designed regressor (qid 0) shows up in the named-regressors table. + await expect(page.getByTestId('confidence-regressors')).toBeVisible(); + }); + + test('ConfidencePanel renders nothing for a queued study (confidence=null)', async ({ + page, + }) => { + const chain = await seedFullChain(2); + const study = await seedStudy({ + clusterId: chain.clusterId, + querySetId: chain.querySetId, + templateId: chain.templateId, + judgmentListId: chain.judgmentListId, + maxTrials: 1, + }); + await page.goto(`/studies/${study.id}`); + + // The study header card mounts (study row exists). + await expect(page.getByRole('heading', { name: 'Study detail' })).toBeVisible({ + timeout: 10_000, + }); + // ConfidencePanel returns null when study.confidence is null whole-object, + // so the panel container must be absent (no empty-state shell — AC-3a). + await expect(page.getByTestId('confidence-panel')).toHaveCount(0); + }); }); From 1b6b16a6a7fe0e06f77dada4c074f131a2a654f3 Mon Sep 17 00:00:00 2001 From: SoundMindsAI Date: Thu, 21 May 2026 13:45:05 -0400 Subject: [PATCH 16/17] docs: update state.md for Epic 2 + capture guide-06 screenshot follow-up MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - state.md "Active feature" line now reflects Epic 1 + Epic 2 both shipped locally, with pre-push / PR / reviews remaining. - chore_guide_06_screenshot_refresh_confidence_panel — new idea file capturing the post-impl Step 3 (guide impact) finding: the /studies/[id] screenshots in Guide 06 predate the ConfidencePanel mount and need regeneration. The seed shape used by guide-06 (seedStudyCompletedWithDigest, no per_query_metrics) means the regenerated screenshot will show the FR-7 partial-shape variant (headline + runner-up gap callout, no CI band, no per-query outcomes) — itself a legitimate teaching surface for operators with pre-migration studies. Bundled with the next routine guide refresh rather than spinning a dedicated PR. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/00_overview/DASHBOARD.md | 2 +- docs/00_overview/MVP1_DASHBOARD.md | 7 ++-- docs/00_overview/dashboard.html | 2 +- docs/00_overview/mvp1_dashboard.html | 18 ++++++++-- .../idea.md | 36 +++++++++++++++++++ state.md | 2 +- 6 files changed, 58 insertions(+), 9 deletions(-) create mode 100644 docs/02_product/planned_features/chore_guide_06_screenshot_refresh_confidence_panel/idea.md diff --git a/docs/00_overview/DASHBOARD.md b/docs/00_overview/DASHBOARD.md index 27cb52d7..ae2f2ee1 100644 --- a/docs/00_overview/DASHBOARD.md +++ b/docs/00_overview/DASHBOARD.md @@ -6,7 +6,7 @@ _Top-level index across MVP1 → GA v1+ as of **2026-05-21**. Click a release na | Release | Theme | Progress | Status | |---|---|---|---| -| [MVP1 / v0.1](MVP1_DASHBOARD.md) | The Loop | 56 / 57 scoped done · 5 remaining | **In progress** | +| [MVP1 / v0.1](MVP1_DASHBOARD.md) | The Loop | 56 / 57 scoped done · 6 remaining | **In progress** | | [MVP2 / v0.2](MVP2_DASHBOARD.md) | Observable | 1 / 1 scoped done · 1 remaining | **In progress** | | MVP3 / v0.3 | Production Stacks | — | **Not yet scoped** | | MVP4 / v0.4 | Multi-tenant, Multi-LLM | — | **Not yet scoped** | diff --git a/docs/00_overview/MVP1_DASHBOARD.md b/docs/00_overview/MVP1_DASHBOARD.md index 4b31b0b4..91ed04a0 100644 --- a/docs/00_overview/MVP1_DASHBOARD.md +++ b/docs/00_overview/MVP1_DASHBOARD.md @@ -21,9 +21,9 @@ Plan approved; run /impl-execute to ship | Metric | Value | |---|---| | Scoped items done | **56 / 57** (98%) — feat_/infra_/chore_/epic_ past idea stage | -| Path to MVP1 | **5** items remaining (features + bugs + chores) | +| Path to MVP1 | **6** items remaining (features + bugs + chores) | | Open bugs | 0 | -| Open chores | 4 (idea-stage debt) | +| Open chores | 5 (idea-stage debt) | | Backlog ideas | 4 idea-only feat/infra (not yet scoped into MVP1) | | In flight | 0 feature(s) actively shipping | @@ -116,7 +116,7 @@ _None._ _None._ -### Idea (8) +### Idea (9) | Feature | Type | One-liner | Depends on | Status | |---|---|---|---|---| @@ -124,6 +124,7 @@ _None._ | [feat_config_repo_baseline_tracking](../02_product/planned_features/feat_config_repo_baseline_tracking/idea.md) | Feature | RelyLoop does not track which configuration is currently live in production. When a proposal's PR merges, the merge webhook at [`backend/app/api/webhooks/github.py:187-191`](../../backend/app/api/webh | — | Idea — surfaced during the 2026-05-21 Karpathy-loop audit. | | [feat_digest_executable_followups](../02_product/planned_features/feat_digest_executable_followups/idea.md) | Feature | The digest worker's LLM contract at [`backend/workers/digest.py:168-189`](../../backend/workers/digest.py) defines `suggested_followups` as a flat `array of string`: | — | Idea — surfaced during the 2026-05-21 Karpathy-loop audit. | | [feat_study_clone_from_previous](../02_product/planned_features/feat_study_clone_from_previous/idea.md) | Feature | A relevance engineer's normal workflow after the first study completes: | — | Idea — surfaced during a UX review of parameter-tuning ergonomics on 2026-05-19. | +| [chore_guide_06_screenshot_refresh_confidence_panel](../02_product/planned_features/chore_guide_06_screenshot_refresh_confidence_panel/idea.md) | Chore | The shipped guide-06 PNGs at [`ui/public/guides/06_create_and_monitor_study/`](../../ui/public/guides/06_create_and_monitor_study) were captured before the ConfidencePanel mounted on the studies-detai | — | Idea — captured during `feat_pr_metric_confidence` Epic 2 guide-impact assessment | | [chore_guides_faq](../02_product/planned_features/chore_guides_faq/idea.md) | Chore | Tooltips and the glossary answer "**what does X mean?**" within a 1–2 sentence budget. They don't carry the operator-judgment-shaped questions that come up *after* the term is understood: | — | Idea — surfaced during `feat_pr_metric_confidence` Story 1.5 review | | [chore_guides_glossary_route](../02_product/planned_features/chore_guides_glossary_route/idea.md) | Chore | The glossary is a load-bearing terminology source-of-truth (cited 100+ times across the codebase, parity-tested against backend Literal enums, locked by source-of-truth comments). But operators can on | — | Idea — surfaced during `feat_pr_metric_confidence` Story 1.5 review | | [chore_study_default_stop_conditions](../02_product/planned_features/chore_study_default_stop_conditions/idea.md) | Chore | The server-side `StudyConfigSpec` validator at [`backend/app/api/v1/schemas.py:572-580`](../../backend/app/api/v1/schemas.py) correctly **requires** at least one of `max_trials` or `time_budget_min` — | — | Idea — surfaced during the 2026-05-21 Karpathy-loop audit; recommendation grounded in measured per-trial cost from the local dev DB. | diff --git a/docs/00_overview/dashboard.html b/docs/00_overview/dashboard.html index d0875958..36954001 100644 --- a/docs/00_overview/dashboard.html +++ b/docs/00_overview/dashboard.html @@ -371,7 +371,7 @@

Releases

The Loop
-
56 / 57 scoped done · 5 remaining
+
56 / 57 scoped done · 6 remaining
In progress
diff --git a/docs/00_overview/mvp1_dashboard.html b/docs/00_overview/mvp1_dashboard.html index 53eaecea..36d2ebab 100644 --- a/docs/00_overview/mvp1_dashboard.html +++ b/docs/00_overview/mvp1_dashboard.html @@ -390,7 +390,7 @@

MVP1 Progress

Path to MVP1
-
5
+
6
items left = features + bugs + chores
@@ -400,7 +400,7 @@

MVP1 Progress

Open chores
-
4
+
5
idea-stage chore_* (debt)
@@ -428,7 +428,7 @@

Pipeline

-

Idea 8

+

Idea 9

@@ -478,6 +478,18 @@

Idea 8

+
+ +
+ Chore + +
+
The shipped guide-06 PNGs at [`ui/public/guides/06_create_and_monitor_study/`](../../ui/public/guides/06_create_and_monitor_study) were captured before the ConfidencePanel mounted on the studies-detai
+ + +
+ +
diff --git a/docs/02_product/planned_features/chore_guide_06_screenshot_refresh_confidence_panel/idea.md b/docs/02_product/planned_features/chore_guide_06_screenshot_refresh_confidence_panel/idea.md new file mode 100644 index 00000000..a1b6f7e0 --- /dev/null +++ b/docs/02_product/planned_features/chore_guide_06_screenshot_refresh_confidence_panel/idea.md @@ -0,0 +1,36 @@ +# Refresh Guide 06 screenshots to capture the new ConfidencePanel + +**Date:** 2026-05-21 +**Status:** Idea — captured during `feat_pr_metric_confidence` Epic 2 guide-impact assessment +**Origin:** Step 3 (Guide impact assessment) of /impl-execute for `feat_pr_metric_confidence` Epic 2. The new `` renders on `/studies/[id]` between `` and the trials table. Guide 06 (`06_create_and_monitor_study`) captures study-detail screenshots at [`ui/tests/e2e/guides/06_create_and_monitor_study.spec.ts`](../../../../ui/tests/e2e/guides/06_create_and_monitor_study.spec.ts) lines 73 / 78 / 88 / 96 against a study seeded via `seedStudyCompletedWithDigest()` (no per_query_metrics on the seeded trials). +**Depends on:** None — purely a screenshot regeneration. + +## Problem + +The shipped guide-06 PNGs at [`ui/public/guides/06_create_and_monitor_study/`](../../../../ui/public/guides/06_create_and_monitor_study/) were captured before the ConfidencePanel mounted on the studies-detail page. Operators reading the walkthrough now see a UI element in their browser that doesn't appear in the corresponding guide screenshot. Small visual drift, not a functional gap. + +Because guide-06's seed uses `seedStudyCompletedWithDigest()` (no per_query_metrics), the new section that lands in the screenshot will be the **partial-shape** variant: headline metric only (no CI band), one secondary callout (`runner_up_gap`, since the seed has 2 complete trials), and no per-query outcomes block. That partial view is itself a legitimate teaching surface — operators with old studies will see exactly that on their own UIs. + +## Proposed capabilities + +- Run `/guide-gen 06 --regen` to re-capture the four study-detail screenshots (the cluster + create-modal + studies-list screenshots earlier in the deck are unaffected by the ConfidencePanel mount). +- Optionally extend `seedStudyCompletedWithDigest()` (or add a `WithConfidence` variant) so the regenerated screenshots include a full ConfidencePanel — depends on whether the operator-facing teaching value of the full panel outweighs the simpler partial-shape seed. Worth deciding at regen time, not now. +- Update [`ui/public/guides/06_create_and_monitor_study/script.md`](../../../../ui/public/guides/06_create_and_monitor_study/script.md) to mention the new section if regen lands with the full panel. + +## Scope signals + +- **Backend:** none (or one optional helper extension if going with the full-panel seed). +- **Frontend:** screenshot files only; the `metadata.json` captions may need a one-line addition if the new section gets its own caption slide. +- **Migration:** none. +- **Config:** none. +- **Audit events:** N/A. +- **Estimated size:** 15-30 minutes once the operator has `make up` running. The `/guide-gen` skill automates the capture + cross-model visual review. + +## Why not yet prioritized + +Guide-06 is stale in a low-stakes way — the walkthrough still works end-to-end; only the screenshots lag the live UI. Operators following the walkthrough will see the missing section in their own browser and understand it; they won't get blocked. The regen is worth bundling with the next guide-screenshot refresh sweep (likely triggered by a UI primitive change or design-system update) rather than spinning a dedicated PR for it now. + +## Relationship to other work + +- **Coordinates with:** the `feat_pr_metric_confidence` Epic 2 ConfidencePanel that just shipped. The screenshots can be captured against any deployed instance once the feature merges. +- **Sibling pattern:** prior `chore_form_dropdown_guide_screenshot_refresh` (PR #154) did the same for the form-dropdown primitive rollout — this is the routine cadence for "UI change → guide regen." diff --git a/state.md b/state.md index 20e005d9..a7cc74c5 100644 --- a/state.md +++ b/state.md @@ -9,7 +9,7 @@ ## Current branch / execution context - **Branch:** `feat_pr_metric_confidence` — Epic 1 (backend) complete locally; Epic 2 (frontend ConfidencePanel + 6 confidence glossary entries) ahead. 8 commits ahead of `origin/feat_pr_metric_confidence`. Earlier: `docs/finalize-agent-propose-search-space` — finalization docs PR after PR #175 (`5d29355`) merged 2026-05-21. `feature/agent-propose-search-space` deleted post-merge. Earlier: `docs/finalize-cluster-target-filter` — finalization docs PR after PR #168 (`57d3ba0`) + PR #169 (`c44d774`) both merged. Prior `main` post-merge of PR #168 squash `57d3ba0` (`feat_cluster_target_filter`) + PR #169 squash `c44d774` (`chore_seed_meaningful_demos`) 2026-05-20. Earlier: PR #165 squash commit `bd4516a` 2026-05-20. Finalization docs branch `docs/finalize-create-study-target-autocomplete`. Prior squash same day: PR #163 `c703953` (`feat_create_study_search_space_builder`). Finalization docs PR off `docs/finalize-create-study-search-space-builder`. Prior squashes (same day): PR #161 `0879df2` (`chore_create_study_modal_e2e_stability`), PR #160 `160ff6b` (`bug_err_metric_frontend_backend_drift`), PR #159 `52e106d` (`bug_tutorial_template_param_boost_naming`), PR #158 `308c315` (finalize chore_create_study_wizard_polish), PR #157 `075c46b` (`chore_create_study_wizard_polish`). Prior squash: PR #155 `9a72514` 2026-05-19. Prior squashes: PR #154 `ed4121f` 2026-05-19 (`chore_form_dropdown_guide_screenshot_refresh`), PR #153 `199e225` 2026-05-19 (`chore_extract_shadcn_select_test_mock`), PR #152 `476db78` 2026-05-19 (`chore_ci_prettier_check`), PR #151 `110dc5a` 2026-05-19 (finalize chore_data_table_columnvisibility_tanstack), PR #150 `c1e4545` 2026-05-19 (`chore_data_table_columnvisibility_tanstack`), PR #149 `da9506b` 2026-05-19 (finalize infra_e2e_wire_seed_helper_into_studies_spec), PR #148 `65f4150` 2026-05-19 (`infra_e2e_wire_seed_helper_into_studies_spec` — `?study_id=` filter bug + E2E test restore), PR #147 `8854e47` 2026-05-18 (capture chore_detail_page_shell_primitive idea), PR #146 `7299fca` 2026-05-18 (bug_install_skip_ui_rebuild — `make up`/`make down` lifecycle fix), PR #136 `cb7d9ee` 2026-05-18 (chore_form_dropdown_primitive), PR #132 `ee4c8d4` 2026-05-17 (chore_data_table_primitive_followups items 1+2+4+6), PR #130 `13b3383` 2026-05-17 (infra_e2e_seed_completed_study), PR #128 `73459d2` 2026-05-17 (bug_cursor_decode_value_validation), PR #126 `d6115b3` 2026-05-16 (feat_data_table_primitive). `v0.1.0` annotated tag still on `main` commit `d099536` 2026-05-13; GitHub Release at https://github.com/SoundMindsAI/relyloop/releases/tag/v0.1.0. -- **Active feature:** `feat_pr_metric_confidence` Epic 1 complete locally (8 commits — Stories 1.1 through 1.6 + the metric-key-drift fix + Guides Glossary/FAQ idea capture). Epic 2 (frontend ConfidencePanel + 3 stories) is next. Prior: none in flight (PR #175 closed `feat_agent_propose_search_space` on 2026-05-21; only finalization docs PR remains for the 21st MVP1 feature). Prior — none in flight (PR #168 closed `feat_cluster_target_filter` + PR #169 closed `chore_seed_meaningful_demos` on 2026-05-20; only finalization docs PR remains for the 20th MVP1 feature). Prior — none in flight (PR #165 closed `feat_create_study_target_autocomplete` + the bundled `bug_get_schema_unhandled_connect_error` fix on 2026-05-20). Prior — none in flight (PR #163 closed `feat_create_study_search_space_builder` + the `bug_judgment_lists_listing_ignores_query_set_filter` bundled fix on 2026-05-20). PR #168 closed `feat_cluster_target_filter` + PR #169 closed `chore_seed_meaningful_demos` (sibling). **Three PRs shipped 2026-05-15:** PR #122 (Phase 1, 16th MVP1 feature — Tooltip primitive + 26 placements on create-study modal + study detail), PR #123 (Phase 1 finalization docs), PR #124 (Phases 2 + 3 — 17th MVP1 feature; 21 additional tooltips on judgments + proposals + cluster registration + 2 new first-run components: chat ExamplePrompts strip + Stripe-style StartHereChecklist on home page). The original "MVP1 Phase 1 only" scope-lock was reversed mid-day: operator decided to ship Phases 2 + 3 together with a Stripe-style design call rather than wait for MVP2. PR #124 took 2 hours from idea-folder reuse to merge. 47 total tooltip placements + 2 new first-run components live in `main`. **PR #122 shipped 2026-05-15 morning** — `feat_contextual_help` Phase 1 (16th MVP1 feature). Adds the first Tooltip primitive (`@radix-ui/react-tooltip@~1.2.8` + shadcn-style wrapper at `ui/src/components/ui/tooltip.tsx`), two glossary-backed wrappers (`InfoTooltip` standalone + asChild modes; `HelpPopover` click-to-open with `react-markdown` safety filter), and a 49-key glossary source-of-truth at `ui/src/lib/glossary.ts` (8 enum groups parity-tested against `enums.ts`). 26 tooltip placements across the create-study modal (Step 1 target + Step 3 template + 9 Step 5 inputs), study-header (status badge dynamic key + Best metric + Trials), trials-table (5 column headers + Sort label), and digest panel (5 section labels + Open PR enabled + Open PR disabled). The disabled Open PR button refactored from native `disabled` to `aria-disabled="true"` so it stays focusable and the tooltip reveals on focus (AC-11). Gemini Code Assist: 2 findings (1 accepted + fixed, 1 rejected with cited counter-evidence). Final GPT-5.5 review: 1 Medium accepted-framing-but-deferred. Spec converged at GPT-5.5 cycle 3 (24 findings, 23 accepted + 1 rejected); plan converged at cycle 2 (12 findings, 10 accepted + 1 rejected + 1 spec patch). UI vitest now **279 passing across 48 files** (was 249 across 45 — +3 new test files, +30 cases). Playwright E2E **8 passing** (was 5 — +3 new contextual-help tests). One follow-up filed: `infra_e2e_seed_completed_study/idea.md` tracks the E2E gap for digest-panel triggers + AC-11 (cross-subsystem helper for seeding a completed study with digest + proposal; component-level coverage is in place). Phases 2 + 3 deferred to MVP2 via `feat_contextual_help_mvp2/` (judgments + proposals tooltips; chat + cluster + home onboarding; the home-page "Start here" panel is the only product-design-shaped item). +- **Active feature:** `feat_pr_metric_confidence` Epic 1 + Epic 2 both complete locally (commits `9c95021` … `614496a`, Stories 1.1 through 2.3 plus metric-key drift fix, Guides Glossary/FAQ idea capture, Epic 1 gate review/fixes, and the Guide-06 screenshot-refresh follow-up idea). Pending: pre-push gate + push + open PR + Gemini adjudication + final GPT-5.5 review. Prior: none in flight (PR #175 closed `feat_agent_propose_search_space` on 2026-05-21; only finalization docs PR remains for the 21st MVP1 feature). Prior — none in flight (PR #168 closed `feat_cluster_target_filter` + PR #169 closed `chore_seed_meaningful_demos` on 2026-05-20; only finalization docs PR remains for the 20th MVP1 feature). Prior — none in flight (PR #165 closed `feat_create_study_target_autocomplete` + the bundled `bug_get_schema_unhandled_connect_error` fix on 2026-05-20). Prior — none in flight (PR #163 closed `feat_create_study_search_space_builder` + the `bug_judgment_lists_listing_ignores_query_set_filter` bundled fix on 2026-05-20). PR #168 closed `feat_cluster_target_filter` + PR #169 closed `chore_seed_meaningful_demos` (sibling). **Three PRs shipped 2026-05-15:** PR #122 (Phase 1, 16th MVP1 feature — Tooltip primitive + 26 placements on create-study modal + study detail), PR #123 (Phase 1 finalization docs), PR #124 (Phases 2 + 3 — 17th MVP1 feature; 21 additional tooltips on judgments + proposals + cluster registration + 2 new first-run components: chat ExamplePrompts strip + Stripe-style StartHereChecklist on home page). The original "MVP1 Phase 1 only" scope-lock was reversed mid-day: operator decided to ship Phases 2 + 3 together with a Stripe-style design call rather than wait for MVP2. PR #124 took 2 hours from idea-folder reuse to merge. 47 total tooltip placements + 2 new first-run components live in `main`. **PR #122 shipped 2026-05-15 morning** — `feat_contextual_help` Phase 1 (16th MVP1 feature). Adds the first Tooltip primitive (`@radix-ui/react-tooltip@~1.2.8` + shadcn-style wrapper at `ui/src/components/ui/tooltip.tsx`), two glossary-backed wrappers (`InfoTooltip` standalone + asChild modes; `HelpPopover` click-to-open with `react-markdown` safety filter), and a 49-key glossary source-of-truth at `ui/src/lib/glossary.ts` (8 enum groups parity-tested against `enums.ts`). 26 tooltip placements across the create-study modal (Step 1 target + Step 3 template + 9 Step 5 inputs), study-header (status badge dynamic key + Best metric + Trials), trials-table (5 column headers + Sort label), and digest panel (5 section labels + Open PR enabled + Open PR disabled). The disabled Open PR button refactored from native `disabled` to `aria-disabled="true"` so it stays focusable and the tooltip reveals on focus (AC-11). Gemini Code Assist: 2 findings (1 accepted + fixed, 1 rejected with cited counter-evidence). Final GPT-5.5 review: 1 Medium accepted-framing-but-deferred. Spec converged at GPT-5.5 cycle 3 (24 findings, 23 accepted + 1 rejected); plan converged at cycle 2 (12 findings, 10 accepted + 1 rejected + 1 spec patch). UI vitest now **279 passing across 48 files** (was 249 across 45 — +3 new test files, +30 cases). Playwright E2E **8 passing** (was 5 — +3 new contextual-help tests). One follow-up filed: `infra_e2e_seed_completed_study/idea.md` tracks the E2E gap for digest-panel triggers + AC-11 (cross-subsystem helper for seeding a completed study with digest + proposal; component-level coverage is in place). Phases 2 + 3 deferred to MVP2 via `feat_contextual_help_mvp2/` (judgments + proposals tooltips; chat + cluster + home onboarding; the home-page "Start here" panel is the only product-design-shaped item). **Earlier — seven PRs shipped 2026-05-14:** `feat_judgments_periodic_resume_sweep` (PR #104, 14th MVP1 feature), `bug_query_inline_crud_since_filter_uuidv7_ms_collision` (PR #106 — UUIDv7 ms-collision test flake), `infra_dashboard_regen_pre_commit_conflict §2+§4` (PR #108 — dashboard regen idempotency + relative-link rewriting), `infra_make_targets_split_backend_only` (PR #110 — `make backend-fmt/lint/typecheck` + symmetric `ui-fmt` so Node-18 contributors aren't blocked), `chore_digest_worker_narrow_except` (PR #112 — narrowed `except Exception` allowlist to `(ValueError,)` + ERROR-level `digest_importance_failed_unexpected` event), `infra_structlog_test_helpers` (PR #114 — factored the two structlog test-assertion patterns into `backend/tests/_log_helpers.py`), and `chore_chat_last_message_preview` (PR #117 — `last_message_preview` + `last_message_at` on `ConversationSummary` via LATERAL JOIN; frontend shows preview under title + swaps displayed timestamp from `created_at` to `last_message_at`). Plus PR #116 dropped `chore_studies_ui_shadcn_polish` as won't-do (forward-compat audit on NavigationMenu primitive + ClusterFilterSelect precedent on native `