diff --git a/CLAUDE.md b/CLAUDE.md index 44f60e28..a17ee62e 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -33,6 +33,7 @@ The tool is a single, engine-agnostic, provider-agnostic system: one UI, one wor | Release | Theme | Adds | |---|---|---| | MVP1 / v0.1 | "The Loop" | ES + OpenSearch adapter, OpenAI-compatible LLM, GitHub provider, single-tenant, no auth, Docker Compose, 80% coverage gate | +| MVP1.5 / v0.1.5 | "Real Signals" | OpenSearch UBI judgments as a first-class source — `UbiReader` (engine-agnostic; reads `ubi_queries` + `ubi_events`) + pluggable `SignalsConverter` (position-bias-corrected CTR, dwell-time, hybrid UBI+LLM); judgment lists can mix sources via existing `source` enum; new `POST /api/v1/judgment-lists/generate-from-ubi` + `generate_judgments_from_ubi` agent tool. No schema migration, no new Compose service. Predicated on operator running the OpenSearch UBI plugin. | | MVP2 / v0.2 | "Observable" | Langfuse + ClickHouse + SigNoz; canonical event catalog; `audit_log` table + immutability trigger (no users/tenants yet); lineage columns; PII redaction; trace propagation | | MVP3 / v0.3 | "Production Stacks" | Lucidworks Fusion adapter; multi-Git-provider abstraction (GitLab, Bitbucket); production install (TLS via Caddy + Let's Encrypt, managed Postgres/Redis); AWS managed OpenSearch | | MVP4 / v0.4 | "Multi-tenant, Multi-LLM" | `tenants` + `tenant_memberships` + `users` + `api_keys`; `tenant_id` columns + backfill; SSO via reverse proxy; Argon2id-hashed bearer API keys; native non-OpenAI provider SDKs (Anthropic, Bedrock, Vertex) | diff --git a/docs/00_overview/DASHBOARD.md b/docs/00_overview/DASHBOARD.md index 1454d3f7..9ef215d3 100644 --- a/docs/00_overview/DASHBOARD.md +++ b/docs/00_overview/DASHBOARD.md @@ -6,7 +6,7 @@ _Top-level index across MVP1 → GA v1+ as of **2026-05-23**. Click a release na | Release | Theme | Progress | Status | |---|---|---|---| -| [MVP1 / v0.1](MVP1_DASHBOARD.md) | The Loop | 66 / 66 scoped done · 6 remaining | **In progress** | +| [MVP1 / v0.1](MVP1_DASHBOARD.md) | The Loop | 66 / 66 scoped done · 7 remaining | **In progress** | | [MVP2 / v0.2](MVP2_DASHBOARD.md) | Observable | 1 / 1 scoped done · 1 remaining | **In progress** | | MVP3 / v0.3 | Production Stacks | — | **Not yet scoped** | | MVP4 / v0.4 | Multi-tenant, Multi-LLM | — | **Not yet scoped** | diff --git a/docs/00_overview/MVP1_DASHBOARD.md b/docs/00_overview/MVP1_DASHBOARD.md index e68aa86c..87d7219c 100644 --- a/docs/00_overview/MVP1_DASHBOARD.md +++ b/docs/00_overview/MVP1_DASHBOARD.md @@ -15,14 +15,14 @@ Pull from the Idea backlog or capture a new feature spec. | Metric | Value | |---|---| | Scoped items done | **66 / 66** (100%) — feat_/infra_/chore_/epic_ past idea stage | -| Pending work | **13** items (every not-done feat/infra/chore/bug across all priorities) | +| Pending work | **15** items (every not-done feat/infra/chore/bug across all priorities) | | → P0 — do next | **0** unblocking / paying daily cost | -| → P1 | **0** high-value, ready when P0 clears | -| → P2 (default) | 12 important to file, not blocking | +| → P1 | **1** high-value, ready when P0 clears | +| → P2 (default) | 13 important to file, not blocking | | → Backlog | 1 captured for record, not planned | -| Open bugs | 2 | -| Legacy "Path to MVP1" | 6 items — scoped-not-done + bugs + chore-ideas only (excludes feat/infra ideas) | -| Backlog ideas | 7 idea-only feat/infra (not yet scoped into MVP1) | +| Open bugs | 3 | +| Legacy "Path to MVP1" | 7 items — scoped-not-done + bugs + chore-ideas only (excludes feat/infra ideas) | +| Backlog ideas | 8 idea-only feat/infra (not yet scoped into MVP1) | | In flight | 0 feature(s) actively shipping | ## Pipeline @@ -32,7 +32,7 @@ Pull from the Idea backlog or capture a new feature spec. | Feature | Type | One-liner | Depends on | Status | |---|---|---|---|---| | [feat_agent_propose_search_space](implemented_features/2026_05_21_feat_agent_propose_search_space/feature_spec.md) | Feature | A new read-only agent tool `propose_search_space(template_id, cluster_id, judgment_list_id?, prior_study_id?) → SearchSpace JSON` that emits a deterministic, code-generated search space using the same | — | [PR #175](https://github.com/SoundMindsAI/relyloop/pull/175) merged 2026-05-21 | -| [feat_chat_agent](implemented_features/2026_05_12_feat_chat_agent/feature_spec.md) | Feature | A chat surface at `/chat/{conversation_id}` streams OpenAI completions via SSE. | `feat_agent_propose_search_space` `feat_auto_followup_studies` `feat_chat_last_message_preview` `feat_cluster_target_filter` `feat_config_repo_baseline_tracking` `feat_contextual_help` `feat_contextual_help_mvp2` `feat_create_study_search_space_builder` `feat_create_study_target_autocomplete` `feat_data_table_primitive` `feat_digest_executable_followups` `feat_digest_proposal` `feat_fts_rank_ordering_mvp2` `feat_github_pr_worker` `feat_github_webhook` `feat_home_demo_reseed_endpoint` `feat_home_first_run_demo_nudge` `feat_judgments_periodic_resume_sweep` `feat_llm_judgments` `feat_orchestrator_zero_streak_abort` `feat_pr_metric_confidence` `feat_proposals_ui` `feat_query_inline_crud` `feat_studies_ui` `feat_study_baseline_trial` `feat_study_clone_from_previous` `feat_study_lifecycle` `feat_study_preflight_overlap_probe` `feat_study_target_judgment_mismatch_guard` `infra_adapter_elastic` `infra_arq_subprocess_test_mvp2` `infra_ci_smoke_makeup` `infra_dashboard_regen_pre_commit_conflict` `infra_e2e_seed_completed_study` `infra_e2e_wire_seed_helper_into_studies_spec` `infra_foundation` `infra_frontend_stack_refresh` `infra_ir_measures_migration` `infra_make_targets_split_backend_only` `infra_nvmrc` `infra_optuna_eval` `infra_per_trial_timeout` `infra_structlog_test_helpers` `infra_study_preflight_real_engine_integration` `infra_uv_sync_drops_precommit` | [PR #60](https://github.com/SoundMindsAI/relyloop/pull/60) merged 2026-05-12 | +| [feat_chat_agent](implemented_features/2026_05_12_feat_chat_agent/feature_spec.md) | Feature | A chat surface at `/chat/{conversation_id}` streams OpenAI completions via SSE. | `feat_agent_propose_search_space` `feat_auto_followup_studies` `feat_chat_last_message_preview` `feat_cluster_target_filter` `feat_config_repo_baseline_tracking` `feat_contextual_help` `feat_contextual_help_mvp2` `feat_create_study_search_space_builder` `feat_create_study_target_autocomplete` `feat_data_table_primitive` `feat_digest_executable_followups` `feat_digest_proposal` `feat_fts_rank_ordering_mvp2` `feat_github_pr_worker` `feat_github_webhook` `feat_home_demo_reseed_endpoint` `feat_home_first_run_demo_nudge` `feat_judgments_periodic_resume_sweep` `feat_llm_judgments` `feat_orchestrator_zero_streak_abort` `feat_pr_metric_confidence` `feat_proposals_ui` `feat_query_inline_crud` `feat_studies_ui` `feat_study_baseline_trial` `feat_study_clone_from_previous` `feat_study_lifecycle` `feat_study_preflight_overlap_probe` `feat_study_target_judgment_mismatch_guard` `feat_ubi_judgments` `infra_adapter_elastic` `infra_arq_subprocess_test_mvp2` `infra_ci_smoke_makeup` `infra_dashboard_regen_pre_commit_conflict` `infra_e2e_seed_completed_study` `infra_e2e_wire_seed_helper_into_studies_spec` `infra_foundation` `infra_frontend_stack_refresh` `infra_ir_measures_migration` `infra_make_targets_split_backend_only` `infra_nvmrc` `infra_optuna_eval` `infra_per_trial_timeout` `infra_structlog_test_helpers` `infra_study_preflight_real_engine_integration` `infra_uv_sync_drops_precommit` | [PR #60](https://github.com/SoundMindsAI/relyloop/pull/60) merged 2026-05-12 | | [feat_cluster_target_filter](implemented_features/2026_05_20_feat_cluster_target_filter/feature_spec.md) | Feature | Each registered cluster can optionally carry a glob pattern (`products*`, `team-a-*`, `docs-[ef][nr]-*`) that scopes `list_targets()` to the matching subset. | — | [PR #168](https://github.com/SoundMindsAI/relyloop/pull/168) merged 2026-05-20 | | [feat_contextual_help](implemented_features/2026_05_15_feat_contextual_help/feature_spec.md) | Feature | a relevance engineer can launch their second study and interpret its digest without re-reading the tutorial, because every domain-jargon label has a one-click contextual definition grounded in the sam | — | [PR #122](https://github.com/SoundMindsAI/relyloop/pull/122) merged 2026-05-15 | | [feat_create_study_search_space_builder](implemented_features/2026_05_20_feat_create_study_search_space_builder/feature_spec.md) | Feature | Complete (PR #163, squash commit `c703953`, merged 2026-05-20) | — | [PR #163](https://github.com/SoundMindsAI/relyloop/pull/163) merged 2026-05-20 | @@ -96,7 +96,7 @@ Pull from the Idea backlog or capture a new feature spec. | [chore_starlette_422_deprecation](implemented_features/2026_05_13_chore_starlette_422_deprecation/idea.md) | Chore | Complete | — | Complete | | [chore_test_both_engines](implemented_features/2026_05_13_chore_test_both_engines/idea.md) | Chore | Complete | — | Complete | | [chore_trial_summary_single_query](implemented_features/2026_05_13_chore_trial_summary_single_query/idea.md) | Chore | Complete | — | Complete | -| [chore_tutorial_polish](implemented_features/2026_05_12_chore_tutorial_polish/feature_spec.md) | Chore | The release tag `v0.1.0` is pushed with: a worked tutorial at `docs/08_guides/tutorial-first-study.md`, sample data (50-query set + sample ES index of ~1,000 docs from the Amazon ESCI subset), README | `feat_agent_propose_search_space` `feat_auto_followup_studies` `feat_chat_agent` `feat_chat_last_message_preview` `feat_cluster_target_filter` `feat_config_repo_baseline_tracking` `feat_contextual_help` `feat_contextual_help_mvp2` `feat_create_study_search_space_builder` `feat_create_study_target_autocomplete` `feat_data_table_primitive` `feat_digest_executable_followups` `feat_digest_proposal` `feat_fts_rank_ordering_mvp2` `feat_github_pr_worker` `feat_github_webhook` `feat_home_demo_reseed_endpoint` `feat_home_first_run_demo_nudge` `feat_judgments_periodic_resume_sweep` `feat_llm_judgments` `feat_orchestrator_zero_streak_abort` `feat_pr_metric_confidence` `feat_proposals_ui` `feat_query_inline_crud` `feat_studies_ui` `feat_study_baseline_trial` `feat_study_clone_from_previous` `feat_study_lifecycle` `feat_study_preflight_overlap_probe` `feat_study_target_judgment_mismatch_guard` `infra_adapter_elastic` `infra_arq_subprocess_test_mvp2` `infra_ci_smoke_makeup` `infra_dashboard_regen_pre_commit_conflict` `infra_e2e_seed_completed_study` `infra_e2e_wire_seed_helper_into_studies_spec` `infra_foundation` `infra_frontend_stack_refresh` `infra_ir_measures_migration` `infra_make_targets_split_backend_only` `infra_nvmrc` `infra_optuna_eval` `infra_per_trial_timeout` `infra_structlog_test_helpers` `infra_study_preflight_real_engine_integration` `infra_uv_sync_drops_precommit` | [PR #64](https://github.com/SoundMindsAI/relyloop/pull/64) merged 2026-05-12 | +| [chore_tutorial_polish](implemented_features/2026_05_12_chore_tutorial_polish/feature_spec.md) | Chore | The release tag `v0.1.0` is pushed with: a worked tutorial at `docs/08_guides/tutorial-first-study.md`, sample data (50-query set + sample ES index of ~1,000 docs from the Amazon ESCI subset), README | `feat_agent_propose_search_space` `feat_auto_followup_studies` `feat_chat_agent` `feat_chat_last_message_preview` `feat_cluster_target_filter` `feat_config_repo_baseline_tracking` `feat_contextual_help` `feat_contextual_help_mvp2` `feat_create_study_search_space_builder` `feat_create_study_target_autocomplete` `feat_data_table_primitive` `feat_digest_executable_followups` `feat_digest_proposal` `feat_fts_rank_ordering_mvp2` `feat_github_pr_worker` `feat_github_webhook` `feat_home_demo_reseed_endpoint` `feat_home_first_run_demo_nudge` `feat_judgments_periodic_resume_sweep` `feat_llm_judgments` `feat_orchestrator_zero_streak_abort` `feat_pr_metric_confidence` `feat_proposals_ui` `feat_query_inline_crud` `feat_studies_ui` `feat_study_baseline_trial` `feat_study_clone_from_previous` `feat_study_lifecycle` `feat_study_preflight_overlap_probe` `feat_study_target_judgment_mismatch_guard` `feat_ubi_judgments` `infra_adapter_elastic` `infra_arq_subprocess_test_mvp2` `infra_ci_smoke_makeup` `infra_dashboard_regen_pre_commit_conflict` `infra_e2e_seed_completed_study` `infra_e2e_wire_seed_helper_into_studies_spec` `infra_foundation` `infra_frontend_stack_refresh` `infra_ir_measures_migration` `infra_make_targets_split_backend_only` `infra_nvmrc` `infra_optuna_eval` `infra_per_trial_timeout` `infra_structlog_test_helpers` `infra_study_preflight_real_engine_integration` `infra_uv_sync_drops_precommit` | [PR #64](https://github.com/SoundMindsAI/relyloop/pull/64) merged 2026-05-12 | | [bug_capability_check_test_isolation](implemented_features/2026_05_12_bug_capability_check_test_isolation/idea.md) | Bug | Complete | — | Complete | | [bug_cursor_decode_value_validation](implemented_features/2026_05_17_bug_cursor_decode_value_validation/idea.md) | Bug | Complete | — | Complete | | [bug_digest_param_importance_seam](implemented_features/2026_05_13_bug_digest_param_importance_seam/idea.md) | Bug | Complete | — | Complete | @@ -122,10 +122,11 @@ _None._ _None._ -### Idea (13) +### Idea (15) | Priority | Feature | Type | One-liner | Depends on | Status | |---|---|---|---|---|---| +| P1 | [feat_ubi_judgments](../02_product/planned_features/feat_ubi_judgments/idea.md) | Feature | MVP1 ships with **LLM-as-judge** as the only authoritative judgment source. The architecture anticipated this would change — the `judgments.source` CHECK already accepts `click`… | — | Idea — anchor feature for MVP1.5 / v0.1.5 "Real Signals" | | P2 | [feat_auto_followup_studies](../02_product/planned_features/feat_auto_followup_studies/idea.md) | Feature | Karpathy's autoresearch loop runs hundreds of experiments overnight and **compounds** improvements: each accepted change becomes the new baseline for the next experiment. RelyLoop's equivalent… | — | Idea — surfaced during the 2026-05-21 Karpathy-loop audit. The highest-leverage recommendation from the audit's "across studies" section. | | P2 | [feat_config_repo_baseline_tracking](../02_product/planned_features/feat_config_repo_baseline_tracking/idea.md) | Feature | RelyLoop does not track which configuration is currently live in production. When a proposal's PR merges, the merge webhook at [`backend/app/api/webhooks/github.py:187-191`](../../backend/app/api/webh | — | Idea — surfaced during the 2026-05-21 Karpathy-loop audit. | | P2 | [feat_digest_executable_followups](../02_product/planned_features/feat_digest_executable_followups/idea.md) | Feature | The digest worker's LLM contract at [`backend/workers/digest.py:168-189`](../../backend/workers/digest.py) defines `suggested_followups` as a flat `array of string`: | — | Idea — surfaced during the 2026-05-21 Karpathy-loop audit. | @@ -138,6 +139,7 @@ _None._ | P2 | [chore_template_library_expansion](../02_product/planned_features/chore_template_library_expansion/idea.md) | Chore | Three connected gaps: | — | Idea — surfaced during a UX review of parameter-tuning ergonomics on 2026-05-19. | | P2 | [bug_contract_test_stub_missing_target_filter_kwarg](../02_product/planned_features/bug_contract_test_stub_missing_target_filter_kwarg/idea.md) | Bug | `backend/tests/contract/test_error_codes.py::TestErrorCodes::test_targets_forbidden` and `::test_targets_unreachable_via_adapter` both define an inline `_Stub` class whose `list_targets` method has th | — | Idea — bug discovered during `feat_orchestrator_zero_streak_abort` phase gate | | P2 | [bug_dashboard_banner_dismiss_persistence_flake](../02_product/planned_features/bug_dashboard_banner_dismiss_persistence_flake/idea.md) | Bug | The test, introduced by [PR #188 (`feat_home_first_run_demo_nudge`)](implemented_features/2026_05_22_feat_home_first_run_demo_nudge), is: | — | Idea — surfaced during `feat_study_preflight_overlap_probe` (PR #193) smoke CI | +| P2 | [bug_dashboard_depends_on_column_bloat](../02_product/planned_features/bug_dashboard_depends_on_column_bloat/idea.md) | Bug | [`scripts/build_mvp1_dashboard.py`](../../scripts/build_mvp1_dashboard.py) (2,084 lines) generates the "Depends on" column for each planned-feature row in [`MVP1_DASHBOARD.md`](MVP1_DASHBOARD.md) and | — | Idea — surfaced by Gemini Code Assist review on PR #200 (2026-05-22). Pre-existing bug; this PR only made one more entry visible. | | Backlog | [chore_e2e_seed_acme_helper_dead](../02_product/planned_features/chore_e2e_seed_acme_helper_dead/idea.md) | Chore | `seedAcmeProductsChain` is a 140-line helper that constructs a cluster + query_set + template + judgment_list + study + optional proposal/digest chain "Acme Products" demo scenario. The function is co | — | Idea — surfaced during `chore_e2e_test_rows_isolation` Story 1.2 coverage audit | ## Dependency graph diff --git a/docs/00_overview/dashboard.html b/docs/00_overview/dashboard.html index c8b04b10..4ee4cbad 100644 --- a/docs/00_overview/dashboard.html +++ b/docs/00_overview/dashboard.html @@ -384,7 +384,7 @@

Releases

MVP1 / v0.1
The Loop
-
66 / 66 scoped done · 6 remaining
+
66 / 66 scoped done · 7 remaining
In progress
diff --git a/docs/00_overview/mvp1_dashboard.html b/docs/00_overview/mvp1_dashboard.html index 25f4af7e..b3a6ee35 100644 --- a/docs/00_overview/mvp1_dashboard.html +++ b/docs/00_overview/mvp1_dashboard.html @@ -403,12 +403,12 @@

MVP1 Progress

Pending work
-
13
+
15
every not-done feat/infra/chore/bug across all priorities
Open bugs
-
2
+
3
tracked bug_* idea files
@@ -420,12 +420,12 @@

MVP1 Progress

P1
-
0
+
1
high-value, ready when P0 clears
P2 (default)
-
12
+
13
important to file, not blocking
@@ -435,14 +435,14 @@

MVP1 Progress

Legacy "Path to MVP1"
-
6
+
7
scoped not-done + bugs + chore-ideas only (excludes feat/infra ideas)
Backlog ideas: - 7 idea-only feat/infra folders (not yet scoped into MVP1) + 8 idea-only feat/infra folders (not yet scoped into MVP1) In flight: @@ -463,7 +463,20 @@

Pipeline

-

Idea 13

+

Idea 15

+ +
+ +
+ Feature + P1 + +
+
MVP1 ships with **LLM-as-judge** as the only authoritative judgment source. The architecture anticipated this would change — the `judgments.source` CHECK already accepts `click`…
+ + +
+
@@ -621,6 +634,19 @@

Idea 13

+
+ +
+ Bug + P2 + +
+
[`scripts/build_mvp1_dashboard.py`](../../scripts/build_mvp1_dashboard.py) (2,084 lines) generates the "Depends on" column for each planned-feature row in [`MVP1_DASHBOARD.md`](MVP1_DASHBOARD.md) and
+ + +
+ +
@@ -675,7 +701,7 @@

Done 78

A chat surface at `/chat/{conversation_id}` streams OpenAI completions via SSE.
-
depends on: feat_agent_propose_search_spacefeat_auto_followup_studiesfeat_chat_last_message_previewfeat_cluster_target_filterfeat_config_repo_baseline_trackingfeat_contextual_helpfeat_contextual_help_mvp2feat_create_study_search_space_builderfeat_create_study_target_autocompletefeat_data_table_primitivefeat_digest_executable_followupsfeat_digest_proposalfeat_fts_rank_ordering_mvp2feat_github_pr_workerfeat_github_webhookfeat_home_demo_reseed_endpointfeat_home_first_run_demo_nudgefeat_judgments_periodic_resume_sweepfeat_llm_judgmentsfeat_orchestrator_zero_streak_abortfeat_pr_metric_confidencefeat_proposals_uifeat_query_inline_crudfeat_studies_uifeat_study_baseline_trialfeat_study_clone_from_previousfeat_study_lifecyclefeat_study_preflight_overlap_probefeat_study_target_judgment_mismatch_guardinfra_adapter_elasticinfra_arq_subprocess_test_mvp2infra_ci_smoke_makeupinfra_dashboard_regen_pre_commit_conflictinfra_e2e_seed_completed_studyinfra_e2e_wire_seed_helper_into_studies_specinfra_foundationinfra_frontend_stack_refreshinfra_ir_measures_migrationinfra_make_targets_split_backend_onlyinfra_nvmrcinfra_optuna_evalinfra_per_trial_timeoutinfra_structlog_test_helpersinfra_study_preflight_real_engine_integrationinfra_uv_sync_drops_precommit
+
depends on: feat_agent_propose_search_spacefeat_auto_followup_studiesfeat_chat_last_message_previewfeat_cluster_target_filterfeat_config_repo_baseline_trackingfeat_contextual_helpfeat_contextual_help_mvp2feat_create_study_search_space_builderfeat_create_study_target_autocompletefeat_data_table_primitivefeat_digest_executable_followupsfeat_digest_proposalfeat_fts_rank_ordering_mvp2feat_github_pr_workerfeat_github_webhookfeat_home_demo_reseed_endpointfeat_home_first_run_demo_nudgefeat_judgments_periodic_resume_sweepfeat_llm_judgmentsfeat_orchestrator_zero_streak_abortfeat_pr_metric_confidencefeat_proposals_uifeat_query_inline_crudfeat_studies_uifeat_study_baseline_trialfeat_study_clone_from_previousfeat_study_lifecyclefeat_study_preflight_overlap_probefeat_study_target_judgment_mismatch_guardfeat_ubi_judgmentsinfra_adapter_elasticinfra_arq_subprocess_test_mvp2infra_ci_smoke_makeupinfra_dashboard_regen_pre_commit_conflictinfra_e2e_seed_completed_studyinfra_e2e_wire_seed_helper_into_studies_specinfra_foundationinfra_frontend_stack_refreshinfra_ir_measures_migrationinfra_make_targets_split_backend_onlyinfra_nvmrcinfra_optuna_evalinfra_per_trial_timeoutinfra_structlog_test_helpersinfra_study_preflight_real_engine_integrationinfra_uv_sync_drops_precommit
@@ -1507,7 +1533,7 @@

Done 78

The release tag `v0.1.0` is pushed with: a worked tutorial at `docs/08_guides/tutorial-first-study.md`, sample data (50-query set + sample ES index of ~1,000 docs from the Amazon ESCI subset), README
-
depends on: feat_agent_propose_search_spacefeat_auto_followup_studiesfeat_chat_agentfeat_chat_last_message_previewfeat_cluster_target_filterfeat_config_repo_baseline_trackingfeat_contextual_helpfeat_contextual_help_mvp2feat_create_study_search_space_builderfeat_create_study_target_autocompletefeat_data_table_primitivefeat_digest_executable_followupsfeat_digest_proposalfeat_fts_rank_ordering_mvp2feat_github_pr_workerfeat_github_webhookfeat_home_demo_reseed_endpointfeat_home_first_run_demo_nudgefeat_judgments_periodic_resume_sweepfeat_llm_judgmentsfeat_orchestrator_zero_streak_abortfeat_pr_metric_confidencefeat_proposals_uifeat_query_inline_crudfeat_studies_uifeat_study_baseline_trialfeat_study_clone_from_previousfeat_study_lifecyclefeat_study_preflight_overlap_probefeat_study_target_judgment_mismatch_guardinfra_adapter_elasticinfra_arq_subprocess_test_mvp2infra_ci_smoke_makeupinfra_dashboard_regen_pre_commit_conflictinfra_e2e_seed_completed_studyinfra_e2e_wire_seed_helper_into_studies_specinfra_foundationinfra_frontend_stack_refreshinfra_ir_measures_migrationinfra_make_targets_split_backend_onlyinfra_nvmrcinfra_optuna_evalinfra_per_trial_timeoutinfra_structlog_test_helpersinfra_study_preflight_real_engine_integrationinfra_uv_sync_drops_precommit
+
depends on: feat_agent_propose_search_spacefeat_auto_followup_studiesfeat_chat_agentfeat_chat_last_message_previewfeat_cluster_target_filterfeat_config_repo_baseline_trackingfeat_contextual_helpfeat_contextual_help_mvp2feat_create_study_search_space_builderfeat_create_study_target_autocompletefeat_data_table_primitivefeat_digest_executable_followupsfeat_digest_proposalfeat_fts_rank_ordering_mvp2feat_github_pr_workerfeat_github_webhookfeat_home_demo_reseed_endpointfeat_home_first_run_demo_nudgefeat_judgments_periodic_resume_sweepfeat_llm_judgmentsfeat_orchestrator_zero_streak_abortfeat_pr_metric_confidencefeat_proposals_uifeat_query_inline_crudfeat_studies_uifeat_study_baseline_trialfeat_study_clone_from_previousfeat_study_lifecyclefeat_study_preflight_overlap_probefeat_study_target_judgment_mismatch_guardfeat_ubi_judgmentsinfra_adapter_elasticinfra_arq_subprocess_test_mvp2infra_ci_smoke_makeupinfra_dashboard_regen_pre_commit_conflictinfra_e2e_seed_completed_studyinfra_e2e_wire_seed_helper_into_studies_specinfra_foundationinfra_frontend_stack_refreshinfra_ir_measures_migrationinfra_make_targets_split_backend_onlyinfra_nvmrcinfra_optuna_evalinfra_per_trial_timeoutinfra_structlog_test_helpersinfra_study_preflight_real_engine_integrationinfra_uv_sync_drops_precommit
diff --git a/docs/00_overview/product/relevance-copilot-spec.md b/docs/00_overview/product/relevance-copilot-spec.md index cef9f327..1e3755bc 100644 --- a/docs/00_overview/product/relevance-copilot-spec.md +++ b/docs/00_overview/product/relevance-copilot-spec.md @@ -13,15 +13,16 @@ RelyLoop is an open-source tool for enterprise search platform teams. It combine The tool is a single, engine-agnostic, provider-agnostic system: one UI, one workflow, one schema. Differences between Elasticsearch / OpenSearch, Lucidworks Fusion, and any future engine (pure Solr, Vespa, etc.) are isolated behind a thin adapter interface — and the same adapter pattern applies to LLM providers (OpenAI, Anthropic, Bedrock, Azure OpenAI, Vertex, self-hosted Ollama/vLLM) and Git providers (GitHub, GitLab, Bitbucket). Multi-tenancy is supported from the schema level so a single deployment can serve many downstream customers in isolation. -**Delivery is incremental across five releases**, each meaningful as a discrete capability bundle: +**Delivery is incremental across six releases**, each meaningful as a discrete capability bundle: - **MVP1 / v0.1 (5 weeks) — "The Loop":** Karpathy loop end-to-end on a laptop. ES + OpenSearch, OpenAI, GitHub, single-tenant, basic logging. Demonstrates the value prop. +- **MVP1.5 / v0.1.5 (+2 weeks) — "Real Signals":** OpenSearch UBI as a first-class judgment source. `UbiReader` (engine-agnostic) + pluggable `SignalsConverter` Protocol + hybrid UBI+LLM mode. Earns the evaluation of operators with real traffic who distrust LLM-as-judge as the primary trust anchor. - **MVP2 / v0.2 (+3 weeks) — "Observable":** Langfuse + SigNoz + event catalog + audit immutability + lineage columns + PII redaction. Trustworthy enough for serious evaluation. -- **MVP3 / v0.3 (+3 weeks) — "Production Stacks":** Lucidworks Fusion adapter + multi-Git provider abstraction (GitLab, Bitbucket) + adapter contract tests. Works against real enterprise stacks. +- **MVP3 / v0.3 (+3 weeks) — "Production Stacks":** Lucidworks Fusion adapter (and its native signals reader feeding the MVP1.5 Protocol) + multi-Git provider abstraction (GitLab, Bitbucket) + adapter contract tests. Works against real enterprise stacks. - **MVP4 / v0.4 (+3 weeks) — "Multi-tenant, Multi-LLM":** Tenants + tenant-scoped API keys + multi-LLM provider abstraction (Anthropic, Bedrock, Azure OpenAI, Vertex, Ollama/vLLM). Platform-team scale. - **GA v1 / v1.0 (+3 weeks) — "Production-ready":** LangGraph orchestrator + full agent-first API surface + four-layer test pyramid + full GitHub Actions CI/CD with security gates + complete OSS governance. -Total: 17 weeks single-engineer, 10–12 weeks with two. Each release ships a coherent step-up in adopter value and audience reach. +Total: ~19 weeks single-engineer, 12–14 weeks with two. Each release ships a coherent step-up in adopter value and audience reach. The HTTP API is designed as a first-class product, not just the back end of the UI. Every operation a human or the in-tool orchestrator can perform is also callable by an external agent over plain REST, with bearer-token auth, OpenAPI 3.1 publication, idempotency keys, outgoing webhooks, SSE event streams, and machine-readable capability discovery. See §21 *Agent integration*. @@ -714,23 +715,33 @@ The `source` field tracks judgment provenance: - `llm` — generated by an LLM-as-judge call against a documented rubric - `human` — entered or overridden by a relevance team member via the UI -- `click` — derived from real user behavior data (Fusion signals, Behavioral Analytics, or imported logs) +- `click` — derived from real user behavior data (OpenSearch UBI primarily; engine-native streams where present) A judgment list can mix sources. The Judgment Review UI surfaces source per row and the calibration stats account for source mix. -### Click-derived judgments from Fusion Signals (v1.5+) +### Click-derived judgments — OpenSearch UBI as the engine-neutral primary path (MVP1.5) -When a Lucidworks Fusion cluster has Signals enabled, the Fusion adapter exposes a `pull_signals` operation that aggregates raw signals over a window into per-(query, doc) interaction features: +**User Behavior Insights** is a standardized, engine-neutral schema (championed by Eric Pugh / OpenSource Connections) for capturing search events. The OpenSearch UBI plugin (2024) writes two indices into the cluster being tuned: + +- `ubi_queries` — the searches users issued: query text, client ID, session ID, application, requested filters, response time, hit count. +- `ubi_events` — what users did next: click, view, dwell, add-to-cart, conversion, refinement; each event references a `query_id` from `ubi_queries`. + +Because UBI is just two indices in the cluster RelyLoop is already adapting, the integration is engine-agnostic: a new `UbiReader` reads UBI indices via the existing `SearchAdapter.search_batch` and aggregates raw events into per-(query, doc) interaction features: - click count - impression count - click-through rate (with position-bias correction) - post-click dwell-time mean +- conversion rate (where the operator emits conversion events) - query-refinement rate -A signals-to-judgment converter (separate module, plugged in behind a config) maps these features to a 0–3 rating. Multiple converters are supported — counterfactual click models (CCM, DBN), simple CTR thresholds, or hybrid LLM+signals where signals provide the rating and LLM judgment is sought only for documents with insufficient impression volume. +The pluggable `SignalsConverter` then maps these features to a 0–3 rating. Initial converters: position-bias-corrected CTR threshold, dwell-time threshold, and **hybrid UBI+LLM** (UBI rates the dense head; LLM-as-judge fills the long tail for queries below an impression threshold). Counterfactual click models (CCM, DBN) are documented as v1.5+ post-GA extensions because they need enough impressions per (query, doc) to be statistically meaningful. + +The judgments table accepts mixed-source lists today (the `source IN ('llm', 'human', 'click')` CHECK has shipped since MVP1) — no schema migration is required to turn this on. The MVP1.5 deliverable is the `UbiReader` + `SignalsConverter` + a new `POST /api/v1/judgment-lists/generate-from-ubi` endpoint + a new `generate_judgments_from_ubi` agent tool. See [`feat_ubi_judgments/idea.md`](../../02_product/planned_features/feat_ubi_judgments/idea.md) for the planned-feature scope. + +Predicated on the operator having installed the OpenSearch UBI plugin and logged enough events to be statistically useful. Deployments without UBI continue to run LLM-as-judge unchanged. -For the user's deployment, signals are not yet enabled in any environment — Signals integration is therefore v1.5+. The architecture (the `source` field, the converter plug-in pattern) is in place from v1 so that turning on Signals later is additive, not a rewrite. +**Engine-native readers as a drop-in extension.** Operators on engines that haven't adopted UBI but have their own behavioral-data stream — Elastic Behavioral Analytics for ES clusters, the Fusion `{app}_signals` collection for Fusion clusters — get a thin engine-specific reader feeding the same `SignalsConverter` Protocol. Reader work is local to the adapter that ships it (the Fusion reader rides MVP3 alongside the Fusion adapter; the ES Behavioral Analytics reader rides v2). The converter library, the API surface, and the storage shape are unchanged across all readers. ### LLM-as-judge @@ -1274,7 +1285,7 @@ The orchestrator agent in the API backend uses OpenAI function calling. Tool inv - `list_pipelines(cluster_id)` → `[PipelineSummary]` — list query pipelines available in the Fusion app - `get_pipeline(cluster_id, pipeline_id)` → `PipelineDefinition` — full pipeline JSON with stages - `list_query_profiles(cluster_id)` → `[QueryProfileSummary]` -- `pull_signals(cluster_id, since, until?, query_filter?)` → `SignalsAggregate` — *(v1.5+, requires Signals enabled)* aggregate raw signals into per-(query, doc) interaction features for judgment generation +- `pull_signals(cluster_id, since, until?, query_filter?)` → `SignalsAggregate` — *(MVP3, requires Fusion Signals enabled)* aggregate raw Fusion `{app}_signals` events into per-(query, doc) interaction features. Engine-specific reader feeding the shared `SignalsConverter` Protocol introduced at MVP1.5; see §14 "Click-derived judgments from user behavior data". ### Templates @@ -1288,6 +1299,7 @@ The orchestrator agent in the API backend uses OpenAI function calling. Tool inv - `create_query_set(name, queries[])` → `QuerySet` - `import_queries_from_csv(query_set_id, csv_data)` → `int` - `generate_judgments_llm(query_set_id, cluster_id, target, current_template_id, rubric)` → `JudgmentList` +- `generate_judgments_from_ubi(query_set_id, cluster_id, target, since, until?, converter, llm_fill_threshold?)` → `JudgmentList` — *(MVP1.5, requires OpenSearch UBI plugin)* read `ubi_queries` + `ubi_events`, aggregate per-(query, doc) features via `UbiReader`, run the named `SignalsConverter`, and (optionally) fill the long tail with LLM-as-judge when impression count < `llm_fill_threshold`. Emits a judgment list with mixed `source` rows (`click` + optional `llm`). See §14. - `get_calibration(judgment_list_id)` → `CalibrationStats` ### Search space proposal @@ -2273,11 +2285,12 @@ In each case, the org's existing Fusion dev cluster is usually a viable substitu ## 27. Phased delivery -Delivery is incremental: five releases (MVP1 → MVP4 → GA v1), each meaningful as a discrete capability bundle. Each release ships a coherent step-up in adopter value and audience reach, never a partial build. Total wall-clock estimate: **17 weeks single-engineer**, or roughly **10–12 weeks with two engineers** working in parallel after MVP1. +Delivery is incremental: six releases (MVP1 → MVP1.5 → MVP2 → MVP3 → MVP4 → GA v1), each meaningful as a discrete capability bundle. Each release ships a coherent step-up in adopter value and audience reach, never a partial build. Total wall-clock estimate: **~19 weeks single-engineer**, or roughly **12–14 weeks with two engineers** working in parallel after MVP1. | Release | Theme | Timeline | Audience | |---|---|---|---| | MVP1 / v0.1 | The Loop | 5 weeks | Technical evaluators willing to test on a laptop | +| MVP1.5 / v0.1.5 | Real Signals | +2 weeks | Operators running OpenSearch UBI; teams that want trust anchored in real user behavior, not LLM ratings | | MVP2 / v0.2 | Observable | +3 weeks | Platform teams considering serious evaluation | | MVP3 / v0.3 | Production Stacks | +3 weeks | Lucidworks shops, GitLab/Bitbucket enterprises | | MVP4 / v0.4 | Multi-tenant, Multi-LLM | +3 weeks | Platform teams operating for many customers | @@ -2323,6 +2336,38 @@ What MVP1 delivers: a relevance engineer can `docker compose up`, point at a loc --- +### MVP1.5 / v0.1.5 — "Real Signals" (target: +2 weeks) + +**Headline: The loop, grounded in what users actually do.** + +MVP1 ships with LLM-as-judge as the only authoritative judgment source. That's enough to demonstrate the optimization loop, but for operators with production traffic it's a weaker trust anchor than real user behavior. MVP1.5 closes that gap by making **OpenSearch UBI** (User Behavior Insights — a standardized, engine-neutral event-capture schema championed by Eric Pugh / OpenSource Connections, shipped as the OpenSearch UBI plugin in 2024) a first-class judgment source alongside LLM-as-judge. + +**MVP1.5 adds on top of MVP1:** + +- **`UbiReader`** (engine-agnostic) reads the standardized `ubi_queries` + `ubi_events` indices via any `SearchAdapter`'s `search_batch` — no engine-specific code, no new Compose service. Aggregates raw events over an operator-specified window into per-(query, doc) interaction features: click count, impression count, position-bias-corrected CTR, post-click dwell-time mean, conversion rate (where conversions are emitted), refinement rate. +- **Pluggable `SignalsConverter` Protocol** mapping features → 0–3 ratings. Initial implementations: + - **Position-bias-corrected CTR threshold** (default, conservative) + - **Dwell-time threshold** (good for content discovery / long-read use cases) + - **Hybrid UBI+LLM** — UBI rates the dense head; LLM-as-judge fills the long tail for queries below an impression threshold. The mixed-`source` judgment list is the operating mode most adopters will ship to production. +- **No schema migration.** The `judgments.source` CHECK constraint accepts `click` today; a single judgment list can mix `llm` + `human` + `click` rows. The MVP1 schema was designed for this. +- **`POST /api/v1/judgment-lists/generate-from-ubi`** endpoint + **`generate_judgments_from_ubi`** agent tool. Same code path on both surfaces (agent-first symmetry per §21). +- **Calibration spot-check workflow** — same Cohen's kappa / agreement-stat surface as MVP1's LLM calibration, run between UBI-derived ratings and a 30–50 row hand-labeled sample. Catches mis-tuned converters (e.g., dwell-time threshold set too low for the traffic shape). +- **Operator docs** — runbook for installing the OpenSearch UBI plugin, configuring event capture in the application, choosing the right converter for the use case, and a tutorial extension to the MVP1 tutorial that swaps the LLM judgment list for a UBI-derived one once enough events have been captured. +- **Documented Phase 2 extensions** (NOT shipped at MVP1.5): counterfactual click models (CCM, DBN); engine-native behavioral-data readers for clusters that haven't adopted UBI — Elastic Behavioral Analytics and others — all feeding the same `SignalsConverter` Protocol unchanged. + +**MVP1.5 does NOT include:** + +- A second Compose service. `UbiReader` runs inside the existing API + worker containers. +- Real-time signal streaming. UBI ratings are computed batch-wise at judgment-list creation time, not on the live serving path — this is still strictly offline Path A (per §27 "Why the deferral is right today"). +- Production quality monitoring or alerting (Path B, v2). +- A schema migration. UBI rides the existing `judgments` table. + +**Audience expansion:** Operators with production search traffic and OpenSearch UBI logging enabled. These adopters disproportionately distrust LLM-as-judge ratings as a primary trust anchor; MVP1.5 is the release that earns their evaluation. Also: open-source signals that UBI is a first-class direction for RelyLoop, not deferred to a post-GA milestone — relevant for the OSC community where UBI was incubated. + +**Strategic rationale:** The optimization loop's quality is bounded by the quality of the judgments it scores against. LLM-as-judge unblocks the MVP1 demonstration, but it caps the believability of every winning trial behind "did the LLM actually get the relevance call right?" UBI removes that ceiling for operators with real traffic. Shipping it as the very next release (rather than waiting for MVP2's observability layer or MVP3's Fusion work) keeps the focus on the core value proposition: trustworthy automated relevance tuning. + +--- + ### MVP2 / v0.2 — "Observable" (target: +3 weeks) **Headline: The loop you can audit.** @@ -2363,6 +2408,7 @@ v0.3 broadens the supported production stack by adding the Lucidworks Fusion ada - Fusion-specific tools: `list_pipelines`, `get_pipeline`, `list_query_profiles` - Two-step apply path (PR edits pipeline params; CI runs `objects-import` to deploy) - `auth_kind = "fusion_session"` and `"fusion_jwt"` paths +- **Engine-native signals reader for Fusion** — aggregates events from the `{app}_signals` collection into the same per-(query, doc) feature shape MVP1.5's `UbiReader` produces. Reuses the MVP1.5 `SignalsConverter` Protocol unchanged; only the read path is Fusion-specific. Relevant for Fusion deployments that haven't adopted UBI. - **Multi-Git-provider abstraction** — `GitProvider` Protocol with three implementations: - GitHub (already present from MVP1) - GitLab — token or app auth, project-level webhooks, MR + approval rules @@ -2435,7 +2481,9 @@ GA v1 layers in the polish that elevates RelyLoop from a working tool to a prope **Strategic rationale:** GA v1 is the moment RelyLoop becomes a real open-source product, not just a working tool. It's contributor-ready (governance), production-ready (testing, security, observability already in place since MVP2), and adoption-ready (docs, distribution, design partners). -### v1.5 (target: +4 weeks) +### v1.5+ (post-GA, target: +4 weeks) + +Post-GA polish items. UBI (MVP1.5) and engine-native behavioral-data readers (MVP3 / v2) used to live here; they were promoted to the release timeline when MVP1.5 was introduced as a formal tier. - Multiple config repos - Outgoing webhooks for resource lifecycle events (study, digest, proposal, PR state) — replaces polling for both internal and external agents @@ -2446,7 +2494,7 @@ GA v1 layers in the polish that elevates RelyLoop from a working tool to a prope - Performance hardening (worker pool tuning, RDB indexes) - Cost dashboard and per-user OpenAI quotas - W3C Trace Context (`traceparent`) propagation through to ES/Fusion -- **Fusion Signals integration:** `pull_signals` adapter operation, signals-to-judgment converter (CTR-threshold and counterfactual click model variants), hybrid signals+LLM judgment lists. Predicated on user enabling Signals in their dev environment. +- Counterfactual click models (CCM, DBN) as additional `SignalsConverter` implementations on top of the MVP1.5 Protocol — relevant once enough impressions per (query, doc) have accumulated to make them statistically valid ### v2 (TBD) diff --git a/docs/01_architecture/tech-stack.md b/docs/01_architecture/tech-stack.md index 0451805b..7d29b262 100644 --- a/docs/01_architecture/tech-stack.md +++ b/docs/01_architecture/tech-stack.md @@ -12,6 +12,7 @@ This is the source-of-truth release matrix that every other arch doc derives fro | Release | Theme | Adds on top of previous | |---|---|---| | **MVP1 / v0.1** | "The Loop" | ES + OpenSearch adapter (single `ElasticAdapter`); LLM via `openai` SDK pointed at any **OpenAI-compatible endpoint** (`OPENAI_BASE_URL` config; defaults to `https://api.openai.com/v1`; works against Ollama, LM Studio, vLLM, HuggingFace TGI for air-gapped evaluation); GitHub Git provider; single-tenant (no `tenants` table, no `tenant_id`); no auth; basic structured logging; Docker Compose; Apache 2.0 LICENSE; 80% backend coverage gate. **No** native non-OpenAI-compatible providers (Anthropic/Bedrock/Vertex SDKs ship at MVP4), **no** observability stack, **no** audit_log, **no** lineage, **no** Fusion, **no** SSO, **no** API keys. | +| **MVP1.5 / v0.1.5** | "Real Signals" | **OpenSearch UBI judgments** as a first-class judgment source. New `UbiReader` (engine-agnostic; reads the standardized `ubi_queries` + `ubi_events` indices via any `SearchAdapter`'s `search_batch`) + pluggable `SignalsConverter` Protocol (initial impls: position-bias-corrected CTR, dwell-time threshold, hybrid UBI+LLM where UBI rates the dense head and LLM fills the long tail). Judgment lists can mix sources (`llm` + `human` + `click` rows in the same list — the existing `judgments.source` enum already permits this). New `POST /api/v1/judgment-lists/generate-from-ubi` endpoint + new agent tool `generate_judgments_from_ubi`. **No** schema migration (additive — uses existing `source = 'click'` enum value), **no** new Compose service. Predicated on the operator having the OpenSearch UBI plugin installed and logging events. | | **MVP2 / v0.2** | "Observable" | Langfuse + ClickHouse + SigNoz + OpenTelemetry exporters wired; canonical event catalog; **`audit_log` table + Postgres immutability trigger** (no users/tenants yet — `actor_id`/`tenant_id` nullable, no FKs; FKs added at MVP4); lineage columns (`langfuse_trace_id`, `prompt_version`, `input_hash`) on `judgments`/`digests`/`proposals`; PII redaction; trace context propagation through API → Redis → worker → adapter → engine. | | **MVP3 / v0.3** | "Production Stacks" | **Lucidworks Fusion adapter** (`auth_kind = fusion_session` and `fusion_jwt`); multi-Git-provider abstraction (GitLab + Bitbucket alongside GitHub); adapter contract test suite; production-style install (TLS via Caddy + Let's Encrypt, managed Postgres/Redis); AWS managed OpenSearch (`auth_kind = opensearch_sigv4` activates). **No** SSO/auth yet (production-stack hardening only). | | **MVP4 / v0.4** | "Multi-tenant, Multi-LLM" | `tenants` + `tenant_memberships` + `users` + `api_keys` tables; `tenant_id` columns on every user-facing table (with backfill); roles `viewer` / `runner` / `tenant_admin` (per-tenant) + `platform_admin` (cross-tenant); **SSO via reverse proxy** (oauth2-proxy or Authelia injecting `X-Auth-Email`); **Argon2id-hashed bearer API keys** for service accounts; **native non-OpenAI-compatible LLM providers via LangChain `BaseChatModel` abstraction** (Anthropic, AWS Bedrock, Google Vertex AI); per-tenant LLM provider selection + cost rollups; FK constraints added to `audit_log.actor_id` / `audit_log.tenant_id`. (OpenAI-compatible providers — including Ollama, LM Studio, vLLM, HuggingFace TGI — already work in MVP1 via `OPENAI_BASE_URL`.) | diff --git a/docs/02_product/planned_features/bug_dashboard_depends_on_column_bloat/idea.md b/docs/02_product/planned_features/bug_dashboard_depends_on_column_bloat/idea.md new file mode 100644 index 00000000..fbee508f --- /dev/null +++ b/docs/02_product/planned_features/bug_dashboard_depends_on_column_bloat/idea.md @@ -0,0 +1,67 @@ +# Dashboard "Depends on" column bloat — shipped features list future ideas as dependencies + +**Date:** 2026-05-22 +**Status:** Idea — surfaced by Gemini Code Assist review on PR #200 (2026-05-22). Pre-existing bug; this PR only made one more entry visible. +**Priority:** P2 — dashboard is internal planning surface; the bloat doesn't break navigation, it just makes the "Depends on" column meaningless. No daily cost; not unblocking anything. +**Origin:** Gemini Code Assist findings on [PR #200](https://github.com/SoundMindsAI/relyloop/pull/200) flagged 4 instances of completed features (`feat_chat_agent`, `chore_tutorial_polish`) listing `feat_ubi_judgments` as a dependency. Verified by `git show main:docs/00_overview/MVP1_DASHBOARD.md | sed -n '35p'` — pre-PR-200, `feat_chat_agent` already had **45** backtick'd feature names in its "Depends on" column, including dozens of features that shipped *after* it. PR #200 only added one more (`feat_ubi_judgments`) to the existing list; the underlying bug is in the dashboard-regen script's parser, not in the planning artifacts. +**Depends on:** None. + +## Problem + +[`scripts/build_mvp1_dashboard.py`](../../../../scripts/build_mvp1_dashboard.py) (2,084 lines) generates the "Depends on" column for each planned-feature row in [`MVP1_DASHBOARD.md`](../../../00_overview/MVP1_DASHBOARD.md) and `mvp1_dashboard.html`. The current output produces logical impossibilities: + +- **`feat_chat_agent`** shipped 2026-05-12 (PR #60). Its "Depends on" column lists 46 features, including `feat_pr_metric_confidence` (shipped 2026-05-21), `feat_study_clone_from_previous` (idea only), `feat_ubi_judgments` (idea only, dated 2026-05-22). A merged feature can't depend on ideas that didn't exist yet. +- **`chore_tutorial_polish`** shipped 2026-05-12 (PR #64). Same pattern — lists 46 features, most shipped weeks later or still unscoped. +- The shape suggests the regen script is treating **every backtick'd feature-name reference in a spec/idea body** as a forward "depends on" relationship — including cases where the reference is in a "Relationship to other work" section, a "Future extensions" paragraph, or a comparison to a sibling feature. + +The "Depends on" column should reflect the **forward dependency graph** — i.e., what each planned feature needs to ship *before* it. The canonical source is the `**Depends on:**` line in each `idea.md` / `feature_spec.md` ([`feature_templates/idea-template.md`](../feature_templates/idea-template.md) requires it). Parsing that line directly (instead of grep'ing the whole document for backtick'd feature names) would fix the bloat. + +## Proposed capabilities + +Single tier — fix the parser; no schema or UI change. + +### Parser correction + +- **Locate the "Depends on" extraction logic** in [`scripts/build_mvp1_dashboard.py`](../../../../scripts/build_mvp1_dashboard.py). Likely a regex sweep over the whole document body; needs to be scoped to the `**Depends on:**` line only. +- **Spec format:** `**Depends on:** `. The line lives near the top of every idea.md (per the template) and every implemented feature_spec.md. +- **Edge cases:** + - `Depends on: None` → empty list (rendered as `—` in the markdown). + - `Depends on:` followed by a paragraph of prose with multiple backtick'd names → parse all backtick'd names on that line only. + - Multiple "Depends on:" lines (shouldn't exist, but defensive) → use the first. + - Implemented features whose canonical `feature_spec.md` predates the convention → fall back to scanning the first 30 lines for an explicit "Depends on" mention; if none, emit `—`. +- **Add a regression test:** new unit test in `backend/tests/unit/scripts/test_dashboard_depends_on.py` (or wherever existing tests for the regen script live) asserts that for a fixture set of idea.md files, only the `**Depends on:**` line is parsed — not body-level backtick references. + +### Verify the bloated rows shrink + +After the fix, re-run `python scripts/build_mvp1_dashboard.py`. Expected outcomes: + +- `feat_chat_agent` "Depends on" column drops from 46 entries to whatever its actual spec lists (likely 1-3: `infra_foundation`, `infra_adapter_elastic`, possibly `feat_study_lifecycle`). +- Every implemented feature's "Depends on" column drops to its real forward dependency graph. +- Spot-check 3-5 rows against the source `feature_spec.md` "Depends on:" line to confirm parity. + +### Out of scope + +- Adding a "Depended on by" (reverse-dependency) column. The current dashboard has no such surface; reverse lookups can come later if useful. +- UI / HTML styling changes. The bug is purely in the data layer. + +## Scope signals + +- **Backend:** 0 LOC (no API change). +- **Scripts:** ~30–80 LOC in `scripts/build_mvp1_dashboard.py` — narrowing the parser. Plus ~80 LOC test coverage. +- **Frontend:** 0 LOC. +- **Migration:** None. +- **Config:** None. +- **Audit events:** N/A. +- **Tests:** 1 new unit test file with ~5–10 cases covering the parser's edge cases. After the fix, run the regen end-to-end and visually spot-check 5 rows. + +## Why not implemented inline in PR #200 + +PR #200 is doc-only — it adds the MVP1.5 release tier and the `feat_ubi_judgments` idea. Fixing the dashboard regen script would mix a `scripts/` code change into a `docs/`-only PR (different `paths-ignore` behavior in CI; different review lens). Per the inline-fix vs idea-file rubric in `CLAUDE.md`: "Fix requires a separate subsystem AND >250 LOC AND no immediate path to inline → Idea file." The dashboard regen script is a separate subsystem from the planning docs; mixing breaks reviewability. + +The fix is bounded enough to ship in a follow-up PR with no further design work. ~60–90 minutes of work. + +## Relationship to other work + +- **Surfaced by [PR #200](https://github.com/SoundMindsAI/relyloop/pull/200)** — Gemini Code Assist flagged 4 instances when `feat_ubi_judgments` got added to the bloated lists. The bug pre-exists PR #200; this idea is the deferred-fix capture. +- **Adjacent to [`infra_dashboard_regen_pre_commit_conflict`](../infra_dashboard_regen_pre_commit_conflict/)** (status: TBD — also a dashboard-regen issue, but about pre-commit hook conflicts rather than "Depends on" parsing). May be worth bundling into a single dashboard-regen-quality PR if both are tackled together. +- **Does NOT block any planned feature.** The dashboard is internal planning surface; the bloated column doesn't break navigation or block decisions, it just makes "Depends on" non-actionable. diff --git a/docs/02_product/planned_features/feat_ubi_judgments/idea.md b/docs/02_product/planned_features/feat_ubi_judgments/idea.md new file mode 100644 index 00000000..a005d573 --- /dev/null +++ b/docs/02_product/planned_features/feat_ubi_judgments/idea.md @@ -0,0 +1,85 @@ +# UBI Judgments — make OpenSearch User Behavior Insights a first-class judgment source + +**Date:** 2026-05-22 +**Status:** Idea — anchor feature for MVP1.5 / v0.1.5 "Real Signals" +**Priority:** P1 — MVP1.5 is named for this capability; nothing else in that release ships without it. +**Origin:** Reframing prompted by an external review on 2026-05-22 (LinkedIn outreach to a senior search engineer at a relevance-tooling company who pushed back on LLM-as-judge as the only authoritative judgment source for v1). Cross-checked against [`docs/00_overview/product/relevance-copilot-spec.md`](../../../00_overview/product/relevance-copilot-spec.md) §14 — the existing spec anticipated click-derived judgments but framed them per-engine without naming UBI's standardized cross-engine schema. This idea consolidates that surface around the OpenSearch UBI plugin as the engine-neutral primary path. +**Depends on:** MVP1 shipped (specifically: [`judgments`](../../../../backend/app/db/models/judgment.py) + [`judgment_lists`](../../../../backend/app/db/models/judgment_list.py) tables, [`ElasticAdapter`](../../../../backend/app/adapters/elastic.py) with `SearchAdapter.search_batch`, [`generate_judgments_llm`](../../../../backend/workers/judgments.py) agent tool pattern). All prerequisites are in `main` as of 2026-05-23. + +## Problem + +MVP1 ships with **LLM-as-judge** as the only authoritative judgment source. The architecture anticipated this would change — the `judgments.source` CHECK already accepts `click` ([`backend/app/db/models/judgment.py:42-48`](../../../../backend/app/db/models/judgment.py#L42-L48)), and judgment lists can mix sources by design ([umbrella spec §14 line 719](../../../00_overview/product/relevance-copilot-spec.md)). But the actual reader, converter, and ingestion endpoint have never been built. + +This leaves three unsolved gaps for operators with production search traffic: + +1. **LLM-as-judge is a weaker trust anchor than real user behavior.** For e-commerce, content discovery, and any surface where user intent is the source of truth, ratings derived from clicks + dwell + conversions reflect what users *find* relevant, not what an LLM *guesses* should be relevant. The optimization loop's quality ceiling is the judgment list's quality; replacing the ceiling is the single biggest believability upgrade RelyLoop can ship. +2. **Judgment-list scale and freshness are bounded.** LLM-as-judge produces hundreds to low thousands of (query, doc) ratings per call (rate-limited, cost-bounded). The 80/20 long tail of queries users actually issue never gets rated. Each new study reuses a snapshot judgment list that goes stale; there's no continuous-refresh path. +3. **UBI is the standardized schema, and OpenSearch is the MVP1 engine target.** The OpenSearch UBI plugin (shipped 2024, championed by Eric Pugh / OpenSource Connections — the same team behind Quepid and the Haystack conference) writes two standardized indices into the cluster RelyLoop is already adapting: `ubi_queries` and `ubi_events`. The integration friction is unusually low — RelyLoop reads two indices in a cluster it already talks to, no new infrastructure on either side. The current spec framing (engine-specific `pull_signals` adapter methods, Fusion Signals at v1.5, ES Behavioral Analytics at v2) under-uses this standardization. + +## Proposed capabilities + +Single-tier — small, additive, no schema migration. Five capability blocks below. + +### `UbiReader` — engine-agnostic read layer + +- **Location:** new module `backend/app/services/ubi_reader.py` + supporting feature aggregation in `backend/app/domain/ubi/features.py`. +- **Inputs:** `cluster_id`, `target` (the live index being tuned, used to disambiguate UBI events emitted from multiple applications against the same UBI indices), `since` / `until` window, optional `query_filter` (substring or exact-match), optional `max_queries` (default 5000). +- **Reads:** the standardized `ubi_queries` and `ubi_events` indices via `SearchAdapter.search_batch` — the engine adapter is unchanged, the reader uses two scrolling searches and a client-side join on `query_id`. No new adapter method, no Fusion-side branch. +- **Output:** a per-(query, doc) feature dict with click count, impression count, position-bias-corrected CTR (Wang-Bendersky correction with a configurable position-bias prior; CCM/DBN deferred to v1.5+), post-click dwell-time mean, conversion rate (where the operator emits conversion events; NULL otherwise), refinement rate. +- **Engine-agnostic by construction.** Any `SearchAdapter` that can run a `search_batch` over `ubi_queries` + `ubi_events` is supported. ES + OpenSearch both work in MVP1.5; engines added later (Fusion at MVP3, others as adapters land) work the moment their adapter ships, no UBI-specific code required. +- **Operator-facing constraint:** the OpenSearch UBI plugin must be installed and event capture enabled in the operator's application. A capability check at endpoint entry returns 412 `UBI_NOT_ENABLED` if `ubi_queries` is absent. + +### `SignalsConverter` Protocol + initial implementations + +- **Location:** new module `backend/app/domain/ubi/converter.py` with the Protocol + three concrete impls. +- **Protocol:** `convert(features: dict[QueryDocPair, FeatureVec]) -> dict[QueryDocPair, Rating]` where `Rating` is 0–3 graded. Pure-domain, no I/O. +- **Initial implementations (MVP1.5):** + - `CtrThresholdConverter` — position-bias-corrected CTR mapped to 0/1/2/3 via configurable thresholds (defaults: 0.05 / 0.15 / 0.30). Conservative, works on small-traffic clusters. + - `DwellTimeThresholdConverter` — post-click dwell-time mapped to ratings. Good for content discovery / long-read surfaces where clicks alone don't separate scan-and-bounce from genuine engagement. + - `HybridUbiLlmConverter` — UBI converter applies where `impressions >= llm_fill_threshold` (default 20); below the threshold the LLM-as-judge path runs over the (query, doc) pair and the resulting `source='llm'` row is interleaved with `source='click'` rows in the same judgment list. This is the operating mode most adopters will ship to production. +- **Deferred to v1.5+ post-GA:** `CcmConverter` and `DbnConverter` (counterfactual click models). Require enough impressions per (query, doc) to be statistically valid, which most early-MVP1.5 adopters won't have. Same Protocol — additive. + +### API surface + +- **New endpoint:** `POST /api/v1/judgment-lists/generate-from-ubi` taking `{cluster_id, target, query_set_id, since, until?, converter: "ctr_threshold" | "dwell_time" | "hybrid_ubi_llm", converter_config?: dict, llm_fill_threshold?: int, name: str}` → 202 `{judgment_list_id, status: "generating"}`. Idempotency via `Idempotency-Key` header (consistent with the rest of the API). +- **Background worker:** new `backend/workers/judgments.py:generate_judgments_from_ubi` Arq job that pulls UBI features, runs the converter, optionally invokes the LLM fill, and INSERTs `judgments` rows with the appropriate `source` value per row. Calibration row written to `judgment_lists.calibration` on completion. +- **Error envelopes:** `UBI_NOT_ENABLED` (412) when `ubi_queries` is missing; `UBI_INSUFFICIENT_DATA` (422) when fewer than `min_impressions_threshold` events match the window/query set; `UBI_QUERY_MAPPING_AMBIGUOUS` (422) when a UBI `user_query` string maps to more than one `query_set.queries.query_text` and the operator hasn't specified a tiebreaker. + +### Agent tool + +- **New tool:** `generate_judgments_from_ubi(query_set_id, cluster_id, target, since, until?, converter, llm_fill_threshold?)` → `JudgmentList`. Mirrors `generate_judgments_llm` shape so the chat agent can switch between the two transparently. Listed in spec §19 Query sets & judgments alongside `generate_judgments_llm`. +- **System prompt update:** the orchestrator's tool description for "generate a judgment list" now prefers UBI when the operator's cluster has UBI enabled (detected via a one-shot `get_schema` probe for the `ubi_queries` index), and falls back to LLM-as-judge otherwise. This is the chat ergonomic that earns the MVP1.5 release name. + +### Operator-facing documentation + +- **New runbook:** `docs/03_runbooks/ubi-judgment-generation.md` — installing the OpenSearch UBI plugin, configuring event capture in the operator's application, choosing the right converter for the use case, calibrating thresholds against a 30–50 row hand-labeled sample. +- **Tutorial extension:** `docs/08_guides/tutorial-first-study.md` gains a Step 7 — "Swap the LLM judgment list for a UBI-derived one." Demonstrates the value upgrade by re-running the tutorial study against the new list and surfacing the metric delta. + +## Scope signals + +- **Backend:** ~600 LOC — `ubi_reader.py` (~200), `domain/ubi/features.py` (~100), `domain/ubi/converter.py` (~150), worker (~80), router additions (~70). Plus ~250 LOC test coverage across unit/integration/contract layers. +- **Frontend:** ~150 LOC — extend the judgment-generation modal (`ui/src/components/judgments/create-judgment-modal.tsx` or whatever sibling shape lands by then) with a "source: LLM | UBI | Hybrid" picker + UBI window controls; new empty-state on the judgment-list detail page when the converter dropped some pairs as insufficient-data. +- **Migration:** **none.** UBI rides the existing `judgments` table; the `source IN ('llm', 'human', 'click')` CHECK already accepts the new value. Alembic head unchanged at whatever MVP1 ships. +- **Config:** one new optional env var `UBI_POSITION_BIAS_PRIOR_FILE` for operators who want to override the default Wang-Bendersky prior with a learned table. Default behaves like an uninformed prior. +- **Audit events:** N/A (MVP1.5 still pre-`audit_log`; that surface activates at MVP2). +- **Tests:** + - Unit: converter math (CTR thresholds, dwell-time thresholds, hybrid routing), feature aggregation, position-bias correction edge cases (zero impressions, single-impression queries, NULL dwell) + - Integration: end-to-end `POST /api/v1/judgment-lists/generate-from-ubi` against a stubbed `UbiReader` that returns canned feature vectors; mixed-source judgment list round-trip (INSERT + SELECT + calibration roll-up) + - Contract: error-code envelopes (`UBI_NOT_ENABLED`, `UBI_INSUFFICIENT_DATA`, `UBI_QUERY_MAPPING_AMBIGUOUS`), OpenAPI shape lock for the new endpoint, agent-tool registry inventory test + - Real-engine integration (optional, gated): UBI plugin smoke test against a CI OpenSearch service container with seeded `ubi_queries` + `ubi_events` indices + +## Why not implemented inline in MVP1 + +1. **MVP1 is sized to demonstrate the loop, not to maximize judgment quality.** Adding UBI inline doubles the judgment-source code path before the LLM-as-judge path has been proven against real adopter feedback. Shipping LLM-only first lets MVP1 stay focused on the optimization-loop value prop; MVP1.5 then earns the trust upgrade for operators with traffic. +2. **Converter strategy benefits from MVP1 adopter feedback.** Position-bias priors, dwell-time thresholds, and the LLM-fill cutoff are all judgment calls that get sharper after watching adopters run MVP1's LLM-as-judge against their real data. Building MVP1.5 against MVP1 adopter signal is meaningfully cheaper than building it speculatively. +3. **No schema migration is required to wait.** The `judgments.source` enum, the mixed-source judgment list contract, and the `SignalsConverter` Protocol shape were designed for this upgrade from day one. Delaying ships nothing important earlier; rushing ships a less-tuned converter. +4. **Strategic positioning.** Naming a dedicated MVP1.5 "Real Signals" release for UBI signals that UBI is a first-class direction — relevant for adoption in the OSC community where UBI was incubated, and for design partners who'd otherwise discount RelyLoop as an LLM-only tuning toy. Burying UBI in MVP2 "Observable" or MVP3 "Production Stacks" misses that positioning. + +## Relationship to other work + +- **Cleans up [`docs/00_overview/product/relevance-copilot-spec.md`](../../../00_overview/product/relevance-copilot-spec.md) §14 + §19 + §27** — the spec previously framed click data as a per-engine adapter concern with engine-specific timelines. The §14 patch (landing with this idea) re-anchors the architecture around the engine-neutral OpenSearch UBI schema, with engine-native readers (Elastic Behavioral Analytics, the Fusion `{app}_signals` collection, etc.) as thin extensions feeding the same `SignalsConverter` Protocol. +- **Composes with [`feat_auto_followup_studies`](../feat_auto_followup_studies/idea.md)** — auto-chained follow-up studies become dramatically more useful with a continuously-refreshed UBI judgment list than with a snapshot LLM-as-judge list. The two features are complementary; UBI ships first. +- **Composes with [`feat_pr_metric_confidence`](../../../00_overview/implemented_features/2026_05_21_feat_pr_metric_confidence/)** (shipped 2026-05-21) — the confidence framing in the PR body becomes meaningfully stronger when "the metric was scored against 50,000 UBI-derived ratings covering 90% of last week's traffic" replaces "the metric was scored against 500 LLM ratings against a snapshot query set." +- **Composes with [`feat_study_baseline_trial`](../feat_study_baseline_trial/idea.md) + [`feat_config_repo_baseline_tracking`](../feat_config_repo_baseline_tracking/idea.md)** — once UBI is the judgment source, "the baseline metric on the live config" becomes a meaningful absolute number rather than a synthetic LLM-rated approximation. Materially raises the credibility of every winning trial. +- **Does NOT block MVP2 "Observable"** — Langfuse and SigNoz instrumentation can layer on top of `generate_judgments_from_ubi` exactly as it would on top of `generate_judgments_llm`. The `langfuse_trace_id` lineage column landing at MVP2 will be NULL for `source='click'` rows (which never invoke an LLM) and populated for `source='llm'` rows in the hybrid case — same column, source-dependent fill. +- **Does NOT block later engine work** — the MVP1.5 `SignalsConverter` Protocol is engine-agnostic. New adapters added in later releases contribute their own engine-native reader (where they have one) feeding the same Protocol; the converter library and the API surface are unchanged regardless of which engines ship.