diff --git a/state.md b/state.md index 1b9cf4ce..ebe689be 100644 --- a/state.md +++ b/state.md @@ -2,7 +2,7 @@ > Read this first. A one-page snapshot: current focus, the last few merges, what's in flight, what's queued, and where the project sits in the MVP1 → MVP2 → MVP3 → GA roadmap. **Historical feature-merge narrative + chained execution context lives in [`state_history.md`](state_history.md)** — new merge entries land there, not here (per `chore_state_md_size_compression`, 2026-05-29). Keep this file loadable in a single `Read` call. -**Last updated:** 2026-06-18 (**`bug_reset_demo_no_instant_feedback_poll_race` merged** — PR #562, squash-merged `bb247a5c`. The home "Reset to demo state" dialog is responsive again: Confirm gives instant "Starting…" feedback + can't double-fire, and the status poller no longer dies in a start-up race (it was enabling polling before the POST and could read `idle` → stop permanently, freezing the UI + step log). Fix: POST first → `setQueryData('running')` → enable polling. **Next up: scenario-clarity feature** (operator picked the authoritative version — a per-run scenario manifest with live per-scenario state in the reseed status API + a labeled checklist replacing "Scenario 0 of 6"). Earlier today: **`bug_healthz_degraded_blocks_ui_engine_subset`** PR #559 (`/healthz` engine-selection-aware — excluded engine reports non-blocking `not_selected`), `feat_engine_version_selection` PR #553, the `install.sh` `.env`-load fix PR #556. Previously: **`feat_selective_engine_startup_and_demo` Phase 1 merged** — PR #548, squash-merged `9bf20ab2`. Operators can now pick which search engines to load: `RELYLOOP_ENGINES=es` (any subset of `es,os,solr`) in `.env` → `install.sh` → `COMPOSE_PROFILES` → only the selected engines pull + boot (cutting single-engine first-run wall-clock); and the home "Reset to demo state" dialog gains an engine-checkbox group. Default unchanged everywhere: no selection → all three. 6 stories / 3 epics: Compose `profiles:` + `RELYLOOP_ENGINES` parser (Epic 1); new `GET /api/v1/_test/demo/engines` probe + `ReseedRequest{engines}` body + orchestrator filter + `scenarios_skipped_reasons` field (Epic 2); reset-modal checkbox group + two-reason partial-completion footer (Epic 3). No migration (head stays `0023`). Gemini 2 findings accepted (load-window default; dead `?? {}` fallback). GPT-5.5 unreachable → Opus self-review; all 19 checks green (smoke skipped); merge-skew clean. **Phase 2 + 3 deferred, split into their own folders** (`feat_engine_version_selection`, `feat_reseed_status_sse_streaming`); Phase-1 folder archived to `implemented_features/`. Full narrative in [`state_history.md`](state_history.md). Previously: **`feat_studies_starting_metric` merged** — PR #545, squash-merged `6a18e113`. The studies-list **Best metric** column now renders a combined `starting → best (lift)` cell by plumbing `studies.baseline_metric` onto `StudySummary`; header `Best metric` → `Starting → best`. No migration. Gemini caught a real `((new))` double-parens bug. Full narrative in [`state_history.md`](state_history.md). Previously: **`feat_query_normalizer_typed_pipeline` merged** — PR #509, squash-merged `7a24849`. Extends Phase 1's fixed-bundle `query_normalizer` into a **typed pipeline**: declare `{type:"normalizer_pipeline", steps:[…]}` over 6 atomic `NormalizerStep`s; the Optuna loop searches the **powerset** (`2^N` labels). Bundles stay backward-compatible (one engine — `normalize_pipeline`; `normalize(bundle)` is now a wrapper). FR-3 smart-quote (U+2019) expansion, reserved-key-only validation (`NormalizerPipelineMisplacedError` → `INVALID_SEARCH_SPACE`), adapter pre-render hook resolves bundle OR pipeline label via `steps_for_label`, **bilingual** (Python+JS) PR-body snippets with three-way output parity (committed corpus fixture), `` builder row + 7 glossary keys + digest-advisory broadening, docs. **No migration** (head stays `0023`). Q-1 locked (include `expand_contractions_custom` inert — 6 steps); Q-2 locked (JS parity via frontend vitest fixture). Opus self-review (GPT-5.5 unreachable); Gemini 2 Med accepted (`7047190`: strip_punctuation snippets use the runtime regex). Extended `verify_enum_source_of_truth` to resolve StrEnum classes. All 19 `pr.yml` checks green (smoke skipped). Full narrative in [`state_history.md`](state_history.md). Previously: **`bug_cluster_url_ssrf_hostname_bypass` Phase 1 merged** — PR #510, squash-merged `3cb28c7`; closes the cluster `base_url` SSRF hostname-bypass via a flag-gated async resolve-and-classify guard before any probe. New `domain/cluster/url_policy.py` (pure classifier) + `services/cluster_url_policy.py` (async orchestrator) + `400 CLUSTER_URL_BLOCKED`; the two `base_url` validators de-duped to one structural helper. No migration/UI (Alembic head stays `0023`). Surfaced by security review #504 (auto-closed); driven through `/pipeline --auto` (spec→plan→3 stories). 52 new tests; Gemini 2 Med accepted (bounded DNS, malformed-port 422). **Phase 2 (connect-time IP pinning) deferred — folder stays in `planned_features/02_mvp2/` with `phase2_idea.md`, NOT archived.** Previously: **`chore_overnight_result_card_screenshot` finalized** — PR #492 reviewed via `/pr-review` and squash-merged (`4128572`); this entry is its `state.md`/`state_history.md` finalization. Ships the tutorial Step 12 morning-result-card PNG + the first-ever tutorial-image **ferry plumbing** across both doc-copy pipelines (`ui/scripts/copy-docs.mjs` `copyImageAssets`/`pruneStaleImages`; `website/scripts/build_guides.py` `copy_long_form_images()` + a back-compat-defaulted prune kwarg), both guarding the `images/` subdir from the flat `.md`/`rmtree` prune and no-op'ing when the source dir is absent. **Docs/test/script only — no code-path/API/migration** (Alembic head stays `0023`). 17 new tests (10 vitest + 7 pytest). GPT-5.5 unreachable → Opus self-review; Gemini 1 Med accepted (`02450dd`: hoist `readFileSync` to a top-level import vs inline `require`). All three freshness gates (`generated-artifacts-fresh`/`copy-docs`/`build-guides`) + 18/19 checks green (smoke skipped — opt-in/off). Full narrative in [`state_history.md`](state_history.md).) +**Last updated:** 2026-06-18 (**`bug_reseed_resolve_engine_base_url_not_idempotent_in_container` merged** — PR #564, squash-merged `e7b787a4`. The home-button demo reseed works in-container again — three stacked root causes that all looked like "stuck at Scenario 0 of 6": (1) `_resolve_engine_base_url` wasn't idempotent (the in-container worker feeds it already-Compose-DNS scenario URLs → it raised `Unrecognized engine host URL`), (2) `install.sh` never pre-created `./data/solr` so after `make reset` Solr's bind mount was a phantom dir and every collection CREATE failed, (3) Arq kept a finished run's result for `keep_result` (~1h) so the deterministic `_job_id` deduped the next reseed into a silent no-op. Fixes: idempotent resolver pass-through; `install.sh` pre-creates `./data/solr` (+ `sudo -n chown 8983` on Linux); the POST deletes the stale `arq:result:` before enqueue. All three verified live on a Solr-only stack (full 50-trial Optuna study ran; retry picks up immediately). Resolves the previously-filed `bug_reseed_failure_blocks_retry_arq_singleton_dedup` (folder removed). Gemini 2 MED — `sudo -n` accepted, trailing-slash normalization deferred. **Next up: scenario-clarity feature** (operator picked the authoritative version — a per-run scenario manifest with live per-scenario state in the reseed status API + a labeled checklist replacing "Scenario 0 of 6"; the counter sitting still *during* a scenario's multi-minute study is exactly the UX this fixes). Earlier today: **`bug_reset_demo_no_instant_feedback_poll_race`** PR #562 (reset-dialog responsiveness — Confirm gives instant "Starting…" feedback + can't double-fire; the status poller no longer dies in a start-up race; fix: POST first → `setQueryData('running')` → enable polling), **`bug_healthz_degraded_blocks_ui_engine_subset`** PR #559 (`/healthz` engine-selection-aware — excluded engine reports non-blocking `not_selected`), `feat_engine_version_selection` PR #553, the `install.sh` `.env`-load fix PR #556. Previously: **`feat_selective_engine_startup_and_demo` Phase 1 merged** — PR #548, squash-merged `9bf20ab2`. Operators can now pick which search engines to load: `RELYLOOP_ENGINES=es` (any subset of `es,os,solr`) in `.env` → `install.sh` → `COMPOSE_PROFILES` → only the selected engines pull + boot (cutting single-engine first-run wall-clock); and the home "Reset to demo state" dialog gains an engine-checkbox group. Default unchanged everywhere: no selection → all three. 6 stories / 3 epics: Compose `profiles:` + `RELYLOOP_ENGINES` parser (Epic 1); new `GET /api/v1/_test/demo/engines` probe + `ReseedRequest{engines}` body + orchestrator filter + `scenarios_skipped_reasons` field (Epic 2); reset-modal checkbox group + two-reason partial-completion footer (Epic 3). No migration (head stays `0023`). Gemini 2 findings accepted (load-window default; dead `?? {}` fallback). GPT-5.5 unreachable → Opus self-review; all 19 checks green (smoke skipped); merge-skew clean. **Phase 2 + 3 deferred, split into their own folders** (`feat_engine_version_selection`, `feat_reseed_status_sse_streaming`); Phase-1 folder archived to `implemented_features/`. Full narrative in [`state_history.md`](state_history.md). Previously: **`feat_studies_starting_metric` merged** — PR #545, squash-merged `6a18e113`. The studies-list **Best metric** column now renders a combined `starting → best (lift)` cell by plumbing `studies.baseline_metric` onto `StudySummary`; header `Best metric` → `Starting → best`. No migration. Gemini caught a real `((new))` double-parens bug. Full narrative in [`state_history.md`](state_history.md). Previously: **`feat_query_normalizer_typed_pipeline` merged** — PR #509, squash-merged `7a24849`. Extends Phase 1's fixed-bundle `query_normalizer` into a **typed pipeline**: declare `{type:"normalizer_pipeline", steps:[…]}` over 6 atomic `NormalizerStep`s; the Optuna loop searches the **powerset** (`2^N` labels). Bundles stay backward-compatible (one engine — `normalize_pipeline`; `normalize(bundle)` is now a wrapper). FR-3 smart-quote (U+2019) expansion, reserved-key-only validation (`NormalizerPipelineMisplacedError` → `INVALID_SEARCH_SPACE`), adapter pre-render hook resolves bundle OR pipeline label via `steps_for_label`, **bilingual** (Python+JS) PR-body snippets with three-way output parity (committed corpus fixture), `` builder row + 7 glossary keys + digest-advisory broadening, docs. **No migration** (head stays `0023`). Q-1 locked (include `expand_contractions_custom` inert — 6 steps); Q-2 locked (JS parity via frontend vitest fixture). Opus self-review (GPT-5.5 unreachable); Gemini 2 Med accepted (`7047190`: strip_punctuation snippets use the runtime regex). Extended `verify_enum_source_of_truth` to resolve StrEnum classes. All 19 `pr.yml` checks green (smoke skipped). Full narrative in [`state_history.md`](state_history.md). Previously: **`bug_cluster_url_ssrf_hostname_bypass` Phase 1 merged** — PR #510, squash-merged `3cb28c7`; closes the cluster `base_url` SSRF hostname-bypass via a flag-gated async resolve-and-classify guard before any probe. New `domain/cluster/url_policy.py` (pure classifier) + `services/cluster_url_policy.py` (async orchestrator) + `400 CLUSTER_URL_BLOCKED`; the two `base_url` validators de-duped to one structural helper. No migration/UI (Alembic head stays `0023`). Surfaced by security review #504 (auto-closed); driven through `/pipeline --auto` (spec→plan→3 stories). 52 new tests; Gemini 2 Med accepted (bounded DNS, malformed-port 422). **Phase 2 (connect-time IP pinning) deferred — folder stays in `planned_features/02_mvp2/` with `phase2_idea.md`, NOT archived.** Previously: **`chore_overnight_result_card_screenshot` finalized** — PR #492 reviewed via `/pr-review` and squash-merged (`4128572`); this entry is its `state.md`/`state_history.md` finalization. Ships the tutorial Step 12 morning-result-card PNG + the first-ever tutorial-image **ferry plumbing** across both doc-copy pipelines (`ui/scripts/copy-docs.mjs` `copyImageAssets`/`pruneStaleImages`; `website/scripts/build_guides.py` `copy_long_form_images()` + a back-compat-defaulted prune kwarg), both guarding the `images/` subdir from the flat `.md`/`rmtree` prune and no-op'ing when the source dir is absent. **Docs/test/script only — no code-path/API/migration** (Alembic head stays `0023`). 17 new tests (10 vitest + 7 pytest). GPT-5.5 unreachable → Opus self-review; Gemini 1 Med accepted (`02450dd`: hoist `readFileSync` to a top-level import vs inline `require`). All three freshness gates (`generated-artifacts-fresh`/`copy-docs`/`build-guides`) + 18/19 checks green (smoke skipped — opt-in/off). Full narrative in [`state_history.md`](state_history.md).) ## Where the roadmap sits @@ -16,7 +16,7 @@ MVP1 (v0.1) **shipped** — all six differentiators live (Bayesian/TPE optimizer ## Current branch / execution context -- **Branch:** `main` (`bug_reset_demo_no_instant_feedback_poll_race` just merged — PR #562, 2026-06-18, squash-merged `bb247a5c`, frontend-only UX fix; today's session also merged `bug_healthz_degraded_blocks_ui_engine_subset` PR #559, `feat_engine_version_selection` PR #553, the `bug_install_sh_env_file_not_loaded` `.env`-load fix PR #556). All `pr.yml` checks green (smoke skipped — opt-in/off). **Reset-to-demo dialog is now responsive:** Confirm gives instant "Starting…" feedback + can't double-fire; the status poller no longer dies in a start-up race (it was enabling polling BEFORE the POST and could read `idle` → stop permanently, freezing the UI + step log). Fix: POST first → `setQueryData('running')` → enable polling. **Active feature next:** scenario-clarity (operator picked the authoritative version — a per-run scenario manifest with live per-scenario state in the reseed status API + a labeled checklist UI replacing "Scenario 0 of 6"); to be built via `/pipeline`. **Recent context:** `/healthz` is engine-selection-aware (#559 — excluded engine reports non-blocking `not_selected`); `install.sh` loads `RELYLOOP_*` from `.env` (#556); engine version selection via `RELYLOOP_*_VERSION` (#553). +- **Branch:** `main` (`bug_reseed_resolve_engine_base_url_not_idempotent_in_container` just merged — PR #564, 2026-06-18, squash-merged `e7b787a4`). **The home-button demo reseed works in-container again** — three stacked root causes fixed (idempotent `_resolve_engine_base_url`; `install.sh` pre-creates `./data/solr`; the POST clears Arq's stale `keep_result` singleton result so a retry isn't dedup-blocked for ~1h). All three verified live on a Solr-only stack (full 50-trial Optuna study ran; retry picks up immediately). Resolves the previously-filed `bug_reseed_failure_blocks_retry_arq_singleton_dedup` (folder removed). Today's session also merged `bug_reset_demo_no_instant_feedback_poll_race` PR #562 (reset-dialog responsiveness), `bug_healthz_degraded_blocks_ui_engine_subset` PR #559, `feat_engine_version_selection` PR #553, the `.env`-load fix PR #556. All `pr.yml` checks green (smoke skipped — opt-in/off); merge-skew clean. **Active feature next:** scenario-clarity (operator picked the authoritative version — a per-run scenario manifest with live per-scenario state in the reseed status API + a labeled checklist UI replacing "Scenario 0 of 6"; the "Scenario 0 of 6" counter sitting still *during* a scenario's multi-minute study is exactly the UX this feature fixes); to be built via `/pipeline`. - **Active feature:** None in flight. **Deferred follow-on (own folder, idea-stage):** `feat_reseed_status_sse_streaming` (SSE reseed-status migration, defer-until-incident — was `feat_selective_engine_startup_and_demo` Phase 3), standalone under `planned_features/02_mvp2/`. **Deferred phase still parked in-folder:** `bug_cluster_url_ssrf_hostname_bypass` **Phase 2** (connect-time IP pinning for DNS rebinding) in `planned_features/02_mvp2/bug_cluster_url_ssrf_hostname_bypass/phase2_idea.md`. **Still deferred:** `chore_demo_seeding_integration_tests_rewrite` (14-story DB-only integration choreography; blocked on a local stack for safe CI-blind validation) and `infra_pr_yml_split_integration_by_service` (defer-until-integration-is-the-binding-CI-constraint; Win 2′ of its parent shipped). GPT-5.5 unreachable in this env → Opus self-review substitution. - **Alembic head:** `0023_proposals_superseded_status` (unchanged — `feat_fts_rank_ordering` is no-migration; head last moved by `feat_overnight_final_solution_phase3` PR #457). - **Python:** 3.13. **Frontend stack:** Next 16 (App Router + Turbopack), React 19, Tailwind 4 (CSS-first), Vitest 4, ESLint 9 (flat), TypeScript 6, Playwright (chromium, single worker) for E2E. @@ -26,12 +26,12 @@ MVP1 (v0.1) **shipped** — all six differentiators live (Bayesian/TPE optimizer Detail + reasoning for each is in [`state_history.md`](state_history.md). +- **2026-06-18** — `bug_reseed_resolve_engine_base_url_not_idempotent_in_container` (PR #564, squash-merged `e7b787a4`). **The home-button demo reseed works in-container again — three stacked root causes.** Operator-reported live: the reseed stuck at "Scenario 0 of 6" and never advanced. (1) **Resolver not idempotent** — the worker (always in-container) receives each scenario's `host_base_url` as the already-Compose-DNS URL (`seed_meaningful_demos.py`'s `_INSIDE_CONTAINER` branch), but `_resolve_engine_base_url` only mapped host-shell `localhost` URLs → Compose-DNS and **raised** on anything else, so `snapshot_engine_reachability` died with `Unrecognized engine host URL: http://elasticsearch:9200`. Latent because the reseed's integration tests mock the engine-probe layer. Fix: pass an already-resolved Compose-DNS target through unchanged (new `_COMPOSE_DNS_TARGETS`); still raise on genuinely-unknown URLs. (2) **Solr couldn't create collections** — `install.sh` only `mkdir`'d `./secrets`, never the engine data dirs, so after `make reset` (`rm -rf ./data`) `./data/solr` didn't exist when Solr started → phantom `/var/solr` bind UID-8983 can't write → every collection CREATE failed (`Underlying core creation failed`). Fix: pre-create `./data/solr` (+ `sudo -n chown 8983` on Linux) before compose-up, gated on Solr in `COMPOSE_PROFILES`, mirroring the smoke job. (3) **Couldn't reseed twice within ~1h** — the deterministic `_job_id` ("demo_reseed:singleton") double-click guard collides with Arq keeping a finished run's result under `arq:result:` for `keep_result` (1h, NOT the 60s the old comment claimed); the next Reset was silently deduped → stuck "enqueued — waiting for worker" with an empty step log. Fix: delete the stale result key before enqueue (the running-status 409 guard already prevents real concurrency; rapid double-clicks still deduped by the in-flight `arq:job` key). **Resolves the previously-filed `bug_reseed_failure_blocks_retry_arq_singleton_dedup`** (folder removed). **No migration.** All three verified live on a Solr-only stack (reseed ran a full 50-trial Optuna study; retry after completion picks up immediately). Shipped via `/impl-execute --ad-hoc`. **Gemini: 2 MED** — accepted `sudo -n`; deferred trailing-slash normalization (no live trailing-slash path into this demo-only resolver). GPT-5.5 unreachable → Opus self-review. Full narrative in [`state_history.md`](state_history.md). - **2026-06-18** — `bug_reset_demo_no_instant_feedback_poll_race` (PR #562, squash-merged `bb247a5c`). **The home "Reset to demo state" dialog is responsive again.** Operator-reported: clicking Confirm appeared to do nothing → they clicked again → eventually a `409`-driven toast → and the streaming step log never appeared. One root cause: `startReseed` enabled the status poller BEFORE sending the reseed and discarded the POST's returned initial status. (1) Start-up race → the poller's first fetch could read `idle` (before the worker wrote `running`); `refetchInterval` stops on any non-`running` status, so it stopped permanently — the reseed ran in the background but the dialog froze and the log never streamed. (2) No instant feedback → the POST's initial `running` status was thrown away, so the dialog only switched after a separate round-trip; until then the Confirm button stayed active → double-click → 409. **Fix (frontend-only):** reorder — send the reseed FIRST, `queryClient.setQueryData(['demo-reseed','status'], initial)` to render the progress view + step log instantly, THEN enable polling (Redis already holds `running`, so the first poll continues — race gone); plus a `submitting` flag that disables Confirm + shows "Starting…" on click (no double-submit). **No backend/API/migration change.** 3 new vitest cases (cache-seed with running status; disabled + "Starting…" while in flight; no cache-seed before the POST resolves). **Gemini: 1 MED, accepted** — disabled `refetchOnWindowFocus` on the status hook so a finished run doesn't refetch `/reseed/status` on every window-focus (fixed via the hook config rather than Gemini's setState-in-effect suggestion, avoiding the `react-hooks/set-state-in-effect` rule). GPT-5.5 unreachable → Opus self-review. Full narrative in [`state_history.md`](state_history.md). - **2026-06-18** — `bug_healthz_degraded_blocks_ui_engine_subset` (PR #559, squash-merged `ad2992a4`). **A Solr-only (or any ES/OS-excluding) stack's UI now starts.** Surfaced live: `RELYLOOP_ENGINES=solr make up` brought up a stack whose UI never started — `/healthz` returned 503 because `overall_status` ([health.py:170-179](backend/app/api/health.py#L170-L179)) hardcoded `elasticsearch`/`opensearch == "unreachable"` as blocking, even when the operator *intentionally* excluded the engine. The api healthcheck (`curl -fs /healthz`) failed on the 503 → api `unhealthy` → ui+worker (`depends_on: api: service_healthy`) never started. The health check predated the engine-subset feature; Solr already had a `not_configured` non-blocking opt-out, ES/OS never got one. **Fix — engine-selection-aware `/healthz`:** new `Settings.compose_profiles` (env `COMPOSE_PROFILES`, default `es,os,solr`) + a `selected_engines` property (case-insensitive, fail-safe → all three on empty/unrecognized); new non-blocking `not_selected` state on the ES/OS `Subsystems` Literals; the handler skips the probe for an unselected engine and reports `not_selected` (saves the 200ms timeout); `overall_status` body **unchanged** (`not_selected != unreachable` → non-blocking automatically); `docker-compose.yml` passes `COMPOSE_PROFILES` into the api container. **No migration.** Default unset → byte-identical all-engines behavior. **Operator-path verified live:** Solr-only `/healthz` 200, `elasticsearch: not_selected`, `opensearch: not_selected`, api healthy, ui+worker started. Tests: 14 settings unit + 3 health handler (not_selected + 200; probe actually skipped; selected-but-down still 503s). Shipped via `/bug-fix --ship`. **Gemini: 2 MED, both accepted** — `asyncio.sleep(0, result=…)` instead of a per-request nested coro + case-insensitive profile parse. **Tangential captured** (not folded): [`chore_healthz_solr_subsystem_ignores_local_container`](docs/00_overview/planned_features/02_mvp2/chore_healthz_solr_subsystem_ignores_local_container/idea.md) — Solr reports `not_configured` even when its container runs (SOLR_HOST unset in api env). GPT-5.5 unreachable → Opus self-review. Full narrative in [`state_history.md`](state_history.md). - **2026-06-18** — `feat_engine_version_selection` (PR #553, squash-merged `fd67886a`). **Let the operator pick which engine *version* to run — not just which engines.** Spun out of `feat_selective_engine_startup_and_demo`'s deferred Phase 2; operator confirmed (during preflight) that version selection is core, not polish (RelyLoop must not assume every customer runs the latest tag), so priority was reset Backlog → P1. `RELYLOOP_ES_VERSION=8.15.3 make up` boots ES 8.15.3 instead of the hardcoded `9.4.1`; same for `RELYLOOP_OS_VERSION` / `RELYLOOP_SOLR_VERSION`. **10 stories / 4 epics, no migration** (head stays `0023`). Epic 1 (install-time infra): three Compose `image:` lines now interpolate `${X_IMAGE_TAG:-}`; new `backend/app/core/engine_versions.py` `ENGINE_VERSION_MATRIX` (one entry per *supported major* in the adapter compatibility window — ES 8.x+9.x, OS 2.x+3.x, Solr 9.x+10.x — NOT a fixed "last N" count, locked at preflight); `scripts/lib/relyloop_engine_versions.sh` helper validates `RELYLOOP_*_VERSION` against the matrix BEFORE any `docker compose pull` + bash mirror + 13-case bash test; `.env.example` block; new `verify_engine_version_matrix_parity.sh` CI guard enforcing 3 sync points (Python↔Compose-default, Python↔bash-mirror, Python↔frontend-mirror) + 4-case self-test. Epic 2 (backend): new `is_engine_reachable_with_version` **sibling** probe returning `(reachable, version)` — the existing bool-only `is_engine_reachable` is untouched (locked at preflight; `snapshot_engine_reachability`/orchestrator still use it); `DemoEngineStatus.version: str | None` field; capability endpoint wired through; OpenAPI snapshot regenerated. Epic 3 (frontend): `ENGINE_VERSION_MATRIX` mirror in `enums.ts`; reset modal renders `` (em-dash, muted text) when reachable + version resolved. Epic 4: `local-dev.md` + `deployment.md` matrix block + `adapters.md` cross-links + `CONTRIBUTING.md` maintainer process. **Tangential fix:** wired Phase 1's `test_parse_relyloop_engines.sh` into `pr.yml` (existed but was never invoked). **Gemini: 2 MED, both accepted** — defensive `isinstance(body, dict)` guard in the version probe (a malformed body was already caught by the broad `except` but logged a misleading `AttributeError` WARN; the guard returns a clean `(False, None)` with 2 regression tests) + `encoding="utf-8"` on the parity guard's file read. GPT-5.5 unreachable → Opus self-review on spec+plan; Gemini was the live cross-family code-stage gate. All 19 `pr.yml` checks green on the fix commit `c7106ccd` (smoke skipped — opt-in/off); merge-skew clean. Full narrative in [`state_history.md`](state_history.md). - **2026-06-17** — `feat_selective_engine_startup_and_demo` Phase 1 (PR #548, squash-merged `9bf20ab2`). **Let the operator pick which engines to load — at install time AND on the reset-to-demo button.** Two layered selections: (a) `RELYLOOP_ENGINES=es` (any subset of `es,os,solr`) in `.env` → `install.sh` translates it to `COMPOSE_PROFILES` → only the selected engines pull + boot, cutting first-run wall-clock for single-engine evaluators; (b) the home dashboard "Reset to demo state" dialog gains an engine-checkbox group (defaults to all reachable) so a reseed targets a subset. Default everywhere is unchanged: no env var / no selection → all three engines, exactly like today. **6 stories across 3 epics.** Epic 1 (infra): `profiles: ["es"|"os"|"solr"]` on the three engine services in `docker-compose.yml` (verified no app-service `depends_on` an engine, so profile-gating doesn't cascade-skip api/worker/migrate/ui); `scripts/lib/relyloop_engines.sh` sourceable parser (17-case bash regression test) defaulting to `es,os,solr` when unset; `.env.example` + `make help` + smoke-job `COMPOSE_PROFILES=es,os,solr` opt-in + local-dev/corp-install/deployment doc updates. Epic 2 (backend): new `GET /api/v1/_test/demo/engines` capability probe (parallel `asyncio.gather`, always 200); `ReseedRequest{engines}` POST body (validated against `EngineTypeWire`; `[]` rejected 422; null/missing = all reachable, back-compat); orchestrator `engines` filter on the small SCENARIOS loop + the rich ESCI scenario; new `scenarios_skipped_reasons: dict[slug, "user_excluded"|"unreachable"]` field on `ReseedStatusResponse`. Epic 3 (frontend): reset-modal checkbox group + `useDemoEnginesCapability` hook + `postDemoReseed` helper + `RESEED_SKIP_REASON_VALUES` enum mirror; two-reason partial-completion footer ("You excluded" vs "Engine unreachable"); `demo-reseed-engine-tolerance.md` runbook update. **No migration** (head stays `0023`). **CI fix on the way:** the new reseed-filter integration tests routed through `arq_pool_spy`, but the reseed POST handler reuses the Arq pool as its Redis client (`status_get`/`status_set`) + logs `job.job_id` — extended `SpyArqPool` with in-memory `get`/`set` + a `SimpleNamespace(job_id=…)` return (additive; existing 25 studies-spy tests still green). **Gemini: 2 findings, both accepted** — HIGH (load-window default: checkboxes rendered empty + Confirm enabled-but-inert while the capability fetch was in flight → default to all-three during load/error) + MED (dropped a dead `?? {}` fallback on a backend-guaranteed field). GPT-5.5 unreachable → Opus self-review on spec+plan; Gemini was the live cross-family code-stage gate. All 19 `pr.yml` checks green on the merged head (smoke skipped — opt-in/off); merge-skew clean. **Phase 2 + 3 deferred and split into their own folders at finalization** (operator decision) — `feat_engine_version_selection` (engine version selection) + `feat_reseed_status_sse_streaming` (SSE reseed-status) now live as standalone `planned_features/02_mvp2/` ideas; the shipped Phase-1 folder was archived to `implemented_features/2026_06_17_feat_selective_engine_startup_and_demo/`. Full narrative in [`state_history.md`](state_history.md). -- **2026-06-17** — `feat_studies_starting_metric` (PR #545, squash-merged `6a18e113`). **Show the starting (baseline) metric beside best metric in the studies list.** The list table showed only **Best metric** — meaningless without a reference point. `studies.baseline_metric` (the pre-optimization default-config score, already stamped by `feat_study_baseline_trial` + shown on the digest panel) was absent from the list shape. Plumbed `baseline_metric: float | None = None` onto `StudySummary` (`schemas.py`) + the `_summary()` builder (`studies.py`), and reworked the column cell into a combined `starting → best (lift)` render — e.g. `0.750 → 0.823 (+9.7%)`, the digest panel's framing. Column `id`/`accessorKey` stay `best_metric` so sort wiring is unchanged; Ceiling badge still gates on direction; header `Best metric` → `Starting → best`. Regenerated `ui/openapi.json` + `types.ts`. **No migration** (head stays `0023`). Tests: contract (optional-number field), 2 integration (value flows through `_summary`; stamped→returned, unstamped→null), 5-case frontend delta suite. Shipped ad-hoc (user feature request). **Gemini caught a real `((new))` double-parens bug** — `deltaPct` returned `'(new)'` but the caller wraps in parens; fixed to bare `'new'` and fixed the **identical latent bug in `digest-panel.tsx`** inline (the two are mirrored), plus tightened the zero-baseline test to exact-match (the old substring assertion passed on `((new))`). Also fixed a Prettier gate miss (ran eslint+tsc locally, not `prettier --check`). All 20 checks green on the fix commit (smoke skipped); base matched current `main`, no skew. GPT-5.5 unreachable → Opus self-review. Full narrative in [`state_history.md`](state_history.md). -_(older entries — full narrative in [`state_history.md`](state_history.md): `infra_pr_yml_split_backend_test_lanes` Win 2′ PR #531, `chore_deploy_docs_daily_cron` PR #529, `chore_install_tls_hint_recommend_extract` PR #527, `chore_corp_ca_extract_make_target` PR #525, `chore_corp_install_dx_improvements` PR #523, `chore_dockerfile_remove_syntax_directive` PR #521, `chore_dockerfile_http_proxy_args` PR #519, `chore_dockerfile_corp_proxy_args` PR #517, `feat_query_normalizer_typed_pipeline` PR #509, `bug_cluster_url_ssrf_hostname_bypass` Phase 1 PR #510, `chore_overnight_result_card_screenshot` PR #492, `bug_seed_meaningful_demos_silent_bulk_errors` PR #482, `bug_relyloop_spec_ubi_section_drift` PR #481, `chore_demo_reseed_partial_completion_fast_test` PR #480, `chore_pr_yml_parallelize_backend_job` PR #478, `chore_studies_post_arq_spy_fixture` PR #476, `chore_ubi_reader_search_after_pagination` PR #474, `feat_fts_rank_ordering` PR #472, `bug_judgment_header_omits_click_bucket` PR #470, `bug_baseline_phase_test_isolation` PR #466, `chore_cluster_detail_rung_badge` PR #464, `feat_ubi_llm_study_comparison` PR #461, `feat_query_normalization_tuning` PR #459, `feat_overnight_final_solution_phase3` PR #457, `feat_study_wizard_inline_judgment_generation` PR #453, `feat_walkthrough_video_cursor_captions` PR #451, `feat_website_walkthrough_guides` PR #448, `feat_proposal_full_param_space_view` PR #446, `feat_overnight_studies_summary_card` PR #444, `feat_overnight_final_solution_phase2` PR #442, `feat_overnight_final_solution` PR #440, `feat_studies_list_trial_convergence_columns` PR #438, `feat_list_count_columns` PR #436, `infra_generated_artifact_freshness_gate` PR #433, `chore_scorecard_pin_deps_postcss` PR #430, `bug_llm_capability_cache_no_refresh` PR #426, `infra_smoke_reseed_runtime_budget` PR #424, `feat_studies_convergence_visibility` PR #421/#422, `bug/cli-seed-ubi-missing-engine-type` PR #419, `chore_template_library_expansion` PR #416, `infra_solr_smoke_stability` PR #383, `infra_solr_ci_readiness` Phase 1 PR #367, MVP2 backlog batch PR #364, `feat_study_convergence_indicator` PR #352, `feat_overnight_autopilot` PR #343, `infra_adapter_solr` PR #336, …)_ +_(older entries — full narrative in [`state_history.md`](state_history.md): `feat_studies_starting_metric` PR #545, `infra_pr_yml_split_backend_test_lanes` Win 2′ PR #531, `chore_deploy_docs_daily_cron` PR #529, `chore_install_tls_hint_recommend_extract` PR #527, `chore_corp_ca_extract_make_target` PR #525, `chore_corp_install_dx_improvements` PR #523, `chore_dockerfile_remove_syntax_directive` PR #521, `chore_dockerfile_http_proxy_args` PR #519, `chore_dockerfile_corp_proxy_args` PR #517, `feat_query_normalizer_typed_pipeline` PR #509, `bug_cluster_url_ssrf_hostname_bypass` Phase 1 PR #510, `chore_overnight_result_card_screenshot` PR #492, `bug_seed_meaningful_demos_silent_bulk_errors` PR #482, `bug_relyloop_spec_ubi_section_drift` PR #481, `chore_demo_reseed_partial_completion_fast_test` PR #480, `chore_pr_yml_parallelize_backend_job` PR #478, `chore_studies_post_arq_spy_fixture` PR #476, `chore_ubi_reader_search_after_pagination` PR #474, `feat_fts_rank_ordering` PR #472, `bug_judgment_header_omits_click_bucket` PR #470, `bug_baseline_phase_test_isolation` PR #466, `chore_cluster_detail_rung_badge` PR #464, `feat_ubi_llm_study_comparison` PR #461, `feat_query_normalization_tuning` PR #459, `feat_overnight_final_solution_phase3` PR #457, `feat_study_wizard_inline_judgment_generation` PR #453, `feat_walkthrough_video_cursor_captions` PR #451, `feat_website_walkthrough_guides` PR #448, `feat_proposal_full_param_space_view` PR #446, `feat_overnight_studies_summary_card` PR #444, `feat_overnight_final_solution_phase2` PR #442, `feat_overnight_final_solution` PR #440, `feat_studies_list_trial_convergence_columns` PR #438, `feat_list_count_columns` PR #436, `infra_generated_artifact_freshness_gate` PR #433, `chore_scorecard_pin_deps_postcss` PR #430, `bug_llm_capability_cache_no_refresh` PR #426, `infra_smoke_reseed_runtime_budget` PR #424, `feat_studies_convergence_visibility` PR #421/#422, `bug/cli-seed-ubi-missing-engine-type` PR #419, `chore_template_library_expansion` PR #416, `infra_solr_smoke_stability` PR #383, `infra_solr_ci_readiness` Phase 1 PR #367, MVP2 backlog batch PR #364, `feat_study_convergence_indicator` PR #352, `feat_overnight_autopilot` PR #343, `infra_adapter_solr` PR #336, …)_ ## In flight diff --git a/state_history.md b/state_history.md index b250ad36..a4f7a144 100644 --- a/state_history.md +++ b/state_history.md @@ -4,6 +4,18 @@ --- +### `bug_reseed_resolve_engine_base_url_not_idempotent_in_container` — the in-container demo reseed works again (PR #564, 2026-06-18) + +**Symptom.** Operator-reported live, twice: the home "Reset to demo state" reseed stuck at "Scenario 0 of 6 (0%)" and never advanced. Investigating the live stack peeled back **three independent root causes** stacked on top of each other — each one, once fixed, exposed the next. None were related to the earlier reset-button responsiveness work (#562); that fix made the dialog *show* the running state instantly, but if the run can't actually progress, instant feedback just shows a frozen progress view faster. + +**Root cause 1 — `_resolve_engine_base_url` not idempotent.** The home-button reseed runs in the Arq worker, which is *always* in-container. `scripts/seed_meaningful_demos.py` sets each scenario's `host_base_url` from its `ES`/`OS`/`SOLR` constants, and those are gated on `_INSIDE_CONTAINER` (`/.dockerenv`): inside a container they're the **Compose-DNS** URLs (`http://elasticsearch:9200`), not the host-shell `localhost` URLs. But `_resolve_engine_base_url` ([demo_seeding.py](backend/app/services/demo_seeding.py)) only mapped the `localhost` keys → Compose-DNS and **raised** on anything else. So `snapshot_engine_reachability` fed it `http://elasticsearch:9200` and the whole run died at the reachability snapshot with `Unrecognized engine host URL: http://elasticsearch:9200`. Latent since the engine-reachability snapshot landed — the reseed's integration tests mock the engine-probe layer, so the real in-container resolve path was never exercised end-to-end. **Fix:** make the resolver idempotent — added `_COMPOSE_DNS_TARGETS = frozenset(_ENGINE_BASE_URL_MAPPING.values())` and pass an already-resolved Compose-DNS target through unchanged; still raise on a genuinely-unknown URL. 3 parametrized unit cases. + +**Root cause 2 — Solr couldn't create collections (`install.sh` data-dir gap).** With (1) fixed, the reseed advanced to the Solr scenario and failed: `400 ... Underlying core creation failed` → Solr log `Couldn't persist core properties to /var/solr/data/... UnixException`. `/var/solr/data` didn't exist; `mkdir` inside the container failed `ENOENT` because the `./data/solr` bind *source* didn't exist on the host — a phantom mount. Cause: `make reset` does `rm -rf ./data`, and `install.sh` only `mkdir`'d `./secrets`, never the engine data dirs, so after any reset (or a fresh clone) Solr starts against a non-existent bind source. The `pr.yml` smoke job already pre-creates `./data/solr` (mkdir + `chown 8983`) as its Lever 0; the local install path never got that. **Fix:** `install.sh` pre-creates `./data/solr` before compose-up, gated on Solr being in `COMPOSE_PROFILES`, with a Linux-only `chown 8983:8983` (Docker Desktop virtualizes ownership, so the mkdir alone suffices there; a chown would needlessly prompt for sudo). Verified live: creating the host dir + recreating Solr → collection CREATE succeeds → the reseed ran a full 50-trial Optuna study on the Solr scenario. + +**Root cause 3 — Arq singleton-dedup blocks retry for ~1h.** With (1)+(2) fixed and a reseed having completed, the *next* Reset click stuck at "enqueued — waiting for worker" with an empty step log — the worker never picked it up. The POST enqueues with a deterministic `_job_id="demo_reseed:singleton"` (double-click protection), but Arq aborts a re-enqueue of that id while EITHER the job is in-flight OR a finished run's result is still cached under `arq:result:` — kept for `keep_result` (Arq default **3600s = 1 hour**, NOT the 60s the old inline comment claimed). So after any terminal run, the next enqueue was silently deduped (`enqueue_job` → None) while the POST had already optimistically written `status="running"` → permanent stuck state. This was the previously-filed `bug_reseed_failure_blocks_retry_arq_singleton_dedup` (verified-still-live 2026-06-05), now hit on every repeat reseed. **Fix:** the running-status 409 guard already prevents genuine concurrency, so any lingering result is a stale completed/failed artifact — delete `arq:result:` before enqueue. This is strictly more robust than the idea's recommended option 1 (worker clears on terminal): Arq writes the result *after* the job function returns, so a crashed worker could never clear its own result, whereas the next POST always clears it regardless of how the prior run ended. Rapid double-clicks are still deduped by the first click's in-flight `arq:job` key. Extracted the literal to `_RESEED_JOB_ID`, corrected the misleading comment. `SpyArqPool` gained a `delete()` double; new integration case seeds a stale `arq:result:` and asserts the POST clears it AND enqueues (not dedup-to-None). **Verified live:** stale result key present (`result-key-before: 1`) → POST → key gone (`result-key-after: 0`) → worker immediately picked up the retry. Removed the resolved `bug_reseed_failure_blocks_retry_arq_singleton_dedup` planned-feature folder + regenerated dashboards/roadmap. + +**Process.** Shipped as one ad-hoc PR via `/impl-execute --ad-hoc` (three facets of "the in-container reseed is broken"). **No migration.** Operator-path verified: `install.sh` re-runs idempotently on a live stack (./data/solr preserved, Solr healthy); full reseed runs end-to-end; retry-after-completion picks up immediately. **Gemini: 2 MED** — accepted `sudo -n` (non-interactive chown can't hang the install); **deferred** trailing-slash normalization of the resolver input (no live trailing-slash path — the resolver only ever sees the demo SCENARIOS' hardcoded URLs + the `ES`/`OS`/`SOLR` constants; operator-registered cluster URLs never reach it). GPT-5.5 unreachable → Opus self-review. All `pr.yml` checks green on `eb14543e`; merge-skew clean (base matched `origin/main` `c7df82ba`); squash-merged `e7b787a4`. **Unverified follow-on:** whether the ES/OS bind dirs share the Solr missing-dir failure mode — couldn't test (those engines weren't running in the Solr-only stack); scoped the fix to the proven Solr case. **Surfaced the next feature:** "Scenario 0 of 6" sitting still *during* a scenario's multi-minute study (the counter only increments on whole-scenario completion) is exactly the misleading-progress UX the queued scenario-clarity feature addresses. + ### `bug_reset_demo_no_instant_feedback_poll_race` — the reset-demo dialog is responsive again (PR #562, 2026-06-18) **What broke.** Operator-reported live: clicking "Reset to demo state" → Confirm appeared to do nothing, so they clicked again, eventually saw a `409`-driven toast, and never saw the streaming step log. One root cause: `startReseed` ([reset-demo-state-button.tsx](ui/src/components/dashboard/reset-demo-state-button.tsx)) enabled the status poller BEFORE sending the reseed and discarded the POST's returned initial status: