From c89de611a9ccf923ccbe8a0db5d8d044a64b39a1 Mon Sep 17 00:00:00 2001 From: SoundMindsAI Date: Thu, 18 Jun 2026 19:10:31 -0400 Subject: [PATCH 1/5] fix(install): pre-create ./data/solr so Solr can write collection cores MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Second root cause of the broken in-container demo reseed on a Solr stack: install.sh only mkdir'd ./secrets, never the engine data dirs. After `make reset` (rm -rf ./data) or a fresh clone, ./data/solr doesn't exist when the solr container starts, so its /var/solr bind mount resolves to a phantom dir the UID-8983 Solr process can't create children in. Every collection CREATE then fails ("Underlying core creation failed" / "Couldn't persist core properties to /var/solr/data/...") and the reseed's Solr scenario dies — verified live. Mirror the pr.yml smoke job's pre-create (mkdir + chown 8983 on Linux), gated on solr being in COMPOSE_PROFILES. On Docker Desktop the mkdir alone suffices (ownership virtualized); on Linux the bind preserves host UIDs so the chown is needed. Best-effort chown — warns rather than hard-failing install if it can't elevate. bug_reseed_resolve_engine_base_url_not_idempotent_in_container Co-Authored-By: Claude Opus 4.8 (1M context) Signed-off-by: SoundMindsAI --- scripts/install.sh | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/scripts/install.sh b/scripts/install.sh index 1cf413a4..3875fde7 100755 --- a/scripts/install.sh +++ b/scripts/install.sh @@ -318,6 +318,29 @@ else echo "RELYLOOP_SKIP_BUILD=1 set — skipping 'docker compose build' (CI artifact-handoff path)" fi +# 7c. Pre-create the Solr data directory so its bind mount resolves to a real +# host path. The solr image runs as UID/GID 8983 and writes each +# collection's core data under /var/solr/data (bind-mounted from +# ./data/solr — docker-compose.yml). When ./data/solr does NOT exist on the +# host (fresh clone, or after `make reset` wipes ./data), Docker's +# bind-mount-of-a-missing-source yields a mount the Solr process cannot +# create children in, so every collection CREATE fails with "Underlying +# core creation failed" / "Couldn't persist core properties" and the demo +# reseed's Solr scenario dies. Mirrors the pr.yml smoke job's pre-create +# step. On Linux the bind mount preserves host UIDs, so a chown to 8983 is +# required for the container to write; on Docker Desktop (macOS/Windows) +# ownership is virtualized and the mkdir alone suffices (a chown there +# would needlessly prompt for sudo). +# bug_reseed_resolve_engine_base_url_not_idempotent_in_container. +if [[ ",${COMPOSE_PROFILES:-es,os,solr}," == *",solr,"* ]]; then + mkdir -p ./data/solr + if [[ "$(uname -s)" == "Linux" && "$(stat -c '%u' ./data/solr 2>/dev/null)" != "8983" ]]; then + chown 8983:8983 ./data/solr 2>/dev/null \ + || sudo chown 8983:8983 ./data/solr 2>/dev/null \ + || echo "WARN: could not chown ./data/solr to 8983:8983. If Solr fails to create collections, run: sudo chown -R 8983:8983 ./data/solr" >&2 + fi +fi + # 8. Bring the stack up. `docker compose up -d` is itself idempotent. # `--wait` blocks until every container's healthcheck passes (or fails) — # needed by step 8 below, which runs the seed against a healthy stack. From 8b7c8fcde5d66adac41d7c85d663156f6bcacf4b Mon Sep 17 00:00:00 2001 From: SoundMindsAI Date: Thu, 18 Jun 2026 19:11:09 -0400 Subject: [PATCH 2/5] fix(demo): make _resolve_engine_base_url idempotent for in-container reseed MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The home-button demo reseed runs in the worker (always in-container), where scripts/seed_meaningful_demos.py's _INSIDE_CONTAINER branch sets every scenario's host_base_url to the Compose-DNS URL (http://elasticsearch:9200, etc.). _resolve_engine_base_url only mapped the host-shell localhost URLs to Compose-DNS and RAISED on anything else, so the reachability snapshot fed it the already-resolved URL and the whole reseed died with "Unrecognized engine host URL: http://elasticsearch:9200" — the run stuck at "Scenario 0 of 6" and never advanced. Latent since the engine-reachability snapshot landed: the home-button reseed's integration tests mock the engine-probe layer, so the real in-container resolve path was never exercised end-to-end. Fix: pass an already-resolved Compose-DNS target through unchanged (add _COMPOSE_DNS_TARGETS, return the input when it's already a mapping value); still raise on a genuinely unknown URL. Regression test parametrizes the three Compose-DNS URLs. Verified live: the reseed now advances past the reachability snapshot and runs the scenario's Optuna study. bug_reseed_resolve_engine_base_url_not_idempotent_in_container Co-Authored-By: Claude Opus 4.8 (1M context) Signed-off-by: SoundMindsAI --- backend/app/services/demo_seeding.py | 68 +++++++++++++------ .../tests/unit/services/test_demo_seeding.py | 17 +++++ 2 files changed, 63 insertions(+), 22 deletions(-) diff --git a/backend/app/services/demo_seeding.py b/backend/app/services/demo_seeding.py index cd34c1eb..cdd41c6b 100644 --- a/backend/app/services/demo_seeding.py +++ b/backend/app/services/demo_seeding.py @@ -418,36 +418,60 @@ async def _emit_progress(status_callback: StatusCallback, progress: ReseedStatus "http://localhost:8983": "http://solr:8983", } +# The Compose-DNS *targets* of the mapping above. When +# ``scripts/seed_meaningful_demos.py`` is imported INSIDE a container +# (``_INSIDE_CONTAINER`` → ``/.dockerenv`` present), its ``ES`` / ``OS`` / +# ``SOLR`` constants — and therefore every scenario's ``host_base_url`` — are +# already these Compose-DNS URLs, not the host-shell ``localhost`` URLs. The +# worker reseed runs in-container, so it feeds these already-resolved values +# into :func:`_resolve_engine_base_url`; treating them as a no-op pass-through +# (rather than raising) is what makes the resolver idempotent. +# bug_reseed_resolve_engine_base_url_not_idempotent_in_container. +_COMPOSE_DNS_TARGETS: Final[frozenset[str]] = frozenset(_ENGINE_BASE_URL_MAPPING.values()) + def _resolve_engine_base_url(host_base_url: str) -> str: - """Map the CLI's host-shell URLs to in-container Compose DNS names. + """Map an engine base URL to the in-container Compose-DNS name. Idempotent. The imported :data:`SCENARIOS` constant from - ``scripts/seed_meaningful_demos.py`` carries ``host_base_url`` values - like ``"http://localhost:9200"`` (ES), ``"http://localhost:9201"`` - (OS), and ``"http://localhost:8983"`` (Solr) — correct from the host - shell, wrong from inside the API container where ``localhost`` is the - API itself. This function transparently maps to the Compose service - DNS names. - - Pure / deterministic / no I/O. No env hooks (per cycle-4 plan review - A1 — AC-5's test injection lives in the test harness, not here). - - Per FR-1d. + ``scripts/seed_meaningful_demos.py`` carries ``host_base_url`` values that + depend on WHERE the script is imported: + + - From the host shell: ``"http://localhost:9200"`` (ES) / + ``":9201"`` (OS) / ``":8983"`` (Solr) — correct from the host, wrong + from inside the API container where ``localhost`` is the API itself. + These are mapped to the Compose service DNS names. + - From INSIDE a container (``_INSIDE_CONTAINER`` in the seed script — + ``/.dockerenv`` present, which is ALWAYS true for the worker reseed): + the constants are already the Compose-DNS URLs (e.g. + ``"http://elasticsearch:9200"``). Those pass through unchanged. + + The pass-through (rather than raising on an already-resolved value) is what + makes this idempotent — required because the worker reseed runs in-container + and so feeds the already-Compose-DNS URLs here. (Pre-existing latent bug: + the home-button reseed's integration tests mock the engine probe layer, so + the real in-container resolve path was never exercised end-to-end — + bug_reseed_resolve_engine_base_url_not_idempotent_in_container.) + + Pure / deterministic / no I/O. Raises: - ValueError: when ``host_base_url`` is not one of the three - recognized CLI URLs. The orchestrator unwraps this to a - :class:`DemoSeedingError` so the route handler returns a - 503 ``SEED_FAILED`` envelope. + ValueError: when ``host_base_url`` is neither a recognized host-shell + URL nor an already-resolved Compose-DNS target. The orchestrator + unwraps this to a :class:`DemoSeedingError` so the route handler + returns a 503 ``SEED_FAILED`` envelope. """ resolved = _ENGINE_BASE_URL_MAPPING.get(host_base_url) - if resolved is None: - raise ValueError( - f"Unrecognized engine host URL: {host_base_url}. " - f"Expected one of {sorted(_ENGINE_BASE_URL_MAPPING)}." - ) - return resolved + if resolved is not None: + return resolved + # Idempotent pass-through for already-resolved Compose-DNS targets. + if host_base_url in _COMPOSE_DNS_TARGETS: + return host_base_url + raise ValueError( + f"Unrecognized engine host URL: {host_base_url}. " + f"Expected a host URL {sorted(_ENGINE_BASE_URL_MAPPING)} " + f"or an already-resolved Compose-DNS URL {sorted(_COMPOSE_DNS_TARGETS)}." + ) # --------------------------------------------------------------------------- diff --git a/backend/tests/unit/services/test_demo_seeding.py b/backend/tests/unit/services/test_demo_seeding.py index 7eb3dfa1..e5701e8f 100644 --- a/backend/tests/unit/services/test_demo_seeding.py +++ b/backend/tests/unit/services/test_demo_seeding.py @@ -66,6 +66,23 @@ def test_resolve_engine_base_url_unknown_raises() -> None: _resolve_engine_base_url("http://example.com:9200") +@pytest.mark.parametrize( + "compose_dns_url", + ["http://elasticsearch:9200", "http://opensearch:9200", "http://solr:8983"], +) +def test_resolve_engine_base_url_is_idempotent_for_compose_dns(compose_dns_url: str) -> None: + """An already-resolved Compose-DNS URL passes through unchanged. + + bug_reseed_resolve_engine_base_url_not_idempotent_in_container — when + scripts/seed_meaningful_demos.py is imported INSIDE a container + (``_INSIDE_CONTAINER``), the SCENARIOS' ``host_base_url`` are already the + Compose-DNS URLs, and the worker reseed feeds them here. Before this fix + the resolver raised ``Unrecognized engine host URL: http://elasticsearch:9200`` + and the whole reseed failed at the reachability snapshot. + """ + assert _resolve_engine_base_url(compose_dns_url) == compose_dns_url + + # --------------------------------------------------------------------------- # DEMO_RESEED_LOCK_KEY — deterministic derivation # --------------------------------------------------------------------------- From de688486913057ccb9c1fb4f91715b5b03a6e4d4 Mon Sep 17 00:00:00 2001 From: SoundMindsAI Date: Thu, 18 Jun 2026 19:24:59 -0400 Subject: [PATCH 3/5] fix(demo): clear stale Arq result so reseed can retry within keep_result MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Third root cause of the broken home-button reseed (operator hit it live): the reseed can't be run twice within an hour. The POST enqueues with a deterministic _job_id ("demo_reseed:singleton") for double-click protection, but Arq aborts a re-enqueue of that id while the PRIOR run's result is still cached under arq:result: — kept for keep_result (Arq default 3600s = 1 HOUR), not the 60s the old comment claimed. So after any terminal run, the next Reset click is silently deduped (enqueue_job returns None) and the dialog sticks on "enqueued — waiting for worker" with an empty step log forever. The running-status 409 guard already prevents genuine concurrency, so a lingering result key is always a stale completed/failed artifact. Delete it before enqueue: rapid double-clicks are still deduped by the first click's in-flight arq:job key, while a deliberate retry now actually enqueues. Fix the misleading comment; extract the job id to _RESEED_JOB_ID. Resolves the previously-filed bug_reseed_failure_blocks_retry_arq_singleton_dedup. Tests: SpyArqPool gains a delete() double (records cleared keys); new integration case seeds a stale arq:result: and asserts the POST clears it AND enqueues (not dedup-to-None). Co-Authored-By: Claude Opus 4.8 (1M context) Signed-off-by: SoundMindsAI --- backend/app/api/v1/_test.py | 29 ++++++++++++--- backend/tests/integration/conftest.py | 16 ++++++++ .../test_demo_reseed_engines_filter.py | 37 +++++++++++++++++++ 3 files changed, 76 insertions(+), 6 deletions(-) diff --git a/backend/app/api/v1/_test.py b/backend/app/api/v1/_test.py index 4b1bfd68..124acac8 100644 --- a/backend/app/api/v1/_test.py +++ b/backend/app/api/v1/_test.py @@ -25,6 +25,7 @@ from typing import Annotated, Any from arq.connections import ArqRedis +from arq.constants import result_key_prefix from fastapi import APIRouter, Body, Depends, HTTPException, Request, Response, status from pydantic import BaseModel, ConfigDict, Field from redis.asyncio import Redis @@ -79,6 +80,12 @@ # should never appear in operator scripts. _TEST_PREFIX = "/_test" +# Deterministic Arq job id for the demo reseed — one in-flight run at a time +# (rapid-double-click protection). Used both to enqueue and to clear the stale +# result key that would otherwise block a legitimate retry within keep_result +# (1h). See the reseed POST handler. bug_reseed_failure_blocks_retry_arq_singleton_dedup. +_RESEED_JOB_ID = "demo_reseed:singleton" + def _err(status_code: int, code: str, message: str, retryable: bool) -> HTTPException: """Canonical error-envelope shape — mirrors ``studies.py:74-78``. @@ -727,11 +734,21 @@ async def reseed_demo( ) await status_set(arq_pool, initial) - # Deterministic job id — Arq drops duplicate enqueues with the same - # _job_id within its dedup window (default 60s). A faster double-click - # gets one job; a slower retry after the previous run completed - # creates a fresh job (because Redis state has moved on). - # + # Deterministic job id — Arq aborts a duplicate enqueue with the same + # _job_id while EITHER the job is still in-flight (``arq:job:``) OR a + # finished run's result is still cached (``arq:result:``). The result + # is kept for ``keep_result`` seconds (Arq default 3600s = 1 HOUR), NOT a + # 60s window — so without the explicit clear below, a legitimate retry + # within an hour of a COMPLETED run is silently deduped (enqueue_job + # returns None) and the operator is stuck on "enqueued — waiting for + # worker" forever. The ``running``-status 409 guard above already proved + # no run is genuinely in-flight, so any lingering result is a stale + # completed/failed artifact that is safe to drop before re-enqueue. This + # preserves rapid-double-click protection (the first click's ``arq:job`` + # key still dedupes the second) while unblocking deliberate retries. + # bug_reseed_failure_blocks_retry_arq_singleton_dedup. + await arq_pool.delete(f"{result_key_prefix}{_RESEED_JOB_ID}") + # ``engines`` is None when the body is absent OR ``{"engines": null}`` # OR ``{}`` (FastAPI parses an empty body to None when the body param # has ``default=None``). All three are the "reseed every reachable @@ -741,7 +758,7 @@ async def reseed_demo( engines_filter = body.engines if body is not None else None job = await arq_pool.enqueue_job( "run_demo_reseed", - _job_id="demo_reseed:singleton", + _job_id=_RESEED_JOB_ID, engines=engines_filter, ) logger.info( diff --git a/backend/tests/integration/conftest.py b/backend/tests/integration/conftest.py index a1fed8e7..e9d1b679 100644 --- a/backend/tests/integration/conftest.py +++ b/backend/tests/integration/conftest.py @@ -198,6 +198,11 @@ class SpyArqPool: def __init__(self) -> None: self.calls: list[tuple[object, ...]] = [] self._store: dict[object, object] = {} + # Keys passed to ``delete`` — the demo-reseed POST clears the stale + # Arq result key (``arq:result:``) before re-enqueuing so a + # completed run's cached result can't dedup-block a legitimate retry + # (bug_reseed_failure_blocks_retry_arq_singleton_dedup). + self.deleted: list[object] = [] async def enqueue_job(self, name: str, *args: object, **kwargs: object) -> object: self.calls.append((name, *args)) # flattened: (name,) + args @@ -218,6 +223,17 @@ async def set(self, key: object, value: object, **kwargs: object) -> None: # function-scoped test double. self._store[key] = value + async def delete(self, *keys: object) -> int: + # Mirror ``redis.delete(*keys)``: drop each key from the in-memory + # store, record it for assertions, return the count that was present. + removed = 0 + for key in keys: + self.deleted.append(key) + if key in self._store: + del self._store[key] + removed += 1 + return removed + _UNSET: object = object() """Sentinel distinguishing "attr unset" from "attr is None".""" diff --git a/backend/tests/integration/test_demo_reseed_engines_filter.py b/backend/tests/integration/test_demo_reseed_engines_filter.py index 56b553a8..c3d2c667 100644 --- a/backend/tests/integration/test_demo_reseed_engines_filter.py +++ b/backend/tests/integration/test_demo_reseed_engines_filter.py @@ -78,6 +78,43 @@ async def test_post_with_null_engines_treats_as_all_engines( assert response.status_code == 202, response.text +async def test_post_clears_stale_result_key_before_enqueue( + async_client: httpx.AsyncClient, + arq_pool_spy: object, +) -> None: + """A prior run's cached Arq result must not dedup-block a retry. + + bug_reseed_failure_blocks_retry_arq_singleton_dedup — Arq keeps a finished + job's result under ``arq:result:`` for keep_result (default 1h) and + silently aborts a re-enqueue of the same ``_job_id`` while that key exists, + leaving the operator stuck on "enqueued — waiting for worker". The reseed + POST handler deletes that key before enqueuing so a deliberate retry after + a completed/failed run actually runs. + """ + from arq.constants import result_key_prefix + + from backend.app.api.v1._test import _RESEED_JOB_ID + + spy = arq_pool_spy + result_key = f"{result_key_prefix}{_RESEED_JOB_ID}" + # Simulate a completed prior run whose result is still cached in Redis. + spy._store[result_key] = b"stale-result" # type: ignore[attr-defined] + + response = await async_client.post("/api/v1/_test/demo/reseed", json={}) + assert response.status_code == 202, response.text + + # The stale result key was deleted (so Arq won't dedup the enqueue) ... + assert result_key in spy.deleted, ( # type: ignore[attr-defined] + f"expected the stale result key to be cleared, deleted={spy.deleted!r}" # type: ignore[attr-defined] + ) + assert result_key not in spy._store # type: ignore[attr-defined] + # ... and the job was actually enqueued (not deduped to None). + enqueued = [c for c in spy.calls if c[0] == "run_demo_reseed"] # type: ignore[attr-defined] + assert len(enqueued) == 1, ( + f"expected exactly one run_demo_reseed enqueue, got {spy.calls!r}" # type: ignore[attr-defined] + ) + + async def test_post_with_no_body_accepted( async_client: httpx.AsyncClient, arq_pool_spy: object, From a262b674bcb8694af72a0e1537823fd8362be2b5 Mon Sep 17 00:00:00 2001 From: SoundMindsAI Date: Thu, 18 Jun 2026 19:27:36 -0400 Subject: [PATCH 4/5] docs(planned): remove bug_reseed_failure_blocks_retry_arq_singleton_dedup (resolved) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Resolved inline by the dedup-clear fix in this branch (the POST handler now deletes the stale arq:result: before re-enqueue). The idea recommended option 1 (worker-side clear on terminal state); the POST-side clear shipped here is strictly more robust — Arq writes the result AFTER the job function returns, so a crashed worker could never clear its own result, whereas the next POST always clears it regardless of how the prior run ended. Regenerated dashboards + public roadmap for the folder removal. Co-Authored-By: Claude Opus 4.8 (1M context) Signed-off-by: SoundMindsAI --- docs/00_overview/DASHBOARD.md | 2 +- docs/00_overview/MVP2_DASHBOARD.md | 33 ++++--- docs/00_overview/dashboard.html | 2 +- docs/00_overview/mvp2_dashboard.html | 25 ++---- .../idea.md | 85 ------------------- website/docs/roadmap.md | 3 +- 6 files changed, 25 insertions(+), 125 deletions(-) delete mode 100644 docs/00_overview/planned_features/02_mvp2/bug_reseed_failure_blocks_retry_arq_singleton_dedup/idea.md diff --git a/docs/00_overview/DASHBOARD.md b/docs/00_overview/DASHBOARD.md index b979a3f8..0294b02e 100644 --- a/docs/00_overview/DASHBOARD.md +++ b/docs/00_overview/DASHBOARD.md @@ -7,7 +7,7 @@ _Top-level index across MVP1 → GA v1+ as of **2026-06-18**. Click a release na | Release | Theme | Progress | Status | |---|---|---|---| | [MVP1 / v0.1](MVP1_DASHBOARD.md) | The Loop | 100 / 100 scoped done | **Complete** | -| [MVP2 / v0.2](MVP2_DASHBOARD.md) | Three-Engine + Real Signals | 28 / 30 scoped done · 17 remaining | **In progress** | +| [MVP2 / v0.2](MVP2_DASHBOARD.md) | Three-Engine + Real Signals | 28 / 30 scoped done · 16 remaining | **In progress** | | MVP3 / v0.3 | Observable | — | **Not yet scoped** | | [GA v1 / v1.0](GA_DASHBOARD.md) | Production-ready | 1 item(s) queued | **Held / queued** | diff --git a/docs/00_overview/MVP2_DASHBOARD.md b/docs/00_overview/MVP2_DASHBOARD.md index c7380111..4665e387 100644 --- a/docs/00_overview/MVP2_DASHBOARD.md +++ b/docs/00_overview/MVP2_DASHBOARD.md @@ -20,15 +20,15 @@ Plan approved; run /impl-execute to ship | Metric | Value | |---|---| -| Filed under MVP2 | **52** folders total (done + specced not-done + idea backlog + bugs) | +| Filed under MVP2 | **51** folders total (done + specced not-done + idea backlog + bugs) | | Specced features done | **28 / 30** (93%) — of features *past the idea stage* (those with a spec); the idea backlog below is NOT in this denominator, so 100% ≠ release complete | -| Pending work | **22** items (every not-done feat/infra/chore/bug across all priorities) | +| Pending work | **21** items (every not-done feat/infra/chore/bug across all priorities) | | → P0 — do next | **0** unblocking / paying daily cost | | → P1 | **0** high-value, ready when P0 clears | -| → P2 (default) | 14 important to file, not blocking | +| → P2 (default) | 13 important to file, not blocking | | → Backlog | 8 captured for record, not planned | -| Open bugs | 7 | -| Legacy "Path to MVP2" | 17 items — scoped-not-done + bugs + chore-ideas only (excludes feat/infra ideas) | +| Open bugs | 6 | +| Legacy "Path to MVP2" | 16 items — scoped-not-done + bugs + chore-ideas only (excludes feat/infra ideas) | | Backlog ideas | 5 idea-only feat/infra (not yet scoped into MVP2) | | In flight | 1 feature(s) actively shipping | @@ -86,7 +86,7 @@ Plan approved; run /impl-execute to ship _None._ -### Idea (19) +### Idea (18) | # | Priority | Feature | Type | One-liner | Depends on | Status | |---|---|---|---|---|---|---| @@ -98,17 +98,16 @@ _None._ | 6 | P2 | [chore_test_router_conditional_mount](planned_features/02_mvp2/chore_test_router_conditional_mount/idea.md) | Chore | The `_test` router exposes data-mutating endpoints used only for deterministic E2E (seed a completed study, demo reseed, hard-delete studies/judgment-lists/proposals). Today it is registered **uncondi | — | Idea — surfaced during a codebase-wide security review (branch `claude/codebase-security-review-6njwio`) | | 7 | P2 | [bug_e2e_teardown_chain_node_delete_500](planned_features/02_mvp2/bug_e2e_teardown_chain_node_delete_500/idea.md) | Bug | The E2E global-teardown deletes seeded rows in a fixed order (per `chore_e2e_test_rows_isolation` Story 1.2 cleanup registration). For auto-followup **chains**, the seeded nodes are `queued` studies c | — | Idea — tangential discovery during `feat_overnight_autopilot` (Story 4.2 E2E, PR forthcoming) | | 8 | P2 | [bug_request_id_header_unvalidated_log_injection](planned_features/02_mvp2/bug_request_id_header_unvalidated_log_injection/idea.md) | Bug | `RequestIDMiddleware` adopts a client-supplied `X-Request-ID` header verbatim with no validation of length or character set: | — | Idea — surfaced during a codebase-wide security review (branch `claude/codebase-security-review-6njwio`) | -| 9 | P2 | [bug_reseed_failure_blocks_retry_arq_singleton_dedup](planned_features/02_mvp2/bug_reseed_failure_blocks_retry_arq_singleton_dedup/idea.md) | Bug | `run_demo_reseed` is enqueued with a fixed Arq job id `demo_reseed:singleton` (the singleton concurrency guard). When a run reaches a terminal state, Arq stores its **result** under `arq:result:demo_r | — | Idea — tangential discovery while verifying `fix(demo): add Solr (8983) to the reseed engine host-URL mapping` (branch `feat_demo_reseed_solr_and_steplog`) | -| 10 | P2 | [bug_studies_detail_vitest_intermittent_timeout](planned_features/02_mvp2/bug_studies_detail_vitest_intermittent_timeout/idea.md) | Bug | Under the full `pnpm test` run (`vitest run`, default worker pool), the Study-detail-page render test sometimes blocks past the 5 s `testTimeout` default — but the test itself is data-driven from mock | — | Idea — captured during `chore_template_library_expansion` post-impl tangential sweep | -| 11 | P2 | [bug_webhook_concurrent_merge_race_timing_sensitive](planned_features/02_mvp2/bug_webhook_concurrent_merge_race_timing_sensitive/idea.md) | Bug | Idea — surfaced during `bug_demo_clusters_unreachable_in_healthz` PR #236 CI. | — | Idea — surfaced during `bug_demo_clusters_unreachable_in_healthz` PR #236 CI. | -| 12 | Backlog | [feat_reseed_status_sse_streaming](planned_features/02_mvp2/feat_reseed_status_sse_streaming/idea.md) | Feature | The shipped feature's spec locked the reseed-status streaming mechanism to the existing 2-second `GET /api/v1/_test/demo/reseed/status` Redis poll… | — | Idea — defer-until-incident; spun out of the shipped [`feat_selective_engine_startup_and_demo`](../../../implemented_features/2026_06_17_feat_selective_engine_startup_and_demo/feature_spec.md) (was its deferred Phase 3; split into its own folder at finalization per operator request) | -| 13 | Backlog | [infra_arq_subprocess_test](planned_features/02_mvp2/infra_arq_subprocess_test/idea.md) | Infra | Idea (deferred from `feat_study_lifecycle` Phase 2 / PR #25 final GPT-5.5 review). Still applicable as of 2026-05-14: the three in-process tests cited below still cover the resume contract correctly; | — | Idea (deferred from `feat_study_lifecycle` Phase 2 / PR #25 final GPT-5.5 review). Still applicable as of 2026-05-14: the three in-process tests cited below still cover the resume contract correctly; a subprocess test would add a narrow Arq-version-regression guard. | -| 14 | Backlog | [infra_pr_yml_split_integration_by_service](planned_features/02_mvp2/infra_pr_yml_split_integration_by_service/idea.md) | Infra | After PR #531 split the heavy backend test job into three lanes (`backend-unit`, `backend-heavy`, `backend-cov-gate`), the new binding CI constraint is `backend (contract + integration + cov)` at ~8m1 | — | Idea — **deferred (defer-until-binding-constraint, posture preserved)**. Carved out of [`infra_pr_yml_split_backend_test_lanes`](../infra_pr_yml_split_backend_test_lanes/idea.md) at its 2026-06-16 split, after Win 2′ (the 3-way lane split + cov-gate plumbing) shipped as PR #531. The cov-gate infrastructure required to merge per-shard partial coverage data was the prerequisite; that's now in production. | -| 15 | Backlog | [infra_smoke_fork_pr_secret_skip](planned_features/02_mvp2/infra_smoke_fork_pr_secret_skip/idea.md) | Infra | `.github/workflows/pr.yml` triggers on `pull_request:` ([pr.yml:43](../.github/workflows/pr.yml)) — **not** `pull_request_target`. GitHub deliberately withholds repository secrets from workflows trigg | — | Idea — tangential discovery while merging PR #387 (`chore_arq_pool_aclose_deprecation`) | -| 16 | Backlog | [chore_auto_followup_parent_advisory_lock](planned_features/02_mvp2/chore_auto_followup_parent_advisory_lock/idea.md) | Chore | The shipped `feat_auto_followup_studies` worker uses a two-layer idempotency scheme: | — | Idea — captured as a standalone file to resolve broken cross-references in `feat_auto_followup_studies` D-11 + plan F2 + `bug_auto_followup_completed_parent_stop_chain_race/idea.md`. The slug was coined 2026-05-24 in D-11 but only existed as descriptive prose across other documents until now. | -| 17 | Backlog | [chore_e2e_overnight_strategy_radix_select_timing](planned_features/02_mvp2/chore_e2e_overnight_strategy_radix_select_timing/idea.md) | Chore | The Story 3.2 E2E spec walks the create-study wizard to Step 5, clicks the depth `` becomes visible. In chromium against `pnpm dev`, t | — | Idea — tangential follow-up captured during `feat_overnight_final_solution` Story 3.2 implementation | -| 18 | Backlog | [chore_ubi_hybrid_template_render](planned_features/02_mvp2/chore_ubi_hybrid_template_render/idea.md) | Chore | Idea — contract decision deferred (NOT a worker bug) | — | Idea — contract decision deferred (NOT a worker bug) | -| 19 | Backlog | [bug_chat_long_conversation_truncation](planned_features/02_mvp2/bug_chat_long_conversation_truncation/idea.md) | Bug | [`backend/app/services/agent_chat.send_user_message`](../../backend/app/services/agent_chat.py) defensively caps the OpenAI history at the most recent `HISTORY_MAX_MESSAGES = 100` messages… | — | Held for MVP2 (decided 2026-05-13). Folder renamed with `_mvp2` suffix to make the deferral visible at-a-glance in `ls docs/00_overview/planned_features/`. Resume work when MVP2 starts — no technical dependency on MVP2 infra (audit_log is N/A; Langfuse is convenience only); the deferral is scope discipline + zero current impact (latent bug, no operator has hit the 100-message cap). | +| 9 | P2 | [bug_studies_detail_vitest_intermittent_timeout](planned_features/02_mvp2/bug_studies_detail_vitest_intermittent_timeout/idea.md) | Bug | Under the full `pnpm test` run (`vitest run`, default worker pool), the Study-detail-page render test sometimes blocks past the 5 s `testTimeout` default — but the test itself is data-driven from mock | — | Idea — captured during `chore_template_library_expansion` post-impl tangential sweep | +| 10 | P2 | [bug_webhook_concurrent_merge_race_timing_sensitive](planned_features/02_mvp2/bug_webhook_concurrent_merge_race_timing_sensitive/idea.md) | Bug | Idea — surfaced during `bug_demo_clusters_unreachable_in_healthz` PR #236 CI. | — | Idea — surfaced during `bug_demo_clusters_unreachable_in_healthz` PR #236 CI. | +| 11 | Backlog | [feat_reseed_status_sse_streaming](planned_features/02_mvp2/feat_reseed_status_sse_streaming/idea.md) | Feature | The shipped feature's spec locked the reseed-status streaming mechanism to the existing 2-second `GET /api/v1/_test/demo/reseed/status` Redis poll… | — | Idea — defer-until-incident; spun out of the shipped [`feat_selective_engine_startup_and_demo`](../../../implemented_features/2026_06_17_feat_selective_engine_startup_and_demo/feature_spec.md) (was its deferred Phase 3; split into its own folder at finalization per operator request) | +| 12 | Backlog | [infra_arq_subprocess_test](planned_features/02_mvp2/infra_arq_subprocess_test/idea.md) | Infra | Idea (deferred from `feat_study_lifecycle` Phase 2 / PR #25 final GPT-5.5 review). Still applicable as of 2026-05-14: the three in-process tests cited below still cover the resume contract correctly; | — | Idea (deferred from `feat_study_lifecycle` Phase 2 / PR #25 final GPT-5.5 review). Still applicable as of 2026-05-14: the three in-process tests cited below still cover the resume contract correctly; a subprocess test would add a narrow Arq-version-regression guard. | +| 13 | Backlog | [infra_pr_yml_split_integration_by_service](planned_features/02_mvp2/infra_pr_yml_split_integration_by_service/idea.md) | Infra | After PR #531 split the heavy backend test job into three lanes (`backend-unit`, `backend-heavy`, `backend-cov-gate`), the new binding CI constraint is `backend (contract + integration + cov)` at ~8m1 | — | Idea — **deferred (defer-until-binding-constraint, posture preserved)**. Carved out of [`infra_pr_yml_split_backend_test_lanes`](../infra_pr_yml_split_backend_test_lanes/idea.md) at its 2026-06-16 split, after Win 2′ (the 3-way lane split + cov-gate plumbing) shipped as PR #531. The cov-gate infrastructure required to merge per-shard partial coverage data was the prerequisite; that's now in production. | +| 14 | Backlog | [infra_smoke_fork_pr_secret_skip](planned_features/02_mvp2/infra_smoke_fork_pr_secret_skip/idea.md) | Infra | `.github/workflows/pr.yml` triggers on `pull_request:` ([pr.yml:43](../.github/workflows/pr.yml)) — **not** `pull_request_target`. GitHub deliberately withholds repository secrets from workflows trigg | — | Idea — tangential discovery while merging PR #387 (`chore_arq_pool_aclose_deprecation`) | +| 15 | Backlog | [chore_auto_followup_parent_advisory_lock](planned_features/02_mvp2/chore_auto_followup_parent_advisory_lock/idea.md) | Chore | The shipped `feat_auto_followup_studies` worker uses a two-layer idempotency scheme: | — | Idea — captured as a standalone file to resolve broken cross-references in `feat_auto_followup_studies` D-11 + plan F2 + `bug_auto_followup_completed_parent_stop_chain_race/idea.md`. The slug was coined 2026-05-24 in D-11 but only existed as descriptive prose across other documents until now. | +| 16 | Backlog | [chore_e2e_overnight_strategy_radix_select_timing](planned_features/02_mvp2/chore_e2e_overnight_strategy_radix_select_timing/idea.md) | Chore | The Story 3.2 E2E spec walks the create-study wizard to Step 5, clicks the depth `` becomes visible. In chromium against `pnpm dev`, t | — | Idea — tangential follow-up captured during `feat_overnight_final_solution` Story 3.2 implementation | +| 17 | Backlog | [chore_ubi_hybrid_template_render](planned_features/02_mvp2/chore_ubi_hybrid_template_render/idea.md) | Chore | Idea — contract decision deferred (NOT a worker bug) | — | Idea — contract decision deferred (NOT a worker bug) | +| 18 | Backlog | [bug_chat_long_conversation_truncation](planned_features/02_mvp2/bug_chat_long_conversation_truncation/idea.md) | Bug | [`backend/app/services/agent_chat.send_user_message`](../../backend/app/services/agent_chat.py) defensively caps the OpenAI history at the most recent `HISTORY_MAX_MESSAGES = 100` messages… | — | Held for MVP2 (decided 2026-05-13). Folder renamed with `_mvp2` suffix to make the deferral visible at-a-glance in `ls docs/00_overview/planned_features/`. Resume work when MVP2 starts — no technical dependency on MVP2 infra (audit_log is N/A; Langfuse is convenience only); the deferral is scope discipline + zero current impact (latent bug, no operator has hit the 100-message cap). | ## Dependency graph diff --git a/docs/00_overview/dashboard.html b/docs/00_overview/dashboard.html index 491e2208..17c032d7 100644 --- a/docs/00_overview/dashboard.html +++ b/docs/00_overview/dashboard.html @@ -392,7 +392,7 @@

Releases

Three-Engine + Real Signals
-
28 / 30 scoped done · 17 remaining
+
28 / 30 scoped done · 16 remaining
In progress
diff --git a/docs/00_overview/mvp2_dashboard.html b/docs/00_overview/mvp2_dashboard.html index 1dac3c7e..a4cb55de 100644 --- a/docs/00_overview/mvp2_dashboard.html +++ b/docs/00_overview/mvp2_dashboard.html @@ -398,17 +398,17 @@

MVP2 Progress

Specced features done
28 / 30
-
93% specced · 52 filed under MVP2
+
93% specced · 51 filed under MVP2
Pending work
-
22
+
21
every not-done feat/infra/chore/bug across all priorities
Open bugs
-
7
+
6
tracked bug_* idea files
@@ -425,7 +425,7 @@

MVP2 Progress

P2 (default)
-
14
+
13
important to file, not blocking
@@ -435,7 +435,7 @@

MVP2 Progress

Legacy "Path to MVP2"
-
17
+
16
scoped not-done + bugs + chore-ideas only (excludes feat/infra ideas)
@@ -463,7 +463,7 @@

Pipeline

-

Idea 19

+

Idea 18

@@ -569,19 +569,6 @@

Idea 19

-
- -
- Bug - P2 - -
-
`run_demo_reseed` is enqueued with a fixed Arq job id `demo_reseed:singleton` (the singleton concurrency guard). When a run reaches a terminal state, Arq stores its **result** under `arq:result:demo_r
- - -
- -
diff --git a/docs/00_overview/planned_features/02_mvp2/bug_reseed_failure_blocks_retry_arq_singleton_dedup/idea.md b/docs/00_overview/planned_features/02_mvp2/bug_reseed_failure_blocks_retry_arq_singleton_dedup/idea.md deleted file mode 100644 index e5dcc6f1..00000000 --- a/docs/00_overview/planned_features/02_mvp2/bug_reseed_failure_blocks_retry_arq_singleton_dedup/idea.md +++ /dev/null @@ -1,85 +0,0 @@ -# Idea — a failed demo reseed silently blocks retries (~1h) via Arq singleton dedup - -**Date:** 2026-05-31 -**Status:** Idea — tangential discovery while verifying `fix(demo): add Solr (8983) to the reseed engine host-URL mapping` (branch `feat_demo_reseed_solr_and_steplog`) -**Type:** `bug_` -**Priority:** P2 — operator-facing: after ANY reseed failure, the next reseed silently never runs until the stale Arq result expires (~1h), with the UI stuck "spinning". No data loss, but confusing and blocks recovery. - -> **Verified still live 2026-06-05 (P2 backlog grooming).** Confirmed against the current tree: -> - The fixed `_job_id="demo_reseed:singleton"` enqueue is unchanged at [`_test.py:691-694`](../../../../../backend/app/api/v1/_test.py) and does **not** handle the `job is None` dedup-drop (it logs `job_id=None` and returns an already-written `status="running"`). -> - The all-engines-unreachable mitigation (`infra_solr_ci_readiness`) at [`demo_seeding.py:1997`](../../../../../backend/app/services/demo_seeding.py) **deliberately raises → `status="failed"`**, which is precisely the terminal state that caches under `arq:result:demo_reseed:singleton` and wedges the retry. So that mitigation *keeps this wedge path reachable* — it fixed the "masquerade-as-success" half, not the "failed-result blocks re-enqueue" half. -> - **Severity confirmed ~1h, not 60s.** The inline comment at `_test.py:688` ("Arq drops duplicate enqueues … default 60s") is **misleading** — the live-reproduced wedge is the `keep_result` result key (~3600s), not a 60s dedup window. Fixing this idea should also correct that comment. Recommended fix is option 1 (clear the singleton result key on terminal state in the worker) — cheapest, preserves the singleton concurrency guard. - -## Origin - -Reproduced live: the Solr host-URL bug caused a reseed to **fail** on the Solr -scenario. Immediately re-triggering `POST /api/v1/_test/demo/reseed` returned -`200 {status: "running"}`, but the **worker never picked up the job** — its log -stayed empty and the status sat at `current_step = "enqueued — waiting for -worker"`, `scenarios_completed = 0` indefinitely. Manual Redis inspection found -the culprit; clearing it unblocked the retry. - -## Problem - -`run_demo_reseed` is enqueued with a fixed Arq job id `demo_reseed:singleton` -(the singleton concurrency guard). When a run reaches a terminal state, Arq -stores its **result** under `arq:result:demo_reseed:singleton` for -`keep_result` (Arq default ~3600 s). A subsequent enqueue with the **same job -id** is **deduplicated by Arq** — `enqueue_job` returns `None` and the job is -**silently dropped**. So: - -- Worker never receives the retry → no `demo_reseed_worker_started`, empty logs. -- The API has already optimistically written `status = "running"` to - `demo_reseed:status`, so the UI shows an in-progress reseed that will never - advance, and the in-tool 409 `SEED_IN_PROGRESS` guard now rejects further - attempts (it reads the stuck "running" status). -- Net: a single failed reseed wedges the feature for up to ~1 h. - -This is the inverse of the dedup behavior `chore_demo_seeding_integration_tests_rewrite` -already documents for the *concurrent* case — here it bites the *sequential -retry-after-failure* case. - -## Manual recovery (today) - -```bash -docker compose exec -T redis redis-cli del arq:result:demo_reseed:singleton demo_reseed:status -# then re-POST /api/v1/_test/demo/reseed -``` - -## Proposed fix (pick one at spec time) - -1. **Clear the singleton result on terminal state.** When the worker finishes - (complete OR failed), `redis.delete("arq:result:demo_reseed:singleton")` (and - any `arq:in-progress:` key) so the next enqueue isn't deduped. Cheapest; - preserves the singleton guard for genuine concurrency. -2. **Detect the dropped enqueue.** `enqueue_job(..., _job_id="demo_reseed:singleton")` - returns `None` when deduped — the POST handler should treat `None` as "a prior - run's result is blocking re-enqueue", clear it (or surface a precise 409 that - says *retry blocked by a previous run; clearing*), instead of writing an - unbacked `status = "running"`. -3. **Fresh job id per attempt + rely on the status/lock guard for concurrency.** - Drop the singleton job id; the existing status-based 409 + the Postgres - advisory lock (`DEMO_RESEED_LOCK_KEY`) already prevent concurrent runs. Removes - the dedup foot-gun entirely but changes the concurrency model — needs care. - -Add a regression test: enqueue → force-fail → re-enqueue must actually run (not -be deduped). The `chore_demo_seeding_integration_tests_rewrite` async-harness -work is the natural home for this assertion. - -## Scope signals - -- **Backend:** small-moderate — the worker terminal-state cleanup +/or the POST - handler's dropped-enqueue handling (`backend/workers/demo_reseed.py`, - `backend/app/api/v1/_test.py`, `backend/app/services/demo_seeding.py`). -- **Frontend:** none (the UI just polls status). -- **Migration / config:** none. -- **Audit events:** N/A (test-only endpoint). - -## Relationship to other work - -- Surfaced by `fix(demo): add Solr (8983) to the reseed engine host-URL mapping` - (the Solr failure is what left the stale singleton result). -- Adjacent to `chore_demo_seeding_integration_tests_rewrite` (async-flow test - rewrite) — the regression test for this belongs in that harness. -- Same Arq-singleton mechanism noted in that chore's spec for the concurrent-POST - case; this is the retry-after-failure case. diff --git a/website/docs/roadmap.md b/website/docs/roadmap.md index 2f6487b7..dafe09c3 100644 --- a/website/docs/roadmap.md +++ b/website/docs/roadmap.md @@ -214,7 +214,7 @@ - 🟡 [PR Yml Split Integration By Service](https://github.com/SoundMindsAI/relyloop/tree/main/docs/00_overview/planned_features/02_mvp2/infra_pr_yml_split_integration_by_service) - 🟡 [Smoke Fork PR Secret Skip](https://github.com/SoundMindsAI/relyloop/tree/main/docs/00_overview/planned_features/02_mvp2/infra_smoke_fork_pr_secret_skip) -??? note "Maintenance & fixes (22)" +??? note "Maintenance & fixes (21)" - ✅ [Backend Suite Nondeterministic Caplog Isolation](https://github.com/SoundMindsAI/relyloop/tree/main/docs/00_overview/implemented_features/2026_06_01_bug_backend_suite_nondeterministic_caplog_isolation) · [#364](https://github.com/SoundMindsAI/relyloop/pull/364) - ✅ [Contract Allowlists Outdated After Mvp2 Features](https://github.com/SoundMindsAI/relyloop/tree/main/docs/00_overview/implemented_features/2026_06_01_bug_contract_allowlists_outdated_after_mvp2_features) · [#364](https://github.com/SoundMindsAI/relyloop/pull/364) @@ -232,7 +232,6 @@ - 🟡 [Healthz Solr Subsystem Ignores Local Container](https://github.com/SoundMindsAI/relyloop/tree/main/docs/00_overview/planned_features/02_mvp2/chore_healthz_solr_subsystem_ignores_local_container) - 🟡 [Overnight Result Card Screenshot](https://github.com/SoundMindsAI/relyloop/tree/main/docs/00_overview/planned_features/02_mvp2/chore_overnight_result_card_screenshot) - 🟡 [Request Id Header Unvalidated Log Injection](https://github.com/SoundMindsAI/relyloop/tree/main/docs/00_overview/planned_features/02_mvp2/bug_request_id_header_unvalidated_log_injection) - - 🟡 [Reseed Failure Blocks Retry Arq Singleton Dedup](https://github.com/SoundMindsAI/relyloop/tree/main/docs/00_overview/planned_features/02_mvp2/bug_reseed_failure_blocks_retry_arq_singleton_dedup) - 🟡 [Solr Post Pipeline Followups](https://github.com/SoundMindsAI/relyloop/tree/main/docs/00_overview/planned_features/02_mvp2/chore_solr_post_pipeline_followups) - 🟡 [Studies Detail Vitest Intermittent Timeout](https://github.com/SoundMindsAI/relyloop/tree/main/docs/00_overview/planned_features/02_mvp2/bug_studies_detail_vitest_intermittent_timeout) - 🟡 [Test Router Conditional Mount](https://github.com/SoundMindsAI/relyloop/tree/main/docs/00_overview/planned_features/02_mvp2/chore_test_router_conditional_mount) From eb14543e961fc09589ee0cdca2a29c8aa954b108 Mon Sep 17 00:00:00 2001 From: SoundMindsAI Date: Thu, 18 Jun 2026 19:32:51 -0400 Subject: [PATCH 5/5] fix(install): use sudo -n for the Solr data-dir chown (Gemini review) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Non-interactive sudo fails fast instead of hanging on a password prompt in CI / automated provisioning when passwordless sudo isn't configured — the warning fallback then fires. Gemini Code Assist Medium finding (install.sh). Co-Authored-By: Claude Opus 4.8 (1M context) Signed-off-by: SoundMindsAI --- scripts/install.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/install.sh b/scripts/install.sh index 3875fde7..5dc2a38d 100755 --- a/scripts/install.sh +++ b/scripts/install.sh @@ -336,7 +336,7 @@ if [[ ",${COMPOSE_PROFILES:-es,os,solr}," == *",solr,"* ]]; then mkdir -p ./data/solr if [[ "$(uname -s)" == "Linux" && "$(stat -c '%u' ./data/solr 2>/dev/null)" != "8983" ]]; then chown 8983:8983 ./data/solr 2>/dev/null \ - || sudo chown 8983:8983 ./data/solr 2>/dev/null \ + || sudo -n chown 8983:8983 ./data/solr 2>/dev/null \ || echo "WARN: could not chown ./data/solr to 8983:8983. If Solr fails to create collections, run: sudo chown -R 8983:8983 ./data/solr" >&2 fi fi