fix: in-container demo reseed stuck at Scenario 0 (idempotent resolver + Solr data-dir + Arq dedup-retry)#564
Conversation
Second root cause of the broken in-container demo reseed on a Solr stack:
install.sh only mkdir'd ./secrets, never the engine data dirs. After
`make reset` (rm -rf ./data) or a fresh clone, ./data/solr doesn't exist
when the solr container starts, so its /var/solr bind mount resolves to a
phantom dir the UID-8983 Solr process can't create children in. Every
collection CREATE then fails ("Underlying core creation failed" /
"Couldn't persist core properties to /var/solr/data/...") and the reseed's
Solr scenario dies — verified live.
Mirror the pr.yml smoke job's pre-create (mkdir + chown 8983 on Linux),
gated on solr being in COMPOSE_PROFILES. On Docker Desktop the mkdir alone
suffices (ownership virtualized); on Linux the bind preserves host UIDs so
the chown is needed. Best-effort chown — warns rather than hard-failing
install if it can't elevate.
bug_reseed_resolve_engine_base_url_not_idempotent_in_container
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>
…reseed The home-button demo reseed runs in the worker (always in-container), where scripts/seed_meaningful_demos.py's _INSIDE_CONTAINER branch sets every scenario's host_base_url to the Compose-DNS URL (http://elasticsearch:9200, etc.). _resolve_engine_base_url only mapped the host-shell localhost URLs to Compose-DNS and RAISED on anything else, so the reachability snapshot fed it the already-resolved URL and the whole reseed died with "Unrecognized engine host URL: http://elasticsearch:9200" — the run stuck at "Scenario 0 of 6" and never advanced. Latent since the engine-reachability snapshot landed: the home-button reseed's integration tests mock the engine-probe layer, so the real in-container resolve path was never exercised end-to-end. Fix: pass an already-resolved Compose-DNS target through unchanged (add _COMPOSE_DNS_TARGETS, return the input when it's already a mapping value); still raise on a genuinely unknown URL. Regression test parametrizes the three Compose-DNS URLs. Verified live: the reseed now advances past the reachability snapshot and runs the scenario's Optuna study. bug_reseed_resolve_engine_base_url_not_idempotent_in_container Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>
There was a problem hiding this comment.
Code Review
This pull request makes the engine base URL resolver idempotent to prevent failures when running inside a container, adds corresponding unit tests, and pre-creates the Solr data directory with correct ownership in the installation script. The reviewer feedback suggests normalizing URLs by stripping trailing slashes for robustness and using non-interactive sudo -n in the installation script to prevent hanging in automated environments.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| resolved = _ENGINE_BASE_URL_MAPPING.get(host_base_url) | ||
| if resolved is None: | ||
| raise ValueError( | ||
| f"Unrecognized engine host URL: {host_base_url}. " | ||
| f"Expected one of {sorted(_ENGINE_BASE_URL_MAPPING)}." | ||
| ) | ||
| return resolved | ||
| if resolved is not None: | ||
| return resolved | ||
| # Idempotent pass-through for already-resolved Compose-DNS targets. | ||
| if host_base_url in _COMPOSE_DNS_TARGETS: | ||
| return host_base_url |
There was a problem hiding this comment.
To make the URL resolution more robust and defensive against trailing slashes (which are common in URL configurations), consider stripping any trailing slashes from host_base_url before performing the lookup in _ENGINE_BASE_URL_MAPPING and _COMPOSE_DNS_TARGETS.
normalized_url = host_base_url.rstrip("/")
resolved = _ENGINE_BASE_URL_MAPPING.get(normalized_url)
if resolved is not None:
return resolved
# Idempotent pass-through for already-resolved Compose-DNS targets.
if normalized_url in _COMPOSE_DNS_TARGETS:
return normalized_url| chown 8983:8983 ./data/solr 2>/dev/null \ | ||
| || sudo chown 8983:8983 ./data/solr 2>/dev/null \ |
There was a problem hiding this comment.
Using sudo -n (non-interactive) is a best practice for scripts that may run in non-interactive environments (such as CI/CD pipelines or automated provisioning tasks). It prevents the script from hanging indefinitely if passwordless sudo is not configured for the user, failing immediately and allowing the warning message to be printed.
| chown 8983:8983 ./data/solr 2>/dev/null \ | |
| || sudo chown 8983:8983 ./data/solr 2>/dev/null \ | |
| chown 8983:8983 ./data/solr 2>/dev/null \ | |
| || sudo -n chown 8983:8983 ./data/solr 2>/dev/null \ |
Third root cause of the broken home-button reseed (operator hit it live):
the reseed can't be run twice within an hour. The POST enqueues with a
deterministic _job_id ("demo_reseed:singleton") for double-click protection,
but Arq aborts a re-enqueue of that id while the PRIOR run's result is still
cached under arq:result:<job_id> — kept for keep_result (Arq default 3600s =
1 HOUR), not the 60s the old comment claimed. So after any terminal run, the
next Reset click is silently deduped (enqueue_job returns None) and the
dialog sticks on "enqueued — waiting for worker" with an empty step log
forever.
The running-status 409 guard already prevents genuine concurrency, so a
lingering result key is always a stale completed/failed artifact. Delete it
before enqueue: rapid double-clicks are still deduped by the first click's
in-flight arq:job key, while a deliberate retry now actually enqueues. Fix
the misleading comment; extract the job id to _RESEED_JOB_ID.
Resolves the previously-filed bug_reseed_failure_blocks_retry_arq_singleton_dedup.
Tests: SpyArqPool gains a delete() double (records cleared keys); new
integration case seeds a stale arq:result:<id> and asserts the POST clears
it AND enqueues (not dedup-to-None).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>
…edup (resolved) Resolved inline by the dedup-clear fix in this branch (the POST handler now deletes the stale arq:result:<job_id> before re-enqueue). The idea recommended option 1 (worker-side clear on terminal state); the POST-side clear shipped here is strictly more robust — Arq writes the result AFTER the job function returns, so a crashed worker could never clear its own result, whereas the next POST always clears it regardless of how the prior run ended. Regenerated dashboards + public roadmap for the folder removal. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>
Non-interactive sudo fails fast instead of hanging on a password prompt in CI / automated provisioning when passwordless sudo isn't configured — the warning fallback then fires. Gemini Code Assist Medium finding (install.sh). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>
Review adjudication (Gemini Code Assist)Fix commit: latest push (
Outcomes
Ready for merge. |
state.md: prepend PR #564 to "Last 5 merges", drop feat_studies_starting_metric (#545) into the older-entries rollup, refresh branch-context + Last-updated. state_history.md: full three-root-cause narrative prepended. Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
Fixes the home-button demo reseed sticking at "Scenario 0 of 6" (operator-reported, live). The symptom had three stacked root causes in the in-container reseed path — all verified fixed live on a Solr stack:
Reseed couldn't run at all —
_resolve_engine_base_url(demo_seeding.py) only mapped host-shelllocalhostURLs → Compose-DNS and raised on anything else. But the worker is always in-container, whereseed_meaningful_demos.py's_INSIDE_CONTAINERbranch sets each scenario'shost_base_urlto the already-Compose-DNS URL — so the reachability snapshot fed those back in and the run died withUnrecognized engine host URL: http://elasticsearch:9200. Latent because the reseed's integration tests mock the engine-probe layer. Fix: make the resolver idempotent (pass an already-resolved Compose-DNS target through unchanged; still raise on genuinely-unknown URLs).Solr couldn't create collections —
install.shonlymkdir'd./secrets, never the engine data dirs. Aftermake reset(rm -rf ./data) or a fresh clone,./data/solrdoesn't exist when Solr starts, so its/var/solrbind is a phantom dir UID-8983 can't write — every collection CREATE failed (Underlying core creation failed). Fix: pre-create./data/solr(+chown 8983on Linux) before compose-up, gated on Solr inCOMPOSE_PROFILES, mirroring thepr.ymlsmoke job.Couldn't reseed twice within an hour — the POST enqueues a deterministic
_job_id("demo_reseed:singleton") for double-click protection, but Arq aborts a re-enqueue while the prior run's result is cached underarq:result:<id>— kept forkeep_result(Arq default 1 hour, not the 60s the old comment claimed). So after any terminal run, the next Reset click was silently deduped and stuck on "enqueued — waiting for worker" with an empty step log. Fix: delete the stale result key before enqueue (the running-status 409 guard already prevents genuine concurrency; rapid double-clicks are still deduped by the first click's in-flightarq:jobkey). Resolves the previously-filedbug_reseed_failure_blocks_retry_arq_singleton_dedup(folder removed).Verified live (Solr-only stack)
arq:resultkey present → POST clears it → worker immediately picks up the job (no stuck "enqueued").install.shre-runs idempotently;./data/solrpreserved, Solr healthy.Test plan
_resolve_engine_base_urlidempotent per Compose-DNS URL (15/15 demo_seeding unit tests pass)arq:result:<id>then enqueues (not dedup-to-None);SpyArqPoolgains adelete()doublemake fmt && make lint && make typecheck && ruff format --checkclean;bash -n/shellcheck clean oninstall.shNotes
🤖 Generated with Claude Code