Skip to content

fix: in-container demo reseed stuck at Scenario 0 (idempotent resolver + Solr data-dir + Arq dedup-retry)#564

Merged
SoundMindsAI merged 5 commits into
mainfrom
bug_reseed_resolve_engine_base_url_not_idempotent_in_container
Jun 18, 2026
Merged

fix: in-container demo reseed stuck at Scenario 0 (idempotent resolver + Solr data-dir + Arq dedup-retry)#564
SoundMindsAI merged 5 commits into
mainfrom
bug_reseed_resolve_engine_base_url_not_idempotent_in_container

Conversation

@SoundMindsAI

@SoundMindsAI SoundMindsAI commented Jun 18, 2026

Copy link
Copy Markdown
Owner

Summary

Fixes the home-button demo reseed sticking at "Scenario 0 of 6" (operator-reported, live). The symptom had three stacked root causes in the in-container reseed path — all verified fixed live on a Solr stack:

  1. Reseed couldn't run at all_resolve_engine_base_url (demo_seeding.py) only mapped host-shell localhost URLs → Compose-DNS and raised on anything else. But the worker is always in-container, where seed_meaningful_demos.py's _INSIDE_CONTAINER branch sets each scenario's host_base_url to the already-Compose-DNS URL — so the reachability snapshot fed those back in and the run died with Unrecognized engine host URL: http://elasticsearch:9200. Latent because the reseed's integration tests mock the engine-probe layer. Fix: make the resolver idempotent (pass an already-resolved Compose-DNS target through unchanged; still raise on genuinely-unknown URLs).

  2. Solr couldn't create collectionsinstall.sh only mkdir'd ./secrets, never the engine data dirs. After make reset (rm -rf ./data) or a fresh clone, ./data/solr doesn't exist when Solr starts, so its /var/solr bind is a phantom dir UID-8983 can't write — every collection CREATE failed (Underlying core creation failed). Fix: pre-create ./data/solr (+ chown 8983 on Linux) before compose-up, gated on Solr in COMPOSE_PROFILES, mirroring the pr.yml smoke job.

  3. Couldn't reseed twice within an hour — the POST enqueues a deterministic _job_id ("demo_reseed:singleton") for double-click protection, but Arq aborts a re-enqueue while the prior run's result is cached under arq:result:<id> — kept for keep_result (Arq default 1 hour, not the 60s the old comment claimed). So after any terminal run, the next Reset click was silently deduped and stuck on "enqueued — waiting for worker" with an empty step log. Fix: delete the stale result key before enqueue (the running-status 409 guard already prevents genuine concurrency; rapid double-clicks are still deduped by the first click's in-flight arq:job key). Resolves the previously-filed bug_reseed_failure_blocks_retry_arq_singleton_dedup (folder removed).

Verified live (Solr-only stack)

  • Reseed advances past the reachability snapshot (was: stuck Scenario 0) and runs the scenario's full 50-trial Optuna study.
  • Retry after a completed run: stale arq:result key present → POST clears it → worker immediately picks up the job (no stuck "enqueued").
  • install.sh re-runs idempotently; ./data/solr preserved, Solr healthy.

Test plan

  • 3 unit cases — _resolve_engine_base_url idempotent per Compose-DNS URL (15/15 demo_seeding unit tests pass)
  • Integration case — POST clears a seeded stale arq:result:<id> then enqueues (not dedup-to-None); SpyArqPool gains a delete() double
  • make fmt && make lint && make typecheck && ruff format --check clean; bash -n/shellcheck clean on install.sh

Notes

  • "Scenario 0 of 6" not visibly advancing during the first scenario's multi-minute Optuna study is a separate UX issue (the counter only increments when a whole scenario completes) — addressed by the next change (a per-scenario live-state checklist).
  • Unverified: whether ES/OS bind dirs share the Solr missing-dir failure mode (couldn't test — those engines aren't running in this Solr-only stack). Scoped to the proven Solr case.

🤖 Generated with Claude Code

SoundMindsAI and others added 2 commits June 18, 2026 19:10
Second root cause of the broken in-container demo reseed on a Solr stack:
install.sh only mkdir'd ./secrets, never the engine data dirs. After
`make reset` (rm -rf ./data) or a fresh clone, ./data/solr doesn't exist
when the solr container starts, so its /var/solr bind mount resolves to a
phantom dir the UID-8983 Solr process can't create children in. Every
collection CREATE then fails ("Underlying core creation failed" /
"Couldn't persist core properties to /var/solr/data/...") and the reseed's
Solr scenario dies — verified live.

Mirror the pr.yml smoke job's pre-create (mkdir + chown 8983 on Linux),
gated on solr being in COMPOSE_PROFILES. On Docker Desktop the mkdir alone
suffices (ownership virtualized); on Linux the bind preserves host UIDs so
the chown is needed. Best-effort chown — warns rather than hard-failing
install if it can't elevate.

bug_reseed_resolve_engine_base_url_not_idempotent_in_container

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>
…reseed

The home-button demo reseed runs in the worker (always in-container), where
scripts/seed_meaningful_demos.py's _INSIDE_CONTAINER branch sets every
scenario's host_base_url to the Compose-DNS URL (http://elasticsearch:9200,
etc.). _resolve_engine_base_url only mapped the host-shell localhost URLs to
Compose-DNS and RAISED on anything else, so the reachability snapshot fed it
the already-resolved URL and the whole reseed died with "Unrecognized engine
host URL: http://elasticsearch:9200" — the run stuck at "Scenario 0 of 6"
and never advanced.

Latent since the engine-reachability snapshot landed: the home-button
reseed's integration tests mock the engine-probe layer, so the real
in-container resolve path was never exercised end-to-end.

Fix: pass an already-resolved Compose-DNS target through unchanged (add
_COMPOSE_DNS_TARGETS, return the input when it's already a mapping value);
still raise on a genuinely unknown URL. Regression test parametrizes the
three Compose-DNS URLs. Verified live: the reseed now advances past the
reachability snapshot and runs the scenario's Optuna study.

bug_reseed_resolve_engine_base_url_not_idempotent_in_container

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request makes the engine base URL resolver idempotent to prevent failures when running inside a container, adds corresponding unit tests, and pre-creates the Solr data directory with correct ownership in the installation script. The reviewer feedback suggests normalizing URLs by stripping trailing slashes for robustness and using non-interactive sudo -n in the installation script to prevent hanging in automated environments.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines 464 to +469
resolved = _ENGINE_BASE_URL_MAPPING.get(host_base_url)
if resolved is None:
raise ValueError(
f"Unrecognized engine host URL: {host_base_url}. "
f"Expected one of {sorted(_ENGINE_BASE_URL_MAPPING)}."
)
return resolved
if resolved is not None:
return resolved
# Idempotent pass-through for already-resolved Compose-DNS targets.
if host_base_url in _COMPOSE_DNS_TARGETS:
return host_base_url

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To make the URL resolution more robust and defensive against trailing slashes (which are common in URL configurations), consider stripping any trailing slashes from host_base_url before performing the lookup in _ENGINE_BASE_URL_MAPPING and _COMPOSE_DNS_TARGETS.

    normalized_url = host_base_url.rstrip("/")
    resolved = _ENGINE_BASE_URL_MAPPING.get(normalized_url)
    if resolved is not None:
        return resolved
    # Idempotent pass-through for already-resolved Compose-DNS targets.
    if normalized_url in _COMPOSE_DNS_TARGETS:
        return normalized_url

Comment thread scripts/install.sh Outdated
Comment on lines +338 to +339
chown 8983:8983 ./data/solr 2>/dev/null \
|| sudo chown 8983:8983 ./data/solr 2>/dev/null \

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using sudo -n (non-interactive) is a best practice for scripts that may run in non-interactive environments (such as CI/CD pipelines or automated provisioning tasks). It prevents the script from hanging indefinitely if passwordless sudo is not configured for the user, failing immediately and allowing the warning message to be printed.

Suggested change
chown 8983:8983 ./data/solr 2>/dev/null \
|| sudo chown 8983:8983 ./data/solr 2>/dev/null \
chown 8983:8983 ./data/solr 2>/dev/null \
|| sudo -n chown 8983:8983 ./data/solr 2>/dev/null \

SoundMindsAI and others added 2 commits June 18, 2026 19:24
Third root cause of the broken home-button reseed (operator hit it live):
the reseed can't be run twice within an hour. The POST enqueues with a
deterministic _job_id ("demo_reseed:singleton") for double-click protection,
but Arq aborts a re-enqueue of that id while the PRIOR run's result is still
cached under arq:result:<job_id> — kept for keep_result (Arq default 3600s =
1 HOUR), not the 60s the old comment claimed. So after any terminal run, the
next Reset click is silently deduped (enqueue_job returns None) and the
dialog sticks on "enqueued — waiting for worker" with an empty step log
forever.

The running-status 409 guard already prevents genuine concurrency, so a
lingering result key is always a stale completed/failed artifact. Delete it
before enqueue: rapid double-clicks are still deduped by the first click's
in-flight arq:job key, while a deliberate retry now actually enqueues. Fix
the misleading comment; extract the job id to _RESEED_JOB_ID.

Resolves the previously-filed bug_reseed_failure_blocks_retry_arq_singleton_dedup.

Tests: SpyArqPool gains a delete() double (records cleared keys); new
integration case seeds a stale arq:result:<id> and asserts the POST clears
it AND enqueues (not dedup-to-None).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>
…edup (resolved)

Resolved inline by the dedup-clear fix in this branch (the POST handler now
deletes the stale arq:result:<job_id> before re-enqueue). The idea recommended
option 1 (worker-side clear on terminal state); the POST-side clear shipped
here is strictly more robust — Arq writes the result AFTER the job function
returns, so a crashed worker could never clear its own result, whereas the
next POST always clears it regardless of how the prior run ended. Regenerated
dashboards + public roadmap for the folder removal.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>
@SoundMindsAI SoundMindsAI changed the title fix: in-container demo reseed stuck at Scenario 0 (idempotent resolver + Solr data-dir prep) fix: in-container demo reseed stuck at Scenario 0 (idempotent resolver + Solr data-dir + Arq dedup-retry) Jun 18, 2026
Non-interactive sudo fails fast instead of hanging on a password prompt in
CI / automated provisioning when passwordless sudo isn't configured — the
warning fallback then fires. Gemini Code Assist Medium finding (install.sh).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>
@SoundMindsAI

Copy link
Copy Markdown
Owner Author

Review adjudication (Gemini Code Assist)

Fix commit: latest push (sudo -n).

# Sev Location Verdict Notes
1 Medium scripts/install.sh:339 Accepted Fixed — sudosudo -n in the Solr data-dir chown fallback. A bare sudo can block on a password prompt in non-interactive installs; -n fails fast and lets the WARN fallback fire.
2 Medium backend/app/services/demo_seeding.py:469 Deferred Defensive against an input that doesn't occur. _resolve_engine_base_url only ever receives the demo SCENARIOS' hardcoded host_base_url values + the ES/OS/SOLR module constants — all trailing-slash-free (seed_meaningful_demos.py:93-105). Operator-registered cluster URLs never reach this function (it's demo-seeding-only). The rstrip("/") would also subtly change the returned value for the pass-through branch. Not a regression this PR introduces; declining the speculative hardening.

Outcomes

  • Applied (1): sudo -n non-interactive chown.
  • Deferred as non-regression (1): trailing-slash normalization (no live trailing-slash path into this resolver).

Ready for merge.

@SoundMindsAI SoundMindsAI merged commit e7b787a into main Jun 18, 2026
20 checks passed
@SoundMindsAI SoundMindsAI deleted the bug_reseed_resolve_engine_base_url_not_idempotent_in_container branch June 18, 2026 23:42
SoundMindsAI added a commit that referenced this pull request Jun 18, 2026
state.md: prepend PR #564 to "Last 5 merges", drop feat_studies_starting_metric
(#545) into the older-entries rollup, refresh branch-context + Last-updated.
state_history.md: full three-root-cause narrative prepended.

Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant