Skip to content

chore: regenerate stale guide screenshots/videos for guides 01 + 06 (MVP2)#363

Merged
SoundMindsAI merged 3 commits into
mainfrom
chore/regen-stale-guide-screenshots
Jun 1, 2026
Merged

chore: regenerate stale guide screenshots/videos for guides 01 + 06 (MVP2)#363
SoundMindsAI merged 3 commits into
mainfrom
chore/regen-stale-guide-screenshots

Conversation

@SoundMindsAI

Copy link
Copy Markdown
Owner

What

The in-app walkthrough assets for guide 01 (register cluster) and guide 06 (create + monitor study) were captured 2026-05-27 — before the MVP2 batch (Solr adapter, UBI judgments, convergence indicator, overnight autopilot) landed. They showed a pre-Solr, pre-convergence UI. This regenerates both guides' slides, captions, and videos against the current stack.

Surfaced by the guide-gen --audit recorded in PR #354's OSS-launch checklist.

Guide 06 — showcase the convergence verdict

The <ConvergencePanel> (the headline feat_study_convergence_indicator surface) needs ≥ STUDIES_TPE_WARMUP_FLOOR (50) usable trials to emit a real verdict. A real study seeded here scores 0.0 on every trial (the seeded judgments reference doc IDs not in the local ES products index), producing a flat-zero curve + too_few_trials — which looks broken, not instructive.

  • Switched guide-06's seed to the test-only seed-completed endpoint with 50 synthetic trials, so the detail page renders a genuine converged verdict + best-so-far curve and the ConfidencePanel happy path. Verified via API: verdict=converged, 50 trials, best_metric=0.487.
  • Best-so-far stays pinned to the unchanged 0.412 → 0.487 winner story (every extra metric < 0.487), so best_metric / proposal / digest — which other E2E specs assert on — are untouched.
  • Backend (additive): optional extra_trial_metrics: list[float] on seed_study_completed_with_digest + SeedCompletedStudyRequest (default None = unchanged 2-trial behaviour).
  • Spec expands the convergence-curve <details> (auto-collapsed when converged) before the shot; captions refreshed (study name, queued-filter note, + a new Convergence-panel paragraph).

Guide 01 — show all three engines

  • Pre-seed an OpenSearch + a Solr cluster via API before the landing shot, so the clusters list, the engine filter chips (all / elasticsearch / opensearch / solr), and the per-engine <EngineBadge> are backed by real rows of each engine (previously ES + OpenSearch only; no Solr).
  • Caption + GUIDE_REGISTRY/metadata.json description updated to name all three engines.

Videos

Both walkthrough.webm recaptured via the demo config + promote-videos.mjs.

Validation

  • Backend ruff + mypy clean; _test contract guard (53) green.
  • Guide vitest (42) green; tsc clean; full pre-commit gate passed.
  • Both specs pass against the live stack; API confirms the seeded study reports verdict=converged.

Notes

🤖 Generated with Claude Code

…MVP2)

The in-app walkthrough assets for guides 01 (register cluster) and 06
(create + monitor study) were captured 2026-05-27, before the MVP2 batch
landed, so they showed a pre-Solr, pre-convergence UI. Regenerate both —
slides, captions, and videos — against the current stack.

Guide 06 — showcase the convergence verdict:
- The convergence classifier needs >= STUDIES_TPE_WARMUP_FLOOR (50) usable
  trials to emit a real verdict; a real study here scores 0.0 on every trial
  (seeded judgments don't match the local ES index), yielding a flat-zero
  curve + too_few_trials. Switch the guide-06 seed to the test-only
  seed-completed endpoint with 50 synthetic trials so the study-detail page
  renders a genuine `converged` verdict + best-so-far curve, plus the
  ConfidencePanel happy path. Best-so-far stays pinned to the unchanged
  0.412 -> 0.487 winner story (every extra metric < 0.487).
- Backend: add an optional `extra_trial_metrics: list[float]` to
  seed_study_completed_with_digest + the SeedCompletedStudyRequest schema
  (default None = unchanged 2-trial behaviour). Purely additive.
- Spec: expand the convergence-curve <details> (auto-collapsed when
  converged) before the screenshot; wait on confidence + convergence panels.
- Captions: refresh study-name references, the queued-filter note, and add
  the Convergence panel to the detail-page caption.

Guide 01 — show all three engines:
- Pre-seed an OpenSearch + a Solr cluster via API before the landing shot so
  the clusters list + engine filter chips (all/elasticsearch/opensearch/solr)
  + per-engine EngineBadge are backed by real rows of each engine.
- Caption + GUIDE_REGISTRY/metadata description updated to name all three
  engines (was "Elasticsearch or OpenSearch").

Videos: recaptured both walkthrough.webm via the demo config + promote-videos.

Validated: backend ruff+mypy clean; _test contract guard (53) green; guide
vitest (42) green; tsc clean; both specs pass against the live stack; API
confirms the seeded study reports verdict=converged, 50 trials, best 0.487.

Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for seeding additional trials (extra_trial_metrics) in completed studies to satisfy the convergence classifier's minimum trial requirements, allowing the <ConvergencePanel> to render a real verdict and curve in E2E tests and guides. It also updates the cluster registration guide and tests to showcase multi-engine support (Elasticsearch, OpenSearch, and Apache Solr). Feedback suggests optimizing the trial seeding loop in test_seeding.py to avoid excessive database round-trips, and adding assertions to ensure pre-seeding API calls succeed in the E2E tests.

Comment on lines +176 to +194
if extra_trial_metrics:
cursor = started + timedelta(milliseconds=2500)
for offset, metric in enumerate(extra_trial_metrics):
trial_number = 2 + offset
await repo.create_trial(
db,
id=str(uuid_utils.uuid7()),
study_id=study_id,
optuna_trial_number=trial_number,
params={"boost": round(1.5 + 0.1 * offset, 2)},
primary_metric=metric,
metrics={"ndcg@10": metric},
duration_ms=1100,
status="complete",
error=None,
started_at=cursor,
ended_at=cursor + timedelta(milliseconds=1100),
)
cursor = cursor + timedelta(milliseconds=1200)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Calling repo.create_trial inside a loop executes a db.flush() and db.refresh(trial) for every single trial. When seeding 48 extra trials, this results in 96 database round-trips, which significantly slows down the E2E test setup.

Since the returned Trial objects are not used in the loop, we can optimize this by directly instantiating Trial models and adding them to the session via db.add(). The subsequent complete_study call will flush all of them in a single database round-trip.

    if extra_trial_metrics:
        from backend.app.db.models import Trial
        cursor = started + timedelta(milliseconds=2500)
        for offset, metric in enumerate(extra_trial_metrics):
            trial_number = 2 + offset
            db.add(
                Trial(
                    id=str(uuid_utils.uuid7()),
                    study_id=study_id,
                    optuna_trial_number=trial_number,
                    params={"boost": round(1.5 + 0.1 * offset, 2)},
                    primary_metric=metric,
                    metrics={"ndcg@10": metric},
                    duration_ms=1100,
                    status="complete",
                    error=None,
                    started_at=cursor,
                    ended_at=cursor + timedelta(milliseconds=1100),
                )
            )
            cursor = cursor + timedelta(milliseconds=1200)

Comment on lines +73 to +79
for (const c of preSeed) {
const r = await apiCtx.post('/api/v1/clusters', { data: c });
if (r.status() === 201) {
const body = (await r.json()) as { id: string };
appendForCleanup('cluster', body.id);
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The pre-seeding API calls are executed without asserting their success. If any of the pre-seed requests fail (e.g., due to database or network issues), the test will silently skip registering them for cleanup and continue, leading to confusing downstream failures or incorrect screenshots.

Adding an explicit assertion like expect(r.status()).toBe(201) ensures that any pre-seeding failure is caught immediately with a clear error message.

    for (const c of preSeed) {
      const r = await apiCtx.post('/api/v1/clusters', { data: c });
      expect(r.status()).toBe(201);
      const body = (await r.json()) as { id: string };
      appendForCleanup('cluster', body.id);
    }

 F2)

The OpenSearch + Solr pre-seed POSTs in guide 01 were fire-and-forget — a
silent failure (e.g. the wrong credentials_ref that 503s the probe, which
bit this during development) would leave the landing screenshot missing an
engine while the test still passed green. Assert 201 so any pre-seed failure
surfaces immediately with a labeled error.

Gemini F1 (batch the 48 extra create_trial flushes via db.add) rejected:
keeps the repo-layer convention (every trial insert goes through
repo.create_trial), and the per-flush cost is immaterial in a test-only
seeder (the spec runs in ~8s). Consistency > micro-opt on a non-hot path.

Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>
@SoundMindsAI

Copy link
Copy Markdown
Owner Author

Gemini review adjudication + CI status

Gemini findings (both on this PR's actual changes, HEAD 79eeda78)

# File:line Finding Severity Verdict Action
F1 test_seeding.py:194 48 extra create_trial calls = 48 flush/refresh round-trips; batch via db.add Medium Reject (reasoned) Keeps the repo-layer convention (every trial insert goes through repo.create_trial); per-flush cost is immaterial in a test-only seeder (spec runs ~8s). Consistency > micro-opt on a non-hot path.
F2 01_*.spec.ts:79 Pre-seed cluster POSTs unasserted — silent failure → wrong screenshot, green test Medium Accept Fixed in a3d235d5expect(r.status()).toBe(201). This exact failure mode (wrong credentials_ref → 503) bit me during development; the assertion would have caught it immediately.

Backend CI — pre-existing test-isolation flakiness, NOT a regression

The backend + smoke jobs are red, but the diagnosis is conclusive that it's not this PR:

  1. The failing set is unstable across identical re-runs — re-running the same commit produced a different failure list (e.g. test_enum_source_of_truth_helpers, test_openapi_surface, test_demo_seeding_ubi_full appeared only on the second run). Order-dependent pollution under pytest-randomly, not a deterministic break.
  2. The failing tests pass in isolation — e.g. test_cluster_health_warmup.py (7/7) and test_capability_check.py run green locally; the failures are caplog-empty / assert [] shapes characteristic of cross-test log-capture pollution.
  3. This PR touches only backend/app/api/v1/_test.py + backend/app/services/test_seeding.py (test-only seeding). None of the 21 failing tests, nor any of their systems-under-test, are touched by the diff — verified file-by-file.
  4. Two recurring failures (test_judgment_generate click-bucket, test_migration_0021) are pre-existing MVP2 drift already tracked (the click-bucket one is issues Render the click (UBI) bucket in the judgment-list source-breakdown card #356/Update contract-test allowlists for MVP2 endpoints + enum value #357).

Touched-area tests are green locally: _test contract guard (53), convergence domain+service (45). The seeded study was verified via API: verdict=converged, 50 trials, best_metric=0.487.

Merging despite the pre-existing flaky reds (operator-confirmed). The broad backend test-isolation failure deserves its own bug — filing a follow-up.

🤖 Generated with Claude Code

…domly caplog pollution)

Surfaced by PR #363 CI: ~15 caplog/contract tests fail nondeterministically
under the full randomized suite (unstable failing set across identical
re-runs) but pass in isolation. Distinct from the narrow
bug_baseline_phase_test_isolation. Makes the backend coverage job
nondeterministically red now that SKIP_HEAVY_CI is lifted. P1.

Includes the lockstep dashboard + roadmap regen the pre-commit hook produces.

Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>
@SoundMindsAI

Copy link
Copy Markdown
Owner Author

Final CI status — merging

11/13 checks green — every deterministic gate: frontend, backend static (ruff+mypy+guards), both docker buildx, both license gates, fast-lane unit tests, DCO, gitleaks, secrets.

2 red, both pre-existing infra flakes with no causal link to this diff:

  • backend (… tests + coverage) — the pytest-ordering caplog test-isolation nondeterminism (failing set changes across identical re-runs; tests pass in isolation; diff touches none of the failing tests/SUTs). Filed P1: bug_backend_suite_nondeterministic_caplog_isolation.
  • smoke (operator-path tutorial flow) — failed at "Bring up the stack": container relyloop-solr-1 exited (1) during make up, before any test logic ran. All other services came up Healthy. A Solr-container startup flake; this PR changes nothing about the Solr Compose service / image / config.

This PR's diff is test-only seeding (_test.py, test_seeding.py) + guide specs/assets. Touched-area tests green locally (contract guard 53, convergence 45); seeded study verified verdict=converged. Merging per operator decision to proceed despite the pre-existing flaky reds.

@SoundMindsAI SoundMindsAI merged commit f70e971 into main Jun 1, 2026
11 of 13 checks passed
@SoundMindsAI SoundMindsAI deleted the chore/regen-stale-guide-screenshots branch June 1, 2026 03:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant