chore: regenerate stale guide screenshots/videos for guides 01 + 06 (MVP2)#363
Conversation
…MVP2) The in-app walkthrough assets for guides 01 (register cluster) and 06 (create + monitor study) were captured 2026-05-27, before the MVP2 batch landed, so they showed a pre-Solr, pre-convergence UI. Regenerate both — slides, captions, and videos — against the current stack. Guide 06 — showcase the convergence verdict: - The convergence classifier needs >= STUDIES_TPE_WARMUP_FLOOR (50) usable trials to emit a real verdict; a real study here scores 0.0 on every trial (seeded judgments don't match the local ES index), yielding a flat-zero curve + too_few_trials. Switch the guide-06 seed to the test-only seed-completed endpoint with 50 synthetic trials so the study-detail page renders a genuine `converged` verdict + best-so-far curve, plus the ConfidencePanel happy path. Best-so-far stays pinned to the unchanged 0.412 -> 0.487 winner story (every extra metric < 0.487). - Backend: add an optional `extra_trial_metrics: list[float]` to seed_study_completed_with_digest + the SeedCompletedStudyRequest schema (default None = unchanged 2-trial behaviour). Purely additive. - Spec: expand the convergence-curve <details> (auto-collapsed when converged) before the screenshot; wait on confidence + convergence panels. - Captions: refresh study-name references, the queued-filter note, and add the Convergence panel to the detail-page caption. Guide 01 — show all three engines: - Pre-seed an OpenSearch + a Solr cluster via API before the landing shot so the clusters list + engine filter chips (all/elasticsearch/opensearch/solr) + per-engine EngineBadge are backed by real rows of each engine. - Caption + GUIDE_REGISTRY/metadata description updated to name all three engines (was "Elasticsearch or OpenSearch"). Videos: recaptured both walkthrough.webm via the demo config + promote-videos. Validated: backend ruff+mypy clean; _test contract guard (53) green; guide vitest (42) green; tsc clean; both specs pass against the live stack; API confirms the seeded study reports verdict=converged, 50 trials, best 0.487. Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>
There was a problem hiding this comment.
Code Review
This pull request introduces support for seeding additional trials (extra_trial_metrics) in completed studies to satisfy the convergence classifier's minimum trial requirements, allowing the <ConvergencePanel> to render a real verdict and curve in E2E tests and guides. It also updates the cluster registration guide and tests to showcase multi-engine support (Elasticsearch, OpenSearch, and Apache Solr). Feedback suggests optimizing the trial seeding loop in test_seeding.py to avoid excessive database round-trips, and adding assertions to ensure pre-seeding API calls succeed in the E2E tests.
| if extra_trial_metrics: | ||
| cursor = started + timedelta(milliseconds=2500) | ||
| for offset, metric in enumerate(extra_trial_metrics): | ||
| trial_number = 2 + offset | ||
| await repo.create_trial( | ||
| db, | ||
| id=str(uuid_utils.uuid7()), | ||
| study_id=study_id, | ||
| optuna_trial_number=trial_number, | ||
| params={"boost": round(1.5 + 0.1 * offset, 2)}, | ||
| primary_metric=metric, | ||
| metrics={"ndcg@10": metric}, | ||
| duration_ms=1100, | ||
| status="complete", | ||
| error=None, | ||
| started_at=cursor, | ||
| ended_at=cursor + timedelta(milliseconds=1100), | ||
| ) | ||
| cursor = cursor + timedelta(milliseconds=1200) |
There was a problem hiding this comment.
Calling repo.create_trial inside a loop executes a db.flush() and db.refresh(trial) for every single trial. When seeding 48 extra trials, this results in 96 database round-trips, which significantly slows down the E2E test setup.
Since the returned Trial objects are not used in the loop, we can optimize this by directly instantiating Trial models and adding them to the session via db.add(). The subsequent complete_study call will flush all of them in a single database round-trip.
if extra_trial_metrics:
from backend.app.db.models import Trial
cursor = started + timedelta(milliseconds=2500)
for offset, metric in enumerate(extra_trial_metrics):
trial_number = 2 + offset
db.add(
Trial(
id=str(uuid_utils.uuid7()),
study_id=study_id,
optuna_trial_number=trial_number,
params={"boost": round(1.5 + 0.1 * offset, 2)},
primary_metric=metric,
metrics={"ndcg@10": metric},
duration_ms=1100,
status="complete",
error=None,
started_at=cursor,
ended_at=cursor + timedelta(milliseconds=1100),
)
)
cursor = cursor + timedelta(milliseconds=1200)| for (const c of preSeed) { | ||
| const r = await apiCtx.post('/api/v1/clusters', { data: c }); | ||
| if (r.status() === 201) { | ||
| const body = (await r.json()) as { id: string }; | ||
| appendForCleanup('cluster', body.id); | ||
| } | ||
| } |
There was a problem hiding this comment.
The pre-seeding API calls are executed without asserting their success. If any of the pre-seed requests fail (e.g., due to database or network issues), the test will silently skip registering them for cleanup and continue, leading to confusing downstream failures or incorrect screenshots.
Adding an explicit assertion like expect(r.status()).toBe(201) ensures that any pre-seeding failure is caught immediately with a clear error message.
for (const c of preSeed) {
const r = await apiCtx.post('/api/v1/clusters', { data: c });
expect(r.status()).toBe(201);
const body = (await r.json()) as { id: string };
appendForCleanup('cluster', body.id);
}F2) The OpenSearch + Solr pre-seed POSTs in guide 01 were fire-and-forget — a silent failure (e.g. the wrong credentials_ref that 503s the probe, which bit this during development) would leave the landing screenshot missing an engine while the test still passed green. Assert 201 so any pre-seed failure surfaces immediately with a labeled error. Gemini F1 (batch the 48 extra create_trial flushes via db.add) rejected: keeps the repo-layer convention (every trial insert goes through repo.create_trial), and the per-flush cost is immaterial in a test-only seeder (the spec runs in ~8s). Consistency > micro-opt on a non-hot path. Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>
Gemini review adjudication + CI statusGemini findings (both on this PR's actual changes, HEAD
|
| # | File:line | Finding | Severity | Verdict | Action |
|---|---|---|---|---|---|
| F1 | test_seeding.py:194 |
48 extra create_trial calls = 48 flush/refresh round-trips; batch via db.add |
Medium | Reject (reasoned) | Keeps the repo-layer convention (every trial insert goes through repo.create_trial); per-flush cost is immaterial in a test-only seeder (spec runs ~8s). Consistency > micro-opt on a non-hot path. |
| F2 | 01_*.spec.ts:79 |
Pre-seed cluster POSTs unasserted — silent failure → wrong screenshot, green test | Medium | Accept | Fixed in a3d235d5 — expect(r.status()).toBe(201). This exact failure mode (wrong credentials_ref → 503) bit me during development; the assertion would have caught it immediately. |
Backend CI — pre-existing test-isolation flakiness, NOT a regression
The backend + smoke jobs are red, but the diagnosis is conclusive that it's not this PR:
- The failing set is unstable across identical re-runs — re-running the same commit produced a different failure list (e.g.
test_enum_source_of_truth_helpers,test_openapi_surface,test_demo_seeding_ubi_fullappeared only on the second run). Order-dependent pollution underpytest-randomly, not a deterministic break. - The failing tests pass in isolation — e.g.
test_cluster_health_warmup.py(7/7) andtest_capability_check.pyrun green locally; the failures are caplog-empty /assert []shapes characteristic of cross-test log-capture pollution. - This PR touches only
backend/app/api/v1/_test.py+backend/app/services/test_seeding.py(test-only seeding). None of the 21 failing tests, nor any of their systems-under-test, are touched by the diff — verified file-by-file. - Two recurring failures (
test_judgment_generateclick-bucket,test_migration_0021) are pre-existing MVP2 drift already tracked (theclick-bucket one is issues Render the click (UBI) bucket in the judgment-list source-breakdown card #356/Update contract-test allowlists for MVP2 endpoints + enum value #357).
Touched-area tests are green locally: _test contract guard (53), convergence domain+service (45). The seeded study was verified via API: verdict=converged, 50 trials, best_metric=0.487.
Merging despite the pre-existing flaky reds (operator-confirmed). The broad backend test-isolation failure deserves its own bug — filing a follow-up.
🤖 Generated with Claude Code
…domly caplog pollution) Surfaced by PR #363 CI: ~15 caplog/contract tests fail nondeterministically under the full randomized suite (unstable failing set across identical re-runs) but pass in isolation. Distinct from the narrow bug_baseline_phase_test_isolation. Makes the backend coverage job nondeterministically red now that SKIP_HEAVY_CI is lifted. P1. Includes the lockstep dashboard + roadmap regen the pre-commit hook produces. Signed-off-by: SoundMindsAI <eric.starr@soundminds.ai>
Final CI status — merging11/13 checks green — every deterministic gate: frontend, backend static (ruff+mypy+guards), both docker buildx, both license gates, fast-lane unit tests, DCO, gitleaks, secrets. 2 red, both pre-existing infra flakes with no causal link to this diff:
This PR's diff is test-only seeding ( |
What
The in-app walkthrough assets for guide 01 (register cluster) and guide 06 (create + monitor study) were captured 2026-05-27 — before the MVP2 batch (Solr adapter, UBI judgments, convergence indicator, overnight autopilot) landed. They showed a pre-Solr, pre-convergence UI. This regenerates both guides' slides, captions, and videos against the current stack.
Surfaced by the
guide-gen --auditrecorded in PR #354's OSS-launch checklist.Guide 06 — showcase the convergence verdict
The
<ConvergencePanel>(the headlinefeat_study_convergence_indicatorsurface) needs ≥STUDIES_TPE_WARMUP_FLOOR(50) usable trials to emit a real verdict. A real study seeded here scores 0.0 on every trial (the seeded judgments reference doc IDs not in the local ESproductsindex), producing a flat-zero curve +too_few_trials— which looks broken, not instructive.seed-completedendpoint with 50 synthetic trials, so the detail page renders a genuineconvergedverdict + best-so-far curve and the ConfidencePanel happy path. Verified via API:verdict=converged, 50 trials,best_metric=0.487.best_metric/ proposal / digest — which other E2E specs assert on — are untouched.extra_trial_metrics: list[float]onseed_study_completed_with_digest+SeedCompletedStudyRequest(defaultNone= unchanged 2-trial behaviour).<details>(auto-collapsed when converged) before the shot; captions refreshed (study name, queued-filter note, + a new Convergence-panel paragraph).Guide 01 — show all three engines
all / elasticsearch / opensearch / solr), and the per-engine<EngineBadge>are backed by real rows of each engine (previously ES + OpenSearch only; no Solr).GUIDE_REGISTRY/metadata.jsondescription updated to name all three engines.Videos
Both
walkthrough.webmrecaptured via the demo config +promote-videos.mjs.Validation
_testcontract guard (53) green.tscclean; full pre-commit gate passed.verdict=converged.Notes
/api/v1/_test/..., 404 outsideENVIRONMENT=development) — no production surface.🤖 Generated with Claude Code