Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
143 changes: 143 additions & 0 deletions .github/workflows/pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -221,6 +221,18 @@ jobs:
# behavior is acceptable until the deploy workflow exists.
RELYLOOP_API_URL: http://localhost:8000
run: |
# CI-perf #3 (pytest-xdist `-n auto`) was attempted on the first
# PR #291 CI run and reverted: the integration test layer hit FK
# collisions (query_sets_cluster_id_fkey violation when parallel
# tests held a FK reference to a cluster being deleted in another
# worker's teardown). pytest-xdist remains in dev deps for local
# opt-in (`pytest -n auto` works fine on the unit-test layer);
# CI-perf #1 + #2 (buildx artifact handoff + base-image cache)
# are the actual smoke-pace wins. A follow-up may split the
# backend job into a parallel-safe "unit + contract" lane + a
# serial "integration" lane to recover #3's savings. See
# chore_ci_perf_buildx_artifact_image_cache_xdist/idea.md
# §"What is NOT changed in this PR".
uv run pytest backend/tests/ \
--cov=backend \
--cov-report=xml \
Expand Down Expand Up @@ -307,6 +319,13 @@ jobs:
name: smoke (operator-path tutorial flow)
runs-on: ubuntu-24.04
timeout-minutes: 15
# Depend on the parallel `docker` (API) + `docker-ui` jobs so both image
# artifacts are ready before `make up`. Pre-bumps this PR was paying ~10min
# for `docker compose up -d` (image pulls + API + UI builds inside the
# step). The artifact handoff (API + UI) + base-image cache + SKIP_BUILD
# below cut that to ~2-3min on a warm cache. See
# chore_ci_perf_buildx_artifact_image_cache_xdist/idea.md.
needs: [docker, docker-ui]
permissions:
contents: read
steps:
Expand Down Expand Up @@ -365,7 +384,75 @@ jobs:
exit 1
fi

# CI-perf #1: download the pre-built API + UI images from the parallel
# docker / docker-ui jobs so `make up` skips both in-step `docker build`s
# (saves ~5min). Combined with RELYLOOP_SKIP_BUILD=1 (which makes
# install.sh skip its `docker compose build` step) compose just `up`s
# the loaded images. RELYLOOP_GIT_SHA below picks them up by tag.
- name: Download pre-built API image
uses: actions/download-artifact@v6
with:
name: relyloop-api-image-${{ github.sha }}
path: /tmp/

- name: Download pre-built UI image
uses: actions/download-artifact@v6
with:
name: relyloop-ui-image-${{ github.sha }}
path: /tmp/

- name: Load pre-built API + UI images into Docker
run: |
docker load -i /tmp/relyloop-api-image.tar
docker load -i /tmp/relyloop-ui-image.tar
docker image ls 'relyloop/*'

# CI-perf #2: cache the base service-container images (postgres / redis /
# elasticsearch / opensearch) keyed on their tags. On cache hit we
# `docker load` 4 tars in ~5s vs ~60-90s for `docker pull` on miss.
# Key changes whenever any of the image tags in docker-compose.yml change
# (forces re-pull on a bump PR, hit on subsequent runs).
- name: Cache base service-container images
id: base-image-cache
uses: actions/cache@v5
with:
path: /tmp/docker-base-images
key: docker-base-images-v1-${{ hashFiles('docker-compose.yml') }}

- name: Pre-pull + save base images on cache miss
if: steps.base-image-cache.outputs.cache-hit != 'true'
run: |
mkdir -p /tmp/docker-base-images
for img in postgres:17 redis:8 elasticsearch:9.4.1 opensearchproject/opensearch:3.6.0; do
docker pull "$img"
safe=$(echo "$img" | tr '/:' '__')
docker save "$img" -o "/tmp/docker-base-images/${safe}.tar"
done

- name: Load base images on cache hit
if: steps.base-image-cache.outputs.cache-hit == 'true'
run: |
for tar in /tmp/docker-base-images/*.tar; do
docker load -i "$tar"
done
docker image ls

# Compose's `image:` lines reference `relyloop/api:${RELYLOOP_GIT_SHA:-dev}`
# and `relyloop/ui:${RELYLOOP_GIT_SHA:-dev}` — setting RELYLOOP_GIT_SHA
# here makes compose pick up the loaded images instead of trying to
# build/pull them. RELYLOOP_SKIP_BUILD=1 also makes install.sh skip its
# explicit `docker compose build` step (added 2026-05-28; see install.sh
# step 6). Together these eliminate the API + UI build duplication in
# smoke that was eating ~5min per run.
- name: Bring up the stack
env:
RELYLOOP_GIT_SHA: ${{ github.sha }}
RELYLOOP_SKIP_BUILD: "1"
# Same rationale as RELYLOOP_SKIP_BUILD — the 2 demo-dependent
# E2E specs were skipped in CI on 2026-05-28
# (chore_drop_demo_seed_from_ci). Without this skip install.sh
# would still auto-seed ~5min of demo data on every CI run.
RELYLOOP_SKIP_AUTO_SEED: "1"
run: make up

- name: Wait for /healthz
Expand Down Expand Up @@ -545,3 +632,59 @@ jobs:
exit 1
}
'

# Export the built API image as a tar so the smoke job can `docker load`
# it instead of rebuilding (which costs ~2-3min inside `make up`). See
# chore_ci_perf_buildx_artifact_image_cache_xdist/idea.md for the
# smoke-pace context. compression-level: 0 because docker save already
# produces a compressed tar (re-compressing wastes ~30s with no win).
- name: Export API image as tar for smoke job
run: docker save relyloop/api:${{ github.sha }} -o /tmp/relyloop-api-image.tar

- name: Upload API image artifact
uses: actions/upload-artifact@v7
with:
name: relyloop-api-image-${{ github.sha }}
path: /tmp/relyloop-api-image.tar
retention-days: 1
compression-level: 0

docker-ui:
name: docker buildx (relyloop/ui)
runs-on: ubuntu-latest
# Parallel to `docker` (API buildx). Symmetric pattern: builds + uploads
# the UI image as a tar so the smoke job can `docker load` it instead of
# rebuilding inside `make up`. Reused via `needs: [docker, docker-ui]` on
# the smoke job + `RELYLOOP_SKIP_BUILD=1` to bypass install.sh's build step.
timeout-minutes: 10
steps:
- uses: actions/checkout@v6

- uses: docker/setup-buildx-action@v4

- name: Build relyloop/ui (no push, load into local daemon)
uses: docker/build-push-action@v7
with:
context: ./ui
file: ui/Dockerfile
push: false
load: true
tags: relyloop/ui:${{ github.sha }}
# The compose service bakes NEXT_PUBLIC_API_BASE_URL into the bundle
# at build time (Next.js inlines it at `next build`). Match the value
# docker-compose.yml line 183 sets so the smoke run uses the same URL.
build-args: |
NEXT_PUBLIC_API_BASE_URL=http://localhost:8000
cache-from: type=gha,scope=ui
cache-to: type=gha,scope=ui,mode=max

- name: Export UI image as tar for smoke job
run: docker save relyloop/ui:${{ github.sha }} -o /tmp/relyloop-ui-image.tar

- name: Upload UI image artifact
uses: actions/upload-artifact@v7
with:
name: relyloop-ui-image-${{ github.sha }}
path: /tmp/relyloop-ui-image.tar
retention-days: 1
compression-level: 0
23 changes: 21 additions & 2 deletions backend/app/scripts/seed_es.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,16 @@ async def main() -> int:
products = json.loads(SAMPLES_PRODUCTS.read_text())
logger.info("seed_es: loaded %d products from %s", len(products), SAMPLES_PRODUCTS)

async with httpx.AsyncClient(base_url=cluster.base_url, timeout=30.0) as client:
# timeout=90 (was 30): ES 9.4.1 single-node on a cold GHA runner can take
# >30s to respond to the first index-create PUT after `docker compose up
# --wait` returns. Observed in PR #291's 6th + 7th smoke runs after the
# fast stack-up (compose-up went from 10min → 21s, eliminating the
# ambient ES warmup time that previously masked this). The compose
# healthcheck waits for `_cluster/health?wait_for_status=yellow` which
# passes early on single-node ES (no shards to wait on), so ES is
# "healthy" but its write path needs more warmup. 90s gives headroom
# without making real failure modes invisible.
async with httpx.AsyncClient(base_url=cluster.base_url, timeout=90.0) as client:
# DELETE existing index (idempotent — 404 is fine, that just means it didn't exist).
delete_resp = await client.delete(f"/{INDEX_NAME}")
if delete_resp.status_code not in (200, 404):
Expand All @@ -58,9 +67,19 @@ async def main() -> int:
return 1

# Create with mapping derived from the products schema.
#
# number_of_replicas=0 is required for single-node ES (local dev +
# CI). The default (1) tries to allocate a replica that can never
# bind on a one-node cluster, leaving the primary itself in an
# INITIALIZING → STARTED race that surfaces as an
# `unavailable_shards_exception` on the immediately-following
# bulk-index. Visible in PR #291 CI run after the faster stack-up
# (~3min vs ~10min) stopped masking the race with implicit warmup
# time. See chore_ci_perf_buildx_artifact_image_cache_xdist/idea.md.
create_resp = await client.put(
f"/{INDEX_NAME}",
json={
"settings": {"number_of_replicas": 0},
"mappings": {
"properties": {
"title": {"type": "text"},
Expand All @@ -69,7 +88,7 @@ async def main() -> int:
"color": {"type": "keyword"},
"bullet_points": {"type": "text"},
}
}
},
},
)
create_resp.raise_for_status()
Expand Down
2 changes: 1 addition & 1 deletion docs/00_overview/DASHBOARD.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ _Top-level index across MVP1 → GA v1+ as of **2026-05-28**. Click a release na

| Release | Theme | Progress | Status |
|---|---|---|---|
| [MVP1 / v0.1](MVP1_DASHBOARD.md) | The Loop | 88 / 89 scoped done · 16 remaining | **In progress** |
| [MVP1 / v0.1](MVP1_DASHBOARD.md) | The Loop | 88 / 89 scoped done · 18 remaining | **In progress** |
| MVP1.5 / v0.1.5 | Real Signals | — | **Not yet scoped** |
| [MVP2 / v0.2](MVP2_DASHBOARD.md) | Observable | 1 / 1 scoped done · 1 remaining | **In progress** |
| MVP3 / v0.3 | Production Stacks | — | **Not yet scoped** |
Expand Down
Loading
Loading