Skip to content

ci: add free-disk-space step to system-test job#1446

Merged
devantler merged 9 commits intomainfrom
devantler/move-free-disk-space-action
Apr 28, 2026
Merged

ci: add free-disk-space step to system-test job#1446
devantler merged 9 commits intomainfrom
devantler/move-free-disk-space-action

Conversation

@devantler
Copy link
Copy Markdown
Contributor

Add endersonmenezes/free-disk-space@v3.2.2 before the ksail-cluster step in the system-test job to reclaim ~20GB on GitHub runners.

Removes Android SDK (~10GB), .NET (~4GB), Haskell (~4GB), and tool cache (~6GB) to prevent DiskPressure / No space left on device during Talos cluster creation and Flux reconciliation.

Inputs

Input Value Space freed
remove_android true ~10 GB
remove_dotnet true ~4 GB
remove_haskell true ~4 GB
remove_tool_cache true ~6 GB

Copilot AI review requested due to automatic review settings April 25, 2026 21:58
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the CI workflow to proactively reclaim disk space on GitHub-hosted runners before provisioning the Talos-in-Docker system test cluster, reducing the likelihood of DiskPressure / No space left on device failures during cluster creation and Flux reconciliation.

Changes:

  • Add a “Free disk space” step in the system-test job prior to the KSail cluster action.
  • Configure the action to remove Android, .NET, Haskell, and the runner tool cache.

Add endersonmenezes/free-disk-space@v3.2.2 before ksail-cluster to
reclaim ~20GB on GitHub runners by removing Android SDK, .NET, Haskell,
and tool cache. Prevents DiskPressure during Talos cluster creation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@devantler devantler force-pushed the devantler/move-free-disk-space-action branch from 18696ca to 903942b Compare April 25, 2026 22:03
@botantler botantler Bot added this pull request to the merge queue Apr 25, 2026
@devantler devantler removed this pull request from the merge queue due to a manual request Apr 25, 2026
@devantler devantler added this pull request to the merge queue Apr 25, 2026
@devantler devantler removed this pull request from the merge queue due to a manual request Apr 25, 2026
Copilot AI review requested due to automatic review settings April 27, 2026 17:48
@devantler devantler enabled auto-merge April 27, 2026 17:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comment thread .github/workflows/ci.yaml
FleetDM includes MySQL, Redis, and migration jobs that need time to
initialize, especially on resource-constrained CI runners. The default
5m Helm install timeout is too tight, causing CrashLoopBackOff and
flaky CI failures across multiple PRs.

Aligns with the pattern used by other heavy HelmReleases (velero,
kube-prometheus-stack, kyverno) that set explicit timeouts and
infinite remediation retries.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The apps Flux Kustomization had a 5m health check timeout, but
FleetDM's HelmRelease (with MySQL + Redis + migrations) needs up to
10m to install. The health check was failing before FleetDM had a
chance to complete its install, forcing unnecessary retry cycles and
causing flaky CI failures.

Aligns the Kustomization timeout with the heaviest HelmRelease
timeout in the apps group.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 27, 2026 20:08
The 15m total timeout didn't leave room for FleetDM HelmRelease retries
after infrastructure-controllers and infrastructure consume ~2-3m.
With apps Kustomization timeout at 10m and retryInterval at 2m, a single
retry cycle needs ~24m total (2m infra + 10m + 2m + 10m).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ting

The Fleet chart deploys the migration Job and Fleet Deployment
simultaneously with MySQL when mysql.enabled=true. This causes a race
condition: migrations fail before MySQL is ready, Fleet crashes, and
exponential backoff prevents convergence within the Helm timeout.

Add init containers via postRenderers:
- Job/fleet-migration: wait for MySQL TCP (port 3306)
- Deployment/fleet: wait for MySQL (3306) and Redis (6379)

With the race condition eliminated, revert all timeout increases back
to their original values (HelmRelease default, apps Kustomization 5m,
ksail connection 15m).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add securityContext to init containers (PodSecurity restricted)
- Add 15s sleep after MySQL TCP check to allow full initialization
  (MySQL opens port 3306 before accepting connections)
- Increase HelmRelease timeout to 10m for retry headroom
- Increase apps Kustomization timeout to 10m
- Increase ksail connection timeout to 25m

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Flux kustomize controller fails to process the apps kustomization
when swap is removed, as the runner lacks memory headroom for all the
controllers running in the Talos-in-Docker cluster.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Fleet binary writes ~/.goquery/history on every invocation (even
`fleet --version`), but the chart sets readOnlyRootFilesystem=true and
runs as uid 3333 which has no /etc/passwd entry. The result is that
fleet-migration silently crashes with the cryptic `<timestamp> N <nil>`
log line and the Helm install times out before convergence.

Fix: postRender both Job/fleet-migration and Deployment/fleet to:
- Mount an emptyDir at /home/fleet (chart only supports extraVolumes
  on the Deployment, not the Job)
- Run as the image's real fleet user (uid 100, gid 101) so the Go
  runtime can look up $HOME and the migration container exits cleanly

Verified locally: migration completes in ~98s and fleet pod becomes
Ready, replacing the previous CrashLoopBackOff loop. Reverts the
unnecessary timeout bumps that were added while diagnosing this.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@devantler devantler added this pull request to the merge queue Apr 28, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 28, 2026
@devantler devantler added this pull request to the merge queue Apr 28, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 28, 2026
@devantler devantler merged commit ff639bf into main Apr 28, 2026
10 checks passed
@devantler devantler deleted the devantler/move-free-disk-space-action branch April 28, 2026 12:55
@github-project-automation github-project-automation Bot moved this from 🚀 In Finalization to ✅ Done in 🌊 Project Board Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

2 participants