ci: add free-disk-space step to system-test job#1446
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the CI workflow to proactively reclaim disk space on GitHub-hosted runners before provisioning the Talos-in-Docker system test cluster, reducing the likelihood of DiskPressure / No space left on device failures during cluster creation and Flux reconciliation.
Changes:
- Add a “Free disk space” step in the
system-testjob prior to the KSail cluster action. - Configure the action to remove Android, .NET, Haskell, and the runner tool cache.
Add endersonmenezes/free-disk-space@v3.2.2 before ksail-cluster to reclaim ~20GB on GitHub runners by removing Android SDK, .NET, Haskell, and tool cache. Prevents DiskPressure during Talos cluster creation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
18696ca to
903942b
Compare
FleetDM includes MySQL, Redis, and migration jobs that need time to initialize, especially on resource-constrained CI runners. The default 5m Helm install timeout is too tight, causing CrashLoopBackOff and flaky CI failures across multiple PRs. Aligns with the pattern used by other heavy HelmReleases (velero, kube-prometheus-stack, kyverno) that set explicit timeouts and infinite remediation retries. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The apps Flux Kustomization had a 5m health check timeout, but FleetDM's HelmRelease (with MySQL + Redis + migrations) needs up to 10m to install. The health check was failing before FleetDM had a chance to complete its install, forcing unnecessary retry cycles and causing flaky CI failures. Aligns the Kustomization timeout with the heaviest HelmRelease timeout in the apps group. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The 15m total timeout didn't leave room for FleetDM HelmRelease retries after infrastructure-controllers and infrastructure consume ~2-3m. With apps Kustomization timeout at 10m and retryInterval at 2m, a single retry cycle needs ~24m total (2m infra + 10m + 2m + 10m). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ting The Fleet chart deploys the migration Job and Fleet Deployment simultaneously with MySQL when mysql.enabled=true. This causes a race condition: migrations fail before MySQL is ready, Fleet crashes, and exponential backoff prevents convergence within the Helm timeout. Add init containers via postRenderers: - Job/fleet-migration: wait for MySQL TCP (port 3306) - Deployment/fleet: wait for MySQL (3306) and Redis (6379) With the race condition eliminated, revert all timeout increases back to their original values (HelmRelease default, apps Kustomization 5m, ksail connection 15m). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add securityContext to init containers (PodSecurity restricted) - Add 15s sleep after MySQL TCP check to allow full initialization (MySQL opens port 3306 before accepting connections) - Increase HelmRelease timeout to 10m for retry headroom - Increase apps Kustomization timeout to 10m - Increase ksail connection timeout to 25m Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Flux kustomize controller fails to process the apps kustomization when swap is removed, as the runner lacks memory headroom for all the controllers running in the Talos-in-Docker cluster. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Fleet binary writes ~/.goquery/history on every invocation (even `fleet --version`), but the chart sets readOnlyRootFilesystem=true and runs as uid 3333 which has no /etc/passwd entry. The result is that fleet-migration silently crashes with the cryptic `<timestamp> N <nil>` log line and the Helm install times out before convergence. Fix: postRender both Job/fleet-migration and Deployment/fleet to: - Mount an emptyDir at /home/fleet (chart only supports extraVolumes on the Deployment, not the Job) - Run as the image's real fleet user (uid 100, gid 101) so the Go runtime can look up $HOME and the migration container exits cleanly Verified locally: migration completes in ~98s and fleet pod becomes Ready, replacing the previous CrashLoopBackOff loop. Reverts the unnecessary timeout bumps that were added while diagnosing this. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add
endersonmenezes/free-disk-space@v3.2.2before theksail-clusterstep in the system-test job to reclaim ~20GB on GitHub runners.Removes Android SDK (~10GB), .NET (~4GB), Haskell (~4GB), and tool cache (~6GB) to prevent
DiskPressure/No space left on deviceduring Talos cluster creation and Flux reconciliation.Inputs
remove_androidtrueremove_dotnettrueremove_haskelltrueremove_tool_cachetrue