Add multi-node runtime barrier#239
Draft
nateGeorge wants to merge 1 commit into
Draft
Conversation
Add a small runtime helper for Kubernetes multi-node tasks so workers can rendezvous with node 0 before the coordinator exits. The launcher now injects per-task barrier metadata, exposes stable pod DNS through the task headless Service, redacts sensitive env values from serialized Kubernetes debug data, and cleans up the Service on terminal completion. Tests cover runtime barrier behavior, launcher env/service wiring, token redaction, create/terminate cleanup, and orchestrator cleanup on completed jobs. Co-authored-by: AI (Pi/GPT-5.5) <noreply@pi.dev>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a reusable multi-node barrier helper for Kubernetes-backed Tangle tasks.
What changed
cloud_pipelines_backend.runtime.multi_node.barrier(...)for Python components that need all nodes in a multi-node task to rendezvous before one node continues or exits.Notes
This is intentionally a small Python runtime helper plus Kubernetes launcher wiring. It does not try to be an authentication boundary; the token is only meant to prevent accidental cross-talk between tasks sharing the same network namespace.
Draft because the API shape and runtime packaging constraints should be checked by Tangle maintainers before treating this as final.
Testing
uv run black --check cloud_pipelines_backend/runtime/multi_node.py cloud_pipelines_backend/runtime/__init__.py cloud_pipelines_backend/launchers/interfaces.py cloud_pipelines_backend/launchers/kubernetes_launchers.py cloud_pipelines_backend/orchestrator_sql.py tests/test_multi_node_runtime.py tests/test_kubernetes_multi_node.py tests/test_orchestrator_cleanup.py→ passed. Black printed the existing Python 3.12 vs 3.14 safety-check warning.git diff --cached --check→ passed.PYTHONPATH=. uv run pytest→ 376 passed, 11 warnings. Warnings were existing SQLAlchemyRow.tuple()deprecation warnings; the test run also printed local OpenTelemetry exporter-unavailable messages after pytest completed.Not yet tested against a real multi-node Kubernetes task; that should be part of follow-up validation for the draft API.