Skip to content

Add multi-node runtime barrier#239

Draft
nateGeorge wants to merge 1 commit into
TangleML:masterfrom
nateGeorge:nate/multi-node-barrier
Draft

Add multi-node runtime barrier#239
nateGeorge wants to merge 1 commit into
TangleML:masterfrom
nateGeorge:nate/multi-node-barrier

Conversation

@nateGeorge
Copy link
Copy Markdown

Adds a reusable multi-node barrier helper for Kubernetes-backed Tangle tasks.

What changed

  • Adds cloud_pipelines_backend.runtime.multi_node.barrier(...) for Python components that need all nodes in a multi-node task to rendezvous before one node continues or exits.
  • Injects launcher-controlled multi-node runtime env vars into Indexed Kubernetes Jobs, including node count, node index, node addresses, barrier port, and a per-task barrier token.
  • Exposes the barrier port through the task headless Service and enables not-ready pod DNS publication so workers can retry while node 0 starts.
  • Redacts sensitive env values from serialized Kubernetes debug data and job-creation error messages.
  • Cleans up the task headless Service on job creation failure, explicit termination, and normal terminal job processing.

Notes

This is intentionally a small Python runtime helper plus Kubernetes launcher wiring. It does not try to be an authentication boundary; the token is only meant to prevent accidental cross-talk between tasks sharing the same network namespace.

Draft because the API shape and runtime packaging constraints should be checked by Tangle maintainers before treating this as final.

Testing

  • uv run black --check cloud_pipelines_backend/runtime/multi_node.py cloud_pipelines_backend/runtime/__init__.py cloud_pipelines_backend/launchers/interfaces.py cloud_pipelines_backend/launchers/kubernetes_launchers.py cloud_pipelines_backend/orchestrator_sql.py tests/test_multi_node_runtime.py tests/test_kubernetes_multi_node.py tests/test_orchestrator_cleanup.py → passed. Black printed the existing Python 3.12 vs 3.14 safety-check warning.
  • git diff --cached --check → passed.
  • PYTHONPATH=. uv run pytest → 376 passed, 11 warnings. Warnings were existing SQLAlchemy Row.tuple() deprecation warnings; the test run also printed local OpenTelemetry exporter-unavailable messages after pytest completed.

Not yet tested against a real multi-node Kubernetes task; that should be part of follow-up validation for the draft API.

Add a small runtime helper for Kubernetes multi-node tasks so workers can rendezvous with node 0 before the coordinator exits. The launcher now injects per-task barrier metadata, exposes stable pod DNS through the task headless Service, redacts sensitive env values from serialized Kubernetes debug data, and cleans up the Service on terminal completion.

Tests cover runtime barrier behavior, launcher env/service wiring, token redaction, create/terminate cleanup, and orchestrator cleanup on completed jobs.

Co-authored-by: AI (Pi/GPT-5.5) <noreply@pi.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant