Skip to content

Fix swallowed CancelledError in start_child_workflow and Nexus operations (Issue #1445)#1472

Open
yegorske50 wants to merge 2 commits intotemporalio:mainfrom
yegorske50:fix-swallowed-cancel
Open

Fix swallowed CancelledError in start_child_workflow and Nexus operations (Issue #1445)#1472
yegorske50 wants to merge 2 commits intotemporalio:mainfrom
yegorske50:fix-swallowed-cancel

Conversation

@yegorske50
Copy link
Copy Markdown

@yegorske50 yegorske50 commented Apr 22, 2026

What was changed

Two files changed:

temporalio/worker/_workflow_instance.py — Added
if self._cancel_requested: raise guard in the asyncio.CancelledError
except blocks within the outer start loops of
_outbound_start_child_workflow and _outbound_start_nexus_operation.

tests/worker/test_workflow.py — Replaced the previously skipped
test_workflow_cancel_child_unstarted placeholder (raise NotImplementedError)
with a working regression test.

The previous version of this PR used an unconditional raise, which the
maintainer correctly pointed out was wrong. This version addresses that
feedback directly.


About the test

The existing placeholder was:

@pytest.mark.skip(reason="unable to easily prevent child start currently")
async def test_workflow_cancel_child_unstarted(_client: Client):
    raise NotImplementedError

The skip reason names the exact problem the team could not solve:
preventing the child from starting so that a cancel could arrive while
_start_fut was still unresolved. Without that, the cancel arrives
after start_child_workflow has already returned, and the bug is never
triggered.

The solution is to place the child on a task queue with no worker. With
No worker polling that queue, the child's first workflow task is never
processed. _start_fut stays unresolved, the start loop stays blocked,
and when the cancel arrives, it lands exactly where the bug lived. This
is the approach used in the local reproduction script that confirmed the
bug, and it is what makes the test work where the placeholder could not.

The execution_timeout=timedelta(seconds=30) acts as a safety net. If
the bug is still present and the workflow hangs, the timeout fires and
produces a TimeoutError instead of CancelledError. The
assertion isinstance(err.value.cause, CancelledError) then fails,
catching the bug. Without this timeout, the test would hang indefinitely.


Test results

All cancellation-related tests pass with the fix applied. Notably,
test_workflow_cancel_child_unstarted, which was previously skipped
with raise NotImplementedError, now runs and passes:
33 passed, 1 skipped

The 1 remaining skip (test_workflow_cancel_signal_and_timer_fired_in_same_task)
is a pre-existing skip requiring time-skipping server infrastructure
and is completely unrelated to this change.


Checklist

@yegorske50 yegorske50 requested a review from a team as a code owner April 22, 2026 21:03
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 22, 2026

CLA assistant check
All committers have signed the CLA.

@tconley1428
Copy link
Copy Markdown
Contributor

I don't believe this fix is correct, as asyncio.CancelledError is not guaranteed to come from cancellation of the worker. You should check that cancellation of the worker has actually been requested.

@yegorske50 yegorske50 force-pushed the fix-swallowed-cancel branch from 4992b5d to 8c6fab6 Compare April 25, 2026 09:52
@yegorske50 yegorske50 force-pushed the fix-swallowed-cancel branch from 8c6fab6 to 218ef5e Compare April 25, 2026 09:59
@yegorske50
Copy link
Copy Markdown
Author

yegorske50 commented Apr 25, 2026

I don't believe this fix is correct, as asyncio.CancelledError is not guaranteed to come from cancellation of the worker. You should check that cancellation of the worker has actually been requested.

Thank you for pointing me in the right direction. I've updated the fix to use if self._cancel_requested: raise instead of an unconditional raise. The guard ensures we only propagate when Temporal actually requested cancellation via _apply_cancel_workflow. Other CancelledError sources (like asyncio.wait_for timeouts) do not set _cancel_requested, so they correctly continue looping.

I've also replaced the incomplete test (test_workflow_cancel_child_unstarted was previously just raise NotImplementedError) with a working regression test that exercises the exact scenario. The test starts a child on a task queue with no worker, so _start_fut never resolves, and the cancellation arrives while the start loop is blocked.

Please review the changes and point me in the right direction if any adjustments are needed @tconley1428

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] CancelledError is swallowed during workflow.start_child_workflow

3 participants