Fix swallowed CancelledError in start_child_workflow and Nexus operations (Issue #1445)#1472
Fix swallowed CancelledError in start_child_workflow and Nexus operations (Issue #1445)#1472yegorske50 wants to merge 2 commits intotemporalio:mainfrom
Conversation
|
I don't believe this fix is correct, as |
4992b5d to
8c6fab6
Compare
8c6fab6 to
218ef5e
Compare
Thank you for pointing me in the right direction. I've updated the fix to use I've also replaced the incomplete test ( Please review the changes and point me in the right direction if any adjustments are needed @tconley1428 |
What was changed
Two files changed:
temporalio/worker/_workflow_instance.py— Addedif self._cancel_requested: raiseguard in theasyncio.CancelledErrorexcept blocks within the outer start loops of
_outbound_start_child_workflowand_outbound_start_nexus_operation.tests/worker/test_workflow.py— Replaced the previously skippedtest_workflow_cancel_child_unstartedplaceholder (raise NotImplementedError)with a working regression test.
The previous version of this PR used an unconditional
raise, which themaintainer correctly pointed out was wrong. This version addresses that
feedback directly.
About the test
The existing placeholder was:
The skip reason names the exact problem the team could not solve:
preventing the child from starting so that a cancel could arrive while
_start_futwas still unresolved. Without that, the cancel arrivesafter
start_child_workflowhas already returned, and the bug is nevertriggered.
The solution is to place the child on a task queue with no worker. With
No worker polling that queue, the child's first workflow task is never
processed.
_start_futstays unresolved, the start loop stays blocked,and when the cancel arrives, it lands exactly where the bug lived. This
is the approach used in the local reproduction script that confirmed the
bug, and it is what makes the test work where the placeholder could not.
The
execution_timeout=timedelta(seconds=30)acts as a safety net. Ifthe bug is still present and the workflow hangs, the timeout fires and
produces a
TimeoutErrorinstead ofCancelledError. Theassertion
isinstance(err.value.cause, CancelledError)then fails,catching the bug. Without this timeout, the test would hang indefinitely.
Test results
All cancellation-related tests pass with the fix applied. Notably,
test_workflow_cancel_child_unstarted, which was previously skippedwith
raise NotImplementedError, now runs and passes:33 passed, 1 skipped
The 1 remaining skip (
test_workflow_cancel_signal_and_timer_fired_in_same_task)is a pre-existing skip requiring time-skipping server infrastructure
and is completely unrelated to this change.
Checklist
raisereplaced with
if self._cancel_requested: raiseraise NotImplementedErrorplaceholder with a workingregression test that actually catches the bug