Fix flow_path_dinf hang on cyclic D-inf flow directions (#2796)#3042
Merged
Conversation
flow_path_dinf traced downstream paths with a `while True` loop that only broke on NaN, a pit, the grid edge, or out-of-bounds. A cyclic D-inf direction grid (a cell that eventually points back to a cell already on the path) never hit any of those, so the call hung forever. In the dask path it was worse: each iteration appended to growing buffers, leaking memory until the process died. Cap each path at H*W steps in both the CPU/JIT loop and the dask loop. A non-cyclic path visits each cell at most once, so H*W is enough headroom to break any cycle. Add regression tests that assert termination (via a worker-thread guard, not a wall-clock assertion) for the two-cell [[0, pi]] reproducer and a four-cell loop, on numpy and dask.
brendancol
commented
Jun 8, 2026
brendancol
left a comment
Contributor
Author
There was a problem hiding this comment.
PR Review: Fix flow_path_dinf hang on cyclic D-inf flow directions (#2796)
Blockers (must fix before merge)
None.
Suggestions (should fix, not blocking)
None.
Nits (optional improvements)
test_flow_path_dinf.pytermination guard: the numpy kernel is@ngjitand runs without the GIL, so a daemon worker thread stuck in the buggywhile Trueloop can't actually be interrupted. On unfixed code the thread keeps spinning aftert.join(timeout)returns, and the AssertionError fires while a runaway compiled loop is still burning a core in the background. On the fixed code this is a non-issue (the call returns in well under a second), so the guard works as a regression check. A one-line comment noting the orphaned thread on failure would help, but it's not worth restructuring.
What looks good
- The
H*Wcap is the right bound. A non-cyclic path visits each distinct cell at most once, soH*Witerations can't truncate a legitimate path while still breaking every cycle. - Both tracing loops get the same change. numpy and cupy share
_flow_path_dinf_cpu; dask+numpy and dask+cupy share the dask loop, so all four backends are covered by the two edits. - The dask fix also closes the memory leak the issue called out, since the bounded loop stops the buffers from growing without limit.
- Tests assert termination via a thread guard rather than a wall-clock ratio, which avoids the timing flakiness this tracker has hit before.
- The two-cell
[[0, pi]]test is the exact reproducer from the issue, and the 2x2 loop covers a longer cycle.
Checklist
- Algorithm matches reference: the dominant-neighbor trace is unchanged; only the termination bound was added
- All implemented backends produce consistent results: numpy/cupy via CPU kernel, dask paths via dask loop; dask-equals-numpy cycle test covers parity
- NaN handling is correct: unchanged
- Edge cases covered by tests: two-cell and four-cell cycles, numpy and dask
- Dask chunk boundaries handled correctly: cap is per-path, independent of chunking
- No premature materialization or unnecessary copies: none introduced
- Benchmark exists or not needed: not needed, pure termination fix
- README feature matrix updated: not needed, no new function or backend change
- Docstrings present and accurate: module docstring still describes the stop conditions; the cap is an internal safety bound
brendancol
commented
Jun 8, 2026
brendancol
left a comment
Contributor
Author
There was a problem hiding this comment.
Follow-up review (after nit fix)
The one nit from the first pass is addressed: the termination-guard docstring now notes that on the buggy @ngjit path the daemon thread can't be interrupted and keeps running after the assertion fires. The change is comment-only, so the code paths reviewed before are unchanged.
No new findings. All 24 tests in test_flow_path_dinf.py pass locally. Nothing left open beyond this.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #2796
flow_path_dinftraced downstream paths with awhile Trueloop that only broke on NaN, a pit, the grid edge, or out-of-bounds. A cyclic D-inf direction grid never hits any of those, so the call hung forever. On the dask path it was worse: each iteration appended to growing buffers, so a cycle leaked memory until the process died.H*Wsteps in both the CPU/JIT loop (_flow_path_dinf_cpu) and the dask loop. A non-cyclic path visits each cell at most once, soH*Wis enough headroom to break any cycle.[[0, pi]]reproducer and a four-cell loop.Backend coverage: the fix lands in the two tracing loops. numpy and cupy both run
_flow_path_dinf_cpu; dask+numpy and dask+cupy both run the dask loop. All four backends are covered.Test plan:
pytest xrspatial/hydro/tests/test_flow_path_dinf.py(24 passed)