Skip to content

Fix flow_path_dinf hang on cyclic D-inf flow directions (#2796)#3042

Merged
brendancol merged 3 commits into
mainfrom
issue-2796
Jun 8, 2026
Merged

Fix flow_path_dinf hang on cyclic D-inf flow directions (#2796)#3042
brendancol merged 3 commits into
mainfrom
issue-2796

Conversation

@brendancol

Copy link
Copy Markdown
Contributor

Closes #2796

flow_path_dinf traced downstream paths with a while True loop that only broke on NaN, a pit, the grid edge, or out-of-bounds. A cyclic D-inf direction grid never hits any of those, so the call hung forever. On the dask path it was worse: each iteration appended to growing buffers, so a cycle leaked memory until the process died.

  • Cap each path at H*W steps in both the CPU/JIT loop (_flow_path_dinf_cpu) and the dask loop. A non-cyclic path visits each cell at most once, so H*W is enough headroom to break any cycle.
  • Add regression tests that assert termination via a worker-thread guard (not a wall-clock assertion) for the two-cell [[0, pi]] reproducer and a four-cell loop.

Backend coverage: the fix lands in the two tracing loops. numpy and cupy both run _flow_path_dinf_cpu; dask+numpy and dask+cupy both run the dask loop. All four backends are covered.

Test plan:

  • pytest xrspatial/hydro/tests/test_flow_path_dinf.py (24 passed)
  • Reverted the caps and confirmed the new cycle tests hang on the buggy code

flow_path_dinf traced downstream paths with a `while True` loop that only
broke on NaN, a pit, the grid edge, or out-of-bounds. A cyclic D-inf
direction grid (a cell that eventually points back to a cell already on
the path) never hit any of those, so the call hung forever. In the dask
path it was worse: each iteration appended to growing buffers, leaking
memory until the process died.

Cap each path at H*W steps in both the CPU/JIT loop and the dask loop. A
non-cyclic path visits each cell at most once, so H*W is enough headroom
to break any cycle.

Add regression tests that assert termination (via a worker-thread guard,
not a wall-clock assertion) for the two-cell [[0, pi]] reproducer and a
four-cell loop, on numpy and dask.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label Jun 8, 2026

@brendancol brendancol left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: Fix flow_path_dinf hang on cyclic D-inf flow directions (#2796)

Blockers (must fix before merge)

None.

Suggestions (should fix, not blocking)

None.

Nits (optional improvements)

  • test_flow_path_dinf.py termination guard: the numpy kernel is @ngjit and runs without the GIL, so a daemon worker thread stuck in the buggy while True loop can't actually be interrupted. On unfixed code the thread keeps spinning after t.join(timeout) returns, and the AssertionError fires while a runaway compiled loop is still burning a core in the background. On the fixed code this is a non-issue (the call returns in well under a second), so the guard works as a regression check. A one-line comment noting the orphaned thread on failure would help, but it's not worth restructuring.

What looks good

  • The H*W cap is the right bound. A non-cyclic path visits each distinct cell at most once, so H*W iterations can't truncate a legitimate path while still breaking every cycle.
  • Both tracing loops get the same change. numpy and cupy share _flow_path_dinf_cpu; dask+numpy and dask+cupy share the dask loop, so all four backends are covered by the two edits.
  • The dask fix also closes the memory leak the issue called out, since the bounded loop stops the buffers from growing without limit.
  • Tests assert termination via a thread guard rather than a wall-clock ratio, which avoids the timing flakiness this tracker has hit before.
  • The two-cell [[0, pi]] test is the exact reproducer from the issue, and the 2x2 loop covers a longer cycle.

Checklist

  • Algorithm matches reference: the dominant-neighbor trace is unchanged; only the termination bound was added
  • All implemented backends produce consistent results: numpy/cupy via CPU kernel, dask paths via dask loop; dask-equals-numpy cycle test covers parity
  • NaN handling is correct: unchanged
  • Edge cases covered by tests: two-cell and four-cell cycles, numpy and dask
  • Dask chunk boundaries handled correctly: cap is per-path, independent of chunking
  • No premature materialization or unnecessary copies: none introduced
  • Benchmark exists or not needed: not needed, pure termination fix
  • README feature matrix updated: not needed, no new function or backend change
  • Docstrings present and accurate: module docstring still describes the stop conditions; the cap is an internal safety bound

@brendancol brendancol left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up review (after nit fix)

The one nit from the first pass is addressed: the termination-guard docstring now notes that on the buggy @ngjit path the daemon thread can't be interrupted and keeps running after the assertion fires. The change is comment-only, so the code paths reviewed before are unchanged.

No new findings. All 24 tests in test_flow_path_dinf.py pass locally. Nothing left open beyond this.

@brendancol brendancol merged commit c9d22a2 into main Jun 8, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

flow_path_dinf hangs forever on cyclic D-inf flow directions

1 participant