polygonize: batch dask compute per chunk row to recover parallelism#2632
Merged
Conversation
…2608) _polygonize_dask called dask.compute() once per chunk inside a nested Python loop, forcing one chunk per scheduler round-trip. Workers sat idle between chunks under a multi-worker scheduler (measured 2.79x slower than batched on a 4-worker LocalCluster, 1024x1024 / 64 chunks). Issue one dask.compute() per row of chunks so a row's chunks run concurrently. Peak driver memory stays bounded by one row of chunk results rather than the full raster, preserving the memory tradeoff the per-chunk loop was written for. Output is byte-identical (verified for 4- and 8-connectivity). Adds test_polygonize_dask_row_batch_2608.py: numpy/dask parity across chunkings, masked-raster parity, 8-connectivity self-consistency, and a structural check that compute is called once per chunk row.
brendancol
commented
May 29, 2026
brendancol
left a comment
Contributor
Author
There was a problem hiding this comment.
PR Review: polygonize dask compute row-batching
Blockers (must fix before merge)
None.
Suggestions (should fix, not blocking)
None.
Nits (optional improvements)
test_one_compute_per_chunk_rowpatchesdask.computeon the real dask module (poly_mod.daskis the actual module, not a local alias). monkeypatch restores it afterward so nothing leaks, but the route throughsys.modulesplus a global-attribute patch is a bit subtle for the next reader. Works as written; flagging only for legibility.
What looks good
- Small, contained change: one nested loop reworked, nothing else in the function touched.
dask.compute(*row_tasks)returns results in task order, and the per-row unpacking keeps the exact accumulation order (cols within a row, rows top to bottom), so the output is unchanged. Verified empirically for connectivity 4 and 8.- Keeps the memory behavior the original loop was written for: peak driver memory is bounded by one row of chunk results, not the whole raster.
- Tests cover the right ground: numpy parity across four chunkings including ragged ones, masked-raster parity, 8-connectivity self-consistency, and a structural check that compute fires once per row. Splitting 4-conn (parity) from 8-conn (self-consistency) is the right call, since 8-conn dask area is a known approximation against numpy.
- Single-column chunk grids work (one task returns a one-tuple the for-loop unpacks fine).
Checklist
- Algorithm matches reference (output byte-identical to prior path)
- All implemented backends produce consistent results (dask+numpy and dask+cupy share this path; numpy/cupy untouched)
- NaN handling is correct (unchanged from prior path)
- Edge cases are covered by tests (ragged chunks, mask, single-column grid)
- Dask chunk boundaries handled correctly (merge logic unchanged)
- No premature materialization or unnecessary copies
- Benchmark exists (benchmarks/benchmarks/polygonize.py)
- README feature matrix updated (not applicable, no API change)
- Docstrings present and accurate (loop comment rewritten to match)
brendancol
commented
May 29, 2026
brendancol
left a comment
Contributor
Author
There was a problem hiding this comment.
Follow-up review (after nit fix)
The one nit from the prior pass is addressed: the comment in test_one_compute_per_chunk_row now spells out why patching dask.compute on poly_mod.dask is the right (and only) seam, and notes monkeypatch handles teardown.
No remaining blockers, suggestions, or nits. The change stays small and output-preserving; tests still pass (8/8 in the new file). Ready from a review standpoint pending CI.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #2608
What this does
_polygonize_daskcalleddask.compute()once per chunk inside a nested Python loop. That's one chunk per scheduler round-trip, so on a multi-worker scheduler the workers idle between chunks. This batches onedask.compute()per row of chunks instead, so a row's chunks can run at the same time. Peak driver memory still tracks one row of chunk results rather than the whole raster, so the memory tradeoff the original loop was written for stays intact.Measurements
@ngjitkernels release the GIL so threads already overlap there; you need a multi-worker scheduler to see the win.Output is byte-identical to the old path, checked for both 4- and 8-connectivity.
Backends
dask+numpy and dask+cupy, which both route through
_polygonize_dask. The numpy and cupy paths are untouched.Test plan
test_polygonize_dask_row_batch_2608.py: numpy/dask area parity across four chunkings, masked-raster parity, 8-connectivity self-consistency across repeated runs, and a structural check that compute fires exactly once per chunk row.