Skip to content

polygonize: batch dask compute per chunk row to recover parallelism#2632

Merged
brendancol merged 3 commits into
mainfrom
deep-sweep-performance-polygonize-2026-05-29
May 29, 2026
Merged

polygonize: batch dask compute per chunk row to recover parallelism#2632
brendancol merged 3 commits into
mainfrom
deep-sweep-performance-polygonize-2026-05-29

Conversation

@brendancol

Copy link
Copy Markdown
Contributor

Closes #2608

What this does

_polygonize_dask called dask.compute() once per chunk inside a nested Python loop. That's one chunk per scheduler round-trip, so on a multi-worker scheduler the workers idle between chunks. This batches one dask.compute() per row of chunks instead, so a row's chunks can run at the same time. Peak driver memory still tracks one row of chunk results rather than the whole raster, so the memory tradeoff the original loop was written for stays intact.

Measurements

  • 4-worker LocalCluster (processes), 1024x1024 int32 / 64 chunks: the per-chunk loop is 2.79x slower than row-batched.
  • Default threaded scheduler with numba warm: within noise (about 1.03x). The @ngjit kernels release the GIL so threads already overlap there; you need a multi-worker scheduler to see the win.

Output is byte-identical to the old path, checked for both 4- and 8-connectivity.

Backends

dask+numpy and dask+cupy, which both route through _polygonize_dask. The numpy and cupy paths are untouched.

Test plan

  • New file test_polygonize_dask_row_batch_2608.py: numpy/dask area parity across four chunkings, masked-raster parity, 8-connectivity self-consistency across repeated runs, and a structural check that compute fires exactly once per chunk row.
  • Full polygonize suite: 299 passed, 16 skipped (GPU and optional-dep gated).

…2608)

_polygonize_dask called dask.compute() once per chunk inside a nested
Python loop, forcing one chunk per scheduler round-trip. Workers sat idle
between chunks under a multi-worker scheduler (measured 2.79x slower than
batched on a 4-worker LocalCluster, 1024x1024 / 64 chunks).

Issue one dask.compute() per row of chunks so a row's chunks run
concurrently. Peak driver memory stays bounded by one row of chunk
results rather than the full raster, preserving the memory tradeoff the
per-chunk loop was written for. Output is byte-identical (verified for
4- and 8-connectivity).

Adds test_polygonize_dask_row_batch_2608.py: numpy/dask parity across
chunkings, masked-raster parity, 8-connectivity self-consistency, and a
structural check that compute is called once per chunk row.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 29, 2026

@brendancol brendancol left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: polygonize dask compute row-batching

Blockers (must fix before merge)

None.

Suggestions (should fix, not blocking)

None.

Nits (optional improvements)

  • test_one_compute_per_chunk_row patches dask.compute on the real dask module (poly_mod.dask is the actual module, not a local alias). monkeypatch restores it afterward so nothing leaks, but the route through sys.modules plus a global-attribute patch is a bit subtle for the next reader. Works as written; flagging only for legibility.

What looks good

  • Small, contained change: one nested loop reworked, nothing else in the function touched.
  • dask.compute(*row_tasks) returns results in task order, and the per-row unpacking keeps the exact accumulation order (cols within a row, rows top to bottom), so the output is unchanged. Verified empirically for connectivity 4 and 8.
  • Keeps the memory behavior the original loop was written for: peak driver memory is bounded by one row of chunk results, not the whole raster.
  • Tests cover the right ground: numpy parity across four chunkings including ragged ones, masked-raster parity, 8-connectivity self-consistency, and a structural check that compute fires once per row. Splitting 4-conn (parity) from 8-conn (self-consistency) is the right call, since 8-conn dask area is a known approximation against numpy.
  • Single-column chunk grids work (one task returns a one-tuple the for-loop unpacks fine).

Checklist

  • Algorithm matches reference (output byte-identical to prior path)
  • All implemented backends produce consistent results (dask+numpy and dask+cupy share this path; numpy/cupy untouched)
  • NaN handling is correct (unchanged from prior path)
  • Edge cases are covered by tests (ragged chunks, mask, single-column grid)
  • Dask chunk boundaries handled correctly (merge logic unchanged)
  • No premature materialization or unnecessary copies
  • Benchmark exists (benchmarks/benchmarks/polygonize.py)
  • README feature matrix updated (not applicable, no API change)
  • Docstrings present and accurate (loop comment rewritten to match)

@brendancol brendancol left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up review (after nit fix)

The one nit from the prior pass is addressed: the comment in test_one_compute_per_chunk_row now spells out why patching dask.compute on poly_mod.dask is the right (and only) seam, and notes monkeypatch handles teardown.

No remaining blockers, suggestions, or nits. The change stays small and output-preserving; tests still pass (8/8 in the new file). Ready from a review standpoint pending CI.

@brendancol brendancol merged commit 2888320 into main May 29, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

polygonize dask backend serializes chunks via per-chunk dask.compute()

1 participant