What's happening
The dask backend for polygonize (_polygonize_dask) walks the chunks in a nested Python loop and calls dask.compute() once per chunk. Each call blocks until that one chunk is done, so the scheduler only ever has a single chunk's worth of work in flight. Chunks that could run in parallel don't.
Numbers
On a 1024x1024 int32 raster split into 64 chunks of 128x128, collecting the per-chunk delayed tasks and computing them in one dask.compute() call ran in 5.14s versus 8.28s for the current loop, about 1.61x. With more chunks and more workers the gap gets wider.
The catch
The per-chunk loop isn't an accident. The comment right above it spells out the reason: computing one chunk at a time means only boundary polygons pile up in driver memory, so peak memory tracks boundary-polygon count instead of total polygons times chunk count. If you "fix" this by dumping every chunk into a single dask.compute(), you pull all the interior polygons into memory at once. On a huge raster that's a regression, not a win.
Suggested approach
Batch one row of chunks per dask.compute() call. The scheduler can run the chunks in a row concurrently, and peak memory stays bounded by a single row's results rather than the whole raster. Same memory behavior the original loop was after, most of the parallelism back.
Backends
dask+numpy and dask+cupy both go through _polygonize_dask.
Bottleneck and OOM notes
Compute-bound: CPU boundary tracing in _scan dominates on every backend. OOM verdict stays RISKY because the driver still accumulates O(total polygons) interior polygons no matter what this change does. Row-batching holds the existing memory bound; it doesn't make it worse.
What's happening
The dask backend for
polygonize(_polygonize_dask) walks the chunks in a nested Python loop and callsdask.compute()once per chunk. Each call blocks until that one chunk is done, so the scheduler only ever has a single chunk's worth of work in flight. Chunks that could run in parallel don't.Numbers
On a 1024x1024 int32 raster split into 64 chunks of 128x128, collecting the per-chunk delayed tasks and computing them in one
dask.compute()call ran in 5.14s versus 8.28s for the current loop, about 1.61x. With more chunks and more workers the gap gets wider.The catch
The per-chunk loop isn't an accident. The comment right above it spells out the reason: computing one chunk at a time means only boundary polygons pile up in driver memory, so peak memory tracks boundary-polygon count instead of total polygons times chunk count. If you "fix" this by dumping every chunk into a single
dask.compute(), you pull all the interior polygons into memory at once. On a huge raster that's a regression, not a win.Suggested approach
Batch one row of chunks per
dask.compute()call. The scheduler can run the chunks in a row concurrently, and peak memory stays bounded by a single row's results rather than the whole raster. Same memory behavior the original loop was after, most of the parallelism back.Backends
dask+numpy and dask+cupy both go through
_polygonize_dask.Bottleneck and OOM notes
Compute-bound: CPU boundary tracing in
_scandominates on every backend. OOM verdict stays RISKY because the driver still accumulates O(total polygons) interior polygons no matter what this change does. Row-batching holds the existing memory bound; it doesn't make it worse.