Skip to content

polygonize dask backend serializes chunks via per-chunk dask.compute() #2608

@brendancol

Description

@brendancol

What's happening

The dask backend for polygonize (_polygonize_dask) walks the chunks in a nested Python loop and calls dask.compute() once per chunk. Each call blocks until that one chunk is done, so the scheduler only ever has a single chunk's worth of work in flight. Chunks that could run in parallel don't.

Numbers

On a 1024x1024 int32 raster split into 64 chunks of 128x128, collecting the per-chunk delayed tasks and computing them in one dask.compute() call ran in 5.14s versus 8.28s for the current loop, about 1.61x. With more chunks and more workers the gap gets wider.

The catch

The per-chunk loop isn't an accident. The comment right above it spells out the reason: computing one chunk at a time means only boundary polygons pile up in driver memory, so peak memory tracks boundary-polygon count instead of total polygons times chunk count. If you "fix" this by dumping every chunk into a single dask.compute(), you pull all the interior polygons into memory at once. On a huge raster that's a regression, not a win.

Suggested approach

Batch one row of chunks per dask.compute() call. The scheduler can run the chunks in a row concurrently, and peak memory stays bounded by a single row's results rather than the whole raster. Same memory behavior the original loop was after, most of the parallelism back.

Backends

dask+numpy and dask+cupy both go through _polygonize_dask.

Bottleneck and OOM notes

Compute-bound: CPU boundary tracing in _scan dominates on every backend. OOM verdict stays RISKY because the driver still accumulates O(total polygons) interior polygons no matter what this change does. Row-batching holds the existing memory bound; it doesn't make it worse.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions