Skip to content

clip_polygon crop=True builds an over-fragmented mask chunking on dask backends #3191

@brendancol

Description

@brendancol

Description

When clip_polygon runs with crop=True on a dask-backed raster, the dask task graph ends up much bigger than the output size needs. The culprit is the chunk size picked for the internal rasterize mask.

_crop_to_bbox slices the dask raster down to the geometry bounding box. Slicing a dask array leaves irregular chunk sizes at the cut edges. For a 2560x2560 raster chunked at (256, 256), clipping to a box that starts mid-chunk gives x-chunks like (12, 256, 256, 256, 256, 256, 208).

clip_polygon then takes the rasterize mask chunk size from the first chunk of each axis:

rc, cc = raster.data.chunks[-2], raster.data.chunks[-1]
kw.setdefault('chunks', (rc[0], cc[0]))

rc[0] / cc[0] is the leading edge chunk, which after slicing is often a tiny partial chunk (12 px). rasterize builds a uniform mask at that size, so a 1500-px-wide output gets 125 chunks of 12 px each. xarray.where then has to align the irregular raster chunks against the tiny uniform mask chunks, and the task count blows up.

Evidence

Graph construction only, no .compute(). 2560x2560 raster, chunks=(256, 256), clip to box(500, 500, 2000, 2000), crop=True:

  • output shape: 1500x1500
  • mask chunks: (8, 125)
  • task count: 13169

Using the largest chunk per axis (max(rc), max(cc)) instead of the first:

  • mask chunks: (6, 6)
  • task count: 1045

About a 12.6x smaller graph, same output values.

Impact

  • Backends affected: dask+numpy and dask+cupy (both go through the same chunk selection).
  • Bottleneck: graph-bound. This is scheduler and graph-build overhead, not peak memory. Peak memory still scales with chunk size, so it is not an OOM risk.
  • crop=False is unaffected (no slicing, chunks stay uniform). numpy and cupy non-dask paths are unaffected.

Fix

Pick a representative interior chunk size instead of the leading partial chunk:

kw.setdefault('chunks', (max(rc), max(cc)))

That keeps the mask grid coarse and roughly aligned with the raster's interior chunk size.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingperformancePR touches performance-sensitive code

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions