Summary
In _run_cupy (xrspatial/rasterize.py) and _rasterize_tile_cupy, when all_touched=True and the input contains polygons, cupy.asarray(poly_props) and cupy.asarray(poly_global) are called twice: once when staging the scanline poly_launch tuple, and again when staging the supercover boundary boundary_launch tuple.
For the dask+cupy path this duplicate transfer fires for every tile, doubling the host-to-device traffic on the per-tile props/global tables. The two launches operate on the same tile, so there is no benefit to allocating separate device copies.
Locations
xrspatial/rasterize.py:2065 -- poly_launch line in _run_cupy
xrspatial/rasterize.py:2083 -- boundary_launch line in _run_cupy (same poly_props/poly_global)
xrspatial/rasterize.py:2541 -- poly_launch line in _rasterize_tile_cupy
xrspatial/rasterize.py:2555 -- boundary_launch line in _rasterize_tile_cupy (same poly_props_2d/poly_global_2d)
Impact
Microbenched on the current host:
- 1000 polygons x 4 cols: duplicate 0.051 ms/iter vs single 0.024 ms/iter (2.1x)
- 10000 polygons x 8 cols: duplicate 0.218 ms/iter vs single 0.092 ms/iter (2.4x, saves 720 KB/tile)
For a 100-tile dask+cupy raster over 10k polygons with all_touched=True, that is ~13 ms and 72 MB of redundant PCIe traffic eliminated per call.
Proposed fix
Stage cupy.asarray(poly_props) and cupy.asarray(poly_global) once before the scanline / boundary conditional, and have both launch tuples reference the same device arrays.
Backend / severity
- Backends affected:
cupy, dask+cupy
- Severity: MEDIUM
- Category: Cat-3 GPU transfer (redundant cupy.asarray)
- OOM verdict: SAFE/graph-bound (unchanged)
Discovered via
/deep-sweep performance pass on rasterize (2026-05-27).
Summary
In
_run_cupy(xrspatial/rasterize.py) and_rasterize_tile_cupy, whenall_touched=Trueand the input contains polygons,cupy.asarray(poly_props)andcupy.asarray(poly_global)are called twice: once when staging the scanlinepoly_launchtuple, and again when staging the supercover boundaryboundary_launchtuple.For the dask+cupy path this duplicate transfer fires for every tile, doubling the host-to-device traffic on the per-tile props/global tables. The two launches operate on the same tile, so there is no benefit to allocating separate device copies.
Locations
xrspatial/rasterize.py:2065--poly_launchline in_run_cupyxrspatial/rasterize.py:2083--boundary_launchline in_run_cupy(same poly_props/poly_global)xrspatial/rasterize.py:2541--poly_launchline in_rasterize_tile_cupyxrspatial/rasterize.py:2555--boundary_launchline in_rasterize_tile_cupy(same poly_props_2d/poly_global_2d)Impact
Microbenched on the current host:
For a 100-tile dask+cupy raster over 10k polygons with
all_touched=True, that is ~13 ms and 72 MB of redundant PCIe traffic eliminated per call.Proposed fix
Stage
cupy.asarray(poly_props)andcupy.asarray(poly_global)once before the scanline / boundary conditional, and have both launch tuples reference the same device arrays.Backend / severity
cupy,dask+cupyDiscovered via
/deep-sweepperformance pass onrasterize(2026-05-27).