Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .claude/sweep-performance-state.csv
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ perlin,2026-03-31T18:00:00Z,WILL OOM,memory-bound,0,,
polygon_clip,2026-04-16T12:00:00Z,SAFE,compute-bound,0,1207,Re-audit 2026-04-16: fix verified SAFE. Mask stays lazy via rasterize chunks kwarg; per-chunk peak bounded.
polygonize,2026-04-16T12:00:00Z,RISKY,compute-bound,0,,Re-audit 2026-04-16 after PR 1190 NaN fix + 1176 simplification. No HIGH. MEDIUM: sequential per-chunk dask.compute loop at L1528 serializes work; _polygonize_cupy full GPU->CPU transfer at L665; per-value bin_mask alloc in _calculate_regions_cupy.
proximity,2026-03-31T18:00:00Z,WILL OOM,memory-bound,3,1111,Memory guard added to line-sweep path. KDTree path (EUCLIDEAN/MANHATTAN + scipy) already had guards. GREAT_CIRCLE unbounded path already guarded.
rasterize,2026-05-17,SAFE,graph-bound,0,2020,"Pass 2 (2026-05-17): re-audit identified MEDIUM Cat-2/Cat-3 graph-bound waste in _run_dask_numpy/_run_dask_cupy -- full line_props/point_props embedded in every delayed tile task (polygon path already filtered via poly_props[pmask]). Filed #2020 and fixed: added _slice_props_for_tile helper to remap geom_idx and slice props per tile (mirrors polygon path). Measured 5000 points x 8 cols / 100 tiles graph shrank from ~30 MB to <0.3 MB (37x); localized lines from ~32 MB to ~1.1 MB. 9 new tests in test_rasterize_tile_props_slice_2020.py (helper unit tests + graph-payload bound + numpy/dask output parity for lines/points/sum-merge). All 184 existing rasterize tests pass; dask+cupy parity verified. Dask graph probe: 2560x2560 chunks=256 yields 400 tasks (4 tasks/chunk constant); 25600x25600 chunks=1024 yields 2500 tasks. cupy 512x512 returns cupy.ndarray with no host round-trip. CUDA _scanline_fill_gpu: 39 regs/thread, 24576 B local_mem/thread (matches static cuda.local.array allocations 2048*8 + 2048*4 bytes). SAFE/graph-bound verdict holds; previous 2026-04-15 false-positive on polygon filtering still valid. | Original (2026-04-15): Tile-by-tile graph construction with per-tile geometry filtering is the correct pattern. Pre-filtering ensures each delayed task gets only its relevant subset."
rasterize,2026-05-27,SAFE,graph-bound,0,2506,"Pass 3 (2026-05-27): re-audit identified 1 MEDIUM Cat-3 GPU-transfer finding. _run_cupy (L2065/L2083) and _rasterize_tile_cupy (L2541/L2555) called cupy.asarray(poly_props/poly_global) twice when all_touched=True -- once for the scanline poly_launch tuple and once for the supercover boundary_launch tuple. The two tuples reference the same per-tile props tables. Filed #2506 and fixed by hoisting the upload above the scanline/boundary conditional so both launches share the same device buffer. Microbench: 1000 polys/4 cols 0.051->0.024 ms/iter (2.1x); 10000 polys/8 cols 0.218->0.092 ms/iter (2.4x, saves 720 KB/tile of redundant H2D transfer). 12 new tests in test_rasterize_props_hoist_2506.py (4 AST-structural single-asarray-call assertions + 5 cupy all_touched parity merges + 3 dask+cupy smoke tests). All 470 rasterize tests pass. Dask graph probe: 25600x25600 chunks=1024 yields 2500 tasks for 625 tiles (4 tasks/chunk), unchanged. Noted pre-existing dask+cupy all_touched parity gap on boundary segments crossing tile borders (not addressed by this PR). SAFE/graph-bound verdict holds. | Pass 2 (2026-05-17): re-audit identified MEDIUM Cat-2/Cat-3 graph-bound waste in _run_dask_numpy/_run_dask_cupy -- full line_props/point_props embedded in every delayed tile task (polygon path already filtered via poly_props[pmask]). Filed #2020 and fixed: added _slice_props_for_tile helper to remap geom_idx and slice props per tile (mirrors polygon path). Measured 5000 points x 8 cols / 100 tiles graph shrank from ~30 MB to <0.3 MB (37x); localized lines from ~32 MB to ~1.1 MB. 9 new tests in test_rasterize_tile_props_slice_2020.py (helper unit tests + graph-payload bound + numpy/dask output parity for lines/points/sum-merge). All 184 existing rasterize tests pass; dask+cupy parity verified. Dask graph probe: 2560x2560 chunks=256 yields 400 tasks (4 tasks/chunk constant); 25600x25600 chunks=1024 yields 2500 tasks. cupy 512x512 returns cupy.ndarray with no host round-trip. CUDA _scanline_fill_gpu: 39 regs/thread, 24576 B local_mem/thread (matches static cuda.local.array allocations 2048*8 + 2048*4 bytes). SAFE/graph-bound verdict holds; previous 2026-04-15 false-positive on polygon filtering still valid. | Original (2026-04-15): Tile-by-tile graph construction with per-tile geometry filtering is the correct pattern. Pre-filtering ensures each delayed task gets only its relevant subset."
reproject,2026-05-10,SAFE,compute-bound,1,1571,"Pass 5 (2026-05-10): 1 HIGH filed and fixed in tree -- issue #1571 + fix _merge_block_adapter same-CRS dask path. _place_same_crs in the dask adapter previously called src_data.compute() on the full source per output chunk (68x amplification measured on 256x256x2 source split into 32x32 output chunks, 8.9M pixels materialized vs 131K total source). Fix: added _place_same_crs_lazy at __init__.py:1716 that slices the source window first then computes only that slice. Verified post-fix: 1.00x ratio, 131K pixels materialized for 131K source. New regression test test_merge_dask_same_crs_bounded_materialization codifies the bound. Other audits clean: CUDA resample kernels use 16x16 blocks (cubic=46 regs, bilinear=36, nearest=22 -- well under the 64K-per-block limit, 0 local mem). _reproject_chunk_numpy/cupy already slice source first before .compute(). Dask graph at 25600x25600 src with 1024 chunks yields 4752 tasks (no per-chunk source dependency). _apply_vertical_shift uses in-place += that may not work on dask arrays -- correctness concern, not perf, defer to accuracy sweep."
resample,2026-04-15T12:00:00Z,SAFE,compute-bound,0,false-positive,Downgraded. GPU-CPU-GPU round-trip only in aggregate path for non-integer scale factors. Interpolation (nearest/bilinear/cubic) stays on GPU. No GPU kernel exists for irregular per-pixel binning.
sieve,2026-04-14T12:00:00Z,WILL OOM,memory-bound,0,false-positive,False positive. Memory guards already in place on both dask paths. CCL is inherently global — documented limitation. CuPy CPU fallback is deliberate and documented.
Expand Down
39 changes: 33 additions & 6 deletions xrspatial/rasterize.py
Original file line number Diff line number Diff line change
Expand Up @@ -2051,6 +2051,21 @@ def _run_cupy(geometries, props_array, bounds, height, width, fill, dtype,

# Stage the per-launch inputs once; first/last reuses these tensors
# for the pass-2 stamp launches.
#
# ``poly_props`` and ``poly_global`` are referenced by both the
# scanline ``poly_launch`` and the supercover ``boundary_launch``
# under ``all_touched=True``. Hoist the host-to-device transfer above
# the conditional so the two launches share the same device buffers;
# without the hoist the props/global tables would be uploaded twice
# per call. Skip the upload when neither launch will consume it
# (``len(edge_y_min) == 0`` and ``not all_touched``), so the hoist
# never costs more than the prior code. See issue #2506.
poly_props_gpu = None
poly_global_gpu = None
if poly_geoms and (len(edge_y_min) > 0 or all_touched):
poly_props_gpu = cupy.asarray(poly_props)
poly_global_gpu = cupy.asarray(poly_global)

poly_launch = None
if len(edge_y_min) > 0:
row_ptr, col_idx = _build_row_csr_numba(
Expand All @@ -2062,7 +2077,7 @@ def _run_cupy(geometries, props_array, bounds, height, width, fill, dtype,
poly_launch = (
cupy.asarray(edge_y_min), cupy.asarray(edge_x_at_ymin),
cupy.asarray(edge_inv_slope), cupy.asarray(edge_geom_id),
cupy.asarray(poly_props), cupy.asarray(poly_global),
poly_props_gpu, poly_global_gpu,
cupy.asarray(row_ptr), cupy.asarray(col_idx))

# all_touched boundaries. ``poly_geoms`` and ``poly_ids`` come
Expand All @@ -2080,8 +2095,8 @@ def _run_cupy(geometries, props_array, bounds, height, width, fill, dtype,
boundary_launch = (
cupy.asarray(bx0), cupy.asarray(by0),
cupy.asarray(bx1), cupy.asarray(by1),
cupy.asarray(bidx), cupy.asarray(poly_props),
cupy.asarray(poly_global), len(bx0))
cupy.asarray(bidx), poly_props_gpu,
poly_global_gpu, len(bx0))

r0, c0, r1, c1, line_idx = _extract_line_segments(
line_geoms, bounds, height, width)
Expand Down Expand Up @@ -2529,6 +2544,18 @@ def _rasterize_tile_cupy(poly_wkb, poly_props_2d, poly_global_2d, tile_bounds,
poly_geoms, poly_ids, tile_bounds, tile_h, tile_w)
edge_arrays = _sort_edges(edge_arrays)
edge_y_min = edge_arrays[0]
# Upload ``poly_props_2d`` and ``poly_global_2d`` once; both the
# scanline ``poly_launch`` and the supercover ``boundary_launch``
# under ``all_touched=True`` reference these tables, and without
# the hoist they would be transferred twice per tile. Skip the
# upload when neither launch will consume it (no edges and not
# ``all_touched``) so the hoist never costs more than the prior
# code. Issue #2506.
poly_props_2d_gpu = None
poly_global_2d_gpu = None
if len(edge_y_min) > 0 or all_touched:
poly_props_2d_gpu = cupy.asarray(poly_props_2d)
poly_global_2d_gpu = cupy.asarray(poly_global_2d)
if len(edge_y_min) > 0:
edge_y_max, edge_x_at_ymin, edge_inv_slope, edge_geom_id = \
edge_arrays[1:]
Expand All @@ -2538,7 +2565,7 @@ def _rasterize_tile_cupy(poly_wkb, poly_props_2d, poly_global_2d, tile_bounds,
poly_launch = (
cupy.asarray(edge_y_min), cupy.asarray(edge_x_at_ymin),
cupy.asarray(edge_inv_slope), cupy.asarray(edge_geom_id),
cupy.asarray(poly_props_2d), cupy.asarray(poly_global_2d),
poly_props_2d_gpu, poly_global_2d_gpu,
cupy.asarray(row_ptr), cupy.asarray(col_idx))

# all_touched: stage the supercover boundary burn through
Expand All @@ -2552,8 +2579,8 @@ def _rasterize_tile_cupy(poly_wkb, poly_props_2d, poly_global_2d, tile_bounds,
boundary_launch = (
cupy.asarray(bx0), cupy.asarray(by0),
cupy.asarray(bx1), cupy.asarray(by1),
cupy.asarray(bidx), cupy.asarray(poly_props_2d),
cupy.asarray(poly_global_2d), len(bx0))
cupy.asarray(bidx), poly_props_2d_gpu,
poly_global_2d_gpu, len(bx0))

line_launch = None
if len(seg_r0) > 0:
Expand Down
203 changes: 203 additions & 0 deletions xrspatial/tests/test_rasterize_props_hoist_2506.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
"""Regression tests for issue #2506.

``_run_cupy`` and ``_rasterize_tile_cupy`` previously called
``cupy.asarray(poly_props)`` and ``cupy.asarray(poly_global)`` twice
when ``all_touched=True``: once for the scanline ``poly_launch``
tuple and once for the supercover ``boundary_launch`` tuple. The
two tuples reference the same per-tile props/global tables, so the
second upload re-transfers identical bytes.

These tests cover two angles:

1. AST-level structural assertion that each function only calls
``cupy.asarray(poly_props...)`` and ``cupy.asarray(poly_global...)``
once -- the fix hoists the upload above the
scanline/boundary conditional and shares the device buffer.
2. End-to-end correctness for ``all_touched=True`` on the cupy and
dask+cupy backends against the numpy reference, so the hoist did
not change observed behaviour.

When cupy / CUDA / dask / shapely are unavailable the relevant tests
skip; the structural test runs on any host with the source file
present.
"""
import ast
from pathlib import Path

import numpy as np
import pytest

try:
from shapely.geometry import box
has_shapely = True
except ImportError:
has_shapely = False

if has_shapely:
from xrspatial.rasterize import rasterize

try:
import cupy # noqa: F401
has_cupy = True
except ImportError:
has_cupy = False

try:
from numba import cuda
has_cuda = has_cupy and cuda.is_available()
except Exception:
has_cuda = False

try:
import dask # noqa: F401
has_dask = True
except ImportError:
has_dask = False


# Module path used by the AST-level structural test. Independent of
# whether cupy / cuda are installed on the host.
_RASTERIZE_PY = Path(__file__).resolve().parent.parent / "rasterize.py"


def _count_cupy_asarray_calls(func_src, var_name):
"""Count ``cupy.asarray(<var_name>)`` calls inside *func_src*.

Matches ``cupy.asarray(poly_props)`` and ``cupy.asarray(poly_props_2d)``
but not ``cupy.asarray(poly_props_2d_gpu)`` (those are the names of
the hoisted device buffers in the fix).
"""
tree = ast.parse(func_src)
count = 0
for node in ast.walk(tree):
if not isinstance(node, ast.Call):
continue
func = node.func
if not (isinstance(func, ast.Attribute)
and func.attr == "asarray"
and isinstance(func.value, ast.Name)
and func.value.id == "cupy"):
continue
if len(node.args) != 1:
continue
arg = node.args[0]
if isinstance(arg, ast.Name) and arg.id == var_name:
count += 1
return count


def _extract_function_source(name):
"""Return the source of the named top-level function in rasterize.py."""
src = _RASTERIZE_PY.read_text()
tree = ast.parse(src)
for node in ast.iter_child_nodes(tree):
if isinstance(node, ast.FunctionDef) and node.name == name:
return ast.get_source_segment(src, node)
raise AssertionError(f"function {name!r} not found in rasterize.py")


@pytest.mark.skipif(
not _RASTERIZE_PY.exists(), reason="rasterize.py not found"
)
class TestStructure:
"""AST-level invariants pinning the props/global hoist in place."""

def test_run_cupy_uploads_poly_props_once(self):
src = _extract_function_source("_run_cupy")
assert _count_cupy_asarray_calls(src, "poly_props") == 1, (
"_run_cupy must call cupy.asarray(poly_props) exactly once "
"(hoisted above the scanline / boundary conditional). "
"See issue #2506."
)

def test_run_cupy_uploads_poly_global_once(self):
src = _extract_function_source("_run_cupy")
assert _count_cupy_asarray_calls(src, "poly_global") == 1, (
"_run_cupy must call cupy.asarray(poly_global) exactly once. "
"See issue #2506."
)

def test_rasterize_tile_cupy_uploads_poly_props_2d_once(self):
src = _extract_function_source("_rasterize_tile_cupy")
assert _count_cupy_asarray_calls(src, "poly_props_2d") == 1, (
"_rasterize_tile_cupy must call cupy.asarray(poly_props_2d) "
"exactly once per tile (hoisted above the scanline / boundary "
"conditional). See issue #2506."
)

def test_rasterize_tile_cupy_uploads_poly_global_2d_once(self):
src = _extract_function_source("_rasterize_tile_cupy")
assert _count_cupy_asarray_calls(src, "poly_global_2d") == 1, (
"_rasterize_tile_cupy must call cupy.asarray(poly_global_2d) "
"exactly once per tile. See issue #2506."
)


@pytest.mark.skipif(not has_shapely, reason="shapely not installed")
@pytest.mark.skipif(not has_cuda, reason="CUDA / CuPy not available")
class TestParityAllTouched:
"""End-to-end parity: hoist must not change pixel output."""

def _geometries(self):
# A handful of overlapping rectangles whose ring segments cross
# the raster interior -- exercises both the scanline polygon
# path and the supercover boundary burn.
return [
(box(0.5, 0.5, 4.5, 4.5), 1.0),
(box(2.0, 2.0, 6.0, 6.0), 2.0),
(box(5.5, 3.5, 9.5, 7.5), 3.0),
(box(3.0, 5.0, 7.0, 9.0), 4.0),
]

def _to_numpy(self, arr):
data = arr.data
# Dask-backed: materialise first so the cupy chunks can be
# transferred (avoids the implicit __array__ on cupy that
# numpy's asarray would trigger).
if hasattr(data, "compute"):
data = data.compute()
if hasattr(data, "get"):
return data.get()
return np.asarray(data)

@pytest.mark.parametrize("merge", ["last", "first", "max", "min", "sum"])
def test_cupy_all_touched_matches_numpy(self, merge):
geoms = self._geometries()
ref = rasterize(geoms, width=24, height=24,
bounds=(0, 0, 10, 10),
all_touched=True, merge=merge, fill=0.0)
got = rasterize(geoms, width=24, height=24,
bounds=(0, 0, 10, 10),
all_touched=True, merge=merge, fill=0.0,
use_cuda=True)
np.testing.assert_allclose(
self._to_numpy(got), self._to_numpy(ref), atol=0, rtol=0,
err_msg=f"cupy all_touched != numpy for merge={merge!r}",
)

@pytest.mark.skipif(not has_dask, reason="dask not installed")
@pytest.mark.parametrize("merge", ["last", "sum", "max"])
def test_dask_cupy_all_touched_runs(self, merge):
"""Smoke test: dask+cupy all_touched returns a result of the
expected shape and dtype. Pixel-level parity against numpy is
tracked separately (a pre-existing boundary-segment-on-tile-
border bug, unrelated to the props hoist in #2506) and is not
asserted here -- the goal of this test is to exercise the
hoisted poly_props_2d / poly_global_2d upload through every
per-tile launch in the dask path.
"""
geoms = self._geometries()
got = rasterize(geoms, width=24, height=24,
bounds=(0, 0, 10, 10),
all_touched=True, merge=merge, fill=0.0,
use_cuda=True, chunks=8)
out = self._to_numpy(got)
assert out.shape == (24, 24)
# Some interior pixels must have been burned by at least one
# geometry; an all-zero output would indicate the per-tile
# polygon launches never wrote anything.
assert (out != 0).any(), (
f"dask+cupy all_touched merge={merge!r} produced an all-zero "
"raster; the per-tile poly_props_2d hoist may have broken "
"the launch wiring."
)
Loading