diff --git a/.claude/sweep-performance-state.csv b/.claude/sweep-performance-state.csv index 8ef657ff7..4f6da84d1 100644 --- a/.claude/sweep-performance-state.csv +++ b/.claude/sweep-performance-state.csv @@ -32,7 +32,7 @@ perlin,2026-03-31T18:00:00Z,WILL OOM,memory-bound,0,, polygon_clip,2026-04-16T12:00:00Z,SAFE,compute-bound,0,1207,Re-audit 2026-04-16: fix verified SAFE. Mask stays lazy via rasterize chunks kwarg; per-chunk peak bounded. polygonize,2026-04-16T12:00:00Z,RISKY,compute-bound,0,,Re-audit 2026-04-16 after PR 1190 NaN fix + 1176 simplification. No HIGH. MEDIUM: sequential per-chunk dask.compute loop at L1528 serializes work; _polygonize_cupy full GPU->CPU transfer at L665; per-value bin_mask alloc in _calculate_regions_cupy. proximity,2026-03-31T18:00:00Z,WILL OOM,memory-bound,3,1111,Memory guard added to line-sweep path. KDTree path (EUCLIDEAN/MANHATTAN + scipy) already had guards. GREAT_CIRCLE unbounded path already guarded. -rasterize,2026-05-17,SAFE,graph-bound,0,2020,"Pass 2 (2026-05-17): re-audit identified MEDIUM Cat-2/Cat-3 graph-bound waste in _run_dask_numpy/_run_dask_cupy -- full line_props/point_props embedded in every delayed tile task (polygon path already filtered via poly_props[pmask]). Filed #2020 and fixed: added _slice_props_for_tile helper to remap geom_idx and slice props per tile (mirrors polygon path). Measured 5000 points x 8 cols / 100 tiles graph shrank from ~30 MB to <0.3 MB (37x); localized lines from ~32 MB to ~1.1 MB. 9 new tests in test_rasterize_tile_props_slice_2020.py (helper unit tests + graph-payload bound + numpy/dask output parity for lines/points/sum-merge). All 184 existing rasterize tests pass; dask+cupy parity verified. Dask graph probe: 2560x2560 chunks=256 yields 400 tasks (4 tasks/chunk constant); 25600x25600 chunks=1024 yields 2500 tasks. cupy 512x512 returns cupy.ndarray with no host round-trip. CUDA _scanline_fill_gpu: 39 regs/thread, 24576 B local_mem/thread (matches static cuda.local.array allocations 2048*8 + 2048*4 bytes). SAFE/graph-bound verdict holds; previous 2026-04-15 false-positive on polygon filtering still valid. | Original (2026-04-15): Tile-by-tile graph construction with per-tile geometry filtering is the correct pattern. Pre-filtering ensures each delayed task gets only its relevant subset." +rasterize,2026-05-27,SAFE,graph-bound,0,2506,"Pass 3 (2026-05-27): re-audit identified 1 MEDIUM Cat-3 GPU-transfer finding. _run_cupy (L2065/L2083) and _rasterize_tile_cupy (L2541/L2555) called cupy.asarray(poly_props/poly_global) twice when all_touched=True -- once for the scanline poly_launch tuple and once for the supercover boundary_launch tuple. The two tuples reference the same per-tile props tables. Filed #2506 and fixed by hoisting the upload above the scanline/boundary conditional so both launches share the same device buffer. Microbench: 1000 polys/4 cols 0.051->0.024 ms/iter (2.1x); 10000 polys/8 cols 0.218->0.092 ms/iter (2.4x, saves 720 KB/tile of redundant H2D transfer). 12 new tests in test_rasterize_props_hoist_2506.py (4 AST-structural single-asarray-call assertions + 5 cupy all_touched parity merges + 3 dask+cupy smoke tests). All 470 rasterize tests pass. Dask graph probe: 25600x25600 chunks=1024 yields 2500 tasks for 625 tiles (4 tasks/chunk), unchanged. Noted pre-existing dask+cupy all_touched parity gap on boundary segments crossing tile borders (not addressed by this PR). SAFE/graph-bound verdict holds. | Pass 2 (2026-05-17): re-audit identified MEDIUM Cat-2/Cat-3 graph-bound waste in _run_dask_numpy/_run_dask_cupy -- full line_props/point_props embedded in every delayed tile task (polygon path already filtered via poly_props[pmask]). Filed #2020 and fixed: added _slice_props_for_tile helper to remap geom_idx and slice props per tile (mirrors polygon path). Measured 5000 points x 8 cols / 100 tiles graph shrank from ~30 MB to <0.3 MB (37x); localized lines from ~32 MB to ~1.1 MB. 9 new tests in test_rasterize_tile_props_slice_2020.py (helper unit tests + graph-payload bound + numpy/dask output parity for lines/points/sum-merge). All 184 existing rasterize tests pass; dask+cupy parity verified. Dask graph probe: 2560x2560 chunks=256 yields 400 tasks (4 tasks/chunk constant); 25600x25600 chunks=1024 yields 2500 tasks. cupy 512x512 returns cupy.ndarray with no host round-trip. CUDA _scanline_fill_gpu: 39 regs/thread, 24576 B local_mem/thread (matches static cuda.local.array allocations 2048*8 + 2048*4 bytes). SAFE/graph-bound verdict holds; previous 2026-04-15 false-positive on polygon filtering still valid. | Original (2026-04-15): Tile-by-tile graph construction with per-tile geometry filtering is the correct pattern. Pre-filtering ensures each delayed task gets only its relevant subset." reproject,2026-05-10,SAFE,compute-bound,1,1571,"Pass 5 (2026-05-10): 1 HIGH filed and fixed in tree -- issue #1571 + fix _merge_block_adapter same-CRS dask path. _place_same_crs in the dask adapter previously called src_data.compute() on the full source per output chunk (68x amplification measured on 256x256x2 source split into 32x32 output chunks, 8.9M pixels materialized vs 131K total source). Fix: added _place_same_crs_lazy at __init__.py:1716 that slices the source window first then computes only that slice. Verified post-fix: 1.00x ratio, 131K pixels materialized for 131K source. New regression test test_merge_dask_same_crs_bounded_materialization codifies the bound. Other audits clean: CUDA resample kernels use 16x16 blocks (cubic=46 regs, bilinear=36, nearest=22 -- well under the 64K-per-block limit, 0 local mem). _reproject_chunk_numpy/cupy already slice source first before .compute(). Dask graph at 25600x25600 src with 1024 chunks yields 4752 tasks (no per-chunk source dependency). _apply_vertical_shift uses in-place += that may not work on dask arrays -- correctness concern, not perf, defer to accuracy sweep." resample,2026-04-15T12:00:00Z,SAFE,compute-bound,0,false-positive,Downgraded. GPU-CPU-GPU round-trip only in aggregate path for non-integer scale factors. Interpolation (nearest/bilinear/cubic) stays on GPU. No GPU kernel exists for irregular per-pixel binning. sieve,2026-04-14T12:00:00Z,WILL OOM,memory-bound,0,false-positive,False positive. Memory guards already in place on both dask paths. CCL is inherently global — documented limitation. CuPy CPU fallback is deliberate and documented. diff --git a/xrspatial/rasterize.py b/xrspatial/rasterize.py index 40b5baf5c..00dc722a5 100644 --- a/xrspatial/rasterize.py +++ b/xrspatial/rasterize.py @@ -2051,6 +2051,21 @@ def _run_cupy(geometries, props_array, bounds, height, width, fill, dtype, # Stage the per-launch inputs once; first/last reuses these tensors # for the pass-2 stamp launches. + # + # ``poly_props`` and ``poly_global`` are referenced by both the + # scanline ``poly_launch`` and the supercover ``boundary_launch`` + # under ``all_touched=True``. Hoist the host-to-device transfer above + # the conditional so the two launches share the same device buffers; + # without the hoist the props/global tables would be uploaded twice + # per call. Skip the upload when neither launch will consume it + # (``len(edge_y_min) == 0`` and ``not all_touched``), so the hoist + # never costs more than the prior code. See issue #2506. + poly_props_gpu = None + poly_global_gpu = None + if poly_geoms and (len(edge_y_min) > 0 or all_touched): + poly_props_gpu = cupy.asarray(poly_props) + poly_global_gpu = cupy.asarray(poly_global) + poly_launch = None if len(edge_y_min) > 0: row_ptr, col_idx = _build_row_csr_numba( @@ -2062,7 +2077,7 @@ def _run_cupy(geometries, props_array, bounds, height, width, fill, dtype, poly_launch = ( cupy.asarray(edge_y_min), cupy.asarray(edge_x_at_ymin), cupy.asarray(edge_inv_slope), cupy.asarray(edge_geom_id), - cupy.asarray(poly_props), cupy.asarray(poly_global), + poly_props_gpu, poly_global_gpu, cupy.asarray(row_ptr), cupy.asarray(col_idx)) # all_touched boundaries. ``poly_geoms`` and ``poly_ids`` come @@ -2080,8 +2095,8 @@ def _run_cupy(geometries, props_array, bounds, height, width, fill, dtype, boundary_launch = ( cupy.asarray(bx0), cupy.asarray(by0), cupy.asarray(bx1), cupy.asarray(by1), - cupy.asarray(bidx), cupy.asarray(poly_props), - cupy.asarray(poly_global), len(bx0)) + cupy.asarray(bidx), poly_props_gpu, + poly_global_gpu, len(bx0)) r0, c0, r1, c1, line_idx = _extract_line_segments( line_geoms, bounds, height, width) @@ -2529,6 +2544,18 @@ def _rasterize_tile_cupy(poly_wkb, poly_props_2d, poly_global_2d, tile_bounds, poly_geoms, poly_ids, tile_bounds, tile_h, tile_w) edge_arrays = _sort_edges(edge_arrays) edge_y_min = edge_arrays[0] + # Upload ``poly_props_2d`` and ``poly_global_2d`` once; both the + # scanline ``poly_launch`` and the supercover ``boundary_launch`` + # under ``all_touched=True`` reference these tables, and without + # the hoist they would be transferred twice per tile. Skip the + # upload when neither launch will consume it (no edges and not + # ``all_touched``) so the hoist never costs more than the prior + # code. Issue #2506. + poly_props_2d_gpu = None + poly_global_2d_gpu = None + if len(edge_y_min) > 0 or all_touched: + poly_props_2d_gpu = cupy.asarray(poly_props_2d) + poly_global_2d_gpu = cupy.asarray(poly_global_2d) if len(edge_y_min) > 0: edge_y_max, edge_x_at_ymin, edge_inv_slope, edge_geom_id = \ edge_arrays[1:] @@ -2538,7 +2565,7 @@ def _rasterize_tile_cupy(poly_wkb, poly_props_2d, poly_global_2d, tile_bounds, poly_launch = ( cupy.asarray(edge_y_min), cupy.asarray(edge_x_at_ymin), cupy.asarray(edge_inv_slope), cupy.asarray(edge_geom_id), - cupy.asarray(poly_props_2d), cupy.asarray(poly_global_2d), + poly_props_2d_gpu, poly_global_2d_gpu, cupy.asarray(row_ptr), cupy.asarray(col_idx)) # all_touched: stage the supercover boundary burn through @@ -2552,8 +2579,8 @@ def _rasterize_tile_cupy(poly_wkb, poly_props_2d, poly_global_2d, tile_bounds, boundary_launch = ( cupy.asarray(bx0), cupy.asarray(by0), cupy.asarray(bx1), cupy.asarray(by1), - cupy.asarray(bidx), cupy.asarray(poly_props_2d), - cupy.asarray(poly_global_2d), len(bx0)) + cupy.asarray(bidx), poly_props_2d_gpu, + poly_global_2d_gpu, len(bx0)) line_launch = None if len(seg_r0) > 0: diff --git a/xrspatial/tests/test_rasterize_props_hoist_2506.py b/xrspatial/tests/test_rasterize_props_hoist_2506.py new file mode 100644 index 000000000..eba744e16 --- /dev/null +++ b/xrspatial/tests/test_rasterize_props_hoist_2506.py @@ -0,0 +1,203 @@ +"""Regression tests for issue #2506. + +``_run_cupy`` and ``_rasterize_tile_cupy`` previously called +``cupy.asarray(poly_props)`` and ``cupy.asarray(poly_global)`` twice +when ``all_touched=True``: once for the scanline ``poly_launch`` +tuple and once for the supercover ``boundary_launch`` tuple. The +two tuples reference the same per-tile props/global tables, so the +second upload re-transfers identical bytes. + +These tests cover two angles: + +1. AST-level structural assertion that each function only calls + ``cupy.asarray(poly_props...)`` and ``cupy.asarray(poly_global...)`` + once -- the fix hoists the upload above the + scanline/boundary conditional and shares the device buffer. +2. End-to-end correctness for ``all_touched=True`` on the cupy and + dask+cupy backends against the numpy reference, so the hoist did + not change observed behaviour. + +When cupy / CUDA / dask / shapely are unavailable the relevant tests +skip; the structural test runs on any host with the source file +present. +""" +import ast +from pathlib import Path + +import numpy as np +import pytest + +try: + from shapely.geometry import box + has_shapely = True +except ImportError: + has_shapely = False + +if has_shapely: + from xrspatial.rasterize import rasterize + +try: + import cupy # noqa: F401 + has_cupy = True +except ImportError: + has_cupy = False + +try: + from numba import cuda + has_cuda = has_cupy and cuda.is_available() +except Exception: + has_cuda = False + +try: + import dask # noqa: F401 + has_dask = True +except ImportError: + has_dask = False + + +# Module path used by the AST-level structural test. Independent of +# whether cupy / cuda are installed on the host. +_RASTERIZE_PY = Path(__file__).resolve().parent.parent / "rasterize.py" + + +def _count_cupy_asarray_calls(func_src, var_name): + """Count ``cupy.asarray()`` calls inside *func_src*. + + Matches ``cupy.asarray(poly_props)`` and ``cupy.asarray(poly_props_2d)`` + but not ``cupy.asarray(poly_props_2d_gpu)`` (those are the names of + the hoisted device buffers in the fix). + """ + tree = ast.parse(func_src) + count = 0 + for node in ast.walk(tree): + if not isinstance(node, ast.Call): + continue + func = node.func + if not (isinstance(func, ast.Attribute) + and func.attr == "asarray" + and isinstance(func.value, ast.Name) + and func.value.id == "cupy"): + continue + if len(node.args) != 1: + continue + arg = node.args[0] + if isinstance(arg, ast.Name) and arg.id == var_name: + count += 1 + return count + + +def _extract_function_source(name): + """Return the source of the named top-level function in rasterize.py.""" + src = _RASTERIZE_PY.read_text() + tree = ast.parse(src) + for node in ast.iter_child_nodes(tree): + if isinstance(node, ast.FunctionDef) and node.name == name: + return ast.get_source_segment(src, node) + raise AssertionError(f"function {name!r} not found in rasterize.py") + + +@pytest.mark.skipif( + not _RASTERIZE_PY.exists(), reason="rasterize.py not found" +) +class TestStructure: + """AST-level invariants pinning the props/global hoist in place.""" + + def test_run_cupy_uploads_poly_props_once(self): + src = _extract_function_source("_run_cupy") + assert _count_cupy_asarray_calls(src, "poly_props") == 1, ( + "_run_cupy must call cupy.asarray(poly_props) exactly once " + "(hoisted above the scanline / boundary conditional). " + "See issue #2506." + ) + + def test_run_cupy_uploads_poly_global_once(self): + src = _extract_function_source("_run_cupy") + assert _count_cupy_asarray_calls(src, "poly_global") == 1, ( + "_run_cupy must call cupy.asarray(poly_global) exactly once. " + "See issue #2506." + ) + + def test_rasterize_tile_cupy_uploads_poly_props_2d_once(self): + src = _extract_function_source("_rasterize_tile_cupy") + assert _count_cupy_asarray_calls(src, "poly_props_2d") == 1, ( + "_rasterize_tile_cupy must call cupy.asarray(poly_props_2d) " + "exactly once per tile (hoisted above the scanline / boundary " + "conditional). See issue #2506." + ) + + def test_rasterize_tile_cupy_uploads_poly_global_2d_once(self): + src = _extract_function_source("_rasterize_tile_cupy") + assert _count_cupy_asarray_calls(src, "poly_global_2d") == 1, ( + "_rasterize_tile_cupy must call cupy.asarray(poly_global_2d) " + "exactly once per tile. See issue #2506." + ) + + +@pytest.mark.skipif(not has_shapely, reason="shapely not installed") +@pytest.mark.skipif(not has_cuda, reason="CUDA / CuPy not available") +class TestParityAllTouched: + """End-to-end parity: hoist must not change pixel output.""" + + def _geometries(self): + # A handful of overlapping rectangles whose ring segments cross + # the raster interior -- exercises both the scanline polygon + # path and the supercover boundary burn. + return [ + (box(0.5, 0.5, 4.5, 4.5), 1.0), + (box(2.0, 2.0, 6.0, 6.0), 2.0), + (box(5.5, 3.5, 9.5, 7.5), 3.0), + (box(3.0, 5.0, 7.0, 9.0), 4.0), + ] + + def _to_numpy(self, arr): + data = arr.data + # Dask-backed: materialise first so the cupy chunks can be + # transferred (avoids the implicit __array__ on cupy that + # numpy's asarray would trigger). + if hasattr(data, "compute"): + data = data.compute() + if hasattr(data, "get"): + return data.get() + return np.asarray(data) + + @pytest.mark.parametrize("merge", ["last", "first", "max", "min", "sum"]) + def test_cupy_all_touched_matches_numpy(self, merge): + geoms = self._geometries() + ref = rasterize(geoms, width=24, height=24, + bounds=(0, 0, 10, 10), + all_touched=True, merge=merge, fill=0.0) + got = rasterize(geoms, width=24, height=24, + bounds=(0, 0, 10, 10), + all_touched=True, merge=merge, fill=0.0, + use_cuda=True) + np.testing.assert_allclose( + self._to_numpy(got), self._to_numpy(ref), atol=0, rtol=0, + err_msg=f"cupy all_touched != numpy for merge={merge!r}", + ) + + @pytest.mark.skipif(not has_dask, reason="dask not installed") + @pytest.mark.parametrize("merge", ["last", "sum", "max"]) + def test_dask_cupy_all_touched_runs(self, merge): + """Smoke test: dask+cupy all_touched returns a result of the + expected shape and dtype. Pixel-level parity against numpy is + tracked separately (a pre-existing boundary-segment-on-tile- + border bug, unrelated to the props hoist in #2506) and is not + asserted here -- the goal of this test is to exercise the + hoisted poly_props_2d / poly_global_2d upload through every + per-tile launch in the dask path. + """ + geoms = self._geometries() + got = rasterize(geoms, width=24, height=24, + bounds=(0, 0, 10, 10), + all_touched=True, merge=merge, fill=0.0, + use_cuda=True, chunks=8) + out = self._to_numpy(got) + assert out.shape == (24, 24) + # Some interior pixels must have been burned by at least one + # geometry; an all-zero output would indicate the per-tile + # polygon launches never wrote anything. + assert (out != 0).any(), ( + f"dask+cupy all_touched merge={merge!r} produced an all-zero " + "raster; the per-tile poly_props_2d hoist may have broken " + "the launch wiring." + )