Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .claude/sweep-performance-state.csv
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,8 @@ geotiff,2026-05-20,SAFE,IO-bound,0,2212,"Pass 13 (2026-05-20): 1 MEDIUM found an
glcm,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,"Downgraded to MEDIUM. da.stack without rechunk is scheduling overhead, not OOM risk."
hillshade,2026-04-16T12:00:00Z,SAFE,compute-bound,0,,"Re-audit after Horn's method rewrite (PR 1175): clean stencil, map_overlap depth=(1,1), no materialization. Zero findings."
hydro,2026-05-01,RISKY,memory-bound,0,1416,"Fixed-in-tree 2026-05-01: hand_mfd._hand_mfd_dask now assembles via da.map_blocks instead of eager da.block of pre-computed tiles (matches hand_dinf pattern). Remaining MEDIUM: sink_d8 CCL fully materializes labels (inherently global), flow_accumulation_mfd frac_bdry held in driver dict instead of memmap-backed BoundaryStore. D8 iterative paths (flow_accum/fill/watershed/basin/stream_*) use serial-tile sweep with memmap-backed boundary store -- per-tile RAM bounded but driver iterates O(diameter) times. flow_direction_*, flow_path/snap_pour_point/twi/hand_d8/hand_dinf are SAFE."
interpolate_spline,2026-06-04,SAFE,compute-bound,0,,"scope=spline-only. Audited _spline.py + _validation.py only (not _idw/_kriging). 1 MEDIUM (Cat3 GPU transfer): _spline_dask_cupy/_spline_cupy re-uploaded invariant x_pts/y_pts/weights host->device once per chunk. Fixed in PR #2929: added _tps_evaluate_gpu taking on-device point/weight arrays + only per-chunk grid slices; dask+cupy uploads invariants once at graph build (verified 48->3 on 16 chunks, scales with chunk count). numpy/cupy/dask+cupy parity ~1e-14. Added cupy+dask+cupy parity tests and an upload-count regression test (red without fix: 48!=3). _tps_cuda_kernel 30 regs/thread, 6 scalar locals -- no register pressure. CPU/dask+numpy eval @ngjit, row-major, no materialization. Dask graph probe 2560x2560/256 chunks = 200 tasks (2/chunk), no fan-in. Memory guard _check_spline_memory bounds N^2 solve. No issue filed -- gh issue create denied by auto-mode classifier; finding surfaced directly by sweep. GitHub issue field left empty."
interpolate-kriging,2026-06-04,SAFE,graph-bound,0,2923,"MEDIUM: memory guard used full-grid k0 term on dask templates -> spurious MemoryError (issue #2923, fixed). LOW: _experimental_variogram nlags python loop vectorizable via bincount (~1.2x, pair-array materialization dominates) - doc only. Dask graph clean (2 tasks/chunk); cupy returns device arrays; no .values/.compute/.data.get materialization."
interpolate_spline,2026-06-04,SAFE,compute-bound,0,,"scope=spline-only. Audited _spline.py + _validation.py only (not _idw/_kriging). 1 MEDIUM (Cat3 GPU transfer): _spline_dask_cupy/_spline_cupy re-uploaded invariant x_pts/y_pts/weights host->device once per chunk. Fixed in PR #2929: added _tps_evaluate_gpu taking on-device point/weight arrays + only per-chunk grid slices; dask+cupy uploads invariants once at graph build (verified 48->3 on 16 chunks, scales with chunk count). numpy/cupy/dask+cupy parity ~1e-14. Added cupy+dask+cupy parity tests and an upload-count regression test (red without fix: 48!=3). _tps_cuda_kernel 30 regs/thread, 6 scalar locals -- no register pressure. CPU/dask+numpy eval @ngjit, row-major, no materialization. Dask graph probe 2560x2560/256 chunks = 200 tasks (2/chunk), no fan-in. Memory guard _check_spline_memory bounds N^2 solve. No issue filed -- gh issue create denied by auto-mode classifier; finding surfaced directly by sweep. GitHub issue field left empty."
kde,2026-04-14T12:00:00Z,SAFE,compute-bound,0,,Graph construction serialized per-tile. _filter_points_to_tile scans all points per tile. No HIGH findings.
mahalanobis,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,False positive. Numpy path materializes by design. Dask path uses lazy reductions + map_blocks.
morphology,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
Expand All @@ -35,7 +35,7 @@ polygon_clip,2026-04-16T12:00:00Z,SAFE,compute-bound,0,1207,Re-audit 2026-04-16:
polygonize,2026-05-29,RISKY,compute-bound,0,2608,"Pass 2 (2026-05-29): re-audit. 0 HIGH. 1 MEDIUM fixed (#2608): _polygonize_dask called dask.compute() once per chunk in a nested Python loop, serializing one chunk per scheduler round-trip. Fixed to batch one dask.compute() per chunk row. Output byte-identical (verified conn=4 and conn=8). Measured 2.79x faster on a 4-worker LocalCluster (1024x1024/64 chunks); threaded-scheduler win is marginal (~1.03x warm) since @ngjit kernels release the GIL. 8 new tests in test_polygonize_dask_row_batch_2608.py; 299 polygonize tests pass. Cat1 clean (no .values/.compute-in-loop wrapping dask; np.asarray at L1064/L2278 only wrap CPU input / user transform). Cat3: no @cuda.jit kernels; _polygonize_cupy GPU->CPU transfer is documented (boundary tracing is sequential, cannot run on GPU); cupy int path runs end-to-end ~2.2s/512x512, dominated by CPU _scan. Cat4 LOW (not fixed): _calculate_regions_cupy allocates bin_mask=(data==v) per unique value (O(n_unique) passes); verified low impact, _scan dominates. Cat5 clean. Cat6: RISKY unchanged -- driver accumulates O(total polygons) interior polys; per-row batch keeps peak bounded to one row. bottleneck=compute-bound (_scan). | Re-audit 2026-04-16 after PR 1190 NaN fix + 1176 simplification."
proximity,2026-03-31T18:00:00Z,WILL OOM,memory-bound,3,1111,Memory guard added to line-sweep path. KDTree path (EUCLIDEAN/MANHATTAN + scipy) already had guards. GREAT_CIRCLE unbounded path already guarded.
rasterize,2026-05-27,SAFE,graph-bound,0,2506,"Pass 3 (2026-05-27): re-audit identified 1 MEDIUM Cat-3 GPU-transfer finding. _run_cupy (L2065/L2083) and _rasterize_tile_cupy (L2541/L2555) called cupy.asarray(poly_props/poly_global) twice when all_touched=True -- once for the scanline poly_launch tuple and once for the supercover boundary_launch tuple. The two tuples reference the same per-tile props tables. Filed #2506 and fixed by hoisting the upload above the scanline/boundary conditional so both launches share the same device buffer. Microbench: 1000 polys/4 cols 0.051->0.024 ms/iter (2.1x); 10000 polys/8 cols 0.218->0.092 ms/iter (2.4x, saves 720 KB/tile of redundant H2D transfer). 12 new tests in test_rasterize_props_hoist_2506.py (4 AST-structural single-asarray-call assertions + 5 cupy all_touched parity merges + 3 dask+cupy smoke tests). All 470 rasterize tests pass. Dask graph probe: 25600x25600 chunks=1024 yields 2500 tasks for 625 tiles (4 tasks/chunk), unchanged. Noted pre-existing dask+cupy all_touched parity gap on boundary segments crossing tile borders (not addressed by this PR). SAFE/graph-bound verdict holds. | Pass 2 (2026-05-17): re-audit identified MEDIUM Cat-2/Cat-3 graph-bound waste in _run_dask_numpy/_run_dask_cupy -- full line_props/point_props embedded in every delayed tile task (polygon path already filtered via poly_props[pmask]). Filed #2020 and fixed: added _slice_props_for_tile helper to remap geom_idx and slice props per tile (mirrors polygon path). Measured 5000 points x 8 cols / 100 tiles graph shrank from ~30 MB to <0.3 MB (37x); localized lines from ~32 MB to ~1.1 MB. 9 new tests in test_rasterize_tile_props_slice_2020.py (helper unit tests + graph-payload bound + numpy/dask output parity for lines/points/sum-merge). All 184 existing rasterize tests pass; dask+cupy parity verified. Dask graph probe: 2560x2560 chunks=256 yields 400 tasks (4 tasks/chunk constant); 25600x25600 chunks=1024 yields 2500 tasks. cupy 512x512 returns cupy.ndarray with no host round-trip. CUDA _scanline_fill_gpu: 39 regs/thread, 24576 B local_mem/thread (matches static cuda.local.array allocations 2048*8 + 2048*4 bytes). SAFE/graph-bound verdict holds; previous 2026-04-15 false-positive on polygon filtering still valid. | Original (2026-04-15): Tile-by-tile graph construction with per-tile geometry filtering is the correct pattern. Pre-filtering ensures each delayed task gets only its relevant subset."
reproject,2026-05-10,SAFE,compute-bound,1,1571,"Pass 5 (2026-05-10): 1 HIGH filed and fixed in tree -- issue #1571 + fix _merge_block_adapter same-CRS dask path. _place_same_crs in the dask adapter previously called src_data.compute() on the full source per output chunk (68x amplification measured on 256x256x2 source split into 32x32 output chunks, 8.9M pixels materialized vs 131K total source). Fix: added _place_same_crs_lazy at __init__.py:1716 that slices the source window first then computes only that slice. Verified post-fix: 1.00x ratio, 131K pixels materialized for 131K source. New regression test test_merge_dask_same_crs_bounded_materialization codifies the bound. Other audits clean: CUDA resample kernels use 16x16 blocks (cubic=46 regs, bilinear=36, nearest=22 -- well under the 64K-per-block limit, 0 local mem). _reproject_chunk_numpy/cupy already slice source first before .compute(). Dask graph at 25600x25600 src with 1024 chunks yields 4752 tasks (no per-chunk source dependency). _apply_vertical_shift uses in-place += that may not work on dask arrays -- correctness concern, not perf, defer to accuracy sweep."
reproject,2026-06-09,SAFE,compute-bound,0,3106,"Pass 6 (2026-06-09): 0 HIGH. 1 MEDIUM found and fixed (#3106): _reproject_chunk_numpy probed try_numba_transform, then _transform_coords probed it again before the pyproj fallback -- each wasted probe re-parses CRS params (~10 pyproj to_dict/to_authority round-trips) and allocates 4 chunk-sized float64 coordinate arrays. Measured 512x512 chunk, 4326->ESRI:54009: ~0.3-0.5 ms/probe, ~11% of the 5.3 ms chunk worker, repeated per output chunk on dask+numpy and merge per-block paths. Fix: worker passes no CRS objects to _transform_coords (inner retry gated on both non-None); cupy CPU fallbacks keep the inner probe (their first numba attempt). 3 new tests (TestNoDuplicateNumbaFastPathProbe); 447 reproject tests pass. LOW (not fixed, documented): try_numba_transform allocates 4 flat arrays before branch dispatch -- wasted for the lcc/tmerc 2D-kernel branches and unsupported pairs; _resample_cupy_native does a redundant .copy() when nodata is non-NaN and the caller already passed a fresh float64 copy; per-projection param extractors (_lcc_params etc.) call crs.to_dict() without the UserWarning suppression that _get_datum_params got in #3076, so fallback chunks emit pyproj warning spam. Dask graph probe: 2560x2560/256 chunks -> 216 tasks for 108 output chunks (2/chunk, 2 layers); merge 2 inputs -> 64 tasks/32 chunks. Source window per task capped at 64 Mpix. GPU validated on host (CUDA available): cupy 1024^2 fast path 13 ms, try_cuda_transform stays on-device, dask+cupy end-to-end OK, numpy/cupy max abs diff 2e-12, NaN positions identical. SAFE/compute-bound holds. | Pass 5 (2026-05-10): 1 HIGH filed and fixed in tree -- issue #1571 + fix _merge_block_adapter same-CRS dask path. _place_same_crs in the dask adapter previously called src_data.compute() on the full source per output chunk (68x amplification measured on 256x256x2 source split into 32x32 output chunks, 8.9M pixels materialized vs 131K total source). Fix: added _place_same_crs_lazy at __init__.py:1716 that slices the source window first then computes only that slice. Verified post-fix: 1.00x ratio, 131K pixels materialized for 131K source. New regression test test_merge_dask_same_crs_bounded_materialization codifies the bound. Other audits clean: CUDA resample kernels use 16x16 blocks (cubic=46 regs, bilinear=36, nearest=22 -- well under the 64K-per-block limit, 0 local mem). _reproject_chunk_numpy/cupy already slice source first before .compute(). Dask graph at 25600x25600 src with 1024 chunks yields 4752 tasks (no per-chunk source dependency). _apply_vertical_shift uses in-place += that may not work on dask arrays -- correctness concern, not perf, defer to accuracy sweep."
resample,2026-04-15T12:00:00Z,SAFE,compute-bound,0,false-positive,Downgraded. GPU-CPU-GPU round-trip only in aggregate path for non-integer scale factors. Interpolation (nearest/bilinear/cubic) stays on GPU. No GPU kernel exists for irregular per-pixel binning.
sieve,2026-04-14T12:00:00Z,WILL OOM,memory-bound,0,false-positive,False positive. Memory guards already in place on both dask paths. CCL is inherently global — documented limitation. CuPy CPU fallback is deliberate and documented.
sky_view_factor,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
Expand Down
6 changes: 5 additions & 1 deletion xrspatial/reproject/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -377,9 +377,13 @@ def _reproject_chunk_numpy(
transformer = pyproj.Transformer.from_crs(
tgt_crs, src_crs, always_xy=True
)
# Pass src_crs/tgt_crs as None: the numba fast path was already
# tried above and returned None, and _transform_coords gates its
# own try_numba_transform retry on both CRSes being non-None.
# Re-trying would repeat the CRS param parsing and chunk-sized
# coordinate allocations for nothing (#3106).
src_y, src_x = _transform_coords(
transformer, chunk_bounds_tuple, chunk_shape, transform_precision,
src_crs=src_crs, tgt_crs=tgt_crs,
)

# Convert source CRS coordinates to source pixel coordinates
Expand Down
101 changes: 101 additions & 0 deletions xrspatial/tests/test_reproject.py
Original file line number Diff line number Diff line change
Expand Up @@ -6843,6 +6843,107 @@ def test_reproject_precision_zero_matches_default_for_smooth_pair(self):
np.testing.assert_allclose(a[both], b[both], rtol=0, atol=1e-3)


@pytest.mark.skipif(not HAS_PYPROJ, reason="pyproj not installed")
class TestNoDuplicateNumbaFastPathProbe:
"""The numpy chunk worker must probe the numba fast path exactly once
per chunk (#3106).

The bug: for CRS pairs with no fast path, _reproject_chunk_numpy
called try_numba_transform (None), then fell into _transform_coords
which called try_numba_transform again before the pyproj control
grid. Each wasted probe re-parses CRS params and allocates four
chunk-sized coordinate arrays.
"""

_BOUNDS = (-2_000_000.0, 4_000_000.0, -1_000_000.0, 5_000_000.0)
_SHAPE = (16, 16)

@staticmethod
def _wkts():
# WGS84 -> Mollweide has no numba fast path, so the worker takes
# the pyproj fallback where the duplicate probe used to happen.
return (pyproj.CRS('EPSG:4326').to_wkt(),
pyproj.CRS('ESRI:54009').to_wkt())

def test_chunk_numpy_probes_fast_path_exactly_once(self, monkeypatch):
from xrspatial.reproject import _reproject_chunk_numpy
from xrspatial.reproject import _projections

calls = []
real = _projections.try_numba_transform

def _spy(*args, **kwargs):
calls.append(args)
return real(*args, **kwargs)

monkeypatch.setattr(_projections, 'try_numba_transform', _spy)

src_wkt, tgt_wkt = self._wkts()
source_data = np.arange(32 * 32, dtype=np.float64).reshape(32, 32)
out = _reproject_chunk_numpy(
source_data,
(-20.0, 35.0, -10.0, 45.0), (32, 32), True,
src_wkt, tgt_wkt,
self._BOUNDS, self._SHAPE,
'bilinear', np.nan, 16,
)
assert out.shape == self._SHAPE
assert len(calls) == 1, (
f"expected one try_numba_transform probe per chunk, "
f"got {len(calls)} (#3106)"
)

def test_transform_coords_still_probes_when_given_crs(self, monkeypatch):
"""_transform_coords keeps its own probe for callers that have not
tried the numba path yet (the cupy CPU fallbacks rely on it)."""
from xrspatial.reproject import _transform_coords
from xrspatial.reproject import _projections

calls = []
real = _projections.try_numba_transform

def _spy(*args, **kwargs):
calls.append(args)
return real(*args, **kwargs)

monkeypatch.setattr(_projections, 'try_numba_transform', _spy)

src = pyproj.CRS('EPSG:4326')
tgt = pyproj.CRS('EPSG:3857')
transformer = pyproj.Transformer.from_crs(tgt, src, always_xy=True)
_transform_coords(
transformer, self._BOUNDS, self._SHAPE, 16,
src_crs=src, tgt_crs=tgt,
)
assert len(calls) == 1

def test_fallback_pair_values_match_pyproj_reference(self):
"""The skipped retry must not change the worker's coordinates:
the pyproj fallback output stays identical for a no-fast-path
pair (exact path, so it is directly comparable to pyproj)."""
from xrspatial.reproject import _transform_coords

src = pyproj.CRS('EPSG:4326')
tgt = pyproj.CRS('ESRI:54009')
transformer = pyproj.Transformer.from_crs(tgt, src, always_xy=True)
src_y, src_x = _transform_coords(
transformer, self._BOUNDS, self._SHAPE, 0,
)

h, w = self._SHAPE
left, bottom, right, top = self._BOUNDS
res_x = (right - left) / w
res_y = (top - bottom) / h
out_x = left + (np.arange(w) + 0.5) * res_x
out_y = top - (np.arange(h) + 0.5) * res_y
gx, gy = np.meshgrid(out_x, out_y)
ref_x, ref_y = transformer.transform(gx.ravel(), gy.ravel())
np.testing.assert_allclose(
src_x, np.asarray(ref_x).reshape(h, w), atol=1e-9)
np.testing.assert_allclose(
src_y, np.asarray(ref_y).reshape(h, w), atol=1e-9)


@pytest.mark.skipif(not HAS_PYPROJ, reason="pyproj not installed")
class TestNonWgsDatumNumbaFastPath:
"""The numba fast path must not corrupt coordinates for non-WGS84
Expand Down
Loading