diff --git a/.claude/sweep-performance-state.csv b/.claude/sweep-performance-state.csv index 84b8a8ab7..54418a3e8 100644 --- a/.claude/sweep-performance-state.csv +++ b/.claude/sweep-performance-state.csv @@ -22,8 +22,8 @@ geotiff,2026-05-20,SAFE,IO-bound,0,2212,"Pass 13 (2026-05-20): 1 MEDIUM found an glcm,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,"Downgraded to MEDIUM. da.stack without rechunk is scheduling overhead, not OOM risk." hillshade,2026-04-16T12:00:00Z,SAFE,compute-bound,0,,"Re-audit after Horn's method rewrite (PR 1175): clean stencil, map_overlap depth=(1,1), no materialization. Zero findings." hydro,2026-05-01,RISKY,memory-bound,0,1416,"Fixed-in-tree 2026-05-01: hand_mfd._hand_mfd_dask now assembles via da.map_blocks instead of eager da.block of pre-computed tiles (matches hand_dinf pattern). Remaining MEDIUM: sink_d8 CCL fully materializes labels (inherently global), flow_accumulation_mfd frac_bdry held in driver dict instead of memmap-backed BoundaryStore. D8 iterative paths (flow_accum/fill/watershed/basin/stream_*) use serial-tile sweep with memmap-backed boundary store -- per-tile RAM bounded but driver iterates O(diameter) times. flow_direction_*, flow_path/snap_pour_point/twi/hand_d8/hand_dinf are SAFE." -interpolate_spline,2026-06-04,SAFE,compute-bound,0,,"scope=spline-only. Audited _spline.py + _validation.py only (not _idw/_kriging). 1 MEDIUM (Cat3 GPU transfer): _spline_dask_cupy/_spline_cupy re-uploaded invariant x_pts/y_pts/weights host->device once per chunk. Fixed in PR #2929: added _tps_evaluate_gpu taking on-device point/weight arrays + only per-chunk grid slices; dask+cupy uploads invariants once at graph build (verified 48->3 on 16 chunks, scales with chunk count). numpy/cupy/dask+cupy parity ~1e-14. Added cupy+dask+cupy parity tests and an upload-count regression test (red without fix: 48!=3). _tps_cuda_kernel 30 regs/thread, 6 scalar locals -- no register pressure. CPU/dask+numpy eval @ngjit, row-major, no materialization. Dask graph probe 2560x2560/256 chunks = 200 tasks (2/chunk), no fan-in. Memory guard _check_spline_memory bounds N^2 solve. No issue filed -- gh issue create denied by auto-mode classifier; finding surfaced directly by sweep. GitHub issue field left empty." interpolate-kriging,2026-06-04,SAFE,graph-bound,0,2923,"MEDIUM: memory guard used full-grid k0 term on dask templates -> spurious MemoryError (issue #2923, fixed). LOW: _experimental_variogram nlags python loop vectorizable via bincount (~1.2x, pair-array materialization dominates) - doc only. Dask graph clean (2 tasks/chunk); cupy returns device arrays; no .values/.compute/.data.get materialization." +interpolate_spline,2026-06-04,SAFE,compute-bound,0,,"scope=spline-only. Audited _spline.py + _validation.py only (not _idw/_kriging). 1 MEDIUM (Cat3 GPU transfer): _spline_dask_cupy/_spline_cupy re-uploaded invariant x_pts/y_pts/weights host->device once per chunk. Fixed in PR #2929: added _tps_evaluate_gpu taking on-device point/weight arrays + only per-chunk grid slices; dask+cupy uploads invariants once at graph build (verified 48->3 on 16 chunks, scales with chunk count). numpy/cupy/dask+cupy parity ~1e-14. Added cupy+dask+cupy parity tests and an upload-count regression test (red without fix: 48!=3). _tps_cuda_kernel 30 regs/thread, 6 scalar locals -- no register pressure. CPU/dask+numpy eval @ngjit, row-major, no materialization. Dask graph probe 2560x2560/256 chunks = 200 tasks (2/chunk), no fan-in. Memory guard _check_spline_memory bounds N^2 solve. No issue filed -- gh issue create denied by auto-mode classifier; finding surfaced directly by sweep. GitHub issue field left empty." kde,2026-04-14T12:00:00Z,SAFE,compute-bound,0,,Graph construction serialized per-tile. _filter_points_to_tile scans all points per tile. No HIGH findings. mahalanobis,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,False positive. Numpy path materializes by design. Dask path uses lazy reductions + map_blocks. morphology,2026-03-31T18:00:00Z,SAFE,compute-bound,0,, @@ -35,7 +35,7 @@ polygon_clip,2026-04-16T12:00:00Z,SAFE,compute-bound,0,1207,Re-audit 2026-04-16: polygonize,2026-05-29,RISKY,compute-bound,0,2608,"Pass 2 (2026-05-29): re-audit. 0 HIGH. 1 MEDIUM fixed (#2608): _polygonize_dask called dask.compute() once per chunk in a nested Python loop, serializing one chunk per scheduler round-trip. Fixed to batch one dask.compute() per chunk row. Output byte-identical (verified conn=4 and conn=8). Measured 2.79x faster on a 4-worker LocalCluster (1024x1024/64 chunks); threaded-scheduler win is marginal (~1.03x warm) since @ngjit kernels release the GIL. 8 new tests in test_polygonize_dask_row_batch_2608.py; 299 polygonize tests pass. Cat1 clean (no .values/.compute-in-loop wrapping dask; np.asarray at L1064/L2278 only wrap CPU input / user transform). Cat3: no @cuda.jit kernels; _polygonize_cupy GPU->CPU transfer is documented (boundary tracing is sequential, cannot run on GPU); cupy int path runs end-to-end ~2.2s/512x512, dominated by CPU _scan. Cat4 LOW (not fixed): _calculate_regions_cupy allocates bin_mask=(data==v) per unique value (O(n_unique) passes); verified low impact, _scan dominates. Cat5 clean. Cat6: RISKY unchanged -- driver accumulates O(total polygons) interior polys; per-row batch keeps peak bounded to one row. bottleneck=compute-bound (_scan). | Re-audit 2026-04-16 after PR 1190 NaN fix + 1176 simplification." proximity,2026-03-31T18:00:00Z,WILL OOM,memory-bound,3,1111,Memory guard added to line-sweep path. KDTree path (EUCLIDEAN/MANHATTAN + scipy) already had guards. GREAT_CIRCLE unbounded path already guarded. rasterize,2026-05-27,SAFE,graph-bound,0,2506,"Pass 3 (2026-05-27): re-audit identified 1 MEDIUM Cat-3 GPU-transfer finding. _run_cupy (L2065/L2083) and _rasterize_tile_cupy (L2541/L2555) called cupy.asarray(poly_props/poly_global) twice when all_touched=True -- once for the scanline poly_launch tuple and once for the supercover boundary_launch tuple. The two tuples reference the same per-tile props tables. Filed #2506 and fixed by hoisting the upload above the scanline/boundary conditional so both launches share the same device buffer. Microbench: 1000 polys/4 cols 0.051->0.024 ms/iter (2.1x); 10000 polys/8 cols 0.218->0.092 ms/iter (2.4x, saves 720 KB/tile of redundant H2D transfer). 12 new tests in test_rasterize_props_hoist_2506.py (4 AST-structural single-asarray-call assertions + 5 cupy all_touched parity merges + 3 dask+cupy smoke tests). All 470 rasterize tests pass. Dask graph probe: 25600x25600 chunks=1024 yields 2500 tasks for 625 tiles (4 tasks/chunk), unchanged. Noted pre-existing dask+cupy all_touched parity gap on boundary segments crossing tile borders (not addressed by this PR). SAFE/graph-bound verdict holds. | Pass 2 (2026-05-17): re-audit identified MEDIUM Cat-2/Cat-3 graph-bound waste in _run_dask_numpy/_run_dask_cupy -- full line_props/point_props embedded in every delayed tile task (polygon path already filtered via poly_props[pmask]). Filed #2020 and fixed: added _slice_props_for_tile helper to remap geom_idx and slice props per tile (mirrors polygon path). Measured 5000 points x 8 cols / 100 tiles graph shrank from ~30 MB to <0.3 MB (37x); localized lines from ~32 MB to ~1.1 MB. 9 new tests in test_rasterize_tile_props_slice_2020.py (helper unit tests + graph-payload bound + numpy/dask output parity for lines/points/sum-merge). All 184 existing rasterize tests pass; dask+cupy parity verified. Dask graph probe: 2560x2560 chunks=256 yields 400 tasks (4 tasks/chunk constant); 25600x25600 chunks=1024 yields 2500 tasks. cupy 512x512 returns cupy.ndarray with no host round-trip. CUDA _scanline_fill_gpu: 39 regs/thread, 24576 B local_mem/thread (matches static cuda.local.array allocations 2048*8 + 2048*4 bytes). SAFE/graph-bound verdict holds; previous 2026-04-15 false-positive on polygon filtering still valid. | Original (2026-04-15): Tile-by-tile graph construction with per-tile geometry filtering is the correct pattern. Pre-filtering ensures each delayed task gets only its relevant subset." -reproject,2026-05-10,SAFE,compute-bound,1,1571,"Pass 5 (2026-05-10): 1 HIGH filed and fixed in tree -- issue #1571 + fix _merge_block_adapter same-CRS dask path. _place_same_crs in the dask adapter previously called src_data.compute() on the full source per output chunk (68x amplification measured on 256x256x2 source split into 32x32 output chunks, 8.9M pixels materialized vs 131K total source). Fix: added _place_same_crs_lazy at __init__.py:1716 that slices the source window first then computes only that slice. Verified post-fix: 1.00x ratio, 131K pixels materialized for 131K source. New regression test test_merge_dask_same_crs_bounded_materialization codifies the bound. Other audits clean: CUDA resample kernels use 16x16 blocks (cubic=46 regs, bilinear=36, nearest=22 -- well under the 64K-per-block limit, 0 local mem). _reproject_chunk_numpy/cupy already slice source first before .compute(). Dask graph at 25600x25600 src with 1024 chunks yields 4752 tasks (no per-chunk source dependency). _apply_vertical_shift uses in-place += that may not work on dask arrays -- correctness concern, not perf, defer to accuracy sweep." +reproject,2026-06-09,SAFE,compute-bound,0,3106,"Pass 6 (2026-06-09): 0 HIGH. 1 MEDIUM found and fixed (#3106): _reproject_chunk_numpy probed try_numba_transform, then _transform_coords probed it again before the pyproj fallback -- each wasted probe re-parses CRS params (~10 pyproj to_dict/to_authority round-trips) and allocates 4 chunk-sized float64 coordinate arrays. Measured 512x512 chunk, 4326->ESRI:54009: ~0.3-0.5 ms/probe, ~11% of the 5.3 ms chunk worker, repeated per output chunk on dask+numpy and merge per-block paths. Fix: worker passes no CRS objects to _transform_coords (inner retry gated on both non-None); cupy CPU fallbacks keep the inner probe (their first numba attempt). 3 new tests (TestNoDuplicateNumbaFastPathProbe); 447 reproject tests pass. LOW (not fixed, documented): try_numba_transform allocates 4 flat arrays before branch dispatch -- wasted for the lcc/tmerc 2D-kernel branches and unsupported pairs; _resample_cupy_native does a redundant .copy() when nodata is non-NaN and the caller already passed a fresh float64 copy; per-projection param extractors (_lcc_params etc.) call crs.to_dict() without the UserWarning suppression that _get_datum_params got in #3076, so fallback chunks emit pyproj warning spam. Dask graph probe: 2560x2560/256 chunks -> 216 tasks for 108 output chunks (2/chunk, 2 layers); merge 2 inputs -> 64 tasks/32 chunks. Source window per task capped at 64 Mpix. GPU validated on host (CUDA available): cupy 1024^2 fast path 13 ms, try_cuda_transform stays on-device, dask+cupy end-to-end OK, numpy/cupy max abs diff 2e-12, NaN positions identical. SAFE/compute-bound holds. | Pass 5 (2026-05-10): 1 HIGH filed and fixed in tree -- issue #1571 + fix _merge_block_adapter same-CRS dask path. _place_same_crs in the dask adapter previously called src_data.compute() on the full source per output chunk (68x amplification measured on 256x256x2 source split into 32x32 output chunks, 8.9M pixels materialized vs 131K total source). Fix: added _place_same_crs_lazy at __init__.py:1716 that slices the source window first then computes only that slice. Verified post-fix: 1.00x ratio, 131K pixels materialized for 131K source. New regression test test_merge_dask_same_crs_bounded_materialization codifies the bound. Other audits clean: CUDA resample kernels use 16x16 blocks (cubic=46 regs, bilinear=36, nearest=22 -- well under the 64K-per-block limit, 0 local mem). _reproject_chunk_numpy/cupy already slice source first before .compute(). Dask graph at 25600x25600 src with 1024 chunks yields 4752 tasks (no per-chunk source dependency). _apply_vertical_shift uses in-place += that may not work on dask arrays -- correctness concern, not perf, defer to accuracy sweep." resample,2026-04-15T12:00:00Z,SAFE,compute-bound,0,false-positive,Downgraded. GPU-CPU-GPU round-trip only in aggregate path for non-integer scale factors. Interpolation (nearest/bilinear/cubic) stays on GPU. No GPU kernel exists for irregular per-pixel binning. sieve,2026-04-14T12:00:00Z,WILL OOM,memory-bound,0,false-positive,False positive. Memory guards already in place on both dask paths. CCL is inherently global — documented limitation. CuPy CPU fallback is deliberate and documented. sky_view_factor,2026-03-31T18:00:00Z,SAFE,compute-bound,0,, diff --git a/xrspatial/reproject/__init__.py b/xrspatial/reproject/__init__.py index 94a623b99..eab299cd1 100644 --- a/xrspatial/reproject/__init__.py +++ b/xrspatial/reproject/__init__.py @@ -377,9 +377,13 @@ def _reproject_chunk_numpy( transformer = pyproj.Transformer.from_crs( tgt_crs, src_crs, always_xy=True ) + # Pass src_crs/tgt_crs as None: the numba fast path was already + # tried above and returned None, and _transform_coords gates its + # own try_numba_transform retry on both CRSes being non-None. + # Re-trying would repeat the CRS param parsing and chunk-sized + # coordinate allocations for nothing (#3106). src_y, src_x = _transform_coords( transformer, chunk_bounds_tuple, chunk_shape, transform_precision, - src_crs=src_crs, tgt_crs=tgt_crs, ) # Convert source CRS coordinates to source pixel coordinates diff --git a/xrspatial/tests/test_reproject.py b/xrspatial/tests/test_reproject.py index 11d7a2829..94ec591d9 100644 --- a/xrspatial/tests/test_reproject.py +++ b/xrspatial/tests/test_reproject.py @@ -6843,6 +6843,107 @@ def test_reproject_precision_zero_matches_default_for_smooth_pair(self): np.testing.assert_allclose(a[both], b[both], rtol=0, atol=1e-3) +@pytest.mark.skipif(not HAS_PYPROJ, reason="pyproj not installed") +class TestNoDuplicateNumbaFastPathProbe: + """The numpy chunk worker must probe the numba fast path exactly once + per chunk (#3106). + + The bug: for CRS pairs with no fast path, _reproject_chunk_numpy + called try_numba_transform (None), then fell into _transform_coords + which called try_numba_transform again before the pyproj control + grid. Each wasted probe re-parses CRS params and allocates four + chunk-sized coordinate arrays. + """ + + _BOUNDS = (-2_000_000.0, 4_000_000.0, -1_000_000.0, 5_000_000.0) + _SHAPE = (16, 16) + + @staticmethod + def _wkts(): + # WGS84 -> Mollweide has no numba fast path, so the worker takes + # the pyproj fallback where the duplicate probe used to happen. + return (pyproj.CRS('EPSG:4326').to_wkt(), + pyproj.CRS('ESRI:54009').to_wkt()) + + def test_chunk_numpy_probes_fast_path_exactly_once(self, monkeypatch): + from xrspatial.reproject import _reproject_chunk_numpy + from xrspatial.reproject import _projections + + calls = [] + real = _projections.try_numba_transform + + def _spy(*args, **kwargs): + calls.append(args) + return real(*args, **kwargs) + + monkeypatch.setattr(_projections, 'try_numba_transform', _spy) + + src_wkt, tgt_wkt = self._wkts() + source_data = np.arange(32 * 32, dtype=np.float64).reshape(32, 32) + out = _reproject_chunk_numpy( + source_data, + (-20.0, 35.0, -10.0, 45.0), (32, 32), True, + src_wkt, tgt_wkt, + self._BOUNDS, self._SHAPE, + 'bilinear', np.nan, 16, + ) + assert out.shape == self._SHAPE + assert len(calls) == 1, ( + f"expected one try_numba_transform probe per chunk, " + f"got {len(calls)} (#3106)" + ) + + def test_transform_coords_still_probes_when_given_crs(self, monkeypatch): + """_transform_coords keeps its own probe for callers that have not + tried the numba path yet (the cupy CPU fallbacks rely on it).""" + from xrspatial.reproject import _transform_coords + from xrspatial.reproject import _projections + + calls = [] + real = _projections.try_numba_transform + + def _spy(*args, **kwargs): + calls.append(args) + return real(*args, **kwargs) + + monkeypatch.setattr(_projections, 'try_numba_transform', _spy) + + src = pyproj.CRS('EPSG:4326') + tgt = pyproj.CRS('EPSG:3857') + transformer = pyproj.Transformer.from_crs(tgt, src, always_xy=True) + _transform_coords( + transformer, self._BOUNDS, self._SHAPE, 16, + src_crs=src, tgt_crs=tgt, + ) + assert len(calls) == 1 + + def test_fallback_pair_values_match_pyproj_reference(self): + """The skipped retry must not change the worker's coordinates: + the pyproj fallback output stays identical for a no-fast-path + pair (exact path, so it is directly comparable to pyproj).""" + from xrspatial.reproject import _transform_coords + + src = pyproj.CRS('EPSG:4326') + tgt = pyproj.CRS('ESRI:54009') + transformer = pyproj.Transformer.from_crs(tgt, src, always_xy=True) + src_y, src_x = _transform_coords( + transformer, self._BOUNDS, self._SHAPE, 0, + ) + + h, w = self._SHAPE + left, bottom, right, top = self._BOUNDS + res_x = (right - left) / w + res_y = (top - bottom) / h + out_x = left + (np.arange(w) + 0.5) * res_x + out_y = top - (np.arange(h) + 0.5) * res_y + gx, gy = np.meshgrid(out_x, out_y) + ref_x, ref_y = transformer.transform(gx.ravel(), gy.ravel()) + np.testing.assert_allclose( + src_x, np.asarray(ref_x).reshape(h, w), atol=1e-9) + np.testing.assert_allclose( + src_y, np.asarray(ref_y).reshape(h, w), atol=1e-9) + + @pytest.mark.skipif(not HAS_PYPROJ, reason="pyproj not installed") class TestNonWgsDatumNumbaFastPath: """The numba fast path must not corrupt coordinates for non-WGS84