xarray-contrib · brendancol · Jun 10, 2026 · Jun 9, 2026 · Jun 9, 2026
diff --git a/.claude/sweep-performance-state.csv b/.claude/sweep-performance-state.csv
@@ -22,8 +22,8 @@ geotiff,2026-05-20,SAFE,IO-bound,0,2212,"Pass 13 (2026-05-20): 1 MEDIUM found an
 glcm,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,"Downgraded to MEDIUM. da.stack without rechunk is scheduling overhead, not OOM risk."
 hillshade,2026-04-16T12:00:00Z,SAFE,compute-bound,0,,"Re-audit after Horn's method rewrite (PR 1175): clean stencil, map_overlap depth=(1,1), no materialization. Zero findings."
 hydro,2026-05-01,RISKY,memory-bound,0,1416,"Fixed-in-tree 2026-05-01: hand_mfd._hand_mfd_dask now assembles via da.map_blocks instead of eager da.block of pre-computed tiles (matches hand_dinf pattern). Remaining MEDIUM: sink_d8 CCL fully materializes labels (inherently global), flow_accumulation_mfd frac_bdry held in driver dict instead of memmap-backed BoundaryStore. D8 iterative paths (flow_accum/fill/watershed/basin/stream_*) use serial-tile sweep with memmap-backed boundary store -- per-tile RAM bounded but driver iterates O(diameter) times. flow_direction_*, flow_path/snap_pour_point/twi/hand_d8/hand_dinf are SAFE."
-interpolate_spline,2026-06-04,SAFE,compute-bound,0,,"scope=spline-only. Audited _spline.py + _validation.py only (not _idw/_kriging). 1 MEDIUM (Cat3 GPU transfer): _spline_dask_cupy/_spline_cupy re-uploaded invariant x_pts/y_pts/weights host->device once per chunk. Fixed in PR #2929: added _tps_evaluate_gpu taking on-device point/weight arrays + only per-chunk grid slices; dask+cupy uploads invariants once at graph build (verified 48->3 on 16 chunks, scales with chunk count). numpy/cupy/dask+cupy parity ~1e-14. Added cupy+dask+cupy parity tests and an upload-count regression test (red without fix: 48!=3). _tps_cuda_kernel 30 regs/thread, 6 scalar locals -- no register pressure. CPU/dask+numpy eval @ngjit, row-major, no materialization. Dask graph probe 2560x2560/256 chunks = 200 tasks (2/chunk), no fan-in. Memory guard _check_spline_memory bounds N^2 solve. No issue filed -- gh issue create denied by auto-mode classifier; finding surfaced directly by sweep. GitHub issue field left empty."
 interpolate-kriging,2026-06-04,SAFE,graph-bound,0,2923,"MEDIUM: memory guard used full-grid k0 term on dask templates -> spurious MemoryError (issue #2923, fixed). LOW: _experimental_variogram nlags python loop vectorizable via bincount (~1.2x, pair-array materialization dominates) - doc only. Dask graph clean (2 tasks/chunk); cupy returns device arrays; no .values/.compute/.data.get materialization."
+interpolate_spline,2026-06-04,SAFE,compute-bound,0,,"scope=spline-only. Audited _spline.py + _validation.py only (not _idw/_kriging). 1 MEDIUM (Cat3 GPU transfer): _spline_dask_cupy/_spline_cupy re-uploaded invariant x_pts/y_pts/weights host->device once per chunk. Fixed in PR #2929: added _tps_evaluate_gpu taking on-device point/weight arrays + only per-chunk grid slices; dask+cupy uploads invariants once at graph build (verified 48->3 on 16 chunks, scales with chunk count). numpy/cupy/dask+cupy parity ~1e-14. Added cupy+dask+cupy parity tests and an upload-count regression test (red without fix: 48!=3). _tps_cuda_kernel 30 regs/thread, 6 scalar locals -- no register pressure. CPU/dask+numpy eval @ngjit, row-major, no materialization. Dask graph probe 2560x2560/256 chunks = 200 tasks (2/chunk), no fan-in. Memory guard _check_spline_memory bounds N^2 solve. No issue filed -- gh issue create denied by auto-mode classifier; finding surfaced directly by sweep. GitHub issue field left empty."
 kde,2026-04-14T12:00:00Z,SAFE,compute-bound,0,,Graph construction serialized per-tile. _filter_points_to_tile scans all points per tile. No HIGH findings.
 mahalanobis,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,False positive. Numpy path materializes by design. Dask path uses lazy reductions + map_blocks.
 morphology,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
@@ -35,7 +35,7 @@ polygon_clip,2026-04-16T12:00:00Z,SAFE,compute-bound,0,1207,Re-audit 2026-04-16:
 polygonize,2026-05-29,RISKY,compute-bound,0,2608,"Pass 2 (2026-05-29): re-audit. 0 HIGH. 1 MEDIUM fixed (#2608): _polygonize_dask called dask.compute() once per chunk in a nested Python loop, serializing one chunk per scheduler round-trip. Fixed to batch one dask.compute() per chunk row. Output byte-identical (verified conn=4 and conn=8). Measured 2.79x faster on a 4-worker LocalCluster (1024x1024/64 chunks); threaded-scheduler win is marginal (~1.03x warm) since @ngjit kernels release the GIL. 8 new tests in test_polygonize_dask_row_batch_2608.py; 299 polygonize tests pass. Cat1 clean (no .values/.compute-in-loop wrapping dask; np.asarray at L1064/L2278 only wrap CPU input / user transform). Cat3: no @cuda.jit kernels; _polygonize_cupy GPU->CPU transfer is documented (boundary tracing is sequential, cannot run on GPU); cupy int path runs end-to-end ~2.2s/512x512, dominated by CPU _scan. Cat4 LOW (not fixed): _calculate_regions_cupy allocates bin_mask=(data==v) per unique value (O(n_unique) passes); verified low impact, _scan dominates. Cat5 clean. Cat6: RISKY unchanged -- driver accumulates O(total polygons) interior polys; per-row batch keeps peak bounded to one row. bottleneck=compute-bound (_scan). | Re-audit 2026-04-16 after PR 1190 NaN fix + 1176 simplification."
 proximity,2026-03-31T18:00:00Z,WILL OOM,memory-bound,3,1111,Memory guard added to line-sweep path. KDTree path (EUCLIDEAN/MANHATTAN + scipy) already had guards. GREAT_CIRCLE unbounded path already guarded.
 rasterize,2026-05-27,SAFE,graph-bound,0,2506,"Pass 3 (2026-05-27): re-audit identified 1 MEDIUM Cat-3 GPU-transfer finding. _run_cupy (L2065/L2083) and _rasterize_tile_cupy (L2541/L2555) called cupy.asarray(poly_props/poly_global) twice when all_touched=True -- once for the scanline poly_launch tuple and once for the supercover boundary_launch tuple. The two tuples reference the same per-tile props tables. Filed #2506 and fixed by hoisting the upload above the scanline/boundary conditional so both launches share the same device buffer. Microbench: 1000 polys/4 cols 0.051->0.024 ms/iter (2.1x); 10000 polys/8 cols 0.218->0.092 ms/iter (2.4x, saves 720 KB/tile of redundant H2D transfer). 12 new tests in test_rasterize_props_hoist_2506.py (4 AST-structural single-asarray-call assertions + 5 cupy all_touched parity merges + 3 dask+cupy smoke tests). All 470 rasterize tests pass. Dask graph probe: 25600x25600 chunks=1024 yields 2500 tasks for 625 tiles (4 tasks/chunk), unchanged. Noted pre-existing dask+cupy all_touched parity gap on boundary segments crossing tile borders (not addressed by this PR). SAFE/graph-bound verdict holds. | Pass 2 (2026-05-17): re-audit identified MEDIUM Cat-2/Cat-3 graph-bound waste in _run_dask_numpy/_run_dask_cupy -- full line_props/point_props embedded in every delayed tile task (polygon path already filtered via poly_props[pmask]). Filed #2020 and fixed: added _slice_props_for_tile helper to remap geom_idx and slice props per tile (mirrors polygon path). Measured 5000 points x 8 cols / 100 tiles graph shrank from ~30 MB to <0.3 MB (37x); localized lines from ~32 MB to ~1.1 MB. 9 new tests in test_rasterize_tile_props_slice_2020.py (helper unit tests + graph-payload bound + numpy/dask output parity for lines/points/sum-merge). All 184 existing rasterize tests pass; dask+cupy parity verified. Dask graph probe: 2560x2560 chunks=256 yields 400 tasks (4 tasks/chunk constant); 25600x25600 chunks=1024 yields 2500 tasks. cupy 512x512 returns cupy.ndarray with no host round-trip. CUDA _scanline_fill_gpu: 39 regs/thread, 24576 B local_mem/thread (matches static cuda.local.array allocations 2048*8 + 2048*4 bytes). SAFE/graph-bound verdict holds; previous 2026-04-15 false-positive on polygon filtering still valid. | Original (2026-04-15): Tile-by-tile graph construction with per-tile geometry filtering is the correct pattern. Pre-filtering ensures each delayed task gets only its relevant subset."
-reproject,2026-05-10,SAFE,compute-bound,1,1571,"Pass 5 (2026-05-10): 1 HIGH filed and fixed in tree -- issue #1571 + fix _merge_block_adapter same-CRS dask path. _place_same_crs in the dask adapter previously called src_data.compute() on the full source per output chunk (68x amplification measured on 256x256x2 source split into 32x32 output chunks, 8.9M pixels materialized vs 131K total source). Fix: added _place_same_crs_lazy at __init__.py:1716 that slices the source window first then computes only that slice. Verified post-fix: 1.00x ratio, 131K pixels materialized for 131K source. New regression test test_merge_dask_same_crs_bounded_materialization codifies the bound. Other audits clean: CUDA resample kernels use 16x16 blocks (cubic=46 regs, bilinear=36, nearest=22 -- well under the 64K-per-block limit, 0 local mem). _reproject_chunk_numpy/cupy already slice source first before .compute(). Dask graph at 25600x25600 src with 1024 chunks yields 4752 tasks (no per-chunk source dependency). _apply_vertical_shift uses in-place += that may not work on dask arrays -- correctness concern, not perf, defer to accuracy sweep."
+reproject,2026-06-09,SAFE,compute-bound,0,3106,"Pass 6 (2026-06-09): 0 HIGH. 1 MEDIUM found and fixed (#3106): _reproject_chunk_numpy probed try_numba_transform, then _transform_coords probed it again before the pyproj fallback -- each wasted probe re-parses CRS params (~10 pyproj to_dict/to_authority round-trips) and allocates 4 chunk-sized float64 coordinate arrays. Measured 512x512 chunk, 4326->ESRI:54009: ~0.3-0.5 ms/probe, ~11% of the 5.3 ms chunk worker, repeated per output chunk on dask+numpy and merge per-block paths. Fix: worker passes no CRS objects to _transform_coords (inner retry gated on both non-None); cupy CPU fallbacks keep the inner probe (their first numba attempt). 3 new tests (TestNoDuplicateNumbaFastPathProbe); 447 reproject tests pass. LOW (not fixed, documented): try_numba_transform allocates 4 flat arrays before branch dispatch -- wasted for the lcc/tmerc 2D-kernel branches and unsupported pairs; _resample_cupy_native does a redundant .copy() when nodata is non-NaN and the caller already passed a fresh float64 copy; per-projection param extractors (_lcc_params etc.) call crs.to_dict() without the UserWarning suppression that _get_datum_params got in #3076, so fallback chunks emit pyproj warning spam. Dask graph probe: 2560x2560/256 chunks -> 216 tasks for 108 output chunks (2/chunk, 2 layers); merge 2 inputs -> 64 tasks/32 chunks. Source window per task capped at 64 Mpix. GPU validated on host (CUDA available): cupy 1024^2 fast path 13 ms, try_cuda_transform stays on-device, dask+cupy end-to-end OK, numpy/cupy max abs diff 2e-12, NaN positions identical. SAFE/compute-bound holds. | Pass 5 (2026-05-10): 1 HIGH filed and fixed in tree -- issue #1571 + fix _merge_block_adapter same-CRS dask path. _place_same_crs in the dask adapter previously called src_data.compute() on the full source per output chunk (68x amplification measured on 256x256x2 source split into 32x32 output chunks, 8.9M pixels materialized vs 131K total source). Fix: added _place_same_crs_lazy at __init__.py:1716 that slices the source window first then computes only that slice. Verified post-fix: 1.00x ratio, 131K pixels materialized for 131K source. New regression test test_merge_dask_same_crs_bounded_materialization codifies the bound. Other audits clean: CUDA resample kernels use 16x16 blocks (cubic=46 regs, bilinear=36, nearest=22 -- well under the 64K-per-block limit, 0 local mem). _reproject_chunk_numpy/cupy already slice source first before .compute(). Dask graph at 25600x25600 src with 1024 chunks yields 4752 tasks (no per-chunk source dependency). _apply_vertical_shift uses in-place += that may not work on dask arrays -- correctness concern, not perf, defer to accuracy sweep."
 resample,2026-04-15T12:00:00Z,SAFE,compute-bound,0,false-positive,Downgraded. GPU-CPU-GPU round-trip only in aggregate path for non-integer scale factors. Interpolation (nearest/bilinear/cubic) stays on GPU. No GPU kernel exists for irregular per-pixel binning.
 sieve,2026-04-14T12:00:00Z,WILL OOM,memory-bound,0,false-positive,False positive. Memory guards already in place on both dask paths. CCL is inherently global — documented limitation. CuPy CPU fallback is deliberate and documented.
 sky_view_factor,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,

diff --git a/xrspatial/reproject/__init__.py b/xrspatial/reproject/__init__.py
@@ -377,9 +377,13 @@ def _reproject_chunk_numpy(
         transformer = pyproj.Transformer.from_crs(
             tgt_crs, src_crs, always_xy=True
         )
+        # Pass src_crs/tgt_crs as None: the numba fast path was already
+        # tried above and returned None, and _transform_coords gates its
+        # own try_numba_transform retry on both CRSes being non-None.
+        # Re-trying would repeat the CRS param parsing and chunk-sized
+        # coordinate allocations for nothing (#3106).
         src_y, src_x = _transform_coords(
             transformer, chunk_bounds_tuple, chunk_shape, transform_precision,
-            src_crs=src_crs, tgt_crs=tgt_crs,
         )
 
     # Convert source CRS coordinates to source pixel coordinates

diff --git a/xrspatial/tests/test_reproject.py b/xrspatial/tests/test_reproject.py
@@ -6843,6 +6843,107 @@ def test_reproject_precision_zero_matches_default_for_smooth_pair(self):
         np.testing.assert_allclose(a[both], b[both], rtol=0, atol=1e-3)
 
 
+@pytest.mark.skipif(not HAS_PYPROJ, reason="pyproj not installed")
+class TestNoDuplicateNumbaFastPathProbe:
+    """The numpy chunk worker must probe the numba fast path exactly once
+    per chunk (#3106).
+
+    The bug: for CRS pairs with no fast path, _reproject_chunk_numpy
+    called try_numba_transform (None), then fell into _transform_coords
+    which called try_numba_transform again before the pyproj control
+    grid. Each wasted probe re-parses CRS params and allocates four
+    chunk-sized coordinate arrays.
+    """
+
+    _BOUNDS = (-2_000_000.0, 4_000_000.0, -1_000_000.0, 5_000_000.0)
+    _SHAPE = (16, 16)
+
+    @staticmethod
+    def _wkts():
+        # WGS84 -> Mollweide has no numba fast path, so the worker takes
+        # the pyproj fallback where the duplicate probe used to happen.
+        return (pyproj.CRS('EPSG:4326').to_wkt(),
+                pyproj.CRS('ESRI:54009').to_wkt())
+
+    def test_chunk_numpy_probes_fast_path_exactly_once(self, monkeypatch):
+        from xrspatial.reproject import _reproject_chunk_numpy
+        from xrspatial.reproject import _projections
+
+        calls = []
+        real = _projections.try_numba_transform
+
+        def _spy(*args, **kwargs):
+            calls.append(args)
+            return real(*args, **kwargs)
+
+        monkeypatch.setattr(_projections, 'try_numba_transform', _spy)
+
+        src_wkt, tgt_wkt = self._wkts()
+        source_data = np.arange(32 * 32, dtype=np.float64).reshape(32, 32)
+        out = _reproject_chunk_numpy(
+            source_data,
+            (-20.0, 35.0, -10.0, 45.0), (32, 32), True,
+            src_wkt, tgt_wkt,
+            self._BOUNDS, self._SHAPE,
+            'bilinear', np.nan, 16,
+        )
+        assert out.shape == self._SHAPE
+        assert len(calls) == 1, (
+            f"expected one try_numba_transform probe per chunk, "
+            f"got {len(calls)} (#3106)"
+        )
+
+    def test_transform_coords_still_probes_when_given_crs(self, monkeypatch):
+        """_transform_coords keeps its own probe for callers that have not
+        tried the numba path yet (the cupy CPU fallbacks rely on it)."""
+        from xrspatial.reproject import _transform_coords
+        from xrspatial.reproject import _projections
+
+        calls = []
+        real = _projections.try_numba_transform
+
+        def _spy(*args, **kwargs):
+            calls.append(args)
+            return real(*args, **kwargs)
+
+        monkeypatch.setattr(_projections, 'try_numba_transform', _spy)
+
+        src = pyproj.CRS('EPSG:4326')
+        tgt = pyproj.CRS('EPSG:3857')
+        transformer = pyproj.Transformer.from_crs(tgt, src, always_xy=True)
+        _transform_coords(
+            transformer, self._BOUNDS, self._SHAPE, 16,
+            src_crs=src, tgt_crs=tgt,
+        )
+        assert len(calls) == 1
+
+    def test_fallback_pair_values_match_pyproj_reference(self):
+        """The skipped retry must not change the worker's coordinates:
+        the pyproj fallback output stays identical for a no-fast-path
+        pair (exact path, so it is directly comparable to pyproj)."""
+        from xrspatial.reproject import _transform_coords
+
+        src = pyproj.CRS('EPSG:4326')
+        tgt = pyproj.CRS('ESRI:54009')
+        transformer = pyproj.Transformer.from_crs(tgt, src, always_xy=True)
+        src_y, src_x = _transform_coords(
+            transformer, self._BOUNDS, self._SHAPE, 0,
+        )
+
+        h, w = self._SHAPE
+        left, bottom, right, top = self._BOUNDS
+        res_x = (right - left) / w
+        res_y = (top - bottom) / h
+        out_x = left + (np.arange(w) + 0.5) * res_x
+        out_y = top - (np.arange(h) + 0.5) * res_y
+        gx, gy = np.meshgrid(out_x, out_y)
+        ref_x, ref_y = transformer.transform(gx.ravel(), gy.ravel())
+        np.testing.assert_allclose(
+            src_x, np.asarray(ref_x).reshape(h, w), atol=1e-9)
+        np.testing.assert_allclose(
+            src_y, np.asarray(ref_y).reshape(h, w), atol=1e-9)
+
+
 @pytest.mark.skipif(not HAS_PYPROJ, reason="pyproj not installed")
 class TestNonWgsDatumNumbaFastPath:
     """The numba fast path must not corrupt coordinates for non-WGS84