Drop two avoidable full-raster allocations in rasterize backends (#3107)#3131
Conversation
All four backends now return through astype(dtype, copy=False), so the default float64 case skips a full-raster copy. The CPU paths allocate the per-pixel order buffer as int8 instead of int64 for merges whose predicate never reads it (max/min/sum/count and user callables); first/last keep the int64 buffer. Measured on a 4000x4000 numpy rasterize: peak goes from 25 B/px to 10 B/px for merge='sum' and 17 B/px for merge='last'.
brendancol
left a comment
There was a problem hiding this comment.
Review scope: the two allocation changes in xrspatial/rasterize.py and the new test_rasterize_alloc_3107.py. Checked against the four-backend dispatch, the merge-mode contract (ordered vs order-insensitive predicates), and the dask tile path.
Blockers
None. The copy=False casts return function-local buffers, so no aliasing is possible. The int8 order buffer is only ever selected for _should_write_any, whose cur_idx argument is ignored, and a failed identity check degrades to int64 (the old behavior), never to wrong output. GPU paths are untouched apart from copy=False; order stays int64 where it is an atomic target. Parity for all six merge modes plus a user callable is locked by tests with 300+ geometries, which forces the wrapped int8 stores through their full range.
Suggestions
order[r, c] = new_idxwraps in numba-compiled code (verified), but in pure-Python mode (NUMBA_DISABLE_JIT=1) NumPy 2 raisesOverflowErroroncenew_idx > 127. CI never runs that mode, but someone debugging rasterize that way with a realistic dataset will hit it. Add a sentence to the_alloc_orderdocstring so the failure is self-explaining.test_order_insensitive_merge_valuescloses withassert np.isfinite(inside), which is much weaker than the rest of the test. The point pixel (48, 16) has an exact expected value per merge (sum 45750, count 301, min 1, max 600); assert those instead.
Nits
- The peak-memory thresholds (16 and 21 B/px) are generous relative to the measured 10 and ~17.5, which is the right call given this repo's history with resource-assertion flakes (#2889). The docstring already records the before/after numbers. No change requested.
tracemalloc-based assertions are deterministic (allocation counts, not wall clock), so they don't reintroduce the timing-flake class that #2889 removed.
brendancol
left a comment
There was a problem hiding this comment.
Follow-up pass after a0903e7.
- Suggestion 1: fixed.
_alloc_orderdocstring now spells out theNUMBA_DISABLE_JIT=1OverflowError caveat and the escape hatch (force the int64 branch). - Suggestion 2: fixed.
test_order_insensitive_merge_valuesnow asserts exact values at both the boxes-only pixel (60, 4) and the point-burn pixel (48, 16) for all four merges; the weak isfinite check is gone. - Nit 3: no change requested, none made.
All 21 tests in test_rasterize_alloc_3107.py pass after the changes; flake8 clean. Nothing else outstanding.
…e-rasterize-2026-06-09-01 # Conflicts: # .claude/sweep-performance-state.csv
Closes #3107.
astype(dtype, copy=False), so the default float64 case no longer copies the full work buffer (_run_numpy,_run_cupy,_rasterize_tile_numpy,_rasterize_tile_cupy). The buffer is local to each function, so nothing aliases it.orderbuffer through a new_alloc_orderhelper: int8 for merges whose predicate never reads it (max/min/sum/countand user callables), int64 only forfirst/last. The int8 buffer still receives int64 owner-index stores; numba wraps them and nothing reads the values back. GPU paths keep int64 becauseorderis an atomic target there.Measured with tracemalloc on a 4000x4000 numpy rasterize: peak drops from 25 B/px to 10 B/px for
merge='sum'and to 17 B/px formerge='last'.Backends: numpy and dask+numpy get both changes; cupy and dask+cupy get the
copy=Falsechange. Verified on a CUDA host: cupylast/sum/maxand dask+cupysummatch the CPU backend, output stays on device, and the dask graph is unchanged (400 tasks for 100 chunks).Test plan:
test_rasterize_alloc_3107.py:_alloc_orderdtype selection, output parity with 300+ geometries (int8 owner indices wrap repeatedly), numpy/dask parity for all six merge modes plus a user callable, tracemalloc peak-memory bounds, dtype casts still applied.Found by the performance sweep (Cat 4, memory allocation). The sweep state CSV update for the rasterize row rides along in the first commit.