Skip to content

Drop two avoidable full-raster allocations in rasterize backends (#3107)#3131

Merged
brendancol merged 4 commits into
mainfrom
deep-sweep-performance-rasterize-2026-06-09-01
Jun 10, 2026
Merged

Drop two avoidable full-raster allocations in rasterize backends (#3107)#3131
brendancol merged 4 commits into
mainfrom
deep-sweep-performance-rasterize-2026-06-09-01

Conversation

@brendancol

Copy link
Copy Markdown
Contributor

Closes #3107.

  • All four backends now return through astype(dtype, copy=False), so the default float64 case no longer copies the full work buffer (_run_numpy, _run_cupy, _rasterize_tile_numpy, _rasterize_tile_cupy). The buffer is local to each function, so nothing aliases it.
  • The CPU paths allocate the per-pixel order buffer through a new _alloc_order helper: int8 for merges whose predicate never reads it (max/min/sum/count and user callables), int64 only for first/last. The int8 buffer still receives int64 owner-index stores; numba wraps them and nothing reads the values back. GPU paths keep int64 because order is an atomic target there.

Measured with tracemalloc on a 4000x4000 numpy rasterize: peak drops from 25 B/px to 10 B/px for merge='sum' and to 17 B/px for merge='last'.

Backends: numpy and dask+numpy get both changes; cupy and dask+cupy get the copy=False change. Verified on a CUDA host: cupy last/sum/max and dask+cupy sum match the CPU backend, output stays on device, and the dask graph is unchanged (400 tasks for 100 chunks).

Test plan:

  • New test_rasterize_alloc_3107.py: _alloc_order dtype selection, output parity with 300+ geometries (int8 owner indices wrap repeatedly), numpy/dask parity for all six merge modes plus a user callable, tracemalloc peak-memory bounds, dtype casts still applied.
  • Full rasterize suite: 662 passed, 2 skipped (includes the GPU tests on this host).

Found by the performance sweep (Cat 4, memory allocation). The sweep state CSV update for the rasterize row rides along in the first commit.

All four backends now return through astype(dtype, copy=False), so the
default float64 case skips a full-raster copy. The CPU paths allocate
the per-pixel order buffer as int8 instead of int64 for merges whose
predicate never reads it (max/min/sum/count and user callables);
first/last keep the int64 buffer.

Measured on a 4000x4000 numpy rasterize: peak goes from 25 B/px to
10 B/px for merge='sum' and 17 B/px for merge='last'.

@brendancol brendancol left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review scope: the two allocation changes in xrspatial/rasterize.py and the new test_rasterize_alloc_3107.py. Checked against the four-backend dispatch, the merge-mode contract (ordered vs order-insensitive predicates), and the dask tile path.

Blockers

None. The copy=False casts return function-local buffers, so no aliasing is possible. The int8 order buffer is only ever selected for _should_write_any, whose cur_idx argument is ignored, and a failed identity check degrades to int64 (the old behavior), never to wrong output. GPU paths are untouched apart from copy=False; order stays int64 where it is an atomic target. Parity for all six merge modes plus a user callable is locked by tests with 300+ geometries, which forces the wrapped int8 stores through their full range.

Suggestions

  1. order[r, c] = new_idx wraps in numba-compiled code (verified), but in pure-Python mode (NUMBA_DISABLE_JIT=1) NumPy 2 raises OverflowError once new_idx > 127. CI never runs that mode, but someone debugging rasterize that way with a realistic dataset will hit it. Add a sentence to the _alloc_order docstring so the failure is self-explaining.
  2. test_order_insensitive_merge_values closes with assert np.isfinite(inside), which is much weaker than the rest of the test. The point pixel (48, 16) has an exact expected value per merge (sum 45750, count 301, min 1, max 600); assert those instead.

Nits

  1. The peak-memory thresholds (16 and 21 B/px) are generous relative to the measured 10 and ~17.5, which is the right call given this repo's history with resource-assertion flakes (#2889). The docstring already records the before/after numbers. No change requested.

tracemalloc-based assertions are deterministic (allocation counts, not wall clock), so they don't reintroduce the timing-flake class that #2889 removed.

@brendancol brendancol left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up pass after a0903e7.

  • Suggestion 1: fixed. _alloc_order docstring now spells out the NUMBA_DISABLE_JIT=1 OverflowError caveat and the escape hatch (force the int64 branch).
  • Suggestion 2: fixed. test_order_insensitive_merge_values now asserts exact values at both the boxes-only pixel (60, 4) and the point-burn pixel (48, 16) for all four merges; the weak isfinite check is gone.
  • Nit 3: no change requested, none made.

All 21 tests in test_rasterize_alloc_3107.py pass after the changes; flake8 clean. Nothing else outstanding.

@github-actions github-actions Bot added the performance PR touches performance-sensitive code label Jun 10, 2026
…e-rasterize-2026-06-09-01

# Conflicts:
#	.claude/sweep-performance-state.csv
@brendancol brendancol merged commit 2f2b758 into main Jun 10, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

rasterize allocates two avoidable full-raster buffers per call

1 participant