Skip to content

zonal: fix int32 overflow in _strides() that silently corrupts stats/crosstab on huge rasters#2617

Merged
brendancol merged 2 commits into
mainfrom
deep-sweep-security-zonal-2026-05-29-01
May 29, 2026
Merged

zonal: fix int32 overflow in _strides() that silently corrupts stats/crosstab on huge rasters#2617
brendancol merged 2 commits into
mainfrom
deep-sweep-security-zonal-2026-05-29-01

Conversation

@brendancol

Copy link
Copy Markdown
Contributor

Closes #2612

What changed

  • _strides() accumulated cumulative zone-boundary counts into an np.int32 array inside a numba @ngjit kernel. For rasters above ~2.1 billion pixels (about 46341 x 46341) the count wraps silently to a negative int32, so the slice bounds _calc_stats and the crosstab paths derive from it go negative and stats()/crosstab() return silently wrong results.
  • Allocate the counter as int64, matching the natural array-index type, so it cannot overflow at realistic raster sizes.
  • Added a regression test asserting _strides returns an int64, non-negative, monotonic result.

Backend coverage

_strides runs in the numpy and dask+numpy paths (and the cupy/dask+cupy paths, which route through the numpy implementation). The dtype fix applies to all of them.

Test plan

  • test_strides_int64_no_overflow_2612 passes
  • Existing stats and crosstab tests pass (111 passed)

…rflow (#2612)

_strides accumulated cumulative zone-boundary counts into an int32 array
inside a numba @ngjit kernel. For rasters above ~2.1 billion pixels the
count wraps silently to a negative value, corrupting the slice bounds
that _calc_stats and crosstab derive from it, so stats() and crosstab()
returned silently wrong results on very large numpy-backed (or
large-chunk dask) rasters. Allocate the counter as int64 to match the
natural array-index type.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 29, 2026

@brendancol brendancol left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: fix int32 overflow in _strides()

Blockers

None.

Suggestions

None.

Nits

  • The test pins the dtype and checks correctness on a tiny input. It can't actually trigger the wrap without a ~17 GB array, so the dtype assertion is doing the real work here. A one-line comment to that effect would save the next reader a moment. Not blocking.

What looks good

  • Root cause is right. In a numba nopython kernel, storing a count past 2**31-1 into an int32 array wraps silently instead of raising, so int64 is the correct fix and it matches how the value is later used (as a slice index).
  • Minimal change, no surrounding churn.
  • _strides is the one chokepoint feeding zone_breaks into _calc_stats and the crosstab paths, so this covers stats() and crosstab() on numpy, dask+numpy, and the cupy/dask+cupy paths (which route through the numpy code). Nothing else to touch per-backend.
  • Downstream uses these as plain Python slice bounds, and the index arrays come from np.argsort (already int64 on 64-bit), so there's no other int32 narrowing waiting to reintroduce this.
  • Existing stats/crosstab tests still pass (111), so the wider dtype didn't move results at normal sizes.

Checklist

  • Algorithm vs reference: n/a (overflow fix)
  • Backends consistent: yes, one shared codepath
  • NaN handling: unchanged
  • Edge cases: dtype + correctness covered; a real overflow isn't allocatable
  • Dask boundaries: unchanged, per-block strides still correct
  • Premature materialization: none
  • Benchmark: not needed
  • README matrix: n/a
  • Docstrings: accurate

@brendancol brendancol left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up: addressed the one nit from the previous review. The test now carries a comment explaining that a real >2**31-element input isn't allocatable (~17 GB), so the int64 dtype assertion stands in for the overflow case. No other findings were open. Regression test still passes.

@brendancol brendancol merged commit 1ad4fdb into main May 29, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

stats()/crosstab() silently wrong on very large rasters: int32 overflow in _strides()

1 participant