Skip to content

natural_breaks(num_sample=None) unbounded allocation on numpy/cupy backends #1246

@brendancol

Description

@brendancol

Problem

natural_breaks(agg, num_sample=None) on the numpy and cupy backends passes the entire flattened raster to _run_jenks, which allocates two (n_data + 1, n_classes + 1) float64 matrices in _run_numpy_jenks_matrices. Memory cost scales as 16 * n_data * (k + 1) bytes.

For a 10000 × 10000 raster with default k=5, that's ~9.6 GB per matrix pair, before considering the O(n²) Jenks runtime.

Reproduction

import numpy as np
import xarray as xr
from xrspatial.classify import natural_breaks

# 1000x1000 = 1M points. Matrices are 1M * 6 * 8 * 2 = ~96 MB.
# 10000x10000 would be ~9.6 GB.
agg = xr.DataArray(np.random.RandomState(0).rand(1000, 1000))
natural_breaks(agg, num_sample=None, k=5)  # runs, but peak memory >> input size

Root Cause

_compute_natural_break_bins in xrspatial/classify.py:635:

if num_sample is not None and num_sample < num_data:
    # ... sample from data
else:
    sample_data = data_flat_np  # full data

The dask paths (_run_dask_natural_break, _run_dask_cupy_natural_break) already cap num_sample at num_data and sample accordingly. The numpy and cupy paths pass num_sample straight through and skip sampling entirely when it is None.

Fix

Cap num_sample before the bin computation on the numpy and cupy paths, matching the dask cap. A conservative default keeps the Jenks matrices bounded regardless of input size.

Scope

  • natural_breaks() numpy backend
  • natural_breaks() cupy backend
  • Dask paths already protected

Found by

Security sweep (Cat 1 - Unbounded Allocation, MEDIUM severity).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions