Skip to content

Reduce CuPy host round-trips and remove redundant copies in reproject #1457

@brendancol

Description

@brendancol

Two small performance items in xrspatial/reproject/__init__.py.

Batch the four CuPy .get() calls per chunk

_reproject_chunk_cupy (around lines 357-364) does four sequential .get() calls to bring nanmin/nanmax of the row/col pixel arrays back to host:

r_min_val = float(cp.nanmin(src_row_px).get())
if not np.isfinite(r_min_val):
    return cp.full(chunk_shape, nodata, dtype=cp.float64)
r_max_val = float(cp.nanmax(src_row_px).get())
c_min_val = float(cp.nanmin(src_col_px).get())
c_max_val = float(cp.nanmax(src_col_px).get())

Each .get() is a synchronous device-to-host transfer and stalls the GPU. Stacking the four reductions into a single 4-element CuPy array and pulling that across in one .get() cuts the round-trips from four to one per chunk. The finite checks then run on host scalars, which is free.

The same pattern repeats in _reproject_dask_cupy around lines 1122-1128.

Drop redundant .copy() after .astype()

numpy.ndarray.astype() and cupy.ndarray.astype() both default to copy=True, so they always return a new array. The follow-up .copy() in:

  • _reproject_chunk_numpy multi-band path (line ~290)
  • _reproject_chunk_numpy single-band path (line ~305)
  • _reproject_chunk_cupy (line ~443)
  • _reproject_dask_cupy (line ~1193)

is therefore redundant and can be removed. No correctness change; one fewer array allocation per chunk.

Impact

For an N-chunk reprojection on GPU the batching saves roughly 3 * N synchronous device-to-host syncs. The .copy() removal saves one window-sized allocation per chunk. Existing parity tests cover correctness.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions