Stream dask input through the GPU writer one tile-row band at a time by brendancol · Pull Request #3241 · xarray-contrib/xarray-spatial

brendancol · 2026-06-11T20:00:38Z

Closes #3166.

PR #3173 handled the docs and the materialisation warning. This finishes the issue: the GPU writer now streams dask input instead of computing the whole array on device.

_write_geotiff_gpu no longer calls .compute() on dask input when cog=False. It computes one tile-row band at a time (grouped by source chunk-row span under streaming_buffer_bytes, reusing the CPU streaming writer's _stream_row_bands helper), compresses each band on device, and releases it before the next. Tiles are independent in the TIFF layout, and the tile-extraction kernel pads edge tiles per band the same way it pads them for the full image, so the output is byte-identical to the eager write.
streaming_buffer_bytes now does something on the GPU path: it caps the device bytes computed per band, with a floor of one full-width tile-row.
cog=True keeps the materialise-and-warn behaviour (overview generation needs the full array). The warning message now says that, instead of claiming the GPU writer has no streaming mode.
Band-first (band, y, x) dask input remaps lazily via da.moveaxis. The per-band NaN-to-sentinel rewrite matches the eager path.

Backends: dask+cupy streams on device; dask+numpy with gpu=True uploads one band per compute. numpy and plain cupy writes are unchanged, as is the CPU dask streaming path.

Measured on an RTX A6000 with a 256 MB float32 raster (8192x8192, 512-row chunks): peak device pool 502 MB streamed vs 2428 MB eager, byte-identical output files.

Test plan:

New tests: dask+cupy auto-dispatch streams with no warning and byte-identical output; positional dask input streams; dask+numpy with gpu=True streams; cog=True still warns and round-trips; tiny streaming_buffer_bytes with NaN holes + a nodata sentinel stays byte-identical; band-first dask input stays byte-identical
Updated the Warn when the GPU writer materializes dask input; scope the streaming docs to the CPU path #3173 warning tests for the inverted contract (streaming is silent, only cog=True warns)
xrspatial/geotiff/tests/gpu/ 428 passed; write/ + test_round_trip.py 1144 passed (CUDA device)
flake8 clean on edited files

brendancol

PR Review: Stream dask input through the GPU writer one tile-row band at a time

Blockers (must fix before merge)

None found.

Suggestions (should fix, not blocking)

The streamed write still builds the entire output file in host RAM: _gpu_stream_compress_to_part accumulates every compressed tile (xrspatial/geotiff/_writers/gpu.py:708), and _assemble_tiff concatenates the full byte string before the single _write_bytes call (gpu.py:856-871). That matches the pre-PR GPU writer, so it is not a regression, but the CPU streaming writer writes incrementally to a temp file, and the new docstring wording ("also streams", xrspatial/geotiff/_writers/eager.py:111-118) could be read as a host-memory bound too. Add a sentence scoping the GPU streaming guarantee to device memory.
The nvCOMP level-warning comment in gpu_compress_tiles (xrspatial/geotiff/_gpu_decode.py:3126-3140) says the GPU writer calls it "once per IFD part", so -W always repeats the warning per part. The streaming path now calls it once per tile-row band, so a compression_level user under -W always sees one warning per band. The default filter still dedups by location, so normal runs are unchanged; update the comment so it stays accurate.
Two streaming combinations have no test: _write_geotiff_gpu(BytesIO) with dask input (file-like destinations are accepted on the non-COG GPU path and now stream), and band-last (y, x, band) 3D dask input (the band-first test exercises the compressor via the remap, but not slicing an already band-last dask array).

Nits (optional improvements)

test_gpu_streaming_small_buffer_byte_identical_3166 and the band-first test share one da_kwargs dict (including the attrs dict) between the lazy and eager DataArrays. Nothing mutates it today, but a writer-side attrs mutation would be invisible to the byte-identity comparison since both arrays see the same dict. Independent dicts would keep the two writes independent.

What looks good

The byte-identity tests are the right contract: streamed output is compared against the eager write at the file-bytes level, with ragged chunk/tile alignment (24-row chunks vs 32-row tiles), NaN holes plus a sentinel, and a forced one-tile-row-per-band floor.
Reusing _stream_row_bands keeps the band geometry and the recompute-amplification fix (#3117 / #3007) consistent with the CPU writer instead of inventing a second banding scheme.
The warning contract inversion is covered from both sides: silent streaming for the three dask entry shapes, and a still-warning cog=True path.
Measured 2428 MB down to 502 MB peak device pool on a 256 MB raster, with byte-identical files.

Checklist

Algorithm matches reference (byte-identical to the eager write, verified by tests)
All implemented backends produce consistent results (dask+cupy, dask+numpy via gpu=True; plain cupy and numpy unchanged)
NaN handling is correct (per-band sentinel rewrite, copy before mutate)
Edge cases covered (odd sizes / partial tiles, ragged chunks, tiny-buffer floor)
Dask chunk boundaries handled correctly (tile-row aligned bands via _stream_row_bands)
No premature materialization on the new path; cog=True materialisation is intentional and warned
Benchmark not needed (no geotiff benchmarks exist in benchmarks/benchmarks/)
README feature matrix unchanged (no new function, no tier change)
Docstrings updated (to_geotiff, _write_geotiff_gpu, warning helper)

…and BytesIO tests (#3166)

brendancol

Follow-up review (commit `b7420fd`)

All four findings from the first pass are addressed:

Device-memory scoping: to_geotiff and the streaming_buffer_bytes docstring on _write_geotiff_gpu now state that the cap bounds device memory only and the compressed file is still assembled in host RAM (xrspatial/geotiff/_writers/eager.py:111-121, xrspatial/geotiff/_writers/gpu.py:244-261). Fixed.
Stale nvCOMP level-warning comment: now mentions the per-band call pattern on the streaming path (xrspatial/geotiff/_gpu_decode.py:3126-3141). Fixed.
Missing coverage: test_gpu_streaming_band_last_byte_identical_3166 covers slicing an already band-last 3D dask array, and test_write_geotiff_gpu_dask_to_bytesio_streams_3166 covers the file-like destination, both asserting byte identity against the eager write. Fixed.
Shared da_kwargs dict: both existing tests and the new ones build fresh dims/attrs per DataArray via a local factory. Fixed.

No new issues in the follow-up diff. The GPU suite passes (430 passed, 3 skipped) and flake8 is clean on the edited files. Nothing further from me.

Stream dask input through the GPU writer per tile-row band (#3166)

3b4798b

github-actions Bot added the performance PR touches performance-sensitive code label Jun 11, 2026

brendancol commented Jun 11, 2026

View reviewed changes

Address review: scope streaming docs to device memory, add band-last …

b7420fd

…and BytesIO tests (#3166)

brendancol commented Jun 11, 2026

View reviewed changes

brendancol merged commit 6983f88 into main Jun 11, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stream dask input through the GPU writer one tile-row band at a time#3241

Stream dask input through the GPU writer one tile-row band at a time#3241
brendancol merged 2 commits into
mainfrom
issue-3166-gpu-streaming

brendancol commented Jun 11, 2026

Uh oh!

brendancol left a comment

Uh oh!

brendancol left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brendancol commented Jun 11, 2026

Uh oh!

brendancol left a comment

Choose a reason for hiding this comment

PR Review: Stream dask input through the GPU writer one tile-row band at a time

Blockers (must fix before merge)

Suggestions (should fix, not blocking)

Nits (optional improvements)

What looks good

Checklist

Uh oh!

brendancol left a comment

Choose a reason for hiding this comment

Follow-up review (commit b7420fd)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Follow-up review (commit `b7420fd`)