Skip to content

Stream dask input through the GPU writer one tile-row band at a time#3241

Merged
brendancol merged 2 commits into
mainfrom
issue-3166-gpu-streaming
Jun 11, 2026
Merged

Stream dask input through the GPU writer one tile-row band at a time#3241
brendancol merged 2 commits into
mainfrom
issue-3166-gpu-streaming

Conversation

@brendancol

Copy link
Copy Markdown
Contributor

Closes #3166.

PR #3173 handled the docs and the materialisation warning. This finishes the issue: the GPU writer now streams dask input instead of computing the whole array on device.

  • _write_geotiff_gpu no longer calls .compute() on dask input when cog=False. It computes one tile-row band at a time (grouped by source chunk-row span under streaming_buffer_bytes, reusing the CPU streaming writer's _stream_row_bands helper), compresses each band on device, and releases it before the next. Tiles are independent in the TIFF layout, and the tile-extraction kernel pads edge tiles per band the same way it pads them for the full image, so the output is byte-identical to the eager write.
  • streaming_buffer_bytes now does something on the GPU path: it caps the device bytes computed per band, with a floor of one full-width tile-row.
  • cog=True keeps the materialise-and-warn behaviour (overview generation needs the full array). The warning message now says that, instead of claiming the GPU writer has no streaming mode.
  • Band-first (band, y, x) dask input remaps lazily via da.moveaxis. The per-band NaN-to-sentinel rewrite matches the eager path.

Backends: dask+cupy streams on device; dask+numpy with gpu=True uploads one band per compute. numpy and plain cupy writes are unchanged, as is the CPU dask streaming path.

Measured on an RTX A6000 with a 256 MB float32 raster (8192x8192, 512-row chunks): peak device pool 502 MB streamed vs 2428 MB eager, byte-identical output files.

Test plan:

  • New tests: dask+cupy auto-dispatch streams with no warning and byte-identical output; positional dask input streams; dask+numpy with gpu=True streams; cog=True still warns and round-trips; tiny streaming_buffer_bytes with NaN holes + a nodata sentinel stays byte-identical; band-first dask input stays byte-identical
  • Updated the Warn when the GPU writer materializes dask input; scope the streaming docs to the CPU path #3173 warning tests for the inverted contract (streaming is silent, only cog=True warns)
  • xrspatial/geotiff/tests/gpu/ 428 passed; write/ + test_round_trip.py 1144 passed (CUDA device)
  • flake8 clean on edited files

@github-actions github-actions Bot added the performance PR touches performance-sensitive code label Jun 11, 2026

@brendancol brendancol left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: Stream dask input through the GPU writer one tile-row band at a time

Blockers (must fix before merge)

None found.

Suggestions (should fix, not blocking)

  • The streamed write still builds the entire output file in host RAM: _gpu_stream_compress_to_part accumulates every compressed tile (xrspatial/geotiff/_writers/gpu.py:708), and _assemble_tiff concatenates the full byte string before the single _write_bytes call (gpu.py:856-871). That matches the pre-PR GPU writer, so it is not a regression, but the CPU streaming writer writes incrementally to a temp file, and the new docstring wording ("also streams", xrspatial/geotiff/_writers/eager.py:111-118) could be read as a host-memory bound too. Add a sentence scoping the GPU streaming guarantee to device memory.
  • The nvCOMP level-warning comment in gpu_compress_tiles (xrspatial/geotiff/_gpu_decode.py:3126-3140) says the GPU writer calls it "once per IFD part", so -W always repeats the warning per part. The streaming path now calls it once per tile-row band, so a compression_level user under -W always sees one warning per band. The default filter still dedups by location, so normal runs are unchanged; update the comment so it stays accurate.
  • Two streaming combinations have no test: _write_geotiff_gpu(BytesIO) with dask input (file-like destinations are accepted on the non-COG GPU path and now stream), and band-last (y, x, band) 3D dask input (the band-first test exercises the compressor via the remap, but not slicing an already band-last dask array).

Nits (optional improvements)

  • test_gpu_streaming_small_buffer_byte_identical_3166 and the band-first test share one da_kwargs dict (including the attrs dict) between the lazy and eager DataArrays. Nothing mutates it today, but a writer-side attrs mutation would be invisible to the byte-identity comparison since both arrays see the same dict. Independent dicts would keep the two writes independent.

What looks good

  • The byte-identity tests are the right contract: streamed output is compared against the eager write at the file-bytes level, with ragged chunk/tile alignment (24-row chunks vs 32-row tiles), NaN holes plus a sentinel, and a forced one-tile-row-per-band floor.
  • Reusing _stream_row_bands keeps the band geometry and the recompute-amplification fix (#3117 / #3007) consistent with the CPU writer instead of inventing a second banding scheme.
  • The warning contract inversion is covered from both sides: silent streaming for the three dask entry shapes, and a still-warning cog=True path.
  • Measured 2428 MB down to 502 MB peak device pool on a 256 MB raster, with byte-identical files.

Checklist

  • Algorithm matches reference (byte-identical to the eager write, verified by tests)
  • All implemented backends produce consistent results (dask+cupy, dask+numpy via gpu=True; plain cupy and numpy unchanged)
  • NaN handling is correct (per-band sentinel rewrite, copy before mutate)
  • Edge cases covered (odd sizes / partial tiles, ragged chunks, tiny-buffer floor)
  • Dask chunk boundaries handled correctly (tile-row aligned bands via _stream_row_bands)
  • No premature materialization on the new path; cog=True materialisation is intentional and warned
  • Benchmark not needed (no geotiff benchmarks exist in benchmarks/benchmarks/)
  • README feature matrix unchanged (no new function, no tier change)
  • Docstrings updated (to_geotiff, _write_geotiff_gpu, warning helper)

@brendancol brendancol left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up review (commit b7420fd)

All four findings from the first pass are addressed:

  • Device-memory scoping: to_geotiff and the streaming_buffer_bytes docstring on _write_geotiff_gpu now state that the cap bounds device memory only and the compressed file is still assembled in host RAM (xrspatial/geotiff/_writers/eager.py:111-121, xrspatial/geotiff/_writers/gpu.py:244-261). Fixed.
  • Stale nvCOMP level-warning comment: now mentions the per-band call pattern on the streaming path (xrspatial/geotiff/_gpu_decode.py:3126-3141). Fixed.
  • Missing coverage: test_gpu_streaming_band_last_byte_identical_3166 covers slicing an already band-last 3D dask array, and test_write_geotiff_gpu_dask_to_bytesio_streams_3166 covers the file-like destination, both asserting byte identity against the eager write. Fixed.
  • Shared da_kwargs dict: both existing tests and the new ones build fresh dims/attrs per DataArray via a local factory. Fixed.

No new issues in the follow-up diff. The GPU suite passes (430 passed, 3 skipped) and flake8 is clean on the edited files. Nothing further from me.

@brendancol brendancol merged commit 6983f88 into main Jun 11, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

dask+cupy writes auto-dispatch to the GPU writer and materialize the full array, contradicting the streaming contract

1 participant