Stream dask input through the GPU writer one tile-row band at a time#3241
Merged
Conversation
brendancol
commented
Jun 11, 2026
brendancol
left a comment
Contributor
Author
There was a problem hiding this comment.
PR Review: Stream dask input through the GPU writer one tile-row band at a time
Blockers (must fix before merge)
None found.
Suggestions (should fix, not blocking)
- The streamed write still builds the entire output file in host RAM:
_gpu_stream_compress_to_partaccumulates every compressed tile (xrspatial/geotiff/_writers/gpu.py:708), and_assemble_tiffconcatenates the full byte string before the single_write_bytescall (gpu.py:856-871). That matches the pre-PR GPU writer, so it is not a regression, but the CPU streaming writer writes incrementally to a temp file, and the new docstring wording ("also streams",xrspatial/geotiff/_writers/eager.py:111-118) could be read as a host-memory bound too. Add a sentence scoping the GPU streaming guarantee to device memory. - The nvCOMP level-warning comment in
gpu_compress_tiles(xrspatial/geotiff/_gpu_decode.py:3126-3140) says the GPU writer calls it "once per IFD part", so-W alwaysrepeats the warning per part. The streaming path now calls it once per tile-row band, so acompression_leveluser under-W alwayssees one warning per band. The default filter still dedups by location, so normal runs are unchanged; update the comment so it stays accurate. - Two streaming combinations have no test:
_write_geotiff_gpu(BytesIO)with dask input (file-like destinations are accepted on the non-COG GPU path and now stream), and band-last(y, x, band)3D dask input (the band-first test exercises the compressor via the remap, but not slicing an already band-last dask array).
Nits (optional improvements)
-
test_gpu_streaming_small_buffer_byte_identical_3166and the band-first test share oneda_kwargsdict (including the attrs dict) between the lazy and eager DataArrays. Nothing mutates it today, but a writer-side attrs mutation would be invisible to the byte-identity comparison since both arrays see the same dict. Independent dicts would keep the two writes independent.
What looks good
- The byte-identity tests are the right contract: streamed output is compared against the eager write at the file-bytes level, with ragged chunk/tile alignment (24-row chunks vs 32-row tiles), NaN holes plus a sentinel, and a forced one-tile-row-per-band floor.
- Reusing
_stream_row_bandskeeps the band geometry and the recompute-amplification fix (#3117 / #3007) consistent with the CPU writer instead of inventing a second banding scheme. - The warning contract inversion is covered from both sides: silent streaming for the three dask entry shapes, and a still-warning
cog=Truepath. - Measured 2428 MB down to 502 MB peak device pool on a 256 MB raster, with byte-identical files.
Checklist
- Algorithm matches reference (byte-identical to the eager write, verified by tests)
- All implemented backends produce consistent results (dask+cupy, dask+numpy via gpu=True; plain cupy and numpy unchanged)
- NaN handling is correct (per-band sentinel rewrite, copy before mutate)
- Edge cases covered (odd sizes / partial tiles, ragged chunks, tiny-buffer floor)
- Dask chunk boundaries handled correctly (tile-row aligned bands via
_stream_row_bands) - No premature materialization on the new path;
cog=Truematerialisation is intentional and warned - Benchmark not needed (no geotiff benchmarks exist in
benchmarks/benchmarks/) - README feature matrix unchanged (no new function, no tier change)
- Docstrings updated (
to_geotiff,_write_geotiff_gpu, warning helper)
…and BytesIO tests (#3166)
brendancol
commented
Jun 11, 2026
brendancol
left a comment
Contributor
Author
There was a problem hiding this comment.
Follow-up review (commit b7420fd)
All four findings from the first pass are addressed:
- Device-memory scoping:
to_geotiffand thestreaming_buffer_bytesdocstring on_write_geotiff_gpunow state that the cap bounds device memory only and the compressed file is still assembled in host RAM (xrspatial/geotiff/_writers/eager.py:111-121,xrspatial/geotiff/_writers/gpu.py:244-261). Fixed. - Stale nvCOMP level-warning comment: now mentions the per-band call pattern on the streaming path (
xrspatial/geotiff/_gpu_decode.py:3126-3141). Fixed. - Missing coverage:
test_gpu_streaming_band_last_byte_identical_3166covers slicing an already band-last 3D dask array, andtest_write_geotiff_gpu_dask_to_bytesio_streams_3166covers the file-like destination, both asserting byte identity against the eager write. Fixed. - Shared
da_kwargsdict: both existing tests and the new ones build fresh dims/attrs per DataArray via a local factory. Fixed.
No new issues in the follow-up diff. The GPU suite passes (430 passed, 3 skipped) and flake8 is clean on the edited files. Nothing further from me.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #3166.
PR #3173 handled the docs and the materialisation warning. This finishes the issue: the GPU writer now streams dask input instead of computing the whole array on device.
_write_geotiff_gpuno longer calls.compute()on dask input whencog=False. It computes one tile-row band at a time (grouped by source chunk-row span understreaming_buffer_bytes, reusing the CPU streaming writer's_stream_row_bandshelper), compresses each band on device, and releases it before the next. Tiles are independent in the TIFF layout, and the tile-extraction kernel pads edge tiles per band the same way it pads them for the full image, so the output is byte-identical to the eager write.streaming_buffer_bytesnow does something on the GPU path: it caps the device bytes computed per band, with a floor of one full-width tile-row.cog=Truekeeps the materialise-and-warn behaviour (overview generation needs the full array). The warning message now says that, instead of claiming the GPU writer has no streaming mode.(band, y, x)dask input remaps lazily viada.moveaxis. The per-band NaN-to-sentinel rewrite matches the eager path.Backends: dask+cupy streams on device; dask+numpy with
gpu=Trueuploads one band per compute. numpy and plain cupy writes are unchanged, as is the CPU dask streaming path.Measured on an RTX A6000 with a 256 MB float32 raster (8192x8192, 512-row chunks): peak device pool 502 MB streamed vs 2428 MB eager, byte-identical output files.
Test plan:
gpu=Truestreams;cog=Truestill warns and round-trips; tinystreaming_buffer_byteswith NaN holes + a nodata sentinel stays byte-identical; band-first dask input stays byte-identicalcog=Truewarns)xrspatial/geotiff/tests/gpu/428 passed;write/+test_round_trip.py1144 passed (CUDA device)