Make VRT tiled writes atomic to avoid poisoned partial output (#2669)#2678
Merged
Conversation
Dedupe duplicate module rows (last-write-wins by last_inspected) and collapse multi-line notes to single physical lines. The notes had embedded newlines, which the merge=union .gitattributes strategy splits record-by-record, corrupting the file into a 156-column phantom row on parallel-agent appends. One line per record keeps union merges safe.
Stage tiles in a temp directory and atomically rename it into place once every tile is written, then write the VRT index. On any failure during tile writing, the staging dir is removed so the final tiles directory is never created and a retry starts clean -- previously a failed write left a partial *_tiles dir that the leftover-state guard then rejected on the next attempt. On the dask path, capture each task's outcome so dask.compute does not abort while sibling threads are still writing, then re-raise the first failure after all writes have settled; this keeps the cleanup from racing in-flight worker threads. Add tests covering a successful tiled write, a numpy write that fails partway, and a dask write that fails partway, asserting no poisoned output remains and that a retry succeeds.
brendancol
commented
May 29, 2026
brendancol
left a comment
Contributor
Author
There was a problem hiding this comment.
PR Review: Make VRT tiled writes atomic to avoid poisoned partial output (#2669)
Blockers (must fix before merge)
None.
Suggestions (should fix, not blocking)
xrspatial/geotiff/_writers/eager.py:1436--os.replace(staging_dir, tiles_dir)onto a pre-existing emptytiles_diris not portable. The old code calledos.makedirs(tiles_dir, exist_ok=True), so a pre-existing empty tiles dir was tolerated and reused. On POSIX,os.replaceover an empty target dir works, but on Windows replacing an existing directory raises even when it is empty. The leftover-state guard at line 1185 only rejects a non-empty tiles dir, so an empty one slips through to the rename. Suggest removing an empty pre-existingtiles_dir(os.rmdir) just before the replace, or documenting POSIX-only support for that edge case.
Nits (optional improvements)
xrspatial/geotiff/_writers/eager.py:1424--raise rre-raises the captured exception but rebinds the traceback to this line. The original tile-write traceback is preserved onr, so this is acceptable; just noting the stack will point at the barrier rather than the failing tile.
What looks good
- Staging-dir-plus-atomic-rename is the right shape for the fix, and the same-parent-dir placement keeps the rename atomic.
- The dask race was caught and handled correctly: capturing each task outcome so
dask.computeruns every task to completion before re-raising means cleanup no longer races in-flight writer threads. The test runs this path repeatedly to flush out the race. - Tests cover the three real paths: clean write, numpy mid-write failure with retry, dask mid-write failure with retry. They assert no staging dir, no tiles dir, and no VRT remain after failure, and that a retry succeeds.
- Temp filenames carry the issue number, per project convention.
- The leftover-state guard comment now explains why a retry after failure is no longer blocked.
Checklist
- Algorithm matches reference/paper: n/a (IO transaction fix)
- All implemented backends produce consistent results: yes (numpy/cupy eager, dask streaming)
- NaN handling is correct: unchanged by this PR
- Edge cases covered by tests: failure-and-retry covered; pre-existing empty tiles dir not covered
- Dask chunk boundaries handled correctly: unchanged; tile grid still from chunks
- No premature materialization or unnecessary copies: yes
- Benchmark exists or not needed: not needed (no perf-path change beyond the existing threaded compute)
- README feature matrix updated: n/a (no new public API)
- Docstrings present and accurate: yes
os.replace onto an existing directory raises on Windows even when the target is empty, and an empty leftover *_tiles dir passes the non-empty leftover-state guard. Drop the empty target with os.rmdir before the rename so the promotion works on every platform, matching the old makedirs(exist_ok=True) tolerance. Add a test for the empty-dir reuse case.
brendancol
commented
May 29, 2026
brendancol
left a comment
Contributor
Author
There was a problem hiding this comment.
Follow-up review (after commit 9c5deaa)
Disposition of prior findings
- Suggestion (
eager.pyrename): fixed. The rename now drops a pre-existing emptytiles_dirwithos.rmdirbeforeos.replace, so promotion works on Windows too. A test covers the empty-dir reuse case (test_preexisting_empty_tiles_dir_is_reused). - Nit (
raise rtraceback): dismissed.raise rre-raises the captured exception with its original__traceback__intact, so the failing tile's stack is preserved; no change needed.
Remaining
No blockers, suggestions, or nits. The write/vrt suites pass (1517 passed, 1 skipped).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #2669
What changed
_write_vrt_tilednow stages tiles in a temp directory and renames it into place atomically once every tile is written. On any failure during tile writing the staging dir is removed, so the final*_tilesdirectory is never created and a retry starts from a clean slate. Before this, a failed write left a partial*_tilesdirectory behind, and the next attempt hit the "directory already contains files" guard and failed immediately.dask.computeabort on the first error while sibling threads are still writing into the staging dir. The first failure is re-raised after all writes settle, so cleanup no longer races in-flight worker threads.Backend coverage
Test plan
.vrtwrite produces the tile directory plus a readable VRT.*_tilesdir, no.vrt, no staging dir, and a retry succeeds.