Skip to content

Make VRT tiled writes atomic to avoid poisoned partial output (#2669)#2678

Merged
brendancol merged 4 commits into
mainfrom
issue-2669
May 29, 2026
Merged

Make VRT tiled writes atomic to avoid poisoned partial output (#2669)#2678
brendancol merged 4 commits into
mainfrom
issue-2669

Conversation

@brendancol

Copy link
Copy Markdown
Contributor

Closes #2669

What changed

  • _write_vrt_tiled now stages tiles in a temp directory and renames it into place atomically once every tile is written. On any failure during tile writing the staging dir is removed, so the final *_tiles directory is never created and a retry starts from a clean slate. Before this, a failed write left a partial *_tiles directory behind, and the next attempt hit the "directory already contains files" guard and failed immediately.
  • On the dask streaming path, each tile task's outcome is captured rather than letting dask.compute abort on the first error while sibling threads are still writing into the staging dir. The first failure is re-raised after all writes settle, so cleanup no longer races in-flight worker threads.

Backend coverage

  • numpy and cupy (eager): covered, transactional staging + rename.
  • dask+numpy and dask+cupy (streaming): covered, with the settle-before-cleanup barrier.

Test plan

  • Successful tiled .vrt write produces the tile directory plus a readable VRT.
  • A numpy write that fails partway leaves no *_tiles dir, no .vrt, no staging dir, and a retry succeeds.
  • A dask write that fails partway cleans up the same way and retries (ran 5x for the threaded race).
  • Full geotiff suite passes (6025 passed, 81 skipped).

Dedupe duplicate module rows (last-write-wins by last_inspected) and
collapse multi-line notes to single physical lines. The notes had
embedded newlines, which the merge=union .gitattributes strategy splits
record-by-record, corrupting the file into a 156-column phantom row on
parallel-agent appends. One line per record keeps union merges safe.
Stage tiles in a temp directory and atomically rename it into place once
every tile is written, then write the VRT index. On any failure during
tile writing, the staging dir is removed so the final tiles directory is
never created and a retry starts clean -- previously a failed write left
a partial *_tiles dir that the leftover-state guard then rejected on the
next attempt.

On the dask path, capture each task's outcome so dask.compute does not
abort while sibling threads are still writing, then re-raise the first
failure after all writes have settled; this keeps the cleanup from racing
in-flight worker threads.

Add tests covering a successful tiled write, a numpy write that fails
partway, and a dask write that fails partway, asserting no poisoned
output remains and that a retry succeeds.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 29, 2026

@brendancol brendancol left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: Make VRT tiled writes atomic to avoid poisoned partial output (#2669)

Blockers (must fix before merge)

None.

Suggestions (should fix, not blocking)

  • xrspatial/geotiff/_writers/eager.py:1436 -- os.replace(staging_dir, tiles_dir) onto a pre-existing empty tiles_dir is not portable. The old code called os.makedirs(tiles_dir, exist_ok=True), so a pre-existing empty tiles dir was tolerated and reused. On POSIX, os.replace over an empty target dir works, but on Windows replacing an existing directory raises even when it is empty. The leftover-state guard at line 1185 only rejects a non-empty tiles dir, so an empty one slips through to the rename. Suggest removing an empty pre-existing tiles_dir (os.rmdir) just before the replace, or documenting POSIX-only support for that edge case.

Nits (optional improvements)

  • xrspatial/geotiff/_writers/eager.py:1424 -- raise r re-raises the captured exception but rebinds the traceback to this line. The original tile-write traceback is preserved on r, so this is acceptable; just noting the stack will point at the barrier rather than the failing tile.

What looks good

  • Staging-dir-plus-atomic-rename is the right shape for the fix, and the same-parent-dir placement keeps the rename atomic.
  • The dask race was caught and handled correctly: capturing each task outcome so dask.compute runs every task to completion before re-raising means cleanup no longer races in-flight writer threads. The test runs this path repeatedly to flush out the race.
  • Tests cover the three real paths: clean write, numpy mid-write failure with retry, dask mid-write failure with retry. They assert no staging dir, no tiles dir, and no VRT remain after failure, and that a retry succeeds.
  • Temp filenames carry the issue number, per project convention.
  • The leftover-state guard comment now explains why a retry after failure is no longer blocked.

Checklist

  • Algorithm matches reference/paper: n/a (IO transaction fix)
  • All implemented backends produce consistent results: yes (numpy/cupy eager, dask streaming)
  • NaN handling is correct: unchanged by this PR
  • Edge cases covered by tests: failure-and-retry covered; pre-existing empty tiles dir not covered
  • Dask chunk boundaries handled correctly: unchanged; tile grid still from chunks
  • No premature materialization or unnecessary copies: yes
  • Benchmark exists or not needed: not needed (no perf-path change beyond the existing threaded compute)
  • README feature matrix updated: n/a (no new public API)
  • Docstrings present and accurate: yes

os.replace onto an existing directory raises on Windows even when the
target is empty, and an empty leftover *_tiles dir passes the non-empty
leftover-state guard. Drop the empty target with os.rmdir before the
rename so the promotion works on every platform, matching the old
makedirs(exist_ok=True) tolerance. Add a test for the empty-dir reuse
case.

@brendancol brendancol left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up review (after commit 9c5deaa)

Disposition of prior findings

  • Suggestion (eager.py rename): fixed. The rename now drops a pre-existing empty tiles_dir with os.rmdir before os.replace, so promotion works on Windows too. A test covers the empty-dir reuse case (test_preexisting_empty_tiles_dir_is_reused).
  • Nit (raise r traceback): dismissed. raise r re-raises the captured exception with its original __traceback__ intact, so the failing tile's stack is preserved; no change needed.

Remaining

No blockers, suggestions, or nits. The write/vrt suites pass (1517 passed, 1 skipped).

@brendancol brendancol merged commit 82dfd8e into main May 29, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

VRT tiled writes are not atomic and leave poisoned partial output on failure

1 participant