Skip to content

Rewrite stale per-band SCALE/OFFSET in to_geotiff(pack=True) after band-subset reads#3175

Merged
brendancol merged 4 commits into
mainfrom
issue-3161
Jun 10, 2026
Merged

Rewrite stale per-band SCALE/OFFSET in to_geotiff(pack=True) after band-subset reads#3175
brendancol merged 4 commits into
mainfrom
issue-3161

Conversation

@brendancol

Copy link
Copy Markdown
Contributor

Closes #3161

_pack() kept attrs['gdal_metadata'] verbatim. That's right for a full-file round trip, but after a band-subset unpack=True read the kept metadata still describes the source's band indices, so a pack=True write re-emitted per-band SCALE/OFFSET entries for bands the output file doesn't have. Reading the packed file back raised MixedBandMetadataError, and band=0 applied the wrong band's scale with no error.

The fix: when the array carries unpack state (scale_factor / mask_and_scale_dtype) and the metadata has per-band (SCALE, i) / (OFFSET, i) entries, _pack() replaces them with dataset-level values holding the pair that was actually applied on read. The stale gdal_metadata_xml attr is dropped so the writer rebuilds GDAL_METADATA from the rewritten dict. Dataset-level-only metadata is untouched, including the raw XML. Plain masked reads are also untouched, since they never applied the per-band scale.

The rewrite happens in _pack(), which runs before write dispatch, so numpy and dask writes both get it (mask_and_scale is a CPU eager + dask read feature).

Test plan:

  • New tests in tests/write/test_pack_band_subset_3161.py: the issue repro (distinct per-band SCALE, band-subset read, pack, re-read full and band=0), per-band OFFSET, selected band without a SCALE entry, full read with uniform per-band scale, dataset-level metadata kept verbatim
  • Existing pack tests (test_pack_3064.py) still pass
  • geotiff write/attrs/unit/read/round-trip suites pass locally

@github-actions github-actions Bot added the performance PR touches performance-sensitive code label Jun 10, 2026

@brendancol brendancol left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: Rewrite stale per-band SCALE/OFFSET in to_geotiff(pack=True) after band-subset reads

Blockers (must fix before merge)

None.

Suggestions (should fix, not blocking)

None.

Nits (optional improvements)

  • xrspatial/geotiff/_attrs.py:1728-1738 -- the rewrite block only normalizes SCALE/OFFSET. Other per-band items (band descriptions, STATISTICS_*) survive a band-subset pack with their source indices intact, e.g. a single-band packed file can still carry ('STATISTICS_MAXIMUM', 1). They don't affect pixel values, and they can't be re-indexed without knowing which band was read, so leaving them is the right call. Worth one sentence in the comment so the next reader knows it's deliberate rather than missed.
  • xrspatial/geotiff/tests/write/test_pack_band_subset_3161.py:106 and :130 -- test_pack_band_subset_selected_band_without_scale and test_pack_full_read_uniform_per_band_scale run eager-only while the two tests above them parametrize numpy/dask. The metadata logic is backend-independent, but the parametrize line is cheap and keeps the file symmetric.

What looks good

  • The trigger is scoped to arrays carrying unpack state (scale_factor / mask_and_scale_dtype), so a plain masked=True read of a scaled file keeps its valid source metadata instead of having it collapsed to identity.
  • The rewrite builds a new dict rather than mutating attrs['gdal_metadata'] in place. That matters: _finalize_eager_read documents that nested attr values are shared with the caller's seed dict, so in-place mutation would leak into the read-side object.
  • Dropping gdal_metadata_xml on rewrite is necessary, not incidental: _extract_rich_tags (_attrs.py:1441-1446) prefers the raw XML over the dict, so leaving it would re-emit the stale per-band items anyway.
  • Verified by hand: a source with dataset-level SCALE=0.5 plus distinct per-band entries keeps the dataset-level value verbatim (it won on read) and drops only the per-band noise; the packed file re-reads to identical values.
  • Tests hit the exact issue repro (full re-read used to raise MixedBandMetadataError, band=0 used to apply 0.1 instead of 0.2), plus per-band OFFSET, a selected band with no SCALE entry of its own, uniform per-band full reads, and the dataset-level-verbatim guarantee. Tmp names carry the issue number.

Checklist

  • Algorithm matches reference: the applied (scale, offset) pair is by construction the single pair valid for every band present, and dataset-level entries take precedence in _extract_scale_offset
  • Backends: rewrite runs in _pack before write dispatch; numpy and dask round trips both tested (mask_and_scale is CPU eager + dask only)
  • NaN handling unchanged (sentinel fill path untouched)
  • Edge cases covered (identity scale, offset-only, mixed dataset+per-band)
  • No dask materialization added (attrs-only logic)
  • Benchmark not needed (metadata dict work on the write path)
  • README/docs: no public API change; pack docstring promise now actually holds
  • Docstrings updated (_pack documents the per-band exception)

@brendancol brendancol left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up review after 9ab38ab

Both nits from the first pass are addressed:

  • _attrs.py:1739-1741 now says outright that non-SCALE per-band items (band descriptions, STATISTICS_*) are left alone because they don't affect pixel values and can't be re-indexed without the original band index.
  • test_pack_band_subset_selected_band_without_scale and test_pack_full_read_uniform_per_band_scale now run on numpy and dask like the rest of the file (22 tests total, all passing locally).

No new findings. The delta is a comment plus two parametrize lines; nothing else changed.

@brendancol brendancol merged commit c7a8b64 into main Jun 10, 2026
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

to_geotiff(pack=True) re-emits the source's full per-band SCALE/OFFSET after a band-subset read; round trip corrupts values

1 participant