Rewrite stale per-band SCALE/OFFSET in to_geotiff(pack=True) after band-subset reads#3175
Merged
Conversation
…nd-subset reads (#3161)
brendancol
commented
Jun 10, 2026
brendancol
left a comment
Contributor
Author
There was a problem hiding this comment.
PR Review: Rewrite stale per-band SCALE/OFFSET in to_geotiff(pack=True) after band-subset reads
Blockers (must fix before merge)
None.
Suggestions (should fix, not blocking)
None.
Nits (optional improvements)
-
xrspatial/geotiff/_attrs.py:1728-1738-- the rewrite block only normalizes SCALE/OFFSET. Other per-band items (band descriptions,STATISTICS_*) survive a band-subset pack with their source indices intact, e.g. a single-band packed file can still carry('STATISTICS_MAXIMUM', 1). They don't affect pixel values, and they can't be re-indexed without knowing which band was read, so leaving them is the right call. Worth one sentence in the comment so the next reader knows it's deliberate rather than missed. -
xrspatial/geotiff/tests/write/test_pack_band_subset_3161.py:106and:130--test_pack_band_subset_selected_band_without_scaleandtest_pack_full_read_uniform_per_band_scalerun eager-only while the two tests above them parametrize numpy/dask. The metadata logic is backend-independent, but the parametrize line is cheap and keeps the file symmetric.
What looks good
- The trigger is scoped to arrays carrying unpack state (
scale_factor/mask_and_scale_dtype), so a plainmasked=Trueread of a scaled file keeps its valid source metadata instead of having it collapsed to identity. - The rewrite builds a new dict rather than mutating
attrs['gdal_metadata']in place. That matters:_finalize_eager_readdocuments that nested attr values are shared with the caller's seed dict, so in-place mutation would leak into the read-side object. - Dropping
gdal_metadata_xmlon rewrite is necessary, not incidental:_extract_rich_tags(_attrs.py:1441-1446) prefers the raw XML over the dict, so leaving it would re-emit the stale per-band items anyway. - Verified by hand: a source with dataset-level
SCALE=0.5plus distinct per-band entries keeps the dataset-level value verbatim (it won on read) and drops only the per-band noise; the packed file re-reads to identical values. - Tests hit the exact issue repro (full re-read used to raise
MixedBandMetadataError,band=0used to apply 0.1 instead of 0.2), plus per-band OFFSET, a selected band with no SCALE entry of its own, uniform per-band full reads, and the dataset-level-verbatim guarantee. Tmp names carry the issue number.
Checklist
- Algorithm matches reference: the applied (scale, offset) pair is by construction the single pair valid for every band present, and dataset-level entries take precedence in
_extract_scale_offset - Backends: rewrite runs in
_packbefore write dispatch; numpy and dask round trips both tested (mask_and_scale is CPU eager + dask only) - NaN handling unchanged (sentinel fill path untouched)
- Edge cases covered (identity scale, offset-only, mixed dataset+per-band)
- No dask materialization added (attrs-only logic)
- Benchmark not needed (metadata dict work on the write path)
- README/docs: no public API change;
packdocstring promise now actually holds - Docstrings updated (
_packdocuments the per-band exception)
…maining pack tests on dask too (#3161)
brendancol
commented
Jun 10, 2026
brendancol
left a comment
Contributor
Author
There was a problem hiding this comment.
Follow-up review after 9ab38ab
Both nits from the first pass are addressed:
_attrs.py:1739-1741now says outright that non-SCALE per-band items (band descriptions,STATISTICS_*) are left alone because they don't affect pixel values and can't be re-indexed without the original band index.test_pack_band_subset_selected_band_without_scaleandtest_pack_full_read_uniform_per_band_scalenow run on numpy and dask like the rest of the file (22 tests total, all passing locally).
No new findings. The delta is a comment plus two parametrize lines; nothing else changed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #3161
_pack()keptattrs['gdal_metadata']verbatim. That's right for a full-file round trip, but after a band-subsetunpack=Trueread the kept metadata still describes the source's band indices, so apack=Truewrite re-emitted per-band SCALE/OFFSET entries for bands the output file doesn't have. Reading the packed file back raisedMixedBandMetadataError, andband=0applied the wrong band's scale with no error.The fix: when the array carries unpack state (
scale_factor/mask_and_scale_dtype) and the metadata has per-band(SCALE, i)/(OFFSET, i)entries,_pack()replaces them with dataset-level values holding the pair that was actually applied on read. The stalegdal_metadata_xmlattr is dropped so the writer rebuilds GDAL_METADATA from the rewritten dict. Dataset-level-only metadata is untouched, including the raw XML. Plain masked reads are also untouched, since they never applied the per-band scale.The rewrite happens in
_pack(), which runs before write dispatch, so numpy and dask writes both get it (mask_and_scaleis a CPU eager + dask read feature).Test plan:
tests/write/test_pack_band_subset_3161.py: the issue repro (distinct per-band SCALE, band-subset read, pack, re-read full andband=0), per-band OFFSET, selected band without a SCALE entry, full read with uniform per-band scale, dataset-level metadata kept verbatimtest_pack_3064.py) still pass