Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .claude/sweep-metadata-state.csv
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
module,last_inspected,issue,severity_max,categories_found,notes
geotiff,2026-05-18,1909,HIGH,4;5,"Re-audit 2026-05-15 (agent-a55b69cec1ef2a092 worktree, branch deep-sweep-metadata-geotiff-2026-05-15). 4-backend (numpy/cupy/dask+numpy/dask+cupy) parity reverified after the #1813 modular refactor: full reads, windowed reads, multi-band, band=N selection, no-georef integer pixel coords, crs/crs_wkt/transform/nodata/x_resolution/y_resolution/resolution_unit/image_description/gdal_metadata all agree across backends. DataArray .name and dims agree (y, x for 2D; y, x, band for 3D). NEW HIGH finding #1909: GDS chunked GPU path (_read_geotiff_gpu_chunked_gds) declared the dask graph dtype as float64 when source had an in-range integer nodata sentinel, matching the CPU dask path's #1597 contract, but the per-chunk _chunk_task did not cast its returned cupy array to declared_dtype -- chunks with no sentinel hit returned the raw uint16/int16 source dtype, producing a silent declared/actual dtype mismatch. Fix mirrors the #1597 + #1624 CPU dask pattern: compute declared_dtype before defining _chunk_task, cast inside the task only when arr.dtype != declared_dtype to skip the no-op astype(copy=True). 6 regression tests added in test_chunked_gpu_declared_dtype_1909.py covering declared vs computed parity, CPU/GPU dask declared-dtype agreement, eager paths preserve source dtype, no-nodata round-trip, explicit dtype= kwarg, and sentinel-hit float64 promotion. Pre-existing test failures in test_predictor2_big_endian_gpu_1517.py and test_size_param_validation_gpu_vrt_1776.py exist on main (read_to_array AttributeError after #1813 refactor, tile_size=4 rejected by stricter _validate_tile_size_arg) and are unrelated to this audit. | Re-audited 2026-05-18 (agent-a59a61958f181c31a worktree, branch deep-sweep-metadata-geotiff-2026-05-18). 4-backend (numpy / cupy / dask+numpy / dask+cupy) metadata parity reverified end-to-end: open_geotiff over a tiled uint16 fixture with crs + transform + GDAL_NODATA sentinel emits identical attrs across all 4 backends (crs=32633, crs_wkt, transform 6-tuple, nodata=5, masked_nodata=True, _xrspatial_geotiff_contract=2, extra_tags, image_description, resolution_unit, x_resolution, y_resolution). Multi-band 3D (y, x, band) with band coord, no-georef int64 pixel coords, windowed reads with transform origin shift, and mask_nodata=False keeping integer dtype all agree across the 4 backends. Write round-trip via to_geotiff (numpy, cupy, dask streaming) re-emits crs / transform / nodata / masked_nodata / contract version with byte-stable transform. Band-first (band, y, x) input correctly remaps to (y, x, band) on disk. _populate_attrs_from_geo_info, _set_nodata_attrs, and _extract_rich_tags centralise attrs emission across all read paths (_init_, _backends/dask, _backends/gpu, _backends/vrt) and write paths (_writers/eager, _writers/gpu, _writers/vrt). _ATTRS_CONTRACT_VERSION=2 is stamped on every path including the chunked GPU GDS and chunked VRT inline-attrs branches. No new CRITICAL/HIGH/MEDIUM/LOW findings."
polygonize,2026-05-19,2149,MEDIUM,1,"Audited 2026-05-19 (agent-ad1070530d37a4fdf worktree, branch deep-sweep-metadata-polygonize-2026-05-19). Output is vector (column, polygon_points / GeoDataFrame / GeoJSON dict / awkward) so Cat 2/3 do not apply in the DataArray sense. Cat 1 MEDIUM finding #2149: GeoDataFrame output drops raster.attrs['crs'] (and crs_wkt and rioxarray rio.crs); GeoDataFrame.crs is always None even when input is georeferenced. Fix: new _detect_raster_crs helper + crs= kwarg threaded into _to_geopandas; df.set_crs is called when a CRS is detected. spatialpandas has no CRS slot and GeoJSON RFC 7946 is WGS84-only, so propagation lives only on the geopandas path. CRS propagation runs at the public API level so all 4 backends (numpy / cupy / dask+numpy / dask+cupy) propagate consistently -- verified end-to-end with EPSG:4326 attrs across all 4 backends. 8 new tests in TestPolygonizeCRSPropagation cover EPSG string/int, crs_wkt, no CRS, unparseable CRS, attrs-vs-rioxarray preference, rioxarray-only path, and simplify interaction. Cat 2 LOW (not fixed): output coords are pixel-space when input has georeferenced x/y or attrs['transform']; user must pass transform= explicitly. Documented behavior, leave as-is. Cat 4 LOW (not fixed): nodatavals from input attrs is not auto-applied as a mask; documented behavior (explicit mask= kwarg)."
rasterize,2026-05-17,2018,HIGH,1;2;4,"rasterize() drops like.attrs, rebuilds like.coords via linspace (not bit-identical), and never emits _FillValue/nodatavals even when fill is non-NaN. Cat 1 HIGH: chained pipelines like slope(rasterize(gdf, like=elevation)) silently lose crs/res/transform. Cat 2 MEDIUM: linspace round-trip from re-derived bounds breaks xr.align with like. Cat 4 MEDIUM: rasterize(..., fill=-9999, dtype=int32) emits no _FillValue. All 4 backends share the same final return so the fix is one place. Fixed in deep-sweep-metadata-rasterize-2026-05-17-01 (worktree agent-ab7a9aee97c1e4cdf): _extract_grid_from_like now returns coords/attrs; rasterize() reuses like.coords directly when grid matches, copies like.attrs, and emits _FillValue + nodatavals when fill is not NaN. 9 new tests in TestMetadataPropagation cover attrs propagation, bit-identical coord reuse, fill-value emission, isolation from template attrs, and parity across numpy/cupy/dask+numpy/dask+cupy backends. Full test suite (193 passing) clean."
reproject,2026-05-10,1572;1573,HIGH,1;3;4,geoid_height_raster dropped input attrs and used dims[-2:] for 3D inputs (#1572). reproject/merge ignored nodatavals (rasterio convention) when rioxarray absent (#1573). Fixed in same branch.
61 changes: 60 additions & 1 deletion xrspatial/polygonize.py
Original file line number Diff line number Diff line change
Expand Up @@ -473,10 +473,46 @@ def _to_awkward(
return column, ak_array


def _detect_raster_crs(raster: xr.DataArray):
"""Detect the CRS of an input raster for output georeferencing.

Mirrors the resolution order in
``xrspatial.reproject._crs_utils._detect_source_crs`` without
pulling pyproj as a hard dependency: returns the raw attribute
value (string / int / pyproj.CRS / rioxarray CRS) so the caller can
hand it straight to ``GeoDataFrame.set_crs``, which performs its own
parsing.

Resolution order:

1. ``raster.attrs['crs']`` (xrspatial.geotiff convention)
2. ``raster.attrs['crs_wkt']``
3. ``raster.rio.crs`` (rioxarray, if installed)
4. ``None``
"""
crs_attr = raster.attrs.get('crs')
if crs_attr is not None:
return crs_attr

crs_wkt = raster.attrs.get('crs_wkt')
if crs_wkt is not None:
return crs_wkt

try:
rio_crs = raster.rio.crs
if rio_crs is not None:
return rio_crs
except Exception:
pass

return None


def _to_geopandas(
column: List[Union[int, float]],
polygon_points: List[np.ndarray],
column_name: str,
crs=None,
):
import geopandas as gpd
import shapely
Expand Down Expand Up @@ -507,6 +543,16 @@ def _to_geopandas(
polygons[i] = Polygon(pts[0], pts[1:])

df = gpd.GeoDataFrame({column_name: column, "geometry": polygons})
if crs is not None:
# GeoPandas accepts pyproj.CRS, EPSG ints, "EPSG:XXXX" strings,
# WKT strings, and the CRS objects rioxarray exposes; let it
# parse whatever the raster carried.
try:
df = df.set_crs(crs)
except Exception:
# Don't fail the whole call if the CRS value is unparseable;
# GeoDataFrame is still functional, just without CRS.
pass
return df


Expand Down Expand Up @@ -1622,6 +1668,14 @@ def polygonize(

For Dask+CuPy, each chunk is transferred independently, keeping peak
CPU memory proportional to chunk size rather than full raster size.

When ``return_type="geopandas"``, the raster's CRS is propagated to
the output ``GeoDataFrame``. The resolution order is
``raster.attrs['crs']``, then ``raster.attrs['crs_wkt']``, then
``raster.rio.crs`` (if rioxarray is installed). An unparseable CRS
value is dropped rather than raised. The ``spatialpandas`` and
``geojson`` return types do not carry CRS metadata: spatialpandas
has no CRS slot, and GeoJSON (RFC 7946) is WGS84 only.
"""
_validate_raster(raster, func_name='polygonize', name='raster', ndim=2)
if raster.shape[0] < 1 or raster.shape[1] < 1:
Expand Down Expand Up @@ -1687,7 +1741,12 @@ def polygonize(
elif return_type == "awkward":
return _to_awkward(column, polygon_points)
elif return_type == "geopandas":
return _to_geopandas(column, polygon_points, column_name)
# Propagate the raster's CRS to the GeoDataFrame so downstream
# spatial joins / reprojections / file writes keep their
# georeferencing. spatialpandas has no CRS slot and GeoJSON
# RFC 7946 is WGS84-only, so the propagation lives only here.
crs = _detect_raster_crs(raster)
return _to_geopandas(column, polygon_points, column_name, crs=crs)
elif return_type == "spatialpandas":
return _to_spatialpandas(column, polygon_points, column_name)
elif return_type == "geojson":
Expand Down
94 changes: 94 additions & 0 deletions xrspatial/tests/test_polygonize.py
Original file line number Diff line number Diff line change
Expand Up @@ -1114,3 +1114,97 @@ def test_rejects_non_dataarray_mask(self):
from xrspatial.polygonize import polygonize
with pytest.raises(TypeError, match="xarray.DataArray"):
polygonize(self._good_raster(), mask=np.ones((4, 4), dtype=bool))


# =====================================================================
# Issue #2149: GeoDataFrame output drops input CRS
# =====================================================================


@pytest.mark.skipif(gpd is None, reason="geopandas not installed")
class TestPolygonizeCRSPropagation:
"""polygonize(return_type='geopandas') preserves the raster CRS (#2149)."""

@staticmethod
def _raster_with_attrs(**attrs):
data = np.array([[0, 0, 1], [0, 4, 0], [0, 0, 0]], dtype=np.int32)
return xr.DataArray(
data,
dims=('y', 'x'),
coords={'y': np.arange(3), 'x': np.arange(3)},
attrs=attrs,
)

def test_crs_attr_epsg_string(self):
raster = self._raster_with_attrs(crs="EPSG:4326")
df = polygonize(raster, return_type="geopandas")
assert df.crs is not None
assert df.crs.to_epsg() == 4326

def test_crs_attr_epsg_int(self):
raster = self._raster_with_attrs(crs=3857)
df = polygonize(raster, return_type="geopandas")
assert df.crs is not None
assert df.crs.to_epsg() == 3857

def test_crs_wkt_attr(self):
# Use pyproj if available to get a canonical WKT.
pyproj = pytest.importorskip("pyproj")
wkt = pyproj.CRS.from_epsg(4326).to_wkt()
raster = self._raster_with_attrs(crs_wkt=wkt)
df = polygonize(raster, return_type="geopandas")
assert df.crs is not None
assert df.crs.to_epsg() == 4326

def test_no_crs_attr_yields_none(self):
"""A raster without CRS info should still produce a GeoDataFrame
(with crs=None), not crash."""
data = np.array([[0, 0, 1], [0, 4, 0], [0, 0, 0]], dtype=np.int32)
raster = xr.DataArray(data, dims=('y', 'x'))
df = polygonize(raster, return_type="geopandas")
assert df.crs is None

def test_unparseable_crs_does_not_crash(self):
"""An invalid CRS attr should not break the polygonize call."""
raster = self._raster_with_attrs(crs="not a real crs at all")
# Should not raise; just yields no CRS on the GeoDataFrame.
df = polygonize(raster, return_type="geopandas")
assert isinstance(df, gpd.GeoDataFrame)

def test_crs_prefers_attrs_over_rio(self):
"""When both attrs['crs'] and rio.crs exist, attrs wins."""
pytest.importorskip("rioxarray")
raster = self._raster_with_attrs(crs="EPSG:4326")
# rio.write_crs stores the CRS on a spatial_ref coord, not in
# attrs['crs'], so re-setting attrs['crs'] below is what makes
# the precedence assertion meaningful.
raster = raster.rio.write_crs("EPSG:3857")
raster.attrs['crs'] = "EPSG:4326"
df = polygonize(raster, return_type="geopandas")
assert df.crs.to_epsg() == 4326

def test_crs_from_rioxarray_only(self):
"""rioxarray's CRS should be picked up when no attrs are set."""
pytest.importorskip("rioxarray")
data = np.array([[0, 0, 1], [0, 4, 0], [0, 0, 0]], dtype=np.int32)
raster = xr.DataArray(
data,
dims=('y', 'x'),
coords={'y': np.arange(3), 'x': np.arange(3)},
)
raster = raster.rio.write_crs("EPSG:3857")
# rioxarray adds a spatial_ref coord but typically does not write
# ``attrs['crs']`` on the DataArray itself. Force-clear in case.
raster.attrs.pop('crs', None)
raster.attrs.pop('crs_wkt', None)
df = polygonize(raster, return_type="geopandas")
assert df.crs is not None
assert df.crs.to_epsg() == 3857

def test_no_crs_attr_simplify_still_works(self):
"""CRS propagation must not interact badly with simplify."""
raster = self._raster_with_attrs(crs="EPSG:4326")
df = polygonize(
raster, return_type="geopandas", simplify_tolerance=0.5)
assert df.crs is not None
assert df.crs.to_epsg() == 4326
Loading