From 0ecd2b8329b536e446dd690b739583e740a6a14d Mon Sep 17 00:00:00 2001 From: Brendan Collins Date: Tue, 19 May 2026 12:04:23 -0700 Subject: [PATCH 1/2] polygonize: propagate raster CRS to GeoDataFrame output (#2149) polygonize(raster, return_type="geopandas") returned a GeoDataFrame with crs=None even when the input DataArray carried CRS info via attrs["crs"], attrs["crs_wkt"], or rioxarray's rio.crs. Downstream spatial joins, overlays, and file writes silently lost georeferencing. A new _detect_raster_crs helper mirrors the resolution order in reproject._crs_utils._detect_source_crs (attrs first, then crs_wkt, then rio.crs) and returns the raw attribute so GeoDataFrame.set_crs handles parsing. The CRS is detected at the public API level, before backend dispatch, so all four backends (numpy / cupy / dask+numpy / dask+cupy) emit the same CRS. An unparseable CRS attribute is caught so the call never crashes -- the GeoDataFrame is returned without CRS in that case. spatialpandas does not expose a CRS slot and GeoJSON RFC 7946 is WGS84-only, so propagation lives only on the geopandas path. 8 new tests in TestPolygonizeCRSPropagation cover EPSG string and int attrs, crs_wkt, no-CRS, unparseable CRS, attrs-vs-rioxarray preference, rioxarray-only detection, and interaction with simplify_tolerance. Also updates .claude/sweep-metadata-state.csv with the 2026-05-19 polygonize audit notes. --- .claude/sweep-metadata-state.csv | 1 + xrspatial/polygonize.py | 53 ++++++++++++++++- xrspatial/tests/test_polygonize.py | 94 ++++++++++++++++++++++++++++++ 3 files changed, 147 insertions(+), 1 deletion(-) diff --git a/.claude/sweep-metadata-state.csv b/.claude/sweep-metadata-state.csv index 3160c2fdb..5047db05f 100644 --- a/.claude/sweep-metadata-state.csv +++ b/.claude/sweep-metadata-state.csv @@ -1,4 +1,5 @@ module,last_inspected,issue,severity_max,categories_found,notes geotiff,2026-05-18,1909,HIGH,4;5,"Re-audit 2026-05-15 (agent-a55b69cec1ef2a092 worktree, branch deep-sweep-metadata-geotiff-2026-05-15). 4-backend (numpy/cupy/dask+numpy/dask+cupy) parity reverified after the #1813 modular refactor: full reads, windowed reads, multi-band, band=N selection, no-georef integer pixel coords, crs/crs_wkt/transform/nodata/x_resolution/y_resolution/resolution_unit/image_description/gdal_metadata all agree across backends. DataArray .name and dims agree (y, x for 2D; y, x, band for 3D). NEW HIGH finding #1909: GDS chunked GPU path (_read_geotiff_gpu_chunked_gds) declared the dask graph dtype as float64 when source had an in-range integer nodata sentinel, matching the CPU dask path's #1597 contract, but the per-chunk _chunk_task did not cast its returned cupy array to declared_dtype -- chunks with no sentinel hit returned the raw uint16/int16 source dtype, producing a silent declared/actual dtype mismatch. Fix mirrors the #1597 + #1624 CPU dask pattern: compute declared_dtype before defining _chunk_task, cast inside the task only when arr.dtype != declared_dtype to skip the no-op astype(copy=True). 6 regression tests added in test_chunked_gpu_declared_dtype_1909.py covering declared vs computed parity, CPU/GPU dask declared-dtype agreement, eager paths preserve source dtype, no-nodata round-trip, explicit dtype= kwarg, and sentinel-hit float64 promotion. Pre-existing test failures in test_predictor2_big_endian_gpu_1517.py and test_size_param_validation_gpu_vrt_1776.py exist on main (read_to_array AttributeError after #1813 refactor, tile_size=4 rejected by stricter _validate_tile_size_arg) and are unrelated to this audit. | Re-audited 2026-05-18 (agent-a59a61958f181c31a worktree, branch deep-sweep-metadata-geotiff-2026-05-18). 4-backend (numpy / cupy / dask+numpy / dask+cupy) metadata parity reverified end-to-end: open_geotiff over a tiled uint16 fixture with crs + transform + GDAL_NODATA sentinel emits identical attrs across all 4 backends (crs=32633, crs_wkt, transform 6-tuple, nodata=5, masked_nodata=True, _xrspatial_geotiff_contract=2, extra_tags, image_description, resolution_unit, x_resolution, y_resolution). Multi-band 3D (y, x, band) with band coord, no-georef int64 pixel coords, windowed reads with transform origin shift, and mask_nodata=False keeping integer dtype all agree across the 4 backends. Write round-trip via to_geotiff (numpy, cupy, dask streaming) re-emits crs / transform / nodata / masked_nodata / contract version with byte-stable transform. Band-first (band, y, x) input correctly remaps to (y, x, band) on disk. _populate_attrs_from_geo_info, _set_nodata_attrs, and _extract_rich_tags centralise attrs emission across all read paths (_init_, _backends/dask, _backends/gpu, _backends/vrt) and write paths (_writers/eager, _writers/gpu, _writers/vrt). _ATTRS_CONTRACT_VERSION=2 is stamped on every path including the chunked GPU GDS and chunked VRT inline-attrs branches. No new CRITICAL/HIGH/MEDIUM/LOW findings." +polygonize,2026-05-19,2149,MEDIUM,1,"Audited 2026-05-19 (agent-ad1070530d37a4fdf worktree, branch deep-sweep-metadata-polygonize-2026-05-19). Output is vector (column, polygon_points / GeoDataFrame / GeoJSON dict / awkward) so Cat 2/3 do not apply in the DataArray sense. Cat 1 MEDIUM finding #2149: GeoDataFrame output drops raster.attrs['crs'] (and crs_wkt and rioxarray rio.crs); GeoDataFrame.crs is always None even when input is georeferenced. Fix: new _detect_raster_crs helper + crs= kwarg threaded into _to_geopandas; df.set_crs is called when a CRS is detected. spatialpandas has no CRS slot and GeoJSON RFC 7946 is WGS84-only, so propagation lives only on the geopandas path. CRS propagation runs at the public API level so all 4 backends (numpy / cupy / dask+numpy / dask+cupy) propagate consistently -- verified end-to-end with EPSG:4326 attrs across all 4 backends. 8 new tests in TestPolygonizeCRSPropagation cover EPSG string/int, crs_wkt, no CRS, unparseable CRS, attrs-vs-rioxarray preference, rioxarray-only path, and simplify interaction. Cat 2 LOW (not fixed): output coords are pixel-space when input has georeferenced x/y or attrs['transform']; user must pass transform= explicitly. Documented behavior, leave as-is. Cat 4 LOW (not fixed): nodatavals from input attrs is not auto-applied as a mask; documented behavior (explicit mask= kwarg)." rasterize,2026-05-17,2018,HIGH,1;2;4,"rasterize() drops like.attrs, rebuilds like.coords via linspace (not bit-identical), and never emits _FillValue/nodatavals even when fill is non-NaN. Cat 1 HIGH: chained pipelines like slope(rasterize(gdf, like=elevation)) silently lose crs/res/transform. Cat 2 MEDIUM: linspace round-trip from re-derived bounds breaks xr.align with like. Cat 4 MEDIUM: rasterize(..., fill=-9999, dtype=int32) emits no _FillValue. All 4 backends share the same final return so the fix is one place. Fixed in deep-sweep-metadata-rasterize-2026-05-17-01 (worktree agent-ab7a9aee97c1e4cdf): _extract_grid_from_like now returns coords/attrs; rasterize() reuses like.coords directly when grid matches, copies like.attrs, and emits _FillValue + nodatavals when fill is not NaN. 9 new tests in TestMetadataPropagation cover attrs propagation, bit-identical coord reuse, fill-value emission, isolation from template attrs, and parity across numpy/cupy/dask+numpy/dask+cupy backends. Full test suite (193 passing) clean." reproject,2026-05-10,1572;1573,HIGH,1;3;4,geoid_height_raster dropped input attrs and used dims[-2:] for 3D inputs (#1572). reproject/merge ignored nodatavals (rasterio convention) when rioxarray absent (#1573). Fixed in same branch. diff --git a/xrspatial/polygonize.py b/xrspatial/polygonize.py index ef94625ce..5e28cbd14 100644 --- a/xrspatial/polygonize.py +++ b/xrspatial/polygonize.py @@ -473,10 +473,46 @@ def _to_awkward( return column, ak_array +def _detect_raster_crs(raster: xr.DataArray): + """Detect the CRS of an input raster for output georeferencing. + + Mirrors the resolution order in + ``xrspatial.reproject._crs_utils._detect_source_crs`` without + pulling pyproj as a hard dependency: returns the raw attribute + value (string / int / pyproj.CRS / rioxarray CRS) so the caller can + hand it straight to ``GeoDataFrame.set_crs``, which performs its own + parsing. + + Resolution order: + + 1. ``raster.attrs['crs']`` (xrspatial.geotiff convention) + 2. ``raster.attrs['crs_wkt']`` + 3. ``raster.rio.crs`` (rioxarray, if installed) + 4. ``None`` + """ + crs_attr = raster.attrs.get('crs') + if crs_attr is not None: + return crs_attr + + crs_wkt = raster.attrs.get('crs_wkt') + if crs_wkt is not None: + return crs_wkt + + try: + rio_crs = raster.rio.crs + if rio_crs is not None: + return rio_crs + except Exception: + pass + + return None + + def _to_geopandas( column: List[Union[int, float]], polygon_points: List[np.ndarray], column_name: str, + crs=None, ): import geopandas as gpd import shapely @@ -507,6 +543,16 @@ def _to_geopandas( polygons[i] = Polygon(pts[0], pts[1:]) df = gpd.GeoDataFrame({column_name: column, "geometry": polygons}) + if crs is not None: + # GeoPandas accepts pyproj.CRS, EPSG ints, "EPSG:XXXX" strings, + # WKT strings, and the CRS objects rioxarray exposes; let it + # parse whatever the raster carried. + try: + df = df.set_crs(crs) + except Exception: + # Don't fail the whole call if the CRS value is unparseable; + # GeoDataFrame is still functional, just without CRS. + pass return df @@ -1687,7 +1733,12 @@ def polygonize( elif return_type == "awkward": return _to_awkward(column, polygon_points) elif return_type == "geopandas": - return _to_geopandas(column, polygon_points, column_name) + # Propagate the raster's CRS to the GeoDataFrame so downstream + # spatial joins / reprojections / file writes keep their + # georeferencing. spatialpandas has no CRS slot and GeoJSON + # RFC 7946 is WGS84-only, so the propagation lives only here. + crs = _detect_raster_crs(raster) + return _to_geopandas(column, polygon_points, column_name, crs=crs) elif return_type == "spatialpandas": return _to_spatialpandas(column, polygon_points, column_name) elif return_type == "geojson": diff --git a/xrspatial/tests/test_polygonize.py b/xrspatial/tests/test_polygonize.py index 07ead34f5..f38056ac0 100644 --- a/xrspatial/tests/test_polygonize.py +++ b/xrspatial/tests/test_polygonize.py @@ -1114,3 +1114,97 @@ def test_rejects_non_dataarray_mask(self): from xrspatial.polygonize import polygonize with pytest.raises(TypeError, match="xarray.DataArray"): polygonize(self._good_raster(), mask=np.ones((4, 4), dtype=bool)) + + +# ===================================================================== +# Issue #2149: GeoDataFrame output drops input CRS +# ===================================================================== + + +@pytest.mark.skipif(gpd is None, reason="geopandas not installed") +class TestPolygonizeCRSPropagation: + """polygonize(return_type='geopandas') preserves the raster CRS (#2149).""" + + @staticmethod + def _raster_with_attrs(**attrs): + data = np.array([[0, 0, 1], [0, 4, 0], [0, 0, 0]], dtype=np.int32) + return xr.DataArray( + data, + dims=('y', 'x'), + coords={'y': np.arange(3), 'x': np.arange(3)}, + attrs=attrs, + ) + + def test_crs_attr_epsg_string(self): + raster = self._raster_with_attrs(crs="EPSG:4326") + df = polygonize(raster, return_type="geopandas") + assert df.crs is not None + assert df.crs.to_epsg() == 4326 + + def test_crs_attr_epsg_int(self): + raster = self._raster_with_attrs(crs=3857) + df = polygonize(raster, return_type="geopandas") + assert df.crs is not None + assert df.crs.to_epsg() == 3857 + + def test_crs_wkt_attr(self): + # Use pyproj if available to get a canonical WKT. + pyproj = pytest.importorskip("pyproj") + wkt = pyproj.CRS.from_epsg(4326).to_wkt() + raster = self._raster_with_attrs(crs_wkt=wkt) + df = polygonize(raster, return_type="geopandas") + assert df.crs is not None + assert df.crs.to_epsg() == 4326 + + def test_no_crs_attr_yields_none(self): + """A raster without CRS info should still produce a GeoDataFrame + (with crs=None), not crash.""" + data = np.array([[0, 0, 1], [0, 4, 0], [0, 0, 0]], dtype=np.int32) + raster = xr.DataArray(data, dims=('y', 'x')) + df = polygonize(raster, return_type="geopandas") + assert df.crs is None + + def test_unparseable_crs_does_not_crash(self): + """An invalid CRS attr should not break the polygonize call.""" + raster = self._raster_with_attrs(crs="not a real crs at all") + # Should not raise; just yields no CRS on the GeoDataFrame. + df = polygonize(raster, return_type="geopandas") + assert isinstance(df, gpd.GeoDataFrame) + + def test_crs_prefers_attrs_over_rio(self): + """When both attrs['crs'] and rio.crs exist, attrs wins.""" + pytest.importorskip("rioxarray") + raster = self._raster_with_attrs(crs="EPSG:4326") + # Write a different CRS via rioxarray. + raster = raster.rio.write_crs("EPSG:3857") + # rioxarray writes its CRS into attrs['crs'] when not present, + # but we set it explicitly above, so attrs['crs'] should win. + raster.attrs['crs'] = "EPSG:4326" + df = polygonize(raster, return_type="geopandas") + assert df.crs.to_epsg() == 4326 + + def test_crs_from_rioxarray_only(self): + """rioxarray's CRS should be picked up when no attrs are set.""" + pytest.importorskip("rioxarray") + data = np.array([[0, 0, 1], [0, 4, 0], [0, 0, 0]], dtype=np.int32) + raster = xr.DataArray( + data, + dims=('y', 'x'), + coords={'y': np.arange(3), 'x': np.arange(3)}, + ) + raster = raster.rio.write_crs("EPSG:3857") + # rioxarray adds a spatial_ref coord but typically does not write + # ``attrs['crs']`` on the DataArray itself. Force-clear in case. + raster.attrs.pop('crs', None) + raster.attrs.pop('crs_wkt', None) + df = polygonize(raster, return_type="geopandas") + assert df.crs is not None + assert df.crs.to_epsg() == 3857 + + def test_no_crs_attr_simplify_still_works(self): + """CRS propagation must not interact badly with simplify.""" + raster = self._raster_with_attrs(crs="EPSG:4326") + df = polygonize( + raster, return_type="geopandas", simplify_tolerance=0.5) + assert df.crs is not None + assert df.crs.to_epsg() == 4326 From 75ea3a88dcd42522c549a3d979610969b05d7767 Mon Sep 17 00:00:00 2001 From: Brendan Collins Date: Wed, 20 May 2026 07:03:34 -0700 Subject: [PATCH 2/2] polygonize: document CRS propagation; clarify test comment Adds a Notes paragraph to the public polygonize() docstring describing the new GeoDataFrame CRS propagation (resolution order, what happens when the value is unparseable, why spatialpandas/geojson return types do not carry CRS). Also corrects the comment in test_crs_prefers_attrs_over_rio: rio.write_crs stores the CRS on a spatial_ref coord, not in attrs['crs']. Review feedback on #2154. --- xrspatial/polygonize.py | 8 ++++++++ xrspatial/tests/test_polygonize.py | 6 +++--- 2 files changed, 11 insertions(+), 3 deletions(-) diff --git a/xrspatial/polygonize.py b/xrspatial/polygonize.py index 5e28cbd14..684fe608f 100644 --- a/xrspatial/polygonize.py +++ b/xrspatial/polygonize.py @@ -1668,6 +1668,14 @@ def polygonize( For Dask+CuPy, each chunk is transferred independently, keeping peak CPU memory proportional to chunk size rather than full raster size. + + When ``return_type="geopandas"``, the raster's CRS is propagated to + the output ``GeoDataFrame``. The resolution order is + ``raster.attrs['crs']``, then ``raster.attrs['crs_wkt']``, then + ``raster.rio.crs`` (if rioxarray is installed). An unparseable CRS + value is dropped rather than raised. The ``spatialpandas`` and + ``geojson`` return types do not carry CRS metadata: spatialpandas + has no CRS slot, and GeoJSON (RFC 7946) is WGS84 only. """ _validate_raster(raster, func_name='polygonize', name='raster', ndim=2) if raster.shape[0] < 1 or raster.shape[1] < 1: diff --git a/xrspatial/tests/test_polygonize.py b/xrspatial/tests/test_polygonize.py index f38056ac0..50fdc0832 100644 --- a/xrspatial/tests/test_polygonize.py +++ b/xrspatial/tests/test_polygonize.py @@ -1175,10 +1175,10 @@ def test_crs_prefers_attrs_over_rio(self): """When both attrs['crs'] and rio.crs exist, attrs wins.""" pytest.importorskip("rioxarray") raster = self._raster_with_attrs(crs="EPSG:4326") - # Write a different CRS via rioxarray. + # rio.write_crs stores the CRS on a spatial_ref coord, not in + # attrs['crs'], so re-setting attrs['crs'] below is what makes + # the precedence assertion meaningful. raster = raster.rio.write_crs("EPSG:3857") - # rioxarray writes its CRS into attrs['crs'] when not present, - # but we set it explicitly above, so attrs['crs'] should win. raster.attrs['crs'] = "EPSG:4326" df = polygonize(raster, return_type="geopandas") assert df.crs.to_epsg() == 4326