Skip to content

Empty-zone count semantics in zonal.stats: NaN vs 0 #2644

@brendancol

Description

@brendancol

Reason or Problem

In xrspatial/zonal.py, stats() computes per-zone summary statistics. For a
zone whose cells are all NaN or all nodata_values (an "empty" zone), the numpy
path's _calc_stats only calls the stat function when len(zone_values) > 0,
so results[i] stays NaN for every statistic, including count. The cupy and
dask paths match this: an empty zone reports count as NaN.

For mean, sum, std, and the others, NaN is a defensible answer to "what is
the mean of no values". For count it is awkward. A count is a cardinality (the
number of valid cells in the zone), and the natural value for an empty zone is
0, not NaN. Downstream code that filters or sums on counts (df[df['count'] > 0], df['count'].sum()) breaks or silently drops rows when the count column
carries NaN.

The current behavior is pinned by tests (test_stats_all_nan_zone,
test_stats_all_nan_zone_preserved), so changing it is a behavior change, not a
bug fix. This needs an explicit decision and documentation either way.

Proposal

Two options:

Option A -- keep NaN for empty zones, document it. Leave the behavior as is.
Add a paragraph to the stats() docstring stating that empty zones report NaN
for every statistic, including count. No code or test changes beyond the
docstring.

Option B (recommended for count only) -- empty-zone count returns 0.
Treat count as a cardinality. An empty zone reports count = 0 while
mean, min, max, sum, std, and var stay NaN. Document the rule
explicitly and update the tests that pin count = NaN for empty zones to expect
0, with a comment and commit message explaining the deliberate change.

Design (Option B):

  • numpy: _calc_stats already detects the empty-zone branch (len == 0); set
    the result to 0 when the statistic being computed is count.
  • cupy: _stats_cupy has an explicit zone_values.size == 0 branch that appends
    float('nan') per stat; append 0 for count.
  • dask: the count reducer uses _nanreduce_preserve_allnan(..., np.nansum),
    which forces NaN when all blocks are NaN. Count should instead use plain
    np.nansum so an all-empty zone sums to 0 across blocks.

Usage: No API change. stats(zones, values, stats_funcs=['count', 'mean'])
returns count = 0 and mean = NaN for an empty zone.

Value: count becomes safe to use in numeric filters and aggregations
without special-casing NaN, and the empty-zone semantics are documented rather
than implicit.

Stakeholders and Impacts

Users of zonal.stats who request count. Impact is limited to empty zones
(all-NaN or all-nodata). Only count changes; other statistics keep NaN.
crosstab and apply have their own count paths and are out of scope unless
they share the affected code.

Drawbacks

It is a behavior change. Code that currently checks isnan(count) to detect
empty zones would need to check count == 0 instead. The change is gated behind
a clear docstring note and a migration comment in the tests.

Alternatives

Option A (document-only) avoids the behavior change but leaves count as NaN,
which is the awkward value the finding flags.

Unresolved Questions

Whether to extend the 0-for-empty rule to crosstab/apply count paths. This
proposal scopes the change to stats() only.

Additional Notes or Context

If during implementation Option B turns out to cascade into crosstab/apply
or other shared code in a risky way, fall back to Option A (document-only) and
record why in the PR. Either way the docstring must state the empty-zone count
semantics precisely.

Metadata

Metadata

Assignees

No one assigned

    Labels

    apiAPI design and consistencyproposalIdea that needs design discussion

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions