Skip to content

hotspots() computes global stats eagerly for Dask input instead of staying lazy #2772

@brendancol

Description

@brendancol

Describe the bug

hotspots() is meant to have a lazy Dask path, but the Dask and Dask+CuPy versions call da.compute() on the global mean and standard deviation while the task graph is still being built. See xrspatial/focal.py near lines 1289/1293 (Dask+NumPy) and 1336 (Dask+CuPy).

So hotspots() fires off about a dozen Dask tasks the moment it returns, reading every chunk once before the caller asks for anything. A Dask-backed call should defer all work until .compute(), and this one doesn't.

The z-score step needs a global mean and std, which are whole-array reductions. Right now those get resolved eagerly. They can stay as lazy 0-d Dask reductions folded into the graph, so the normalization and classification stay deferred too.

Expected behavior

Calling hotspots() on a Dask input should build a graph and compute nothing. The global mean and std stay lazy, broadcast into the per-chunk z-score, and the result matches the eager version once you actually compute it.

Additional context

The existing laziness test only checks that the return value is a Dask array (xrspatial/tests/test_dask_laziness.py:115), so it passes even when tasks run eagerly. A better test would assert that nothing executes when hotspots() is called, e.g. with a scheduler callback that counts task runs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdaskDask backend / chunked arraysperformancePR touches performance-sensitive code

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions