GH-50182: [C++][Parquet] Fix truncated min/max statistics for all-infinity floating-point columns by clee704 · Pull Request #50183 · apache/arrow

clee704 · 2026-06-15T17:36:49Z

Rationale for this change

When every value in a FLOAT, DOUBLE, or FLOAT16 column is the same infinity, the Parquet writer stored a finite min/max statistic instead of the infinity actually present in the column:

all +Inf → min written as FLT_MAX / DBL_MAX (should be +Inf)
all -Inf → max written as -FLT_MAX / -DBL_MAX (should be -Inf)

The running min/max were seeded with std::numeric_limits<T>::max() / ::lowest() — the largest/smallest finite values. These are not identity elements for min/max once the data contains infinities: Min(FLT_MAX, +Inf) keeps FLT_MAX, so an all-+Inf column never displaces the seed (and symmetrically for -Inf). The wrong value is then written with is_min_value_exact = true, so a reader is told the exact minimum of an all-+Inf column is FLT_MAX — a value not even present in the column. Range-based row-group filtering stays correct (the bound is only loosened in the safe direction), but engines that read min/max statistics as values (statistics-based MIN/MAX evaluation, metadata introspection) return wrong results.

See GH-50182 for a full reproduction in both PyArrow and C++.

What changes are included in this PR?

CompareHelper::DefaultMin() / DefaultMax() now seed floating-point accumulation with +Inf / -Inf (the true identities) instead of the largest finite values; integral types are unchanged.
The same change is applied to the Float16 comparator, via new Float16Constants::positive_infinity() / negative_infinity().
The "no valid value was seen" detection in CleanStatistic / CleanFloat16Statistic (which relied on the old finite seeds) is updated to match the new ±Inf seeds. An all-NaN or empty column still leaves min = +Inf, max = -Inf (min > max, which real data can never produce), so it stays unambiguously detectable and continues to emit no statistics.

Are these changes tested?

Yes. A new typed test TestFloatStatistics.Infinities (covering FLOAT, DOUBLE, and FLOAT16) asserts that all-+Inf, all--Inf, and mixed ±Inf columns produce the exact infinity as min/max (the existing AssertMinMaxAre helper also checks is_min_value_exact/is_max_value_exact). The existing all-NaN tests continue to pass, exercising the updated empty/all-NaN detection.

Are there any user-facing changes?

Parquet files written for all-infinity floating-point columns now record the correct ±Inf min/max statistics instead of the largest finite value. No public API signatures change.

Two minor behavior changes on existing public low-level APIs are worth noting (the normal writer accumulation path is unaffected, since a column with ≥1 non-NaN value always yields min ≤ max):

TypedComparator<FloatType/DoubleType/Float16LogicalType>::GetMinMax() now returns (+Inf, -Inf) for an all-NaN (or empty) batch, where it previously returned the largest/smallest finite value as a sentinel.
TypedStatistics::SetMinMax() now treats any inverted (min > max) input as "no valid statistics", whereas it previously only special-cased that one exact finite sentinel pair.

This PR contains a "Critical Fix". It fixes a bug that produced incorrect data: the writer emitted min/max statistics that are wrong yet flagged exact (is_min_value_exact = true), which can yield incorrect results for readers that evaluate MIN/MAX directly from Parquet statistics.

GitHub Issue: [C++][Parquet] Min/max statistics for all-infinity floating-point columns are truncated to the largest finite value #50182

…ll-infinity float columns Seed the floating-point and Float16 min/max accumulation with +/-infinity instead of the largest/smallest finite value. A finite seed is not an identity for min/max once the data contains infinities, so an all-+Inf column previously reported min = FLT_MAX/DBL_MAX (and an all--Inf column reported max = -FLT_MAX/-DBL_MAX) rather than the infinity actually present -- a value written with is_min_value_exact = true. Update the empty/all-NaN detection in CleanStatistic and CleanFloat16Statistic to match the new seeds; an all-NaN or empty column still leaves min = +Inf, max = -Inf (min > max, impossible for real data) and so emits no statistics. Add a TestFloatStatistics.Infinities typed test covering FLOAT, DOUBLE and FLOAT16. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Fixes Parquet statistics min/max computation for floating-point columns containing infinities (GH-50182), ensuring infinity values are not lost due to finite seeding.

Changes:

Seed floating-point running min/max with ±infinity instead of largest/smallest finite values.
Update statistic “no valid values” detection to match the new seeding strategy (including Float16).
Add unit tests covering all-±Inf and mixed ±Inf columns for float/double/float16.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
cpp/src/parquet/statistics_test.cc	Adds regression tests validating min/max behavior for infinity-only and mixed-infinity columns.
cpp/src/parquet/statistics.cc	Changes default min/max seeds for floats (and Float16) to ±infinity and adjusts cleaning logic accordingly.

github-actions · 2026-06-15T17:38:59Z

⚠️ GitHub issue #50182 has been automatically assigned in GitHub to PR creator.

…zedMinMax The empty/all-NaN detection in CleanStatistic / CleanFloat16Statistic now uses a single IsUninitializedMinMax(min, max) helper (max < min) instead of comparing against the +/-infinity seeds directly. It relies only on the invariant DefaultMin > DefaultMax, so the seed and its detection cannot drift. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Switching the Float16 DefaultMin()/DefaultMax() seeds to the +/-infinity constants left Float16Constants::max()/lowest() (and their backing members) without any caller. Drop the dead accessors and members. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

The helper tests max < min. "Uninitialized" overstated what it checks: a valid min/max pair always has min <= max, so max < min is simply an invalid pair -- either the inverted DefaultMin()/DefaultMax() seeds (no value observed) or an inverted caller-supplied range. Rename to IsInvalidMinMax and reword the comments so the predicate is definitional rather than implying an "iff uninitialized" claim. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Extend TestFloatStatistics.Infinities with [+Inf, NaN, +Inf] and [-Inf, NaN, -Inf] cases (float/double/Float16). This is the only case that intersects the new +/-infinity seeds with NaN coalescing on a column whose true min/max is infinite, and it confirms a NaN does not corrupt the infinite min/max. Verified to fail on the pre-fix code (min/max came back as the largest finite value). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

pitrou · 2026-06-16T09:00:57Z

    return ::std::nullopt;
  }

-  if (min == std::numeric_limits<T>::max() && max == std::numeric_limits<T>::lowest()) {


This was a logical AND, but IsInvalidMinMax will return true if either condition holds. This looks like a behavior change, does it matter in practice?

In practice there's no change or this is better. When min/max is set, both are set always and it's always min <= max. This change also catches an invalid SetMinMax call that sets min > max. So it's a better guard than before.

pitrou · 2026-06-16T09:05:20Z

+
+  constexpr c_type inf = std::numeric_limits<c_type>::infinity();
+  constexpr c_type nan = std::numeric_limits<c_type>::quiet_NaN();
+  std::array<c_type, 3> all_pos_inf{inf, inf, inf};


You can just use std::vector as a container instead of explicitly declaring array sizes like this.

It avoids heap allocation but I agree this is an overkill in test code. Simplified code with std::vector

Per review feedback, the infinity test inputs don't need the explicit array sizing; std::vector is simpler and reads more clearly in test code. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings June 15, 2026 17:36

clee704 requested review from pitrou and wgtmac as code owners June 15, 2026 17:36

Copilot AI reviewed Jun 15, 2026

View reviewed changes

Comment thread cpp/src/parquet/statistics.cc Outdated

Comment thread cpp/src/parquet/statistics.cc

github-actions Bot added the awaiting review Awaiting review label Jun 15, 2026

github-actions Bot added Component: Parquet Component: C++ labels Jun 15, 2026

github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 15, 2026

Copilot AI review requested due to automatic review settings June 15, 2026 18:18

Copilot AI reviewed Jun 15, 2026

View reviewed changes

clee704 and others added 2 commits June 15, 2026 18:54

Copilot AI review requested due to automatic review settings June 15, 2026 19:14

Copilot AI reviewed Jun 15, 2026

View reviewed changes

Comment thread cpp/src/parquet/statistics.cc

pitrou reviewed Jun 16, 2026

View reviewed changes

Use std::vector instead of fixed-size std::array in TestInfinities

cae7c27

Per review feedback, the infinity test inputs don't need the explicit array sizing; std::vector is simpler and reads more clearly in test code. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-50182: [C++][Parquet] Fix truncated min/max statistics for all-infinity floating-point columns#50183

GH-50182: [C++][Parquet] Fix truncated min/max statistics for all-infinity floating-point columns#50183
clee704 wants to merge 6 commits into
apache:mainfrom
clee704:gh-50182-parquet-inf-stats

clee704 commented Jun 15, 2026 •

edited by github-actions Bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

pitrou Jun 16, 2026

Uh oh!

clee704 Jun 16, 2026

Uh oh!

pitrou Jun 16, 2026

Uh oh!

clee704 Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

clee704 commented Jun 15, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

pitrou Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

clee704 Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

pitrou Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

clee704 Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

clee704 commented Jun 15, 2026 •

edited by github-actions Bot

Loading