GH-50182: [C++][Parquet] Fix truncated min/max statistics for all-infinity floating-point columns#50183
GH-50182: [C++][Parquet] Fix truncated min/max statistics for all-infinity floating-point columns#50183clee704 wants to merge 6 commits into
Conversation
…ll-infinity float columns Seed the floating-point and Float16 min/max accumulation with +/-infinity instead of the largest/smallest finite value. A finite seed is not an identity for min/max once the data contains infinities, so an all-+Inf column previously reported min = FLT_MAX/DBL_MAX (and an all--Inf column reported max = -FLT_MAX/-DBL_MAX) rather than the infinity actually present -- a value written with is_min_value_exact = true. Update the empty/all-NaN detection in CleanStatistic and CleanFloat16Statistic to match the new seeds; an all-NaN or empty column still leaves min = +Inf, max = -Inf (min > max, impossible for real data) and so emits no statistics. Add a TestFloatStatistics.Infinities typed test covering FLOAT, DOUBLE and FLOAT16. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Fixes Parquet statistics min/max computation for floating-point columns containing infinities (GH-50182), ensuring infinity values are not lost due to finite seeding.
Changes:
- Seed floating-point running min/max with ±infinity instead of largest/smallest finite values.
- Update statistic “no valid values” detection to match the new seeding strategy (including Float16).
- Add unit tests covering all-±Inf and mixed ±Inf columns for float/double/float16.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| cpp/src/parquet/statistics_test.cc | Adds regression tests validating min/max behavior for infinity-only and mixed-infinity columns. |
| cpp/src/parquet/statistics.cc | Changes default min/max seeds for floats (and Float16) to ±infinity and adjusts cleaning logic accordingly. |
|
|
…zedMinMax The empty/all-NaN detection in CleanStatistic / CleanFloat16Statistic now uses a single IsUninitializedMinMax(min, max) helper (max < min) instead of comparing against the +/-infinity seeds directly. It relies only on the invariant DefaultMin > DefaultMax, so the seed and its detection cannot drift. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Switching the Float16 DefaultMin()/DefaultMax() seeds to the +/-infinity constants left Float16Constants::max()/lowest() (and their backing members) without any caller. Drop the dead accessors and members. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The helper tests max < min. "Uninitialized" overstated what it checks: a valid min/max pair always has min <= max, so max < min is simply an invalid pair -- either the inverted DefaultMin()/DefaultMax() seeds (no value observed) or an inverted caller-supplied range. Rename to IsInvalidMinMax and reword the comments so the predicate is definitional rather than implying an "iff uninitialized" claim. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extend TestFloatStatistics.Infinities with [+Inf, NaN, +Inf] and [-Inf, NaN, -Inf] cases (float/double/Float16). This is the only case that intersects the new +/-infinity seeds with NaN coalescing on a column whose true min/max is infinite, and it confirms a NaN does not corrupt the infinite min/max. Verified to fail on the pre-fix code (min/max came back as the largest finite value). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| return ::std::nullopt; | ||
| } | ||
|
|
||
| if (min == std::numeric_limits<T>::max() && max == std::numeric_limits<T>::lowest()) { |
There was a problem hiding this comment.
This was a logical AND, but IsInvalidMinMax will return true if either condition holds. This looks like a behavior change, does it matter in practice?
There was a problem hiding this comment.
In practice there's no change or this is better. When min/max is set, both are set always and it's always min <= max. This change also catches an invalid SetMinMax call that sets min > max. So it's a better guard than before.
|
|
||
| constexpr c_type inf = std::numeric_limits<c_type>::infinity(); | ||
| constexpr c_type nan = std::numeric_limits<c_type>::quiet_NaN(); | ||
| std::array<c_type, 3> all_pos_inf{inf, inf, inf}; |
There was a problem hiding this comment.
You can just use std::vector as a container instead of explicitly declaring array sizes like this.
There was a problem hiding this comment.
It avoids heap allocation but I agree this is an overkill in test code. Simplified code with std::vector
Per review feedback, the infinity test inputs don't need the explicit array sizing; std::vector is simpler and reads more clearly in test code. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Rationale for this change
When every value in a
FLOAT,DOUBLE, orFLOAT16column is the same infinity, the Parquet writer stored a finite min/max statistic instead of the infinity actually present in the column:+Inf→minwritten asFLT_MAX/DBL_MAX(should be+Inf)-Inf→maxwritten as-FLT_MAX/-DBL_MAX(should be-Inf)The running min/max were seeded with
std::numeric_limits<T>::max()/::lowest()— the largest/smallest finite values. These are not identity elements for min/max once the data contains infinities:Min(FLT_MAX, +Inf)keepsFLT_MAX, so an all-+Infcolumn never displaces the seed (and symmetrically for-Inf). The wrong value is then written withis_min_value_exact = true, so a reader is told the exact minimum of an all-+Infcolumn isFLT_MAX— a value not even present in the column. Range-based row-group filtering stays correct (the bound is only loosened in the safe direction), but engines that read min/max statistics as values (statistics-basedMIN/MAXevaluation, metadata introspection) return wrong results.See GH-50182 for a full reproduction in both PyArrow and C++.
What changes are included in this PR?
CompareHelper::DefaultMin()/DefaultMax()now seed floating-point accumulation with+Inf/-Inf(the true identities) instead of the largest finite values; integral types are unchanged.Float16comparator, via newFloat16Constants::positive_infinity()/negative_infinity().CleanStatistic/CleanFloat16Statistic(which relied on the old finite seeds) is updated to match the new±Infseeds. An all-NaN or empty column still leavesmin = +Inf, max = -Inf(min > max, which real data can never produce), so it stays unambiguously detectable and continues to emit no statistics.Are these changes tested?
Yes. A new typed test
TestFloatStatistics.Infinities(coveringFLOAT,DOUBLE, andFLOAT16) asserts that all-+Inf, all--Inf, and mixed±Infcolumns produce the exact infinity as min/max (the existingAssertMinMaxArehelper also checksis_min_value_exact/is_max_value_exact). The existing all-NaN tests continue to pass, exercising the updated empty/all-NaN detection.Are there any user-facing changes?
Parquet files written for all-infinity floating-point columns now record the correct
±Infmin/max statistics instead of the largest finite value. No public API signatures change.Two minor behavior changes on existing public low-level APIs are worth noting (the normal writer accumulation path is unaffected, since a column with ≥1 non-NaN value always yields
min ≤ max):TypedComparator<FloatType/DoubleType/Float16LogicalType>::GetMinMax()now returns(+Inf, -Inf)for an all-NaN (or empty) batch, where it previously returned the largest/smallest finite value as a sentinel.TypedStatistics::SetMinMax()now treats any inverted (min > max) input as "no valid statistics", whereas it previously only special-cased that one exact finite sentinel pair.This PR contains a "Critical Fix". It fixes a bug that produced incorrect data: the writer emitted min/max statistics that are wrong yet flagged exact (
is_min_value_exact = true), which can yield incorrect results for readers that evaluateMIN/MAXdirectly from Parquet statistics.