bench: add correlated-proxy case to the predicate_eval suite by adriangb · Pull Request #22919 · apache/datafusion

adriangb · 2026-06-11T22:57:02Z

Which issue does this PR close?

Related to Reorder boolean expressions (including filter predicates) according to evaluation cost / selectivity #11262 (predicate evaluation ordering). Extends the predicate_eval
suite added in bench: add predicate_eval SQL micro-benchmark suite for conjunctive filter evaluation #22704. No single issue closed.

Rationale for this change

The correlation subgroup's existing cases (q70–q72) use two predicates of equal
cost and equal selectivity. For two conjuncts the evaluation cost of an order is
cost(first) + selectivity(first) × cost(second), which is symmetric here — so
the two orders cost the same and correlation only affects the result cardinality.
These cases measure the overhead of an ordering system, but give it no
opportunity: nothing in the suite rewards (or even detects) correlation-aware
ordering.

This adds a case with real, measurable headroom that only joint statistics can
find. A cheap integer predicate (c0 = 1, ~30%) is a perfect proxy for three
string regexes on s1; a fourth regex on s2 has the same ~30% selectivity and
similar cost but is independent. Marginally, the four regexes are
indistinguishable in any position. Conditionally — behind the proxy — the three
s1 regexes keep every survivor while the s2 regex still discards ~70%.

The query is written in the natural-but-pessimal order (the redundant regexes
grouped with their proxy, the informative one last). On an M-series laptop the
written order runs ~1.9x slower than the hand-optimal order [c0, s2-regex, s1-regexes...] (16.4 ms vs 8.6 ms per iteration), so:

an ordering system using marginal per-predicate statistics (or an
independence assumption) is blind to the difference — every ranking of the four
regexes looks equivalent;
a system measuring the predicates' joint behaviour can reliably collect ~1.9x.

What changes are included in this PR?

load/corrproxy.sql — the correlated-proxy dataset (deterministic, generated
from generate_series like the existing datasets; PRED_ROWS/PRED_FILL
knobs as elsewhere).
queries/correlation/q73.sql, benchmarks/correlation/q73.benchmark — the new
case, following the suite's existing conventions.

Run with: BENCH_NAME=predicate_eval BENCH_SUBGROUP=correlation cargo bench --bench sql

Are these changes tested?

The suite's shared template asserts the query returns rows; the case runs green
locally alongside q70–q72.

Are there any user-facing changes?

No — benchmark-only.

🤖 Generated with Claude Code

The correlation subgroup's existing cases (q70-q72) use two predicates of equal cost and equal selectivity, so the two orders cost the same and correlation only affects the result cardinality - no ordering system can win or lose on them. They measure overhead, not opportunity. Add q73: a cheap integer predicate that is a perfect proxy for three string regexes, plus one independent regex of the same ~30% selectivity and similar cost. Marginal statistics cannot tell the four regexes apart in any position; their joint distribution with the proxy is what matters. Written in the natural-but-pessimal order (redundant regexes grouped with their proxy), the query runs ~1.9x slower than the hand-optimal order [c0, s2, s1...] on an M-series laptop, so a correlation-aware ordering system has real, measurable headroom here while an independence-assuming one is blind to it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…posal Marginal per-conjunct statistics are blind to correlation: arrangements with very different costs can be statistically identical (the new predicate_eval correlation_q73 case has ~1.9x headroom invisible to any independence-based ranking), and the fused-vs-compact-once strategy difference is invisible to per-conjunct numbers entirely. Borrow the exploration idea from DuckDB's AdaptiveFilter (keep-or-revert timing of random swaps, src/execution/adaptive_filter.cpp): when a measuring window ends with nothing material to propose, occasionally put a random adjacent swap of the incumbent through the existing shared paired A/B trial instead of freezing. Each position carries a likelihood (halved when a swap there loses its trial, restored to 100 when one wins), so exploration of barren positions decays geometrically on top of the re-thaw backoff. The candidate bypasses the model gates by design — it exists because the model cannot see it — but adoption still requires the same measured, confidence-separated end-to-end win as any other proposal, which is a stronger keep-or-revert rule than DuckDB's strict mean comparison. On correlation_q73 (PR apache#22919) this captures 1.28x of the ~1.9x headroom within each 122-batch query (convergence needs two specific adjacent swaps; the rest needs cross-query persistence, cf. DuckDB's multi-file adaptive filter cache, left as future work). Tied micro-queries pay ~4-6% for the exploration trials they decline. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench: add correlated-proxy case to the predicate_eval suite#22919

bench: add correlated-proxy case to the predicate_eval suite#22919
adriangb wants to merge 1 commit into
apache:mainfrom
pydantic:predicate-eval-correlated-case

adriangb commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

adriangb commented Jun 11, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant