Skip to content

feat: add max_row_group_bytes option to ParquetOptions#22649

Open
Satyr09 wants to merge 3 commits into
apache:mainfrom
Satyr09:daipayan/parquet-max-row-group-byte
Open

feat: add max_row_group_bytes option to ParquetOptions#22649
Satyr09 wants to merge 3 commits into
apache:mainfrom
Satyr09:daipayan/parquet-max-row-group-byte

Conversation

@Satyr09
Copy link
Copy Markdown

@Satyr09 Satyr09 commented May 30, 2026

Which issue does this PR close?

Rationale for this change

arrow-rs 58.0 added WriterProperties::set_max_row_group_bytes (PR: apache/arrow-rs#9357
Issue: apache/arrow-rs#1213), which flushes a row group when either the row-count or the byte limit is reached, whichever comes first, matching parquet-mr's parquet.block.size. DataFusion already consumes atleast this version of arrow but does not yet expose this new byte-based setter through its config.

What changes are included in this PR?

  • Add max_row_group_bytes: Option<usize> (default None) to ParquetOptions in datafusion/common/src/config.rs.
  • Wire it through ParquetOptions::into_writer_properties_builder to WriterPropertiesBuilder::set_max_row_group_bytes, with a guard that rejects Some(0) as a configuration error (arrow-rs panics on a zero byte limit).
  • Plumb the field through protobuf serialization - add it to the ParquetOptions proto message and the proto-common/proto conversions, with regenerated bindings.
  • Exposed as the max_row_group_bytes COPY / CREATE EXTERNAL TABLE format option alongside max_row_group_size.
  • Update the generated config docs and the format options table doc.

Are these changes tested?

Yes - run locally and passing:

Unit (datafusion-common, parquet_writer.rs):

  • defaults to None, so no byte limit is propagated to WriterProperties.
  • a configured value propagates to WriterProperties.
  • Some(0) is rejected with a configuration error.
  • the existing table_parquet_opts_to_writer_props round-trip and test_defaults_match tests were extended to cover the new field.

Protobuf round-trip (datafusion-proto-common):

  • new test_parquet_options_max_row_group_bytes_round_trip confirms the option survives serialization to protobuf and back.

SLTs:

  • new test_files/parquet_max_row_group_bytes.slt writes Parquet with the option set (via both COPY ... OPTIONS and session config), reads it back, asserts the data round-trips, and asserts a zero value is rejected.
  • copy.slt exercises the option inside the existing "all supported statement overrides" COPY test.
  • information_schema.slt updated for the new option in SHOW ALL.

Commands run locally (all pass):
cargo test -p datafusion-common --features parquet
cargo test -p datafusion-proto-common
cargo test -p datafusion-proto
cargo test --test sqllogictests -- parquet_max_row_group_bytes
cargo test --test sqllogictests -- information_schema
cargo test --test sqllogictests -- copy

Are there any user-facing changes?

Additive only, does not affect existing options.

@github-actions github-actions Bot added documentation Improvements or additions to documentation common Related to common crate labels May 30, 2026
@github-actions github-actions Bot added sqllogictest SQL Logic Tests (.slt) proto Related to proto crate labels May 30, 2026
Copy link
Copy Markdown
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! LGTM in general, just some minor advices.

max_predicate_cache_size: _,
} = self;

if let Some(0) = max_row_group_bytes {
Copy link
Copy Markdown
Contributor

@2010YOUY01 2010YOUY01 May 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend to move all the config related validation to

  • datafusion/common/src/config.rs

You can find examples in config.rs -- there are configurations that are custom type, and a FromStr trait is implemented on them.


/// (writing) Target maximum size of each row group in bytes. When set,
/// the writer flushes whenever either this limit or `max_row_group_size`
/// is reached, whichever comes first. Useful for bounding writer memory
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also update max_row_group_size for this 'whichever comes first` semantics.

# specific language governing permissions and limitations
# under the License.

# End-to-end tests for the `max_row_group_bytes` Parquet writer option:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to add the following coverages:

  • When read back, check the row group count -- for example if max_row_group_bytes was set to 1, there will be 5 row groups created
  • Test the combination of max_row_group_size and max_row_group_bytes for the 'whichever reaches first' semantics.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 31, 2026

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning apache/main
    Building datafusion-common v53.1.0 (current)
       Built [  34.336s] (current)
     Parsing datafusion-common v53.1.0 (current)
      Parsed [   0.056s] (current)
    Building datafusion-common v53.1.0 (baseline)
       Built [  32.647s] (baseline)
     Parsing datafusion-common v53.1.0 (baseline)
      Parsed [   0.057s] (baseline)
    Checking datafusion-common v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.624s] 222 checks: 221 pass, 1 fail, 0 warn, 30 skip

--- failure constructible_struct_adds_field: externally-constructible struct adds field ---

Description:
A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
        ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/constructible_struct_adds_field.ron

Failed in:
  field ParquetOptions.max_row_group_bytes in /home/runner/work/datafusion/datafusion/datafusion/common/src/config.rs:910

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  69.314s] datafusion-common
    Building datafusion-proto v53.1.0 (current)
       Built [  56.786s] (current)
     Parsing datafusion-proto v53.1.0 (current)
      Parsed [   0.018s] (current)
    Building datafusion-proto v53.1.0 (baseline)
       Built [  56.373s] (baseline)
     Parsing datafusion-proto v53.1.0 (baseline)
      Parsed [   0.018s] (baseline)
    Checking datafusion-proto v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.252s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [ 115.904s] datafusion-proto
    Building datafusion-proto-common v53.1.0 (current)
       Built [  21.224s] (current)
     Parsing datafusion-proto-common v53.1.0 (current)
      Parsed [   0.047s] (current)
    Building datafusion-proto-common v53.1.0 (baseline)
       Built [  20.754s] (baseline)
     Parsing datafusion-proto-common v53.1.0 (baseline)
      Parsed [   0.048s] (baseline)
    Checking datafusion-proto-common v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.986s] 222 checks: 221 pass, 1 fail, 0 warn, 30 skip

--- failure constructible_struct_adds_field: externally-constructible struct adds field ---

Description:
A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
        ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/constructible_struct_adds_field.ron

Failed in:
  field ParquetOptions.max_row_group_bytes_opt in /home/runner/work/datafusion/datafusion/datafusion/proto-common/src/generated/prost.rs:904
  field ParquetOptions.max_row_group_bytes_opt in /home/runner/work/datafusion/datafusion/datafusion/proto-common/src/generated/prost.rs:904
  field ParquetOptions.max_row_group_bytes_opt in /home/runner/work/datafusion/datafusion/datafusion/proto-common/src/generated/prost.rs:904

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  44.036s] datafusion-proto-common
    Building datafusion-proto-models v53.1.0 (current)
       Built [  23.727s] (current)
     Parsing datafusion-proto-models v53.1.0 (current)
      Parsed [   0.123s] (current)
    Building datafusion-proto-models v53.1.0 (baseline)
       Built [  23.891s] (baseline)
     Parsing datafusion-proto-models v53.1.0 (baseline)
      Parsed [   0.122s] (baseline)
    Checking datafusion-proto-models v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   1.631s] 222 checks: 221 pass, 1 fail, 0 warn, 30 skip

--- failure constructible_struct_adds_field: externally-constructible struct adds field ---

Description:
A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
        ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/constructible_struct_adds_field.ron

Failed in:
  field ParquetOptions.max_row_group_bytes_opt in /home/runner/work/datafusion/datafusion/datafusion/proto-models/src/generated/datafusion_proto_common.rs:904
  field ParquetOptions.max_row_group_bytes_opt in /home/runner/work/datafusion/datafusion/datafusion/proto-models/src/generated/datafusion_proto_common.rs:904

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  50.581s] datafusion-proto-models
    Building datafusion-sqllogictest v53.1.0 (current)
       Built [ 162.770s] (current)
     Parsing datafusion-sqllogictest v53.1.0 (current)
      Parsed [   0.022s] (current)
    Building datafusion-sqllogictest v53.1.0 (baseline)
       Built [ 163.108s] (baseline)
     Parsing datafusion-sqllogictest v53.1.0 (baseline)
      Parsed [   0.023s] (baseline)
    Checking datafusion-sqllogictest v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.096s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [ 329.865s] datafusion-sqllogictest

@github-actions github-actions Bot added the auto detected api change Auto detected API change label May 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto detected api change Auto detected API change common Related to common crate documentation Improvements or additions to documentation proto Related to proto crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support byte-based row group size limit (max_row_group_bytes) in ParquetOptions

2 participants