Skip to content

[SPARK-56171][SQL] Enable V2 file write path for non-partitioned DataFrame API writes and delete FallBackFileSourceV2#54998

Open
LuciferYang wants to merge 3 commits intoapache:masterfrom
LuciferYang:SPARK-56171-combined
Open

[SPARK-56171][SQL] Enable V2 file write path for non-partitioned DataFrame API writes and delete FallBackFileSourceV2#54998
LuciferYang wants to merge 3 commits intoapache:masterfrom
LuciferYang:SPARK-56171-combined

Conversation

@LuciferYang
Copy link
Copy Markdown
Contributor

@LuciferYang LuciferYang commented Mar 25, 2026

What changes were proposed in this pull request?

Enable the V2 file write path for non-partitioned df.write.mode("append"/"overwrite").save(path) across all built-in file formats (Parquet, ORC, JSON, CSV, Text, Avro), and delete the now-redundant FallBackFileSourceV2 analysis rule.

Key changes

V2 write infrastructure (FileTable, FileWrite)

  • FileTable.createFileWriteBuilder: shared builder with SupportsTruncate and SupportsDynamicOverwrite capabilities
  • FileWrite: partition schema, truncation (overwrite), dynamic partition overwrite, schema validation (nested column name duplication, data type, collation in map keys)
  • All 6 format-specific Write and Table classes use createFileWriteBuilder

Delete FallBackFileSourceV2

  • Remove the rule and its registrations in BaseSessionStateBuilder / HiveSessionStateBuilder
  • Redundant: USE_V1_SOURCE_LIST (default: all formats) already prevents V2 file tables from being created, and the DataFrame API uses AppendData/OverwriteByExpression (not InsertIntoStatement)

Cache invalidation (DataSourceV2Strategy)

  • Use recacheByPath with fileIndex.refresh() for FileTable writes

Error handling (FileFormatDataWriter)

  • Override writeAll to wrap errors with TASK_WRITE_FAILED

V2 write gating (DataFrameWriter, DataSourceV2Utils)

  • Allow V2 for FileDataSourceV2 only for Append/Overwrite without partitionBy; fall back to V1 for ErrorIfExists/Ignore (TODO: SPARK-56174) and partitioned writes
  • saveAsTable/insertInto: V1 fallback for FileDataSourceV2
  • DataSourceV2Utils.getTableProvider: return None for FileDataSourceV2 to prevent V2 catalog table loading until remaining gaps are addressed

Data type validation (V2SessionCatalog)

  • Add V1 FileFormat.supportDataType validation in createTable fallback, ensuring CREATE TABLE with unsupported types is rejected consistently

Why are the changes needed?

The V2 Data Source API provides a cleaner write path than V1's InsertIntoHadoopFsRelationCommand. Enabling V2 writes for built-in file formats is a step toward fully migrating file sources to V2 (SPARK-56170).

Does this PR introduce any user-facing change?

No. With default configuration, all file writes use V1. The V2 path activates only when a user explicitly clears USE_V1_SOURCE_LIST and uses df.write.mode("append"/"overwrite").save(path) without partitionBy.

How was this patch tested?

  • Pass Github Actions
  • New FileDataSourceV2WriteSuite (23 tests): V2 write correctness, V1/V2 result comparison, partitioned writes, dynamic partition overwrite, cache invalidation, DataFrame API modes, catalog table INSERT INTO, CTAS
  • New tests in AvroV2Suite (13 tests): equivalent V2 write coverage for Avro (V1/V2 comparison, multi-level partitioned write, dynamic overwrite, cache invalidation, catalog INSERT INTO, CTAS, partitioned write to empty/existing directory)

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code 4.6

@LuciferYang
Copy link
Copy Markdown
Contributor Author

LuciferYang commented Mar 25, 2026

Follow-up tickets

Ticket Description
SPARK-56174 V2 file write ErrorIfExists/Ignore modes
SPARK-56175 FileTable implements SupportsPartitionManagement and V2 catalog table loading
SPARK-56176 V2-native ANALYZE TABLE and ANALYZE COLUMN for file tables

The currently recorded plan: https://issues.apache.org/jira/browse/SPARK-56170

@dongjoon-hyun
Copy link
Copy Markdown
Member

Wow, this and many TODO IDs. Thank you for working on this area, @LuciferYang .

…Frame API writes and delete FallBackFileSourceV2

Key changes:
- FileWrite: added partitionSchema, customPartitionLocations,
  dynamicPartitionOverwrite, isTruncate; path creation and truncate
  logic; dynamic partition overwrite via FileCommitProtocol
- FileTable: createFileWriteBuilder with SupportsDynamicOverwrite
  and SupportsTruncate; capabilities now include TRUNCATE and
  OVERWRITE_DYNAMIC; fileIndex skips file existence checks when
  userSpecifiedSchema is provided (write path)
- All file format writes (Parquet, ORC, CSV, JSON, Text, Avro) use
  createFileWriteBuilder with partition/truncate/overwrite support
- DataFrameWriter.lookupV2Provider: enabled FileDataSourceV2 for
  non-partitioned Append and Overwrite via df.write.save(path)
- DataFrameWriter.insertInto: V1 fallback for file sources
  (TODO: SPARK-56175)
- DataFrameWriter.saveAsTable: V1 fallback for file sources
  (TODO: SPARK-56230, needs StagingTableCatalog)
- DataSourceV2Utils.getTableProvider: V1 fallback for file sources
  (TODO: SPARK-56175)
- Removed FallBackFileSourceV2 rule
- V2SessionCatalog.createTable: V1 FileFormat data type validation

// Built-in file formats for write testing. Text is excluded
// because it only supports a single string column.
private val fileFormats = Seq("parquet", "orc", "json", "csv")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Shall we revisit the comment because avro is also excluded in this suite?
  • Do we have the same test coverage for avro?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are quite a few loose threads here; let me handle them one by one :)

Copy link
Copy Markdown
Contributor Author

@LuciferYang LuciferYang Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comments and added some test cases for Avro to maintain the same scenario coverage as here.Their corresponding relationships are as follows:

FileDataSourceV2WriteSuite test Avro coverage
File write for multiple formats AvroSuite test save and load
File write V1-vs-V2 comparison SPARK-56171: Avro V2 write produces same results as V1 write
Partitioned file write AvroSuite reading and writing partitioned data
Partitioned write V1-vs-V2 comparison SPARK-56171: Avro V2 partitioned write produces same results as V1
Multi-level partitioned write SPARK-56171: Avro V2 multi-level partitioned write
Dynamic partition overwrite SPARK-56171: Avro V2 dynamic partition overwrite
Dynamic partition overwrite V1-vs-V2 SPARK-56171: Avro V2 dynamic partition overwrite produces same results as V1
DataFrame API write (append + overwrite) Covered by cache invalidation tests
DataFrame API partitioned write Covered by multi-level partitioned test
DataFrame API write with compression AvroSuite write with compression (738, 2350)
Catalog table INSERT INTO SPARK-56171: Avro V2 catalog table INSERT INTO
Catalog table partitioned INSERT INTO SPARK-56171: Avro V2 catalog table partitioned INSERT INTO
V2 cache invalidation on overwrite SPARK-56171: Avro V2 cache invalidation on overwrite
V2 cache invalidation on append SPARK-56171: Avro V2 cache invalidation on append
Cache invalidation on catalog table overwrite SPARK-56171: Avro V2 cache invalidation on catalog table overwrite
CTAS SPARK-56171: Avro V2 CTAS
Partitioned write to empty directory SPARK-56171: Avro V2 partitioned write to empty directory
Partitioned overwrite to existing directory SPARK-56171: Avro V2 partitioned overwrite to existing directory

# Conflicts:
#	connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants