[SPARK-56171][SQL] Enable V2 file write path for non-partitioned DataFrame API writes and delete `FallBackFileSourceV2` by LuciferYang · Pull Request #54998 · apache/spark

LuciferYang · 2026-03-25T04:05:11Z

What changes were proposed in this pull request?

Enable the V2 file write path for non-partitioned df.write.mode("append"/"overwrite").save(path) across all built-in file formats (Parquet, ORC, JSON, CSV, Text, Avro), and delete the now-redundant FallBackFileSourceV2 analysis rule.

Key changes

V2 write infrastructure (FileTable, FileWrite)

FileTable.createFileWriteBuilder: shared builder with SupportsTruncate and SupportsDynamicOverwrite capabilities
FileWrite: partition schema, truncation (overwrite), dynamic partition overwrite, schema validation (nested column name duplication, data type, collation in map keys)
All 6 format-specific Write and Table classes use createFileWriteBuilder

Delete FallBackFileSourceV2

Remove the rule and its registrations in BaseSessionStateBuilder / HiveSessionStateBuilder
Redundant: USE_V1_SOURCE_LIST (default: all formats) already prevents V2 file tables from being created, and the DataFrame API uses AppendData/OverwriteByExpression (not InsertIntoStatement)

Cache invalidation (DataSourceV2Strategy)

Use recacheByPath with fileIndex.refresh() for FileTable writes

Error handling (FileFormatDataWriter)

Override writeAll to wrap errors with TASK_WRITE_FAILED

V2 write gating (DataFrameWriter, DataSourceV2Utils)

Allow V2 for FileDataSourceV2 only for Append/Overwrite without partitionBy; fall back to V1 for ErrorIfExists/Ignore (TODO: SPARK-56174) and partitioned writes
saveAsTable/insertInto: V1 fallback for FileDataSourceV2
DataSourceV2Utils.getTableProvider: return None for FileDataSourceV2 to prevent V2 catalog table loading until remaining gaps are addressed

Data type validation (V2SessionCatalog)

Add V1 FileFormat.supportDataType validation in createTable fallback, ensuring CREATE TABLE with unsupported types is rejected consistently

Why are the changes needed?

The V2 Data Source API provides a cleaner write path than V1's InsertIntoHadoopFsRelationCommand. Enabling V2 writes for built-in file formats is a step toward fully migrating file sources to V2 (SPARK-56170).

Does this PR introduce any user-facing change?

No. With default configuration, all file writes use V1. The V2 path activates only when a user explicitly clears USE_V1_SOURCE_LIST and uses df.write.mode("append"/"overwrite").save(path) without partitionBy.

How was this patch tested?

Pass Github Actions
New FileDataSourceV2WriteSuite (23 tests): V2 write correctness, V1/V2 result comparison, partitioned writes, dynamic partition overwrite, cache invalidation, DataFrame API modes, catalog table INSERT INTO, CTAS
New tests in AvroV2Suite (13 tests): equivalent V2 write coverage for Avro (V1/V2 comparison, multi-level partitioned write, dynamic overwrite, cache invalidation, catalog INSERT INTO, CTAS, partitioned write to empty/existing directory)

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code 4.6

LuciferYang · 2026-03-25T04:05:30Z

Follow-up tickets

Ticket	Description
SPARK-56174	V2 file write ErrorIfExists/Ignore modes
SPARK-56175	FileTable implements SupportsPartitionManagement and V2 catalog table loading
SPARK-56176	V2-native ANALYZE TABLE and ANALYZE COLUMN for file tables

The currently recorded plan: https://issues.apache.org/jira/browse/SPARK-56170

dongjoon-hyun · 2026-03-25T04:45:37Z

Wow, this and many TODO IDs. Thank you for working on this area, @LuciferYang .

…Frame API writes and delete FallBackFileSourceV2 Key changes: - FileWrite: added partitionSchema, customPartitionLocations, dynamicPartitionOverwrite, isTruncate; path creation and truncate logic; dynamic partition overwrite via FileCommitProtocol - FileTable: createFileWriteBuilder with SupportsDynamicOverwrite and SupportsTruncate; capabilities now include TRUNCATE and OVERWRITE_DYNAMIC; fileIndex skips file existence checks when userSpecifiedSchema is provided (write path) - All file format writes (Parquet, ORC, CSV, JSON, Text, Avro) use createFileWriteBuilder with partition/truncate/overwrite support - DataFrameWriter.lookupV2Provider: enabled FileDataSourceV2 for non-partitioned Append and Overwrite via df.write.save(path) - DataFrameWriter.insertInto: V1 fallback for file sources (TODO: SPARK-56175) - DataFrameWriter.saveAsTable: V1 fallback for file sources (TODO: SPARK-56230, needs StagingTableCatalog) - DataSourceV2Utils.getTableProvider: V1 fallback for file sources (TODO: SPARK-56175) - Removed FallBackFileSourceV2 rule - V2SessionCatalog.createTable: V1 FileFormat data type validation

dongjoon-hyun · 2026-04-10T12:23:31Z

+
+  // Built-in file formats for write testing. Text is excluded
+  // because it only supports a single string column.
+  private val fileFormats = Seq("parquet", "orc", "json", "csv")


Shall we revisit the comment because avro is also excluded in this suite?

Do we have the same test coverage for avro?

There are quite a few loose threads here; let me handle them one by one :)

Updated the comments and added some test cases for Avro to maintain the same scenario coverage as here.Their corresponding relationships are as follows:

FileDataSourceV2WriteSuite test Avro coverage

File write for multiple formats AvroSuite test save and load

File write V1-vs-V2 comparison SPARK-56171: Avro V2 write produces same results as V1 write

Partitioned file write AvroSuite reading and writing partitioned data

Partitioned write V1-vs-V2 comparison SPARK-56171: Avro V2 partitioned write produces same results as V1

Multi-level partitioned write SPARK-56171: Avro V2 multi-level partitioned write

Dynamic partition overwrite SPARK-56171: Avro V2 dynamic partition overwrite

Dynamic partition overwrite V1-vs-V2 SPARK-56171: Avro V2 dynamic partition overwrite produces same results as V1

DataFrame API write (append + overwrite) Covered by cache invalidation tests

DataFrame API partitioned write Covered by multi-level partitioned test

DataFrame API write with compression AvroSuite write with compression (738, 2350)

Catalog table INSERT INTO SPARK-56171: Avro V2 catalog table INSERT INTO

Catalog table partitioned INSERT INTO SPARK-56171: Avro V2 catalog table partitioned INSERT INTO

V2 cache invalidation on overwrite SPARK-56171: Avro V2 cache invalidation on overwrite

V2 cache invalidation on append SPARK-56171: Avro V2 cache invalidation on append

Cache invalidation on catalog table overwrite SPARK-56171: Avro V2 cache invalidation on catalog table overwrite

CTAS SPARK-56171: Avro V2 CTAS

Partitioned write to empty directory SPARK-56171: Avro V2 partitioned write to empty directory

Partitioned overwrite to existing directory SPARK-56171: Avro V2 partitioned overwrite to existing directory

# Conflicts: # connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

LuciferYang force-pushed the SPARK-56171-combined branch from 30a677f to 677a482 Compare March 26, 2026 07:38

This was referenced Mar 26, 2026

[SPARK-56175][SQL] FileTable implements SupportsPartitionManagement and V2 catalog table loading #55034

Draft

[SPARK-56174][SQL] Complete V2 file write path for DataFrame API #55091

Draft

LuciferYang force-pushed the SPARK-56171-combined branch from 677a482 to 300f362 Compare April 2, 2026 05:17

LuciferYang force-pushed the SPARK-56171-combined branch from 300f362 to 3162633 Compare April 7, 2026 06:25

dongjoon-hyun reviewed Apr 10, 2026

View reviewed changes

address comments

e9ad5dc

dongjoon-hyun reviewed Apr 10, 2026

View reviewed changes

Comment thread connector/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroTable.scala

Merge remote-tracking branch 'upstream/master' into SPARK-56171-combined

908315f

# Conflicts: # connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56171][SQL] Enable V2 file write path for non-partitioned DataFrame API writes and delete `FallBackFileSourceV2`#54998

[SPARK-56171][SQL] Enable V2 file write path for non-partitioned DataFrame API writes and delete `FallBackFileSourceV2`#54998
LuciferYang wants to merge 3 commits intoapache:masterfrom
LuciferYang:SPARK-56171-combined

LuciferYang commented Mar 25, 2026 •

edited

Loading

Uh oh!

LuciferYang commented Mar 25, 2026 •

edited

Loading

Uh oh!

dongjoon-hyun commented Mar 25, 2026

Uh oh!

dongjoon-hyun Apr 10, 2026

Uh oh!

LuciferYang Apr 10, 2026

Uh oh!

LuciferYang Apr 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FileDataSourceV2WriteSuite test	Avro coverage
File write for multiple formats	AvroSuite `test save and load`
File write V1-vs-V2 comparison	`SPARK-56171: Avro V2 write produces same results as V1 write`
Partitioned file write	AvroSuite `reading and writing partitioned data`
Partitioned write V1-vs-V2 comparison	`SPARK-56171: Avro V2 partitioned write produces same results as V1`
Multi-level partitioned write	`SPARK-56171: Avro V2 multi-level partitioned write`
Dynamic partition overwrite	`SPARK-56171: Avro V2 dynamic partition overwrite`
Dynamic partition overwrite V1-vs-V2	`SPARK-56171: Avro V2 dynamic partition overwrite produces same results as V1`
DataFrame API write (append + overwrite)	Covered by cache invalidation tests
DataFrame API partitioned write	Covered by multi-level partitioned test
DataFrame API write with compression	AvroSuite `write with compression` (738, 2350)
Catalog table INSERT INTO	`SPARK-56171: Avro V2 catalog table INSERT INTO`
Catalog table partitioned INSERT INTO	`SPARK-56171: Avro V2 catalog table partitioned INSERT INTO`
V2 cache invalidation on overwrite	`SPARK-56171: Avro V2 cache invalidation on overwrite`
V2 cache invalidation on append	`SPARK-56171: Avro V2 cache invalidation on append`
Cache invalidation on catalog table overwrite	`SPARK-56171: Avro V2 cache invalidation on catalog table overwrite`
CTAS	`SPARK-56171: Avro V2 CTAS`
Partitioned write to empty directory	`SPARK-56171: Avro V2 partitioned write to empty directory`
Partitioned overwrite to existing directory	`SPARK-56171: Avro V2 partitioned overwrite to existing directory`

Conversation

LuciferYang commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Key changes

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

LuciferYang commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Mar 25, 2026

Uh oh!

dongjoon-hyun Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

LuciferYang Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

LuciferYang Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LuciferYang commented Mar 25, 2026 •

edited

Loading

LuciferYang commented Mar 25, 2026 •

edited

Loading

LuciferYang Apr 10, 2026 •

edited

Loading