Skip to content

fix: fall back from native_datafusion for duplicate fields in case-insensitive mode#3687

Draft
andygrove wants to merge 5 commits intoapache:mainfrom
andygrove:fix/duplicate-field-fallback
Draft

fix: fall back from native_datafusion for duplicate fields in case-insensitive mode#3687
andygrove wants to merge 5 commits intoapache:mainfrom
andygrove:fix/duplicate-field-fallback

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Mar 13, 2026

Which issue does this PR close?

Partial fix for #3311

Rationale for this change

When spark.sql.caseSensitive=false (the default) and a Parquet schema contains field names that collide after lowercasing (e.g., Name and name), DataFusion produces different error messages than Spark. This causes the SPARK-25207: exception when duplicate fields in case-insensitive mode spark-sql test to fail when using native_datafusion.

What changes are included in this PR?

  • Adds a guard in nativeDataFusionScan() that detects duplicate field names (after lowercasing) in the required and data schemas when case-insensitive analysis is enabled. When duplicates are found, the scan falls back to avoid incompatible error behavior.
  • Updates the Spark 3.5.8 diff for the SPARK-25207 duplicate fields test to accept both Spark's and DataFusion's error messages. The fallback rule catches duplicates at the schema level, but when duplicates exist only in the physical Parquet file (not in the table definition), the fallback cannot detect them. In that case DataFusion produces "Unable to get field named" instead of Spark's "Found duplicate field(s)".

How are these changes tested?

Covered by the existing SPARK-25207 test in the spark-sql test suite, which verifies the correct error behavior for duplicate fields in case-insensitive mode.

…se-insensitive mode

When Parquet files contain duplicate column names that only differ by
case, DataFusion produces a different error than Spark in
case-insensitive mode. Add a check in CometScanRule to detect duplicate
fields in both the read and data schemas and fall back to Spark's
reader when case-insensitive mode is enabled.

Physical-file-only duplicates (not reflected in the table schema) cannot
be detected at plan time, so the SPARK-25207 test is updated to accept
either error message format.

Closes apache#3311
@andygrove andygrove force-pushed the fix/duplicate-field-fallback branch from 8b2b8ba to 73acaf2 Compare March 14, 2026 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant