fix: fall back from native_datafusion for duplicate fields in case-insensitive mode by andygrove · Pull Request #3687 · apache/datafusion-comet

andygrove · 2026-03-13T16:38:23Z

Which issue does this PR close?

Partial fix for #3311

Rationale for this change

When spark.sql.caseSensitive=false (the default) and a Parquet schema contains field names that collide after lowercasing (e.g., Name and name), DataFusion produces different error messages than Spark. This causes the SPARK-25207: exception when duplicate fields in case-insensitive mode spark-sql test to fail when using native_datafusion.

What changes are included in this PR?

Adds a guard in nativeDataFusionScan() that detects duplicate field names (after lowercasing) in the required and data schemas when case-insensitive analysis is enabled. When duplicates are found, the scan falls back to avoid incompatible error behavior.
Updates the Spark 3.5.8 diff for the SPARK-25207 duplicate fields test to accept both Spark's and DataFusion's error messages. The fallback rule catches duplicates at the schema level, but when duplicates exist only in the physical Parquet file (not in the table definition), the fallback cannot detect them. In that case DataFusion produces "Unable to get field named" instead of Spark's "Found duplicate field(s)".

How are these changes tested?

Covered by the existing SPARK-25207 test in the spark-sql test suite, which verifies the correct error behavior for duplicate fields in case-insensitive mode.

…se-insensitive mode When Parquet files contain duplicate column names that only differ by case, DataFusion produces a different error than Spark in case-insensitive mode. Add a check in CometScanRule to detect duplicate fields in both the read and data schemas and fall back to Spark's reader when case-insensitive mode is enabled. Physical-file-only duplicates (not reflected in the table schema) cannot be detected at plan time, so the SPARK-25207 test is updated to accept either error message format. Closes apache#3311

…allback

…diff" This reverts commit 4f6beaf.

andygrove marked this pull request as draft March 13, 2026 16:41

andygrove mentioned this pull request Mar 13, 2026

feat: Enable native_datafusion scan in auto scan mode [WIP] #3682

Closed

andygrove force-pushed the fix/duplicate-field-fallback branch from 8b2b8ba to 73acaf2 Compare March 14, 2026 18:19

andygrove added 4 commits March 14, 2026 13:06

Merge remote-tracking branch 'apache/main' into fix/duplicate-field-f…

ad48275

…allback

fix: revert FileDataSourceV2FallBackSuite changes from Spark diff

4f6beaf

Revert "fix: revert FileDataSourceV2FallBackSuite changes from Spark …

c3caf37

…diff" This reverts commit 4f6beaf.

fix diff

e1c8b9e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: fall back from native_datafusion for duplicate fields in case-insensitive mode#3687

fix: fall back from native_datafusion for duplicate fields in case-insensitive mode#3687
andygrove wants to merge 5 commits intoapache:mainfrom
andygrove:fix/duplicate-field-fallback

andygrove commented Mar 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

andygrove commented Mar 13, 2026 •

edited

Loading