fix: fall back from native_datafusion for duplicate fields in case-insensitive mode#3687
Draft
andygrove wants to merge 5 commits intoapache:mainfrom
Draft
fix: fall back from native_datafusion for duplicate fields in case-insensitive mode#3687andygrove wants to merge 5 commits intoapache:mainfrom
andygrove wants to merge 5 commits intoapache:mainfrom
Conversation
…se-insensitive mode When Parquet files contain duplicate column names that only differ by case, DataFusion produces a different error than Spark in case-insensitive mode. Add a check in CometScanRule to detect duplicate fields in both the read and data schemas and fall back to Spark's reader when case-insensitive mode is enabled. Physical-file-only duplicates (not reflected in the table schema) cannot be detected at plan time, so the SPARK-25207 test is updated to accept either error message format. Closes apache#3311
8b2b8ba to
73acaf2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Partial fix for #3311
Rationale for this change
When
spark.sql.caseSensitive=false(the default) and a Parquet schema contains field names that collide after lowercasing (e.g.,Nameandname), DataFusion produces different error messages than Spark. This causes theSPARK-25207: exception when duplicate fields in case-insensitive modespark-sql test to fail when usingnative_datafusion.What changes are included in this PR?
nativeDataFusionScan()that detects duplicate field names (after lowercasing) in the required and data schemas when case-insensitive analysis is enabled. When duplicates are found, the scan falls back to avoid incompatible error behavior.SPARK-25207duplicate fields test to accept both Spark's and DataFusion's error messages. The fallback rule catches duplicates at the schema level, but when duplicates exist only in the physical Parquet file (not in the table definition), the fallback cannot detect them. In that case DataFusion produces"Unable to get field named"instead of Spark's"Found duplicate field(s)".How are these changes tested?
Covered by the existing
SPARK-25207test in the spark-sql test suite, which verifies the correct error behavior for duplicate fields in case-insensitive mode.