Fix PySpark struct() to accept a list of columns#378
Open
wavebyrd wants to merge 612 commits intoduckdb:mainfrom
Open
Fix PySpark struct() to accept a list of columns#378wavebyrd wants to merge 612 commits intoduckdb:mainfrom
wavebyrd wants to merge 612 commits intoduckdb:mainfrom
Conversation
…ckdb#55) Redo of duckdb#54 but now actually pulling from my feature branch, not main, as I mistakenly did. Includes the fix for the changes requested at duckdb#54 (comment)
In this PR: * Using an extensive set of rules (see pyproject.toml), fixed lots and lots of linting errors with ruff * Added pre-commit as a dev dependency, with support for ruff, clang-format and cmake-format * Added a linting step to the on-pr job * Added the post-checkout hook as well that automatically updates the git submodule. Soon: * add dev docs * plug into on-pr workflow * plug into nightly release workflow as sanity check (because why not: it's fast) This is step 1 in fixing some of the devexp issues we have since v1.4.0. See duckdb#62 duckdb#57 and duckdb#47
This PR fixes duckdb#66
Fixes duckdb#74 This PR: - exports the `adbc_driver_duckdb` package so you can use it after `pip install duckdb` - doesn't just skip the tests if the package can't be found - limits the exported symbols to the only two that are ever needed: `duckdb_adbc_init` and `PyInit__duckdb`: ❯ nm -g -C -P .venv/lib/python3.11/site-packages/_duckdb.cpython-311-darwin.so | ag -v ' U ' _PyInit__duckdb T 41900 0 _duckdb_adbc_init T 1054e6c 0
…ues to avoid leaking Python objects during scalar UDF execution.
Fixes ADBC tests to align with changes made in duckdb/duckdb#20344
Fixes duckdb/duckdb#20329 `DuckDBPyRelation.select_dtypes` failed for relations with column names that require quoting (e.g., names containing spaces). The projection builder did not correctly quote identifiers, leading to binder errors. This change ensures identifiers are quoted consistently and adds a regression test to cover the reported case.
This fixes a bug in the cmake function that checks whether we should include jemalloc in our extension list. Should make sure that jemalloc not included on Windows.
The jemalloc extension's symbols can't be found. There might be a deeper issue at play here, but there isn't really any reason to use jemalloc by default, so let's remove that first.
This change makes all CMake targets to use `-MT`/`-MTd` compilation flags for Windows MSVC builds. This way the MSVC runtime library is linked statically and the workaround for VS2019 described in duckdb/duckdb#17991 is no longer necessary. `extension-ci-tools` PR: duckdb/extension-ci-tools#276 Ref: duckdblabs/duckdb-internal#2036
Fixes duckdb#209 The type stub for `DuckDBPyRelation.aggregate` incorrectly restricts the `aggr_expr` parameter to `Expression | str`. However, the DuckDB Python API and runtime behavior also support passing a list of `Expression` objects for multiple aggregations. The fix extends the type annotation to include `list[Expression]`, aligning the stub definition with the actual supported API behavior. ``` def aggregate( self, aggr_expr: Expression | str | list[Expression], group_expr: Expression | str = "" ) -> DuckDBPyRelation: ... ```
Fixes duckdb#224 Fixes the memory leak reported in [https://github.com/duckdb/duckdb-python/issues/224](https://github.com/duckdb/duckdb-python/issues/224) The issue was caused by Python UDF return values not being released after conversion. While this affected all return types, it was only observable for large VARCHAR/bytes values due to their size. The fix ensures correct reference management using `py::reinterpret_steal<py::object>.` A regression test was added to detect refcount leaks in Python scalar UDF execution.
The struct() function now unwraps a single list or set argument, matching the PySpark API behavior and the existing array() function in this codebase.
bf2c5a5 to
1f211e2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes duckdb/duckdb#17189
The PySpark API's
struct()function accepts either varargs or a single list of columns. The DuckDB implementation only accepted varargs, causing anInvalidInputExceptionwhen a list was passed.This applies the same unwrapping pattern already used by
array()in this file: if a single list or set argument is passed, unwrap it before processing.Changes
duckdb/experimental/spark/sql/functions.py: Added list/set unwrapping tostruct()and updated type hints to matcharray()tests/fast/spark/test_spark_column.py: Addedtest_struct_column_with_listcovering the reported use caseTest plan
struct([df.age, df.name])(list syntax) and verifies the output matches the expected struct rowstest_struct_columncontinues to verify varargs usage