Fix PySpark struct() to accept a list of columns by wavebyrd · Pull Request #378 · duckdb/duckdb-python

wavebyrd · 2026-03-12T22:17:18Z

The PySpark API's struct() function accepts either varargs or a single list of columns. The DuckDB implementation only accepted varargs, causing an InvalidInputException when a list was passed.

This applies the same unwrapping pattern already used by array() in this file: if a single list or set argument is passed, unwrap it before processing.

Changes

duckdb/experimental/spark/sql/functions.py: Added list/set unwrapping to struct() and updated type hints to match array()
tests/fast/spark/test_spark_column.py: Added test_struct_column_with_list covering the reported use case

Test plan

New test passes struct([df.age, df.name]) (list syntax) and verifies the output matches the expected struct rows
Existing test_struct_column continues to verify varargs usage

…ckdb#55) Redo of duckdb#54 but now actually pulling from my feature branch, not main, as I mistakenly did. Includes the fix for the changes requested at duckdb#54 (comment)

In this PR: * Using an extensive set of rules (see pyproject.toml), fixed lots and lots of linting errors with ruff * Added pre-commit as a dev dependency, with support for ruff, clang-format and cmake-format * Added a linting step to the on-pr job * Added the post-checkout hook as well that automatically updates the git submodule. Soon: * add dev docs * plug into on-pr workflow * plug into nightly release workflow as sanity check (because why not: it's fast) This is step 1 in fixing some of the devexp issues we have since v1.4.0. See duckdb#62 duckdb#57 and duckdb#47

This PR fixes duckdb#66

Fixes duckdb#74 This PR: - exports the `adbc_driver_duckdb` package so you can use it after `pip install duckdb` - doesn't just skip the tests if the package can't be found - limits the exported symbols to the only two that are ever needed: `duckdb_adbc_init` and `PyInit__duckdb`: ❯ nm -g -C -P .venv/lib/python3.11/site-packages/_duckdb.cpython-311-darwin.so | ag -v ' U ' _PyInit__duckdb T 41900 0 _duckdb_adbc_init T 1054e6c 0

See duckdb/duckdb#19154

…ues to avoid leaking Python objects during scalar UDF execution.

Fixes ADBC tests to align with changes made in duckdb/duckdb#20344

Fixes duckdb/duckdb#20329 `DuckDBPyRelation.select_dtypes` failed for relations with column names that require quoting (e.g., names containing spaces). The projection builder did not correctly quote identifiers, leading to binder errors. This change ensures identifiers are quoted consistently and adds a regression test to cover the reported case.

This fixes a bug in the cmake function that checks whether we should include jemalloc in our extension list. Should make sure that jemalloc not included on Windows.

The jemalloc extension's symbols can't be found. There might be a deeper issue at play here, but there isn't really any reason to use jemalloc by default, so let's remove that first.

This change makes all CMake targets to use `-MT`/`-MTd` compilation flags for Windows MSVC builds. This way the MSVC runtime library is linked statically and the workaround for VS2019 described in duckdb/duckdb#17991 is no longer necessary. `extension-ci-tools` PR: duckdb/extension-ci-tools#276 Ref: duckdblabs/duckdb-internal#2036

Fixes duckdb#209 The type stub for `DuckDBPyRelation.aggregate` incorrectly restricts the `aggr_expr` parameter to `Expression | str`. However, the DuckDB Python API and runtime behavior also support passing a list of `Expression` objects for multiple aggregations. The fix extends the type annotation to include `list[Expression]`, aligning the stub definition with the actual supported API behavior. ``` def aggregate( self, aggr_expr: Expression | str | list[Expression], group_expr: Expression | str = "" ) -> DuckDBPyRelation: ... ```

@szarnyasg

Port of duckdb#90 Thanks @szarnyasg!

Fixes duckdb#224 Fixes the memory leak reported in [https://github.com/duckdb/duckdb-python/issues/224](https://github.com/duckdb/duckdb-python/issues/224) The issue was caused by Python UDF return values not being released after conversion. While this affected all return types, it was only observable for large VARCHAR/bytes values due to their size. The fix ensures correct reference management using `py::reinterpret_steal<py::object>.` A regression test was added to detect refcount leaks in Python scalar UDF execution.

The struct() function now unwraps a single list or set argument, matching the PySpark API behavior and the existing array() function in this codebase.

evertlammerts and others added 30 commits September 18, 2025 19:02

Run pre-commit on-push

3af6f8f

Very subtle typing problem in the tests

5077501

Merge branch 'main' into readme-update

fad8ab9

docs: update readme and move contributing docs to CONTRIBUTING.md (du…

d8e2a54

…ckdb#55) Redo of duckdb#54 but now actually pulling from my feature branch, not main, as I mistakenly did. Includes the fix for the changes requested at duckdb#54 (comment)

Add code quality workflow

30966ec

fix sdist build

6d7684c

re-added adbc_driver_duckdb to exported packages

40c19e6

Merge branch 'v1.4-andium' into reset_pyerr_before_throwing

3f73855

Fixed adbc driver and force tests to run

68b7702

Reset PyErr before throwing (duckdb#69)

d0fb192

This PR fixes duckdb#66

Cleaned up

091b587

double extern?

ec5160e

try to force inclusion with msvc and gcc as well

956430b

Merge branch 'v1.4-andium' into adbc_driver

38f86e7

pinning adbc-driver-manager to 1.7.0

430cc11

xfail adbc tests on windows

2ff5448

bump submodule

642c57a

bump submodule (duckdb#86)

6e115be

CI: Add jobs to mirror issues

d13e971

bump submodule

4a6c4cb

skip failing test while bug is fixed in core

d365112

now xfail correct test :/

20848e8

Merge branch 'main' into test_skip_19154

fd69319

Skip test to get working build (duckdb#89)

0e120c9

See duckdb/duckdb#19154

CI: Fix repository name

24254c8

bumped submodule

132871e

added stubs and mypy

5ad0ffd

mypy passing

8d2faa4

Schwarf and others added 29 commits January 3, 2026 20:10

Use py::object (reinterpret_steal) for PyObject_CallObject return val…

baea49d

…ues to avoid leaking Python objects during scalar UDF execution.

Add regression test for Python UDF return value refcount leak.

59e1398

Fix formatting.

e047d3f

Fix adbc tests

9d94f7b

Fix adbc tests (duckdb#248)

8aa0364

Fixes ADBC tests to align with changes made in duckdb/duckdb#20344

Merge branch 'main' into fix/select-dtypes-identifier-quoting

8b1a241

Exclude jemalloc from anything apart from debug builds on osx and linux

72bb5c0

Merge branch 'main' into fix/python-udf-refcount-leak

634e7a2

Fix review finding. Use KeywordHelper::WriteOptionallyQuoted

63c0b82

Exclude jemalloc from anything apart from debug builds on osx and linux

218d85e

Fix jemalloc cmake filter (duckdb#249)

d508e84

This fixes a bug in the cmake function that checks whether we should include jemalloc in our extension list. Should make sure that jemalloc not included on Windows.

Merge branch 'main' into fix/python-udf-refcount-leak

7dd6c0e

remove jemalloc extension

345b3b3

remove jemalloc extension (duckdb#250)

2278faa

The jemalloc extension's symbols can't be found. There might be a deeper issue at play here, but there isn't really any reason to use jemalloc by default, so let's remove that first.

Merge branch 'main' into msvc_static

61b2fb2

enable jemalloc but only on linux 64bit, aligned with duckdb core

edcbda3

Bumped submodule

ad1f013

fixed linux jemalloc check

faf18cc

Fix : aggregate method typing to accept list of expressions

1299b00

Merge branch 'main' into fix-DuckDBPyRelation.aggregate-incorrect-typing

46637bc

fix millisecond typo

f0ea9df

Merge branch 'main' into fix/python-udf-refcount-leak

bab10f7

Fix typo in class name: milisecond -> millisecond (duckdb#252)

2f30f1a

Port of duckdb#90 Thanks @szarnyasg!

disable release on main

5686cd1

Fix PySpark struct() to accept a list of columns (#17189)

1f211e2

The struct() function now unwraps a single list or set argument, matching the PySpark API behavior and the existing array() function in this codebase.

wavebyrd force-pushed the fix/17189-spark-struct-list branch from bf2c5a5 to 1f211e2 Compare March 13, 2026 21:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix PySpark struct() to accept a list of columns#378

Fix PySpark struct() to accept a list of columns#378
wavebyrd wants to merge 612 commits intoduckdb:mainfrom
wavebyrd:fix/17189-spark-struct-list

wavebyrd commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

Conversation

wavebyrd commented Mar 12, 2026

Changes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants