Skip to content

Python Relational API - Generated SQL query embeds data #372

@Ethan-Spencer

Description

@Ethan-Spencer

What happens?

Using the python relational API sometimes leads to the dataset being downloaded and data being embedded into the generated SQL which is then executed. This causes a major slowdown which I assume is generation of the massive query.

On my modestly sized data set (about 3 million rows) running the following takes about 1 sec:

df = conn.sql("SELECT * FROM demand_history WHERE year(interval_date) = 2025").pl()

Doin this instead, which I expected to be exactly the same takes 40 secs:

df = conn.table("demand_history").filter("year(interval_date) = 2025").pl()

I've attached the top 18 lines of the output of `conn.table("demand_history").filter("year(interval_date) = 2025").sql_query()'. The entire query is 500MB. sql_query.txt

It unfortunately only seems to happen with a large amount of data so I haven't been able to produce a minimal example and I haven't narrowed down the exact conditions that causes this to happen. If I for instance do a `between date1 and date2' filter for a short period (a few days) I don't see this behavior - printing the SQL query looks exactly as I'd expect - but doing it for a longer period (a few months) results in the unexpected behavior.

I also don't know if this is a DuckLake bug or a DuckDB bug, but I was never able to reproduce it using just DuckDB. This is with Azure file storage using the Azure extension (3918e45) so that may also be the culprit.

To Reproduce

connection.sql:

t0 = time.perf_counter()
df = conn.sql("select * from demand_history where year(interval_date) = 2025").pl()
print(f"Elapsed: {time.perf_counter() - t0:.2f}s") # Elapsed: 1.61s

connection.table:

with open("sql_query.txt", "a") as f:
    print(conn.table("demand_history").filter("year(interval_date) = 2025").sql_query(), file=f)
t0 = time.perf_counter()
df = conn.table("demand_history").filter("year(interval_date) = 2025").pl()
print(f"Elapsed: {time.perf_counter() - t0:.2f}s") # Elapsed: 35.19s (faster than usual but point stands)

Apologies for not providing something more reproducible but I'm struggling to produce a simple reproducible example. Let me know if there is any more info that would be helpful.

OS:

Ubuntu 22.04.5 LTS x86_64

DuckDB Version:

1.4.4

DuckLake Version:

3f1b372

DuckDB Client:

Python (3.12.3)

Hardware:

No response

Full Name:

Ethan Spencer

Affiliation:

ZEN Energy

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a nightly build

Did you include all relevant data sets for reproducing the issue?

No - I cannot easily share my data sets due to their large size

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions