-
Notifications
You must be signed in to change notification settings - Fork 66
Description
What happens?
Using the python relational API sometimes leads to the dataset being downloaded and data being embedded into the generated SQL which is then executed. This causes a major slowdown which I assume is generation of the massive query.
On my modestly sized data set (about 3 million rows) running the following takes about 1 sec:
df = conn.sql("SELECT * FROM demand_history WHERE year(interval_date) = 2025").pl()
Doin this instead, which I expected to be exactly the same takes 40 secs:
df = conn.table("demand_history").filter("year(interval_date) = 2025").pl()
I've attached the top 18 lines of the output of `conn.table("demand_history").filter("year(interval_date) = 2025").sql_query()'. The entire query is 500MB. sql_query.txt
It unfortunately only seems to happen with a large amount of data so I haven't been able to produce a minimal example and I haven't narrowed down the exact conditions that causes this to happen. If I for instance do a `between date1 and date2' filter for a short period (a few days) I don't see this behavior - printing the SQL query looks exactly as I'd expect - but doing it for a longer period (a few months) results in the unexpected behavior.
I also don't know if this is a DuckLake bug or a DuckDB bug, but I was never able to reproduce it using just DuckDB. This is with Azure file storage using the Azure extension (3918e45) so that may also be the culprit.
To Reproduce
connection.sql:
t0 = time.perf_counter()
df = conn.sql("select * from demand_history where year(interval_date) = 2025").pl()
print(f"Elapsed: {time.perf_counter() - t0:.2f}s") # Elapsed: 1.61s
connection.table:
with open("sql_query.txt", "a") as f:
print(conn.table("demand_history").filter("year(interval_date) = 2025").sql_query(), file=f)
t0 = time.perf_counter()
df = conn.table("demand_history").filter("year(interval_date) = 2025").pl()
print(f"Elapsed: {time.perf_counter() - t0:.2f}s") # Elapsed: 35.19s (faster than usual but point stands)
Apologies for not providing something more reproducible but I'm struggling to produce a simple reproducible example. Let me know if there is any more info that would be helpful.
OS:
Ubuntu 22.04.5 LTS x86_64
DuckDB Version:
1.4.4
DuckLake Version:
3f1b372
DuckDB Client:
Python (3.12.3)
Hardware:
No response
Full Name:
Ethan Spencer
Affiliation:
ZEN Energy
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a nightly build
Did you include all relevant data sets for reproducing the issue?
No - I cannot easily share my data sets due to their large size
Did you include all code required to reproduce the issue?
- Yes, I have
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
- Yes, I have