Python Relational API - Generated SQL query embeds data

### What happens?

Using the python relational API sometimes leads to the dataset being downloaded and data being embedded into the generated SQL which is then executed. This causes a major slowdown which I assume is generation of the massive query.

On my modestly sized data set (about 3 million rows) running the following takes about 1 sec:
```
df = conn.sql("SELECT * FROM demand_history WHERE year(interval_date) = 2025").pl()
```
Doin this instead, which I expected to be exactly the same takes 40 secs:
```
df = conn.table("demand_history").filter("year(interval_date) = 2025").pl()
```
I've attached the top 18 lines of the output of `conn.table("demand_history").filter("year(interval_date) = 2025").sql_query()'. The entire query is 500MB. [sql_query.txt](https://github.com/user-attachments/files/25704141/sql_query.txt)

It unfortunately only seems to happen with a large amount of data so I haven't been able to produce a minimal example and I haven't narrowed down the exact conditions that causes this to happen. If I for instance do a `between date1 and date2' filter for a short period (a few days) I don't see this behavior - printing the SQL query looks exactly as I'd expect - but doing it for a longer period (a few months) results in the unexpected behavior.

I also don't know if this is a DuckLake bug or a DuckDB bug, but I was never able to reproduce it using just DuckDB. This is with Azure file storage using the Azure extension (3918e45) so that may also be the culprit.

### To Reproduce

connection.sql:
```
t0 = time.perf_counter()
df = conn.sql("select * from demand_history where year(interval_date) = 2025").pl()
print(f"Elapsed: {time.perf_counter() - t0:.2f}s") # Elapsed: 1.61s
```

connection.table:
```
with open("sql_query.txt", "a") as f:
    print(conn.table("demand_history").filter("year(interval_date) = 2025").sql_query(), file=f)
t0 = time.perf_counter()
df = conn.table("demand_history").filter("year(interval_date) = 2025").pl()
print(f"Elapsed: {time.perf_counter() - t0:.2f}s") # Elapsed: 35.19s (faster than usual but point stands)
```

Apologies for not providing something more reproducible but I'm struggling to produce a simple reproducible example. Let me know if there is any more info that would be helpful.

### OS:

Ubuntu 22.04.5 LTS x86_64

### DuckDB Version:

1.4.4

### DuckLake Version:

3f1b372

### DuckDB Client:

Python (3.12.3)

### Hardware:

_No response_

### Full Name:

Ethan Spencer

### Affiliation:

ZEN Energy

### What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a nightly build

### Did you include all relevant data sets for reproducing the issue?

No - I cannot easily share my data sets due to their large size

### Did you include all code required to reproduce the issue?

- [ ] Yes, I have

### Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

- [ ] Yes, I have

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python Relational API - Generated SQL query embeds data #372

What happens?

To Reproduce

OS:

DuckDB Version:

DuckLake Version:

DuckDB Client:

Hardware:

Full Name:

Affiliation:

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

Did you include all relevant data sets for reproducing the issue?

Did you include all code required to reproduce the issue?

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Python Relational API - Generated SQL query embeds data #372

Description

What happens?

To Reproduce

OS:

DuckDB Version:

DuckLake Version:

DuckDB Client:

Hardware:

Full Name:

Affiliation:

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

Did you include all relevant data sets for reproducing the issue?

Did you include all code required to reproduce the issue?

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions