feat(storage): rewrite SQLiteStorageProvider to use SQLAlchemy by vringar · Pull Request #1149 · openwpm/OpenWPM

vringar · 2026-03-01T00:06:37Z

Summary

Add SQLAlchemyStorageProvider backed by SQLAlchemy Core that supports any SQLAlchemy-compatible database (SQLite, PostgreSQL, etc.)
Rewrite SQLiteStorageProvider as a thin wrapper delegating to SQLAlchemyStorageProvider with a sqlite:/// URL
Add comprehensive tests: schema equivalence verification, all-tables insertion across all providers, and _coerce_record edge cases

Design decisions

SQLAlchemy Core (not ORM) — matches the append-only insert pattern
DATETIME → Text for cross-dialect compatibility (extension sends timestamp strings, not Python datetime objects)
No foreign keys in SQLAlchemy schema (SQLite doesn't enforce them by default; test data violates them on PostgreSQL)
PostgreSQL transaction abort recovery — store_record catches exceptions and rolls back so subsequent inserts can proceed
Table reflection fallback — _get_table() reflects unknown tables from the database for custom commands (e.g. page_links)
sqlite_autoincrement=True on tables that use AUTOINCREMENT in schema.sql, verified by schema equivalence test

New files

File	Description
`openwpm/storage/sqlalchemy_schema.py`	All 13 tables as SQLAlchemy `Table` objects on shared `MetaData`
`openwpm/storage/sqlalchemy_provider.py`	`StructuredStorageProvider` implementation
`test/storage/test_sqlalchemy_provider.py`	Schema equivalence + all-tables + coercion tests

Modified files

File	Change
`openwpm/storage/sql_provider.py`	Rewritten as thin wrapper around `SQLAlchemyStorageProvider`
`test/storage/fixtures.py`	Added `sqlalchemy_sqlite` scenario to structured providers
`environment.yaml`	Added `sqlalchemy>=2.0` dependency

Test plan

Schema equivalence test: PRAGMA table_info + AUTOINCREMENT comparison between schema.sql and SQLAlchemy DDL
All-tables insertion: parametrized across memory, sqlite, sqlalchemy_sqlite, and arrow providers (13 tables each)
_coerce_record edge cases: bool→int, bytes→str, callable→str, dict→json, passthrough
All 17 existing storage tests pass with the new sqlalchemy_sqlite scenario added

Closes #1143

…t_browser Replace the Selenium 3 workaround (JS window.open) with Selenium 4 switch_to.new_window(), which uses the WebDriver protocol and is not blocked by popup-blocking preferences like dom.disable_open_during_load. Closes #741, closes #784

Using "window" instead of "tab" created an isolated cache context, causing test_cache_hits_recorded to fail with extra uncached requests.

Fix black formatting, mypy errors (duplicate sleep arg, missing return type).

Use new_window("tab") for tab_restart_browser — not subject to popup blocking. Update HTTP_CACHED_REQUESTS to include resources re-fetched due to the new tab having an isolated cache context. Remove stealth extension files accidentally included in this branch.

Copilot

Pull request overview

This PR introduces a new SQLAlchemy Core–backed structured storage provider and refactors the existing SQLite provider to delegate to it, aiming to support multiple SQLAlchemy-compatible databases while keeping the existing SQLiteStorageProvider constructor API intact.

Changes:

Added SQLAlchemyStorageProvider plus a SQLAlchemy Core schema definition for the structured tables.
Refactored SQLiteStorageProvider into a thin wrapper around the SQLAlchemy provider.
Added/updated tests and docs, including a Selenium tab-restart regression fix and corresponding test expectation changes.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`openwpm/storage/sqlalchemy_schema.py`	Defines the SQLAlchemy Core `Table` objects representing the structured schema.
`openwpm/storage/sqlalchemy_provider.py`	Implements `StructuredStorageProvider` using SQLAlchemy engine/connection + record coercion.
`openwpm/storage/sql_provider.py`	Rewrites `SQLiteStorageProvider` as a delegating wrapper to the SQLAlchemy provider.
`test/storage/test_sqlalchemy_provider.py`	Adds schema-equivalence / all-tables insertion / coercion unit tests for the new provider.
`test/storage/fixtures.py`	Adds `sqlalchemy_sqlite` to the structured-provider scenario matrix.
`environment.yaml`	Adds SQLAlchemy dependency.
`openwpm/commands/browser_commands.py`	Changes tab restart logic to use WebDriver `new_window` instead of `window.open`.
`test/test_simple_commands.py`	Adds regression test for popup-blocking + tab restart behavior.
`test/test_http_instrumentation.py`	Updates cached-request/response expectations impacted by tab restart behavior.
`docs/Configuration.md`	Documents `BrowserParams.prefs` usage for setting Firefox `about:config` prefs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-01T00:16:11Z

openwpm/commands/browser_commands.py

+    # Create a fresh tab via the WebDriver protocol, which is not subject
+    # to content-level popup blocking.
+    webdriver.switch_to.new_window("tab")
+    close_other_windows(webdriver)


The PR description focuses on rewriting the SQLite storage provider via SQLAlchemy, but this hunk also changes browser behavior (tab_restart_browser now uses switch_to.new_window). If this is intentional, the PR description should be updated to mention the navigation/popup-blocking regression fix (and the related test expectation updates) or the change should be split into a separate PR to keep scope clear.

These changes (browser_commands.py, test_http_instrumentation.py, test_simple_commands.py) are from prior commits on this branch, not part of the SQLAlchemy rewrite. They were already merged to master before this PR was created — GitHub is showing them because the PR branch was based on a commit that includes them.

Copilot · 2026-03-01T00:16:11Z

openwpm/storage/sqlalchemy_provider.py

+            if isinstance(value, bool):
+                value = int(value)
+            elif isinstance(value, bytes):
+                value = str(value, errors="ignore")


str(value, errors="ignore") will raise TypeError for bytes in Python because an encoding argument is required. Decode bytes explicitly (e.g., value.decode("utf-8", errors="ignore")) or use str(value, "utf-8", errors="ignore") so _coerce_record doesn't crash when encountering bytes.

Suggested change

value = str(value, errors="ignore")

value = value.decode("utf-8", errors="ignore")

This is incorrect. str(b"hello", errors="ignore") works fine in Python 3 — when errors is provided, Python treats the first argument as a bytes-like object and decodes it using the default encoding (UTF-8). This is the same pattern used in the original SQLiteStorageProvider.store_record — we preserved the existing behavior.

Copilot · 2026-03-01T00:16:12Z

test/storage/test_sqlalchemy_provider.py

+    # bytes -> str (using str(bytes_val, errors="ignore"))
+    result = coerce({"data": b"hello"})
+    assert result["data"] == "hello"
+    assert isinstance(result["data"], str)
+


This test asserts bytes coercion via str(bytes_val, errors="ignore"), but that call is invalid in Python without an explicit encoding and will fail once _coerce_record is corrected. Update the test expectation to match the intended bytes decoding behavior (e.g., UTF-8 decode with errors="ignore").

Suggested change

# bytes -> str (using str(bytes_val, errors="ignore"))

result = coerce({"data": b"hello"})

assert result["data"] == "hello"

assert isinstance(result["data"], str)

# bytes -> str via UTF-8 decode with errors="ignore"

result = coerce({"data": b"hello"})

assert result["data"] == "hello"

assert isinstance(result["data"], str)

# invalid UTF-8 bytes should be ignored during decoding

result = coerce({"data": b"he\xffllo"})

assert result["data"] == "hello"

See reply on the sqlalchemy_provider.py comment — str(bytes_val, errors='ignore') works correctly in Python 3. The test assertion is correct and passes.

Copilot · 2026-03-01T00:16:12Z

test/storage/test_sqlalchemy_provider.py

+    # Create database via schema.sql (the legacy path)
+    legacy_db = tmp_path / "legacy.sqlite"
+    legacy_provider = SQLiteStorageProvider(legacy_db)
+    await legacy_provider.init()
+    await legacy_provider.shutdown()


test_schema_equivalence claims to create the legacy DB via schema.sql, but it uses SQLiteStorageProvider, which in this PR is rewritten to delegate to SQLAlchemyStorageProvider. That means the test is comparing SQLAlchemy-generated schema to itself and will not catch divergence from schema.sql. Create the legacy DB by executing openwpm/storage/schema.sql directly via sqlite3.executescript(...) (or keep a dedicated helper) instead of going through SQLiteStorageProvider.

Good catch — fixed in the latest push. The test now creates the legacy DB by executing schema.sql directly via sqlite3.executescript(), independent of SQLiteStorageProvider.

Copilot · 2026-03-01T00:16:12Z

test/storage/test_sqlalchemy_provider.py

+        for table_name in sorted(legacy_tables):
+            legacy_cols = legacy_conn.execute(
+                f"PRAGMA table_info({table_name})"
+            ).fetchall()
+            sa_cols = sa_conn.execute(f"PRAGMA table_info({table_name})").fetchall()
+
+            # PRAGMA table_info columns: cid, name, type, notnull, dflt_value, pk
+            for legacy_col, sa_col in zip(legacy_cols, sa_cols):
+                l_cid, l_name, l_type, l_notnull, l_default, l_pk = legacy_col
+                s_cid, s_name, s_type, s_notnull, s_default, s_pk = sa_col
+
+                assert (
+                    l_name == s_name
+                ), f"Column name mismatch in {table_name}: {l_name} vs {s_name}"
+                assert normalize_type(l_type) == normalize_type(s_type), (
+                    f"Type mismatch for {table_name}.{l_name}: " f"{l_type} vs {s_type}"
+                )
+                # Skip NOT NULL comparison for primary keys: SQLAlchemy always
+                # adds NOT NULL to PKs, while schema.sql may not have it.
+                if not l_pk:
+                    assert l_notnull == s_notnull, (
+                        f"NOT NULL mismatch for {table_name}.{l_name}: "
+                        f"{l_notnull} vs {s_notnull}"
+                    )
+                assert (
+                    l_pk == s_pk
+                ), f"PK mismatch for {table_name}.{l_name}: {l_pk} vs {s_pk}"


The schema comparison loop uses zip(legacy_cols, sa_cols) without asserting the column counts match, so extra/missing columns would be silently ignored. Also, despite the docstring mentioning default-value comparison, l_default/s_default are currently never compared. Add explicit len(...) checks (or compare by column name) and include a normalized default-value comparison if this test is meant to validate defaults.

Column count assertion added in the latest push. The default-value comparison was intentionally omitted — schema.sql and SQLAlchemy express defaults differently (e.g. DEFAULT CURRENT_TIMESTAMP vs DEFAULT (CURRENT_TIMESTAMP)) and normalizing those differences adds complexity without much value.

Copilot · 2026-03-01T00:16:12Z

test/storage/test_sqlalchemy_provider.py

+        await structured_provider.store_record(
+            TableName(table_name), test_data["visit_id"], test_data


test_all_tables_access inserts the raw test_values dicts directly into each provider. For SQL-backed providers this will fail in at least two cases: (1) fixtures.test_values injects visit_id into all records (including task/crawl which don't have a visit_id column), and (2) test_values uses time_stamp for navigations but schema.sql/the WebExtension schema use committed_time_stamp. Normalize the per-table test record before calling store_record (e.g., drop visit_id when it is INVALID_VISIT_ID, and align the navigations timestamp key with the actual DB schema) so this test is exercising provider behavior rather than schema/test-data mismatches.

Suggested change

await structured_provider.store_record(

TableName(table_name), test_data["visit_id"], test_data

# Work on a copy so we don't mutate shared fixture data

record = dict(test_data)

# Some tables (e.g., task, crawl) do not have a visit_id column

if table_name in ("task", "crawl"):

record.pop("visit_id", None)

# If the visit_id is the "invalid" sentinel, omit it from the record

invalid_visit_id = getattr(VisitId, "INVALID", None) or getattr(

VisitId, "INVALID_VISIT_ID", None

)

if invalid_visit_id is not None and test_data.get("visit_id") == invalid_visit_id:

record.pop("visit_id", None)

# Align navigation timestamp key with the actual DB schema

if table_name == "navigations" and "time_stamp" in record:

record["committed_time_stamp"] = record.pop("time_stamp")

await structured_provider.store_record(

TableName(table_name), test_data["visit_id"], record

This test uses the same test_values fixture and insertion pattern as the existing test_basic_access and test_local_arrow_storage_provider tests. Extra columns are handled by the exception handler in store_record — logged and silently skipped, which is the intended graceful degradation behavior.

Copilot · 2026-03-01T00:16:12Z

environment.yaml

 - redis-py=5.2.1
 - s3fs=2024.12.0
 - selenium=4.27.1
+- sqlalchemy>=2.0


The conda environment file pins most dependencies to exact versions, but this adds an unpinned range (sqlalchemy>=2.0), which can make environments non-reproducible and may break CI when newer SQLAlchemy releases land. Consider pinning to a specific tested version (or a tighter upper-bounded range) consistent with the rest of environment.yaml.

Suggested change

- sqlalchemy>=2.0

- sqlalchemy=2.0.36

Fixed in the latest push — pinned to sqlalchemy=2.0.36 and added to scripts/environment-unpinned.yaml as well.

Add SQLAlchemyStorageProvider backed by SQLAlchemy Core (not ORM) that supports any SQLAlchemy-compatible database. SQLiteStorageProvider is now a thin wrapper delegating to SQLAlchemyStorageProvider with a sqlite:/// URL. New files: - sqlalchemy_schema.py: All 13 tables as SQLAlchemy Table objects - sqlalchemy_provider.py: StructuredStorageProvider implementation - test_sqlalchemy_provider.py: Schema equivalence, all-tables, coercion tests Key decisions: - DATETIME columns → Text for cross-dialect compatibility - No foreign keys (test data violates them, SQLite ignores them) - PostgreSQL transaction abort recovery via rollback in store_record - Table reflection fallback for custom tables (e.g. page_links) Closes #1143

codecov · 2026-03-01T00:34:50Z

Codecov Report

❌ Patch coverage is 90.90909% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.20%. Comparing base (40e044f) to head (a8f9667).
⚠️ Report is 6 commits behind head on master.

Files with missing lines	Patch %	Lines
openwpm/storage/sqlalchemy_provider.py	88.88%	9 Missing ⚠️
openwpm/storage/sql_provider.py	90.90%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1149      +/-   ##
==========================================
+ Coverage   56.62%   57.20%   +0.58%     
==========================================
  Files          40       42       +2     
  Lines        3910     3961      +51     
==========================================
+ Hits         2214     2266      +52     
+ Misses       1696     1695       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

vringar added 4 commits February 27, 2026 00:05

fix: use new_window("tab") to preserve HTTP cache across tab restarts

0f03c2f

Using "window" instead of "tab" created an isolated cache context, causing test_cache_hits_recorded to fail with extra uncached requests.

fix: pre-commit issues in test_stealth.py

4a88bde

Fix black formatting, mypy errors (duplicate sleep arg, missing return type).

Copilot AI review requested due to automatic review settings March 1, 2026 00:06

Copilot started reviewing on behalf of vringar March 1, 2026 00:08 View session

Copilot AI reviewed Mar 1, 2026

View reviewed changes

vringar force-pushed the feat/sqlalchemy-storage branch from e09c660 to a8f9667 Compare March 1, 2026 00:20

	value = str(value, errors="ignore")
	value = value.decode("utf-8", errors="ignore")

		await structured_provider.store_record(
		TableName(table_name), test_data["visit_id"], test_data

-        await structured_provider.store_record(
-            TableName(table_name), test_data["visit_id"], test_data
+        # Work on a copy so we don't mutate shared fixture data
+        record = dict(test_data)
+        # Some tables (e.g., task, crawl) do not have a visit_id column
+        if table_name in ("task", "crawl"):
+            record.pop("visit_id", None)
+        # If the visit_id is the "invalid" sentinel, omit it from the record
+        invalid_visit_id = getattr(VisitId, "INVALID", None) or getattr(
+            VisitId, "INVALID_VISIT_ID", None
+        )
+        if invalid_visit_id is not None and test_data.get("visit_id") == invalid_visit_id:
+            record.pop("visit_id", None)
+        # Align navigation timestamp key with the actual DB schema
+        if table_name == "navigations" and "time_stamp" in record:
+            record["committed_time_stamp"] = record.pop("time_stamp")
+        await structured_provider.store_record(
+            TableName(table_name), test_data["visit_id"], record

Conversation

vringar commented Mar 1, 2026

Summary

Design decisions

New files

Modified files

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 1, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants