Skip to content

feat(storage): rewrite SQLiteStorageProvider to use SQLAlchemy#1149

Open
vringar wants to merge 5 commits intomasterfrom
feat/sqlalchemy-storage
Open

feat(storage): rewrite SQLiteStorageProvider to use SQLAlchemy#1149
vringar wants to merge 5 commits intomasterfrom
feat/sqlalchemy-storage

Conversation

@vringar
Copy link
Contributor

@vringar vringar commented Mar 1, 2026

Summary

  • Add SQLAlchemyStorageProvider backed by SQLAlchemy Core that supports any SQLAlchemy-compatible database (SQLite, PostgreSQL, etc.)
  • Rewrite SQLiteStorageProvider as a thin wrapper delegating to SQLAlchemyStorageProvider with a sqlite:/// URL
  • Add comprehensive tests: schema equivalence verification, all-tables insertion across all providers, and _coerce_record edge cases

Design decisions

  • SQLAlchemy Core (not ORM) — matches the append-only insert pattern
  • DATETIME → Text for cross-dialect compatibility (extension sends timestamp strings, not Python datetime objects)
  • No foreign keys in SQLAlchemy schema (SQLite doesn't enforce them by default; test data violates them on PostgreSQL)
  • PostgreSQL transaction abort recoverystore_record catches exceptions and rolls back so subsequent inserts can proceed
  • Table reflection fallback_get_table() reflects unknown tables from the database for custom commands (e.g. page_links)
  • sqlite_autoincrement=True on tables that use AUTOINCREMENT in schema.sql, verified by schema equivalence test

New files

File Description
openwpm/storage/sqlalchemy_schema.py All 13 tables as SQLAlchemy Table objects on shared MetaData
openwpm/storage/sqlalchemy_provider.py StructuredStorageProvider implementation
test/storage/test_sqlalchemy_provider.py Schema equivalence + all-tables + coercion tests

Modified files

File Change
openwpm/storage/sql_provider.py Rewritten as thin wrapper around SQLAlchemyStorageProvider
test/storage/fixtures.py Added sqlalchemy_sqlite scenario to structured providers
environment.yaml Added sqlalchemy>=2.0 dependency

Test plan

  • Schema equivalence test: PRAGMA table_info + AUTOINCREMENT comparison between schema.sql and SQLAlchemy DDL
  • All-tables insertion: parametrized across memory, sqlite, sqlalchemy_sqlite, and arrow providers (13 tables each)
  • _coerce_record edge cases: bool→int, bytes→str, callable→str, dict→json, passthrough
  • All 17 existing storage tests pass with the new sqlalchemy_sqlite scenario added

Closes #1143

…t_browser

Replace the Selenium 3 workaround (JS window.open) with Selenium 4
switch_to.new_window(), which uses the WebDriver protocol and is not
blocked by popup-blocking preferences like dom.disable_open_during_load.

Closes #741, closes #784
Using "window" instead of "tab" created an isolated cache context,
causing test_cache_hits_recorded to fail with extra uncached requests.
Fix black formatting, mypy errors (duplicate sleep arg, missing return type).
Use new_window("tab") for tab_restart_browser — not subject to popup
blocking. Update HTTP_CACHED_REQUESTS to include resources re-fetched
due to the new tab having an isolated cache context.

Remove stealth extension files accidentally included in this branch.
Copilot AI review requested due to automatic review settings March 1, 2026 00:06
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new SQLAlchemy Core–backed structured storage provider and refactors the existing SQLite provider to delegate to it, aiming to support multiple SQLAlchemy-compatible databases while keeping the existing SQLiteStorageProvider constructor API intact.

Changes:

  • Added SQLAlchemyStorageProvider plus a SQLAlchemy Core schema definition for the structured tables.
  • Refactored SQLiteStorageProvider into a thin wrapper around the SQLAlchemy provider.
  • Added/updated tests and docs, including a Selenium tab-restart regression fix and corresponding test expectation changes.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
openwpm/storage/sqlalchemy_schema.py Defines the SQLAlchemy Core Table objects representing the structured schema.
openwpm/storage/sqlalchemy_provider.py Implements StructuredStorageProvider using SQLAlchemy engine/connection + record coercion.
openwpm/storage/sql_provider.py Rewrites SQLiteStorageProvider as a delegating wrapper to the SQLAlchemy provider.
test/storage/test_sqlalchemy_provider.py Adds schema-equivalence / all-tables insertion / coercion unit tests for the new provider.
test/storage/fixtures.py Adds sqlalchemy_sqlite to the structured-provider scenario matrix.
environment.yaml Adds SQLAlchemy dependency.
openwpm/commands/browser_commands.py Changes tab restart logic to use WebDriver new_window instead of window.open.
test/test_simple_commands.py Adds regression test for popup-blocking + tab restart behavior.
test/test_http_instrumentation.py Updates cached-request/response expectations impacted by tab restart behavior.
docs/Configuration.md Documents BrowserParams.prefs usage for setting Firefox about:config prefs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +94 to +97
# Create a fresh tab via the WebDriver protocol, which is not subject
# to content-level popup blocking.
webdriver.switch_to.new_window("tab")
close_other_windows(webdriver)
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description focuses on rewriting the SQLite storage provider via SQLAlchemy, but this hunk also changes browser behavior (tab_restart_browser now uses switch_to.new_window). If this is intentional, the PR description should be updated to mention the navigation/popup-blocking regression fix (and the related test expectation updates) or the change should be split into a separate PR to keep scope clear.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes (browser_commands.py, test_http_instrumentation.py, test_simple_commands.py) are from prior commits on this branch, not part of the SQLAlchemy rewrite. They were already merged to master before this PR was created — GitHub is showing them because the PR branch was based on a commit that includes them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes (browser_commands.py, test_http_instrumentation.py, test_simple_commands.py) are from prior commits on this branch, not part of the SQLAlchemy rewrite. They were already merged to master before this PR was created — GitHub is showing them because the PR branch was based on a commit that includes them.

if isinstance(value, bool):
value = int(value)
elif isinstance(value, bytes):
value = str(value, errors="ignore")
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str(value, errors="ignore") will raise TypeError for bytes in Python because an encoding argument is required. Decode bytes explicitly (e.g., value.decode("utf-8", errors="ignore")) or use str(value, "utf-8", errors="ignore") so _coerce_record doesn't crash when encountering bytes.

Suggested change
value = str(value, errors="ignore")
value = value.decode("utf-8", errors="ignore")

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect. str(b"hello", errors="ignore") works fine in Python 3 — when errors is provided, Python treats the first argument as a bytes-like object and decodes it using the default encoding (UTF-8). This is the same pattern used in the original SQLiteStorageProvider.store_record — we preserved the existing behavior.

Comment on lines +170 to +174
# bytes -> str (using str(bytes_val, errors="ignore"))
result = coerce({"data": b"hello"})
assert result["data"] == "hello"
assert isinstance(result["data"], str)

Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test asserts bytes coercion via str(bytes_val, errors="ignore"), but that call is invalid in Python without an explicit encoding and will fail once _coerce_record is corrected. Update the test expectation to match the intended bytes decoding behavior (e.g., UTF-8 decode with errors="ignore").

Suggested change
# bytes -> str (using str(bytes_val, errors="ignore"))
result = coerce({"data": b"hello"})
assert result["data"] == "hello"
assert isinstance(result["data"], str)
# bytes -> str via UTF-8 decode with errors="ignore"
result = coerce({"data": b"hello"})
assert result["data"] == "hello"
assert isinstance(result["data"], str)
# invalid UTF-8 bytes should be ignored during decoding
result = coerce({"data": b"he\xffllo"})
assert result["data"] == "hello"

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See reply on the sqlalchemy_provider.py comment — str(bytes_val, errors='ignore') works correctly in Python 3. The test assertion is correct and passes.

Comment on lines +32 to +36
# Create database via schema.sql (the legacy path)
legacy_db = tmp_path / "legacy.sqlite"
legacy_provider = SQLiteStorageProvider(legacy_db)
await legacy_provider.init()
await legacy_provider.shutdown()
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_schema_equivalence claims to create the legacy DB via schema.sql, but it uses SQLiteStorageProvider, which in this PR is rewritten to delegate to SQLAlchemyStorageProvider. That means the test is comparing SQLAlchemy-generated schema to itself and will not catch divergence from schema.sql. Create the legacy DB by executing openwpm/storage/schema.sql directly via sqlite3.executescript(...) (or keep a dedicated helper) instead of going through SQLiteStorageProvider.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in the latest push. The test now creates the legacy DB by executing schema.sql directly via sqlite3.executescript(), independent of SQLiteStorageProvider.

Comment on lines +81 to +107
for table_name in sorted(legacy_tables):
legacy_cols = legacy_conn.execute(
f"PRAGMA table_info({table_name})"
).fetchall()
sa_cols = sa_conn.execute(f"PRAGMA table_info({table_name})").fetchall()

# PRAGMA table_info columns: cid, name, type, notnull, dflt_value, pk
for legacy_col, sa_col in zip(legacy_cols, sa_cols):
l_cid, l_name, l_type, l_notnull, l_default, l_pk = legacy_col
s_cid, s_name, s_type, s_notnull, s_default, s_pk = sa_col

assert (
l_name == s_name
), f"Column name mismatch in {table_name}: {l_name} vs {s_name}"
assert normalize_type(l_type) == normalize_type(s_type), (
f"Type mismatch for {table_name}.{l_name}: " f"{l_type} vs {s_type}"
)
# Skip NOT NULL comparison for primary keys: SQLAlchemy always
# adds NOT NULL to PKs, while schema.sql may not have it.
if not l_pk:
assert l_notnull == s_notnull, (
f"NOT NULL mismatch for {table_name}.{l_name}: "
f"{l_notnull} vs {s_notnull}"
)
assert (
l_pk == s_pk
), f"PK mismatch for {table_name}.{l_name}: {l_pk} vs {s_pk}"
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema comparison loop uses zip(legacy_cols, sa_cols) without asserting the column counts match, so extra/missing columns would be silently ignored. Also, despite the docstring mentioning default-value comparison, l_default/s_default are currently never compared. Add explicit len(...) checks (or compare by column name) and include a normalized default-value comparison if this test is meant to validate defaults.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Column count assertion added in the latest push. The default-value comparison was intentionally omitted — schema.sql and SQLAlchemy express defaults differently (e.g. DEFAULT CURRENT_TIMESTAMP vs DEFAULT (CURRENT_TIMESTAMP)) and normalizing those differences adds complexity without much value.

Comment on lines +150 to +151
await structured_provider.store_record(
TableName(table_name), test_data["visit_id"], test_data
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_all_tables_access inserts the raw test_values dicts directly into each provider. For SQL-backed providers this will fail in at least two cases: (1) fixtures.test_values injects visit_id into all records (including task/crawl which don't have a visit_id column), and (2) test_values uses time_stamp for navigations but schema.sql/the WebExtension schema use committed_time_stamp. Normalize the per-table test record before calling store_record (e.g., drop visit_id when it is INVALID_VISIT_ID, and align the navigations timestamp key with the actual DB schema) so this test is exercising provider behavior rather than schema/test-data mismatches.

Suggested change
await structured_provider.store_record(
TableName(table_name), test_data["visit_id"], test_data
# Work on a copy so we don't mutate shared fixture data
record = dict(test_data)
# Some tables (e.g., task, crawl) do not have a visit_id column
if table_name in ("task", "crawl"):
record.pop("visit_id", None)
# If the visit_id is the "invalid" sentinel, omit it from the record
invalid_visit_id = getattr(VisitId, "INVALID", None) or getattr(
VisitId, "INVALID_VISIT_ID", None
)
if invalid_visit_id is not None and test_data.get("visit_id") == invalid_visit_id:
record.pop("visit_id", None)
# Align navigation timestamp key with the actual DB schema
if table_name == "navigations" and "time_stamp" in record:
record["committed_time_stamp"] = record.pop("time_stamp")
await structured_provider.store_record(
TableName(table_name), test_data["visit_id"], record

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test uses the same test_values fixture and insertion pattern as the existing test_basic_access and test_local_arrow_storage_provider tests. Extra columns are handled by the exception handler in store_record — logged and silently skipped, which is the intended graceful degradation behavior.

environment.yaml Outdated
- redis-py=5.2.1
- s3fs=2024.12.0
- selenium=4.27.1
- sqlalchemy>=2.0
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conda environment file pins most dependencies to exact versions, but this adds an unpinned range (sqlalchemy>=2.0), which can make environments non-reproducible and may break CI when newer SQLAlchemy releases land. Consider pinning to a specific tested version (or a tighter upper-bounded range) consistent with the rest of environment.yaml.

Suggested change
- sqlalchemy>=2.0
- sqlalchemy=2.0.36

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the latest push — pinned to sqlalchemy=2.0.36 and added to scripts/environment-unpinned.yaml as well.

Add SQLAlchemyStorageProvider backed by SQLAlchemy Core (not ORM) that
supports any SQLAlchemy-compatible database. SQLiteStorageProvider is
now a thin wrapper delegating to SQLAlchemyStorageProvider with a
sqlite:/// URL.

New files:
- sqlalchemy_schema.py: All 13 tables as SQLAlchemy Table objects
- sqlalchemy_provider.py: StructuredStorageProvider implementation
- test_sqlalchemy_provider.py: Schema equivalence, all-tables, coercion tests

Key decisions:
- DATETIME columns → Text for cross-dialect compatibility
- No foreign keys (test data violates them, SQLite ignores them)
- PostgreSQL transaction abort recovery via rollback in store_record
- Table reflection fallback for custom tables (e.g. page_links)

Closes #1143
@vringar vringar force-pushed the feat/sqlalchemy-storage branch from e09c660 to a8f9667 Compare March 1, 2026 00:20
@codecov
Copy link

codecov bot commented Mar 1, 2026

Codecov Report

❌ Patch coverage is 90.90909% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.20%. Comparing base (40e044f) to head (a8f9667).
⚠️ Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
openwpm/storage/sqlalchemy_provider.py 88.88% 9 Missing ⚠️
openwpm/storage/sql_provider.py 90.90% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1149      +/-   ##
==========================================
+ Coverage   56.62%   57.20%   +0.58%     
==========================================
  Files          40       42       +2     
  Lines        3910     3961      +51     
==========================================
+ Hits         2214     2266      +52     
+ Misses       1696     1695       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rewrite SQLiteStorageProvider to use SQLAlchemy

2 participants