feat(storage): rewrite SQLiteStorageProvider to use SQLAlchemy#1149
feat(storage): rewrite SQLiteStorageProvider to use SQLAlchemy#1149
Conversation
Using "window" instead of "tab" created an isolated cache context, causing test_cache_hits_recorded to fail with extra uncached requests.
Fix black formatting, mypy errors (duplicate sleep arg, missing return type).
Use new_window("tab") for tab_restart_browser — not subject to popup
blocking. Update HTTP_CACHED_REQUESTS to include resources re-fetched
due to the new tab having an isolated cache context.
Remove stealth extension files accidentally included in this branch.
There was a problem hiding this comment.
Pull request overview
This PR introduces a new SQLAlchemy Core–backed structured storage provider and refactors the existing SQLite provider to delegate to it, aiming to support multiple SQLAlchemy-compatible databases while keeping the existing SQLiteStorageProvider constructor API intact.
Changes:
- Added
SQLAlchemyStorageProviderplus a SQLAlchemy Core schema definition for the structured tables. - Refactored
SQLiteStorageProviderinto a thin wrapper around the SQLAlchemy provider. - Added/updated tests and docs, including a Selenium tab-restart regression fix and corresponding test expectation changes.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
openwpm/storage/sqlalchemy_schema.py |
Defines the SQLAlchemy Core Table objects representing the structured schema. |
openwpm/storage/sqlalchemy_provider.py |
Implements StructuredStorageProvider using SQLAlchemy engine/connection + record coercion. |
openwpm/storage/sql_provider.py |
Rewrites SQLiteStorageProvider as a delegating wrapper to the SQLAlchemy provider. |
test/storage/test_sqlalchemy_provider.py |
Adds schema-equivalence / all-tables insertion / coercion unit tests for the new provider. |
test/storage/fixtures.py |
Adds sqlalchemy_sqlite to the structured-provider scenario matrix. |
environment.yaml |
Adds SQLAlchemy dependency. |
openwpm/commands/browser_commands.py |
Changes tab restart logic to use WebDriver new_window instead of window.open. |
test/test_simple_commands.py |
Adds regression test for popup-blocking + tab restart behavior. |
test/test_http_instrumentation.py |
Updates cached-request/response expectations impacted by tab restart behavior. |
docs/Configuration.md |
Documents BrowserParams.prefs usage for setting Firefox about:config prefs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Create a fresh tab via the WebDriver protocol, which is not subject | ||
| # to content-level popup blocking. | ||
| webdriver.switch_to.new_window("tab") | ||
| close_other_windows(webdriver) |
There was a problem hiding this comment.
The PR description focuses on rewriting the SQLite storage provider via SQLAlchemy, but this hunk also changes browser behavior (tab_restart_browser now uses switch_to.new_window). If this is intentional, the PR description should be updated to mention the navigation/popup-blocking regression fix (and the related test expectation updates) or the change should be split into a separate PR to keep scope clear.
There was a problem hiding this comment.
These changes (browser_commands.py, test_http_instrumentation.py, test_simple_commands.py) are from prior commits on this branch, not part of the SQLAlchemy rewrite. They were already merged to master before this PR was created — GitHub is showing them because the PR branch was based on a commit that includes them.
There was a problem hiding this comment.
These changes (browser_commands.py, test_http_instrumentation.py, test_simple_commands.py) are from prior commits on this branch, not part of the SQLAlchemy rewrite. They were already merged to master before this PR was created — GitHub is showing them because the PR branch was based on a commit that includes them.
| if isinstance(value, bool): | ||
| value = int(value) | ||
| elif isinstance(value, bytes): | ||
| value = str(value, errors="ignore") |
There was a problem hiding this comment.
str(value, errors="ignore") will raise TypeError for bytes in Python because an encoding argument is required. Decode bytes explicitly (e.g., value.decode("utf-8", errors="ignore")) or use str(value, "utf-8", errors="ignore") so _coerce_record doesn't crash when encountering bytes.
| value = str(value, errors="ignore") | |
| value = value.decode("utf-8", errors="ignore") |
There was a problem hiding this comment.
This is incorrect. str(b"hello", errors="ignore") works fine in Python 3 — when errors is provided, Python treats the first argument as a bytes-like object and decodes it using the default encoding (UTF-8). This is the same pattern used in the original SQLiteStorageProvider.store_record — we preserved the existing behavior.
| # bytes -> str (using str(bytes_val, errors="ignore")) | ||
| result = coerce({"data": b"hello"}) | ||
| assert result["data"] == "hello" | ||
| assert isinstance(result["data"], str) | ||
|
|
There was a problem hiding this comment.
This test asserts bytes coercion via str(bytes_val, errors="ignore"), but that call is invalid in Python without an explicit encoding and will fail once _coerce_record is corrected. Update the test expectation to match the intended bytes decoding behavior (e.g., UTF-8 decode with errors="ignore").
| # bytes -> str (using str(bytes_val, errors="ignore")) | |
| result = coerce({"data": b"hello"}) | |
| assert result["data"] == "hello" | |
| assert isinstance(result["data"], str) | |
| # bytes -> str via UTF-8 decode with errors="ignore" | |
| result = coerce({"data": b"hello"}) | |
| assert result["data"] == "hello" | |
| assert isinstance(result["data"], str) | |
| # invalid UTF-8 bytes should be ignored during decoding | |
| result = coerce({"data": b"he\xffllo"}) | |
| assert result["data"] == "hello" |
There was a problem hiding this comment.
See reply on the sqlalchemy_provider.py comment — str(bytes_val, errors='ignore') works correctly in Python 3. The test assertion is correct and passes.
| # Create database via schema.sql (the legacy path) | ||
| legacy_db = tmp_path / "legacy.sqlite" | ||
| legacy_provider = SQLiteStorageProvider(legacy_db) | ||
| await legacy_provider.init() | ||
| await legacy_provider.shutdown() |
There was a problem hiding this comment.
test_schema_equivalence claims to create the legacy DB via schema.sql, but it uses SQLiteStorageProvider, which in this PR is rewritten to delegate to SQLAlchemyStorageProvider. That means the test is comparing SQLAlchemy-generated schema to itself and will not catch divergence from schema.sql. Create the legacy DB by executing openwpm/storage/schema.sql directly via sqlite3.executescript(...) (or keep a dedicated helper) instead of going through SQLiteStorageProvider.
There was a problem hiding this comment.
Good catch — fixed in the latest push. The test now creates the legacy DB by executing schema.sql directly via sqlite3.executescript(), independent of SQLiteStorageProvider.
| for table_name in sorted(legacy_tables): | ||
| legacy_cols = legacy_conn.execute( | ||
| f"PRAGMA table_info({table_name})" | ||
| ).fetchall() | ||
| sa_cols = sa_conn.execute(f"PRAGMA table_info({table_name})").fetchall() | ||
|
|
||
| # PRAGMA table_info columns: cid, name, type, notnull, dflt_value, pk | ||
| for legacy_col, sa_col in zip(legacy_cols, sa_cols): | ||
| l_cid, l_name, l_type, l_notnull, l_default, l_pk = legacy_col | ||
| s_cid, s_name, s_type, s_notnull, s_default, s_pk = sa_col | ||
|
|
||
| assert ( | ||
| l_name == s_name | ||
| ), f"Column name mismatch in {table_name}: {l_name} vs {s_name}" | ||
| assert normalize_type(l_type) == normalize_type(s_type), ( | ||
| f"Type mismatch for {table_name}.{l_name}: " f"{l_type} vs {s_type}" | ||
| ) | ||
| # Skip NOT NULL comparison for primary keys: SQLAlchemy always | ||
| # adds NOT NULL to PKs, while schema.sql may not have it. | ||
| if not l_pk: | ||
| assert l_notnull == s_notnull, ( | ||
| f"NOT NULL mismatch for {table_name}.{l_name}: " | ||
| f"{l_notnull} vs {s_notnull}" | ||
| ) | ||
| assert ( | ||
| l_pk == s_pk | ||
| ), f"PK mismatch for {table_name}.{l_name}: {l_pk} vs {s_pk}" |
There was a problem hiding this comment.
The schema comparison loop uses zip(legacy_cols, sa_cols) without asserting the column counts match, so extra/missing columns would be silently ignored. Also, despite the docstring mentioning default-value comparison, l_default/s_default are currently never compared. Add explicit len(...) checks (or compare by column name) and include a normalized default-value comparison if this test is meant to validate defaults.
There was a problem hiding this comment.
Column count assertion added in the latest push. The default-value comparison was intentionally omitted — schema.sql and SQLAlchemy express defaults differently (e.g. DEFAULT CURRENT_TIMESTAMP vs DEFAULT (CURRENT_TIMESTAMP)) and normalizing those differences adds complexity without much value.
| await structured_provider.store_record( | ||
| TableName(table_name), test_data["visit_id"], test_data |
There was a problem hiding this comment.
test_all_tables_access inserts the raw test_values dicts directly into each provider. For SQL-backed providers this will fail in at least two cases: (1) fixtures.test_values injects visit_id into all records (including task/crawl which don't have a visit_id column), and (2) test_values uses time_stamp for navigations but schema.sql/the WebExtension schema use committed_time_stamp. Normalize the per-table test record before calling store_record (e.g., drop visit_id when it is INVALID_VISIT_ID, and align the navigations timestamp key with the actual DB schema) so this test is exercising provider behavior rather than schema/test-data mismatches.
| await structured_provider.store_record( | |
| TableName(table_name), test_data["visit_id"], test_data | |
| # Work on a copy so we don't mutate shared fixture data | |
| record = dict(test_data) | |
| # Some tables (e.g., task, crawl) do not have a visit_id column | |
| if table_name in ("task", "crawl"): | |
| record.pop("visit_id", None) | |
| # If the visit_id is the "invalid" sentinel, omit it from the record | |
| invalid_visit_id = getattr(VisitId, "INVALID", None) or getattr( | |
| VisitId, "INVALID_VISIT_ID", None | |
| ) | |
| if invalid_visit_id is not None and test_data.get("visit_id") == invalid_visit_id: | |
| record.pop("visit_id", None) | |
| # Align navigation timestamp key with the actual DB schema | |
| if table_name == "navigations" and "time_stamp" in record: | |
| record["committed_time_stamp"] = record.pop("time_stamp") | |
| await structured_provider.store_record( | |
| TableName(table_name), test_data["visit_id"], record |
There was a problem hiding this comment.
This test uses the same test_values fixture and insertion pattern as the existing test_basic_access and test_local_arrow_storage_provider tests. Extra columns are handled by the exception handler in store_record — logged and silently skipped, which is the intended graceful degradation behavior.
environment.yaml
Outdated
| - redis-py=5.2.1 | ||
| - s3fs=2024.12.0 | ||
| - selenium=4.27.1 | ||
| - sqlalchemy>=2.0 |
There was a problem hiding this comment.
The conda environment file pins most dependencies to exact versions, but this adds an unpinned range (sqlalchemy>=2.0), which can make environments non-reproducible and may break CI when newer SQLAlchemy releases land. Consider pinning to a specific tested version (or a tighter upper-bounded range) consistent with the rest of environment.yaml.
| - sqlalchemy>=2.0 | |
| - sqlalchemy=2.0.36 |
There was a problem hiding this comment.
Fixed in the latest push — pinned to sqlalchemy=2.0.36 and added to scripts/environment-unpinned.yaml as well.
Add SQLAlchemyStorageProvider backed by SQLAlchemy Core (not ORM) that supports any SQLAlchemy-compatible database. SQLiteStorageProvider is now a thin wrapper delegating to SQLAlchemyStorageProvider with a sqlite:/// URL. New files: - sqlalchemy_schema.py: All 13 tables as SQLAlchemy Table objects - sqlalchemy_provider.py: StructuredStorageProvider implementation - test_sqlalchemy_provider.py: Schema equivalence, all-tables, coercion tests Key decisions: - DATETIME columns → Text for cross-dialect compatibility - No foreign keys (test data violates them, SQLite ignores them) - PostgreSQL transaction abort recovery via rollback in store_record - Table reflection fallback for custom tables (e.g. page_links) Closes #1143
e09c660 to
a8f9667
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1149 +/- ##
==========================================
+ Coverage 56.62% 57.20% +0.58%
==========================================
Files 40 42 +2
Lines 3910 3961 +51
==========================================
+ Hits 2214 2266 +52
+ Misses 1696 1695 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary
SQLAlchemyStorageProviderbacked by SQLAlchemy Core that supports any SQLAlchemy-compatible database (SQLite, PostgreSQL, etc.)SQLiteStorageProvideras a thin wrapper delegating toSQLAlchemyStorageProviderwith asqlite:///URL_coerce_recordedge casesDesign decisions
store_recordcatches exceptions and rolls back so subsequent inserts can proceed_get_table()reflects unknown tables from the database for custom commands (e.g.page_links)sqlite_autoincrement=Trueon tables that useAUTOINCREMENTin schema.sql, verified by schema equivalence testNew files
openwpm/storage/sqlalchemy_schema.pyTableobjects on sharedMetaDataopenwpm/storage/sqlalchemy_provider.pyStructuredStorageProviderimplementationtest/storage/test_sqlalchemy_provider.pyModified files
openwpm/storage/sql_provider.pySQLAlchemyStorageProvidertest/storage/fixtures.pysqlalchemy_sqlitescenario to structured providersenvironment.yamlsqlalchemy>=2.0dependencyTest plan
_coerce_recordedge cases: bool→int, bytes→str, callable→str, dict→json, passthroughsqlalchemy_sqlitescenario addedCloses #1143