Skip to content

fix(oom): prevent RSS growth during VCR recording + sanitizer dedup#6

Open
matyas-jirat-keboola wants to merge 3 commits intomainfrom
fix/oom-clean
Open

fix(oom): prevent RSS growth during VCR recording + sanitizer dedup#6
matyas-jirat-keboola wants to merge 3 commits intomainfrom
fix/oom-clean

Conversation

@matyas-jirat-keboola
Copy link
Collaborator

Summary

Two self-contained commits:

1. feat(sanitizers): merge duplicate sanitizers + pre-scan optimisation

  • Each sanitizer class gains a merge() method
  • _dedup_sanitizers() collapses same-class sanitizers into one, so the chain makes a single pass over each request/response instead of N passes
  • DefaultSanitizer._sanitize_body gains a pre-scan: skips the full JSON parse when no sensitive field name or value is present in the body (fast path for large unrelated responses)

2. fix(oom): prevent RSS growth during VCR recording

Root cause: every HTTPS interaction caused urllib3 to call load_default_certs() on a brand-new ssl.SSLContext, allocating ~480 KB of CA cert data into the OpenSSL C-heap via jemalloc. jemalloc never returns this to the OS, producing ~7 MB RSS growth per 10 interactions → OOM for jobs with 500–2000 interactions.

Three root causes fixed:

  • Pool proliferation — components creating a new requests.Session per call each get their own PoolManager → N pools → N ssl_context allocations. Fix: _pool_reuse_patch() returns one shared pool per (scheme, host, port).
  • ssl_context per connect() — even with one pool, requests 2.32.x never puts ssl_context into pool.conn_kw, so urllib3 calls load_default_certs() on every connect(). Fix: inject a shared SSLContext into pool.conn_kw at pool creation time.
  • Connection churnVCRConnection.is_connected delegated to real_connection.sock (None after Connection: close), causing _new_conn() on every request. Fix: override is_connected to return True when a VCR connection/socket is active; implement VCRHTTPResponse.release_conn() so the pool queue is refilled.

Also: response bodies written in 64 KB chunks to avoid large transient heap allocations fragmenting glibc's heap.

Confirmed working on Keboola platform — RSS growth eliminated on a 100+ interaction Facebook Ads recording run.

Test plan

  • 119 unit tests pass (uv run pytest)
  • Confirmed working on Keboola platform

🤖 Generated with Claude Code

matyas-jirat-keboola and others added 3 commits March 5, 2026 19:53
…ation

- Add merge() to DefaultSanitizer, TokenSanitizer, HeaderSanitizer,
  QueryParamSanitizer, UrlPatternSanitizer, ResponseUrlSanitizer
- Add _dedup_sanitizers() which collapses same-class sanitizers into one,
  so the sanitizer chain makes a single pass over each request/response
  instead of N passes when multiple sanitizers of the same type are present
- Add pre-scan in DefaultSanitizer._sanitize_body: skip the full JSON parse
  when no sensitive field name or value is present in the body (fast path)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Root cause: every HTTPS interaction during recording caused urllib3 to call
load_default_certs() on a brand-new ssl.SSLContext, allocating ~480 KB of
CA cert data into the OpenSSL C-heap (via jemalloc).  jemalloc's
high-watermark behaviour means this memory is never returned to the OS,
producing ~7 MB RSS growth per 10 interactions and OOM for jobs with
500–2000 interactions.

Three root causes, three fixes applied inside _pool_reuse_patch():

1. Pool proliferation: components that create a new requests.Session per
   API call (e.g. Facebook Ads batch requests) each get their own
   urllib3.PoolManager and a fresh HTTPSConnectionPool, so N interactions
   → N pools → N ssl_context allocations.  Fix: intercept
   PoolManager.connection_from_host and return one shared pool per
   (scheme, host, port) for the lifetime of the recording.

2. ssl_context per connect(): even with a single pool, urllib3 calls
   _new_conn() → connect() → _ssl_wrap_socket_and_match_hostname(
   ssl_context=None) for each interaction because requests 2.32.x never
   puts ssl_context into pool.conn_kw.  When ssl_context is None, urllib3
   creates a new context and calls load_default_certs() every time.  Fix:
   inject one shared ssl.SSLContext (with load_default_certs() pre-called)
   into pool.conn_kw["ssl_context"] so all _new_conn() connections share it.

3. Connection churn: VCRConnection.is_connected delegated to
   real_connection.sock, which is None after Connection: close responses.
   urllib3 saw is_connected=False and called _new_conn() for every request.
   Fix: override is_connected on VCRConnection to always return True when
   real_connection exists (recording) or a VCRFakeSocket is present (replay).
   Additionally, implement VCRHTTPResponse.release_conn() so urllib3's pool
   queue is refilled after each response and _new_conn() is not called
   unnecessarily.

Also: write large response body strings to the cassette file in 64 KB chunks
(_write_interaction_chunked) to prevent multi-MB transient allocations from
fragmenting glibc's heap during recording of APIs with large JSON payloads.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
_vcr_is_connected uses _VCRFakeSocket via a global lookup (module-level
function, not a closure). Deleting it with `del` before the property is
ever called would raise NameError at runtime. Remove it from the del list
so it stays available as a module-level name.

Also fix ruff I001: import block sort order in sanitizers.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant