Skip to content

fix(oom): prevent RSS growth during VCR recording by reusing SSLContext per pool#5

Closed
matyas-jirat-keboola wants to merge 19 commits intomainfrom
fix/oom-vcr-sanitizer-overhead
Closed

fix(oom): prevent RSS growth during VCR recording by reusing SSLContext per pool#5
matyas-jirat-keboola wants to merge 19 commits intomainfrom
fix/oom-vcr-sanitizer-overhead

Conversation

@matyas-jirat-keboola
Copy link
Collaborator

Summary

  • Root cause: Every HTTPS connection created during VCR recording called urllib3._ssl_wrap_socket_and_match_hostname() with ssl_context=None (because requests 2.32.5 never puts ssl_context into pool.conn_kw). This triggered create_urllib3_context() + load_default_certs() per interaction, allocating ~480 KB of CA cert data into the OpenSSL C-heap. jemalloc's high-watermark behavior means this memory is never returned to the OS — causing ~7 MB/10-interaction linear RSS growth.
  • Fix (149be86): Inject one shared SSLContext (with load_default_certs() pre-called) into pool.conn_kw["ssl_context"] when the pool is created. All subsequent _new_conn() connections receive it via **conn_kw, so connect() reuses the same context and never calls load_default_certs() again.
  • Additional supporting fixes already on branch: pool reuse (no pool proliferation), zero-copy response reader, JSONL streaming to avoid accumulating interactions in memory, release_conn() / is_connected fixes.
  • Diagnostic instrumentation stripped; only functional code remains.

Test plan

  • 119 unit tests pass (uv run pytest)
  • Confirmed working on Keboola platform — RSS growth eliminated during recording run with 100+ HTTP interactions

🤖 Generated with Claude Code

matyas-jirat-keboola and others added 19 commits March 5, 2026 10:49
Three targeted fixes for OOM failures at 256MB when recording debug cassettes
with hundreds of Facebook Ads/Pages interactions:

Fix 1 (DefaultSanitizer._sanitize_body): Add a quick string pre-scan before
json.loads(). For responses where no sensitive JSON field names or secret values
appear in the body, the entire JSON parse+walk is skipped. Applies to ~90% of
Facebook Ads responses (tokens appear in requests, not response bodies), dropping
per-response overhead from ~500KB to ~50KB of heap fragmentation.

Fix 2 (_dedup_sanitizers + merge()): record_debug_run() prepends its own
DefaultSanitizer then extends with the caller's VCR_SANITIZERS (which also
contains a DefaultSanitizer). This caused every response body to be JSON-parsed
twice. Added merge() methods to DefaultSanitizer, TokenSanitizer, HeaderSanitizer,
QueryParamSanitizer, UrlPatternSanitizer, and ResponseUrlSanitizer (which now
stores dynamic_params for merging). _dedup_sanitizers() merges same-class
sanitizers in order, halving the sanitization work per interaction.

Fix 3 (_trim_process_memory): Python's allocator does not return freed pages to
the OS after large dict allocations. Added periodic gc.collect() + malloc_trim(0)
every 50 interactions to reclaim fragmented heap back to the OS (Linux only;
silently no-ops on macOS/Windows).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nosis

- Add _get_rss_mb() to read process RSS on Linux
- Add _sanitize_counters module-level dict in sanitizers.py to track
  pre-scan skips vs full JSON parses
- Log RSS + counters every 10 interactions during recording
- Log cassette.data length after component run (should be 0)
- Wrap record_debug_run with tracemalloc and log top-20 allocations

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mentation

json.dumps on multi-MB response bodies allocates the full escaped form at
once. Chunking to 64 KB keeps each transient allocation just above glibc's
mmap threshold so the allocator returns it to the OS immediately on free
rather than leaving freed pages stranded on the heap.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…prit

Runs tracemalloc only for 10 interactions (91-100) to capture what is
allocated and still live — without the overhead of tracing the whole
recording. Top 15 stats by line are logged as [vcr-mem] tmalloc: entries.
This should show the ~7.7 MB of objects that accumulate per 10 interactions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Every 50 interactions: gc.collect(), count all live Python objects by type,
total large bytes (>10 KB), and log cassette.data / _old_interactions /
responses lengths. This shows which type is accumulating and whether response
body bytes are staying alive. Combined with the tracemalloc window
(interactions 91-100) this should pinpoint the exact source of RSS growth.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Picks up GC diagnostics and short tracemalloc window for OOM diagnosis.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
During recording, Facebook's Connection: close header closes real_connection.sock,
causing urllib3 to create a new VCRHTTPConnection + ssl_context per interaction.
Old ssl_contexts accumulate between GC runs (~80 per 50 interactions), each holding
~100 KB of OpenSSL C-level state — the root cause of ~7.7 MB/10 interactions RSS growth.

Patch VCRConnection.is_connected to return True when a VCRFakeSocket is present,
so urllib3 reuses the existing connection object. Real reconnection still happens
via real_connection.connect() but uses the same ssl_context, capping live ssl_contexts
at pool maxsize (10) instead of letting them accumulate unboundedly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
urllib3/requests call release_conn() on response.raw after consuming a response
to return the connection to the pool. VCRHTTPResponse is missing this method,
so the AttributeError is silently swallowed by urllib3's PoolManager and the
connection slot is never returned — objects accumulate and RSS grows.

Add a no-op release_conn() since VCR connections are not real sockets.
Bump to 0.1.12.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
During VCR recording, the previous fix only returned True for VCRFakeSocket
(replay mode). With real connections and Connection: close responses, urllib3
still called _new_conn(), creating a new VCRHTTPConnection + real_connection
+ ssl_context per interaction. Each ssl_context loads the system CA trust
store into OpenSSL C-heap (~480 KB), matching the observed ~770 KB/interaction
RSS growth visible in production recordings but invisible to tracemalloc.

Fix: return True whenever real_connection exists, letting http.client auto-
reconnect on the next send() if the socket was closed. This reuses the
existing ssl_context (CA store loaded once) instead of reloading it per
interaction.

Also adds release_conn() no-op to VCRHTTPResponse to unblock urllib3's
connection-pool teardown path (Issue #816).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rolling tracemalloc)

- Add _get_smaps_rollup() to parse /proc/self/smaps_rollup for Private_Dirty/Anonymous breakdown
- Add _get_mallinfo() via ctypes mallinfo2 to expose glibc heap arena/in-use/free/mmap stats
- Add _get_fd_count() to track open file descriptor growth
- Replace one-time tracemalloc window (91-100) with rolling snapshots every 50 interactions using compare_to() for net-new allocation deltas
- Add ssl.SSLSocket, ssl.SSLContext, urllib3.HTTPSConnection counts in gc scan
- Add urllib3 HTTPConnectionPool introspection with queue depth
- Add sys.getallocatedblocks() Python allocator metric
- Bump version to 0.1.14

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When the component creates a new requests.Session per API query, each
session creates its own urllib3.PoolManager which in turn creates a fresh
HTTPSConnectionPool for graph.facebook.com:443.  For N interactions this
produces N pools, each with its own SSLContext (~480 KB of C-heap from the
system CA trust store).  The C-heap growth doesn't appear in tracemalloc,
isn't collected by gc.collect(), and adds ~8 MB per 10 interactions —
causing OOM around interaction 230-260 in a 256 MB container.

Fix: add _pool_reuse_patch() context manager that intercepts
PoolManager.connection_from_host at the class level during recording and
returns the first pool for any (scheme, host, port) tuple instead of
letting each new PoolManager create its own.  The patch is applied inside
the vcrpy cassette context so cached pools already carry vcrpy's stubs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ns to pool

urllib3 2.x returns VCRHTTPResponse directly from _make_request() (duck-typed
as BaseHTTPResponse), stamping _connection=VCRHTTPConnection and _pool on the
instance.  When requests calls response.raw.release_conn() the no-op lambda
was hit, so _put_conn() was never called, the pool queue stayed empty (q=0),
and _new_conn() fired for every single request.

Each _new_conn() creates a new VCRHTTPConnection + HTTPSConnection + SSLContext
(~480 KB OpenSSL C-heap from the system CA store), producing ~8 MB / 10
interactions and OOM for jobs with 500+ interactions — confirmed by logs showing
ssl_contexts=N at interaction N despite urllib3_pools=1.

Fix: replace the no-op with a real release_conn() that mirrors urllib3's own
HTTPResponse.release_conn(): calls pool._put_conn(conn) and clears _connection.
This lets the same VCRHTTPConnection (and its real_connection + ssl_context) be
reused across all requests, reducing SSL context count to O(hosts).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… store reload per interaction

This is the commit that actually solved the OOM. All prior commits on this branch
were necessary supporting work (pool reuse, connection reuse, diagnostics) but this
is the change that stopped the ~7 MB/10-interaction RSS growth observed on platform.

Root cause:
  requests 2.32.5 does not pass ssl_context through pool_kwargs, so every connection
  created by _new_conn() has real_connection.ssl_context = None.  On every reconnect
  (Facebook APIs send Connection: close, so reconnect happens every interaction),
  _validate_conn() → conn.connect() → _ssl_wrap_socket_and_match_hostname(ssl_context=None)
  calls create_urllib3_context() + load_default_certs(), loading ~480 KB of CA cert data
  into OpenSSL's C-heap.  jemalloc never returns this C-heap to the OS even after the
  SSLContext is GC'd — so RSS grows ~480 KB per interaction = ~7 MB per 10 interactions.

Fix:
  In _pool_reuse_patch, after creating each pool, inject a single shared ssl.SSLContext
  into pool.conn_kw["ssl_context"].  All connections from _new_conn() pick it up via
  **conn_kw.  Every connect() call sees ssl_context != None, reuses it, and never calls
  load_default_certs() again.  CA store is loaded exactly once per host per recording.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ation

Removes: _get_rss_mb, _get_smaps_rollup, _get_mallinfo, _get_fd_count,
_trim_process_memory, _TRIM_INTERVAL, tracemalloc rolling snapshots,
gc/ssl/urllib3 object counts, [vcr-mem]/[vcr-gc]/[vcr-smaps]/[vcr-heap]/
[vcr-fd]/[vcr-pyalloc]/[vcr-tmalloc]/[vcr-pool] log lines, and the
_diag_cassette reference.

Keeps all functional fixes: pool reuse patch, shared SSLContext injection,
VCRConnection.is_connected, VCRHTTPResponse.release_conn, zero-copy reader,
JSONL streaming, chunked JSON writer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…unters)

These were only read by the diagnostic logging blocks that were already stripped.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@matyas-jirat-keboola
Copy link
Collaborator Author

Superseded by #6 (clean 2-commit history from main).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant