fix(oom): prevent RSS growth during VCR recording by reusing SSLContext per pool by matyas-jirat-keboola · Pull Request #5 · keboola/python-vcr-tests

matyas-jirat-keboola · 2026-03-05T18:46:43Z

Summary

Root cause: Every HTTPS connection created during VCR recording called urllib3._ssl_wrap_socket_and_match_hostname() with ssl_context=None (because requests 2.32.5 never puts ssl_context into pool.conn_kw). This triggered create_urllib3_context() + load_default_certs() per interaction, allocating ~480 KB of CA cert data into the OpenSSL C-heap. jemalloc's high-watermark behavior means this memory is never returned to the OS — causing ~7 MB/10-interaction linear RSS growth.
Fix (149be86): Inject one shared SSLContext (with load_default_certs() pre-called) into pool.conn_kw["ssl_context"] when the pool is created. All subsequent _new_conn() connections receive it via **conn_kw, so connect() reuses the same context and never calls load_default_certs() again.
Additional supporting fixes already on branch: pool reuse (no pool proliferation), zero-copy response reader, JSONL streaming to avoid accumulating interactions in memory, release_conn() / is_connected fixes.
Diagnostic instrumentation stripped; only functional code remains.

Test plan

119 unit tests pass (uv run pytest)
Confirmed working on Keboola platform — RSS growth eliminated during recording run with 100+ HTTP interactions

🤖 Generated with Claude Code

Three targeted fixes for OOM failures at 256MB when recording debug cassettes with hundreds of Facebook Ads/Pages interactions: Fix 1 (DefaultSanitizer._sanitize_body): Add a quick string pre-scan before json.loads(). For responses where no sensitive JSON field names or secret values appear in the body, the entire JSON parse+walk is skipped. Applies to ~90% of Facebook Ads responses (tokens appear in requests, not response bodies), dropping per-response overhead from ~500KB to ~50KB of heap fragmentation. Fix 2 (_dedup_sanitizers + merge()): record_debug_run() prepends its own DefaultSanitizer then extends with the caller's VCR_SANITIZERS (which also contains a DefaultSanitizer). This caused every response body to be JSON-parsed twice. Added merge() methods to DefaultSanitizer, TokenSanitizer, HeaderSanitizer, QueryParamSanitizer, UrlPatternSanitizer, and ResponseUrlSanitizer (which now stores dynamic_params for merging). _dedup_sanitizers() merges same-class sanitizers in order, halving the sanitization work per interaction. Fix 3 (_trim_process_memory): Python's allocator does not return freed pages to the OS after large dict allocations. Added periodic gc.collect() + malloc_trim(0) every 50 interactions to reclaim fragmented heap back to the OS (Linux only; silently no-ops on macOS/Windows). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…nosis - Add _get_rss_mb() to read process RSS on Linux - Add _sanitize_counters module-level dict in sanitizers.py to track pre-scan skips vs full JSON parses - Log RSS + counters every 10 interactions during recording - Log cassette.data length after component run (should be 0) - Wrap record_debug_run with tracemalloc and log top-20 allocations Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…mentation json.dumps on multi-MB response bodies allocates the full escaped form at once. Chunking to 64 KB keeps each transient allocation just above glibc's mmap threshold so the allocator returns it to the OS immediately on free rather than leaving freed pages stranded on the heap. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…prit Runs tracemalloc only for 10 interactions (91-100) to capture what is allocated and still live — without the overhead of tracing the whole recording. Top 15 stats by line are logged as [vcr-mem] tmalloc: entries. This should show the ~7.7 MB of objects that accumulate per 10 interactions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Every 50 interactions: gc.collect(), count all live Python objects by type, total large bytes (>10 KB), and log cassette.data / _old_interactions / responses lengths. This shows which type is accumulating and whether response body bytes are staying alive. Combined with the tracemalloc window (interactions 91-100) this should pinpoint the exact source of RSS growth. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Picks up GC diagnostics and short tracemalloc window for OOM diagnosis. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

During recording, Facebook's Connection: close header closes real_connection.sock, causing urllib3 to create a new VCRHTTPConnection + ssl_context per interaction. Old ssl_contexts accumulate between GC runs (~80 per 50 interactions), each holding ~100 KB of OpenSSL C-level state — the root cause of ~7.7 MB/10 interactions RSS growth. Patch VCRConnection.is_connected to return True when a VCRFakeSocket is present, so urllib3 reuses the existing connection object. Real reconnection still happens via real_connection.connect() but uses the same ssl_context, capping live ssl_contexts at pool maxsize (10) instead of letting them accumulate unboundedly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

urllib3/requests call release_conn() on response.raw after consuming a response to return the connection to the pool. VCRHTTPResponse is missing this method, so the AttributeError is silently swallowed by urllib3's PoolManager and the connection slot is never returned — objects accumulate and RSS grows. Add a no-op release_conn() since VCR connections are not real sockets. Bump to 0.1.12. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

During VCR recording, the previous fix only returned True for VCRFakeSocket (replay mode). With real connections and Connection: close responses, urllib3 still called _new_conn(), creating a new VCRHTTPConnection + real_connection + ssl_context per interaction. Each ssl_context loads the system CA trust store into OpenSSL C-heap (~480 KB), matching the observed ~770 KB/interaction RSS growth visible in production recordings but invisible to tracemalloc. Fix: return True whenever real_connection exists, letting http.client auto- reconnect on the next send() if the socket was closed. This reuses the existing ssl_context (CA store loaded once) instead of reloading it per interaction. Also adds release_conn() no-op to VCRHTTPResponse to unblock urllib3's connection-pool teardown path (Issue #816). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…rolling tracemalloc) - Add _get_smaps_rollup() to parse /proc/self/smaps_rollup for Private_Dirty/Anonymous breakdown - Add _get_mallinfo() via ctypes mallinfo2 to expose glibc heap arena/in-use/free/mmap stats - Add _get_fd_count() to track open file descriptor growth - Replace one-time tracemalloc window (91-100) with rolling snapshots every 50 interactions using compare_to() for net-new allocation deltas - Add ssl.SSLSocket, ssl.SSLContext, urllib3.HTTPSConnection counts in gc scan - Add urllib3 HTTPConnectionPool introspection with queue depth - Add sys.getallocatedblocks() Python allocator metric - Bump version to 0.1.14 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

When the component creates a new requests.Session per API query, each session creates its own urllib3.PoolManager which in turn creates a fresh HTTPSConnectionPool for graph.facebook.com:443. For N interactions this produces N pools, each with its own SSLContext (~480 KB of C-heap from the system CA trust store). The C-heap growth doesn't appear in tracemalloc, isn't collected by gc.collect(), and adds ~8 MB per 10 interactions — causing OOM around interaction 230-260 in a 256 MB container. Fix: add _pool_reuse_patch() context manager that intercepts PoolManager.connection_from_host at the class level during recording and returns the first pool for any (scheme, host, port) tuple instead of letting each new PoolManager create its own. The patch is applied inside the vcrpy cassette context so cached pools already carry vcrpy's stubs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ns to pool urllib3 2.x returns VCRHTTPResponse directly from _make_request() (duck-typed as BaseHTTPResponse), stamping _connection=VCRHTTPConnection and _pool on the instance. When requests calls response.raw.release_conn() the no-op lambda was hit, so _put_conn() was never called, the pool queue stayed empty (q=0), and _new_conn() fired for every single request. Each _new_conn() creates a new VCRHTTPConnection + HTTPSConnection + SSLContext (~480 KB OpenSSL C-heap from the system CA store), producing ~8 MB / 10 interactions and OOM for jobs with 500+ interactions — confirmed by logs showing ssl_contexts=N at interaction N despite urllib3_pools=1. Fix: replace the no-op with a real release_conn() that mirrors urllib3's own HTTPResponse.release_conn(): calls pool._put_conn(conn) and clears _connection. This lets the same VCRHTTPConnection (and its real_connection + ssl_context) be reused across all requests, reducing SSL context count to O(hosts). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… store reload per interaction This is the commit that actually solved the OOM. All prior commits on this branch were necessary supporting work (pool reuse, connection reuse, diagnostics) but this is the change that stopped the ~7 MB/10-interaction RSS growth observed on platform. Root cause: requests 2.32.5 does not pass ssl_context through pool_kwargs, so every connection created by _new_conn() has real_connection.ssl_context = None. On every reconnect (Facebook APIs send Connection: close, so reconnect happens every interaction), _validate_conn() → conn.connect() → _ssl_wrap_socket_and_match_hostname(ssl_context=None) calls create_urllib3_context() + load_default_certs(), loading ~480 KB of CA cert data into OpenSSL's C-heap. jemalloc never returns this C-heap to the OS even after the SSLContext is GC'd — so RSS grows ~480 KB per interaction = ~7 MB per 10 interactions. Fix: In _pool_reuse_patch, after creating each pool, inject a single shared ssl.SSLContext into pool.conn_kw["ssl_context"]. All connections from _new_conn() pick it up via **conn_kw. Every connect() call sees ssl_context != None, reuses it, and never calls load_default_certs() again. CA store is loaded exactly once per host per recording. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ation Removes: _get_rss_mb, _get_smaps_rollup, _get_mallinfo, _get_fd_count, _trim_process_memory, _TRIM_INTERVAL, tracemalloc rolling snapshots, gc/ssl/urllib3 object counts, [vcr-mem]/[vcr-gc]/[vcr-smaps]/[vcr-heap]/ [vcr-fd]/[vcr-pyalloc]/[vcr-tmalloc]/[vcr-pool] log lines, and the _diag_cassette reference. Keeps all functional fixes: pool reuse patch, shared SSLContext injection, VCRConnection.is_connected, VCRHTTPResponse.release_conn, zero-copy reader, JSONL streaming, chunked JSON writer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…unters) These were only read by the diagnostic logging blocks that were already stripped. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

matyas-jirat-keboola · 2026-03-05T19:09:39Z

Superseded by #6 (clean 2-commit history from main).

matyas-jirat-keboola and others added 19 commits March 5, 2026 10:49

chore: bump version to 0.1.6

517717c

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

update

5b00415

chore: update uv.lock for keboola-vcr v0.1.10

a159c69

Picks up GC diagnostics and short tracemalloc window for OOM diagnosis. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore: remove leftover debug log lines from pool reuse patch

ce899a9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore: remove dead profiling counters (_prof_skip_count, _sanitize_co…

e3ed497

…unters) These were only read by the diagnostic logging blocks that were already stripped. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(lint): keep _VCRFakeSocket in module scope; fix import sort

662ae3b

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

matyas-jirat-keboola closed this Mar 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(oom): prevent RSS growth during VCR recording by reusing SSLContext per pool#5

fix(oom): prevent RSS growth during VCR recording by reusing SSLContext per pool#5
matyas-jirat-keboola wants to merge 19 commits intomainfrom
fix/oom-vcr-sanitizer-overhead

matyas-jirat-keboola commented Mar 5, 2026

Uh oh!

matyas-jirat-keboola commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

matyas-jirat-keboola commented Mar 5, 2026

Summary

Test plan

Uh oh!

matyas-jirat-keboola commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant