fix(oom): prevent RSS growth during VCR recording by reusing SSLContext per pool#5
Closed
matyas-jirat-keboola wants to merge 19 commits intomainfrom
Closed
fix(oom): prevent RSS growth during VCR recording by reusing SSLContext per pool#5matyas-jirat-keboola wants to merge 19 commits intomainfrom
matyas-jirat-keboola wants to merge 19 commits intomainfrom
Conversation
Three targeted fixes for OOM failures at 256MB when recording debug cassettes with hundreds of Facebook Ads/Pages interactions: Fix 1 (DefaultSanitizer._sanitize_body): Add a quick string pre-scan before json.loads(). For responses where no sensitive JSON field names or secret values appear in the body, the entire JSON parse+walk is skipped. Applies to ~90% of Facebook Ads responses (tokens appear in requests, not response bodies), dropping per-response overhead from ~500KB to ~50KB of heap fragmentation. Fix 2 (_dedup_sanitizers + merge()): record_debug_run() prepends its own DefaultSanitizer then extends with the caller's VCR_SANITIZERS (which also contains a DefaultSanitizer). This caused every response body to be JSON-parsed twice. Added merge() methods to DefaultSanitizer, TokenSanitizer, HeaderSanitizer, QueryParamSanitizer, UrlPatternSanitizer, and ResponseUrlSanitizer (which now stores dynamic_params for merging). _dedup_sanitizers() merges same-class sanitizers in order, halving the sanitization work per interaction. Fix 3 (_trim_process_memory): Python's allocator does not return freed pages to the OS after large dict allocations. Added periodic gc.collect() + malloc_trim(0) every 50 interactions to reclaim fragmented heap back to the OS (Linux only; silently no-ops on macOS/Windows). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nosis - Add _get_rss_mb() to read process RSS on Linux - Add _sanitize_counters module-level dict in sanitizers.py to track pre-scan skips vs full JSON parses - Log RSS + counters every 10 interactions during recording - Log cassette.data length after component run (should be 0) - Wrap record_debug_run with tracemalloc and log top-20 allocations Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mentation json.dumps on multi-MB response bodies allocates the full escaped form at once. Chunking to 64 KB keeps each transient allocation just above glibc's mmap threshold so the allocator returns it to the OS immediately on free rather than leaving freed pages stranded on the heap. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…prit Runs tracemalloc only for 10 interactions (91-100) to capture what is allocated and still live — without the overhead of tracing the whole recording. Top 15 stats by line are logged as [vcr-mem] tmalloc: entries. This should show the ~7.7 MB of objects that accumulate per 10 interactions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Every 50 interactions: gc.collect(), count all live Python objects by type, total large bytes (>10 KB), and log cassette.data / _old_interactions / responses lengths. This shows which type is accumulating and whether response body bytes are staying alive. Combined with the tracemalloc window (interactions 91-100) this should pinpoint the exact source of RSS growth. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Picks up GC diagnostics and short tracemalloc window for OOM diagnosis. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
During recording, Facebook's Connection: close header closes real_connection.sock, causing urllib3 to create a new VCRHTTPConnection + ssl_context per interaction. Old ssl_contexts accumulate between GC runs (~80 per 50 interactions), each holding ~100 KB of OpenSSL C-level state — the root cause of ~7.7 MB/10 interactions RSS growth. Patch VCRConnection.is_connected to return True when a VCRFakeSocket is present, so urllib3 reuses the existing connection object. Real reconnection still happens via real_connection.connect() but uses the same ssl_context, capping live ssl_contexts at pool maxsize (10) instead of letting them accumulate unboundedly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
urllib3/requests call release_conn() on response.raw after consuming a response to return the connection to the pool. VCRHTTPResponse is missing this method, so the AttributeError is silently swallowed by urllib3's PoolManager and the connection slot is never returned — objects accumulate and RSS grows. Add a no-op release_conn() since VCR connections are not real sockets. Bump to 0.1.12. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
During VCR recording, the previous fix only returned True for VCRFakeSocket (replay mode). With real connections and Connection: close responses, urllib3 still called _new_conn(), creating a new VCRHTTPConnection + real_connection + ssl_context per interaction. Each ssl_context loads the system CA trust store into OpenSSL C-heap (~480 KB), matching the observed ~770 KB/interaction RSS growth visible in production recordings but invisible to tracemalloc. Fix: return True whenever real_connection exists, letting http.client auto- reconnect on the next send() if the socket was closed. This reuses the existing ssl_context (CA store loaded once) instead of reloading it per interaction. Also adds release_conn() no-op to VCRHTTPResponse to unblock urllib3's connection-pool teardown path (Issue #816). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rolling tracemalloc) - Add _get_smaps_rollup() to parse /proc/self/smaps_rollup for Private_Dirty/Anonymous breakdown - Add _get_mallinfo() via ctypes mallinfo2 to expose glibc heap arena/in-use/free/mmap stats - Add _get_fd_count() to track open file descriptor growth - Replace one-time tracemalloc window (91-100) with rolling snapshots every 50 interactions using compare_to() for net-new allocation deltas - Add ssl.SSLSocket, ssl.SSLContext, urllib3.HTTPSConnection counts in gc scan - Add urllib3 HTTPConnectionPool introspection with queue depth - Add sys.getallocatedblocks() Python allocator metric - Bump version to 0.1.14 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When the component creates a new requests.Session per API query, each session creates its own urllib3.PoolManager which in turn creates a fresh HTTPSConnectionPool for graph.facebook.com:443. For N interactions this produces N pools, each with its own SSLContext (~480 KB of C-heap from the system CA trust store). The C-heap growth doesn't appear in tracemalloc, isn't collected by gc.collect(), and adds ~8 MB per 10 interactions — causing OOM around interaction 230-260 in a 256 MB container. Fix: add _pool_reuse_patch() context manager that intercepts PoolManager.connection_from_host at the class level during recording and returns the first pool for any (scheme, host, port) tuple instead of letting each new PoolManager create its own. The patch is applied inside the vcrpy cassette context so cached pools already carry vcrpy's stubs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ns to pool urllib3 2.x returns VCRHTTPResponse directly from _make_request() (duck-typed as BaseHTTPResponse), stamping _connection=VCRHTTPConnection and _pool on the instance. When requests calls response.raw.release_conn() the no-op lambda was hit, so _put_conn() was never called, the pool queue stayed empty (q=0), and _new_conn() fired for every single request. Each _new_conn() creates a new VCRHTTPConnection + HTTPSConnection + SSLContext (~480 KB OpenSSL C-heap from the system CA store), producing ~8 MB / 10 interactions and OOM for jobs with 500+ interactions — confirmed by logs showing ssl_contexts=N at interaction N despite urllib3_pools=1. Fix: replace the no-op with a real release_conn() that mirrors urllib3's own HTTPResponse.release_conn(): calls pool._put_conn(conn) and clears _connection. This lets the same VCRHTTPConnection (and its real_connection + ssl_context) be reused across all requests, reducing SSL context count to O(hosts). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… store reload per interaction This is the commit that actually solved the OOM. All prior commits on this branch were necessary supporting work (pool reuse, connection reuse, diagnostics) but this is the change that stopped the ~7 MB/10-interaction RSS growth observed on platform. Root cause: requests 2.32.5 does not pass ssl_context through pool_kwargs, so every connection created by _new_conn() has real_connection.ssl_context = None. On every reconnect (Facebook APIs send Connection: close, so reconnect happens every interaction), _validate_conn() → conn.connect() → _ssl_wrap_socket_and_match_hostname(ssl_context=None) calls create_urllib3_context() + load_default_certs(), loading ~480 KB of CA cert data into OpenSSL's C-heap. jemalloc never returns this C-heap to the OS even after the SSLContext is GC'd — so RSS grows ~480 KB per interaction = ~7 MB per 10 interactions. Fix: In _pool_reuse_patch, after creating each pool, inject a single shared ssl.SSLContext into pool.conn_kw["ssl_context"]. All connections from _new_conn() pick it up via **conn_kw. Every connect() call sees ssl_context != None, reuses it, and never calls load_default_certs() again. CA store is loaded exactly once per host per recording. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ation Removes: _get_rss_mb, _get_smaps_rollup, _get_mallinfo, _get_fd_count, _trim_process_memory, _TRIM_INTERVAL, tracemalloc rolling snapshots, gc/ssl/urllib3 object counts, [vcr-mem]/[vcr-gc]/[vcr-smaps]/[vcr-heap]/ [vcr-fd]/[vcr-pyalloc]/[vcr-tmalloc]/[vcr-pool] log lines, and the _diag_cassette reference. Keeps all functional fixes: pool reuse patch, shared SSLContext injection, VCRConnection.is_connected, VCRHTTPResponse.release_conn, zero-copy reader, JSONL streaming, chunked JSON writer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…unters) These were only read by the diagnostic logging blocks that were already stripped. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Collaborator
Author
|
Superseded by #6 (clean 2-commit history from main). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
urllib3._ssl_wrap_socket_and_match_hostname()withssl_context=None(because requests 2.32.5 never putsssl_contextintopool.conn_kw). This triggeredcreate_urllib3_context()+load_default_certs()per interaction, allocating ~480 KB of CA cert data into the OpenSSL C-heap. jemalloc's high-watermark behavior means this memory is never returned to the OS — causing ~7 MB/10-interaction linear RSS growth.149be86): Inject one sharedSSLContext(withload_default_certs()pre-called) intopool.conn_kw["ssl_context"]when the pool is created. All subsequent_new_conn()connections receive it via**conn_kw, soconnect()reuses the same context and never callsload_default_certs()again.release_conn()/is_connectedfixes.Test plan
uv run pytest)🤖 Generated with Claude Code