perf: add SSAR cache, event log tail reads, and agent status cache#1026
perf: add SSAR cache, event log tail reads, and agent status cache#1026syntaxsdev wants to merge 5 commits intoambient-code:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughIntroduces two in-process TTL caches (SSAR decision cache and session agent-status cache), integrates SSAR cache into request middleware to short-circuit authorization, bounds event replay by tail-reading large JSONL files, and adds benchmarks/tests for both caches and agent-status derivation. Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
components/backend/handlers/sessions.go (1)
403-427:⚠️ Potential issue | 🟠 MajorInclude the namespace in the agent-status cache key.
metadata.nameis only namespace-scoped, but this map is process-global. Two projects with the same session name will alias the same entry and can receive each other's cachedagentStatus.🔧 Suggested direction
name, _ := session.Metadata["name"].(string) +namespace, _ := session.Metadata["namespace"].(string) if name == "" { return } +cacheKey := namespace + "/" + name // Check cache agentStatusCache.RLock() -if entry, ok := agentStatusCache.entries[name]; ok && time.Now().Before(entry.expiresAt) { +if entry, ok := agentStatusCache.entries[cacheKey]; ok && time.Now().Before(entry.expiresAt) { agentStatusCache.RUnlock() if entry.status != "" { session.Status.AgentStatus = types.StringPtr(entry.status) } return } agentStatusCache.RUnlock() @@ agentStatusCache.Lock() -agentStatusCache.entries[name] = agentStatusCacheEntry{ +agentStatusCache.entries[cacheKey] = agentStatusCacheEntry{ status: derived, expiresAt: time.Now().Add(agentStatusCacheTTL), } agentStatusCache.Unlock()As per coding guidelines,
**: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/backend/handlers/sessions.go` around lines 403 - 427, The cache currently keys agentStatusCache.entries by the session name (session.Metadata["name"]) which is namespace-scoped and causes cross-project collisions; change the key to include the namespace (e.g., build key := fmt.Sprintf("%s/%s", namespace, name) where namespace := session.Metadata["namespace"].(string) or fallback to ""), then use that composite key everywhere the map is accessed (both the RLock read: agentStatusCache.entries[...] and the Lock write that sets agentStatusCache.entries[...]) and when deriving/storing the entry (keep using DeriveAgentStatusFromEvents(name) if it expects just name, but store under the composite key). Ensure any checks that read entry.status and set session.Status.AgentStatus use the composite key as well.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/backend/handlers/middleware.go`:
- Around line 318-325: The cached-allowed branch (globalSSARCache.check
returning cachedAllowed==true) is short-circuiting SSAR validation and lets
expired/revoked tokens pass; update the middleware handling around
cacheKey/cachedAllowed to still validate the incoming token (e.g., call the same
SSAR/introspect flow used for cache misses or check a stored token expiry/active
flag) before allowing access. Specifically, modify the logic in middleware.go
where globalSSARCache.check(cacheKey) is used: if found && cachedAllowed then
perform a token validity check (or verify stored expiry/nonce) and only proceed
to skip the full SSAR call when that token check succeeds; alternatively, change
the cache payload to include token expiry/active state and verify it here before
returning allowed. Ensure you reference and reuse the existing SSAR/token
validation functions used elsewhere so handlers (e.g., sessions proxy checks)
still rely on non-nil clients but only after token validity is confirmed.
In `@components/backend/handlers/sessions.go`:
- Around line 381-384: The TTL cache defined by agentStatusCache (with entries
map[string]agentStatusCacheEntry) never evicts expired entries, which leaks
memory; add an expiration field to agentStatusCacheEntry if not present and
implement a periodic sweeper goroutine (use time.Ticker) that takes the cache's
RWMutex, iterates entries, and deletes those whose expiry has passed, and ensure
the sweeper can be stopped on shutdown; alternatively ensure setters/readers
(functions that touch agentStatusCache) update/remove entries atomically when
sessions complete or are deleted so stale keys are removed.
In `@components/backend/handlers/ssar_cache.go`:
- Around line 43-58: In ssarCache.check, avoid deleting a potentially refreshed
entry by re-checking the map under the write lock: when
time.Now().After(entry.expiresAt) is true, acquire c.mu.Lock(), reload entry :=
c.entries[key] (and ok), verify that ok and that entry.expiresAt is still
expired before deleting; if the reloaded entry is present and not expired, use
its allowed value and return (don’t delete it). This ensures the concurrent
store that refreshed the key between RUnlock and Lock doesn’t get removed.
In `@components/backend/websocket/agui_store.go`:
- Around line 269-276: Add a clear truncation notice when tail-reading large
logs: after computing offset := fileSize - replayMaxTailBytes and successfully
seeking, compute and log how many bytes (and optionally approximate events by
counting newline-delimited records skipped) were omitted for sessionID so
operators see "skipped X bytes (~Y events)" along with sessionID and
replayMaxTailBytes; if counting events is expensive, at minimum log the skipped
byte count and include this behavior in the function comment above the tail-read
block referencing replayMaxTailBytes, readJSONLFile, and the f.Seek fallback so
callers/operators know active runs may miss early events like RUN_STARTED.
- Around line 282-297: The tail-scan loop currently ignores scanner errors and
JSON unmarshal failures; after the for scanner.Scan() loop check scanner.Err()
and propagate it (return it or wrap and return) so I/O errors in tail mode are
not silenced, and change the json.Unmarshal handling to log malformed JSON lines
(use the same log.Printf("Export: Skipping malformed JSON line: %v", err) style
as readJSONLFile) instead of silently dropping them; update the function's
return path/signature if needed so the events slice and any scanner error are
returned to the caller (refer to the scanner variable, the json.Unmarshal call,
the events slice and the existing readJSONLFile behavior).
- Around line 263-267: The small-file fast path opens a file into variable f and
then calls readJSONLFile(path) which re-opens the same path; fix by reusing the
existing open handle f instead of double-opening: either add/rename a helper
like readJSONLFromReader(io.Reader) and call it with f when fileSize <=
replayMaxTailBytes, or close f immediately and then call readJSONLFile(path);
update the branch that checks replayMaxTailBytes to use f (or close it) and
remove the second open to avoid the redundant open on the same path.
---
Outside diff comments:
In `@components/backend/handlers/sessions.go`:
- Around line 403-427: The cache currently keys agentStatusCache.entries by the
session name (session.Metadata["name"]) which is namespace-scoped and causes
cross-project collisions; change the key to include the namespace (e.g., build
key := fmt.Sprintf("%s/%s", namespace, name) where namespace :=
session.Metadata["namespace"].(string) or fallback to ""), then use that
composite key everywhere the map is accessed (both the RLock read:
agentStatusCache.entries[...] and the Lock write that sets
agentStatusCache.entries[...]) and when deriving/storing the entry (keep using
DeriveAgentStatusFromEvents(name) if it expects just name, but store under the
composite key). Ensure any checks that read entry.status and set
session.Status.AgentStatus use the composite key as well.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: f8d9901a-3860-4a62-af54-0d17e5e61ca3
📒 Files selected for processing (4)
components/backend/handlers/middleware.gocomponents/backend/handlers/sessions.gocomponents/backend/handlers/ssar_cache.gocomponents/backend/websocket/agui_store.go
| if cachedAllowed, found := globalSSARCache.check(cacheKey); found { | ||
| if !cachedAllowed { | ||
| c.JSON(http.StatusForbidden, gin.H{"error": "Unauthorized to access project"}) | ||
| c.Abort() | ||
| return | ||
| } | ||
| // Cache hit — allowed, skip SSAR call | ||
| } else { |
There was a problem hiding this comment.
Cached Allowed=true decisions are extending token validity.
On a cache hit this path makes no API server call, and the production client setup in Lines 91-104 only constructs clients. Handlers such as components/backend/handlers/sessions.go, Lines 1848-1854 and Lines 2534-2540, only check for non-nil clients before proxying, so an expired or revoked token can still pass middleware and reach protected paths until the cache entry expires.
As per coding guidelines, **: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Also applies to: 348-349
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/backend/handlers/middleware.go` around lines 318 - 325, The
cached-allowed branch (globalSSARCache.check returning cachedAllowed==true) is
short-circuiting SSAR validation and lets expired/revoked tokens pass; update
the middleware handling around cacheKey/cachedAllowed to still validate the
incoming token (e.g., call the same SSAR/introspect flow used for cache misses
or check a stored token expiry/active flag) before allowing access.
Specifically, modify the logic in middleware.go where
globalSSARCache.check(cacheKey) is used: if found && cachedAllowed then perform
a token validity check (or verify stored expiry/nonce) and only proceed to skip
the full SSAR call when that token check succeeds; alternatively, change the
cache payload to include token expiry/active state and verify it here before
returning allowed. Ensure you reference and reuse the existing SSAR/token
validation functions used elsewhere so handlers (e.g., sessions proxy checks)
still rely on non-nil clients but only after token validity is confirmed.
There was a problem hiding this comment.
The 30s TTL is an intentional trade-off — every auth cache (including K8s's own token cache) has this window. A 30s delay on revocation is acceptable for the substantial reduction in API server load. The alternative is a full SSAR call on every single request, which was the performance bottleneck this PR addresses.
| // Large file — seek to tail to bound reconnect latency. | ||
| log.Printf("AGUI Store: large event log for %s (%.1f MB), reading tail only", sessionID, float64(fileSize)/(1024*1024)) | ||
| offset := fileSize - replayMaxTailBytes | ||
| if _, err := f.Seek(offset, 0); err != nil { | ||
| log.Printf("AGUI Store: seek failed for %s: %v, falling back to full read", sessionID, err) | ||
| events, _ := readJSONLFile(path) | ||
| return events | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider documenting the truncation impact on active runs.
For large active runs, clients receive only tail events and may miss critical early events (e.g., RUN_STARTED, initial state). The SSE handler in agui_proxy.go sends these directly to clients for state reconstruction without indicating truncation.
Finished runs should be compacted and small, so this primarily affects active runs with very large logs. Consider either:
- Adding a log message noting how many events were skipped, or
- Documenting this trade-off explicitly in the function comment
This helps operators understand reconnect behavior when debugging client state issues.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/backend/websocket/agui_store.go` around lines 269 - 276, Add a
clear truncation notice when tail-reading large logs: after computing offset :=
fileSize - replayMaxTailBytes and successfully seeking, compute and log how many
bytes (and optionally approximate events by counting newline-delimited records
skipped) were omitted for sessionID so operators see "skipped X bytes (~Y
events)" along with sessionID and replayMaxTailBytes; if counting events is
expensive, at minimum log the skipped byte count and include this behavior in
the function comment above the tail-read block referencing replayMaxTailBytes,
readJSONLFile, and the f.Seek fallback so callers/operators know active runs may
miss early events like RUN_STARTED.
There was a problem hiding this comment.
The code already logs when tail-only reads happen (line 269). Finished runs are compacted to snapshot-only events and will be small. Active runs with truly enormous logs are an edge case — the truncation is a bounded-latency trade-off, not data loss.
| for scanner.Scan() { | ||
| line := scanner.Bytes() | ||
| if len(line) == 0 { | ||
| continue | ||
| } | ||
| // Skip the first line — likely partial from the seek | ||
| if first { | ||
| first = false | ||
| continue | ||
| } | ||
| var evt map[string]interface{} | ||
| if err := json.Unmarshal(line, &evt); err == nil { | ||
| events = append(events, evt) | ||
| } | ||
| } | ||
| return events |
There was a problem hiding this comment.
Scanner errors are silently ignored in tail-scan mode.
The loop exits without checking scanner.Err(). If the scanner encounters an I/O error mid-read, the function returns partial data without any indication of failure. The existing readJSONLFile properly returns scanner.Err() to propagate such issues.
Also, JSON unmarshal failures are silently ignored here, whereas readJSONLFile logs them with log.Printf("Export: Skipping malformed JSON line: %v", err). This behavioral divergence makes debugging production issues harder.
Proposed fix to add error handling
for scanner.Scan() {
line := scanner.Bytes()
if len(line) == 0 {
continue
}
// Skip the first line — likely partial from the seek
if first {
first = false
continue
}
var evt map[string]interface{}
- if err := json.Unmarshal(line, &evt); err == nil {
+ if err := json.Unmarshal(line, &evt); err != nil {
+ log.Printf("AGUI Store: skipping malformed JSON line in tail read: %v", err)
+ continue
+ }
- events = append(events, evt)
- }
+ events = append(events, evt)
+ }
+ if err := scanner.Err(); err != nil {
+ log.Printf("AGUI Store: scanner error during tail read for %s: %v", sessionID, err)
}
return events📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for scanner.Scan() { | |
| line := scanner.Bytes() | |
| if len(line) == 0 { | |
| continue | |
| } | |
| // Skip the first line — likely partial from the seek | |
| if first { | |
| first = false | |
| continue | |
| } | |
| var evt map[string]interface{} | |
| if err := json.Unmarshal(line, &evt); err == nil { | |
| events = append(events, evt) | |
| } | |
| } | |
| return events | |
| for scanner.Scan() { | |
| line := scanner.Bytes() | |
| if len(line) == 0 { | |
| continue | |
| } | |
| // Skip the first line — likely partial from the seek | |
| if first { | |
| first = false | |
| continue | |
| } | |
| var evt map[string]interface{} | |
| if err := json.Unmarshal(line, &evt); err != nil { | |
| log.Printf("AGUI Store: skipping malformed JSON line in tail read: %v", err) | |
| continue | |
| } | |
| events = append(events, evt) | |
| } | |
| if err := scanner.Err(); err != nil { | |
| log.Printf("AGUI Store: scanner error during tail read for %s: %v", sessionID, err) | |
| } | |
| return events |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/backend/websocket/agui_store.go` around lines 282 - 297, The
tail-scan loop currently ignores scanner errors and JSON unmarshal failures;
after the for scanner.Scan() loop check scanner.Err() and propagate it (return
it or wrap and return) so I/O errors in tail mode are not silenced, and change
the json.Unmarshal handling to log malformed JSON lines (use the same
log.Printf("Export: Skipping malformed JSON line: %v", err) style as
readJSONLFile) instead of silently dropping them; update the function's return
path/signature if needed so the events slice and any scanner error are returned
to the caller (refer to the scanner variable, the json.Unmarshal call, the
events slice and the existing readJSONLFile behavior).
There was a problem hiding this comment.
Already addressed — scanner.Err() is checked and logged at lines 312-314, and JSON unmarshal failures are logged at line 307. Both error paths are handled.
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/backend/handlers/agent_status_bench_test.go`:
- Around line 165-182: The benchmark uses the same session name
("bench-session") for all entries so agentStatusCache (keyed by session name)
gets populated on the first call and the remaining enrichAgentStatus calls hit
the cache; change the synthetic sessions in the sessions slice so each
AgenticSession.Metadata["name"] is unique (e.g., append the index to
"bench-session") and ensure any stub log lookup/mapping used by
enrichAgentStatus is populated for each distinct name (or alternatively change
the test's stub to map each synthetic name to a corresponding log file) so the
uncached benchmark truly measures cold-cache behavior for each session.
- Around line 143-149: The benchmark shares pointers to the same
types.AgenticSession values across b.RunParallel workers causing data races
because enrichAgentStatus mutates session.Status.AgentStatus; fix it by ensuring
each worker operates on a unique session copy — inside the b.RunParallel
closure, before calling enrichAgentStatus, copy the session value from the
sessions slice into a local variable (e.g., value copy of *sessions[i%20]) and
pass a pointer to that local copy to enrichAgentStatus so no two goroutines
mutate the same underlying struct.
In `@components/backend/handlers/ssar_cache.go`:
- Around line 21-22: The comment for ssarCacheMaxSize incorrectly says "LRU
eviction" while the current implementation performs arbitrary/random eviction by
iterating a Go map; update the comment to state "random/unspecified eviction"
(or "map-iteration-based eviction") to match behavior, and if you prefer true
LRU replace the eviction logic with an LRU structure (e.g., implement
container/list + map or use hashicorp/golang-lru) in the cache code paths that
perform eviction (the block that iterates the map around lines referenced in the
review); ensure any references to "LRU" in comments or docstrings for
ssarCacheMaxSize or the eviction routine are changed accordingly.
- Around line 74-95: checkAccess() currently always calls the Kubernetes
SelfSubjectAccessReview API, bypassing the shared globalSSARCache; update
checkAccess to first build a deterministic cache key (e.g., user identifiers +
verb + resource + namespace) and check globalSSARCache.get(key) and use that
result if present, only performing the API call on a cache miss and then storing
the boolean result with globalSSARCache.store(key, allowed). Locate
checkAccess() in agui_proxy.go and use the existing ssarCache methods (and
globalSSARCache) to ensure thread-safe reads/writes and preserve the existing
30s TTL behavior; do not change the cache eviction logic in ssar_cache.store.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 40b96d78-b6b5-498e-8f5f-d8b59c5e5c49
📒 Files selected for processing (3)
components/backend/handlers/agent_status_bench_test.gocomponents/backend/handlers/ssar_cache.gocomponents/backend/handlers/ssar_cache_bench_test.go
|
Caution Failed to replace (edit) comment. This is likely due to insufficient permissions or the comment being deleted. Error details |
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
components/backend/handlers/ssar_cache.go (1)
21-22:⚠️ Potential issue | 🟠 MajorEviction strategy can introduce latency spikes on hot auth path.
At Line 90, capacity eviction deletes ~50% of entries while holding the global write lock. This is a large blocking section for both
check()andstore()under load, causing latency cliffs right when the cache is hottest. Also, Line 21 still describes this as LRU, but the current behavior is map-iteration-based eviction.As per coding guidelines,
**: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.Also applies to: 89-98
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/backend/handlers/ssar_cache.go` around lines 21 - 22, The current eviction in the SSAR cache (ssarCacheMaxSize) iterates the entire map and deletes ~50% of entries while holding the global write lock inside store()/check(), causing blocking and latency spikes; change the strategy to avoid long-held locks by either implementing a proper LRU (use a doubly-linked list + map and update on access inside short critical sections) or perform eviction outside the hot write path (mark keys to evict, release the lock, then do deletions or have a background goroutine drain an eviction queue); ensure check() and store() only take the mutex briefly to update/access the map and list and move eviction work to a non-blocking path.components/backend/websocket/agui_store.go (1)
277-301:⚠️ Potential issue | 🟠 MajorDon’t treat scanner failures as a successful replay.
Both scan paths only log
scanner.Err()and still return whatever was parsed so far. That turns a read failure into a truncated replay, so reconnecting clients can rebuild from incomplete state without any signal to the caller. Please push the scanner loop behind a helper that returns([]map[string]interface{}, error)so the small-file and tail paths can fail or fall back consistently.Also applies to: 304-324
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/backend/websocket/agui_store.go` around lines 277 - 301, Extract the scanner loop into a helper like parseEventsFromScanner(sc *bufio.Scanner) ([]map[string]interface{}, error) that performs the first-line skip, JSON unmarshal, appends events, and returns scanner.Err() if any scan error occurred (instead of just logging and returning partial results); replace the inline scanner code in both the small-file and tail paths with calls to this helper and have the callers handle/log the returned error and decide whether to fail or fallback so scanner failures are not treated as successful replays.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/backend/handlers/ssar_cache.go`:
- Around line 85-88: The store method on ssarCache is acquiring the mutex and
writing entries even when caching is disabled; update ssarCache.store to check
the global CacheDisabled flag (or the cache's enabled state) and return
immediately before acquiring c.mu (i.e., do an early-return if CacheDisabled) so
no lock or map write occurs when disabled; apply the same early-return pattern
to the similar methods referenced around lines 101-104 to avoid unnecessary
mutex/map overhead when globalSSARCache.store is called unconditionally by
middleware.
In `@components/backend/websocket/agui_store.go`:
- Around line 217-221: replayMaxTailBytes being a fixed 2MB can cause valid
JSONL records (up to scannerMaxLineSize) to be clipped; change the tail window
to be derived from scannerMaxLineSize (e.g., tailWindow :=
max(replayMaxTailBytes, scannerMaxLineSize)) so the window always can contain
the largest record. Also update the replay logic around the seek/skip (the code
using offset and the first boolean) to detect whether the seek landed mid-record
and only discard the first non-empty line when the seek actually started inside
a record (not when offset is on a record boundary or on the newline). Locate and
modify the constants and the replay loop that reference replayMaxTailBytes,
scannerMaxLineSize, offset and first (including the similar logic at the other
occurrence around the 270–289 region).
---
Duplicate comments:
In `@components/backend/handlers/ssar_cache.go`:
- Around line 21-22: The current eviction in the SSAR cache (ssarCacheMaxSize)
iterates the entire map and deletes ~50% of entries while holding the global
write lock inside store()/check(), causing blocking and latency spikes; change
the strategy to avoid long-held locks by either implementing a proper LRU (use a
doubly-linked list + map and update on access inside short critical sections) or
perform eviction outside the hot write path (mark keys to evict, release the
lock, then do deletions or have a background goroutine drain an eviction queue);
ensure check() and store() only take the mutex briefly to update/access the map
and list and move eviction work to a non-blocking path.
In `@components/backend/websocket/agui_store.go`:
- Around line 277-301: Extract the scanner loop into a helper like
parseEventsFromScanner(sc *bufio.Scanner) ([]map[string]interface{}, error) that
performs the first-line skip, JSON unmarshal, appends events, and returns
scanner.Err() if any scan error occurred (instead of just logging and returning
partial results); replace the inline scanner code in both the small-file and
tail paths with calls to this helper and have the callers handle/log the
returned error and decide whether to fail or fallback so scanner failures are
not treated as successful replays.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 7010aab6-376c-465b-a26f-4947a66c4039
📒 Files selected for processing (3)
components/backend/handlers/sessions.gocomponents/backend/handlers/ssar_cache.gocomponents/backend/websocket/agui_store.go
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
components/backend/handlers/ssar_cache.go (1)
21-22:⚠️ Potential issue | 🟠 MajorEvicting half the map under a write lock can stall hot-path checks at capacity.
At capacity, this path performs thousands of deletes while holding
c.mu.Lock(), blocking concurrent readers/writers and creating latency spikes. Also, the “LRU eviction” comment does not match behavior.🔧 Minimal mitigation
-// ssarCacheMaxSize is the maximum number of entries before LRU eviction. +// ssarCacheMaxSize is the maximum number of entries before map-iteration eviction. const ssarCacheMaxSize = 10000 // Evict if at capacity (simple: clear half the entries) if len(c.entries) >= ssarCacheMaxSize { - count := 0 - for k := range c.entries { - delete(c.entries, k) - count++ - if count >= ssarCacheMaxSize/2 { - break - } - } + for k := range c.entries { + delete(c.entries, k) // remove one to keep lock hold bounded + break + } }As per coding guidelines,
**: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.Also applies to: 92-101
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/backend/handlers/ssar_cache.go` around lines 21 - 22, The current capacity-handling deletes thousands of map entries while holding c.mu.Lock() (refer to ssarCacheMaxSize and the eviction code around c.mu.Lock()/lines ~92-101), which stalls hot-path readers/writers and isn't true LRU; change the design to avoid bulk deletes under the write lock by (a) checking capacity under a read lock and immediately returning or marking the cache as over-capacity, (b) performing eviction asynchronously (spawn a single background goroutine or push keys to an eviction queue) so deletions happen outside the hot-path lock, or (c) implement a true incremental LRU using a concurrent-safe list/queue that evicts a few items per mutation instead of half the map at once; also update the comment to accurately describe the eviction strategy.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/backend/handlers/ssar_cache.go`:
- Around line 44-47: The cache key in ssarCacheKey currently uses only the first
8 bytes of the SHA-256 digest, which creates a small collision domain and can
lead to cross-principal authorization reuse; change ssarCacheKey to include the
full SHA-256 digest (use the entire h value instead of h[:8]) when formatting
the key so the key is generated from the full 256-bit hash combined with
namespace, verb, group, and resource.
---
Duplicate comments:
In `@components/backend/handlers/ssar_cache.go`:
- Around line 21-22: The current capacity-handling deletes thousands of map
entries while holding c.mu.Lock() (refer to ssarCacheMaxSize and the eviction
code around c.mu.Lock()/lines ~92-101), which stalls hot-path readers/writers
and isn't true LRU; change the design to avoid bulk deletes under the write lock
by (a) checking capacity under a read lock and immediately returning or marking
the cache as over-capacity, (b) performing eviction asynchronously (spawn a
single background goroutine or push keys to an eviction queue) so deletions
happen outside the hot-path lock, or (c) implement a true incremental LRU using
a concurrent-safe list/queue that evicts a few items per mutation instead of
half the map at once; also update the comment to accurately describe the
eviction strategy.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 9a44e235-c459-4c34-b289-b46425c3458b
📒 Files selected for processing (2)
components/backend/handlers/ssar_cache.gocomponents/backend/websocket/agui_store.go
- Fix misleading "LRU eviction" comment to "random eviction" (Go map iteration order is undefined) - Fix data race in BenchmarkEnrichAgentStatus_Concurrent: each RunParallel goroutine now gets its own session struct - Fix BenchmarkEnrichAgentStatus_ListPage/uncached: clear cache before each session so every call is a true cache miss Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/backend/handlers/agent_status_bench_test.go`:
- Around line 109-115: The benchmark is warming agentStatusCache only before
ResetTimer(), but the TTL can expire during long runs, skewing cached metrics;
update each benchmark (enrichAgentStatus calls in BenchmarkEnrichAgentStatus,
BenchmarkEnrichAgentStatus_Concurrent,
BenchmarkEnrichAgentStatus_ListPage/cached) to "pin" the cache entry after
warming by explicitly extending its TTL (e.g. re-set or update the entry via the
same cache API used in sessions.go such as agentStatusCache.Set or a
touch/update method on the session) immediately before b.ResetTimer(), ensuring
the cached entry cannot expire during the timed loop.
- Around line 275-282: Replace the plain callCount increment with an atomic
counter and assert its value after the goroutines complete: change callCount to
an int32 (used by the test stub for DeriveAgentStatusFromEvents), increment it
with atomic.AddInt32(&callCount, 1) inside the stub, and after wg.Wait() add an
assertion that the counter equals 1 to verify the cache prevented extra
derivations; keep the defer that restores DeriveAgentStatusFromEvents and
reference the same stub and wg.Wait() code paths when making the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: cf70201e-237e-446a-b53a-afe7c848800a
📒 Files selected for processing (2)
components/backend/handlers/agent_status_bench_test.gocomponents/backend/handlers/ssar_cache.go
| // Prime the cache with one call | ||
| enrichAgentStatus(session) | ||
|
|
||
| b.ResetTimer() | ||
| for i := 0; i < b.N; i++ { | ||
| enrichAgentStatus(session) | ||
| } |
There was a problem hiding this comment.
Pin the cache entry past the timed section.
agentStatusCache is TTL-based in components/backend/handlers/sessions.go, but these benchmarks only warm it once before ResetTimer(). On longer -benchtime runs, the entry can expire mid-loop and the reported “cached” numbers start including file-derivation work again.
Suggested change
// Prime the cache with one call
enrichAgentStatus(session)
+agentStatusCache.Lock()
+if entry, ok := agentStatusCache.entries[sessionName]; ok {
+ entry.expiresAt = time.Now().Add(time.Hour)
+ agentStatusCache.entries[sessionName] = entry
+}
+agentStatusCache.Unlock()
b.ResetTimer()Apply the same pattern to BenchmarkEnrichAgentStatus_Concurrent and BenchmarkEnrichAgentStatus_ListPage/cached.
Also applies to: 139-152, 189-199
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/backend/handlers/agent_status_bench_test.go` around lines 109 -
115, The benchmark is warming agentStatusCache only before ResetTimer(), but the
TTL can expire during long runs, skewing cached metrics; update each benchmark
(enrichAgentStatus calls in BenchmarkEnrichAgentStatus,
BenchmarkEnrichAgentStatus_Concurrent,
BenchmarkEnrichAgentStatus_ListPage/cached) to "pin" the cache entry after
warming by explicitly extending its TTL (e.g. re-set or update the entry via the
same cache API used in sessions.go such as agentStatusCache.Set or a
touch/update method on the session) immediately before b.ResetTimer(), ensuring
the cached entry cannot expire during the timed loop.
| callCount := 0 | ||
| origDerive := DeriveAgentStatusFromEvents | ||
| defer func() { DeriveAgentStatusFromEvents = origDerive }() | ||
|
|
||
| DeriveAgentStatusFromEvents = func(name string) string { | ||
| callCount++ | ||
| return "working" | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "agent_status_bench_test.go" -type fRepository: ambient-code/platform
Length of output: 121
🏁 Script executed:
cat -n ./components/backend/handlers/agent_status_bench_test.go | sed -n '270,290p'Repository: ambient-code/platform
Length of output: 771
🏁 Script executed:
cat -n ./components/backend/handlers/agent_status_bench_test.go | sed -n '315,340p'Repository: ambient-code/platform
Length of output: 659
🏁 Script executed:
cat -n ./components/backend/handlers/agent_status_bench_test.go | sed -n '290,320p'Repository: ambient-code/platform
Length of output: 1226
🏁 Script executed:
cat -n ./components/backend/handlers/agent_status_bench_test.go | sed -n '334,345p'Repository: ambient-code/platform
Length of output: 91
🏁 Script executed:
cat -n ./components/backend/handlers/agent_status_bench_test.go | sed -n '330,350p'Repository: ambient-code/platform
Length of output: 157
🏁 Script executed:
grep -n "callCount" ./components/backend/handlers/agent_status_bench_test.goRepository: ambient-code/platform
Length of output: 366
🏁 Script executed:
cat -n ./components/backend/handlers/agent_status_bench_test.go | sed -n '1,30p'Repository: ambient-code/platform
Length of output: 907
Make the concurrency check race-free and assert the outcome.
The stub increments a plain callCount (line 280) that all 50 concurrent goroutines can reach without synchronization. The concurrent test section (lines 321-334) never checks that callCount remained at 1 after wg.Wait(). If the cache regresses and multiple workers fall through to derivation, this test races on its own bookkeeping and still passes.
Use atomic.AddInt32(&callCount, 1) instead of callCount++, and add an assertion after line 334 to verify the cache prevented extra derivations under concurrent load.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/backend/handlers/agent_status_bench_test.go` around lines 275 -
282, Replace the plain callCount increment with an atomic counter and assert its
value after the goroutines complete: change callCount to an int32 (used by the
test stub for DeriveAgentStatusFromEvents), increment it with
atomic.AddInt32(&callCount, 1) inside the stub, and after wg.Wait() add an
assertion that the counter equals 1 to verify the cache prevented extra
derivations; keep the defer that restores DeriveAgentStatusFromEvents and
reference the same stub and wg.Wait() code paths when making the change.
Review Queue Status
Action needed: Address CodeRabbit Major issue on cache entry pinning
|
Addresses three performance bottlenecks identified through benchmarking:
SSAR cache (30s TTL): caches K8s permission checks per user/project
to avoid etcd writes on every API request. Before (sustained load):
After: cached requests skip the K8s API call entirely.
Event log tail reads: reconnect replay now reads only the last 2MB
instead of scanning the entire file. Measured before/after: