Skip to content

fix: concurrent delete race conditions in delete_entity#702

Merged
phernandez merged 1 commit intomainfrom
fix-concurrent-delete-resilience
Mar 31, 2026
Merged

fix: concurrent delete race conditions in delete_entity#702
phernandez merged 1 commit intomainfrom
fix-concurrent-delete-resilience

Conversation

@phernandez
Copy link
Copy Markdown
Member

Summary

  • Make delete_entity() idempotent under concurrent access by treating "entity not found" as success instead of raising ValueError
  • Treat repository.delete() returning False (NoResultFound) as success since it means the entity was concurrently deleted
  • Make search index cleanup best-effort during deletes to handle cascade race conditions

Context: A customer deleting a large directory triggered multiple concurrent delete-directory requests, producing 662 errors across 3 types — all race conditions from competing deletes on the same entities.

Test plan

  • Added test_delete_entity_by_id_already_deleted - double-delete by ID returns True
  • Added test_delete_entity_by_permalink_already_deleted - double-delete by permalink
  • Added test_delete_directory_concurrent_resilience - directory delete with pre-deleted entities
  • Added test_delete_directory_all_already_deleted - directory delete when all entities gone
  • Added test_delete_directory_entity_deleted_between_query_and_delete - simulates exact race condition
  • All existing delete tests pass

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 30, 2026

Claude finished @phernandez's task in 2m 5s —— View job


PR Review: Fix concurrent delete race conditions

Tasks:

  • Gather context and read changed files
  • Review entity_service.py changes
  • Review test_entity_service.py changes
  • Evaluate against team checklist
  • Post final review

Summary

The overall approach is correct and well-motivated — making delete_entity idempotent under concurrent access is the right fix for this class of race condition. The test suite additions are thorough and well-structured. A few issues need attention before merging.


Issues Found

1. Line length violation (minor)

entity_service.py:758 exceeds the 100-character limit:

                    logger.debug(f"Search cleanup skipped for {permalink_or_id} (likely concurrent delete)")

This line is ~107 characters. Fix this →


2. Bare except Exception swallows legitimate errors (concern)

entity_service.py:755:

            try:
                await self.search_service.handle_delete(entity)
            except Exception:
                # Search cleanup is best-effort during concurrent deletes.
                logger.debug(f"Search cleanup skipped for {permalink_or_id} (likely concurrent delete)")

This violates CLAUDE.md's "Fail fast" principle — a bare except Exception at debug level will silently absorb programming errors (e.g., AttributeError, TypeError, connection failures) and make them extremely hard to diagnose.

Looking at search_service.handle_delete(), it accesses entity.observations and entity.outgoing_relations (lines 740–741). The expected failure mode in a concurrent delete is SQLAlchemy raising when these relationships are cascade-deleted. The exception should be logged at warning level with exc_info=True so it's visible in logs while still not blocking the delete:

            try:
                await self.search_service.handle_delete(entity)
            except Exception:
                # Search cleanup is best-effort during concurrent deletes.
                # Relationships may have been cascade-deleted by a concurrent request.
                logger.warning(
                    "Search cleanup failed for entity (likely concurrent delete)",
                    permalink_or_id=permalink_or_id,
                    exc_info=True,
                )

Fix this →


3. f-strings in structlog calls (style, minor)

Lines 740, 758, 769 use f-strings in logger calls. The project uses structlog, and the existing pattern (e.g., line 744) passes keyword arguments:

# Existing pattern (line 744)
logger.error("Entity lookup error", entity_id=permalink_or_id, found_count=len(entities))

# New code (inconsistent)
logger.info(f"Entity already deleted: {permalink_or_id}")

Not blocking, but new logger calls should follow the structlog key=value style for consistency and searchability.


Checklist Results

Item Status
Follows Basic Memory coding conventions ⚠️ Bare except + f-strings in structlog, line length
Python 3.12+ type annotations and async patterns
SQLAlchemy 2.0 best practices
100-character line length limit ❌ Line 758 is ~107 chars
No commented-out code blocks
Unit tests for new functions ✅ 5 new tests
Test coverage for edge cases ✅ Well covered
100% test coverage maintained
No hardcoded secrets/credentials
Proper error handling and logging ⚠️ Exception swallowed at debug level
Performance considerations ✅ No regressions
Windows-compatible file paths ✅ No path issues introduced
No emojis/unicode in console/log output

What's Working Well

  • Correct idempotency logic — treating "0 results by ID" and NoResultFound from repository.delete() as success is exactly right for concurrent deletes.
  • Excellent test suitetest_delete_directory_entity_deleted_between_query_and_delete in particular nails the real race condition scenario.
  • Good decision-point comments on the repository.delete() return-value handling (Trigger/Why/Outcome pattern from CLAUDE.md).
  • The EntityNotFoundErrorreturn True path for permalink lookups was already correct; this PR keeps it intact.

The two items to address before merging: fix the line length at line 758 and raise the logging level on the caught exception to warning with exc_info=True.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 36db875536

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +755 to +758
except Exception:
# Search cleanup is best-effort during concurrent deletes.
# Relationships may have been cascade-deleted by a concurrent request.
logger.debug(f"Search cleanup skipped for {permalink_or_id} (likely concurrent delete)")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Re-raise unexpected search cleanup failures

Catching Exception here violates the fail-fast rule in AGENTS.md ("Do not swallow exceptions") and can silently leave the search index inconsistent: if handle_delete fails for a real defect or backend issue (not just a concurrent-race edge case), the entity is still deleted from files/DB and the stale search rows remain queryable. That produces incorrect search results and makes the root cause hard to diagnose because the original exception is discarded.

Useful? React with 👍 / 👎.

@phernandez phernandez changed the title Fix concurrent delete race conditions fix: concurrent delete race conditions in delete_entity Mar 31, 2026
…ectory

When multiple delete-directory requests target the same directory concurrently,
entities can be deleted between the prefix query and the individual delete calls.
Previously this produced hundreds of noisy errors (ValueError, "Delete returned
False", NoneType attribute errors) even though the deletes ultimately succeeded.

Changes:
- 🔧 Return True when entity lookup by ID finds 0 results (already deleted)
- 🔧 Treat repository.delete() returning False as success (concurrent delete)
- 🔧 Make search index cleanup best-effort during deletes (cascade race)
- ✅ Add 5 tests covering concurrent delete scenarios

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: phernandez <paul@basicmachines.co>
@phernandez phernandez force-pushed the fix-concurrent-delete-resilience branch from 36db875 to de93959 Compare March 31, 2026 00:26
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 31, 2026

Claude finished @phernandez's task in 2m 45s —— View job


PR Review: Fix concurrent delete race conditions

Tasks:

  • Gather context and read changed files
  • Review entity_service.py changes
  • Review test_entity_service.py changes
  • Evaluate against team checklist
  • Post final review

Summary

This PR correctly addresses the concurrent delete race conditions that caused 662 errors in production. The previous review flagged three issues; all three have been resolved in the current commit.


Issues Resolved from Previous Review

1. Line length at line 758 — ✅ Fixed
The logger.debug(f"...") (107 chars) was replaced with a properly multi-line logger.warning(...) call that stays within the 100-char limit.

2. Bare except Exception swallowing at debug level — ✅ Fixed
The exception is now logged at warning level with exc_info=True using structlog keyword args:

logger.warning(
    "Search cleanup failed for entity (likely concurrent delete)",
    permalink_or_id=permalink_or_id,
    exc_info=True,
)

3. f-strings in structlog calls — ✅ Fixed
New log calls at lines 740, 760, and 773 all use proper key=value structlog style.


Code Review: entity_service.py

The three concurrent-delete paths are correctly handled:

  1. ID lookup returns 0 results (lines 738–741) — returns True immediately; the entity was deleted by a concurrent request between the delete_directory query and this call.
  2. search_service.handle_delete() raises (lines 753–762) — best-effort, logged at warning with exc_info=True. SQLAlchemy may raise when cascade-deleted relationships are accessed.
  3. repository.delete() returns False (lines 771–774) — NoResultFound from the DB means another request won already. Return True.

The delete_directory call path (line 1549 calls delete_entity(entity.id)) is now fully resilient: the stale entity IDs from the prefix query will be handled gracefully by path #1.


Code Review: test_entity_service.py

Five well-structured tests cover the full scenario space:

Test What it verifies
test_delete_entity_by_id_already_deleted Double-delete by ID returns True
test_delete_entity_by_permalink_already_deleted Double-delete by permalink returns True
test_delete_directory_concurrent_resilience Some entities pre-deleted before delete_directory runs
test_delete_directory_all_already_deleted All entities pre-deleted (empty directory case)
test_delete_directory_entity_deleted_between_query_and_delete Exact race condition: entity exists in prefix query but deleted before delete_entity is called

The last test is the most important — it directly simulates the production failure scenario.


Checklist Results

Item Status
Follows Basic Memory coding conventions
Python 3.12+ type annotations and async patterns
SQLAlchemy 2.0 best practices
100-character line length limit ✅ New code is clean
No commented-out code blocks
Unit tests for new functions ✅ 5 new tests
Test coverage for edge cases ✅ All race condition paths covered
100% test coverage maintained
No hardcoded secrets/credentials
Proper error handling and logging ✅ Fixed from previous review
Performance considerations ✅ No regressions
Windows-compatible file paths
No emojis/unicode in console/log output

Verdict

All issues from the previous review are addressed. The implementation is correct, well-tested, and follows CLAUDE.md conventions. Ready to merge.

@phernandez phernandez merged commit cfa7000 into main Mar 31, 2026
25 checks passed
@phernandez phernandez deleted the fix-concurrent-delete-resilience branch March 31, 2026 00:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant