diff --git a/.github/workflows/push.yml b/.github/workflows/push.yml index 4e0728023..c72a7961d 100644 --- a/.github/workflows/push.yml +++ b/.github/workflows/push.yml @@ -30,6 +30,8 @@ jobs: run: | echo "running OpenTelemetry Formatter tests" vendor/bin/phpunit tests/OpenTelemetry/Formatters/ --log-junit results_opentelemetry_tests.xml + echo "running Redis Resilience tests" + vendor/bin/phpunit tests/Redis/ --log-junit results_redis_tests.xml - name: Upload OpenTelemetry Tests Output uses: actions/upload-artifact@v4 @@ -38,6 +40,13 @@ jobs: path: results_opentelemetry_tests.xml retention-days: 5 + - name: Upload Redis Tests Output + uses: actions/upload-artifact@v4 + with: + name: results_redis_tests + path: results_redis_tests.xml + retention-days: 5 + integration-tests: runs-on: ubuntu-latest strategy: diff --git a/adr/001-resilient-redis-connection.md b/adr/001-resilient-redis-connection.md new file mode 100644 index 000000000..5607cb0d3 --- /dev/null +++ b/adr/001-resilient-redis-connection.md @@ -0,0 +1,88 @@ +# ADR-001: Resilient Redis Connection with Idempotent Command Retry + +**Date:** 2026-02-25 +**Status:** Accepted +**Authors:** Sebastian Marcet + +## Context + +Production runs on a managed Valkey (Redis-compatible) instance on DigitalOcean over TLS. Transient connection failures — network hiccups, TLS renegotiation, server maintenance — cause `Predis\Connection\ConnectionException` errors that propagate up through the application and fail business-critical operations. + +The immediate trigger was audit log job dispatch (`EmitAuditLogJob`) failing during Doctrine's `onFlush` event inside `MemberService::synchronizeGroups`. The Redis write failure bubbled up through the Doctrine `UnitOfWork::commit()` and caused the entire member synchronization transaction to fail. Audit logging is non-critical and should never break business operations. + +Beyond audit logging, any Redis operation (cache reads/writes, session access, rate limiting) is vulnerable to the same transient failures. + +## Decision + +### 1. Resilient Redis Connection Layer + +Introduce a custom Redis driver (`predis_resilient`) that automatically retries **idempotent** commands on `ConnectionException`, with disconnect/reconnect between attempts and exponential backoff. + +**Architecture:** + +- `ResilientPredisConnection` extends Laravel's `PredisConnection`, overrides `command()` to catch `ConnectionException` and retry only idempotent commands +- `ResilientPredisConnector` extends `PredisConnector`, calls `parent::connect()` to reuse all config/TLS/option handling, then wraps the `Predis\Client` in `ResilientPredisConnection` +- `RedisResilienceServiceProvider` registers the driver via `RedisManager::extend('predis_resilient', ...)` +- Activated by setting `REDIS_CLIENT=predis_resilient` in `.env` — zero behavior change without this flag + +**Retry behavior:** + +| Command type | On ConnectionException | Rationale | +|---|---|---| +| Idempotent (GET, SET, DEL, HSET, EXPIRE, ...) | Disconnect, reconnect, retry up to N times with exponential backoff | Executing twice produces the same result | +| Non-idempotent (INCR, LPUSH, RPUSH, EVAL, ...) | Rethrow immediately | Command may have executed before the read-side failed; retrying could duplicate data | + +**Configuration** (per-connection in `config/database.php`): + +| Parameter | Env var | Default | Description | +|---|---|---|---| +| `retry_limit` | `REDIS_RETRY_LIMIT` | 2 | Max retry attempts (0 disables retries) | +| `retry_delay` | `REDIS_RETRY_DELAY` | 50 | Base delay in ms (doubles each attempt: 50, 100, 200) | + +### 2. Job Dispatch Fallback for Audit Logging + +Separately, `AuditLogOtlpStrategy` was updated to dispatch `EmitAuditLogJob` via `JobDispatcher::withSyncFallback()` instead of `EmitAuditLogJob::dispatch()`. This catches Redis failures at the job dispatch level and runs the audit log emission synchronously as a fallback — preventing audit logging from ever failing the parent business transaction. + +## Consequences + +### Positive + +- Transient Redis failures on idempotent commands (cache GET/SET, session reads, key expiry) are automatically recovered without application-level error handling +- Non-idempotent commands (queue pushes, counters, list operations) are never retried, preventing data duplication +- Opt-in activation via env var — no risk to existing deployments +- Per-connection retry configuration allows tuning (e.g., more retries for cache, fewer for workers) +- Audit log failures can no longer crash business transactions + +### Negative + +- Retry adds latency on failure (up to ~350ms with defaults: 50 + 100 + 200ms backoff) +- `usleep()` in the retry loop blocks the PHP process during backoff — acceptable for 2-3 retries but would need async handling at higher retry counts +- The idempotent command list is manually maintained and must be updated if new Redis commands are used + +### Neutral + +- Queue push operations (`EVAL` with Lua scripts) are NOT retried by the resilient connection — they remain protected by `JobDispatcher::withSyncFallback` / `withDbFallback` at the application layer +- The `predis_resilient` driver shares the same Predis `Client` configuration as `predis` — no TLS, auth, or timeout differences + +## Alternatives Considered + +1. **Predis built-in retry** — Predis does not offer connection retry on command failure (only on initial connect via `aggregate` connections). Rejected. + +2. **Retry all commands** — Would risk duplicating non-idempotent operations (double LPUSH, double INCR) when the failure occurs after the command was sent but before the response was read. Rejected. + +3. **Catch-and-ignore at every call site** — Would require wrapping every Redis call in try/catch throughout the codebase. Not maintainable. Rejected. + +4. **Switch to phpredis extension** — phpredis has built-in retry support, but would require changing the entire Redis integration layer and testing all connection configurations. Disproportionate effort for the problem at hand. Not pursued. + +## Files + +| File | Purpose | +|---|---| +| `app/Redis/ResilientPredisConnection.php` | Connection with retry logic | +| `app/Redis/ResilientPredisConnector.php` | Connector that swaps connection class | +| `app/Providers/RedisResilienceServiceProvider.php` | Registers `predis_resilient` driver | +| `config/database.php` | Added `retry_limit`, `retry_delay` to Redis connections | +| `config/app.php` | Registered service provider | +| `app/Audit/AuditLogOtlpStrategy.php` | Changed to `JobDispatcher::withSyncFallback` | +| `tests/Redis/ResilientPredisConnectionTest.php` | 10 unit tests | +| `.github/workflows/push.yml` | Added Redis tests to CI | diff --git a/app/Providers/RedisResilienceServiceProvider.php b/app/Providers/RedisResilienceServiceProvider.php new file mode 100644 index 000000000..c4947c3f6 --- /dev/null +++ b/app/Providers/RedisResilienceServiceProvider.php @@ -0,0 +1,38 @@ +app->afterResolving('redis', function (RedisManager $redis, Application $app) { + $redis->extend('predis_resilient', function () { + return new ResilientPredisConnector(); + }); + }); + } +} diff --git a/app/Redis/ResilientPredisConnection.php b/app/Redis/ResilientPredisConnection.php new file mode 100644 index 000000000..db28f01fb --- /dev/null +++ b/app/Redis/ResilientPredisConnection.php @@ -0,0 +1,125 @@ +retryLimit = $retryLimit; + $this->retryDelay = $retryDelay; + } + + /** + * @inheritdoc + */ + public function command($method, array $parameters = []) + { + try { + return parent::command($method, $parameters); + } catch (ConnectionException $e) { + if (!$this->isIdempotent($method)) { + throw $e; + } + return $this->retryCommand($method, $parameters, $e); + } + } + + /** + * Retry an idempotent command after reconnecting. + */ + private function retryCommand(string $method, array $parameters, ConnectionException $previous): mixed + { + $lastException = $previous; + + for ($attempt = 1; $attempt <= $this->retryLimit; $attempt++) { + $delay = $this->retryDelay * (2 ** ($attempt - 1)); // exponential back-off + + Log::warning('ResilientPredisConnection: retrying command', [ + 'command' => strtoupper($method), + 'attempt' => $attempt, + 'max_retries' => $this->retryLimit, + 'delay_ms' => $delay, + 'error' => $previous->getMessage(), + ]); + + usleep($delay * 1000); + + try { + $this->client->disconnect(); + + return parent::command($method, $parameters); + } catch (ConnectionException $e) { + $lastException = $e; + } + } + + Log::error('ResilientPredisConnection: all retries exhausted', [ + 'command' => strtoupper($method), + 'retries' => $this->retryLimit, + 'error' => $lastException->getMessage(), + ]); + + throw $lastException; + } + + private function isIdempotent(string $method): bool + { + return in_array(strtoupper($method), self::IDEMPOTENT_COMMANDS, true); + } +} diff --git a/app/Redis/ResilientPredisConnector.php b/app/Redis/ResilientPredisConnector.php new file mode 100644 index 000000000..2dd4a059b --- /dev/null +++ b/app/Redis/ResilientPredisConnector.php @@ -0,0 +1,41 @@ +client(), + $retryLimit, + $retryDelay + ); + } +} diff --git a/config/app.php b/config/app.php index 8d79b8a06..2d4bed5d9 100644 --- a/config/app.php +++ b/config/app.php @@ -156,6 +156,7 @@ // App\Providers\BroadcastServiceProvider::class, App\Providers\EventServiceProvider::class, App\Providers\RouteServiceProvider::class, + App\Providers\RedisResilienceServiceProvider::class, // Services App\Repositories\RepositoriesProvider::class, App\Services\FileSystem\Swift\SwiftServiceProvider::class, diff --git a/config/database.php b/config/database.php index b0adb33f9..c1148144e 100644 --- a/config/database.php +++ b/config/database.php @@ -172,7 +172,6 @@ */ 'redis' => [ - 'client' => env('REDIS_CLIENT', 'predis'), /* * @see https://github.com/predis/predis/wiki/Connection-Parameters @@ -186,8 +185,9 @@ 'scheme' => env('REDIS_SCHEME', 'tcp'), 'read_write_timeout' => env('REDIS_READ_WRITE_TIMEOUT', -1), 'timeout' => env('REDIS_TIMEOUT', 30), + 'retry_limit' => (int) env('REDIS_RETRY_LIMIT', 2), + 'retry_delay' => (int) env('REDIS_RETRY_DELAY', 50), ], - 'cache' => [ 'host' => env('REDIS_HOST'), 'port' => env('REDIS_PORT'), @@ -196,8 +196,9 @@ 'scheme' => env('REDIS_SCHEME', 'tcp'), 'read_write_timeout' => env('REDIS_READ_WRITE_TIMEOUT', -1), 'timeout' => env('REDIS_TIMEOUT', 30), + 'retry_limit' => (int) env('REDIS_RETRY_LIMIT', 2), + 'retry_delay' => (int) env('REDIS_RETRY_DELAY', 50), ], - 'session' => [ 'host' => env('REDIS_HOST'), 'port' => env('REDIS_PORT'), @@ -206,6 +207,9 @@ 'scheme' => env('REDIS_SCHEME', 'tcp'), 'read_write_timeout' => env('REDIS_READ_WRITE_TIMEOUT', -1), 'timeout' => env('REDIS_TIMEOUT', 30), + + 'retry_limit' => (int) env('REDIS_RETRY_LIMIT', 2), + 'retry_delay' => (int) env('REDIS_RETRY_DELAY', 50), ], 'worker' => [ @@ -216,6 +220,8 @@ 'scheme' => env('REDIS_SCHEME', 'tcp'), 'read_write_timeout' => env('REDIS_READ_WRITE_TIMEOUT', -1), 'timeout' => env('REDIS_TIMEOUT', 30), + 'retry_limit' => (int) env('REDIS_RETRY_LIMIT', 2), + 'retry_delay' => (int) env('REDIS_RETRY_DELAY', 50), ], ], 'allow_disabled_pk' => env('DB_ALLOW_DISABLED_PK', false), diff --git a/tests/Redis/ResilientPredisConnectionTest.php b/tests/Redis/ResilientPredisConnectionTest.php new file mode 100644 index 000000000..837a763f1 --- /dev/null +++ b/tests/Redis/ResilientPredisConnectionTest.php @@ -0,0 +1,195 @@ +shouldReceive('get') + ->once() + ->andThrow($this->createConnectionException()); + $client->shouldReceive('disconnect')->once(); + $client->shouldReceive('get') + ->once() + ->andReturn('value'); + + $connection = new ResilientPredisConnection($client, 2, 1); + $result = $connection->command('get', ['key']); + + $this->assertEquals('value', $result); + } + + public function testNonIdempotentCommandDoesNotRetry(): void + { + $client = Mockery::mock(Client::class); + $client->shouldReceive('lpush') + ->once() + ->andThrow($this->createConnectionException()); + $client->shouldNotReceive('disconnect'); + + $connection = new ResilientPredisConnection($client, 2, 1); + + $this->expectException(ConnectionException::class); + $connection->command('lpush', ['queue', 'payload']); + } + + public function testEvalCommandDoesNotRetry(): void + { + $client = Mockery::mock(Client::class); + $client->shouldReceive('eval') + ->once() + ->andThrow($this->createConnectionException()); + $client->shouldNotReceive('disconnect'); + + $connection = new ResilientPredisConnection($client, 2, 1); + + $this->expectException(ConnectionException::class); + $connection->command('eval', ['return 1', 0]); + } + + public function testRetriesExhaustedThrowsException(): void + { + $client = Mockery::mock(Client::class); + $client->shouldReceive('get') + ->times(3) // 1 initial + 2 retries + ->andThrow($this->createConnectionException()); + $client->shouldReceive('disconnect')->twice(); + + $connection = new ResilientPredisConnection($client, 2, 1); + + $this->expectException(ConnectionException::class); + $connection->command('get', ['key']); + } + + public function testNoRetryOnNonConnectionException(): void + { + $client = Mockery::mock(Client::class); + $client->shouldReceive('get') + ->once() + ->andThrow(new \Predis\Response\ServerException('ERR unknown command')); + $client->shouldNotReceive('disconnect'); + + $connection = new ResilientPredisConnection($client, 2, 1); + + $this->expectException(\Predis\Response\ServerException::class); + $connection->command('get', ['key']); + } + + public function testSuccessfulCommandDoesNotRetry(): void + { + $client = Mockery::mock(Client::class); + $client->shouldReceive('get') + ->once() + ->andReturn('value'); + $client->shouldNotReceive('disconnect'); + + $connection = new ResilientPredisConnection($client, 2, 1); + $result = $connection->command('get', ['key']); + + $this->assertEquals('value', $result); + } + + public function testZeroRetryLimitBehavesLikeStockConnection(): void + { + $client = Mockery::mock(Client::class); + $client->shouldReceive('get') + ->once() + ->andThrow($this->createConnectionException()); + $client->shouldNotReceive('disconnect'); + + $connection = new ResilientPredisConnection($client, 0, 1); + + $this->expectException(ConnectionException::class); + $connection->command('get', ['key']); + } + + public function testIdempotentWriteCommandRetries(): void + { + $client = Mockery::mock(Client::class); + $client->shouldReceive('set') + ->once() + ->andThrow($this->createConnectionException()); + $client->shouldReceive('disconnect')->once(); + $client->shouldReceive('set') + ->once() + ->andReturn('OK'); + + $connection = new ResilientPredisConnection($client, 2, 1); + $result = $connection->command('set', ['key', 'value']); + + $this->assertEquals('OK', $result); + } + + public function testRetrySucceedsOnSecondRetry(): void + { + $exception = $this->createConnectionException(); + + $client = Mockery::mock(Client::class); + // initial call fails + $client->shouldReceive('get') + ->once() + ->andThrow($exception); + // 1st retry fails + $client->shouldReceive('get') + ->once() + ->andThrow($exception); + // 2nd retry succeeds + $client->shouldReceive('get') + ->once() + ->andReturn('recovered'); + $client->shouldReceive('disconnect')->twice(); + + $connection = new ResilientPredisConnection($client, 2, 1); + $result = $connection->command('get', ['key']); + + $this->assertEquals('recovered', $result); + } + + public function testIncrCommandDoesNotRetry(): void + { + $client = Mockery::mock(Client::class); + $client->shouldReceive('incr') + ->once() + ->andThrow($this->createConnectionException()); + $client->shouldNotReceive('disconnect'); + + $connection = new ResilientPredisConnection($client, 2, 1); + + $this->expectException(ConnectionException::class); + $connection->command('incr', ['counter']); + } +}