Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .github/workflows/push.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ jobs:
run: |
echo "running OpenTelemetry Formatter tests"
vendor/bin/phpunit tests/OpenTelemetry/Formatters/ --log-junit results_opentelemetry_tests.xml
echo "running Redis Resilience tests"
vendor/bin/phpunit tests/Redis/ --log-junit results_redis_tests.xml

- name: Upload OpenTelemetry Tests Output
uses: actions/upload-artifact@v4
Expand All @@ -38,6 +40,13 @@ jobs:
path: results_opentelemetry_tests.xml
retention-days: 5

- name: Upload Redis Tests Output
uses: actions/upload-artifact@v4
with:
name: results_redis_tests
path: results_redis_tests.xml
retention-days: 5

integration-tests:
runs-on: ubuntu-latest
strategy:
Expand Down
88 changes: 88 additions & 0 deletions adr/001-resilient-redis-connection.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# ADR-001: Resilient Redis Connection with Idempotent Command Retry

**Date:** 2026-02-25
**Status:** Accepted
**Authors:** Sebastian Marcet

## Context

Production runs on a managed Valkey (Redis-compatible) instance on DigitalOcean over TLS. Transient connection failures — network hiccups, TLS renegotiation, server maintenance — cause `Predis\Connection\ConnectionException` errors that propagate up through the application and fail business-critical operations.

The immediate trigger was audit log job dispatch (`EmitAuditLogJob`) failing during Doctrine's `onFlush` event inside `MemberService::synchronizeGroups`. The Redis write failure bubbled up through the Doctrine `UnitOfWork::commit()` and caused the entire member synchronization transaction to fail. Audit logging is non-critical and should never break business operations.

Beyond audit logging, any Redis operation (cache reads/writes, session access, rate limiting) is vulnerable to the same transient failures.

## Decision

### 1. Resilient Redis Connection Layer

Introduce a custom Redis driver (`predis_resilient`) that automatically retries **idempotent** commands on `ConnectionException`, with disconnect/reconnect between attempts and exponential backoff.

**Architecture:**

- `ResilientPredisConnection` extends Laravel's `PredisConnection`, overrides `command()` to catch `ConnectionException` and retry only idempotent commands
- `ResilientPredisConnector` extends `PredisConnector`, calls `parent::connect()` to reuse all config/TLS/option handling, then wraps the `Predis\Client` in `ResilientPredisConnection`
- `RedisResilienceServiceProvider` registers the driver via `RedisManager::extend('predis_resilient', ...)`
- Activated by setting `REDIS_CLIENT=predis_resilient` in `.env` — zero behavior change without this flag

**Retry behavior:**

| Command type | On ConnectionException | Rationale |
|---|---|---|
| Idempotent (GET, SET, DEL, HSET, EXPIRE, ...) | Disconnect, reconnect, retry up to N times with exponential backoff | Executing twice produces the same result |
| Non-idempotent (INCR, LPUSH, RPUSH, EVAL, ...) | Rethrow immediately | Command may have executed before the read-side failed; retrying could duplicate data |

**Configuration** (per-connection in `config/database.php`):

| Parameter | Env var | Default | Description |
|---|---|---|---|
| `retry_limit` | `REDIS_RETRY_LIMIT` | 2 | Max retry attempts (0 disables retries) |
| `retry_delay` | `REDIS_RETRY_DELAY` | 50 | Base delay in ms (doubles each attempt: 50, 100, 200) |

### 2. Job Dispatch Fallback for Audit Logging

Separately, `AuditLogOtlpStrategy` was updated to dispatch `EmitAuditLogJob` via `JobDispatcher::withSyncFallback()` instead of `EmitAuditLogJob::dispatch()`. This catches Redis failures at the job dispatch level and runs the audit log emission synchronously as a fallback — preventing audit logging from ever failing the parent business transaction.

## Consequences

### Positive

- Transient Redis failures on idempotent commands (cache GET/SET, session reads, key expiry) are automatically recovered without application-level error handling
- Non-idempotent commands (queue pushes, counters, list operations) are never retried, preventing data duplication
- Opt-in activation via env var — no risk to existing deployments
- Per-connection retry configuration allows tuning (e.g., more retries for cache, fewer for workers)
- Audit log failures can no longer crash business transactions

### Negative

- Retry adds latency on failure (up to ~350ms with defaults: 50 + 100 + 200ms backoff)
- `usleep()` in the retry loop blocks the PHP process during backoff — acceptable for 2-3 retries but would need async handling at higher retry counts
- The idempotent command list is manually maintained and must be updated if new Redis commands are used

### Neutral

- Queue push operations (`EVAL` with Lua scripts) are NOT retried by the resilient connection — they remain protected by `JobDispatcher::withSyncFallback` / `withDbFallback` at the application layer
- The `predis_resilient` driver shares the same Predis `Client` configuration as `predis` — no TLS, auth, or timeout differences

## Alternatives Considered

1. **Predis built-in retry** — Predis does not offer connection retry on command failure (only on initial connect via `aggregate` connections). Rejected.

2. **Retry all commands** — Would risk duplicating non-idempotent operations (double LPUSH, double INCR) when the failure occurs after the command was sent but before the response was read. Rejected.

3. **Catch-and-ignore at every call site** — Would require wrapping every Redis call in try/catch throughout the codebase. Not maintainable. Rejected.

4. **Switch to phpredis extension** — phpredis has built-in retry support, but would require changing the entire Redis integration layer and testing all connection configurations. Disproportionate effort for the problem at hand. Not pursued.

## Files

| File | Purpose |
|---|---|
| `app/Redis/ResilientPredisConnection.php` | Connection with retry logic |
| `app/Redis/ResilientPredisConnector.php` | Connector that swaps connection class |
| `app/Providers/RedisResilienceServiceProvider.php` | Registers `predis_resilient` driver |
| `config/database.php` | Added `retry_limit`, `retry_delay` to Redis connections |
| `config/app.php` | Registered service provider |
| `app/Audit/AuditLogOtlpStrategy.php` | Changed to `JobDispatcher::withSyncFallback` |
| `tests/Redis/ResilientPredisConnectionTest.php` | 10 unit tests |
| `.github/workflows/push.yml` | Added Redis tests to CI |
38 changes: 38 additions & 0 deletions app/Providers/RedisResilienceServiceProvider.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
<?php namespace App\Providers;
/**
* Copyright 2026 OpenStack Foundation
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
* http://www.apache.org/licenses/LICENSE-2.0
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
**/

use App\Redis\ResilientPredisConnector;
use Illuminate\Contracts\Foundation\Application;
use Illuminate\Redis\RedisManager;
use Illuminate\Support\ServiceProvider;

/**
* Class RedisResilienceServiceProvider
*
* Registers the "predis_resilient" Redis driver which adds automatic
* retry-with-reconnect for idempotent commands on transient failures.
*
* To activate, set REDIS_CLIENT=predis_resilient in your .env.
*/
class RedisResilienceServiceProvider extends ServiceProvider
{
public function boot(): void
{
$this->app->afterResolving('redis', function (RedisManager $redis, Application $app) {
$redis->extend('predis_resilient', function () {
return new ResilientPredisConnector();
});
});
}
}
125 changes: 125 additions & 0 deletions app/Redis/ResilientPredisConnection.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
<?php namespace App\Redis;
/**
* Copyright 2026 OpenStack Foundation
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
* http://www.apache.org/licenses/LICENSE-2.0
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
**/

use Illuminate\Redis\Connections\PredisConnection;
use Illuminate\Support\Facades\Log;
use Predis\Connection\ConnectionException;

/**
* Class ResilientPredisConnection
*
* Extends the default PredisConnection to add automatic retry with
* reconnect for idempotent Redis commands on transient connection failures.
*
* Non-idempotent commands (INCR, LPUSH, RPUSH, EVAL, etc.) are never
* retried because the command may have already been executed on the
* server before the read-side of the connection failed.
*/
class ResilientPredisConnection extends PredisConnection
{
private int $retryLimit;

private int $retryDelay;

/**
* Commands that are safe to retry after a connection failure.
* A command is safe when executing it twice produces the same result.
*/
private const IDEMPOTENT_COMMANDS = [
// reads
'GET', 'MGET', 'HGET', 'HGETALL', 'HMGET', 'HEXISTS', 'HLEN', 'HKEYS', 'HVALS',
'LLEN', 'LRANGE', 'LINDEX',
'SCARD', 'SMEMBERS', 'SISMEMBER',
'ZCARD', 'ZCOUNT', 'ZRANGE', 'ZRANGEBYSCORE', 'ZREVRANGEBYSCORE', 'ZSCORE', 'ZRANK', 'ZREVRANK',
'EXISTS', 'TYPE', 'TTL', 'PTTL', 'KEYS', 'SCAN', 'HSCAN', 'SSCAN', 'ZSCAN',
'INFO', 'PING', 'DBSIZE', 'TIME', 'STRLEN', 'GETRANGE',
// idempotent writes
'SET', 'SETEX', 'PSETEX', 'MSET', 'SETNX', 'GETSET',
'HSET', 'HMSET', 'HSETNX',
'DEL', 'HDEL', 'UNLINK',
'EXPIRE', 'EXPIREAT', 'PEXPIRE', 'PEXPIREAT', 'PERSIST',
'SADD', 'SREM',
'ZADD', 'ZREM', 'ZREMRANGEBYSCORE', 'ZREMRANGEBYRANK',
];

/**
* @param \Predis\Client $client
* @param int $retryLimit Max number of retries (0 = no retries, behaves like stock PredisConnection)
* @param int $retryDelay Base delay in milliseconds between retries (doubled each attempt)
*/
public function __construct($client, int $retryLimit = 2, int $retryDelay = 50)
{
parent::__construct($client);
$this->retryLimit = $retryLimit;
$this->retryDelay = $retryDelay;
}

/**
* @inheritdoc
*/
public function command($method, array $parameters = [])
{
try {
return parent::command($method, $parameters);
} catch (ConnectionException $e) {
if (!$this->isIdempotent($method)) {
throw $e;
}
return $this->retryCommand($method, $parameters, $e);
}
}

/**
* Retry an idempotent command after reconnecting.
*/
private function retryCommand(string $method, array $parameters, ConnectionException $previous): mixed
{
$lastException = $previous;

for ($attempt = 1; $attempt <= $this->retryLimit; $attempt++) {
$delay = $this->retryDelay * (2 ** ($attempt - 1)); // exponential back-off

Log::warning('ResilientPredisConnection: retrying command', [
'command' => strtoupper($method),
'attempt' => $attempt,
'max_retries' => $this->retryLimit,
'delay_ms' => $delay,
'error' => $previous->getMessage(),
]);

usleep($delay * 1000);

try {
$this->client->disconnect();

return parent::command($method, $parameters);
} catch (ConnectionException $e) {
$lastException = $e;
}
}

Log::error('ResilientPredisConnection: all retries exhausted', [
'command' => strtoupper($method),
'retries' => $this->retryLimit,
'error' => $lastException->getMessage(),
]);

throw $lastException;
}

private function isIdempotent(string $method): bool
{
return in_array(strtoupper($method), self::IDEMPOTENT_COMMANDS, true);
}
}
41 changes: 41 additions & 0 deletions app/Redis/ResilientPredisConnector.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
<?php namespace App\Redis;
/**
* Copyright 2026 OpenStack Foundation
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
* http://www.apache.org/licenses/LICENSE-2.0
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
**/

use Illuminate\Redis\Connectors\PredisConnector;

/**
* Class ResilientPredisConnector
*
* Wraps the stock PredisConnector, reusing all its config/TLS/option
* handling, and swaps the returned connection to ResilientPredisConnection.
*/
class ResilientPredisConnector extends PredisConnector
{
/**
* @inheritdoc
*/
public function connect(array $config, array $options)
{
$connection = parent::connect($config, $options);

$retryLimit = (int) ($config['retry_limit'] ?? 2);
$retryDelay = (int) ($config['retry_delay'] ?? 50);

return new ResilientPredisConnection(
$connection->client(),
$retryLimit,
$retryDelay
);
}
}
1 change: 1 addition & 0 deletions config/app.php
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,7 @@
// App\Providers\BroadcastServiceProvider::class,
App\Providers\EventServiceProvider::class,
App\Providers\RouteServiceProvider::class,
App\Providers\RedisResilienceServiceProvider::class,
// Services
App\Repositories\RepositoriesProvider::class,
App\Services\FileSystem\Swift\SwiftServiceProvider::class,
Expand Down
12 changes: 9 additions & 3 deletions config/database.php
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,6 @@
*/

'redis' => [

'client' => env('REDIS_CLIENT', 'predis'),
/*
* @see https://github.com/predis/predis/wiki/Connection-Parameters
Expand All @@ -186,8 +185,9 @@
'scheme' => env('REDIS_SCHEME', 'tcp'),
'read_write_timeout' => env('REDIS_READ_WRITE_TIMEOUT', -1),
'timeout' => env('REDIS_TIMEOUT', 30),
'retry_limit' => (int) env('REDIS_RETRY_LIMIT', 2),
'retry_delay' => (int) env('REDIS_RETRY_DELAY', 50),
],

'cache' => [
'host' => env('REDIS_HOST'),
'port' => env('REDIS_PORT'),
Expand All @@ -196,8 +196,9 @@
'scheme' => env('REDIS_SCHEME', 'tcp'),
'read_write_timeout' => env('REDIS_READ_WRITE_TIMEOUT', -1),
'timeout' => env('REDIS_TIMEOUT', 30),
'retry_limit' => (int) env('REDIS_RETRY_LIMIT', 2),
'retry_delay' => (int) env('REDIS_RETRY_DELAY', 50),
],

'session' => [
'host' => env('REDIS_HOST'),
'port' => env('REDIS_PORT'),
Expand All @@ -206,6 +207,9 @@
'scheme' => env('REDIS_SCHEME', 'tcp'),
'read_write_timeout' => env('REDIS_READ_WRITE_TIMEOUT', -1),
'timeout' => env('REDIS_TIMEOUT', 30),

'retry_limit' => (int) env('REDIS_RETRY_LIMIT', 2),
'retry_delay' => (int) env('REDIS_RETRY_DELAY', 50),
],

'worker' => [
Expand All @@ -216,6 +220,8 @@
'scheme' => env('REDIS_SCHEME', 'tcp'),
'read_write_timeout' => env('REDIS_READ_WRITE_TIMEOUT', -1),
'timeout' => env('REDIS_TIMEOUT', 30),
'retry_limit' => (int) env('REDIS_RETRY_LIMIT', 2),
'retry_delay' => (int) env('REDIS_RETRY_DELAY', 50),
],
],
'allow_disabled_pk' => env('DB_ALLOW_DISABLED_PK', false),
Expand Down
Loading