Skip to content

limits_concurrency bypassed after non-graceful shutdown #735

@npadgett

Description

@npadgett

Summary

When a worker is force-killed during shutdown (shutdown timeout exceeded), jobs protected by limits_concurrency can run concurrently after restart.

Steps to reproduce

class SlowJob < ActiveJob::Base
  limits_concurrency key: "slow_job", duration: 5.minutes

  def perform
    sleep 1.hour
  end
end
  1. Enqueue 3 SlowJob instances
  2. Start SolidQueue supervisor in fork mode (worker thread_pool_size: 3)
  3. Wait for the first job to start (semaphore acquired, others blocked)
  4. Send SIGTERM to the supervisor
  5. SolidQueue.shutdown_timeout (default 5s) expires — supervisor force-kills the worker
  6. Start a new supervisor
  7. Two or more jobs start Performing concurrently, violating the concurrency limit of 1

Expected behavior

Only one SlowJob runs at a time after restart, same as before the shutdown.

Actual behavior

Multiple jobs with the same concurrency key run simultaneously after restart.

Root cause

Supervisor#start calls start_processes (line 39), which starts the dispatcher and workers concurrently. The dispatcher's ConcurrencyMaintenance is initialized with Concurrent::TimerTask.new(run_now: true), so it does run expire_semaphores and unblock_blocked_executions at boot — but in a background thread. Meanwhile, the worker starts polling immediately and can claim multiple jobs before the maintenance thread completes.

The sequence:

  1. Old worker is force-killed mid-job, leaving a stale semaphore in solid_queue_semaphores
  2. Release claimed jobs runs, putting the interrupted job back in the ready queue
  3. New supervisor starts — dispatcher and workers boot concurrently
  4. Dispatcher's maintenance starts in a background thread (Concurrent::TimerTask)
  5. Worker starts polling (every 0.1s), claims multiple ready jobs before maintenance has expired the stale semaphore and unblocked blocked executions
  6. Concurrency limit is violated

Observed in production logs

14:38:39 Supervisor wasn't terminated gracefully - shutdown timeout exceeded (5018.5ms)
14:38:39 Release claimed jobs (90.1ms) size: 1
...
14:51:47 ==> Your service is live
14:51:50 [Job ff2291c7] Performing RefreshDataJob (az4n-8mr2)
14:51:50 [Job b1ddfa0c] Performing RefreshDataJob (6sqe-dvqs)

Both jobs use limits_concurrency key: self (limit 1) but started in the same second after a deploy that triggered a non-graceful shutdown.

Possible fix

Run ConcurrencyMaintenance#expire_semaphores and #unblock_blocked_executions synchronously during dispatcher boot, before workers start polling. This would ensure stale semaphores from dead processes are cleaned up before any jobs are claimed.

Environment

  • solid_queue 1.4.0
  • Rails 8.1
  • Ruby 3.4.7
  • PostgreSQL 16
  • Fork mode supervisor

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions