Recovery Belongs to the Engine, Not the Scheduler

The engine’s Quartz backend ran startup recovery — scanning event logs for workers that were scheduled but never completed, then rescheduling them. It tracked the outcome with a volatile RecoveryStatus field. The problem wasn’t that recovery was broken. It was that it lived in the wrong place.

I started with a straightforward task: wire RecoveryStatus to a health check endpoint. The obvious implementation was @Liveness health check → inject QuartzWorkerExecutionManager → read getRecoveryStatus(). Three files, thirty minutes.

But the CompositeWorkerExecutionManager work from the previous session had just established @WorkerBackend as the abstraction layer for execution backends. A health check that bypasses the abstraction and reaches into one concrete backend is the kind of quiet coupling that compounds until the abstraction is fiction.

So I traced the actual ownership chain. WorkerExecutionRecoveryService does the recovery work — it’s backend-agnostic, lives in common, and routes through the composite. QuartzWorkerExecutionManager.onStart() calls it and tracks the status. But recovery doesn’t belong to Quartz any more than a health check belongs to one database driver. The Quartz backend just happened to be the one that existed when recovery was written.

The fix was to extract recovery initiation into a WorkerRecoveryCoordinator in runtime. The coordinator owns the startup observer (@Priority(22), slotting between Quartz listener registration at 20 and human task recovery at 25), calls the recovery service, tracks the status, and exposes it to a @Liveness health check. The Quartz backend retains only what’s genuinely Quartz-specific: registering the job listener.

One thing the design review caught that I’d missed: a configurable timeout. Without it, a hung recovery (the Uni never completing) maps PENDING to the “everything is fine” health signal — permanently. casehub.engine.recovery.timeout (default 60s) with .ifNoItem().after(timeout).fail() converts the hang into FAILED, which maps to liveness DOWN, which triggers a restart.

The behavioural change is worth noting: recovery now fires unconditionally from runtime, not gated on scheduler-quartz being on the classpath. That’s correct — the recovery service already routes through the composite, which dispatches to whichever backends are available. Gating recovery on one backend was the accidental coupling showing through.

Workers Never Needed Capabilities

Three CDN bytes and a virtual thread