What 186 meant

The branch was closed. CI was sitting at the expected baseline — six pre-existing MeshResourceInterjectionTest failures from a Qhorus SNAPSHOT bug, one known ResearcherCaseCompletionTest failure. That was the state I intended to leave.

Then a second CI run came in. CaseEngineRoundTripTest was failing. Not consistently — it had passed in the first run. That inconsistency usually means timing, but the error said something stranger: Expected size: 1 but was: 186.

One hundred and eighty-six ledger entries. The test asserts exactly one completed-worker summary. Getting 186 means the provision loop fired 93 times — the engine kept scheduling new workers, each detecting exit and writing a completion record.

This only happens when the when-guard never clears. The guard is workers.researcher.started != true. It clears when CaseHubRuntime.signal() sends the started signal and SignalReceivedEventHandler processes it. Something in that chain was broken.

We had already changed signal() to use .toCompletableFuture().join() after discovering its return type changed from void to CompletionStage<Void> in the engine SNAPSHOT. Local builds used the new API. CI downloads the remote jar — build 128, from May 29 — which still has void signal(). Our compiled bytecode referenced the CompletionStage<Void> descriptor. CI threw NoSuchMethodError.

The tricky part: catch (Throwable e) absorbed it. The signal appeared to have been sent. It hadn’t. The guard stayed true. Ninety-three provisions later, the test timed out with a list of 186.

The error message itself is worth reading carefully:

NoSuchMethodError: 'void io.casehub.api.engine.CaseHubRuntime.signal(...)'

The natural read is “runtime has the wrong API — it only has a void version.” That’s backwards. The message shows what the COMPILED code expected. Our stale sibling-module bytecode still had the void descriptor from before we’d rebuilt the casehub module against the updated jar. The runtime had CompletionStage<Void>. The message reads like the wrong thing is on the right and the right thing is on the left.

Running mvn test -f app/pom.xml only recompiles the app module. It was using whatever .class files were already in casehub/target/. The fix for local diagnosis was mvn clean compile -f root/pom.xml before running sub-module tests.

For CI — which always starts fresh — the real fix was a reflection-based shim that calls signal() by name regardless of return type:

static void signal(CaseHubRuntime runtime, UUID caseId, String key, Object value) {
    try {
        Method m = CaseHubRuntime.class.getMethod(
            "signal", UUID.class, String.class, Object.class);
        Object result = m.invoke(runtime, caseId, key, value);
        if (result instanceof CompletionStage<?> cs) {
            cs.toCompletableFuture().join();
        }
    } catch (Throwable t) {
        LOG.warnf(t, "signal() failed for %s", caseId);
    }
}

Both call sites — signalStarted() and ClaudonyLedgerEventCapture — now go through CaseHubRuntimeCompat.signal().

The second piece was SignalReceivedEventHandler. We’d excluded it from CasehubEnabledProfile because of an engine bug (engine#493). Without it, there is no Vert.x event bus listener for casehub.signal.received. Every signal() call fails silently with (NO_HANDLERS,-1). Re-including it — and adding a NoOpWorkerExecutionRecoveryService for its new injection requirement — cleared the loop. CI went green on the next run.

The session also fixed SystemPromptIntegrationTest. Adding @WithSession("qhorus") to listChannels() in the previous session had broken these tests: the CDI interceptor fires immediately when the method is invoked from a plain JUnit thread, before any reactive subscription, and throws No current Vertx context found. The fix is @RunOnVertxContext with UniAsserter — the test moves onto the event loop thread, and UniAsserter provides the assertion API that works without .await().

The six MeshResourceInterjectionTest failures remain. Those are in Qhorus, tracked as issue #155.

What 186 was: the count of times the engine saw the guard still set and tried again. The signal was gone before it arrived.

Three Problems, One Queue

The Source Code Says So