What 186 meant
The branch was closed. CI was sitting at the expected baseline — six pre-existing
MeshResourceInterjectionTest failures from a Qhorus SNAPSHOT bug, one known
ResearcherCaseCompletionTest failure. That was the state I intended to leave.
Then a second CI run came in. CaseEngineRoundTripTest was failing. Not
consistently — it had passed in the first run. That inconsistency usually means
timing, but the error said something stranger: Expected size: 1 but was: 186.
One hundred and eighty-six ledger entries. The test asserts exactly one completed-worker summary. Getting 186 means the provision loop fired 93 times — the engine kept scheduling new workers, each detecting exit and writing a completion record.
This only happens when the when-guard never clears. The guard is
workers.researcher.started != true. It clears when CaseHubRuntime.signal()
sends the started signal and SignalReceivedEventHandler processes it. Something
in that chain was broken.
We had already changed signal() to use .toCompletableFuture().join() after
discovering its return type changed from void to CompletionStage<Void> in the
engine SNAPSHOT. Local builds used the new API. CI downloads the remote jar —
build 128, from May 29 — which still has void signal(). Our compiled bytecode
referenced the CompletionStage<Void> descriptor. CI threw NoSuchMethodError.
The tricky part: catch (Throwable e) absorbed it. The signal appeared to have
been sent. It hadn’t. The guard stayed true. Ninety-three provisions later, the
test timed out with a list of 186.
The error message itself is worth reading carefully:
NoSuchMethodError: 'void io.casehub.api.engine.CaseHubRuntime.signal(...)'
The natural read is “runtime has the wrong API — it only has a void version.” That’s
backwards. The message shows what the COMPILED code expected. Our stale
sibling-module bytecode still had the void descriptor from before we’d rebuilt
the casehub module against the updated jar. The runtime had CompletionStage<Void>.
The message reads like the wrong thing is on the right and the right thing is on the
left.
Running mvn test -f app/pom.xml only recompiles the app module. It was using
whatever .class files were already in casehub/target/. The fix for local
diagnosis was mvn clean compile -f root/pom.xml before running sub-module tests.
For CI — which always starts fresh — the real fix was a reflection-based shim that
calls signal() by name regardless of return type:
static void signal(CaseHubRuntime runtime, UUID caseId, String key, Object value) {
try {
Method m = CaseHubRuntime.class.getMethod(
"signal", UUID.class, String.class, Object.class);
Object result = m.invoke(runtime, caseId, key, value);
if (result instanceof CompletionStage<?> cs) {
cs.toCompletableFuture().join();
}
} catch (Throwable t) {
LOG.warnf(t, "signal() failed for %s", caseId);
}
}
Both call sites — signalStarted() and ClaudonyLedgerEventCapture — now go
through CaseHubRuntimeCompat.signal().
The second piece was SignalReceivedEventHandler. We’d excluded it from
CasehubEnabledProfile because of an engine bug (engine#493). Without it, there
is no Vert.x event bus listener for casehub.signal.received. Every signal()
call fails silently with (NO_HANDLERS,-1). Re-including it — and adding a
NoOpWorkerExecutionRecoveryService for its new injection requirement — cleared
the loop. CI went green on the next run.
The session also fixed SystemPromptIntegrationTest. Adding @WithSession("qhorus")
to listChannels() in the previous session had broken these tests: the CDI
interceptor fires immediately when the method is invoked from a plain JUnit thread,
before any reactive subscription, and throws No current Vertx context found. The
fix is @RunOnVertxContext with UniAsserter — the test moves onto the event loop
thread, and UniAsserter provides the assertion API that works without .await().
The six MeshResourceInterjectionTest failures remain. Those are in Qhorus,
tracked as issue #155.
What 186 was: the count of times the engine saw the guard still set and tried again. The signal was gone before it arrived.