Slack threads and the cache write problem

The Slack channel backend is shipped. #261 is done. But getting there involved two decisions about timing that weren’t in the spec.

The session opened with something else entirely — a CDI regression guard for a Claudony test failure. The root fix (switching ReactiveJpaChannelStore.updateLastActivity from positional to named JPQL parameters) had already landed. What was missing was a CI gate to catch the same failure pattern before it surfaces in a consumer again.

The insight from the Claudony analysis was precise: @Alternative @Priority(1) from an external jar is insufficient to override an @IfBuildProperty-gated bean when both are active in the same CDI context. The reactive JPA stores are gated behind casehub.qhorus.reactive.enabled=true, but when that gate passes, @Priority alone doesn’t win the CDI selection — the InMemory stores also need to appear in quarkus.arc.selected-alternatives. This isn’t documented; we found it by elimination.

I added StoreCdiAlternativesTest to examples/type-system — the module that has both casehub-qhorus and casehub-qhorus-testing on its classpath, matching what a consumer sees. One of the tests simply asserts reactiveChannelStore instanceof InMemoryReactiveChannelStore. Runs without PostgreSQL and catches exactly the failure mode Claudony hit.

The Slack backend itself had been designed three weeks ago (spec r4), so the implementation was mostly translation. The interesting parts were two ordering decisions.

Thread continuity across restarts. Slack replies need a thread_ts — the timestamp of the root message in the thread. Without it, your response appears as a new top-level message instead of a reply. We cache (channelId, correlationId) → thread_ts in memory. But a server restart wipes the cache. Without it, the next outbound message starts a fresh Slack thread, silently fragmenting the conversation.

The answer is to back the in-memory cache with DB rows. SlackThreadCache (V24) holds the same mapping persistently. On ChannelInitialisedEvent, SlackChannelBackend loads all rows for that channel and pre-warms the cache. The overhead is one SELECT per channel init — not per message — and one INSERT when a new commitment opens its first thread.

The write-before-dispatch race. The subtler problem: when does the cache entry get written?

The natural approach is to write the correlationId mapping after receiving the inbound Slack message, then dispatch to the gateway. The gateway fires your message handlers. If the agent is fast — running on a virtual thread, or just quick — it can send a RESPONSE before the cache write lands, because the write happened after the dispatch call returned. The post() handler looks up the cache, finds nothing, and creates a new top-level Slack message instead of a thread reply.

The fix: write both the DB row and the in-memory entry before calling gateway.receiveHumanMessage().

if (slackTs != null) {
    threadCacheStore.save(channelRef.id(), corrId, slackTs);          // DB first
    threadCache.computeIfAbsent(channelRef.id(), k -> new ConcurrentHashMap<>())
               .put(corrId, slackTs);                                  // memory second
}
gateway.receiveHumanMessage(channelRef,                                // dispatch last
    new InboundHumanMessage(senderId, content, receivedAt, metadata, corrId.toString(), null));

Obvious in hindsight. The natural mental model is receive → process → write state on the way out. Here, the state needs to be on the way in.

Two unexpected gotchas during integration testing.

After a JPQL bulk DELETE in deleteByChannelId, em.find() returned the deleted entity from Hibernate’s L1 cache. The row was gone from H2; the persistence context still had it. Switching findByChannelId to a JPQL SELECT query was the surgical fix — JPQL queries go to the DB rather than reading from the first-level cache.

The bigger surprise was quarkus.hibernate-orm.qhorus.packages. The qhorus PU only scans io.casehub.qhorus.runtime and io.casehub.ledger.runtime. The new slack entities live in io.casehub.qhorus.slack. Without that package in the test configuration, Hibernate throws Unknown entity type even though the jar is Jandex-indexed. Any future optional qhorus module that ships JPA entities will hit the same wall.

The module builds clean and the issue is closed. The thread cache is the part worth remembering — not for Slack specifically, but for any system where a correlation mapping must exist before the first handler fires.

The platform gets ears — CloudEvent foundation and five stream modules

The Query That Only Failed at Hour One