The Wrong Hypothesis and the JSON Document That Ate Every Binding

AML’s entire test suite was dead. Every @QuarkusTest that started a case and drained to completion timed out — cases started, bindings evaluated, and then nothing happened. No workers fired. No errors. The case stayed RUNNING forever.

I started where the symptoms pointed: the fireAsync() chain in WorkflowExecutionCompletedHandler. The handler awaits CDI observer completion before publishing CONTEXT_CHANGED. A blocking observer — say, one deadlocked on H2 lock contention — would prevent CONTEXT_CHANGED from ever reaching the binding evaluator. I wrote a test: inject a blocking WorkerDecisionEvent observer, start a case, verify it still completes. The test failed (confirming the blocking path existed), and the fix was clean. Fire CDI audit events as true fire-and-forget. Case state must not be gated on optional observer completion.

Except the AML test still failed.

I’d spent hours tracing CDI observer chains, tenancy ID propagation, @DefaultBean resolution, and Quartz job scheduling. Every hypothesis was internally consistent and wrong. The engine’s own 740 tests passed. The difference had to be in the consumer context — but I was diagnosing from the engine’s code, not the consumer’s test.

The user interrupted and told me to stop. Reproduce in AML. Use first principles. Don’t guess.

I ran AmlLayer5ResourceTest. Three minutes. The output was definitive:

INFO  CaseContextChangedEventHandler: Rules+milestones+goals processed for caseId: 0e1ed5a8...

No Agent selected: log followed. The handler ran binding evaluation, found zero matches, and moved on. Every binding condition — .transaction != null, .entityResolution != null — evaluated to false. On data that was demonstrably present.

The panels refactor from June 9 had changed CaseContextImpl.asJsonNode(). It used to return flat working data:

{"transaction": {...}, "entityResolution": {...}}

Now it returns a panel document:

{"working": {"transaction": {...}}, "semantic": {...}, "episodic": {...}}

Every JQ expression in every consumer YAML evaluates against context.asJsonNode(). .transaction looks for a top-level key. It doesn’t exist — transaction lives under .working.transaction. The expression is false. The binding doesn’t fire. The case stays RUNNING. No error, no warning, nothing.

The engine’s own tests were updated to use .working.* paths as part of the panels refactor. Engine CI stayed green. Consumer apps — AML, clinical, life, devtown — were never touched. Their YAML still uses unqualified paths. Every single one of them is broken against the current engine SNAPSHOT.

The fix: all JQ evaluation points must evaluate against the working panel, not the full panel document. Eight production files. Fifty-five test files (stripping the .working. prefix back out). The panel structure is an engine implementation detail — YAML definitions should not know about it.

// Before (broken):
jqEvaluator.eval(expr, context.asJsonNode());

// After (fixed):
jqEvaluator.eval(expr, context.panel(ContextPanel.WORKING).asJsonNode());

The branch also landed three other issues: trigger context threading (#231 — channelId and correlationId now propagate from signal() through to ProvisionContext), the CaseOutcomeObserver SPI (#477 — lifecycle hook at case close for CBR Retain), and the ActionGatePolicy enum (#472 — shared vocabulary for domain classifiers).

The fire-and-forget CDI change from the wrong hypothesis turned out to be a valid improvement on its own — blocking observers really can stall case state progression. It just wasn’t the root cause. The root cause was a JSON document structure change that broke every consumer app silently, and the only way to find it was to stop guessing from the engine and run the consumer test.

Removing AgentKey and making the context window self-resolving

Three questions, three repos unblocked