casehub-pages — The Dedup Semantics Question
Date: 2026-07-02 Type: phase-update
What I was trying to achieve: a missing aggregation primitive
The pages-data aggregation engine had join() and distinct() but no way to concatenate deduplicated values. The clinical dashboard was computing distinct strings server-side as a workaround — the second time a gap in the aggregation engine forced a server-side computation. Time to close it.
The design question that mattered
The implementation itself was mechanical — a Set, a separator, done. The interesting question was deduplication semantics. The engine already had countDistinct, which uses type-prefixed keys: NUMBER 1 and TEXT "1" are two distinct values because they have different types. That’s correct for counting — you’re asking “how many unique values exist?”
But distinctJoin produces a human-readable string. If NUMBER 1 and TEXT "1" both survive dedup, the output reads "1, 1" — which looks like a bug to anyone reading it. I chose string-level dedup: convert to the output representation first, then deduplicate. PostgreSQL’s STRING_AGG(DISTINCT ...) works the same way.
The naming landed on DISTINCTJOIN (compact, no underscore) following PostgreSQL convention rather than the Java-style DISTINCT_JOIN. Minor call, but the SQL world’s STRING_AGG and GROUP_CONCAT patterns are what users will recognise.
cellValueToString — preventing the next drift
The design review caught something I’d have missed: joinValues and distinctJoinValues would both inline the same type-dispatch logic for converting cells to strings. That’s a drift hazard — if DATE formatting changes, one gets updated and the other doesn’t. We extracted cellValueToString as a shared helper. Both functions now call it. getBucketName stays separate because it has different NULL semantics (produces "null" as a string rather than skipping).
The review also caught a missing barrel re-export in dsl/index.ts and pushed the manual AggregationFnType union to a derived type (Aggregation["fn"]). That last one is the kind of improvement that pays off every time someone adds a new aggregation — the compiler enforces the sync forever.
Where this leaves things
distinctJoin is the third time a clinical dashboard gap has surfaced a missing primitive in pages-data. The pattern is consistent: server-side workarounds accumulate until someone adds the right aggregation. The engine’s function set is now reasonably complete for the use cases in play, but the next gap will likely be cross-column access — the expression evaluator can’t reference multiple columns in a single computation.