Phase 5: Porting Python's HTTP enrichment pipeline to Java
Phase 5 is enrich.py — 350 lines that transform a raw blog post HTML file
into an archive-ready version. YouTube iframes become thumbnail figures.
GitHub Gist <script> tags become inlined <pre><code> blocks. SyntaxHighlighter
brush classes become standard language-X classes. Unlabelled code blocks
get language hints from content heuristics. Leftover iframes get archive-note
wrappers. None of this sounds complicated — but it’s the first phase where
Java has to make real HTTP calls, parse JSON, and build DOM elements from
scratch rather than just querying them.

The pipeline runs once per post, the first time it’s scanned. After that the
enriched copy is cached on disk and subsequent scans read from it directly.
Order matters: YouTube iframes must be replaced before the embed fallback
wrapper runs, or you’d wrap a figure you just created. Code class normalisation
must run before language detection, or you’d try to detect languages on <pre>
elements that already have correct language-X classes from the normaliser.
The Python code — what we’re porting
The Python module exposes six public functions and an orchestrator:
def replace_youtube_embeds(article, assets_dir, session) -> dict
def replace_gist_embeds(article, github_token, session) -> dict
def normalise_code_classes(article) -> dict
def detect_code_languages(article) -> dict
def replace_embed_fallbacks(article) -> dict
def enrich_post(html_path, enriched_path, assets_dir, github_token) -> dict
The first two take a requests.Session — the Python HTTP client. The
last four are pure DOM transforms. This distinction drives the Java design:
pure transforms become static methods; HTTP-dependent methods become
instance methods on a class that owns an HttpClient.
The biggest shift: HTTP clients

In Python, requests is the de-facto HTTP library. Its Session object
handles connection pooling, headers, and timeouts with a simple API:
# Python — download YouTube thumbnail
session = requests.Session()
resp = session.get(url, timeout=15)
if resp.status_code == 200 and resp.content:
dest.write_bytes(resp.content)
Java’s built-in java.net.http.HttpClient (added in Java 11) is the modern
equivalent. It’s slightly more ceremonial but equally capable:
// Java — download YouTube thumbnail
HttpRequest req = HttpRequest.newBuilder(URI.create(url)).build();
HttpResponse<byte[]> resp = http.send(req, HttpResponse.BodyHandlers.ofByteArray());
if (resp.statusCode() == 200 && resp.body().length > 0) {
Files.write(dest, resp.body());
}
The key difference: Python’s requests raises exceptions on network errors
by default; Java’s HttpClient.send() throws checked exceptions that you
must handle. Both libraries use connection pooling by default. Neither
requires external dependencies — requests is bundled with most Python
installs; HttpClient is in the JDK.
For the Gist API, headers become a slightly longer builder chain:
# Python — GitHub API call with optional auth header
headers = {'Accept': 'application/vnd.github+json'}
if github_token:
headers['Authorization'] = f'Bearer {github_token}'
resp = session.get(url, headers=headers, timeout=30)
// Java — GitHub API call with optional auth header
HttpRequest.Builder b = HttpRequest.newBuilder(URI.create(url))
.header("Accept", "application/vnd.github+json");
if (token != null && !token.isEmpty())
b.header("Authorization", "Bearer " + token);
HttpResponse<String> resp = http.send(b.build(), HttpResponse.BodyHandlers.ofString());
The builder pattern in Java is more verbose but explicit about what’s
happening. The BodyHandlers selection (ofByteArray() for images,
ofString() for JSON) makes the return type clear at the call site —
something Python’s typeless response object doesn’t do.
Testing HTTP without Mockito
The hardest testing question in Phase 5 is: how do you test methods that make real HTTP calls without actually making HTTP calls?
The obvious Java answer is Mockito — mock the HttpClient, stub the
response, verify the call. But mocking HttpClient is genuinely painful.
The interface hierarchy involves HttpRequest, HttpResponse<T> with its
generics, BodyHandler, and more. Getting Mockito to return a sensible
HttpResponse<byte[]> takes more boilerplate than the actual test.
Instead, we used a simpler pattern: package-private method overrides via
subclassing. The HTTP calls in Enricher are extracted into two tiny
helper methods:
// In Enricher.java — package-private, overrideable in tests
byte[] fetchUrl(String url) {
HttpRequest req = HttpRequest.newBuilder(URI.create(url)).build();
HttpResponse<byte[]> resp = http.send(req, HttpResponse.BodyHandlers.ofByteArray());
return resp.statusCode() == 200 && resp.body().length > 0 ? resp.body() : null;
}
String fetchJson(String url, String token) {
// ... builder, send, return body or null
}
The test file defines an inner MockEnricher that overrides both methods:
static class MockEnricher extends Enricher {
final Map<String, byte[]> urlBytes = new HashMap<>();
final Map<String, String> urlJson = new HashMap<>();
int fetchCallCount = 0;
void mockBytes(String url, byte[] bytes) { urlBytes.put(url, bytes); }
void mockJson(String url, String json) { urlJson.put(url, json); }
@Override byte[] fetchUrl(String url) {
fetchCallCount++;
return urlBytes.get(url); // returns null if URL not mocked
}
@Override String fetchJson(String url, String token) {
return urlJson.get(url);
}
}
Usage in a test is four lines:
@Test
void youtubeEmbedReplacedWithFigure(@TempDir Path tempDir) throws Exception {
MockEnricher e = new MockEnricher();
e.mockBytes("https://img.youtube.com/vi/dQw4w9WgXcQ/maxresdefault.jpg",
new byte[]{1, 2, 3});
Path assets = Files.createDirectories(tempDir.resolve("assets"));
Element a = article("<iframe src=\"https://www.youtube.com/embed/dQw4w9WgXcQ\"></iframe>");
int count = e.replaceYoutubeEmbeds(a, assets);
assertEquals(1, count);
assertEquals("yt_dQw4w9WgXcQ.jpg", a.selectFirst("img").attr("src"));
assertTrue(Files.exists(assets.resolve("yt_dQw4w9WgXcQ.jpg")));
}
No Mockito. No @ExtendWith. No @Mock annotations. The test is a plain
JUnit test that happens to use a subclass. The fetchCallCount field lets
you assert that the cache hit path doesn’t re-download a thumbnail that
already exists:
@Test
void existingThumbnailNotReDownloaded(@TempDir Path tempDir) throws Exception {
MockEnricher e = new MockEnricher();
Files.write(assets.resolve("yt_existing.jpg"), new byte[]{9, 8, 7});
e.replaceYoutubeEmbeds(a, assets); // should not call fetchUrl at all
assertEquals(0, e.fetchCallCount, "cache hit — no HTTP call");
}
This pattern generalises to any class with external dependencies you want to control in tests. The trade-off vs Mockito: you can’t verify specific arguments or call counts beyond what you instrument yourself. The benefit: zero framework overhead, readable setup, and tests that run in 10ms.
Building DOM elements: Jsoup as a builder
Python’s BeautifulSoup enrichment code builds HTML as a string, then parses it back:
fig_html = (
f'<figure class="video-embed">'
f'<a href="{watch_url}" target="_blank" rel="noopener">'
f'<img src="{thumb_src}" alt="YouTube video" style="max-width:100%">'
f'<figcaption>▶ Watch on YouTube</figcaption>'
f'</a></figure>'
)
iframe.replace_with(BeautifulSoup(fig_html, 'html.parser').find('figure'))
This works but has two problems: f-string building bypasses HTML escaping
(a URL with a " in it would break the attribute), and the parse-then-find
dance is inelegant.
Jsoup has a proper element builder API. You construct the tree directly:
Element fig = new Element("figure").addClass("video-embed");
Element a = new Element("a")
.attr("href", watchUrl)
.attr("target", "_blank")
.attr("rel", "noopener");
a.appendChild(new Element("img")
.attr("src", thumbName != null ? thumbName : "")
.attr("alt", "YouTube video")
.attr("style", "max-width:100%"));
a.appendChild(new Element("figcaption").text("\u25B6 Watch on YouTube"));
fig.appendChild(a);
iframe.replaceWith(fig);
Every attr() call goes through Jsoup’s encoding — you can pass a raw URL
and it will be escaped correctly in the output. .text() encodes HTML
entities. There’s no string-then-parse round-trip; the element is constructed
directly and inserted into the existing document tree. The resulting HTML is
identical to the Python version.
The same pattern applies to archive-note fallback figures, Gist figure
captions, and the error figures on API failure. Every replaceWith() call
uses a Jsoup-built element rather than a parsed string.
URL parsing: URI vs urlparse
Python’s urlparse is one of those standard library modules that you don’t
think about until you have to port it. Java’s equivalent is java.net.URI.
The API is different enough to require care.
# Python — extract YouTube video ID from embed URL
parsed = urlparse(url)
host = parsed.netloc.lower().replace('www.', '')
if '/embed/' in parsed.path:
return parsed.path.split('/embed/')[-1].split('/')[0].split('?')[0]
// Java — extract YouTube video ID from embed URL
URI uri = URI.create(url);
String host = uri.getHost() != null
? uri.getHost().toLowerCase().replace("www.", "") : "";
String path = uri.getPath() != null ? uri.getPath() : "";
if (path.contains("/embed/")) {
String[] parts = path.split("/embed/");
if (parts.length > 1) {
String id = parts[1].split("/")[0].split("\\?")[0];
return id.isEmpty() ? null : id;
}
}
The Java version is longer because URI.create() throws if the URL is
malformed (wrap in try-catch), and getHost() / getPath() can return null
(guard with != null). Python’s urlparse never throws — it just returns
empty strings. Both behaviours are defensible; Java’s forces you to think
about the error case at the call site.
The query string is another difference. Python’s parse_qs returns a dict
of lists: qs['v'][0]. Java’s URI.getRawQuery() returns the raw query
string: "v=abc&t=1" — you split on & and find the v= prefix. More
primitive, equally effective.
One gotcha: URI.create() requires a valid URI. YouTube embed URLs are
always valid, but Gist script src attributes occasionally contain
malformed values in the wild. The try-catch around URI.create() is
genuinely necessary, not defensive paranoia.
Porting Python constants: dicts become static maps
Python’s brush map is a plain dict:
_BRUSH_MAP = {
'jscript': 'javascript', 'js': 'javascript',
'csharp': 'csharp', 'c#': 'csharp',
'c++': 'cpp', 'cplusplus': 'cpp',
'plain': 'plaintext', 'text': 'plaintext',
'shell': 'bash', 'sh': 'bash',
'drl': 'drl',
}
Java’s Map.of() goes up to 10 pairs. This map has 11. The solution is
a static initialiser block:
private static final Map<String, String> BRUSH_MAP;
static {
Map<String, String> m = new LinkedHashMap<>();
m.put("jscript", "javascript"); m.put("js", "javascript");
m.put("csharp", "csharp"); m.put("c#", "csharp");
m.put("c++", "cpp"); m.put("cplusplus", "cpp");
m.put("plain", "plaintext"); m.put("text", "plaintext");
m.put("shell", "bash"); m.put("sh", "bash");
m.put("drl", "drl");
BRUSH_MAP = Collections.unmodifiableMap(m);
}
LinkedHashMap preserves insertion order. Collections.unmodifiableMap
makes it immutable. The result behaves identically to the Python dict for
all uses in normaliseCodeClasses.
A subtlety worth noting: the Python regex r'\bbrush\s*:\s*(\w+)\b'
captures the language token with \w+ — which means brush:c# would
capture c not c#, since # is not a word character. The entries
'c#': 'csharp' and 'c++': 'cpp' in the brush map are technically
dead code in Python (they’d never match via the brush regex). They’re
in the map for direct class name lookup — if a <pre> somehow has c#
as a class name. We ported them faithfully even though they’ll rarely fire.
Language patterns: preserving order
The language detection patterns are a list of (language, [patterns])
tuples in Python:
_LANG_PATTERNS: list[tuple[str, list[str]]] = [
('java', [r'\bpublic\s+class\b', ...]),
('xml', [r'<\?xml\s', r'xmlns=']),
('drl', [r'\brule\s+"', r'\bwhen\b.*\bthen\b', r'\bend\b']),
# ...
]
Order matters — the first matching language wins. In Java, Map doesn’t
guarantee order, so Map<String, List<String>> would be wrong. The
equivalent is List<Map.Entry<String, List<String>>>:
private static final List<Map.Entry<String, List<String>>> LANG_PATTERNS = List.of(
Map.entry("java", List.of("\\bpublic\\s+class\\b", ...)),
Map.entry("xml", List.of("<\\?xml\\s", "xmlns=")),
Map.entry("drl", List.of("\\brule\\s+\"", "\\bwhen\\b.*\\bthen\\b", "\\bend\\b")),
// ...
);
List.of() preserves insertion order. Map.entry() is an immutable key-value
pair. The detection loop is a direct translation of the Python:
outer:
for (Map.Entry<String, List<String>> entry : LANG_PATTERNS) {
for (String pattern : entry.getValue()) {
if (Pattern.compile(pattern, Pattern.MULTILINE | Pattern.CASE_INSENSITIVE)
.matcher(text).find()) {
detected = entry.getKey();
break outer;
}
}
}
Python’s any(re.search(p, text, re.MULTILINE | re.IGNORECASE) for p in patterns)
becomes a double loop with a labelled break. break outer exits both loops
simultaneously — the Java equivalent of Python’s break breaking out of
the inner any() generator and the outer for loop.
One note on pre-compiling: this code compiles the same 21 patterns on every
call to detectCodeLanguages. For a bulk enrichment run over 577 posts this
adds up. The patterns could be pre-compiled to List<Map.Entry<String,
List<Pattern>>>. We didn’t do it — premature optimisation — but it’s a
clear next step if profiling shows it matters.
JSON parsing: Jackson without the annotations
Gist inlining requires parsing the GitHub API response — a JSON object with
nested files, each with content and language fields. Python does this
with .json() on the response object, which returns a plain dict:
data = resp.json()
files = data.get('files', {})
for fname, fdata in files.items():
content = fdata.get('content', '')
language = (fdata.get('language') or 'text').lower()
In Java, Quarkus includes Jackson. You don’t need @JsonDeserialize or
mapping classes for a one-off parse — just use the tree API:
JsonNode root = new ObjectMapper().readTree(json);
JsonNode files = root.path("files");
if (!files.isMissingNode() && files.fields().hasNext()) {
Map.Entry<String, JsonNode> first = files.fields().next();
String filename = first.getKey();
String content = first.getValue().path("content").asText("");
String language = first.getValue().path("language").asText("text").toLowerCase();
// ...
}
path() returns a MissingNode instead of throwing if the key doesn’t exist
— safe to chain. asText("") returns the string value or the default.
isMissingNode() is the null check. The Jackson tree API is the right tool
for ad-hoc JSON navigation without binding to a type; @JsonDeserialize into
a class is right when you have a fixed, known schema.
The Python equivalent of (fdata.get('language') or 'text').lower() — use
'text' if the field is None or an empty string — needs two calls in
Java:
String lang = first.getValue().path("language").asText("text");
if (lang.isEmpty()) lang = "text";
lang = lang.toLowerCase();
asText("text") only fires the default for missing nodes, not for an empty
string value. Python’s or coerces both null and empty to the fallback. Worth
knowing when porting null-or-empty patterns.
What was easy
The four pure DOM transforms — normaliseBrToNewlines, normaliseCodeClasses,
detectCodeLanguages, replaceEmbedFallbacks — are near-literal translations.
No HTTP, no JSON, no file I/O. The only API differences are:
br.replace_with('\n')→br.replaceWith(new TextNode("\n"))pre.get('class', [])→pre.classNames()(returnsSet<String>, never null)code['class'] = new_list→code.classNames(newSet)
Each one is two lines of lookup in the Jsoup docs. The logic itself is unchanged. TDD made this fast: write the test with the BS4 semantics in mind, make the Java method satisfy it, done.
The replaceEmbedFallbacks method is particularly clean in Java because Jsoup’s
element builder handles HTML escaping automatically. In Python you have to call
html.escape(src) before interpolating into an f-string. In Java, setting an
attribute via .attr("href", src) encodes the value for you. The Java version
is safer by default.
What was harder
URL parsing edge cases. youtubeVideoId() handles four URL shapes: embed
paths, youtu.be short links, watch URLs with ?v=, and youtube-nocookie.com
embeds. Each has slightly different URI structure. Python’s urlparse does the
heavy lifting; Java requires manual string splitting after URI.getPath(). The
logic is equivalent but getting the null guards right takes a test for each
case.
Gist script parsing. gist.github.com/user/gistid.js (two path segments)
and gist.github.com/gistid.js (one path segment, no user) are both valid.
Python’s parts = parsed.path.strip('/').split('/') handles both. Java’s
URI.getPath().split("/") produces leading empty strings if the path starts
with / — you need .replaceAll("^/+", "") first. A minor gotcha that costs
15 minutes if you don’t already know it.
The Map type mismatch. Enricher.enrich() returns Map<String, Integer>.
StateStore.markEnriched() takes Map<String, Object>. These are not
assignment-compatible in Java — generic types are invariant. The solution:
new HashMap<>(enrichStats) with the target type providing inference context.
Java infers HashMap<String, Object> and the copy constructor accepting
Map<? extends K, ? extends V> does the widening. It compiles, it’s correct,
and it looks like it shouldn’t work — which is exactly the kind of thing you
should write a comment for.
The testing pyramid in practice

Phase 5 adds 32 new tests across three layers:
28 unit tests cover each enrichment method with multiple inputs: happy
path, failure/fallback, non-matching input, caching behaviour. The MockEnricher
inner class supplies all HTTP responses. No network, no file system beyond
@TempDir. Tests run in under 200ms total.
2 integration tests enrich real KIE archive HTML files. The first enriches
one post and asserts the output file exists and the stats map has the right
keys. The second enriches 10 posts in sequence and asserts all outputs are
written. These run @EnabledIf("kieArchivePresent") — they skip on CI or
fresh machines, run locally. Their value is exercising the real content
combinations that 577 posts produce, not just the synthetic snippets in
unit tests.
2 E2E tests fire the scan endpoint via @QuarkusTest and RestAssured,
triggering enrichment as a side effect of scan. These prove the plumbing:
PostsResource.scan() calls Enricher.enrich() via the right path,
StateStore.markEnriched() is called with the stats, and the endpoint
returns a valid 200 response with the expected fields.
The integration and E2E tests both fail in the JEP-less CI environment — they need the KIE archive and the Python bridge to be available. This is the same limitation as previous phases. The unit tests are CI-safe and catch correctness issues; the integration tests catch behavioural issues with real content.
The subagent-driven implementation
Phase 5 used the same subagent-driven development process as Phase 4: a fresh Claude subagent per task, spec compliance review after each task, code quality review after spec passes. Eight tasks, eight implementer dispatches, sixteen review dispatches (spec + quality per task).
The main benefit shows up on the HTTP tasks (Tasks 5 and 6). A single-session implementation would carry context from the pure DOM tasks into the HTTP tasks, potentially conflating the two. Fresh-context subagents arrive clean: they read the task, read the existing files, implement the specific thing, commit. No accumulated assumptions from six previous tasks.
The review process caught one real bug: fetchJson() was returning ""
(empty string) for a 200 response with an empty body, while fetchUrl()
returned null in the same case. The inconsistency was caught in code review,
not testing — because the tests mocked successful responses. A test for
“empty body” was not in scope but the reviewer noticed the asymmetry. Fixed
before merge with an && !body.isEmpty() check.
Comparing the approaches
Two sessions in, a pattern is clear: Python is better at the exploratory, ad-hoc phase. Java is better at the production, maintainable phase.
Python (enrich.py) |
Java (Enricher.java) |
|
|---|---|---|
| HTTP calls | requests.get(url, timeout=15) — one line |
Builder + send + handler type — 4 lines |
| JSON parsing | resp.json()['files']['Foo.java']['content'] |
root.path("files").fields().next().getValue().path("content").asText("") |
| DOM building | f-string + BS4(html).find() |
Element builder API, safe by default |
| Type safety | none — runtime errors only | compile-time — Map<String, Integer> is caught |
| Testing HTTP | patch('enrich.requests') |
MockEnricher subclass |
| Null safety | or 'text' catches both None and "" |
Two separate checks |
| Error messages | Stack trace into dynamic dict | Compile error with type name |
The Python code is shorter in every row. The Java code is more explicit in every row. “Shorter” wins during development. “More explicit” wins at 3am during an incident.
Where we are
JEP call count: 23 → 22. One bridge function removed:
bridge.post_enrich_only. The scan endpoint no longer reaches into Python
for enrichment — it’s entirely Java from HTTP receive to enriched HTML output.
Phase 0 (JEP bridge): 35 calls — DONE
Phase 1 (config/home): 35 → 32 — DONE
Phase 2 (state.py): 32 → 27 — DONE
Phase 3 (html_utils, code blocks): 27 → 26 — DONE
Phase 4 (scan_html, scan_assets): 26 → 23 — DONE
Phase 5 (enrich.py): 23 → 22 — DONE ← here
Phase 6 (???): 22 → ? — next
The remaining 22 calls cover the convert pipeline (Markdown generation, validation), the ingest pipeline (URL discovery, scraping), staging workflow, and the search endpoint. Each is a self-contained Python module. Phase 6 will be whichever of those the data suggests is highest-value to port next.
The JEP count is dropping. The Python codebase is shrinking. The Java tests are growing. At some point the crossover happens and the Python code becomes the minority. That’s the direction we’re headed.