Product Recommendation Email Testing: Why Block-Level CTC Needs a Different Data Stack Than Subject Line Tests
ESPs log one open per email. They don’t log which product recommendation email block rendered, which variant was active, or which click led to a purchase. Block-level CTC testing requires two infrastructure layers most programs don’t have.
Your ESP can tell you which subject line got more opens. It cannot tell you which product recommendation email block drove more revenue. That is not a feature request your vendor is ignoring. It is an architectural constraint baked into how every major email platform stores data.
ESPs are batch-send systems. They run on OLTP databases (usually PostgreSQL or MySQL) and record three things: one row per send, one row per open, one row per click. There is no per-block impression table. No variant-level column. No attribution chain connecting a specific content module to a purchase four days later. So when someone asks “which version of our product recommendation email block performs better,” the data to answer that question literally doesn’t exist in the ESP.
This is why subject lines remain the default test in email. Not because they’re the most valuable variable. Because they’re the only variable the existing data model can measure. If you want to test what actually moves revenue (the content inside the email, especially personalized product recommendation blocks), you need two additional infrastructure layers most programs don’t have. Here is what those layers are, why they’re necessary, and what the math actually looks like when you try to run block-level tests without them.
Why ESPs Make Subject Lines the Default Test
The ESP data model was designed for delivery operations, not content analytics. It answers questions like: Did the message get delivered? Did the recipient open it? Did they click something? These are message-level events. They tell you nothing about what happened inside the email after the open pixel fired.
A subject line A/B test works perfectly within this model because the variable (subject line A vs. B) maps 1:1 to the message. The open event is the measurement event. One variable, one event, clean signal. But a product recommendation email block test has no equivalent measurement event. The ESP logs the open. It does not log which block rendered, which variant of that block was shown, or what products appeared. Those events don’t have a table to live in.
Most email programs default to campaign-level metrics — total clicks, total revenue attributed to a send — because the infrastructure doesn’t support anything more granular. Teams know the campaign performed, but they can’t isolate which content module inside the email drove the result. That’s the email observability gap, and ESP architecture is why it exists.
The 35x Sample Size Gap Between Open-Rate Tests and CTC Tests
Even if you could measure block-level CTC, the statistics are significantly harder than subject line tests. This is pure math, not opinion.
Using Evan Miller’s sample size calculator: to detect a 10% relative lift at 95% confidence with 80% power, a subject line test with an 18% open rate baseline needs roughly 900 recipients per variant. A content block CTC test with a 2.5% baseline needs approximately 31,000 per variant. That is a 35x difference.
Most retail email teams send to lists of 100K–500K. At those volumes, subject line tests reach significance within a single send. A product recommendation email block CTC test at 2.5% baseline? You might need weeks of accumulated data across multiple sends to get there. This is why the testing guides that say “test one variable at a time” and “run for significance” produce months of ambiguous results when applied to content blocks. The framework was built for subject lines.
And there’s a compounding problem. As Evan Miller showed in his widely cited analysis, checking a running test daily inflates the false-positive rate from 5% to roughly 14.5%. Most marketing teams peek at dashboards constantly, which means most running A/B tests are operating at 85.5% confidence or lower, not the 95% they assume. For a low-baseline CTC test that already needs 31,000 per variant, daily peeking makes the problem much worse.
What Module-Level Attribution Actually Measures in a Product Recommendation Email
To test product recommendation email blocks (or any content module), you need a data layer that subject line tests never required. Here’s what a per-block impression record actually contains:
- Subscriber ID (who saw it)
- Block ID (which module in the email)
- Variant ID (which version of that module rendered)
- Render timestamp (when the image was requested at open time)
That impression record then links to click events (did the subscriber click this specific block?) and purchase events within a 7-day attribution window (did that click lead to a conversion?). The resulting metrics are CTC (click-to-conversion rate per block variant) and RPM (revenue per mille impressions). These are the block-level equivalents of what ad platforms call ROAS. You can read a full breakdown in this post on email block analytics and why RPM is the metric that wins CFO meetings.
Across Zembula’s platform, personalized blocks (Smart Banners, Smart Kickers, Smart Blocks) average a 13.6% CTC, versus the roughly 2.5% retail broadcast baseline. Those numbers come from our 2025 email performance benchmark report, derived from over 6.2 billion opens. The gap between those two numbers is exactly the kind of signal module-level attribution is designed to capture.
Longitudinal vs. Split-by-Send: Why Assignment Method Shapes Experimental Quality
There are two ways to assign test groups for a product recommendation email block test. The first (split-by-send) re-randomizes every send. Subscriber A sees Variant 1 on Monday, Variant 2 on Wednesday. The second (longitudinal) locks each person into an arm that persists across sends. Subscriber A always sees Variant 1.
Split-by-send is what most ESPs offer, and it can work. With last-click attribution — which is how Zembula’s attribution model operates — conversions are attributed to the variant the subscriber actually clicked, regardless of which variants they may have seen in prior sends. The attribution chain is correct: if a subscriber clicks Variant 2 on Wednesday and purchases on Thursday, that conversion is credited to Variant 2, period. Seeing Variant 1 on Monday doesn’t corrupt the signal.
That said, longitudinal (person-locked) assignment has real experimental advantages when you want to go beyond attribution and understand which variant drives better outcomes for the same audience over time. In a split-by-send design, each subscriber’s behavior is a composite of exposure to multiple variants. You can correctly attribute each conversion, but you can’t cleanly isolate whether Variant 1 or Variant 2 produces better engagement from the same cohort of people. Longitudinal assignment gives you that: every subscriber contributes data to exactly one arm, so the comparison between variants is between equivalent groups with consistent exposure.
This is how every serious ad platform runs creative tests. It’s also what makes multi-week block tests produce the cleanest experimental signal, especially when you’re accumulating data across sends to close the 35x sample size gap. We’ve covered the full structural comparison in our post on why campaign-level A/B testing is broken, and the conditions where holdout testing is worth running vs. a waste of time.
The Two-Layer Infrastructure That Makes Product Recommendation Email Measurement Possible
This is the core architectural point. To measure what happens inside an email at the block level, you need two layers that sit outside the ESP entirely.
Layer 1: Event streaming for real-time impression capture. When a subscriber opens an email containing a Zembula-powered block, the email client requests an image URL. At that moment, a composition engine renders the personalized content and simultaneously emits an impression event. That event (subscriber ID, block ID, variant ID, render timestamp) must be captured in real time. Zembula uses WarpStream, a Kafka-API-compatible streaming system that Confluent acquired in September 2024. WarpStream’s stateless agent model handles the concurrency pattern of millions of impression events arriving in parallel across a large send without the partition rebalancing overhead of traditional Kafka.
Layer 2: Columnar OLAP for sub-second attribution queries. Those impression events need to be queryable. For a mid-size retailer running Smart Banners across a 500K subscriber list with daily sends, that’s millions of impression records per month. Every one of those renders is an impression event that must be stored, joined with click events and purchase events within a 7-day window, and aggregated into CTC and RPM by block, variant, and use case.
This is exactly the query pattern where PostgreSQL (the database most ESPs use) breaks down. A recent benchmark by Fiveonefour showed that at 10 million rows, ClickHouse runs analytical workloads in 453ms while PostgreSQL takes 12,201ms. That’s a 27x performance gap, and it comes from columnar storage: ClickHouse reads only the columns each query touches instead of scanning full rows. Zembula runs attribution queries through Tinybird (a managed ClickHouse layer), which is why per-block CTC dashboards return in under a second even at billions-of-rows scale.
The data flow, end to end: Email client requests image URL → Composition engine renders personalized product recommendation email block and emits impression event → WarpStream captures event in real time → ClickHouse stores impression + click + purchase chain → Attribution query returns RPM/CTC per block variant in sub-second latency.
Without both layers, the data to run a content block test simply does not exist. You can know the methodology. You can have the statistical literacy. You can even have the budget. But if your stack has no per-block impression table and no analytical database fast enough to join impression-click-purchase chains across millions of rows, you’re stuck testing subject lines. Not by choice. By infrastructure.
Directional Confidence as the Operating Mode for Most Programs
Here is the honest reality. Most retail email programs will not reach 95% statistical significance on a content block CTC test within a reasonable timeframe. The math requires roughly 31,000 recipients per variant at a 2.5% baseline. For a brand sending to 200K subscribers, that is achievable only through weeks of longitudinal accumulation, and only with person-locked assignment.
Chasing 95% significance on block-level tests typically produces one of two outcomes: the test runs for months and nobody acts on the data, or somebody peeks at intermediate results and declares a winner prematurely (which, per the Evan Miller analysis, means they’re operating at ~85% confidence anyway).
The practical alternative is directional CTC/RPM confidence, sustained across 4+ weeks and millions of impressions. This is how performance marketers actually run creative tests on ad platforms. Meta doesn’t require 95% significance before reallocating budget to a winning ad creative. They use directional signal plus automated allocation (their version of multi-arm bandit). Email block testing should work the same way: accumulate signal, allocate traffic toward the better-performing variant, and let the system optimize continuously rather than waiting for a binary pass/fail verdict.
This is a real shift in how email teams need to think about testing. Email has historically operated with ad-hoc tests and campaign-level measurement. Running product recommendation email blocks as continuously measured, continuously optimized content modules is closer to how ad spend optimization works. And given that email has structurally better economics (owned audience, first-party identity, personalized product recommendations rendered on first-party data), the ROI argument for investing in this infrastructure is strong.
Key takeaways
- Subject line tests persist because of data architecture, not team discipline. ESPs log one open per email. There is no per-block impression table, no variant-level CTC column, and no attribution chain to downstream purchases at the module level.
- The sample size gap is 35x. A subject line test needs ~900 recipients per variant. A product recommendation email block CTC test needs ~31,000. Same confidence level, same relative lift target, wildly different math.
- Two infrastructure layers are non-negotiable. Event streaming (WarpStream/Kafka) captures per-block impressions at open time. Columnar OLAP (ClickHouse) queries attribution chains across billions of rows sub-second. Without both, module-level measurement is impossible.
- Longitudinal assignment produces the cleanest experimental signal. Last-click attribution (as Zembula uses) correctly handles split-by-send scenarios, but person-locked arm assignment gives you a controlled experiment where each subscriber contributes data to exactly one arm — the gold standard for isolating which content performs better over time.
- Directional confidence is the realistic standard. Most retail programs can’t reach 95% significance on block CTC tests within any reasonable window. Sustained directional CTC/RPM signal across 4+ weeks, paired with automated allocation, is how performance marketers actually operate.
- This is the same investment case as ad platform measurement, with better economics. Email runs on first-party identity and owned audiences. Building the email observability layer to measure personalized email content at the block level turns email into a performance marketing channel with measurement parity to paid ads.
Carl Thorner is CTO at Zembula, where he architects the streaming infrastructure and personalization platform that powers real-time email content for enterprise brands. He writes about the technical decisions behind scalable email systems.
Grow your business and total sales



