Deploy Your Winners

The hidden cost of experiments that never end

There’s a common anti-pattern in A/B testing that costs ecommerce brands more than most failed experiments ever could: the “100% rollout” via the testing tool. 

Here’s how it happens. A team runs an experiment. The variant wins with statistical significance. The team sets the A/B tool to show Variant B to 100% of users and moves on to the next test. The experiment is “done.” Except it isn’t. 

What’s actually happening in the browser

Your server is still sending the browser the original page content — the version no user will ever see. The user’s device is then still fetching the A/B testing script. It’s still parsing and executing the bucketing logic. It’s still patching the DOM to transform the page from the original into the winning variant. The user receives a page that has been built twice: once by the server, and once by the testing tool that overwrites it. 

To understand why, consider how client-side A/B testing works mechanically. The server sends the original HTML. The browser begins parsing it and constructing the DOM. An anti-flicker snippet hides the page. The A/B testing script downloads from a third-party CDN — which requires DNS resolution, a TCP handshake, and TLS negotiation before a single byte of test logic arrives. The script executes, reads a cookie or generates an assignment, evaluates targeting rules, and then hunts through the DOM for the elements it needs to modify. It patches those elements — swapping text, changing styles, replacing images. Only then does the page become visible. 

Every one of those steps still runs when the experiment is set to 100%. The bucketing logic still evaluates. The targeting rules still fire. The DOM still gets patched. The only difference is that every user receives the same patch. The overhead is identical to a live experiment, but the information value is zero — because there’s nothing left to learn. 

The sedimentary layer problem

Now multiply that by every “permanent” experiment running on the site. Five experiments means five layers of DOM patching on every page load. Ten means ten. Each one adds main-thread work, delays rendering, and increases the gap between when the page could have been visible and when it actually is. 

This is the sedimentary layer problem — technical debt that accumulates silently beneath the surface, compounding with each experiment that overstays its purpose. Like geological sediment, each layer is thin enough to ignore individually. But over months and years of experimentation, the cumulative weight crushes performance. Not to mention, it creates a permanent dependency on the A/B testing tool to maintain the  UX improvements.  

The conversion gains from those original experiments? They plateau or reverse as the site gets slower. The irony is hard to overstate: the tools deployed to improve conversion are actively degrading it. A team might celebrate a 3% uplift from a button color test while their site’s LCP drifts from 1.5 seconds to 2.8 seconds over a year of “optimization” — costing them far more in lost revenue than the button test ever gained. 

The math behind the drift

Research from Google and Deloitte has consistently shown that each 100ms of additional latency costs roughly 0.5–1% in conversion. If a site running $10M/month in revenue has accumulated 300ms of latency from stale experiments that were never properly deployed, that’s $150K–$300K/month in lost conversion — not from running experiments, but from failing to clean them up. 

This cost is invisible in most analytics dashboards. It doesn’t show up as a single event. It manifests as a slow, steady erosion of conversion rate that gets attributed to seasonality, competitive pressure, or market conditions — anything but the experimentation infrastructure itself. The performance regression is diffuse and gradual, which makes it easy to rationalize and hard to diagnose. 

An experiment is a question, not a feature

An experiment is, by definition, a temporary inquiry into user behavior. It asks: “Is B better than A?” Once you have the answer with statistical significance, the experiment is over. The code that asked the question is no longer needed. Only the answer matters. 

Yet most teams treat experiments as features. The testing tool becomes the delivery mechanism for production changes. This conflation is understandable — it’s faster to leave the experiment running than to schedule engineering time for a proper deployment — but the convenience comes at a compounding cost that teams rarely quantify. 

The root cause is organizational, not technical. CRO and marketing teams have the authority to launch experiments but often lack the engineering resources to deploy winners natively. Engineering teams have the capability to deploy but aren’t incentivized to prioritize experiment cleanup over new feature work. The result is a backlog of “completed” experiments that continue to run in production indefinitely. 

The four-step exit plan

The process after a winner is identified should be explicit and non-negotiable: 

First, code it. Engineers implement the winning variant directly in the application codebase — as native code, not a DOM patch. This means the winning experience is rendered by the server or built into the application templates. No third-party script is involved in delivering it. 

Second, deploy it. Ship the new version to production through the normal deployment pipeline. The winning variant is now the default experience for all users, delivered without any testing overhead. 

Third, delete it. Remove the experiment configuration from the A/B testing tool entirely. Not “pause” — delete. A paused experiment still loads its configuration. A deleted experiment loads nothing. 

Fourth, verify. This step is routinely skipped, and it’s the most important one. Confirm that the A/B script is no longer executing logic for that element. Verify that page performance has returned to pre-experiment levels. Check that no residual targeting rules, anti-flicker snippets, or event listeners remain active. Residual experiment logic has a way of lingering — a targeting rule still evaluating, a script still loading on pages where it no longer applies, an anti-flicker snippet still hiding content for a test that concluded weeks ago. 

Feature flags as the bridge

Feature flags are the mechanism that makes this exit plan operationally feasible. Every experiment, whether client-side or server-side, should be wrapped in a feature flag that allows it to be disabled cleanly when it’s no longer needed. 

Some teams go further: they implement a global feature flag that controls whether any experimentation code loads at all. When no tests are active, the flag is off, and the performance tax drops to zero. This “off by default” mentality treats experimentation as a cost center that must justify its overhead — which is exactly what it is. 

Without feature flags, disabling an experiment requires a code change and a deployment. With them, it requires flipping a switch. The difference in practice is the difference between experiments that get cleaned up in days versus experiments that linger for months, or forever. 

The best test is the one that’s over

Every experiment should have a defined exit plan before it launches. If your team cannot answer the question “How will we hardcode the winner and remove the experiment?” before the test starts, you are not ready to run the test. 

The best A/B test is the one that no longer needs to run. It provided its insight, the winner was hardcoded, the loser was deleted, and the testing infrastructure was removed. That’s not the end of experimentation — it’s what makes the next experiment affordable to run. Your users and business benefit from the uplifted UX, and fast performance. Win-Win. 

Signup for Free Web Performance Tips & Stories

Search