Experimentation A/B Testing Analysis Python, SciPy, Pandas, Matplotlib

Finding the Real Winner in a Paid Media A/B Test

This project evaluates two campaign variants across the full marketing funnel, not just a single headline KPI. The final analysis combines classical hypothesis testing, bootstrap confidence intervals, paired day-level comparisons, Bayesian rate estimation, and diagnostic visual analysis to answer a harder question: does the test really improve business performance, or only early-stage engagement?

A/B TESTING EXPERIMENT ANALYSIS BAYESIAN ANALYSIS BOOTSTRAPPING FUNNEL ANALYSIS MARKETING ANALYTICS
View Code on GitHub Jump to Results

+101%
CTR uplift
for test variant
-43%
Cart per click
for test variant
+68%
Purchase per impression
for test variant
≈0%
purchase lift
Business Question

Which campaign actually performs better?

The analysis is designed to go beyond clicks and evaluate whether the test variant improves downstream purchase behavior enough to justify the extra media investment.

Data Setup

60 daily rows across control and test

The dataset covers August 1 to August 30, 2019, with spend, impressions, reach, clicks, searches, content views, add to cart, and purchases for both variants.

Bottom Line

The variants win different parts of the funnel

The test campaign drives much stronger traffic efficiency, while the control campaign retains meaningfully stronger commercial intent deeper in the journey.

Why a single headline metric would be incomplete here

A campaign variant can look successful because it increases clicks, impressions, or other top-of-funnel metrics while simultaneously weakening downstream purchase behavior. In this dataset, that trade-off is exactly what needs to be tested.

This case study treats the campaign as a multi-stage funnel experiment. The analysis does not only compare average levels; it also examines how each variant changes the path from reach to clicks, from clicks to carting, and from traffic to purchases. That makes the recommendation more decision-ready than a one-metric summary.

Project componentWhat was done
Data preparationColumn standardization, date parsing, derived rate metrics, cost metrics, and missing-value handling
Classical inferenceWelch t-tests, Mann-Whitney tests, paired day-level tests, sign tests, and FDR correction
Uncertainty estimationBootstrap confidence intervals and permutation tests
Probability viewBayesian posterior probability of one variant beating the other on key rates
DiagnosticsOutlier detection, weekday effects, cumulative curves, and correlation-shift analysis

The test variant wins attention, but the control variant wins shopping intent

The strongest result is not a final purchase lift. The more important story is structural: the test variant is much better at generating traffic from impressions, while the control variant is materially stronger at turning that traffic into carts. Final purchases remain effectively unchanged because upper-funnel gains and mid-funnel losses offset one another.

Formal significance testing was performed for the main comparison set using Welch t-tests, Mann-Whitney tests, permutation tests, bootstrap intervals, paired calendar-day tests, and false-discovery-rate correction. In that framework, the purchase difference itself is not statistically significant, while CTR, purchase per impression, cart per click, and cart rate show much clearer separation.

Focused comparison of the most decision-relevant metrics
Focused chart of the most important A/B testing metric uplifts
Only the most decision-relevant metrics are shown here, with statistical significance called out directly on the chart.
Traffic efficiency

The test roughly doubles CTR and significantly improves purchases per impression. It is clearly more effective at turning paid visibility into visits.

Commercial quality

The control variant is far better at converting visits into carts. That points to stronger purchase intent, stronger message match, or a more qualified audience.

Spend impact

The test spends more and pays a much higher CPM. Higher traffic volume does not come for free, so incremental engagement needs to be judged against downstream efficiency.

Decision quality

This is not a clean end-to-end win for the test. A click-only read would miss the fact that the final purchase result is statistically flat.

Upper-funnel gains do not translate into a clear purchase lead

The first chart below focuses on funnel rates rather than raw volumes, making the trade-off easier to read. The test variant clearly improves CTR, but the control retains stronger cart formation and slightly stronger purchase efficiency per click.

Where the funnel improves and where it leaks
Funnel rate comparison between control and test variants
The test is more efficient at generating visits, but the control remains stronger deeper in the funnel.

The cumulative chart adds the time dimension. It shows that the test campaign keeps building a click advantage across the month, but the cumulative purchase lines stay close together. That visual pattern aligns with the formal tests: a clear traffic effect, but no reliable end-of-funnel purchase win.

Cumulative spend, clicks, and purchases
Cumulative trends for spend, clicks, and purchases
The widening click gap is much larger than the purchase gap, which remains narrow throughout the period.

Combining frequentist and Bayesian evidence

Multiple testing lenses are useful here because the dataset is small enough that any one method could be misleading on its own. Independent-sample comparisons, paired day-level tests, bootstrap intervals, permutation tests, and posterior probabilities all point in the same broad direction: the test improves attention efficiency, but not purchase quality per visit.

MetricTest meanControl meanAdjusted p-valueInterpretation
CTR10.24%5.10%0.0012Statistically supported lift in top-of-funnel engagement
Cart per click15.79%27.82%0.0013Control converts visits into shopping intent much better
Purchase per impression0.84%0.50%0.0052Test extracts more purchases from each impression
Purchase per click8.64%9.83%0.1977Directional control advantage, but not significant after correction
Purchase521.23522.790.9760No evidence of a meaningful final-purchase difference

Best concise conclusion: the test variant expands reach efficiently, but the control variant attracts users with stronger buying intent. If the objective is traffic, the test is attractive. If the objective is downstream conversion quality, the control remains safer.

Going beyond averages to understand why the result happens

Additional diagnostic views help separate stable evidence from patterns that are useful but more exploratory. The weekday chart below is a good example: there are visible differences, but none of the weekday-specific purchase-rate gaps remain statistically significant after multiple-testing correction. That means the shape is informative, but it should not be overclaimed.

Average purchase rate by weekday
Weekday purchase rate comparison between campaign variants
Weekday variation exists, but the weekday-specific differences are directional rather than statistically confirmed.

Recommended interpretation for the business

1

The test should not be declared a universal winner

The extra traffic is real, but the test should not replace the control outright if the business optimizes for efficient purchase behavior rather than raw engagement.

2

The top-of-funnel strength is worth preserving

The click-generation advantage is meaningful. The next iteration should retain that strength while improving what happens after the click.

3

A follow-up test should target the landing-page and cart step

The evidence points to a handoff problem between ad engagement and commercial intent. That is the most promising place for the next experiment.

4

Future experiment reviews should stay multi-metric

This project demonstrates why decision-making should combine funnel metrics, cost metrics, and uncertainty measures instead of treating one KPI as the whole story.

Tools Used

The project was built in Python using Pandas and NumPy for data preparation and metric engineering, SciPy for inferential testing, and Matplotlib for the visual layer. The final analysis combines classical significance testing with resampling methods and Bayesian probability estimates to produce a decision-ready A/B testing workflow.

Full code available on GitHub