Customer Analytics · RFM · Segmentation · 2024

Customer Retention
& Value Analysis

A full customer analytics pipeline on a UK e-commerce dataset, from raw transactions to dual-layer RFM segmentation, cohort retention, and commercial buyer deep dives.

E-commerce RFM Analysis Customer Segmentation Cohort Analysis Retention
View on GitHub Back to Portfolio

4,338
Unique customers analyzed
65%
Revenue from top 22% of customers
52%
Revenue from bulk buyers (14% of customers)
4.6×
Bulk vs retail avg order value

The Problem

This analysis focuses on understanding how customer value and retention differ across the customer base.

Customer value is not evenly distributed: a small group of buyers often drives most of the revenue. At the same time, not all high-value customers behave the same.

To account for this, customers are first grouped by purchasing behavior (retail vs. bulk), and then analyzed separately to capture more meaningful patterns in value and retention.

From Raw Transactions to Customer Behavior

The analysis is based on the Online Retail dataset, a one-year transaction log from a UK-based e-commerce.

After cleaning and filtering, the data is used to analyze customer behavior, value, and retention patterns over time.

01
541,909 raw transactions
02
392,692 clean transactions
03
4,338 customers
04
18,532 invoices

Not All Customers Shop the Same Way

Before running any RFM model, customers are classified by purchasing behavior. A composite "bulk score" flags customers who regularly order in large quantities and at high invoice values, consistent with wholesale or commercial purchasing rather than retail browsing.

The threshold is set at the 85th percentile of avg quantity per invoice, avg invoice value, and max quantity per invoice. Customers above the threshold on at least two of these dimensions are labeled Bulk / commercial-like; the rest are Retail-like.

Customer Behavior by Buyer Type — Log Scale
Customer behavior scatter
log(1 + avg quantity per invoice) vs log(1 + avg invoice value) — two clearly distinct clusters

The scatter plot shows two clean clusters. Bulk buyers sit at the upper-right: systematically higher quantities and higher invoice values. The separation justifies the classification before any RFM scoring is applied.

Buyer TypeCustomersCustomer ShareTotal RevenueRevenue ShareAvg Order ValueAvg Recency
Bulk / commercial-like 59713.8%£4,638,01452.2%£1,28269 days
Retail-like 3,74186.2%£4,249,19547.8%£28096 days

13.8% of customers generate 52% of total revenue, with an average order value 4.6× higher than retail. Bulk buyers also return more often and more recently on average, they are structurally more valuable, not just occasionally larger spenders.

Revenue Is Extremely Concentrated

Even within the full customer base, value concentration is striking. The top 1% of customers by revenue account for nearly a third of all sales.

Revenue Share by Top Customer Groups
Top 1% of customers

Generate 32% of total revenue. That's roughly 43 customers responsible for nearly one-third of sales.

Top 5% of customers

Capture 50.5% of revenue, a clear super-majority driven by a small commercial segment.

Top 10% of customers

Account for 61.5% of revenue, while the remaining 90% generate less than 40 cents on the dollar.

Top 20% of customers

Reach 74.7%, close to the classic 80/20 Pareto rule, confirming the structural pattern.

Eight Segments, Two Worlds

RFM scoring assigns each customer a 1–5 score on Recency, Frequency, and Monetary value based on their position within the global customer base. The combined score maps to eight named segments.

Customer Count by Global RFM Segment
Global RFM segment counts
SegmentCustomersCustomer ShareRevenue ShareAvg RecencyAvg Frequency
Best customers95722.1%65.2%13 days11.1 invoices
Big spenders3447.9%10.3%98 days2.2 invoices
Loyal high value2355.4%8.1%50 days6.4 invoices
Mid-value customers1,11925.8%5.9%94 days1.8 invoices
At risk high value1683.9%4.2%126 days5.5 invoices
Loyal customers3758.6%2.7%64 days3.9 invoices
Low value / inactive82118.9%2.1%228 days1.0 invoices
Recent customers3197.4%1.6%19 days1.2 invoices

The Best customers segment (22% of the base) generates 65% of revenue and has an average recency of just 13 days, they're active, frequent, and high-value. The contrast with Mid-value (26% of customers, 6% of revenue) illustrates why flat engagement strategies miss the point entirely.

Note on the "Big spenders" segment: These customers score high on Monetary but low on Frequency and Recency — they buy large amounts infrequently. Many of them are bulk buyers whose purchasing rhythm naturally looks like churn in a global RFM model. This is exactly why within-type scoring matters.

Bulk Customers Show Stronger Repeat-Purchase Retention

Monthly cohort retention tracks the share of customers from each acquisition month who return to purchase in later months. Comparing retail and bulk/commercial-like buyers reveals a clear structural difference in purchasing behavior.

Monthly Cohort Retention — Retail Customers
Retail customers cohort retention
Monthly Cohort Retention — Bulk / Commercial-like Customers
Bulk customers cohort retention

The difference is visible across the most mature cohorts. In the December 2010 cohort, retail customers retain at 33.2% in month 2, while bulk customers retain at 55.1%. More importantly, bulk cohorts continue to hold much higher retention across later months, often staying in the 45–60% range where retail customers more often remain closer to 20–35%.

This suggests that bulk/commercial-like buyers are not simply higher spenders, but customers with a more recurring pueratirchasing pattern, likely driven by oponal replenishment rather than occasional shopping.

Who Stays, Who Slips, Who Recovers

To understand segment stability over time, the bulk customer dataset is split into early and late periods. RFM scoring is applied independently to each period, and the resulting segments are cross-tabulated to produce a transition matrix.

Bulk Customer Segment Transitions — Early vs Late Period
Segment transition matrix

Three patterns stand out. First, Best customers are remarkably sticky: 68.9% remain Best customers in the late period. Second, Loyal high value customers have strong upward mobility: 62.5% graduate to Best customers, suggesting this segment is a pipeline, not a ceiling. Third, Recent customers mostly slide to Mid-value (53.9%), with none graduating to Best, they need nurturing before they stabilize at higher value.

Retention — Best Customers

68.9% stay Best customers period over period. Once a commercial account is active and engaged, it tends to stay that way.

Upgrade — Loyal High Value

62.5% of Loyal high value upgrade to Best. This is the most valuable migration path, and a signal that frequency, not just spend, drives the upgrade.

Warning — At Risk High Value

36.4% of At risk customers become Recent customers, they haven't churned yet, but they've stopped buying.

Concern — Recent Customers

53.9% of Recent customers move to Mid-value, not upward. New bulk buyers need active engagement early to develop into high-frequency accounts.

Four Actionable Priorities

1

Protect the Best customers bulk segment above all else

206 customers generating an average of £16,462 each and purchasing every 43 days. A 10% churn rate here costs more than losing the entire Low value / inactive segment. Account management, priority fulfillment, and proactive outreach belong here.

2

Invest in Loyal high value → Best customer conversion

62.5% of Loyal high value bulk customers upgrade on their own. A structured push: volume incentives, product recommendations, or dedicated account support, could accelerate this pipeline and increase its size.

3

Act on At risk high value before the window closes

These are high-revenue customers going quiet. With an average recency of 126 days and strong historical spend, they're worth targeted re-engagement. The transition matrix shows 36% are already sliding to Recent, the time to act is now.

4

Build onboarding for new bulk accounts

Recent bulk customers mostly drift to Mid-value rather than climbing. An early engagement sequence, product education, reorder prompts calibrated to the 10–50 day purchase cycle, could meaningfully shift this trajectory.

Tools Used

The full pipeline is written in Python with Pandas and NumPy for data preparation, transformation, and feature engineering. Visualizations were produced with Matplotlib and Seaborn.

→ Full code available on GitHub