Predicting Customer Lifetime Value and Repurchase

Overview

The Objective

The goal is to use customer purchase history to answer two questions that matter for CRM and growth: who is likely to purchase again, and which customers are likely to generate future revenue.

The project therefore treats customer analytics as a decision-support problem: how to identify customers worth retaining, customers worth reactivating, and customers worth prioritizing commercially.

Dataset

From Raw Transactions to a Predictive Customer Table

The project is based on the Online Retail II dataset, a transactional log from a UK-based e-commerce.

After cleaning and filtering, transactions are aggregated into a customer-level table capturing purchase behavior, value, and timing patterns.

This dataset serves as the foundation for all downstream modeling and decision-making.

01

824,364 customer-linked transaction rows after excluding missing customer IDs

02

805,549 purchase rows retained after data cleaning and consistency checks

03

20 engineered features describing purchase history, basket behavior, and timing

04

4,266 customers in the final supervised modeling dataset

05

36,969 invoices available to reconstruct customer purchase history

Prediction Objective

Two Forecasting Questions, Two Targets

Customer value is not a single dimension.

This project separates the problem into two questions: whether a customer returns, and how much value they generate, linking behavior directly to retention and revenue decisions.

Prediction question	Target	Business use
Will this customer purchase again?	will_purchase_again	Retention targeting, win-back, churn prevention
How much revenue will this customer generate?	future_revenue	Customer prioritization, value tiering, budget allocation

Modeling logic: features are built from an observation window and targets are measured in a later prediction window, so the setup mirrors how a business would score customers in practice.

Model Selection

Benchmarking Six Models Across Both Tasks

Three models were compared for each task. For repurchase classification, the benchmark included Logistic Regression, Random Forest, and Gradient Boosting. For future revenue, it included Ridge, Random Forest Regressor, and Gradient Boosting Regressor.

The winning models were selected through cross-validation and then evaluated on an untouched test set.

Task	Selected model	Key metrics	Reason it won
Repurchase classification	Logistic Regression	ROC-AUC 0.75 Average precision 0.74 Accuracy 0.69	Strong ranking performance with high interpretability
Future revenue regression	Random Forest Regressor	R² 0.29 on log revenue target	Better fit for nonlinear, highly skewed spending behavior

Repurchase Results

Predicting Who Will Buy Again

The repurchase model learns from completed purchases made between December 1, 2009 and November 30, 2010, and predicts whether those customers purchase again between December 1, 2010 and May 31, 2011. In that future window, 48.1% of customers actually purchased again.

Within that setup, logistic regression ranked customers better than the more complex alternatives. That matters because the output is not only predictive, but also easy to explain operationally: customers can be scored by return likelihood and then moved into retention, reminder, or reactivation flows depending on campaign cost and urgency.

Repurchase model summary	Value
Observation window	01/12/2009 to 30/11/2010
Prediction window	01/12/2010 to 31/05/2011
Customers in modeling table	4,266
Share who purchased again	48.1%
Selected model	Logistic Regression
ROC-AUC	0.75
Average precision	0.74
Accuracy	0.69

Revenue Results

Predicting Future Customer Revenue

The revenue model uses the same historical window, but answers a harder question: how much future revenue each customer is likely to generate in the next six months. In the prediction window, future revenue is extremely uneven. Many customers generate nothing, while a smaller set accounts for most of the commercial value.

That is why the regression task is useful. The goal is not precise invoice-level forecasting, but ranking customers by expected value so the business can distinguish between customers likely to return at low value and customers worth stronger commercial attention. For this task, the random forest regressor performed best.

Revenue model summary	Value
Observation window	01/12/2009 to 30/11/2010
Prediction window	01/12/2010 to 31/05/2011
Average future revenue per customer	748
Selected model	Random Forest Regressor
R2 on log target	0.29
RMSE on log target	2.77
Revenue-scale MAE	430

Interpretation

What the Models Learn About Customer Value

After selecting the best model for each task, the next step is interpretation. Instead of leaving the models as black boxes, the project translates them into operational signals: which customer behaviors are associated with repurchase, and which ones matter most when forecasting future value.

Drivers of Repurchase Probability

Blue increases return likelihood, red decreases it

Drivers of Future Revenue

Larger bars indicate features the revenue model relies on most

Repurchase

Longer customer history, broader baskets, and stronger past spend are associated with a higher chance of returning. In practical terms, customers who have built a more established relationship with the store are much easier to retain than newly acquired, low-depth buyers.

Revenue

Past revenue, recency, and transaction count dominate the future revenue forecast. The model is effectively saying that value tends to come from customers who were already commercially meaningful and still look active.

CRM Use

Recency appears in both models for a reason. It is one of the clearest signs that a customer is still engaged, making it especially useful for retention timing and reminder-based interventions.

Commercial Meaning

The models reward relationship depth, not just isolated spend spikes. Frequency, product breadth, and sustained activity matter more than single large purchases when identifying customers worth prioritizing.

Across both tasks, the most informative features are consistent: recency, transaction frequency, historical revenue, and basket behavior.

Marketing Actions

How the Predictions Can Be Used

1

Use repurchase scores for retention and win-back targeting

The classification model helps identify customers who are less likely to return. That makes it useful for reminder flows, reactivation campaigns, and retention outreach before the customer fully drops out.

2

Use revenue forecasts to prioritize attention and budget

The revenue model adds a second layer to the decision. Among customers likely to remain active, it helps distinguish between lower-value returners and customers worth stronger investment.

3

Combine both signals for CRM segmentation

Customers can be grouped into practical action tiers: likely to return and high value, likely to return but lower value, at-risk but historically important, or low-priority. That structure is much more actionable than a single score on its own.

4

Support customer strategy with an end-to-end pipeline

The project demonstrates a complete customer-level workflow: cleaning, feature engineering, supervised target design, model comparison, testing, and interpretation. It goes beyond exploratory analysis without losing the business framing.

Stack

Tools Used

The full workflow was built in Python. Data preparation and feature engineering were handled with Pandas and NumPy. Visual outputs were generated with Matplotlib. The predictive layer used scikit-learn for train/test splitting, cross-validation, model benchmarking, logistic regression, random forest modeling, and feature interpretation.

→ Full code available on GitHub

Predicting Repurchase & Customer Lifetime Value

The Objective

From Raw Transactions to a Predictive Customer Table

Two Forecasting Questions, Two Targets

Benchmarking Six Models Across Both Tasks

Predicting Who Will Buy Again

Predicting Future Customer Revenue

What the Models Learn About Customer Value

How the Predictions Can Be Used

Use repurchase scores for retention and win-back targeting

Use revenue forecasts to prioritize attention and budget

Combine both signals for CRM segmentation

Support customer strategy with an end-to-end pipeline

Tools Used

Predicting Repurchase
& Customer Lifetime Value