Predictive Analytics · Customer Value · CRM · 2026

Predicting Repurchase
& Customer Lifetime Value

A customer growth analysis built on e-commerce data, combining retention, revenue concentration, cohort behavior, and a predictive layer based on logistic regression and random forest models to support customer prioritization, retention strategy, and marketing decision-making.

Predictive Analytics Customer LTV Random Forest Logistic Regression CRM Strategy
Project Repository Back to Portfolio

5.9K
Customers analyzed
4.3K
Customers in final modeling table
48.1%
Customers who purchased again
6
Predictive models benchmarked

The Objective

The goal is to use customer purchase history to answer two questions that matter for CRM and growth: who is likely to purchase again, and which customers are likely to generate future revenue.

The project therefore treats customer analytics as a decision-support problem: how to identify customers worth retaining, customers worth reactivating, and customers worth prioritizing commercially.

From Raw Transactions to a Predictive Customer Table

The project uses the Online Retail II dataset, a transaction log from a UK-based online retailer. After removing rows without customer identifiers, the dataset contained 824,364 customer-linked records. From there, the data was cleaned to retain a consistent transaction base for customer-level modeling.

The resulting analytical dataset contains 805,549 purchase rows, covering 5,878 customers and 36,969 invoices. These transactions are then aggregated into a customer-level table with features such as recency, transaction frequency, historical revenue, invoice value, basket size, and purchase timing.

01
824,364 customer-linked transaction rows after excluding missing customer IDs
02
805,549 purchase rows retained after data cleaning and consistency checks
03
20 engineered features describing purchase history, basket behavior, and timing
04
4,266 customers in the final supervised modeling dataset

Two Forecasting Questions, Two Targets

Instead of modeling a single broad notion of LTV, the project separates the problem into two more operational questions. The first is a binary classification task: whether the customer purchases again in the future window. The second is a regression task: how much revenue that customer generates in the future window.

This separation is useful because repurchase and value are related but not identical. A customer can return and still produce little value, while a smaller group of customers can generate a disproportionate share of revenue.

Prediction questionTargetBusiness use
Will this customer purchase again? will_purchase_again Retention targeting, win-back, churn prevention
How much revenue will this customer generate? future_revenue Customer prioritization, value tiering, budget allocation
Modeling logic: features are built from an observation window and targets are measured in a later prediction window, so the setup mirrors how a business would score customers in practice.

Benchmarking Six Models Across Both Tasks

Three models were compared for each task. For repurchase classification, the benchmark included Logistic Regression, Random Forest, and Gradient Boosting. For future revenue, it included Ridge, Random Forest Regressor, and Gradient Boosting Regressor.

The winning models were selected through cross-validation and then evaluated on an untouched test set.

TaskSelected modelKey metricsReason it won
Repurchase classification Logistic Regression ROC-AUC 0.75
Average precision 0.74
Accuracy 0.69
Strong ranking performance with high interpretability
Future revenue regression Random Forest Regressor R² 0.29 on log revenue target Better fit for nonlinear, highly skewed spending behavior

Predicting Who Will Buy Again

The repurchase model learns from completed purchases made between December 1, 2009 and November 30, 2010, and predicts whether those customers purchase again between December 1, 2010 and May 31, 2011. In that future window, 48.1% of customers actually purchased again.

Within that setup, logistic regression ranked customers better than the more complex alternatives. That matters because the output is not only predictive, but also easy to explain operationally: customers can be scored by return likelihood and then moved into retention, reminder, or reactivation flows depending on campaign cost and urgency.

Repurchase model summaryValue
Observation window01/12/2009 to 30/11/2010
Prediction window01/12/2010 to 31/05/2011
Customers in modeling table4,266
Share who purchased again48.1%
Selected modelLogistic Regression
ROC-AUC0.75
Average precision0.74
Accuracy0.69

Predicting Future Customer Revenue

The revenue model uses the same historical window, but answers a harder question: how much future revenue each customer is likely to generate in the next six months. In the prediction window, future revenue is extremely uneven. Many customers generate nothing, while a smaller set accounts for most of the commercial value.

That is why the regression task is useful. The goal is not precise invoice-level forecasting, but ranking customers by expected value so the business can distinguish between customers likely to return at low value and customers worth stronger commercial attention. For this task, the random forest regressor performed best.

Revenue model summaryValue
Observation window01/12/2009 to 30/11/2010
Prediction window01/12/2010 to 31/05/2011
Average future revenue per customer748
Selected modelRandom Forest Regressor
R? on log target0.29
RMSE on log target2.77
Revenue-scale MAE430

What the Models Learn About Customer Value

After selecting the best model for each task, the next step is interpretation. Instead of leaving the models as black boxes, the project translates them into operational signals: which customer behaviors are associated with repurchase, and which ones matter most when forecasting future value.

Drivers of Repurchase Probability
Blue increases return likelihood, red decreases it
Drivers of Future Revenue
Larger bars indicate features the revenue model relies on most
Repurchase

Longer customer history, broader baskets, and stronger past spend are associated with a higher chance of returning. In practical terms, customers who have built a more established relationship with the store are much easier to retain than newly acquired, low-depth buyers.

Revenue

Past revenue, recency, and transaction count dominate the future revenue forecast. The model is effectively saying that value tends to come from customers who were already commercially meaningful and still look active.

CRM Use

Recency appears in both models for a reason. It is one of the clearest signs that a customer is still engaged, making it especially useful for retention timing and reminder-based interventions.

Commercial Meaning

The models reward relationship depth, not just isolated spend spikes. Frequency, product breadth, and sustained activity matter more than single large purchases when identifying customers worth prioritizing.

Across both tasks, the most informative features are consistent: recency, transaction frequency, historical revenue, and basket behavior.

How the Predictions Can Be Used

1

Use repurchase scores for retention and win-back targeting

The classification model helps identify customers who are less likely to return. That makes it useful for reminder flows, reactivation campaigns, and retention outreach before the customer fully drops out.

2

Use revenue forecasts to prioritize attention and budget

The revenue model adds a second layer to the decision. Among customers likely to remain active, it helps distinguish between lower-value returners and customers worth stronger investment.

3

Combine both signals for CRM segmentation

Customers can be grouped into practical action tiers: likely to return and high value, likely to return but lower value, at-risk but historically important, or low-priority. That structure is much more actionable than a single score on its own.

4

Support customer strategy with an end-to-end pipeline

The project demonstrates a complete customer-level workflow: cleaning, feature engineering, supervised target design, model comparison, testing, and interpretation. It goes beyond exploratory analysis without losing the business framing.

Tools Used

The full workflow was built in Python. Data preparation and feature engineering were handled with Pandas and NumPy. Visual outputs were generated with Matplotlib. The predictive layer used scikit-learn for train/test splitting, cross-validation, model benchmarking, logistic regression, random forest modeling, and feature interpretation.

Full code available on GitHub