The Objective
The goal is to use customer purchase history to answer two questions that matter for CRM and growth: who is likely to purchase again, and which customers are likely to generate future revenue.
The project therefore treats customer analytics as a decision-support problem: how to identify customers worth retaining, customers worth reactivating, and customers worth prioritizing commercially.
From Raw Transactions to a Predictive Customer Table
The project uses the Online Retail II dataset, a transaction log from a UK-based online retailer. After removing rows without customer identifiers, the dataset contained 824,364 customer-linked records. From there, the data was cleaned to retain a consistent transaction base for customer-level modeling.
The resulting analytical dataset contains 805,549 purchase rows, covering 5,878 customers and 36,969 invoices. These transactions are then aggregated into a customer-level table with features such as recency, transaction frequency, historical revenue, invoice value, basket size, and purchase timing.
Two Forecasting Questions, Two Targets
Instead of modeling a single broad notion of LTV, the project separates the problem into two more operational questions. The first is a binary classification task: whether the customer purchases again in the future window. The second is a regression task: how much revenue that customer generates in the future window.
This separation is useful because repurchase and value are related but not identical. A customer can return and still produce little value, while a smaller group of customers can generate a disproportionate share of revenue.
| Prediction question | Target | Business use |
|---|---|---|
| Will this customer purchase again? | will_purchase_again | Retention targeting, win-back, churn prevention |
| How much revenue will this customer generate? | future_revenue | Customer prioritization, value tiering, budget allocation |
Benchmarking Six Models Across Both Tasks
Three models were compared for each task. For repurchase classification, the benchmark included Logistic Regression, Random Forest, and Gradient Boosting. For future revenue, it included Ridge, Random Forest Regressor, and Gradient Boosting Regressor.
The winning models were selected through cross-validation and then evaluated on an untouched test set.
| Task | Selected model | Key metrics | Reason it won |
|---|---|---|---|
| Repurchase classification | Logistic Regression | ROC-AUC 0.75 Average precision 0.74 Accuracy 0.69 |
Strong ranking performance with high interpretability |
| Future revenue regression | Random Forest Regressor | R² 0.29 on log revenue target | Better fit for nonlinear, highly skewed spending behavior |
Predicting Who Will Buy Again
The repurchase model learns from completed purchases made between December 1, 2009 and November 30, 2010, and predicts whether those customers purchase again between December 1, 2010 and May 31, 2011. In that future window, 48.1% of customers actually purchased again.
Within that setup, logistic regression ranked customers better than the more complex alternatives. That matters because the output is not only predictive, but also easy to explain operationally: customers can be scored by return likelihood and then moved into retention, reminder, or reactivation flows depending on campaign cost and urgency.
| Repurchase model summary | Value |
|---|---|
| Observation window | 01/12/2009 to 30/11/2010 |
| Prediction window | 01/12/2010 to 31/05/2011 |
| Customers in modeling table | 4,266 |
| Share who purchased again | 48.1% |
| Selected model | Logistic Regression |
| ROC-AUC | 0.75 |
| Average precision | 0.74 |
| Accuracy | 0.69 |
Predicting Future Customer Revenue
The revenue model uses the same historical window, but answers a harder question: how much future revenue each customer is likely to generate in the next six months. In the prediction window, future revenue is extremely uneven. Many customers generate nothing, while a smaller set accounts for most of the commercial value.
That is why the regression task is useful. The goal is not precise invoice-level forecasting, but ranking customers by expected value so the business can distinguish between customers likely to return at low value and customers worth stronger commercial attention. For this task, the random forest regressor performed best.
| Revenue model summary | Value |
|---|---|
| Observation window | 01/12/2009 to 30/11/2010 |
| Prediction window | 01/12/2010 to 31/05/2011 |
| Average future revenue per customer | 748 |
| Selected model | Random Forest Regressor |
| R? on log target | 0.29 |
| RMSE on log target | 2.77 |
| Revenue-scale MAE | 430 |
What the Models Learn About Customer Value
After selecting the best model for each task, the next step is interpretation. Instead of leaving the models as black boxes, the project translates them into operational signals: which customer behaviors are associated with repurchase, and which ones matter most when forecasting future value.
Longer customer history, broader baskets, and stronger past spend are associated with a higher chance of returning. In practical terms, customers who have built a more established relationship with the store are much easier to retain than newly acquired, low-depth buyers.
Past revenue, recency, and transaction count dominate the future revenue forecast. The model is effectively saying that value tends to come from customers who were already commercially meaningful and still look active.
Recency appears in both models for a reason. It is one of the clearest signs that a customer is still engaged, making it especially useful for retention timing and reminder-based interventions.
The models reward relationship depth, not just isolated spend spikes. Frequency, product breadth, and sustained activity matter more than single large purchases when identifying customers worth prioritizing.
Across both tasks, the most informative features are consistent: recency, transaction frequency, historical revenue, and basket behavior.
How the Predictions Can Be Used
Use repurchase scores for retention and win-back targeting
The classification model helps identify customers who are less likely to return. That makes it useful for reminder flows, reactivation campaigns, and retention outreach before the customer fully drops out.
Use revenue forecasts to prioritize attention and budget
The revenue model adds a second layer to the decision. Among customers likely to remain active, it helps distinguish between lower-value returners and customers worth stronger investment.
Combine both signals for CRM segmentation
Customers can be grouped into practical action tiers: likely to return and high value, likely to return but lower value, at-risk but historically important, or low-priority. That structure is much more actionable than a single score on its own.
Support customer strategy with an end-to-end pipeline
The project demonstrates a complete customer-level workflow: cleaning, feature engineering, supervised target design, model comparison, testing, and interpretation. It goes beyond exploratory analysis without losing the business framing.
Tools Used
The full workflow was built in Python. Data preparation and feature engineering were handled with Pandas and NumPy. Visual outputs were generated with Matplotlib. The predictive layer used scikit-learn for train/test splitting, cross-validation, model benchmarking, logistic regression, random forest modeling, and feature interpretation.