Unsupervised Learning · Market Basket · 2025

Customer Segmentation &
Market Basket Analysis

Combining K-Means clustering and association rule mining to uncover which products are bought together, and by which kinds of customers.

Unsupervised Learning Clustering Apriori algorithm Scikit-learn Market Basket Analysis
View on GitHub Back to Portfolio

550K
Purchase records
12
Customer groups identified
5.9K
Unique customers
0.75
Top co-purchase rate

The Idea

Standard market basket analysis asks: which products are frequently bought together? This project pushes that question one step further: which products are bought together by which kinds of customers?

The approach is a two-step pipeline. First, customers are grouped into coherent demographic segments using K-Means clustering. Then, association rule mining (Apriori) is applied within each segment separately. The result is a set of product affinity rules that are sharper and more actionable than anything the full-population analysis alone would surface.

🗃
Raw Data
550K transactions with demographic attributes
👥
Clustering
K-Means groups customers into 12 segments
🛒
Rule Mining
Apriori finds product associations within each cluster
🎯
Insights
Targeted rules per customer profile
Note on data interpretation: The dataset contains real purchase records, but product IDs and customer attributes have been masked for privacy. This means individual rules cannot be interpreted in concrete product terms, but the methodology, the structure of the segments, and the quality of the associations are fully valid and directly applicable to real retail contexts.

A Rich Retail Transaction Log

The dataset covers 550,068 Walmart purchase records from 5,891 unique customers across 3,631 unique products. Each transaction includes both what was bought and who bought it, making it possible to link purchasing behavior directly to customer profiles.

Key attributes available per transaction: gender, age group, marital status, city category, years in current city, product category, and purchase amount. The combination of behavioral and demographic data in the same record is what makes the two-step approach viable.

The data already shows an uneven customer base before any modeling. Customers aged 26–35 account for the largest transaction volume, City B leads in activity, and male customers generate significantly more purchase records. Purchase amounts are right-skewed, mean around 9,264, median around 8,047, indicating a subset of high-value transactions pulling the average upward.

12 Distinct Customer Profiles

K-Means clustering was applied to demographic attributes after one-hot encoding categorical variables and normalizing numerical ones. Multiple validation metrics (Silhouette Score, Davies-Bouldin, Calinski-Harabasz) were used to select k=12 as the optimal number of clusters, achieving a Silhouette Score of 0.91, indicating well-separated, cohesive groups.

The resulting segments are not arbitrary splits. They reflect clear, interpretable combinations of age, gender, marital status, and city.

Cluster 1
Single males, 26–35 · City B
The retailer's core demographic. High volume but internally diverse.
Cluster 4
Young single males, 18–25 · City B
Early-lifecycle segment. Relevant for entry-level offers and lower-price promotions.
Cluster 5
Married women, 36–45 · City B
Household-oriented profile. Likely more stable demand patterns and repeat purchase behavior.
Cluster 8
Married women, 46–50 · City C
Mature segment with geographically distinct location. Highest interest association rules in the project.
Cluster 9
Single males, 36–45 · City B
Highest confidence rules in the project. Strong product affinity signals within this segment.
Cluster 10
Married males, 51–55 · City B
Mature, established segment. Suitable for age-tailored positioning and differentiated messaging.

Cluster sizes vary from ~14,000 to over 100,000 users. Large clusters are suitable for scalable campaigns with broad impact; smaller clusters may reveal niche, high-value audiences worth targeting more precisely.

What Products Are Bought Together — by Customer Type

After identifying the customer segments, the next step was to analyze which products tend to be purchased together inside each group. This was done using the Apriori algorithm, a common technique in retail analytics that searches for recurring combinations of products within shopping baskets.

Instead of running the analysis on the entire customer base, it was applied separately within each segment. This makes an important difference: customers with similar demographic profiles often show more consistent purchasing patterns. As a result, the associations discovered inside segments are clearer and more predictable than those found in the full dataset.

The results show that product relationships become significantly stronger once customers are segmented, revealing patterns that would otherwise be diluted when analyzing the entire population at once.

Three product associations stand out as particularly informative. Each rule includes two indicators: co-purchase rate, which measures how often the second product appears when the first one is purchased, and lift above baseline, which measures how much stronger the association is compared to what would happen by chance if the two products were unrelated.

Cluster 9 — Single males, 36–45, City B
Product P00277642 Product P00117442
Co-purchase probability 0.75 Lift above baseline 0.45
In this segment, three out of four baskets containing the first product also include the second. This is the strongest co-purchase pattern observed in the analysis. In a retail setting, such a relationship would be a strong candidate for a “frequently bought together” suggestion, a bundle offer, or a checkout recommendation.
Cluster 8 — Married women, 46–50, City C
Product P00271142 Product P00117942
Co-purchase probability 0.65 Lift above baseline 0.48
This rule shows the strongest lift above baseline in the analysis, meaning the two products appear together far more often than their individual popularity would suggest. The pattern emerges within a specific demographic niche, illustrating how segment-level analysis can reveal meaningful associations that would remain hidden in a full-population analysis.
Cluster 1 — Single males, 26–35, City B
Product P00182742 Product P00110742
Co-purchase probability 0.71 Lift above baseline 0.38
This association appears within the retailer’s largest customer group. Even in a broad segment, the analysis reveals clear and repeatable product combinations, suggesting that targeted recommendations or bundled promotions could be effective for this audience.

Customer Segmentation Reveals Stronger Product Relationships

The most important result of this analysis appears when comparing product relationships found inside customer segments with those discovered in the entire dataset.

When association rules are calculated across all customers together, purchasing patterns appear weaker because the population includes many different types of shoppers. Once customers are grouped into similar profiles, purchasing behavior becomes more consistent, and product relationships become much easier to detect.

SegmentAvg Co-purchase RateAvg Lift Above Baseline
Cluster 90.6800.422
Cluster 110.6500.366
Cluster 10.6480.378
Cluster 60.6420.340
Cluster 50.6180.330
Cluster 40.6080.340
Cluster 100.5740.350
Entire Dataset (baseline)0.4380.182

Every customer segment produces stronger product associations than the analysis performed on the entire dataset. In the best segments, co-purchase rates are about 55% higher and the lift above baseline is more than twice as large as in the unsegmented population.

Average Lift Above Baseline by Cluster vs. Full Dataset Baseline

What This Enables

1

Audience design & campaign targeting

The 12 customer segments can be used to structure marketing audiences. Instead of sending the same campaign to every customer, messaging can be tailored to groups with similar profiles, for example by age range, city, or household status, making promotions more relevant and potentially more effective.

2

Cross-selling and bundle recommendations

Product pairs with high co-purchase rates are strong candidates for “frequently bought together” suggestions, checkout recommendations, or bundle offers. Because these combinations are discovered within specific customer groups, they are more likely to reflect real purchasing behavior rather than broad popularity across the entire store.

3

Segment-specific product recommendations

Different customer groups often show different purchasing patterns. A product pair that frequently appears together for one group, for example married women aged 46–50, may not appear in the baskets of younger customers. Using segment-level insights allows retailers to show more relevant product suggestions to each audience.

4

Broad campaigns vs niche opportunities

Larger customer segments represent broad audiences where product recommendations or campaigns can reach many users. Smaller segments may reveal niche behaviors that are less visible at scale but potentially valuable for targeted promotions or specialized offers.

Tools Used

Python was used throughout. Clustering was implemented with scikit-learn's KMeans. Association rules were generated with the Apriori algorithm. Data processing relied on Pandas and NumPy, with Matplotlib for visualization. Cluster quality was assessed using Silhouette Score, Davies-Bouldin Score, and Calinski-Harabasz Score across multiple distance metrics (Euclidean, Cosine, Manhattan, Gower).

→ Full code available on GitHub