The Idea
Standard market basket analysis asks: which products are frequently bought together? This project pushes that question one step further: which products are bought together by which kinds of customers?
The approach is a two-step pipeline. First, customers are grouped into coherent demographic segments using K-Means clustering. Then, association rule mining (Apriori) is applied within each segment separately. The result is a set of product affinity rules that are sharper and more actionable than anything the full-population analysis alone would surface.
A Rich Retail Transaction Log
The dataset covers 550,068 Walmart purchase records from 5,891 unique customers across 3,631 unique products. Each transaction includes both what was bought and who bought it, making it possible to link purchasing behavior directly to customer profiles.
Key attributes available per transaction: gender, age group, marital status, city category, years in current city, product category, and purchase amount. The combination of behavioral and demographic data in the same record is what makes the two-step approach viable.
The data already shows an uneven customer base before any modeling. Customers aged 26–35 account for the largest transaction volume, City B leads in activity, and male customers generate significantly more purchase records. Purchase amounts are right-skewed, mean around 9,264, median around 8,047, indicating a subset of high-value transactions pulling the average upward.
12 Distinct Customer Profiles
K-Means clustering was applied to demographic attributes after one-hot encoding categorical variables and normalizing numerical ones. Multiple validation metrics (Silhouette Score, Davies-Bouldin, Calinski-Harabasz) were used to select k=12 as the optimal number of clusters, achieving a Silhouette Score of 0.91, indicating well-separated, cohesive groups.
The resulting segments are not arbitrary splits. They reflect clear, interpretable combinations of age, gender, marital status, and city.
Cluster sizes vary from ~14,000 to over 100,000 users. Large clusters are suitable for scalable campaigns with broad impact; smaller clusters may reveal niche, high-value audiences worth targeting more precisely.
What Products Are Bought Together — by Customer Type
After identifying the customer segments, the next step was to analyze which products tend to be purchased together inside each group. This was done using the Apriori algorithm, a common technique in retail analytics that searches for recurring combinations of products within shopping baskets.
Instead of running the analysis on the entire customer base, it was applied separately within each segment. This makes an important difference: customers with similar demographic profiles often show more consistent purchasing patterns. As a result, the associations discovered inside segments are clearer and more predictable than those found in the full dataset.
The results show that product relationships become significantly stronger once customers are segmented, revealing patterns that would otherwise be diluted when analyzing the entire population at once.
Three product associations stand out as particularly informative. Each rule includes two indicators: co-purchase rate, which measures how often the second product appears when the first one is purchased, and lift above baseline, which measures how much stronger the association is compared to what would happen by chance if the two products were unrelated.
Customer Segmentation Reveals Stronger Product Relationships
The most important result of this analysis appears when comparing product relationships found inside customer segments with those discovered in the entire dataset.
When association rules are calculated across all customers together, purchasing patterns appear weaker because the population includes many different types of shoppers. Once customers are grouped into similar profiles, purchasing behavior becomes more consistent, and product relationships become much easier to detect.
| Segment | Avg Co-purchase Rate | Avg Lift Above Baseline |
|---|---|---|
| Cluster 9 | 0.680 | 0.422 |
| Cluster 11 | 0.650 | 0.366 |
| Cluster 1 | 0.648 | 0.378 |
| Cluster 6 | 0.642 | 0.340 |
| Cluster 5 | 0.618 | 0.330 |
| Cluster 4 | 0.608 | 0.340 |
| Cluster 10 | 0.574 | 0.350 |
| Entire Dataset (baseline) | 0.438 | 0.182 |
Every customer segment produces stronger product associations than the analysis performed on the entire dataset. In the best segments, co-purchase rates are about 55% higher and the lift above baseline is more than twice as large as in the unsegmented population.
What This Enables
Audience design & campaign targeting
The 12 customer segments can be used to structure marketing audiences. Instead of sending the same campaign to every customer, messaging can be tailored to groups with similar profiles, for example by age range, city, or household status, making promotions more relevant and potentially more effective.
Cross-selling and bundle recommendations
Product pairs with high co-purchase rates are strong candidates for “frequently bought together” suggestions, checkout recommendations, or bundle offers. Because these combinations are discovered within specific customer groups, they are more likely to reflect real purchasing behavior rather than broad popularity across the entire store.
Segment-specific product recommendations
Different customer groups often show different purchasing patterns. A product pair that frequently appears together for one group, for example married women aged 46–50, may not appear in the baskets of younger customers. Using segment-level insights allows retailers to show more relevant product suggestions to each audience.
Broad campaigns vs niche opportunities
Larger customer segments represent broad audiences where product recommendations or campaigns can reach many users. Smaller segments may reveal niche behaviors that are less visible at scale but potentially valuable for targeted promotions or specialized offers.
Tools Used
Python was used throughout. Clustering was implemented with scikit-learn's KMeans. Association rules were generated with the Apriori algorithm. Data processing relied on Pandas and NumPy, with Matplotlib for visualization. Cluster quality was assessed using Silhouette Score, Davies-Bouldin Score, and Calinski-Harabasz Score across multiple distance metrics (Euclidean, Cosine, Manhattan, Gower).