Customer Segmentation & Market Basket Analysis

Overview

The Idea

Standard market basket analysis asks: which products are frequently bought together? This project adds one more layer: which products are bought together by which customer types?

The method has two steps. First, customers are grouped into demographic segments with K-Means clustering. Then, Apriori association rules are mined inside each segment. The result is a set of product affinities that is sharper and more actionable than a full-population analysis.

🗃

Raw Data

550K transactions with demographic attributes

→

👥

Clustering

K-Means groups customers into 12 segments

→

🛒

Rule Mining

Apriori finds product associations within each cluster

→

🎯

Insights

Targeted rules per customer profile

⚠

Note on data interpretation: The dataset contains real transactions, but product IDs and customer attributes are masked for privacy. So the exact products are anonymous, but the methodology, segment structure, and strength of the associations remain valid and applicable to real retail settings.

Dataset

A Rich Retail Transaction Log

The dataset contains 550,068 Walmart purchase records from 5,891 customers across 3,631 products. Each record includes both the basket and the customer profile, which makes behavior-to-segment analysis possible.

Available fields include gender, age group, marital status, city category, years in current city, product category, and purchase amount. That mix of demographic and transactional data is what makes the two-step approach work.

Even before modeling, the customer base is uneven: ages 26–35 drive the most transactions, City B is the most active, and male customers generate more purchase records. Purchase amounts are also right-skewed, with a mean of 9,264 and median of 8,047, which suggests a subset of higher-value purchases lifts the average.

Customer Segmentation

12 Distinct Customer Profiles

K-Means clustering was applied to demographic variables after encoding categorical fields and normalizing numeric ones. Multiple validation metrics selected k=12 as the best solution, with a Silhouette Score of 0.91, which indicates well-separated groups.

These are not arbitrary splits. The segments map cleanly to combinations of age, gender, marital status, and city.

Cluster 1

Single males, 26–35 · City B

The retailer's core demographic. High volume but internally diverse.

Cluster 4

Young single males, 18–25 · City B

Early-lifecycle segment. Relevant for entry-level offers and lower-price promotions.

Cluster 5

Married women, 36–45 · City B

Household-oriented profile. Likely more stable demand patterns and repeat purchase behavior.

Cluster 8

Married women, 46–50 · City C

Mature segment with geographically distinct location. Highest interest association rules in the project.

Cluster 9

Single males, 36–45 · City B

Highest confidence rules in the project. Strong product affinity signals within this segment.

Cluster 10

Married males, 51–55 · City B

Mature, established segment. Suitable for age-tailored positioning and differentiated messaging.

Cluster sizes range from ~14,000 to over 100,000 users. Large clusters are useful for scalable campaigns, while smaller ones can reveal niche, high-value audiences.

Product Affinity

What Products Are Bought Together — by Customer Type

After defining the segments, I used Apriori to find which products tend to be bought together inside each group.

Running the analysis at the segment level matters because similar customers usually show more consistent purchase behavior. That makes the rules clearer, more predictable, and stronger than in the full dataset.

Three associations stand out. Each rule is evaluated with co-purchase rate, which shows how often the second product appears when the first is bought, and lift above baseline, which shows how much stronger the relationship is than chance.

Cluster 9 — Single males, 36–45, City B

Product P00277642 → Product P00117442

Co-purchase probability 0.75 Lift above baseline 0.45

In this segment, three out of four baskets containing the first product also include the second. This is the strongest co-purchase pattern in the analysis and a strong candidate for a “frequently bought together” suggestion, bundle, or checkout recommendation.

Cluster 8 — Married women, 46–50, City C

Product P00271142 → Product P00117942

Co-purchase probability 0.65 Lift above baseline 0.48

This rule shows the strongest lift above baseline in the analysis, meaning the two products appear together far more often than their individual popularity would suggest. It appears within a specific demographic niche, which shows how segment-level analysis can uncover patterns hidden in the full population.

Cluster 1 — Single males, 26–35, City B

Product P00182742 → Product P00110742

Co-purchase probability 0.71 Lift above baseline 0.38

This association appears in the retailer’s largest customer group. Even in a broad segment, the analysis finds clear and repeatable product combinations, suggesting that targeted recommendations or bundles could work well here.

Key Finding

Customer Segmentation Reveals Stronger Product Relationships

The main result appears when segment-level rules are compared with rules from the full dataset.

When all customers are analyzed together, product relationships look weaker because the population mixes many different shopper types. Once customers are grouped into similar profiles, behavior becomes more consistent and product affinities become easier to detect.

Segment	Avg Co-purchase Rate	Avg Lift Above Baseline
Cluster 9	0.680	0.422
Cluster 11	0.650	0.366
Cluster 1	0.648	0.378
Cluster 6	0.642	0.340
Cluster 5	0.618	0.330
Cluster 4	0.608	0.340
Cluster 10	0.574	0.350
Entire Dataset (baseline)	0.438	0.182

Every customer segment produces stronger product associations than the full-dataset analysis. In the best segments, co-purchase rates are about 55% higher and lift above baseline is more than 2× higher than in the unsegmented population.

Average Lift Above Baseline by Cluster vs. Full Dataset Baseline

Business Applications

What This Enables

1

Audience design & campaign targeting

The 12 customer segments can structure marketing audiences. Instead of sending the same campaign to everyone, messaging can be tailored by age, city, or household profile to make promotions more relevant.

2

Cross-selling and bundle recommendations

Product pairs with high co-purchase rates are strong candidates for “frequently bought together” suggestions, checkout recommendations, or bundle offers. Because these rules are found inside specific customer groups, they are more likely to reflect real behavior than general store-wide popularity.

3

Segment-specific product recommendations

Different customer groups show different purchasing patterns. A product pair that is common for one segment may not appear at all in another. Segment-level rules make product recommendations more relevant to each audience.

4

Broad campaigns vs niche opportunities

Larger segments support broad campaigns and scalable recommendations. Smaller segments can reveal niche behaviors that are less visible at scale but valuable for targeted promotions or specialized offers.

Stack

Tools Used

Python was used throughout. Clustering was implemented with scikit-learn's KMeans, association rules with Apriori, and data processing with Pandas and NumPy. Visuals were built in Matplotlib, and cluster quality was checked with Silhouette, Davies-Bouldin, and Calinski-Harabasz across multiple distance metrics.

→ Full code available on GitHub

Customer Segmentation &Market Basket Analysis

The Idea

A Rich Retail Transaction Log

12 Distinct Customer Profiles

What Products Are Bought Together — by Customer Type

Customer Segmentation Reveals Stronger Product Relationships

What This Enables

Audience design & campaign targeting

Cross-selling and bundle recommendations

Segment-specific product recommendations

Broad campaigns vs niche opportunities

Tools Used

Customer Segmentation &
Market Basket Analysis