Analytics · Recommendation · Unsupervised · 2026

BasketIQ

Market-basket analysis and customer segmentation at 32.4M-transaction scale.

32.4M

Instacart transactions analysed

2,016

Aisle-level association rules mined

58.9%

Reorder rate across the catalogue

Pythonpandasmlxtend (Apriori)scikit-learnChart.js

GitHub

Overview

BasketIQ is an end-to-end retail analytics project on 32.4 million Instacart transactions across 206,000 users, 49,000 products, 21 departments, and 134 aisles. It mines association rules at two granularities (aisle and product), segments customers with RFM + K-means, and layers three complementary recommendation approaches on top. The output is not a notebook: it is an interactive dashboard with 23 publication-quality figures organised by analytical task.

Background

Market-basket analysis surfaces non-obvious co-purchase patterns that drive retail decisions: store layout, cross-sell emails, and bundle promotions. Association-rule mining via Apriori is the classical approach, using support, confidence, and lift to identify rules that are both frequent and informative.

RFM segmentation (Recency, Frequency, Monetary) is the classical customer-value framework; combined with unsupervised clustering it produces actionable cohorts. Both techniques are old but remain surprisingly hard to implement well at tens of millions of rows.

Methodology

Apriori is run at both aisle granularity (where coarse behavioural groupings are visible) and product granularity (where individual SKU-to-SKU relationships emerge). Rules are filtered by minimum support and then ranked by lift so that the surviving rules are strong relative to independence.

RFM features are computed per user over the full history and passed into K-means. The final configuration uses k = 5, chosen by a trade-off between silhouette score and cluster interpretability. Three recommendation approaches run alongside the segmentation: item-item collaborative filtering, co-purchase graph walks, and a reorder-probability model tuned on the 58.9% catalogue-wide reorder rate.

What stood out

Peak demand concentrates on Sunday and Monday between 10 AM and 3 PM. Bananas are the single most ordered product (472K orders) and dominate both co-purchase and reorder patterns. Organic variants cluster tightly; a representative product-level rule is organic garlic → ginger root at lift 3.25.

Segment-wise, Champions are 4.9% of users and average 64.8 orders; the Hibernating cohort is 28.9% of users and averages 5.6 orders. The retention lever is obvious once those two numbers sit next to each other.

What this is good for

The dashboard was built so that a retention team or category manager can act on the findings without reading code. The clusters become email cohorts. The rules become bundle proposals. The reorder probabilities become cart-abandonment triggers.

Tech stack

Python 3.10: Primary language.
pandas: Loading and aggregating 32.4M rows with efficient categorical encoding.
mlxtend: Apriori implementation at aisle and product granularity.
scikit-learn: RFM scaling, K-means, silhouette scoring.
Chart.js: Interactive dashboard with four analytical tabs.
matplotlib / seaborn: Static figures for heatmaps, cluster profiles, and rule networks.

References

[1]Agrawal, R., Imieliński, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. ACM SIGMOD.
[2]Instacart Market Basket Analysis dataset. Kaggle, 2017.