
Market basket analysis helps us find useful information among hundreds, thousand, or millions of different purchases. Any business with a POS system already has access to this type of data.
Instead of focusing only on what products sell the most/least, we dive deeper to see what products are most often purchased together and what product influence the purchase of other products. We can then better plan our store, placing complementary items closer together.
In this example, we’ll use 1,138 receipts from a grocery store. First, we need to load our data, which is from a csv file.
import re import itertools import operator from collections import defaultdict import pandas as pd import matplotlib.pyplot as plt import numpy as np import seaborn as sns from heapq import nlargest from heapq import nsmallest #mount drive from google.colab import drive drive.mount('/content/drive') list_of_receipts = [] #load to pandas to preview import csv vis_receipts = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Basket Analysis/data.csv') vis_receipts.head()

As we can see, the file contains rows of [Date, Receipt #, Item]. Each receipt is currently split between multiple lines. We need to aggregate these into a more usable form. Ideally, we’d like each row to represent a single receipt.
#a fucntion to neatly display our rules (used later) def print_rules(rules, n): i=0 print("---Top {} rules---".format(n)) for (a,b),v in rules: if i <= n: print("{} -> {}: {}%".format(a,b,round(v*100,3))) i = i+1 with open('/content/drive/My Drive/Colab Notebooks/Basket Analysis/data.csv') as f: reader = csv.reader(f) last_i = 1 receipt_items = [] for i,line in enumerate(reader): #if receipt_index is the same as the last line, append current item to receipt_items if int(line[1]) == last_i: receipt_items.append(line[2]) #else, 1. Append the receipt_items to the list_of_receipts & start a new list with new current item else: list_of_receipts.append(receipt_items) receipt_items = [] receipt_items.append(line[2]) last_i = last_i + 1 #see how we aggregated print("Sample of Receipts: {}".format(list_of_receipts[:2])) print("Number of Receipts: {}".format(len(list_of_receipts)))

We’ve now cleaned our data. Let’s also see how many items are in the normal shopping cart.
counts_of_items = [] for r in list_of_receipts: counts_of_items.append(len(r)) #print summary statistics min_items = min(counts_of_items) max_items = max(counts_of_items) avg_items = sum(counts_of_items)/ len(counts_of_items) print("Min Items: {}".format(min_items)) print("Max Items: {}".format(max_items)) print("Average Items: {}".format(avg_items)) sns.set_style('darkgrid') _ = plt.hist(counts_of_items) # arguments are passed to np.histogram plt.title("Histogram of Items in Shopping Cart") plt.show()


It looks like the carts range from 5-34 items, with the average being nearly 20 items per cart.
For our basket analysis, we’ll need counts of item occurrences.
We’ll need a list of the occurrences of every itemΒ πΉπππ(π΄) and the number of occurrences of pairs of items πΉπππ(π΄,π΅). These will tell us how often particular products are purchased, and how often pairs of products are purchased.
item_counts = defaultdict(int) pair_counts = defaultdict(int) #count occurance of each items def get_item_counts(item_counts, itemset): for r in itemset: #add 1 to item_count for each item in each receipt for item in set(r): item_counts[item] += 1 #count occurance of each pair def get_pair_counts (pair_counts, item_counts, itemset): for r in itemset: for combo in itertools.combinations(set(r),2): if combo[0] != combo[1]: pair_counts[combo] += 1 get_item_counts(item_counts, list_of_receipts) get_pair_counts(pair_counts, item_counts, list_of_receipts) print("Distinct Items: {}".format(len(item_counts.items()))) print("Disctinct Pairs: {}".format(len(pair_counts.items())))

After making our counts, we can see we that our receipts have 38 distinct items (must have been a rather small grocery store). We can also see there are 1,123 unique pairs of items.
This is a good point to make some observations on our data. Let’s look at what items sell least. We can probably eliminate these from our store if the counts are extremely low. Doing this would help us free up shelf space and reduce inventory with low turnover.
#get n highest/lowers items based on frequency
high_sellers = nlargest(3, item_counts, key = item_counts.get)
low_sellers = nsmallest(3, item_counts, key = item_counts.get)
print("--Low Sellers--")
for val in low_sellers:
print(val, ":", item_counts.get(val))
print("\n\n--High Sellers--")
for val in high_sellers:
print(val, ":", item_counts.get(val))

Nothing useful… We still have relatively large sales for even the items that have sold the least. And we can’t just not sell hand-soap or sandwich loaves in a grocery store…
Support
Lets now calculate the item and pair supports. These measure how often we expect items (or pairs of items) to appear in random carts.
#Item Support item_support = dict() for k,v in item_counts.items(): item_support[k] = v/ len(list_of_receipts) #Pair Support pair_support = dict() for k,v in pair_counts.items(): pair_support[k] = v/ len(list_of_receipts) #sort our rules based on support sorted_item_support = sorted(item_support.items(), key=lambda kv: kv[1]) sorted_pair_support = sorted(pair_support.items(), key=lambda kv: kv[1]) #print top 5 rules print_rules(sorted_pair_support[::-1],5)

As we can see, the average shopper has nearly a 74% chance of purchasing vegetables. We also see there’s a 33% chance of purchasing both Eggs and Vegetables.
Confidence
Now we’ll focus on calculating the confidence rules. Here we want to determine how often item B is purchased when item A is already being purchased.
#Generate Confidence Rules confidence_rules = dict() for k,v in pair_counts.items(): #calculate confidence confidence = pair_support[k] / item_support[k[0]] if confidence < 1: confidence_rules[k] = confidence #sort rules by confidence sorted_confidence = sorted(confidence_rules.items(), key=lambda kv: kv[1]) #print top 20 rules print_rules(sorted_confidence[::-1], 20)

As we saw before, it’s common for Eggs and Vegetables to be purchased at the same time. Indeed, if we are purchasing eggs, there’s a nearly 84% chance we’ll also purchase vegetables.