Market Basket Analysis

Photo by Oleg Magni on Pexels.com

Market basket analysis helps us find useful information among hundreds, thousand, or millions of different purchases. Any business with a POS system already has access to this type of data.

Instead of focusing only on what products sell the most/least, we dive deeper to see what products are most often purchased together and what product influence the purchase of other products. We can then better plan our store, placing complementary items closer together.

In this example, we’ll use 1,138 receipts from a grocery store. First, we need to load our data, which is from a csv file.

import re
import itertools 
import operator
from collections import defaultdict
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from heapq import nlargest 
from heapq import nsmallest


#mount drive
from google.colab import drive
drive.mount('/content/drive')
list_of_receipts = []

#load to pandas to preview
import csv
vis_receipts = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Basket Analysis/data.csv')
vis_receipts.head()

As we can see, the file contains rows of [Date, Receipt #, Item]. Each receipt is currently split between multiple lines. We need to aggregate these into a more usable form. Ideally, we’d like each row to represent a single receipt.

#a fucntion to neatly display our rules (used later)
def print_rules(rules, n):
  i=0
  print("---Top {} rules---".format(n))
  for (a,b),v in rules:
      if i <= n:

        print("{} -> {}: {}%".format(a,b,round(v*100,3)))
        i = i+1


with open('/content/drive/My Drive/Colab Notebooks/Basket Analysis/data.csv') as f:
  reader = csv.reader(f)
  last_i = 1
  receipt_items = []
  for i,line in enumerate(reader):
    #if receipt_index is the same as the last line, append current item to receipt_items
    if int(line[1]) == last_i:
      receipt_items.append(line[2])
    #else, 1. Append the receipt_items to the list_of_receipts & start a new list with new current item
    else:
      list_of_receipts.append(receipt_items)
      receipt_items = []
      receipt_items.append(line[2])
      last_i = last_i + 1

#see how we aggregated
print("Sample of Receipts: {}".format(list_of_receipts[:2]))
print("Number of Receipts: {}".format(len(list_of_receipts)))

We’ve now cleaned our data. Let’s also see how many items are in the normal shopping cart.

counts_of_items = []
for r in list_of_receipts:
  counts_of_items.append(len(r))

#print summary statistics
min_items = min(counts_of_items)
max_items = max(counts_of_items)
avg_items = sum(counts_of_items)/ len(counts_of_items)

print("Min Items: {}".format(min_items))
print("Max Items: {}".format(max_items))
print("Average Items: {}".format(avg_items))

sns.set_style('darkgrid')
_ = plt.hist(counts_of_items)  # arguments are passed to np.histogram
plt.title("Histogram of Items in Shopping Cart")

plt.show()

It looks like the carts range from 5-34 items, with the average being nearly 20 items per cart.

For our basket analysis, we’ll need counts of item occurrences.

We’ll need a list of the occurrences of every itemΒ πΉπ‘Ÿπ‘’π‘ž(𝐴) and the number of occurrences of pairs of items πΉπ‘Ÿπ‘’π‘ž(𝐴,𝐡). These will tell us how often particular products are purchased, and how often pairs of products are purchased.

item_counts = defaultdict(int)
pair_counts = defaultdict(int)

#count occurance of each items
def get_item_counts(item_counts, itemset):
    for r in itemset:
      #add 1 to item_count for each item in each receipt
      for item in set(r):
        item_counts[item] += 1

#count occurance of each pair
def get_pair_counts (pair_counts, item_counts, itemset): 
    for r in itemset:
      for combo in itertools.combinations(set(r),2):
        if combo[0] != combo[1]:
          pair_counts[combo] += 1
          
get_item_counts(item_counts, list_of_receipts)
get_pair_counts(pair_counts, item_counts, list_of_receipts)

print("Distinct Items: {}".format(len(item_counts.items())))
print("Disctinct Pairs: {}".format(len(pair_counts.items())))

After making our counts, we can see we that our receipts have 38 distinct items (must have been a rather small grocery store). We can also see there are 1,123 unique pairs of items.

This is a good point to make some observations on our data. Let’s look at what items sell least. We can probably eliminate these from our store if the counts are extremely low. Doing this would help us free up shelf space and reduce inventory with low turnover.

#get n highest/lowers items based on frequency
high_sellers = nlargest(3, item_counts, key = item_counts.get) 
low_sellers = nsmallest(3, item_counts, key = item_counts.get) 

print("--Low Sellers--")
for val in low_sellers: 
    print(val, ":", item_counts.get(val)) 

print("\n\n--High Sellers--")
for val in high_sellers: 
    print(val, ":", item_counts.get(val)) 

Nothing useful… We still have relatively large sales for even the items that have sold the least. And we can’t just not sell hand-soap or sandwich loaves in a grocery store…


Support

Lets now calculate the item and pair supports. These measure how often we expect items (or pairs of items) to appear in random carts.

Support(A) = \frac{frequency(A)}{N}

Support(A\cup B) = \frac {frequency(A\ cup B)}{N}

#Item Support
item_support = dict()
for k,v in item_counts.items():
  item_support[k] = v/ len(list_of_receipts)

#Pair Support
pair_support = dict()
for k,v in pair_counts.items():
  pair_support[k] = v/ len(list_of_receipts)


#sort our rules based on support 
sorted_item_support = sorted(item_support.items(), key=lambda kv: kv[1])  
sorted_pair_support = sorted(pair_support.items(), key=lambda kv: kv[1])

#print top 5 rules
print_rules(sorted_pair_support[::-1],5)

As we can see, the average shopper has nearly a 74% chance of purchasing vegetables. We also see there’s a 33% chance of purchasing both Eggs and Vegetables.

Confidence

Now we’ll focus on calculating the confidence rules. Here we want to determine how often item B is purchased when item A is already being purchased.

Conf(A \Rightarrow B) = \frac{Support(A \cup B)}{Support(A))}

#Generate Confidence Rules

confidence_rules = dict()

for k,v in pair_counts.items():
  #calculate confidence
  confidence = pair_support[k] / item_support[k[0]]
  if confidence < 1:
    confidence_rules[k] = confidence

#sort rules by confidence
sorted_confidence = sorted(confidence_rules.items(), key=lambda kv: kv[1])

#print top 20 rules
print_rules(sorted_confidence[::-1], 20)

As we saw before, it’s common for Eggs and Vegetables to be purchased at the same time. Indeed, if we are purchasing eggs, there’s a nearly 84% chance we’ll also purchase vegetables.

%d bloggers like this: