RFM Analysis

Image by Author

Any good data scientist is (or at least should be) adept at taking complex mathematical and statistical models and explaining them in a simple and concise manner. In the end, our job is to create value for our company or client. Even if we have a model with 99.9999% accuracy, management is unlikely to use it to make decisions unless they understand (at the very least) the basics of the model.

Our Problem

A large part of any business is built around understanding the company’s clients, and ensuring their needs and wants are being satisfied. This helps us ensure our clients are actually using the products we’re creating/providing and that we’re spending our own resources in optimal business segments. A common approach to understanding our clients is to segregate them into distinct groups. Instead of trying to understand and develop products for hundreds of thousands of individual people or companies, we can instead focus our efforts on a few distinct groups which represent our underlying clients. This allows us to make more informed, targeted decisions that will have a greater impact. In short, it ensures we see the forest and not the trees.

If you’ve been around machine learning over the past few years, you’re brain has automatically already switched into unsupervised learning mode and you’re already thinking of coding up a k-means or nearest-neighbor model. I can’t deny that I’m not usually right there with you. But let’s take a step back. Is there a simpler method? One that requires almost zero explanation to management? One that’s much less computationally expensive?

RFM to the Rescue

RFM analysis began in the mid 90’s when companies where trying to find optimal groups for Direct Mail. Although there it’s a reference to any specific research paper, most people site Jan Bult and Tom Wansbeek’s article “Optimal Selection for Direct Mail” in Marketing Science as the first emergence of the idea. And the idea is simple.

RFM stands for Recency, Frequency, and Monetary Value. In short we want to group customers based on:

  1. How recent was their last transaction?
  2. How frequently do they purchase?
  3. How much money have they spent with us?

Customers who have purchased three items within the last month are more important than customers who purchased two items over the last three years. Customers who have spent $10,000 on our products/services are more important than customers who have spent $50. As you can see, this methodology requires little (if any) explanation.

The implementation of the analysis is equally as simple. For each group (R,F,M) we’ll divide our customers into N segments. We’ll then issue a rank to each group from 1-N (with 1 being optimal). At the end, we can use our RFM scores as-is, or we can use some aggregation method to develop one super-score for each customer.

Let’s get started

First, we’ll need a list of all invoices over the time period we’re analyzing. For each invoice we’ll only need three things:

  1. Client Name
  2. Invoice Amount
  3. Invoice Date

For any analyst, this should be nothing more than a simple SQL query.

SELECT IV.client__description as "client_description",
IV.invoice_date,
IV.amount
FROM Invoice_View IV
view raw invoice_pull.sql hosted with ❤ by GitHub

We’ll then pull our SQL data into a pandas dataframe. (The data in this example has been anonymized.)

# Load our data and ensure dates are parese correctly
data = pd.read_csv('sql_anon_invoices.csv')
data['invoice_date'] = data['invoice_date'].astype('datetime64[ns]')
data.head()
print("Rows: {}".format(len(data)))
view raw RFM_data_load.py hosted with ❤ by GitHub
Image by Author

We’ll now create a new dataframe (rfm) where we’ll house our RFM metrics. Our new dataframe will contain one row for each client, along with their individual R,F,M ratings.

We’ll need a few helper functions to generate our metrics. We can use Pandas’ apply to use these on our data.

#get days since last invoice (Recency)
def get_history(x):
d = data[data['client_id'] == x.client_id]
d_max = max(d['invoice_date'])
t = datetime.today()
y = t d_max
return y.days
#count number of invoices for client (Frequency)
c_invoices_dict = collections.Counter(data['client_description'])
def get_counts(x):
return c_invoices_dict[x.client_description]
#get total amount of all invoices for client (Monetary Value)
def total_invoices(x):
d = data[data['client_description'] == x.client_description]
total = d.amount.sum()
return total

Finally, we’ll need one last bit of code to split our customers into N groups and assign a ranking. For this example I’ve chosen 4 groups and am splitting based on simple quartiles.

#split metric X into quartiles
def get_ranks(x):
pct_75 = np.percentile(x, 75)
pct_50 = np.percentile(x, 50)
pct_25 = np.percentile(x, 25)
return {'max':x.max(), '75':pct_75, '50':pct_50, '25':pct_25, 'min':x.min()}
#rank Recency metric
def rank_r(x):
i = x.R_days_since_last_invoice
r = 0
if i <= R_pct['25']:
r = 1
if (i >= R_pct['25']) and (i < R_pct['50']):
r = 2
if (i >= R_pct['50']) and (i < R_pct['75']):
r = 3
if i >= R_pct['75']:
r = 4
return r
#…. F and M are ranked similarly, see GitHub for full code.
view raw RFM_rank.py hosted with ❤ by GitHub

Our ranking is now complete. Now we can dive into our analysis and pull out different segments that are useful for management.

Image by Author.

Final Analysis

With RFM analysis, there are many different types of customer segments we can choose. Of course our best customers (with recent, frequent purchases who have spent a large amount of money) are denoted by R=1, F=1, M=1 – or more concisely as (1-1-1). A listing of some other notable groups is below:

  1. Low-spending but Active, Loyal clients – (1-1-3|4)
  2. Best customer we let get away (4-1|2-1|2)
  3. New big, spending customers (1-4-1|2)
  4. the list goes on and on….

All of these segments are easily accessible with a simple pandas filter.

best_clients = rfm[(rfm['R'] ==1) & (rfm['F'] == 1) & (rfm['M']==1)]

Of course, what would an analysis be without a visualization? A small side-advantage of RFM analysis is that it provides 3 dimensions – making it easy to visualize. Let’s step into it with plotly.

import plotly.express as px
fig = px.scatter_3d(rfm, x='R_days_since_last_invoice', y='F_count_invoices', z='M_total_invoiced', hover_name='client_description', color='M_total_invoiced')
fig.update_layout(scene = dict(
xaxis_title='Days Since Last Invoice',
yaxis_title='Count Total Invoices',
zaxis_title='Total Amount Invoiced'))
fig.show()
view raw RFM_plot.py hosted with ❤ by GitHub
Image by Author

Thank you for taking the time to read. I truly appreciate it. The full code can be found on my GitHub.

%d bloggers like this: