Anomaly Detection using Benford’s Law

Are these transactions real?

What are the chances of rolling a dice and getting 5? Of course, 1/6. What are the chances a randomly selected number between 1 and 100 is 32? 1/100.

Suppose you download your bank transactions for 2020. What are the chances that your a random transaction’s amount begins with 3? Considering that there are 9 possible digits (omitting 0 as a 1st digit), you’d logically guess 1/9. Surprisingly, this is wrong. The true probability is actually around 12%. And the probability that the first digit is a one is an amazing 30.%.

So where did this rule come from, and how

History

Although commonly known as Benford’s Law, like many famous laws, it’s not named after the first person to discover it. It was actually an astronomer named Simoon Newcomb who noticed in the late 1800’s that in logarithm tables, some pages were worn much more than others – particularly the first few pages. His finding was later re-discovered by Frank Benford, who continued to do many more empirical tests to validate the theory.

In short, the law states that the leading digit of numbers in a “real” dataset do not occur with uniform probability. What do we mean by “real”? Here, we mean naturally-occurring sets of numbers – bank account transactions, street addresses, mathematical constants. The probability that a number begins with d (1,2,3…9) is given by the following formula:

If we plot the expected frequencies for first-digits, we obtain the following plo:


Similarly, we can do the same with the second-digits.

So how can we use our newfound knowledge? It’s most logical applications are within accounting (particularly auditing) but uses can be also be found elsewhere. The law is so widely documented that it’s frequently used in federal, state, and local court cases as evidence.

Futher examples:

  • If you’ve watched the movie The Accountant starring Ben Afflec, you can see him use Benford’s Law to uncover fraud.
  • Ciaponia and Mandanici used the law in a research paper to investigate if Italian Universities intentionally used fraudulent data to bolster their books.
  • Benford’s Law was used to detect voter fraud in the 2009 Iranian election.

Utilizing the Law

Now that you understand the theory of Benford’s Law, lets put it to work. In our example, we’ll be reviewing around 2,100 financial transactions to see if they conform. If we detect data that falls outside of the expected frequencies, we’ll tag these transactions as suspicious and have others investigate them further. (Code and example can be found in my github.)

First, we need to create a listing of the expected frequencies. For our purposes, we’ll focus on the first two digits and utilize numpy to create a 10×2 matrix. Rows will represent the digit (0-9). Columns will represent the digit position (1st or 2nd).

# Probability of digit (d) occurring in nth position
def calc_expected_probability(d: int, n: int) -> float:
# generalization of Benford's Law ~ Summation( log[ 1 + (1/(10k+d)] ) where d = digit, k = (10^(n-2), 10^(n-1))
# source: https://en.wikipedia.org/wiki/Benford%27s_law
prob = 0
if (d == 0) and (n == 1):
prob = 0
elif n == 1:
prob = math.log10(1 + (1 / d))
else:
l_bound = 10 ** (n 2)
u_bound = 10 ** (n 1)
for k in range(l_bound, u_bound):
prob += math.log10(1 + (1 / (10 * k + d)))
return round(prob, 3)
# populate matrix with Benford's Law (10×2 matrix for r= digit, c=location – 1)
expected_prob_matrix = np.zeros((10, 2))
for d_i in range(expected_prob_matrix.shape[0]):
for n_i in range(expected_prob_matrix.shape[1]):
expected_prob_matrix[d_i, n_i] = calc_expected_probability(d_i, n_i + 1)

Now we need to calculate the frequencies of the various digits in each position from our data – the 2,100 transactions. We’ll first load our data into a pandas datframe. We’ll then clean the numbers (remove negative identifies, commas, etc.) and extract the first and second digits.

For easy comparison to our precomputed Expected Probabilities, we’ll also calculate the probabilities of the observed data within another numpy array.

# Compute observed probability matrix
observed_prob_matrix = np.zeros((10, 2))
data_cols_length = [data.first_digit.count(), data[data.second_digit != 'NaN'].second_digit.count()]
for d_i in range(observed_prob_matrix.shape[0]):
for n_i in range(observed_prob_matrix.shape[1]):
observed_prob_matrix[d_i, n_i] = data[data.iloc[:, n_i + 2] == d_i].iloc[:, n_i + 2].count() / \
data_cols_length[n_i]

Based on The Use of Benford’s Law as an Aid in Analytical Procedures by Nigrini and Mittermaier, we can calculate a confidence interval around our expected values based on a given alpha level.

We can save our upper and lower bounds in two numpy arrays and then compare our observations to see if we fall outside of the calculate confidence interval. If so, we know we have data that is suspicious.

# Calculate Confidence Intervals for probability matrix
def calc_confidence_interval(alpha):
alpha_stat = stats.norm.ppf(1 (1 alpha) / 2)
for d_i in range(expected_prob_matrix.shape[0]):
for n_i in range(expected_prob_matrix.shape[1]):
prop_exp = expected_prob_matrix[d_i, n_i]
half_length = alpha_stat * (prop_exp * (1 prop_exp) / data_cols_length[n_i]) ** .5 + (
1 / (2 * data_cols_length[n_i]))
u_bound = prop_exp + half_length
l_bound = prop_exp half_length
h_length_matrix[d_i, n_i] = half_length
u_bound_matrix[d_i, n_i] = u_bound
l_bound_matrix[d_i, n_i] = max(l_bound, 0)
calc_confidence_interval(alpha_level)
view raw benfords_ci.py hosted with ❤ by GitHub

To visualize our findings, we can plot the distributions of the fist and second digit, along with the confidence intervals.

As we can see, in the first digit, we see that 8 occurs much more frequently than predicted by Benford’s Law. In the second digit we find that 5 also occurs much more frequently – and indeed the most frequently of any 2nd digit number.

I hope you learned something and are able to find another useful application of Benford’s Law. Again, the code is on my github.

Thank you for reading.

Follow My Blog

Get new content delivered directly to your inbox.

%d bloggers like this: