**Benford’s Law Introduction**¶

## What is Benford’s?¶

Benford’s Law is a statistical phenomenon that describes the distribution of leading digits in many naturally occurring datasets. The law states that the first digit in a random dataset is more likely to be small (e.g., 1, 2, or 3) than large (e.g., 8 or 9). This pattern is observed in a wide range of data sets, including stock prices, geographic data, scientific measurements, and financial transactions.

For example, the probability of the first digit being 1 is around 30%, while the probability of it being 9 is only 5%. This distribution of first digits occurs in many naturally occurring datasets, including financial statements, population statistics, scientific data, and more.

## Benford’s Law formula¶

The formula is given as: `P(d) = log10(1 + 1/d)`

and can be expressed as:

\begin{equation*}
\begin{aligned}
&P(d) = \log_{10}\left(1 + \frac{1}{d}\right), \
&d \in {1, 2, …, 9}
\end{aligned}
\end{equation*}

Where `P(d)`

is the probability that the first digit in a number is `d`

, and `d`

is any integer between 1 and 9. This formula shows that the probability of observing a small digit as the first digit is much higher than observing a large digit.

In simpler terms, the formula suggests that the first digit of numbers in many naturally occurring datasets is not uniformly distributed, but instead follows a predictable pattern. This pattern is characterized by a high frequency of smaller digits (1, 2, 3), and a low frequency of larger digits (7, 8, 9).

## Python code¶

Here’s a simple Python code to calculate the Benford’s Law formula for a given number:

```
import numpy as np
import matplotlib.pyplot as plt
def benfords_law(x):
return np.log10(1 + 1 / x)
x = np.arange(1, 10)
y = benfords_law(x)
fig, ax = plt.subplots(figsize=(9, 5))
bars = ax.bar(x, y, color='c')
plt.title("Benford's Law Distribution")
plt.xlabel("First Digit")
plt.ylim([0, 0.4])
plt.xticks(x)
for bar in bars:
h = bar.get_height().round(2)
ax.text(bar.get_x()+bar.get_width()/2, h, h, ha='center', va='bottom')
plt.show()
```

## Validation of Benford’s law on different data sets¶

The next example defines functions to apply Benford’s Law to a pandas DataFrame, which is a statistical phenomenon that describes the frequency distribution of the first digits of many naturally occurring datasets.

The `get_first_digits()`

function extracts the first digit of each value in the input data, while the `calculate_frequency()`

function calculates the frequency of each digit in the extracted first digits.

The `calculate_benford_freq()`

function calculates the expected frequency of each digit according to Benford’s Law, and the `plot_benford_law()`

function generates a bar chart that compares the observed frequency of each digit in the data with the expected frequency according to Benford’s Law.

```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
def get_first_digits(data):
return [int(str(i)[0]) for i in data.values.flatten() if str(i)[0].isdigit()]
def calculate_frequency(first_digits):
freq = np.zeros(9)
for i in range(1,10):
freq[i-1] = sum([1 for j in first_digits if j==i])
return freq
def calculate_benford_freq():
return [np.log10(1 + 1/d) for d in range(1,10)]
def plot_benford_law(data, ds_name):
first_digits = get_first_digits(data)
freq = calculate_frequency(first_digits)
benford_freq = calculate_benford_freq()
plt.figure(figsize=(10, 5))
plt.bar(range(1,10), freq/sum(freq), label='Observed', alpha=0.5)
plt.grid(axis='both')
plt.plot(range(1,10), benford_freq, 'r', label='Expected')
plt.legend()
plt.xlabel('First Digit')
plt.ylabel('Frequency')
plt.title(f'Benford\'s Law Applied to {ds_name} dataset')
plt.xticks(range(1,10))
plt.show()
```

```
from sklearn.datasets import load_wine
wine = load_wine()
wine_df = pd.DataFrame(data=wine.data, columns=wine.feature_names)
plot_benford_law(wine_df, 'WINE')
```

```
from sklearn.datasets import load_breast_cancer
breast_cancer = load_breast_cancer()
breast_cancer_df = pd.DataFrame(data=breast_cancer.data, columns=breast_cancer.feature_names)
plot_benford_law(breast_cancer_df, 'BREAST CANCER')
```

```
from seaborn import load_dataset
pg = load_dataset('attention')
plot_benford_law(pg, 'ATTENTION')
```

```
from seaborn import load_dataset
pg = load_dataset('flights')
plot_benford_law(pg, 'FLIGHTS')
```