Lesson 15: How to Catch a Liar With Math

Part 1

A Strange Question

Quick — think of a number. Any number. The population of a city. The price of a house. The number of followers someone has. The distance to a star.

Now look at the first digit of that number. Is it a 1? A 5? A 9?

You'd think every digit (1 through 9) would show up equally often as the first digit — about 11% each. That seems logical, right?

It's completely wrong.

In the real world, the digit 1 appears as the first digit about 30% of the time. The digit 2 appears about 17%. The digit 9? Only about 5%.

This isn't a coincidence. It's a mathematical law discovered over 100 years ago, and it's so reliable that investigators use it to catch people who fake their numbers. It's called Benford's Law.

Part 2

Seeing It With Real Data

Let's look at the first digits of real-world numbers. We'll start with country populations:

# Populations of 30 countries (approximate, in millions)
populations = [
    1412, 1408, 334, 277, 223, 216, 211, 171, 167, 149,
    129, 127, 115, 112, 105, 99, 87, 84, 83, 70,
    68, 67, 66, 60, 59, 55, 52, 47, 46, 45
]

# Count the first digits
digit_counts = {1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0}

for pop in populations:
    first_digit = int(str(pop)[0])
    digit_counts[first_digit] = digit_counts[first_digit] + 1

# Display
total = len(populations)
print(f"FIRST DIGITS OF {total} COUNTRY POPULATIONS")
print("═" * 40)
print()

for digit in range(1, 10):
    count = digit_counts[digit]
    pct = count / total * 100
    bar = "█" * int(pct / 3) + "░" * (12 - int(pct / 3))
    print(f"  {digit}: {bar} {pct:5.1f}% ({count})")

print()
print("Notice: 1 appears WAY more than 9!")
print("This isn't a coincidence...")

What Just Happened

str(pop)[0] converts the number to text, then grabs the first character. So 1412 → "1412" → "1".

int(str(pop)[0]) converts it back to a number so we can use it as a dictionary key.

Look at the result: the digit 1 dominates. This is real data from real countries. The pattern is already visible.

Part 3

The Law

Benford's Law predicts exactly how often each first digit should appear:

# Benford's Law: expected percentage for each first digit
benford = {
    1: 30.1, 2: 17.6, 3: 12.5, 4: 9.7, 5: 7.9,
    6: 6.7,  7: 5.8,  8: 5.1,  9: 4.6
}

print("BENFORD'S LAW — Expected First Digits")
print("═" * 40)
print()

for digit in range(1, 10):
    pct = benford[digit]
    bar = "█" * int(pct / 2) + "░" * (16 - int(pct / 2))
    print(f"  {digit}: {bar} {pct:.1f}%")

print()
print("  1 appears 30% of the time!")
print("  9 appears only 5% of the time!")
print()
print("  This pattern shows up in populations,")
print("  prices, street addresses, river lengths,")
print("  financial data, and much more.")

The pattern is dramatic — 1 is six times more likely than 9 to be the first digit. This seems impossible, but it shows up in data set after data set across every field imaginable.

But Why?

Here's the intuition: think about counting from 1 to 1000. You start at 1 and have to pass through all the 100s (100-199) before reaching the 200s. The numbers starting with 1 cover a huge range (1-1, 10-19, 100-199, 1000-1999). Numbers starting with 9 only barely exist before the next power of 10 (9, 90-99, 900-999). The digit 1 has more "room" to appear.

Part 4

Building a Benford Detector

Let's build a function that analyzes any list of numbers and checks if they follow Benford's Law:

def get_first_digits(numbers):
    """Extract the first digit from each number."""
    digits = []
    for n in numbers:
        n = abs(n)  # handle negative numbers
        if n == 0:
            continue
        first = int(str(n)[0])
        digits.append(first)
    return digits

def benford_analysis(numbers, label="Data"):
    """Compare first-digit distribution to Benford's Law."""
    benford = {1: 30.1, 2: 17.6, 3: 12.5, 4: 9.7, 5: 7.9,
               6: 6.7,  7: 5.8,  8: 5.1,  9: 4.6}
    
    digits = get_first_digits(numbers)
    total = len(digits)
    
    counts = {d: 0 for d in range(1, 10)}
    for d in digits:
        counts[d] = counts[d] + 1
    
    print(f"BENFORD ANALYSIS: {label}")
    print(f"({total} numbers analyzed)")
    print("═" * 45)
    print(f"  {'Digit':>5s}  {'Observed':>8s}  {'Expected':>8s}  {'Diff':>6s}")
    print("─" * 45)
    
    total_diff = 0
    for d in range(1, 10):
        observed = counts[d] / total * 100 if total > 0 else 0
        expected = benford[d]
        diff = abs(observed - expected)
        total_diff = total_diff + diff
        
        flag = " ✓" if diff < 5 else " ⚠️" if diff < 10 else " 🚨"
        print(f"  {d:>5d}  {observed:>7.1f}%  {expected:>7.1f}%  {diff:>5.1f}%{flag}")
    
    return total_diff

# Test it with populations
pops = [1412, 1408, 334, 277, 223, 216, 211, 171, 167, 149,
        129, 127, 115, 112, 105, 99, 87, 84, 83, 70,
        68, 67, 66, 60, 59, 55, 52, 47, 46, 45]

diff = benford_analysis(pops, "Country Populations")
print("─" * 45)
print(f"  Total deviation: {diff:.1f}%")
if diff < 30:
    print("  ✅ VERDICT: Follows Benford's Law")
else:
    print("  🚨 VERDICT: Suspicious!")

Part 5

Testing Different Data Sets

Let's test Benford's Law on different kinds of numbers. Some should follow the law. Some shouldn't. Can we tell the difference?

import random

def get_first_digits(numbers):
    digits = []
    for n in numbers:
        n = abs(n)
        if n == 0: continue
        digits.append(int(str(n)[0]))
    return digits

def quick_benford(numbers, label):
    benford = {1: 30.1, 2: 17.6, 3: 12.5, 4: 9.7, 5: 7.9,
               6: 6.7, 7: 5.8, 8: 5.1, 9: 4.6}
    
    digits = get_first_digits(numbers)
    total = len(digits)
    counts = {d: 0 for d in range(1, 10)}
    for d in digits:
        counts[d] = counts[d] + 1
    
    total_diff = 0
    print(f"\n{label} ({total} numbers)")
    print("─" * 40)
    for d in range(1, 10):
        obs = counts[d] / total * 100 if total > 0 else 0
        exp = benford[d]
        total_diff = total_diff + abs(obs - exp)
        bar_obs = "█" * int(obs / 3)
        bar_exp = "░" * int(exp / 3)
        print(f"  {d}: {bar_obs:<12s} {obs:5.1f}% (expected {exp:.1f}%)")
    
    verdict = "✅ Follows Benford" if total_diff < 40 else "🚨 SUSPICIOUS"
    print(f"  Deviation: {total_diff:.1f}% → {verdict}")
    return total_diff

# DATA SET 1: Powers of 2 (should follow Benford's!)
powers = [2**i for i in range(1, 80)]
quick_benford(powers, "Powers of 2")

# DATA SET 2: House prices (realistic, follows Benford's)
prices = []
for i in range(200):
    prices.append(random.randint(80000, 950000))
quick_benford(prices, "Random House Prices")

# DATA SET 3: Fibonacci numbers (follows Benford's!)
fib = [1, 1]
for i in range(60):
    fib.append(fib[-1] + fib[-2])
quick_benford(fib, "Fibonacci Numbers")

# DATA SET 4: Made-up "random" data (WON'T follow Benford's)
fake = [random.randint(100, 999) for _ in range(200)]
quick_benford(fake, "Uniformly Random 100-999")

What Follows Benford's Law?

Follows: Populations, financial transactions, street addresses, river lengths, powers of 2, Fibonacci numbers, stock prices, earthquake magnitudes, file sizes on your computer.

Doesn't follow: Numbers assigned by humans (phone numbers, zip codes, ID numbers), numbers from a narrow range (test scores 0-100), truly uniform random numbers.

The key pattern: Benford's Law applies to data that spans many orders of magnitude (ones, tens, hundreds, thousands, etc.) and grows naturally.

Part 6

Catching a Liar

Here's where it gets exciting. When people make up numbers — to fake financial records, cheat on taxes, or fabricate research — they usually pick digits roughly equally. They don't know about Benford's Law. And that's how they get caught.

Let's simulate this. We'll create "real" data and "fake" data, then see if our detector can tell the difference:

import random

def get_first_digit(n):
    return int(str(abs(n))[0])

def benford_score(numbers):
    """Return how well data matches Benford's Law (lower = better match)."""
    benford = {1: 30.1, 2: 17.6, 3: 12.5, 4: 9.7, 5: 7.9,
               6: 6.7, 7: 5.8, 8: 5.1, 9: 4.6}
    counts = {d: 0 for d in range(1, 10)}
    for n in numbers:
        if n == 0: continue
        counts[get_first_digit(n)] = counts[get_first_digit(n)] + 1
    total = sum(counts.values())
    deviation = 0
    for d in range(1, 10):
        observed = counts[d] / total * 100 if total > 0 else 0
        deviation = deviation + abs(observed - benford[d])
    return deviation

# REAL company expenses (follows Benford's — natural growth)
real_expenses = []
for i in range(300):
    # Real expenses span many magnitudes
    magnitude = random.choice([10, 100, 1000, 10000])
    expense = random.randint(1, 9) * magnitude + random.randint(0, magnitude - 1)
    real_expenses.append(expense)

# FAKE expenses (someone made them up — too uniform)
fake_expenses = []
for i in range(300):
    # Faker picks "realistic looking" numbers but distributes digits evenly
    fake_expenses.append(random.randint(100, 9999))

print("🔍 FRAUD DETECTION CHALLENGE")
print("═" * 40)
print()
print("Two companies submitted expense reports.")
print("One is real. One is fabricated.")
print("Can Benford's Law tell them apart?")
print()

score_a = benford_score(real_expenses)
score_b = benford_score(fake_expenses)

# Randomize which is shown first
if random.choice([True, False]):
    real_expenses, fake_expenses = fake_expenses, real_expenses
    score_a, score_b = score_b, score_a

print(f"COMPANY A — Deviation from Benford: {score_a:.1f}%")
bar_a = "█" * int(score_a / 3)
print(f"  [{bar_a}]")
if score_a < 40:
    print(f"  ✅ Matches Benford's Law — looks legitimate")
else:
    print(f"  🚨 Doesn't match — SUSPICIOUS")

print()

print(f"COMPANY B — Deviation from Benford: {score_b:.1f}%")
bar_b = "█" * int(score_b / 3)
print(f"  [{bar_b}]")
if score_b < 40:
    print(f"  ✅ Matches Benford's Law — looks legitimate")
else:
    print(f"  🚨 Doesn't match — SUSPICIOUS")

print()
print("═" * 40)
print("The suspicious one has digits that are")
print("too evenly distributed — a sign of")
print("human-made numbers, not natural ones.")

Run it several times. The detector consistently flags the fake data. Real investigators at the IRS and FBI actually use this technique to flag suspicious tax returns and financial reports for further investigation.

Part 7

Seeing the Difference

Let's build a side-by-side visual that shows exactly where fake data deviates from Benford's Law:

import random

benford = {1: 30.1, 2: 17.6, 3: 12.5, 4: 9.7, 5: 7.9,
           6: 6.7, 7: 5.8, 8: 5.1, 9: 4.6}

def count_first_digits(numbers):
    counts = {d: 0 for d in range(1, 10)}
    for n in numbers:
        if n == 0: continue
        counts[int(str(abs(n))[0])] += 1
    total = sum(counts.values())
    return {d: counts[d] / total * 100 for d in range(1, 10)}

# Generate data
real_data = []
for _ in range(500):
    mag = random.choice([1, 10, 100, 1000, 10000, 100000])
    real_data.append(random.randint(1, 9) * mag + random.randint(0, max(mag-1, 1)))

fake_data = [random.randint(10, 99999) for _ in range(500)]

real_dist = count_first_digits(real_data)
fake_dist = count_first_digits(fake_data)

print("SIDE-BY-SIDE: Real Data vs Fake Data vs Benford")
print("═" * 55)
print(f"  {'Digit':>5s}  {'Benford':>8s}  {'Real':>8s}  {'Fake':>8s}")
print("─" * 55)

for d in range(1, 10):
    exp = benford[d]
    real = real_dist[d]
    fake = fake_dist[d]
    
    # Flags
    real_ok = "✓" if abs(real - exp) < 5 else "⚠"
    fake_ok = "✓" if abs(fake - exp) < 5 else "🚨"
    
    print(f"  {d:>5d}  {exp:>7.1f}%  {real:>6.1f}% {real_ok}  {fake:>6.1f}% {fake_ok}")

print("─" * 55)
print()
print("  Notice: Real data's distribution matches Benford's.")
print("  Fake data has digits that are too evenly spread.")
print()
print("  A fraudster who doesn't know Benford's Law")
print("  will make numbers that look 'random' to them")
print("  but scream 'FAKE!' to a mathematician.")

Part 8

The Complete Data Forensics Tool

Here's everything from this lesson combined into one interactive forensics tool. Choose a dataset — real city populations, simulated transactions, a fake expense report someone clearly made up, or your own numbers — and the tool applies Benford's Law analysis and gives you a verdict.

Option 5 is especially revealing: it generates a clean dataset and a fraudulent one side by side so you can see exactly what the difference looks like in the numbers.

import random

# === BENFORD'S LAW TOOLKIT ===

BENFORD = {1: 30.1, 2: 17.6, 3: 12.5, 4: 9.7, 5: 7.9,
           6: 6.7, 7: 5.8, 8: 5.1, 9: 4.6}

def first_digit(n):
    return int(str(abs(int(n)))[0])

def analyze(numbers, label="Data"):
    counts = {d: 0 for d in range(1, 10)}
    clean = [n for n in numbers if n != 0]
    for n in clean:
        counts[first_digit(n)] += 1
    total = len(clean)
    
    print(f"\n🔬 ANALYSIS: {label}")
    print(f"   {total} numbers analyzed")
    print("═" * 50)
    
    total_dev = 0
    for d in range(1, 10):
        obs = counts[d] / total * 100 if total > 0 else 0
        exp = BENFORD[d]
        diff = abs(obs - exp)
        total_dev += diff
        
        # Visual bar comparison
        obs_bar = "█" * int(obs / 2.5)
        exp_bar = "░" * int(exp / 2.5)
        
        flag = "  " if diff < 3 else "⚠ " if diff < 7 else "🚨"
        print(f"  {d}: {obs_bar:<13s} {obs:5.1f}% (expect {exp:4.1f}%) {flag}")
    
    print("═" * 50)
    
    # Verdict
    if total_dev < 25:
        verdict = "✅ CLEAN — Closely follows Benford's Law"
    elif total_dev < 45:
        verdict = "⚠️  UNCERTAIN — Some deviation detected"
    else:
        verdict = "🚨 SUSPICIOUS — Significant deviation from Benford's Law"
    
    print(f"  Total deviation: {total_dev:.1f}%")
    print(f"  {verdict}")
    return total_dev

# === MAIN PROGRAM ===

def header(text, char="═", width=50):
    print(char * width)
    print(f"  {text}")
    print(char * width)

header("🕵️ DATA FORENSICS LAB")
print()
print("Choose a dataset to investigate:")
print("  1 — City populations (real)")
print("  2 — Financial transactions (real)")
print("  3 — Suspicious expense report (fake!)")
print("  4 — Enter your own numbers")
print("  5 — Generate real vs fake comparison")
print()
choice = input("Your choice (1-5): ")

if choice == "1":
    data = [8336, 3979, 2746, 2693, 2320, 1603, 1386, 1304, 998, 961,
            936, 918, 906, 876, 873, 856, 839, 743, 689, 675,
            641, 632, 617, 603, 585, 574, 567, 553, 535, 515,
            504, 486, 467, 451, 429, 413, 398, 377, 362, 345]
    analyze(data, "US City Populations (thousands)")

elif choice == "2":
    data = []
    for _ in range(400):
        mag = random.choice([10, 100, 1000, 10000])
        data.append(random.randint(1, 9) * mag + random.randint(0, mag))
    analyze(data, "Financial Transactions")

elif choice == "3":
    data = [random.randint(200, 8000) for _ in range(400)]
    analyze(data, "Suspicious Expense Report")

elif choice == "4":
    print("Enter numbers separated by spaces:")
    raw = input("> ")
    nums = []
    for item in raw.split():
        try:
            nums.append(int(item))
        except:
            pass
    if len(nums) > 0:
        analyze(nums, "Your Numbers")
    else:
        print("No valid numbers found!")

elif choice == "5":
    print()
    real = []
    for _ in range(400):
        mag = random.choice([10, 100, 1000, 10000, 100000])
        real.append(random.randint(1, 9) * mag + random.randint(0, mag))
    
    fake = [random.randint(100, 99999) for _ in range(400)]
    
    score_real = analyze(real, "Company A (Natural Data)")
    score_fake = analyze(fake, "Company B (Fabricated Data)")
    
    print()
    header("COMPARISON", char="─")
    print(f"  Company A deviation: {score_real:.1f}%")
    print(f"  Company B deviation: {score_fake:.1f}%")
    if score_real < score_fake:
        print("  → Company B's data looks fabricated!")
    else:
        print("  → Company A's data looks fabricated!")

print()
header("Benford's Law: the math that catches liars", char="─")

Try all five options. Option 4 lets you type your own numbers — try entering random numbers you make up and see if the detector catches you!

Part 9

Real-World Fraud Detection

Benford's Law isn't just a math curiosity. It's used in real investigations:

Real Cases

Tax fraud: The IRS uses Benford's Law as a screening tool. If someone's reported income, expenses, or deductions don't follow the expected distribution, the return gets flagged for an audit.

Election fraud: Researchers have analyzed vote counts from elections worldwide. When vote totals don't follow Benford's Law, it can indicate manipulation — though this is just one piece of evidence, not proof by itself.

Accounting fraud: Enron — one of the biggest corporate frauds in history — had financial statements that deviated significantly from Benford's Law. Forensic accountants now routinely use this technique.

Scientific fraud: Researchers who fabricate experimental data often get caught because their numbers are "too random" — they lack the natural patterns Benford's Law predicts.

Part 10

When Benford's Law Doesn't Apply

Like any tool, Benford's Law has limits. It's important to know when it works and when it doesn't:

import random

def quick_check(numbers, label):
    counts = {d: 0 for d in range(1, 10)}
    clean = [n for n in numbers if n > 0]
    for n in clean:
        counts[int(str(n)[0])] += 1
    total = len(clean)
    print(f"\n{label}:")
    for d in range(1, 10):
        pct = counts[d] / total * 100 if total > 0 else 0
        bar = "█" * int(pct / 3)
        print(f"  {d}: {bar:<12s} {pct:.0f}%")

# These DON'T follow Benford's — and that's OK!

# Test scores (limited range: 0-100)
scores = [random.randint(50, 100) for _ in range(200)]
quick_check(scores, "Test Scores (50-100) — NOT expected to follow")

# Phone numbers (assigned by humans)
phones = [random.randint(2000000000, 9999999999) for _ in range(200)]
quick_check(phones, "Phone Numbers — NOT expected to follow")

print()
print("═" * 40)
print("Benford's Law only works for data that:")
print("  • Spans multiple orders of magnitude")
print("  • Grows naturally (not assigned)")
print("  • Isn't limited to a narrow range")
print("Using it on the wrong data = false alarms!")

Critical Thinking

Benford's Law is a powerful tool, but it's not proof of fraud by itself. Data that deviates from Benford's could be fake — or it could just be the wrong kind of data.

A good data detective doesn't jump to conclusions. They use Benford's Law as a screening tool — a reason to investigate further, not a final verdict. This is the same caution scientists use: one piece of evidence is a clue, not a conclusion.

Part 11

What You Just Did

You learned a mathematical law that most adults don't know about — and you built a tool that applies it. You can now analyze any set of numbers and check if they follow the natural pattern of real-world data.

This lesson combines everything you've learned: lists and dictionaries to store data, loops to process it, functions to organize the analysis, and statistical thinking (from Lesson 14) to interpret the results.

The Big Idea

The real world has hidden patterns. Numbers that look random actually follow mathematical laws. When those patterns are broken — when someone makes up data — the math can catch them. The ability to spot what's real and what's fake is one of the most valuable skills you can have.

Thinking Questions

Why don't human-invented numbers follow Benford's Law? What does this tell us about how we think about randomness?
If someone knew about Benford's Law, could they fake data that passes the test? Would it be easy or hard?
Why is it important to know when Benford's Law does NOT apply? What could happen if you used it on the wrong kind of data?

Challenges

Level Up

Challenge 1 · Test Your Friends

Ask 5 friends to each write down 20 "random" numbers between 1 and 10,000. Run each friend's numbers through the Benford analyzer. Do human-generated numbers follow the law? Compare their results to the computer's random numbers.

Challenge 2 · Smart Faker

Write a program that generates fake data that actually follows Benford's Law. Hint: instead of picking numbers uniformly, weight your random choices so digit 1 appears 30% of the time, digit 2 appears 17.6%, etc. Then test it with the analyzer — can you fool your own detector?

Challenge 3 · Second Digit Analysis

Benford's Law also predicts the distribution of the second digit (it's more even than the first, but still not uniform — 0 is most common at 12%). Build an analyzer for second digits and test it on the same data sets. Does it catch the same fake data?

Summary

What You Learned

Thinking Skill

Data Forensics — using mathematical patterns to verify if data is genuine. Ask: "Does this data follow the patterns we'd expect from real-world numbers?"

Key Concept

Benford's Law: In naturally occurring data, the first digit is 1 about 30% of the time, 2 about 17.6%, declining to 9 at about 4.6%.

Data that deviates from this pattern may be fabricated — but always investigate further before concluding fraud.

Python Concepts

Extracting digits — int(str(number)[0])

Distribution analysis — counting occurrences, calculating percentages

Deviation scoring — measuring how far observed data is from expected

Data forensics pipeline — extract → count → compare → verdict

Combining all skills — lists, dicts, loops, functions, f-strings, conditionals

← Lesson 14 Lesson 16 →