Lesson 15 · Data Forensics

How to Catch a Liar With Math

What You'll Learn

A bizarre mathematical law that lets you detect fake data — used by real fraud investigators, the IRS, and forensic accountants — and how to build your own lie detector in Python.

A Strange Question

Quick — think of a number. Any number. The population of a city. The price of a house. The number of followers someone has. The distance to a star.

Now look at the first digit of that number. Is it a 1? A 5? A 9?

You'd think every digit (1 through 9) would show up equally often as the first digit — about 11% each. That seems logical, right?

It's completely wrong.

In the real world, the digit 1 appears as the first digit about 30% of the time. The digit 2 appears about 17%. The digit 9? Only about 5%.

This isn't a coincidence. It's a mathematical law discovered over 100 years ago, and it's so reliable that investigators use it to catch people who fake their numbers. It's called Benford's Law.

Seeing It With Real Data

Let's look at the first digits of real-world numbers. We'll start with country populations:

Python · First Digits of Populations

What Just Happened

str(pop)[0] converts the number to text, then grabs the first character. So 1412 → "1412" → "1".

int(str(pop)[0]) converts it back to a number so we can use it as a dictionary key.

Look at the result: the digit 1 dominates. This is real data from real countries. The pattern is already visible.

The Law

Benford's Law predicts exactly how often each first digit should appear:

Python · Benford's Expected Distribution

The pattern is dramatic — 1 is six times more likely than 9 to be the first digit. This seems impossible, but it shows up in data set after data set across every field imaginable.

But Why?

Here's the intuition: think about counting from 1 to 1000. You start at 1 and have to pass through all the 100s (100-199) before reaching the 200s. The numbers starting with 1 cover a huge range (1-1, 10-19, 100-199, 1000-1999). Numbers starting with 9 only barely exist before the next power of 10 (9, 90-99, 900-999). The digit 1 has more "room" to appear.

Building a Benford Detector

Let's build a function that analyzes any list of numbers and checks if they follow Benford's Law:

Python · Benford Analyzer

Testing Different Data Sets

Let's test Benford's Law on different kinds of numbers. Some should follow the law. Some shouldn't. Can we tell the difference?

Python · Multiple Data Sets

What Follows Benford's Law?

Follows: Populations, financial transactions, street addresses, river lengths, powers of 2, Fibonacci numbers, stock prices, earthquake magnitudes, file sizes on your computer.

Doesn't follow: Numbers assigned by humans (phone numbers, zip codes, ID numbers), numbers from a narrow range (test scores 0-100), truly uniform random numbers.

The key pattern: Benford's Law applies to data that spans many orders of magnitude (ones, tens, hundreds, thousands, etc.) and grows naturally.

Catching a Liar

Here's where it gets exciting. When people make up numbers — to fake financial records, cheat on taxes, or fabricate research — they usually pick digits roughly equally. They don't know about Benford's Law. And that's how they get caught.

Let's simulate this. We'll create "real" data and "fake" data, then see if our detector can tell the difference:

Python · Real vs. Fake

Run it several times. The detector consistently flags the fake data. Real investigators at the IRS and FBI actually use this technique to flag suspicious tax returns and financial reports for further investigation.

Seeing the Difference

Let's build a side-by-side visual that shows exactly where fake data deviates from Benford's Law:

Python · Visual Comparison

The Complete Data Forensics Tool

Here's everything from this lesson combined into one interactive forensics tool. Choose a dataset — real city populations, simulated transactions, a fake expense report someone clearly made up, or your own numbers — and the tool applies Benford's Law analysis and gives you a verdict.

Option 5 is especially revealing: it generates a clean dataset and a fraudulent one side by side so you can see exactly what the difference looks like in the numbers.

Python · Data Forensics Tool

Try all five options. Option 4 lets you type your own numbers — try entering random numbers you make up and see if the detector catches you!

Real-World Fraud Detection

Benford's Law isn't just a math curiosity. It's used in real investigations:

Real Cases

Tax fraud: The IRS uses Benford's Law as a screening tool. If someone's reported income, expenses, or deductions don't follow the expected distribution, the return gets flagged for an audit.

Election fraud: Researchers have analyzed vote counts from elections worldwide. When vote totals don't follow Benford's Law, it can indicate manipulation — though this is just one piece of evidence, not proof by itself.

Accounting fraud: Enron — one of the biggest corporate frauds in history — had financial statements that deviated significantly from Benford's Law. Forensic accountants now routinely use this technique.

Scientific fraud: Researchers who fabricate experimental data often get caught because their numbers are "too random" — they lack the natural patterns Benford's Law predicts.

When Benford's Law Doesn't Apply

Like any tool, Benford's Law has limits. It's important to know when it works and when it doesn't:

Python · Where It Fails

Critical Thinking

Benford's Law is a powerful tool, but it's not proof of fraud by itself. Data that deviates from Benford's could be fake — or it could just be the wrong kind of data.

A good data detective doesn't jump to conclusions. They use Benford's Law as a screening tool — a reason to investigate further, not a final verdict. This is the same caution scientists use: one piece of evidence is a clue, not a conclusion.

What You Just Did

You learned a mathematical law that most adults don't know about — and you built a tool that applies it. You can now analyze any set of numbers and check if they follow the natural pattern of real-world data.

This lesson combines everything you've learned: lists and dictionaries to store data, loops to process it, functions to organize the analysis, and statistical thinking (from Lesson 14) to interpret the results.

The Big Idea

The real world has hidden patterns. Numbers that look random actually follow mathematical laws. When those patterns are broken — when someone makes up data — the math can catch them. The ability to spot what's real and what's fake is one of the most valuable skills you can have.

Thinking Questions

  1. Why don't human-invented numbers follow Benford's Law? What does this tell us about how we think about randomness?
  2. If someone knew about Benford's Law, could they fake data that passes the test? Would it be easy or hard?
  3. Why is it important to know when Benford's Law does NOT apply? What could happen if you used it on the wrong kind of data?

Level Up

Challenge 1 · Test Your Friends

Ask 5 friends to each write down 20 "random" numbers between 1 and 10,000. Run each friend's numbers through the Benford analyzer. Do human-generated numbers follow the law? Compare their results to the computer's random numbers.

Challenge 2 · Smart Faker

Write a program that generates fake data that actually follows Benford's Law. Hint: instead of picking numbers uniformly, weight your random choices so digit 1 appears 30% of the time, digit 2 appears 17.6%, etc. Then test it with the analyzer — can you fool your own detector?

Challenge 3 · Second Digit Analysis

Benford's Law also predicts the distribution of the second digit (it's more even than the first, but still not uniform — 0 is most common at 12%). Build an analyzer for second digits and test it on the same data sets. Does it catch the same fake data?

What You Learned

Thinking Skill

Data Forensics — using mathematical patterns to verify if data is genuine. Ask: "Does this data follow the patterns we'd expect from real-world numbers?"

Key Concept

Benford's Law: In naturally occurring data, the first digit is 1 about 30% of the time, 2 about 17.6%, declining to 9 at about 4.6%.

Data that deviates from this pattern may be fabricated — but always investigate further before concluding fraud.

Python Concepts

Extracting digitsint(str(number)[0])

Distribution analysis — counting occurrences, calculating percentages

Deviation scoring — measuring how far observed data is from expected

Data forensics pipeline — extract → count → compare → verdict

Combining all skills — lists, dicts, loops, functions, f-strings, conditionals