We're supplied with data every day, but how do we know whether it's accurate, adjusted, inaccurate, or just made up?
Imagine you're the person that your family and friends turn to whenever they encounter figures, because "you know about this stuff". How do you help them? What if your manager asks you to check whether some data is credible or not? What about the figures the media publish? Is there a way to tell who to trust?
Thankfully there are plenty of ways to get an idea of how credible data is, but one that's fairly easy to start with is one that was discovered in the late 19th century, and formalized in the 20th: Benford's Law.
What’s Benford’s Law?
If you make a list of random numbers, it seems reasonable to assume that the first digits of each number are equally likely to be any of the digits 1-9. After all, that’s pretty much the definition of “random”, and if you use a computer to generate a list, that’s exactly what you’ll get.
There’s a problem with this though: real-world numbers might look random, but they’re not really that random at all. Well, they are random, but they follow patterns. If you analyze a large number of datasets, what you’ll see is that numbers starting with 1 are the most common, followed by numbers starting with 2, then 3, and so on to 9 being the least likely. The distribution looks a lot like this:
That’s Benford’s Law.
Who Discovered Benford’s Law?
This sounds like a silly question, but Benford’s Law was discovered, of course, by Simon Newcomb in 1881. He noticed that in the logarithm tables he and his colleagues were using, the pages at the start of the books, for numbers starting with 1, were much more worn than the other pages. Professor Newcomb published a paper on this effect, and included details of the distribution of the second digits, which gets much more complicated. He even proposed a law that the probability of the first digit of a number being N is:
log(N + 1) – log(N)
which simplifies to:
log(1 + 1/n)
Frank Benford rediscovered the law in 1938, tested it on a huge number of different datasets, and published his paper “The Law Of Anomalous Numbers”, and that’s when it was named after him. The law is also known sometimes as the Newcomb-Benford law, so Professor Newcomb isn’t entirely forgotten.
Just out of interest, the law shown above works for any number base, so for example, if you’re using base 16, the following is true:
p = log16(1 + 1/n)
And don’t forget that if you want to covert logs of different bases, where x is the number you want to calculate the log of:
logb(X) = loga(X) / loga(b)
How, Or Rather Why, Does This Happen?
I've read a few explanations about why this happens, so here are two of them:
Imagine a river, running from its source to the sea. If the river ran in a straight line, its length would be random, in the same way that the numbers generated by a computers random number are.
But, rivers don't work like that. They flow around large rocks, they erode soft rocks and create waterfalls, they move silt around, changing their paths and create ox bow lakes (why yes, I did do Geography at school!). Every one of these things changes the length of the river by a certain percentage, so it's length consists of its minimum possible length, plus an almost infinite number of percentage increments, one for every diversion.
It's because of this that river lengths adhere to Benford's law.
Incidentally, why are we using percentages there, instead of units? The reason is because actual units - miles, kilometres, yards, furlongs, chains, etc. - don't matter, because we care about the numbers themselves. This confused me a little at first, mostly because I'm an engineer not a mathematician, and I usually prefer real things to abstract things, but then having thought about it the overall effect is independent of whether you use Imperial, metric or some other measuring system.
Changes in Numbers
This is another theory I read, and which also hints at why this entire law was discovered in the first place. Benford's law is all about combining numbers, so let's see how much difference adding numbers together actually makes. This sounds a little odd, but let's go with it.
To get to the next starting digit in the series 1-9,you just need to add 1. If you're using hundreds then you need to add 100, millions you need to add a million, but that's just a matter of magnitude.
Starting with the number 1, how much do we need to add AS A PERCENTAGE to move to 2?
((2 - 1) / 1) * 100 = 100%
No surprises there. How about from 2 to 3?
((3 - 2) / 2) * 100 = 50%
And from 3 to 4 (my Maths teachers at school always told us we only need 3 numbers to identify a series):
((4 - 3) / 3) * 100 = 33.3%
So from here we can see this looks logarithmic, which is what drew attention to this law before anyone knew about it. Let's put the compete set of values in a table so we can see the overall effect:
|First Digit||Percentage Change To Next Digit|
Now let's scale those percentages and put them alongside the first digit probabilities from the earlier graph illustrating Benford's law:
There's not much difference there at all is there? It’s not EXACTLY the same, but it IS logarithmic.
Say you have a member of staff who is responsible for purchasing in your organization. Because of their position, there's a rule that says that they must get approval from someone (the accounts manager, CFO, it doesn't matter) for any amount over 5000 dollars. You might think this rule shouldn't have any effect on the values of orders they place, and you'd be right. So, you run a quick analysis on their orders and you see this:
There's some inconsistency there - more orders starting with 4 than you'd expect, and fewer starting with 5. That corresponds to their authorization limit, and it's quite a serious discrepancy, so it's worth taking a closer look. It may be a coincidence, it may be that they're negotiating costs down because their manager is a pain to deal with, or it may be that their Lamborghini wasn't really a Christmas gift from their rich granny.
How About A Demonstration?
Theory is fine, but it's always good to show things working in practice. So, before it's a button. When you press it, it will do two things:
1. It will create a series of random numbers using your browses random number generator and work out the frequency of each first digit
2. It will create another series of random numbers in the same way, but this time will adjust them with random factors, in the same way that river flows or populations get "adjusted" in nature. It will then work out the frequency of each first digit for this series as well.
Go on, give it a try:
|First Digit||Random Number||Benford Random Number|
|1||( %)||( %)|
|2||( %)||( %)|
|3||( %)||( %)|
|4||( %)||( %)|
|5||( %)||( %)|
|6||( %)||( %)|
|7||( %)||( %)|
|8||( %)||( %)|
|9||( %)||( %)|
|Total simulation count:||ms||ms|
|Total simulation duration:||ms||ms|
Benford's law isn't a hard and fast rule - in mathematics a law isn't quite the same as you might expect. However it's IS a very good "red flag rule". If you're wondering whether a member of staff's expenses claims are credible, take a closer look.
Of course it's not just purchasing that Benford's law can be used to check. Any set of financial data should follow it - sales, investments, tax returns, even corporate financial statements. If you think that's an exaggeration is worth remembering that deviation from the expected distribution is enough to trigger audits from many financial authorities and government revenue departments.
Like any other law, there are times when it doesn’t really apply. Any numbers from withing a closely defined range are unlikely to fit very well. However, if you have a set of numbers covering several magnitudes, and which are themselves made up from other numbers, there’s a good chance Benford’s Law will apply.