Benford or not Benford: new results on digits beyond the first

(1)

HAL Id: hal-01781021

https://hal.archives-ouvertes.fr/hal-01781021

Preprint submitted on 28 Apr 2018

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires

Benford or not Benford: new results on digits beyond the first

Stéphane Blondeau da Silva

To cite this version:

Stéphane Blondeau da Silva. Benford or not Benford: new results on digits beyond the first. 2018.

�hal-01781021�

(2)

Benford or not Benford: new results on digits beyond the first

April 28, 2018

Abstract

In this paper, we will see that the proportion of dasp^th digit, where p >1 and d ∈ J0,9K, in data (obtained thanks to the hereunder devel- oped model) is more likely to follow a law whose probability distribution is determined by a specific upper bound, rather than the generalization of Benford’s Law to digits beyond the first one. These probability dis- tributions fluctuate around theoretical values determined by Hill in 1995.

Knowing beforehand the value of the upper bound can be a way to find a better adjusted law than Hill’s one.

Introduction

Benford’s Law is really amazing: according to it, the first digitd,d∈J1,9K, of numbers in many naturally occurring collections of data does not follow a discrete uniform distribution; it rather follows a logarithmic distribution. Having been discovered by Newcomb in 1881 ([11]), this law was definitively brought to light by Benford in 1938 ([2]). He proposed the following probability distribution: the probability fordto be the first digit of a number is equal to log(1 +¹_d).

Most of the empirical data, as physical data (Knuth in [10] or Burke and Kin- canon in [4]), demographic and economic data (Nigrini and Wood in [12]) or genome data (Friaret al. in [6]), follow approximately Benford’s Law. To such an extent that this law is used to detect possible frauds in lists of socio-economic data ([15]) or in scientific publications ([5]).

In [3], Blondeau Da Silva, building a rather relevant representative model, showed that, in this case, the proportion of eachd as leading digit,d∈J0,9K, structurally fluctuates. It strengthens the fact that, concerning empirical data sets, this law often appears to be a good approximation of the reality, but no more than an approximation ([7]). We can note that there also exist distribu- tions known to disobey Benford’s Law ([13] and [1]).

Generalizing Benford’s Law, Hill ([8]) extends the law to digits beyond the first one: the probability for d, d ∈ J0,9K, to be the p^th digit of a number is equal toP10^p−1−1

j=10^p−2log(1 +_10j+d¹ ).

Building a very similar model to that described in [3], the naturally occurring data will be considered as the realizations of independant random variables following the hereinafter constraints: (a) the data is strictly positive and is upper-bounded by an integern, constraint which is often valid in data sets, the

(3)

physical, biological and economical quantities being limited ; (b) each random variable is considered to follow a discrete uniform distribution whereby the first strictly positive p-digits integers (p > 1) are equally likely to occur (i being uniformly randomly selected inJ10^p−1, nK). This model relies on the fact that the random variables are not always the same.

Through this article we will demonstrate that the predominance of 0 over 1 (and of 1 over 2, and so on), as p^th, (p > 1) digit is all but surprising.

Hill’s probabilities became standard values that should exactly be followed by most of naturally occurring collections of data. However the reality is that the proportion of each d as leading digit structurally fluctuates. There is not a single law but numerous distinct laws that we will hereafter examine.

1 Notations and probability space

Letpanddbe two strictly positive integers such thatp >1 andd∈J0,9K. Let m be a strictly positive integer such thatm≥10^p−1. Let U {10^p−1, m} denote the discrete uniform distribution whereby integers between 10^p−1 and m are equally likely to be observed.

Let n be a strictly positive integer such that n ≥10^p−1. Let us consider the random experiment En of tossing two independent dice. The first one is a fair (n+ 1−10^p−1)-sided die showingn+ 1−10^p−1 different numbers from 1 ton+ 1−10^p−1. The numberi rolled on it defines the number of faces on the second die. It thus showsidifferent numbers from 10^p−1toi+ 10^p−1−1.

Let us now define the probability space Ωn as follows: Ωn = {(i, j) : i ∈ J1, n+ 1−10^p−1Kandj ∈J10^p−1, i+ 10^p−1−1K}. Our probability measure is denoted by P.

Let us denote by D_(n,p) the random variable from Ωn to J0,9K that maps each elementω of Ω_n to thep^th digit of the second component of ω.

As our aim is to determine the probability that thep^th digit of the integer obtained thanks to the second throw is d, it can be considered with no con- sequence on our results that we first select an integer i equal to or less than n among at least p-digits integers (following the U {10^p−1, n} discrete uniform distribution); afterwards we select an other at least p-digits integer equal to or less than i(following theU {10^p−1, i}discrete uniform distribution).

2 Proportion of d

Through the below proposition, we will express the value of P(D_(n,p)=d)i.e.

the probability that thep^thdigit of our second throw in the random experiment En isd.

Proposition 2.1. Let k denote the integer such that:

k= max{i∈N: 10^i+p≤n}. Let l denote the positive integer such that:

l=bⁿ⁻⁽¹⁰^p−1₁₀k+2^+d)10^k+1c+ 10^p−2.

(4)

The value of P(D_(n,p)=d)is:

1 n+ 1−10^p−1

X^k

i=0

10^p−1−1

X

j=10^p−2

(10j+(d+1))10ⁱ−1

X

b=(10j+d)10ⁱ

b−((9j+d)10ⁱ+ 10^p−2−1) b+ 1−10^p−1

+

10^p−1−1

X

j=10^p−2−1

min(10^p+i−1,(10(j+1)+d)10ⁱ−1)

X

a=max(10^p+i−1,(10j+(d+1))10ⁱ)

10ⁱ(j+ 1)−10^p−2 a+ 1−10^p−1

+r(n,d,p)

,

wherer_(n,d,p)is, if thep^th digit of nisd:

l

X

j=10^p−2

min(n,(10j+(d+1))10^k+1−1)

X

b=(10j+d)10^k+1

b−((9j+d)10^k+1+ 10^p−2−1) b+ 1−10^p−1

+

l−1

X

j=10^p−2−1

(10(j+1)+d)10^k+1−1

X

a=max(10^p+k,(10j+(d+1))10^k+1)

10^k+1(j+ 1)−10^p−2 a+ 1−10^p−1 , or where r(n,d,p) is, if thep^th digit ofn is all butd:

l

X

j=10^p−2

(10j+(d+1))10^k+1−1

X

b=(10j+d)10^k+1

b−((9j+d)10^k+1+ 10^p−2−1) b+ 1−10^p−1

+

l

X

j=10^p−2−1

min(n,(10(j+1)+d)10^k+1−1)

X

a=max(10^p+k,(10j+(d+1))10^k+1)

10^k+1(j+ 1)−10^p−2 a+ 1−10^p−1 .

Proof. Let us denote byF_(n,p)the random variable from Ω_ntoJ1, n+ 1−10^p−1K that maps each element ω of Ω_n to the first component of ω. It returns the number obtained on the first throw of the unbiased (n+ 1−10^p−1)-sided die.

For each q∈J1, n+ 1−10^p−1K, we have:

P(F_(n,p)=q) = 1

n+ 1−10^p−1. (1)

According to the Law of total probability, we state:

P(D_(n,p)=d) =

n+1−10^p−1

X

q=1

P(D_(n,p)=d|F_(n,p)=q) P(F_(n,p)=q). (2) Thereupon two cases appear in determining the value, for q ∈ J1, n+ 1− 10^p−1K, of P(D_(n,p) = d|F_(n,p) = q). Let kq be the integer such that kq = max{k∈N: 10^p+k≤q+ 10^p−1−1} in both cases.

Let us study the first case where the p^th digit of q+ 10^p−1−1 is d. For alliin J0, k_qK, there exist 9×10^p−2 sequences of 10ⁱ consecutive integers from (10j+d)10ⁱ to (10j+ (d+ 1))10ⁱ−1, wherej∈J10^p−2,10^p−1−1K, whosep^th digit isd. The higher of these integers is (10(10^p−1−1) + (d+ 1))10^k^q−1, the

(5)

last (p+k_q)-digit number in this case. Thus, from 10^p−1 to 10^p+k^q −1, the number of integers whosep^th digit isdis:

k_q

X

i=0 10^p−1−1

X

j=10^p−2

(10j+(d+1))10ⁱ−1

X

(10j+d)10ⁱ

1 = 9×10^p−2

k_q

X

i=0

10ⁱ = 10^p−2(10^k^q⁺¹−1).

This equality still holds true forkq=−1. Such types of sum would be considered null in the rest of the article. From 10^p+k^qtoq+10^p−1−1, there existtsequences of 10^k^q⁺¹consecutive integers from (10j+d)10^k^q⁺¹to (10j+ (d+ 1))10^k^q⁺¹−1, where j ∈ J10^p−2,10^p−2+t−1K, whose p^th digit is d. There also exist q+ 10^p−1−1−(10(10^p−2+t) +d)10^k^q⁺¹+ 1 additional integers in this case between (10(10^p−2+t) +d)10^k^q⁺¹andq+ 10^p−1−1. Finally the total number of integers whosep^th digit isdis:

10^p−2(10^k^q⁺¹−1) +t×10^k^q⁺¹+q+ 10^p−1−1−(10(10^p−2+t) +d)10^k^q⁺¹+ 1 i.e. q+ 10^p−1−1−

9(10^p−2+t) +d

10^k^q⁺¹−1 . It may be inferred that:

P(D(n,p)=d|F(n,p)=q) =

q+ 10^p−1−1−

9(10^p−2+t) +d

10^k^q⁺¹−1

q ,

(3) thep^th digit ofq+ 10^p−1−1 being hered.

In the second case, we consider the integersq+ 10^p−1−1 whosep^thdigits are different fromd. On the basis of the previous case, the total number of integers whosep^thdigit isdis, wheretis the number of sequences of consecutive integers lower thanq+ 10^p−1−1:

10^p−2(10^k^q⁺¹−1) +t×10^k^q⁺¹ i.e. 10^k^q⁺¹(10^p−2+t)−10^p−2. It can be concluded that:

P(D_(n,p)=d|F_(n,p)=q) = 10^k^q⁺¹(10^p−2+t)−10^p−2

q , (4)

thep^th digit ofq+ 10^p−1−1 being here different from d.

Using equalities (1), (2), (3) and (4), we get our result.

For example, we get:

Examples 2.2. Let us first determine the value of P(D_(10003,5)= 2). The probability that the fifth digit of a randomly selected number in J10000,10000K is 2 is ⁰₁, those in J10000,10001K is ⁰₂, those in J10000,10002K is ¹₃ and those in J10000,10003Kis ¹₄. Hence we have:

P(D_(10003,5)= 2) = 1 4

0 1 +0

2+1 3+1

4

≈0.1458.

It is the second case of Proposition 2.1, wheren= 10003,d= 2, p= 5,k=−1 andl= 1000.

(6)

Let us now determine the value of P(D_(1113,3) = 1) (first case of Proposition 2.1); in this case, we havek= 0 and l= 11.

P(D_(1113,3)= 1) = 1 1014

X⁹⁹

j=10

j−9 10j−98+

98

X

j=10 10(j+1)

X

a=10j+2

j−9 a−99+

999

X

a=992

90 a−99+

1009

X

a=1000

90 a−99 +

1019

X

b=1010

b−919 b−99 +

1109

X

a=1020

100 a−99+

1113

X

b=1110

b−1009 b−99

= 1

1014 1

2+1 3+1

4+...+ 1 11+ 2

12+ 2

13+...+ 89 891+ 90

892+ 90

893+...+ 90 910 + 91

911+...+100 920+100

921+...+ 100 1010+ 101

1011+...+ 104 1014

≈0.1028.

Let us determine the value of P(D(212,2)= 9) (second case of Proposition 2.1);

in this case, we havek= 0 and l= 1.

P(D_(212,2)= 9) = 1 203

9 10+

8

X

j=1

10(j+1)+8

X

a=10(j+1)

j a−9+

189

X

a=100

9 a−9+

199

X

b=190

b−180 b−9 +

212

X

a=200

19 a−9

= 1 203

1 10+ 1

11+...+ 1 19+ 2

20+ 2

21+...+ 8 89+ 9

90+ 9

91+...+ 9 180+ 10

181 + 11

182+...+ 19 190+ 19

191+...+ 19 203

≈0.0759.

3 Study of a particular subsequence

It is natural that we take a specific look at the values ofnpositioned one rank before the integers for which the number of digits has just increased.

To this end we will consider the sequence P(Dn,p=d)

n∈N\J0,10^p−1−1K

. In the interests of simplifying notation, we will denote by (P_(d,n,p))_n∈_N_\

J0,10^p−1−1K

this sequence. Let us study the subsequence (P_(d,φ_(d,p)_(n),p))_n∈_N_\

J0,p−1K where φ_(d,p) is the function fromN\J0, p−1K to Nthat mapsnto 10ⁿ−1. We get the below result:

Proposition 3.1. The subsequence (P(d,φ_(d,p)(n),p))_n∈N\

J0,p−1K converges to:

10⁻¹+n_(d,p)+m_(d,p)−9l_(d,p)−d×k_(d,p)

9×10^p−1 + 1

90ln(10^p−1+d 10^p−1 ) +1

9ln( 10^p 10^p−10 +d+ 1),

where:











k(d,p) =P10^p−1−1

j=10^p−2ln(^10j+(d+1)_10j+d ) l_(d,p) =P10^p−1−1

j=10^p−2jln(^10j+(d+1)_10j+d ) m_(d,p)=P10^p−1−2

j=10^p−2ln(^10(j+1)+d_10j+(d+1)) n(d,p)=P10^p−1−2

j=10^p−2jln(^10(j+1)+d_10j+(d+1)).

Proof. Let nbe a positive integer such that n ≥p. According to Proposition 2.1, we have P(d,φ_(d,p)(n),p) = P(d,10ⁿ−1,p) i.e., knowing that in this case k =

(7)

max{i∈N: 10^i+p ≤10ⁿ−1}=n−p−1:

1 10ⁿ−10^p−1

^n−pX

i=0

10^p−1−1

X

j=10^p−2

(10j+(d+1))10ⁱ−1

X

b=(10j+d)10ⁱ

b−((9j+d)10ⁱ+ 10^p−2−1) b+ 1−10^p−1

+

10^p−1−1

X

j=10^p−2−1

min(10^p+i−1,(10(j+1)+d)10ⁱ−1)

X

a=max(10^p+i−1,(10j+(d+1))10ⁱ)

10ⁱ(j+ 1)−10^p−2 a+ 1−10^p−1

.

Let us denote byb_(i,d,p)the positive number:

10^p−1−1

X

j=10^p−2

(10j+(d+1))10ⁱ−1

X

b=(10j+d)10ⁱ

b−((9j+d)10ⁱ+ 10^p−2−1) b+ 1−10^p−1 , and bya_(i,d,p) the positive number:

10^p−1−1

X

j=10^p−2−1

min(10^p+i−1,(10(j+1)+d)10ⁱ−1)

X

a=max(10^p+i−1,(10j+(d+1))10ⁱ)

10ⁱ(j+ 1)−10^p−2 a+ 1−10^p−1 . Thus we have:

P_(d,φ_(d,p)_(n),p)= 1 10ⁿ−10^p−1

n−p

X

i=0

b_(i,d,p)+a_(i,d,p) .

Let us first find an appropriate lower bound ofP_(d,φ_(d,p)_(n),p). We have:

b_(i,d,p)=

10p−1−1

X

j=10p−2

10ⁱ−

(10j+(d+1))10i−1

X

b=(10j+d)10i

(9j+d)10ⁱ+ 10^p−2−10^p−1 b+ 1−10^p−1

= 9×10^p+i−2−

10p−1−1

X

j=10p−2

((9j+d)10ⁱ+ 10^p−2−10^p−1)

(10j+(d+1))10i−1

X

b=(10j+d)10i

1 b+ 1−10^p−1

Recall that for all integers (p, q), such that 1< p < q:

ln(q+ 1 p )≤

q

X

k=p

1

k ≤ln( q

p−1). (5)

Consequently, we obtain, fori≥1:

b_(i,d,p)≥9×10^p+i−2−

10p−1−1

X

j=10p−2

(9j+d)10ⁱln((10j+ (d+ 1))10ⁱ−10^p−1 (10j+d)10ⁱ−10^p−1 )

≥9×10^p+i−2−

10p−1−1

X

j=10p−2

(9j+d)10ⁱ ln(10j+ (d+ 1)

10j+d ) + ln(1 +

10p−1 10j+(d+1)

10ⁱ(10j+d)−10^p−1)

≥9×10^p+i−2−d×10ⁱ

10p−1−1

X

j=10p−2

ln(10j+ (d+ 1)

10j+d )−9×10ⁱ

10p−1−1

X

j=10p−2

jln(10j+ (d+ 1) 10j+d )

−

10p−1−1

X

j=10p−2

(9j+d)10ⁱln(1 +

10p−1 10j+(d+1)

10ⁱ(10j+d)−10^p−1).

(8)

Let us denote byk_(d,p)the positive numberP10^p−1−1

j=10^p−2ln(^10j+(d+1)_10j+d ) andl_(d,p)the positive numberP10^p−1−1

j=10^p−2jln(^10j+(d+1)_10j+d ). Knowing that for allx∈]−1; +∞[, we have ln(1 +x)≤x, we obtain:

b_(i,d,p)≥9×10^p+i−2−d×10ⁱk_(d,p)−9×10ⁱl_(d,p)−

10p−1−1

X

j=10p−2

(9j+d)10ⁱ

10p−1 10j+(d+1)

10ⁱ(10j+d)−10^p−1

≥9×10^p+i−2−d×10ⁱk_(d,p)−9×10ⁱl_(d,p)−

10p−1−1

X

j=10p−2

10ⁱ 10^p−1 10ⁱ×10^p−1−10^p−1

≥9×10^p+i−2−d×10ⁱk_(d,p)−9×10ⁱl_(d,p)−9×10^p−2 10ⁱ 10ⁱ−1.

Similarly, we have thanks to inequalities (5):

a_(i,d,p)≥

10p−1−2

X

j=10p−2

(10ⁱ(j+ 1)−10^p−2) ln((10(j+ 1) +d)10ⁱ+ 1−10^p−1 (10j+ (d+ 1))10ⁱ+ 1−10^p−1) + (10^p−2+i−10^p−2) ln((10^p−1+d)10ⁱ+ 1−10^p−1

10^p+i−1+ 1−10^p−1 ) + (10^p−1+i−10^p−2) ln( 10^p+i+ 1−10^p−1

(10^p−10 +d+ 1)10ⁱ+ 1−10^p−1)

≥10ⁱ

10p−1−2

X

j=10p−2

jln(10(j+ 1) +d 10j+ (d+ 1)) + 10ⁱ

10p−1−2

X

j=10p−2

jln(1 +

9×(10p−1−1) 10(j+1)+d

(10j+d+ 1)10ⁱ+ 1−10^p−1)

+ (10ⁱ−10^p−2)¹⁰

p−1−2

X

j=10p−2

ln(10(j+ 1) +d

10j+ (d+ 1)) + ln(1 +

9×(10p−1−1) 10(j+1)+d

(10j+d+ 1)10ⁱ+ 1−10^p−1)

+ (10^p−2+i−10^p−2) ln(10^p−1+d

10^p−1 ) + ln(1 +

d(10p−1−1) 10p−1 +d

10^p−110ⁱ+ 1−10^p−1) + (10^p−1+i−10^p−2)(ln( 10^p

10^p−10 +d+ 1) + ln(1 +

(10p−1−1)(10−d−1) 10p

(10^p−10 +d+ 1)10ⁱ+ 1−10^p−1)).

Let us denote by m_(d,p) the positive numberP10^p−1−2

j=10^p−2ln(^10(j+1)+d_10j+(d+1)) andn_(d,p) the positive number P10^p−1−2

j=10^p−2jln(^10(j+1)+d_10j+(d+1)):

a_(i,d,p)≥10ⁱn_(d,p)+ (10ⁱ−10^p−2) m_(d,p)+

10p−1−2

X

j=10p−2

ln(1 +

9×(10p−1−1) 10(j+1)+d

(10j+d+ 1)10ⁱ+ 1−10^p−1) + (10^p−2+i−10^p−2) ln(10^p−1+d

10^p−1 ) + (10^p−1+i−10^p−2) ln( 10^p 10^p−10 +d+ 1).

Hence we have:

P_(d,φ

(d,p) (n),p)≥ 1 10ⁿ

a_(0,d,p)+b_(0,d,p)+

n−p

X

i=1

9×10^p+i−2−d×10ⁱk_(d,p)−9×10ⁱl_(d,p)

+ 10ⁱn_(d,p)+ 10ⁱm_(d,p)+ 10^p−2+iln(10^p−1+d

10^p−1 ) + 10^p−1+iln( 10^p 10^p−10 +d+ 1)

−9×10^p−210

9 −10^p−2 m_(d,p)+ ln(10^p−1+d

10^p−1 ) + ln( 10^p 10^p−10 +d+ 1) + (10ⁱ−10^p−2)

10p−1−2

X

j=10p−2

ln(1 +

9×(10p−1−1) 10(j+1)+d

(10j+d+ 1)10ⁱ+ 1−10^p−1) .

(9)

In light of the following equalityPn−p

i=1 10ⁱ= ¹⁰^n−p+1₉ ⁻¹⁰, we have:

P_(d,φ

(d,p) (n),p)≥10⁻¹+10^−p+1(n_(d,p)+m_(d,p)−9l_(d,p)−dk_(d,p))

9 +10⁻¹

9 ln(10^p−1+d 10^p−1 ) +1

9ln( 10^p

10^p−10 +d+ 1) +_(d,n,p),

where_(d,n,p)is:

a_(0,d,p)+b_(0,d,p) 10ⁿ −10^p−1

10ⁿ +dk_(d,p)+ 9l_(d,p)−n_(d,p)−m_(d,p)

9×10ⁿ⁻¹ − 10^p−1

9×10ⁿln(10^p−1+d 10^p−1 )

− 10^p

9×10ⁿln( 10^p

10^p−10 +d+ 1)−10^p−1(n−p)

10ⁿ −10^p−2(n−p)

10ⁿ m_(d,p)+ ln(10^p−1+d 10^p−1 ) + ln( 10^p

10^p−10 +d+ 1) + 1

10ⁿ

p−3

X

i=1

(10ⁱ−10^p−2)

10p−1−2

X

j=10p−2

ln(1 +

9×(10p−1−1) 10(j+1)+d

(10j+d+ 1)10ⁱ+ 1−10^p−1).

Knowing that for all x∈]−1; +∞[, we have ln(1 +x) ≤x, we obtain, for all i∈ {1, ..., p−3}:

10p−1−2

X

j=10p−2

ln(1 +

9×(10p−1−1) 10(j+1)+d

(10j+d+ 1)10ⁱ+ 1−10^p−1)≤

10p−1−2

X

j=10p−2

9×(10p−1−1) 10(j+1)+d

(10j+d+ 1)10ⁱ+ 1−10^p−1

≤10^p−1

10p 10p−1

d+ 2 ≤10^p

From the above upper bound and the definition of (d,n,p), it may be deduced that lim

n→+∞_(d,n,p)= 0.

Let us now find an appropriate upper bound of P_(d,φ_(d,p)_(n),p). Thanks to inequalities (5):

b_(i,d,p)≤9×10^p+i−2−

10p−1−1

X

j=10p−2

((9j+d)10ⁱ+ 10^p−2−10^p−1)

ln((10j+ (d+ 1))10ⁱ+ 1−10^p−1 (10j+d)10ⁱ+ 1−10^p−1 )

≤9×10^p+i−2−

10p−1−1

X

j=10p−2

((9j+d)10ⁱ+ 10^p−2−10^p−1) ln(10j+ (d+ 1) 10j+d )

+ ln(1 +

10p−1−1 10j+(d+1)

10ⁱ(10j+d) + 1−10^p−1)

≤9×10^p+i−2−d×10ⁱk_(d,p)−9×10ⁱl_(d,p)+ 10^p−1k_(d,p).