Scholarly article on topic 'The size distributions of all Indian cities'

The size distributions of all Indian cities Academic research paper on "Sociology"

Share paper
OECD Field of science
{"Indian city size distribution" / "Lognormal body" / "Lower-tail reverse Pareto" / "Transitions points" / "Upper-tail Pareto"}

Abstract of research paper on Sociology, author of scientific article — Jeff Luckstead, Stephen Devadoss, Diana Danforth

Abstract We apply five distributions–lognormal, double-Pareto lognormal, lognormal-upper tail Pareto, Pareto tails-lognormal, and Pareto tails-lognormal with differentiability restrictions–to estimate the size distribution of all Indian cities. Since India contains numerous small cities, it is important to explicitly model the lower-tail behavior for studying the distribution of all Indian cities. Our results rigorously confirm, using both graphical and formal statistical tests, that among these five distributions, Pareto tails-lognormal is a better suited parametrization of the Indian city size data, verifying that the Indian city size distribution exhibits a strong reverse Pareto in the lower tail, lognormal in the mid-range body, and Pareto in the upper tail.

Academic research paper on topic "The size distributions of all Indian cities"

Contents lists available at ScienceDirect

Physica A

journal homepage:

The size distributions of all Indian cities

Jeff Luckstead3-*, Stephen Devadossb, Diana Danfortha

a University of Arkansas, United States b Texas Tech University, United States



• We propose a composite distribution to estimate the lower tail, body, and upper tail.

• We apply this distribution to all Indian cities.

• The lower-tail is modeled with reverse Pareto and the upper-tail with Pareto.

• The body is modeled with lognormal.

article info abstract

Article history:

Received 9 September 2016

Received in revised form 12 January 2017

Available online 19 January 2017


Indian city size distribution Lognormal body Lower-tail reverse Pareto Transitions points Upper-tail Pareto

We apply five distributions - lognormal, double-Pareto lognormal, lognormal-upper tail Pareto, Pareto tails-lognormal, and Pareto tails-lognormal with differentiability restrictions - to estimate the size distribution of all Indian cities. Since India contains numerous small cities, it is important to explicitly model the lower-tail behavior for studying the distribution of all Indian cities. Our results rigorously confirm, using both graphical and formal statistical tests, that among these five distributions, Pareto tails-lognormal is a better suited parametrization of the Indian city size data, verifying that the Indian city size distribution exhibits a strong reverse Pareto in the lower tail, lognormal in the mid-range body, and Pareto in the upper tail.

© 2017 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (

1. Introduction

Though the size distribution of large cities in the upper tail has been extensively studied, the size distribution of all cities has recently been gaining attention in the literature.1 Reed [7] observed power law behavior in both the upper and lower tails for US settlements for 1998 data and proposed a distribution known as the double Pareto distribution which accommodates these behaviors. Reed [8] generalized this distribution to the double Pareto lognormal (DPLN) and applied it to city size data for two US states (California and West Virginia) and two provinces in Spain (Barcelona and Cantabria). Though Reed formulated DPLN in 2002, it was not used by other researchers to estimate the size distribution of all cities until Giesen et al. [9] applied it to eight countries2 and found that DPLN fits the data better than the lognormal distribution. More recently,

* Corresponding author.

E-mail addresses: (J. Luckstead), (S. Devadoss), (D. Danforth).

1 Upper-tail cities are generally shown to follow Zipfs law [1,2]. However, the size of the sample plays an important role in determining whether Zipfs law holds or not [3-5]. Using a revised t-statistic, Nishiyama et al. [6] established that 23 of 24 countries follow Zipfs law in the upper tail.

2 The countries and years studied are Germany (2006), United States (2000), France (2006), Brazil (2007), Czech (2009), Hungary (2009), Italy (2009), and Switzerland (2008).

0378-4371/© 2017 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license ( licenses/by-nc-nd/4.0/).

González-Val et al. [10] used DPLN along with lognormal, q-exponential, and log-logistic distributions to untruncated city size data for United States and Spain and concluded that DPLN performs better than the other three distributions.3

Eeckhout [19] showed that US cities follow proportionate growth (Gibrat's law) for the Census years 1990 and 2000, and the lognormal (LN) distribution with mean of 7.28 and standard deviation of 1.75 accurately represents the 2000 Census data for all US cities.4 Based on these observations, Eeckhout developed a dynamic general equilibrium model with population dynamics that follow Gibrat's law, and the asymptotic size distribution is LN. However, Levy [20] argued that power law, not lognormal, accurately represents the upper tail when analyzing data for all US cities. Rozenfeld et al. [21] provided a unique analysis of US upper tail cities by defining area clusters as opposed to administrative cities and found that upper-tail US cities adhere to Zipfs for a cut-off point of 12,000 which is considerably smaller than most other studies.

Given the controversial nature of the power laws in the upper-tail cities, loannides and Skouras [22] proposed a lognormal-upper tail Pareto (LNUP) distribution to estimate small and medium cities by lognormal, upper-tail cities by Pareto, and also the switching point from the body to the upper tail.5 Their results provide statistical evidence that the distribution of all US cities for 2000 Census data does indeed transition from lognormal to Pareto for city sizes above 60,290. However, LNUP is limited in that it does not model the power law and the transition point in the lower tail (i.e., reverse-Pareto), if this tail behavior exists. Luckstead and Devadoss [25] proposed a distribution which combines the approach of DPLN and LNUP to model the lower tail with reverse Pareto, the body with lognormal, and the upper tail with Pareto. Their distribution, known as Pareto-tails and lognormal body (PTLN), endogenously determines the transition cities from the lower tail to body and from the body to the upper tail. The PTLN distribution is highly suitable for developing countries which generally have a larger number of cities in the lower tail.

The LNUP and PTLN distributions allow for continuity, but not differentiability, at the transition points. In this study, we also extend the PTLN distribution by imposing differentiability conditions (labeled PTLND) to model the size distribution of all cities in lndia.

According to the most recent census in 2011 [26], India is the second largest country in the world with more than 1.2 billion people, and it is predominantly a rural country with about 68% of the population living in rural areas in 2012 [27]. Consequently, in contrast to developed countries, India has a large number of small cities. For instance, of the 605,654 total cities, the number of cities with population less than 1000 is 338,713 which is more than half of the total number of cities and account for 151 million people. Only 536 urban cities with population above 100,000 exist, which account for 227 million inhabitants. The number of cities between a population of 1000 and 100,000 is 266,405 where the majority of the population (833 million) lives. Thus, analyzing the size distribution of all Indian cities by explicitly modeling the lower tail, middle part, and upper tail provides important insight into the organization of the cities that contain 17.5% of the world population.

We apply the above described five distributions (LN, DPLN, LNUP, PTLN, PTLND) to all cities in India and evaluate which one fits the data most accurately. To our knowledge, our study is the first to apply these distributions to all Indian cities. Previous studies that examined city size distribution for India mainly focused on large cities in the upper tail [28,29] and found evidence of Pareto behavior. Devadoss et al. [30] showed that lower-tail Indian cities do exhibit power law behavior for a robust set of cutoff points which are arbitrarily selected. Our results rigorously confirm, using both graphical and formal statistical tests, that among these five distributions, PTLN provides a better parametrization of the lndian city size data, verifying that the Indian city size distribution exhibits a strong reverse Pareto in the lower tail, lognormal in the mid-range body, and Pareto in the upper tail.

2. Methodology

In this section, we present the five distributions (LN, DPLN, LNUP, PTLN, and PTLND) and their likelihood functions, and discuss estimation procedures.

2.1. Lognormal

The PDF of the LN distribution with mean / and variance a2 is

hL, , , 1 ( (logx - /)2 \

A series of important studies by Eliazar and Cohen [11-13] employed geometric Langevin dynamics to develop the anchoring method which allows for a wide range of distributions that follow power laws in both upper and lower tails. They applied these distributions to study income and wealth phenomena. Rank size distribution can also be established by utilizing the ''Lorenzian Limit Law,'' which is derived utilizing Lorenz curves and applying central limit theorem [14-16] This comes under the statistical classification known as universal macroscopic statistics and can be applied to both physical and economic variables to study their distributions.

Another recent strand of studies by Puente-Ajovin and Ramos [ 17] and Ramos [ 18] developed new distributions. Puente-Ajovin and Ramos [ 17] applied double Pareto Singh-Maddala distribution to French, German, Italian, and Spanish communes and showed Pareto behavior in the lower and upper tails but has a Singh-Maddala body. Ramos [ 18] modeled log-growth rate of population of US incorporated cities using double-mixture exponential Generalized Beta 2 and found this distribution is more appropriate than normal.

4 The 2000 US Census was the first one to report data for all cities, which are denoted as ''Census Designated Places''.

5 Calderin-Ojeda [23] developed a composite Lognormal-Pareto distribution by extending LNUP model to study French communes. Fazio and Modica [24] also used LNUP to study US city size data.

For n i.i.d. samples of x, the joint log likelihood is

n n n 1 n

Ll (i,a |x) = logXi - 2 log (2n) - 2 log (a2) - ^ ^ (logXi - i)

Optimization of this function with respect to i and a2 yields

£ log Xi

and a2 = ——

£ (log Xi - j)

which can be estimated using the sample data for x.

2.2. Double-Pareto lognormal

The DPLN distribution proposed by Reed [8] has a strong statistical underpinning based on Gibrat's law with constant mean and instantaneous variance and assumes that the number of settlements are exponentially distributed at a given time after the initial settlement. This distribution is a weighted sum of the lower- and upper-tail Pareto with normal cumulative and countercumulative distribution functions serving as the respective weights. For a random sample, the DPLN distribution

hd(xlZd, ßd, id, ad) = Jdßda Zd + ßd

x 'Zd-1 exp ( Zdid + 2Zdad

log (x) - id - Zdad

+ xßd-1 exp ^-ßdid + 1 ß2a2) $

log (x) - id + ßdad

where Zd and are the shape parameters for the upper and lower tails respectively, id and ad are the mean and standard deviation of the lognormal distribution, $ (•) is the CDF of the standard normal distribution, and <Pc (•) = 1 — $ (•). Thus, the DPLN distribution captures power law behavior in both tails and lognormal in the middle range. As x becomes small, hd (•) approximates a reverse Pareto distribution x^d—1, and as x becomes large, hd (•) approaches upper-tail Pareto distribution x—Zd—1. $ (•) places more (less) weight on the upper tail for large (small) cities, whereas $c (•) applies more (less) weight on the lower tail for small (large) cities. However, this distribution, as pointed out by Giesen et al. [9], does not parametrically estimate lower- and upper-tail cutoff points, and thus, we cannot determine where the lognormal body transitions from the lower tail and to the upper tail. As a result, DPLN does not help to quantify the proportion of population and cities in the lower-tail small cities and upper-tail large cities.

For n i.i.d. samples, the log-likelihood is

L (Zd, ßd, id, ad|x) = n log

Zd + ßd

+ E log

x, 1 exp ( Zdid + 2Zd2aj

log (x{) - id - Zdaj

+ xßd 1 exp ( -ßdid + 2ßd2a2

log (xi) - id + ßda'2 ad

Because optimization of (3) does not yield an analytical solution, we numerically estimate the parameters Zd, Pd, id, and ad. 2.3. Lognormal-upper tail Pareto

Given the economic and statistical importance of the upper tail, Ioannides and Skouras [22] proposed the LNUP distribution which identifies the switching point from lognormal to upper-tail Pareto:

b (Tu, iu, au, Zu) f (x; iu, au), xmin < x < Tu

u, f^u, ^u, Suj

a (Tu, iu, au, Zu) b (Tu, iu, au, Zu) g (x; Zu, Tu) , Tu < x < TO,

hu (x; iu, au, Tu, Zu) =

where Tu is the cutoff parameter; f (x; iu, au) is a lognormal PDF with mean iu and standard deviation au:

f (x; iu, au) =

(logx - iu)2

xau (2n)1/2 V 2a2 g (x; Zu, Tu) is a power law function with shape parameter Zu: 1

xmin < x < Tu;

g (x; Zu, Tu) =

Tu < x, Zu > 1;

a (Tu, /u,au, Zu) and b (Tu, /u,au, Zu) are given by

f (Tu ; // u , au) i I z

a (Tu, /u, au, Zu) = —---- = f (Tu; Xu, au) Tu+iu,

g (Tu; Zu, Tu) u

b (t, /, a, Z) =

0 (t; /,a) - 0 (Xmin; /, a) + f (t; /,a)Z

and 0 (•) is the CDF of the lognormal distribution. Here, a (•) serves as a normalization constant to maintain continuity in h (•) at Tu, where lognormal ends and Pareto begins, and b (•) is a normalization constant to ensure that the density function h (•) integrates to one. To maintain probabilities between zero and one and a unimodal distribution, Tu must be finite and strictly greater than exp (/u). As the switching parameter (tu) converges to the largest city in the dataset, the distribution turns into lognormal, while as Tu tends to one the distribution becomes Pareto. Thus, Tu specifies the transition point, if it occurs, where the distribution switches from lognormal to Pareto. This model nests Zipfs law when the Pareto coefficient Zu = 1.

For n i.i.d. samples of x, the joint log likelihood is

L (Tu, /u, au, Zu|x) =

nTu-1 nTu-1

log [b (•)] + l°g [f (Xi'; •)] , Xmin < x < Tu = = (5)

n n n v '

Y^ log [a (•)] log [b (•)] + Y. log [g (Xi; •)], Tu < X < x.

i=ntu i=ntu i=ntu

Since an analytical solution does not exist for this likelihood function, we numerically estimate the parameters Tu, /u, au, and Zu.

2.4. Pareto Tails-lognormal

Though DPLN models the power-law behavior in both tails, it does not pin down the switching points in both tails. Also, the LNUP distribution does capture the transition point from lognormal to upper-tail Pareto, but not from the lower-tail Pareto to lognormal. We consider a distribution that extends the LNUP distribution to identify the switching point in the lower tail also [25]. The PDF of this new distribution is

Ic (/b, ab, rib, Pb) b (/b, ab, Pb, rib, Tb, Zb) gl (x; Pb, rib) xmin < x < rib b (/b, ab, Pb, nb, Tb, Zb) f (x; /b, ab), nb < x < Tb

a (/b, ab, Tb, Zb) b (/b, ab, Pb, nb, Tb, Zb) gu (x; Zb, Tb), Tb < x < x,

which consists of three components: lower-tail power law, lognormal body, and upper-tail power law. The lower-tail power law with shape parameter Pb and lower-tail transition point nb is

gl (x; Pb, nb) = -r-T, xmin < x < nb, Pb > 0.

The lognormal body with mean /b and variance ab2 is

, 1 ( (logx - /b)2 \ ^ ^

f (x; /b, ab) =- 1/2 exp--—- , nb < x < Tb,

xab (2n)1/2 V 2a2 )

where Tb pinpoints the upper-tail switching point. The upper-tail power law with shape parameter Zb is

gu (x; Zb, Tb) = -^r-, Tb < x, Zb > 0.

The terms

f (nb; /b, ab)

c (/b, ab, nb, P a (/b, ab, Tb, Zb) =

gl (nb; Pb, nb)

f (Tb; /b, ab)

gu (Tb; Zb, Tb)

are normalization constants which, respectively, preserve continuity at the transition points from lower-tail Pareto to lognormal body and from lognormal body to upper-tail Pareto. The normalization constant

b (/b, ab, Pb, nb, Tb, Zb) =-

f (nb; /b, ab) n (nb^""xmin + 0 (t u; /b, ab) - 0 (nb; /b, ab) + f (Tb; /b, ab) %

ensures that the integration of the density function hb (•) over its support yields one, where $ (•) is the CDF of the lognormal. For a unimodal distribution, xmjn < nb < exp ("b) < xb. Also, Tb must be finite to ensure that the probabilities are between zero and one. When nb = xmjn, the entire lower tail is captured by the lognormal and PTLN becomes LNUP. As in the LNUP, this distribution also nests Zipf in the upper tail for Zb = 1. When nb ^ xmjn and Tb ^ max (x), then the Pareto tails are not relevant and PTLN becomes LN.

For n i.i.d. samples of x, the joint log likelihood is b

L (ib,Vb,Vb,Pb,Tb,Zb\x)

■ nnb-1 nnb-1 n1b-1

J2 log [c (•)] + J2 log [b (•)] + J2 log W •)] ' Xmin < x <nb i=1 i=1 i=1

nTb ntb

= E log [b (•)] + E loglf (*;•)], nb < X < Tb (6)

i=nnb i=nnb

J2 log [a (•)] + J2 log [b (•)] + J2 log [gU (Xi;^)], Tb < X < TO.

J=nrb + 1 i=nTb +1 i=nTb + 1

Since the maximization of this log-likelihood function does not yield an analytical solution, we numerically estimate the parameters ab, nb, Pb, Tb, and fb. As documented by Urzua [3] and Luckstead and Devadoss [5], the truncation point influences the slope parameter of the upper-tail Pareto, allowing the analyst to hone in on the truncation point and thus the sample size that can result in Pareto estimate close to one to support Zipfs law. The advantage of PTLN is that it parametrically estimates the lower- and upper-tail cutoff points and thus one does not have to arbitrarily select these truncation points.

2.5. Pareto tails-lognormal with differentiability

The PTLN distribution imposes continuity, but not differentiability, at the transitions points. For the PTLND, in addition to the continuity conditions c (/xc, ac, nc, Pc) and a (/xc, ac, Tc, Zc), we also impose the differentiability conditions at the transition points, which results in the restrictions

pc = -l0g n - "c (7)

Zc = - (8)

on the estimation. However, given the Indian city size dataset contains over 600,000 observations, the differentiability conditions are not required for asymptotic efficiency of the Fisher Information matrix in the Maximum Likelihood estimation. Therefore, the parameters are asymptotically efficient without the differentiability conditions. Consequently, in this case, the differentiability conditions only help to ensure the model is not overfitted because it reduces the number of parameters from six ("b, ab, nb, Pb, Tb, and Zb) in PTLN to four ("c, ac, nc, and Tc) in PTLND as fic and Zc are restricted based on the above two equations. Henceforth, we denote the PTLN distribution with the differentiability conditions as PTLND. We impose the two restrictions (7) and (8) in the likelihood function (6) to estimate the four PTLND parameters: "c, ac, nc, and Tc.

3. Empirical analysis

3.1. Data

Since India has been a predominantly agrarian country for most of the twentieth century, census data did not include population data for all cities. But in recent decades, with economic reforms and technological development, enumerating population for the census became possible. As a result, the two most recent censuses were able to report population data for all cities. We classify cities as towns or villages according to the Census of India [26] based on the following criteria.6 All places with a municipality, corporation, cantonment board are defined as urban areas or towns. These towns contain a

6 In this paper, we utilize the census definition of cities and towns. However, using spatial distributions of city populations, clustering of metropolitan areas have been employed to defined areas without relying on administrative boundaries. Makse et al. [31] develop a physical model where "units" are correlated to clusters. Rozenfeld et al. [32] propose the City Clustering Algorithm (CCA), as an alternative to Metropolitan Statistical Areas covered by the Census Bureau, to define cities based on their spatial distributions of the population. The results provide evidence that Gibrat's law - the mean and standard deviation of city growth rates are constant and do not depend on city size - fails to hold when CCA is utilized. Oliveiraet al. [33] employ the CCA and carbon data and show that cities with a large and more productive population emit a proportionally more CO2 relative to cities with a small population.

Log City Size Fig. 1. Histogram of the log of Indian cities.

minimum population of 5000, have at least 75% of the male workers employed in non-agricultural occupations, and have a population density of at least 400 per square km. Places that do not meet the criteria for a town are classified as rural areas or villages.

According to the 2011 census [26], India is the second largest country with a total population of 1.21 billion people, and it is projected to surpass China by 2028 as the most populous country [34]. Even with economic reforms and higher growth, about two-thirds of the population lived in rural villages and one-third of the population lived in urban towns in 2011 [26]. Thus, India provides a unique dataset to analyze the city size distribution for all cities.

We collected data for all Indian cities from the 2011 Indian Census [26]. The total number of cities in India is 605,686,7 ranging in size from the smallest city of one to the largest city (Mumbai) with a population of 12,442,373. Fig. 1 plots the histogram of Indian city sizes in log scale. While this histogram appears fairly lognormal, the Anderson normality statistic8 of 3631 (p-value < 0.001) strongly rejects lognormality. A closer examination of the histogram shows it is skewed to the left (third moment of -0.59), and furthermore the median city size is 6.74, which is less than the mean of 7.60.

3.2. Parameter estimates and discussion

Table 1 reports parameter estimates for the five distributions. For lognormal, the estimated mean / and standard deviation a are, respectively, 6.63 and 1.33, which are highly significant as indicated by the small standard errors. Our estimates for the mean and standard deviation for Indian cities are smaller than the mean of 7.28 and standard deviation of 1.75 reported by Eeckhout [19] for US full sample. Because India is a predominantly rural country with a large number of villages, the average city size is smaller than the average US city size. Furthermore, even though the difference between the largest and smallest Indian cities is greater than the same difference for US cities, there is a greater concentration of cities around the mean in India than in the United States, and hence, the standard deviation for Indian cities is smaller than that for the US cities.

For the DPLN, the estimated mean /d is 7.03 and standard deviation ad is 0.61, with lower-tail Pareto slope fid of 0.99 and upper-tail Pareto slope fd of 1.65. All four estimated parameter values are highly significant as shown by the small values of their respective standard errors. Recent papers that have studied the Indian upper-tail Pareto distribution have reported differing values for the Pareto estimates. For example, Gangopadhyay and Basu [28] used urban agglomeration data from the 2001 Indian census and found the upper-tail Pareto shape parameter estimate of 2.03 and 1.88 for the truncation points of 203,380 and 10,000, respectively. In contrast, Luckstead and Devadoss [29] used United Nation's urban agglomeration data for 2010 and obtained the Pareto shape parameter estimate of 1.16 for the truncation point of 746,000. As Urzua [3]

7 The observations in the dataset were population totals for 606,017 individual towns/villages, each of which was identified as rural or urban. We identified 331 of the original urban observations as having duplicate identifiers. Based on our analysis, the population for a city was reported for both a part of the entity and for the entire entity. To address this problem, we kept only one of the duplicate identifiers in the dataset, the entry with the larger population reported. Our final dataset contained 605,686 observations after the 331 duplicates were removed.

Furthermore, population totals in the analysis dataset were compared to the published abstract by the Ministry: the rural population exactly matched the report; the urban population was 0.004% higher than that in the report.

The null hypothesis for this test is that the data are from a normal population.

Table 1

Parameter estimates for the three distributions.

Model Parameters Est. value Std. error3 Test statistics

LN ß 6.63 3.17 x 10-8 AIC = 10,092,686.08

a 1.33 9.67 x 10-8 BIC = 10,092,708.71

BF < 0.001

DPLN ßd 0.99 9.86 x 10-5 AIC = 10,040,429.62

ßd 7.03 1.90 x 10-6 BIC = 10,040,474.88

ad 0.61 1.36 x 10-4 BF < 0.001

Zd 1.65 2.31 x 10-4

LNUP ßu 9.60 4.15 x 10-3 AIC = 10,049,007.07

au 2.38 9.70 x 10-4 BIC = 10,049,052.32

Tu 1425.00 0.71 BF < 0.001

Zu 1.38 3.22 x 10-4

PTLN Vb 353.00 2.94 AIC = 10,038,960.43

ßb 0.96 6.77 x 10-5 BIC = 10,039,028.31

ßb 6.84 1.95 x 10-4

ab 1.00 5.33 x 10-5

Tb 9871.00 8.78

Zb 1.34 3.27 x 10-5

PTLND nc 375.06 4.28 x 10-2 AIC = 10,039,433.41

ßc 6.84 1.20 x 10-4 BIC = 10,039,478.67

ac 0.98 1.63 x 10-4 BF < 0.001

Tc 4385.43 6.46

a Standard errors are calculated based on the Fisher information matrix.

and Luckstead and Devadoss [5] have shown, the truncation points influence the value of the Pareto estimates, which is evident from the differing estimates from Gangopadhyay and Basu [28] and Luckstead and Devadoss [29]. As pointed out by Giesen et al. [9], one drawback of this distribution is that we cannot ascertain the exact cutoff points where lower-tail Pareto ends and lognormal body starts and lognormal body ends and upper-tail Pareto starts. Despite this problem, Giesen et al. [9] and Gonzalez-Val et al. [10] found DPLN is the preferred model for all city size distribution. Given the high precision of the upper-tail Pareto estimate, Zipfs law = 1 is easily rejected based on the likelihood ratio test with a p-value of <0.0001. The lower-tail estimate of 0.99 is consistent with a recent study by Devadoss et al. [30], whose estimates range from 0.995 to 1.003 for truncation points of 60 and 120, respectively. This suggests that the lower-tail Pareto estimate is not as sensitive to the truncation point as the upper-tail estimate.

The LNUP provides more flexibility in the upper tail than DPLN by endogenously estimating the switching point in the upper tail. For the LNUP distribution, the estimated parameter values are 9.60 for the mean /xu, 2.38 for the standard deviation au, 1425 for the upper-tail cutoff Tu, and 1.38 for the Pareto slope fu, all of which are highly significant as indicated by the small values of their respective standard errors. The estimated mean and standard deviation of this distribution are different from those of the LN and DPLN distributions, and the reason for these differences is explained in the subsequent sections. Even though India has a much larger population with substantially more cities than the United States, the estimate indicates that the upper-tail switching point from lognormal to Pareto occurs at 1425 for India compared to 60,290 for the United States as found by loannides and Skouras [22]. The reason for this result is that India is a predominantly rural country with relatively more small cities than in the United States; consequently, the truncation point is at a lower point with an elongated upper tail. The cutoff city size of 1425 implies that 190,884 (31.52%) of cities are in the upper tail with a population of 968,602,917 (80.00% of the total population). For the United States, loannides and Skouras [22] report that only 2% of places or 501 cities with 46% of the population are in the upper tail. As in DPLN, the upper-tail Pareto slope estimate of 1.38 is highly significant, implying that it does not adhere to Zipfs law, which is also confirmed by the likelihood ratio test with a p-value < 0.0001.

The PTLN distribution includes parameter estimates for the lower tail with cutoff point nb of 353 and reverse Pareto slope pb of 0.96, lognormal body with mean of 6.84 and standard deviation ab of 1.00, and upper tail with cutoff point Tb of 9871 and Pareto slope fb of 1.34. Since LNUP combines both the lower tail and body with lognormal, it does not explicitly model the lower tail as Pareto and over estimates the scale and shape parameters (/xu = 9.60 and au = 2.38) compared to PTLN with /xb = 6.84 and ab = 1.00. Since the cutoff point in the lower tail is considerably higher than the smallest city size and the cutoff point in the upper tail is markedly lower than the largest city size and both are highly significant, the tail behavior is clearly represented by the Pareto distribution. This result is also supported by both graphical and more formal statistical tests, which are presented below. This suggests that PTLN provides flexibility in accurately estimating the lndian city size distribution without undue complexity or overfitting, as confirmed by AIC and BIC statistics below.

The lower-tail cutoff point of 353 indicates that there are 143,302 cities (23.66% of total cities) with a population of 25 million in the lower tail. In contrast, Devadoss and Luckstead [35] presented, for the 2010 census data of the United States, the lower-tail cutoff estimate of 158 with 3582 cities (12.25% of cities) and a population of 311,769. These results show that there are considerably more cities with a larger population in the lower tail for lndia than in the United States. The

lower-tail Pareto slope of 0.96 is close to 0.99 for the DPLN. The mean (standard deviation) estimate of lognormal body is larger (smaller) than that of LN, while the opposite is true for the DPLN. The upper-tail cutoff estimate is significantly higher than that of the LNUP, and the Pareto slope parameter is about the same as that of LNUP but smaller than DPLN, and the rationale for these results is explained in the subsequent sections. The upper-tail cutoff estimate of 9871 indicates that 10,171 cities (1.68%) with a population of 433 million are in the Indian upper tail.

The PTLND distribution estimates the lower tail with reverse Pareto, yielding the lower-tail cutoff and power-law exponents. The lower tail cutoff estimate nc = 375 indicates that 151,214 cities with 27.9 million inhabitants reside in the lower tail. This long lower tail is due to India being a predominantly rural country with numerous small cities and is in sharp contrast to the short lower tail (only 3582 small cities) for the United States as found by Luckstead and Devadoss [25]. The reverse-Pareto exponent fic = 0.95 is calculated from the differentiability restriction (7). Comparison of these estimates to those of PTLN reveal that the lower tail cutoff and reverse-Pareto are similar under both distributions.

Unlike LNUP, both PTLN and PTLND estimate lower tail and distinguish the body from the lower tail. Consequently, both PTLN and PTLND have similar estimates for the mean and standard deviation of the body. However, the estimates for the upper tail differ significantly between the two distributions, with upper-tail cutoff estimate of Tb = 9871 for PTLN and Tc = 4385 for PTLND. These cutoff estimates imply that PTLN includes 10,171 cities in the upper tail with 433 million people, while PTLND predicts 38,438 cities with 607 million people. While the slope parameter for PTLN Zb = 1.34 is lower than for PTLND Zc = 1.61 (calculated from Eq. (8)), both estimates are statistically different from one, and fail to confirm Zipfs law for the upper tail. These results indicate that imposition of differentiability conditions causes PTLND to under estimate the upper-tail cutoff city and over estimate the Pareto slope parameter. Thus, the added flexibility of the unrestricted slope parameters in PTLN improves the estimation of the upper tail.

For four distributions (DPLN, LNUP, PTLN, PTLND), the upper-tail Pareto slope estimates range from 1.34 to 1.65 which underscores the relatively robust Pareto behavior of the Indian large cities. Based on these estimates and their significance, the upper tail clearly does not follow Zipfs law. The lower-tail Pareto estimate ranges from 0.95 for PTLND to 0.99 for DPLN, indicating the robust reverse-Pareto behavior of Indian small cities.

3.3. Graphical analyses

To assess the goodness of fit of the five distributions, first, we illustrate with several graphical analyses, and second, we provide formal statistical results. Fig. 2 depicts the rank-size plot in log-log scale in ascending order, which accentuates the lower tail.9 The solid black line represents the Indian city size data. LN deviates from the data at a log city size of about 5.8, where the Pareto behavior in the lower tail begins, which LN fails to capture. This deviation also delineates the transition point of lower-tail Pareto from the lognormal body and is accurately estimated by lower-tail cutoff point of log(353) = 5.87 by the PTLN distribution. LNUP performs considerably better than LN; however, it does not replicate the lower-tail data as well as PTLN and PTLND do. DPLN performance in the lower tail is also comparable to those of PTLN and PTLND. The better performance of LNUP compared to LN is due to modeling the upper tail, which allows for more flexibility to estimate the lognormal body. The large standard deviation estimate of 2.38 indicates that LNUP extends the lognormal body further into the lower tail. Because the Indian city size data has pronounced power-law behavior in the lower-tail, DPLN, PTLN, and PTLND which explicitly model the reverse Pareto behavior, follow the data very closely—although PTLN and PTLND deliver a slightly better fit.

Fig. 3 graphs in ascending order the mid-range lognormal body with the lower-tail cutoff point of log (353) = 5.87 and the upper-tail cutoff point of log (9871) = 9.20 as estimated by PTLN. Thus, this part of the lognormal body corresponds to the range of 5.87-9.20 in the horizontal axis in Fig. 2. Because of LN's misspecification of the tail behavior, it accommodates the tails by adjusting the mean and standard deviation; consequently, it fails to accurately model the lognormal body as evident from the over estimation of the data. Similar bias in the mean and standard deviation occur for the LNUP as it does not model the lower-tail Pareto behavior, which results in under and over estimation of the mid-range body. In contrast, as DPLN, PTLN, and PTLND model the Pareto-tail behavior, these distributions accurately predict the lognormal body data.

Much of the discussion in the literature is concerned with whether the upper tail of the distribution can be represented by lognormal or Pareto, to which we turn our attention next. Fig. 4 depicts the rank size plot in descending order, which accentuates the upper tail. As in the previous two figures, LN predicts the data less accurately than the other four distributions, and PTLN provides a better fit over the full range of the upper tail. Though the PTLN and LNUP distributions predict the extreme upper-tail data well, LNUP overestimates the city sizes in the range of about 8.5-11. The upper-tail cutoff point for the PTLN (log(9871) = 9.20) is markedly higher than that for LNUP (log(1425) = 7.26). The reason for the early cutoff point and different estimates for the lognormal body by LNUP is because it does not explicitly model the lower tail, which impacts its prediction of lognormal body and ultimately the upper tail. From the comparison of LN to the data, we can observe a clear deviation from the lognormal body around 9, indicating the PTLN estimate of the cutoff point of 9.20 appears to be more accurate than the LNUP estimate of 7.26.

9 For LN, LNPU, PTLN, and PTLND, we inverted the CDF to generate predicted city sizes. Because the CDF of DPLN cannot be inverted, we used the Accept-Reject method to generate predicted city sizes for the DPLN [36]. We use both the LN and PTLN as source distributions, and both generate very similar datasets. We graph the data based on the PTLN source distribution.

Fig. 2. Rank-size plot, ascending city size.

Fig. 3. Rank size plot for lognormal body, ascending city size.

Between the range of 9 and 11 the data appears to follow Pareto behavior, as expected. However after 11, the data shows a slight deviation from the estimated Pareto slope. This deviation seems to be unique to the Indian city size data and is not captured by any of the distributions. LNUP seems to accommodate this deviation by estimating a higher lognormal mean and standard deviation and a much earlier cutoff. By doing so, it averages out the deviation in the Pareto tail by overestimating in the data range of 8.5-11 and underestimating in the range 11-16. In contrast, PTLN accurately predicts the lognormal body, the transition to Pareto at about 9, and the Pareto slope until the data deviates around 11. Thus, based on these graphical analyses, we can conclude PTLN is a better predictor of Indian city sizes than LNUP. The reason for this result is that, given the large proportion of inhabitants in the lower tail, PTLN, by explicitly modeling the lower tail, allows for more flexibility and accurate estimates of the lognormal body and consequently upper-tail cutoff point. DPLN does not perform well in predicting the data in the upper tail as it does not model the switching points. Similarly, PTLND also does not capture the upper tail well because of the restriction on the upper-tail transition point.

As Reed [8] observed in his study of US settlements, our rank-size plots show both the lower- and upper-tails exhibit linearity, confirming the Pareto behavior of both tails, and thus, it is important to model not only the upper-tail power law, but also the reverse Pareto behavior.

Fig. 4. Rank-size plot, descending city size.


yf ---LN -


yT ---LNUP



- ^lyr -Data

6 8 10 12 Quantiles of Actual City Size

Fig. 5. QQplot.

Fig. 5 presents the QQplot of simulated city sizes from the five distributions against the actual data. If the predicted city sizes replicate the actual data generating process, then the resulting line will be straight along the 45° line. It is clear from the figure that LN systematically deviates from the actual data both in the lower- and upper-tail. LNUP performs considerably better than LN, but not as well as PTLN, whereas DPLN and PTLND miss the upper tail. Thus, the QQ plot reinforces the graphical findings presented in Figs. 2-4.

Fig. 6 graphs the absolute difference between the CDF based on the kernel density estimates for the city size data and simulated city sizes from the five distributions. The point-wise differences are larger for LN and LNUP, but much smaller for DPLN, PTLN, and PTLND. The Kolmogorov-Smirnov (KS) test considers the goodness of fit by analyzing the supremum of the difference between the theoretical and empirical CDF. The supremum difference for LN, LNUP, DPLN, PTLND, and PTLN are, respectively, 0.039, 0.017,0.003,0.003, and 0.002, indicating that the KS-test would reject LN first and PTLN last. Fig. 7 plots the cumulated deviations of the Absolute Delta depicted in Fig. 6. Clearly, the cumulated deviations of DPLN, PTLN, and PTLND are below those of LN and LNUP over the range of the data. Comparing cumulative delta for DPLN, PTLN, and PTLND, both are very similar over the range 0 to about 7; however, PTLN is below DPLN and PTLND10 over the range 7 to about 17.

10 Since the cumulated deviations of the absolute delta are identical for DPLN and PLTND, they are indistinguishable in Fig. 6.

8 10 Log City Size

Fig. 6. Absolute delta.

<D □

0.6 „ 05





0.2 0.1 0-

6 8 10 Log City Size

Fig. 7. Cumulated delta.

The maximum cumulated differences are 0.068 for DPLN, 0.067 for PTLND, and 0.043 for PTLN, indicating that DPLN has 58% and PTLND has 55% higher cumulated deviation than that of PTLN, and for PTLN the theoretical CDF better matches the empirical CDF.

3.4. Statistical tests

We next present more formal statistical results for the goodness of fit of the four distributions. Because PTLN is more flexible relative to the other four distributions, we analyze model selection using AIC and BIC which weighs model precision against number of specified parameters to evaluate whether the additional precision is statistically relevant. AIC = 2 (ki — log (L^) and BIC = ki log (n) — 2 log (L^, where ki is the number of estimated parameters for the ith distribution, Li is the value of the maximized likelihood function, and n is the number of observations. The distribution with the lowest AIC and BIC values is the most favored distribution. As presented in Table 1, PTLN has the lowest AIC and BIC statistics, followed by PTLND, DPLN, LNUP, and LN. As one would expect, the distribution with more flexible functional form estimates the city size distribution with more precision. Furthermore, these tests indicate that, even after being penalized for the additional parameters, PLTN is still the preferred distribution.

Based on the graphical, AIC, and BIC evidence, PTLN clearly performs better than the other four distributions. To analyze whether PTLN is a statistical improvement over the other four distributions, we conduct a formal test based on the Bayes factor. This test is a Bayesian counterpart to the likelihood ratio test; however, it is more general than the likelihood ratio test as it can compare the performance of two distributions that do not nest each other, such as PTLN and DPLN. The Bayes factor is approximated using the BIC : BF ^ exp (2 (BICPTLN - BICitt where i = LN, LNUP, DPLN, and PTLND. The BF is judged based on Jeffrey's scale, which indicates strong support for the PTLN distribution if BF < 10, moderate evidence if 11J < BF < 1, and weak evidence if 3 < BF < 1 [37]. Given the small value of BF (Table 1), the results show that PTLN is the better suited parametrization of the Indian city size distribution.

4. Conclusion

India is the second most populous country with a much larger rural demography than any other country in the world [27]. In this study, we apply five distributions to model the Indian city size data. Among these five distributions, PTLN consistently delivers a closer fit of the Indian city size data. Thus, while a relatively small percentage of the total population lives in the lower tail, our results show that PTLN, by explicitly modeling both the cutoff and slope of the lower tail, allows for more flexibility leading to unbiased and precise estimates of the lognormal body and predicts the upper-tail Pareto better than other distributions. The added flexibility is justified based on AIC and BIC test results, and PTLN is a statistically better model relative to the other distributions based on the Bayes factor.

Given the significant estimates of the Pareto behavior in both tails, the estimation of all city size distribution by only lognormal clearly misses the Pareto behavior in the tails. Our results also suggest that the improved performance of PTLN relative to LNUP is important, not only to model the upper-tail power law behavior, but also the lower-tail power law behavior, particularly for a developing country such as India where two-thirds of the population live in small villages in rural areas. Though DPLN and PTLN performances are fairly similar in the lower tail and mid-range lognormal body, PTLN has the added advantage of pinpointing the lower- and upper-tail cutoff points. Consequently, it allows us to estimate the proportion of population living in the tail cities and also the number of cities located in these tails. The performance of PTLND is on par with PTLN in the lower tail and body; however, PTLND does not capture the upper tail very well because of the additional restriction on the upper-tail cutoff parameter.

Because of economic significance and about 36% of the population live in the upper-tail large cities, it is important to model Pareto behavior of these cities. However, 64% of the population inhabit the mid-range body and lower-tail small cities because of the predominantly rural nature of the Indian demography. Therefore, it is also equally important to precisely model the lognormal behavior of the mid-range and the reverse-Pareto behavior of lower-tail cities. The PTLN model has good predictions for the frequency of all city sizes in these three parts of the Indian city size distribution.


[1] P. Krugman, The Self-Organizing Economy, Blackwell Publishers Cambridge, Massachusetts, 1996.

[2] X. Gabaix, Zipfs law for cities: An explanation, Quart. J. Econ. 114(3) (1999) 739-767.

[3] C.M. Urzúa, A simple and efficient test for Zipfs law, Econom. Lett. 66 (3) (2000) 257-260.

[4] R. González-Val, The evolution of US city size distribution from a long-term perspective (1900-2000), J. Reg. Sci. 50 (5) (2010) 952-972.

[5] J. Luckstead, S. Devadoss, Do the world's largest cities follow Zipfs and Gibrat's laws?, Econom. Lett. 125 (2) (2014) 182-186.

[6] Y. Nishiyama, S. Osada, Y. Sato, OLS estimation and the t test revisited in rank-size rule regression, J. Reg. Sci. 48 (4) (2008) 691-716.

[7] W.J. Reed, The Pareto, Zipf and other power laws, Econom. Lett. 74 (1) (2001) 15-19.

[8] W.J. Reed, On the rank-size distribution for human settlements, J. Reg. Sci. 42 (1) (2002) 1-17.

[9] K. Giesen, A. Zimmermann, J. Suedekum, The size distribution across all cities-double pareto lognormal strikes, J. Urban Econ. 68 (2) (2010) 129-137.

[10] R. González-Val, A. Ramos, F. Sanz-Gracia, M. Vera-Cabello, Size distributions for all cities: Which one is best?, Pap. Reg. Sci. 94 (1) (2015) 177-196.

[11] I. Eliazar, M.H. Cohen, A Langevin approach to the Log-Gauss-Pareto composite statistical structure, Physica A 391 (22) (2012) 5598-5610.

[12] I.I. Eliazar, M.H. Cohen, Econophysical anchoring of unimodal power-law distributions, J. Phys. A 46 (36) (2013) 365001.

[13] I.I. Eliazar, M.H. Cohen, On the physical interpretation of statistical data from black-box systems, Physica A 392 (13) (2013) 2924-2939.

[14] I. Eliazar, M.H. Cohen, The universal macroscopic statistics and phase transitions of rank distributions, Physica A 390 (23) (2011) 4293-4303.

[15] I. Eliazar, M.H. Cohen, Inverted rank distributions: Macroscopic statistics, universality classes, and critical exponents, Physica A 393 (2014) 450-459.

[16] I.I. Eliazar, M.H. Cohen, Rank distributions: A panoramic macroscopic outlook, Phys. Rev. E 89 (1) (2014) 012111.

[17] M. Puente-Ajovin, A. Ramos, On the parametric description of the french, german, italianand spanish city size distributions, Ann. Reg. Sci. 54(2) (2015) 489-509.

[18] A. Ramos, Are the log-growth rates of city sizes distributed normally? Empirical evidence for the USA, Empir. Econom. (2016) 1-15.

[19] J. Eeckhout, Gibrat's law for (All) cities, Amer. Econ. Rev. 94 (5) (2004) 1429-1451.

[20] M. Levy, Gibrat's law for (all) cities: Comment, Amer. Econ. Rev. 99 (4) (2009) 1672-1675.

[21] H. Rozenfeld, D. Rybski, X. Gabaix, H. Makse, The area and population of cities: New insights from a different perspective on cities, Amer. Econ. Rev. 101 (5)(2011) 2205-2225.

[22] Y. Ioannides, S. Skouras, US city size distribution: Robustly pareto, but only in the tail, J. Urban Econ. 73 (1) (2013) 18-29.

[23] E. Calderín-Ojeda, The distribution of all french communes: A composite parametric approach, Physica A 450 (2016) 385-394.

[24] G. Fazio, M. Modica, Pareto or log-normal? best fit and truncation in the distribution of all cities*, J. Reg. Sci. 55 (5) (2015) 736-756.

[25] J. Luckstead, S. Devadoss, Pareto tails and lognormal body of US cities size distribution, Physica A 465 (2017) 573-578.

[26] Census of India. Population enumeration data (final population)., 2014.

[27] The World Bank. World development indicators database., 2014.

[28] K. Gangopadhyay, B. Basu, City size distributions for India and China, Physica A 388 (13) (2009) 2682-2688.

[29] J. Luckstead, S. Devadoss, A comparison of city size distributions for China and India from 1950 to 2010, Econom. Lett. 124 (2) (2014) 290-295.

[30] S. Devadoss, J. Luckstead, D. Danforth, S. Akhundjanov, The power law distribution for lower tail cities in india, Physica A 442 (2016) 193-196.

[31] H.A. Makse, J.S. Andrade, M. Batty, S. Havlin, H.E. Stanley, et al., Modeling urban growth patterns with correlated percolation, Phys. Rev. E 58 (6) (1998) 7054.

[32] H.D. Rozenfeld, D. Rybski, J.S. Andrade, M. Batty, H.E. Stanley, H.A. Makse, Laws of population growth, Proc. Natl. Acad. Sci. 105 (48) (2008) 18702-18707.

[33] E.A. Oliveira, J.S. Andrade Jr., H.A. Makse, Large cities are less green, Sci. Rep. 4 (2014).

[34] United Nations (2014). World population prospects: The 2012 revision. Population Division of the Department of Economic and Social Affairs of the United Nations Secretariat,

[35] S. Devadoss, J. Luckstead, Size distribution of US lower-tail cities, Econom. Lett. 135 (1) (2015) 12-14.

[36] R.C. Mittelhammer, G.G. Judge, D. Miller, Econometric Foundations, Cambridge University Press, 2000.

[37] R.E. Kass, A.E. Raftery, Bayes factors, J. Amer. Statist. Assoc. 90 (430) (1995) 773-795.