Scholarly article on topic 'A Family of Nonparametric Density Estimation Algorithms'

A Family of Nonparametric Density Estimation Algorithms Academic research paper on "Art (arts, history of arts, performing arts, music)"

0
0
Share paper
Keywords
{""}

Academic research paper on topic "A Family of Nonparametric Density Estimation Algorithms"

A Family of Nonparametric Density Estimation Algorithms

E. G. TABAK

Courant Institute

CRISTINA V. TURNER

FaMAF, Universidad Nacional de Córdoba

Abstract

A new methodology for density estimation is proposed. The methodology, which builds on the one developed by Tabak and Vanden-Eijnden, normalizes the data points through the composition of simple maps. The parameters of each map are determined through the maximization of a local quadratic approximation to the log-likelihood. Various candidates for the elementary maps of each step are proposed; criteria for choosing one includes robustness, computational simplicity, and good behavior in high-dimensional settings. A good choice is that of localized radial expansions, which depend on a single parameter: all the complexity of arbitrary, possibly convoluted probability densities can be built through the composition of such simple maps. © 2012 Wiley Periodicals, Inc.

1 Introduction

A central problem in the analysis of data is density estimation: given a set of independent observations xj, j = 1,... ,m, estimate its underlying probability distribution. This article is concerned with the case in which x is a continuous, possibly multidimensional variable, typically in Rn, and its distribution is specified by a probability density p(x). Among the many uses of density estimation are its application to classification, clustering, and dimensional reduction, as well as more field-specific applications such as medical diagnosis, option pricing, and weather prediction [2, 7, 14].

Parametric density estimation is often based on maximal likelihood: a family of candidate densities is proposed, p(x; ft), where ft denotes parameters from an admissible set A. Then these parameters are chosen so as to maximize the log-likelihood L of the available observations:

(1.1) ft = argmax L = V" log(p(xj; ft)).

A typical example is a family p(x; ft) of Gaussian mixtures, with ft including free parameters in the means and covariance matrices of the individual Gaussians and

Communications on Pure and Applied Mathematics, 0001-0020 (PREPRINT) © 2012 Wiley Periodicals, Inc.

their weights in the mixture. Parametric density estimation is a practical tool of wide applicability, yet it suffers from the arbitrariness in the choice of the parametric family and number of parameters involved. Ideally, the form of the density function would emerge from the data, not from arbitrary a priori choices, unless these are guided by a deeper knowledge of the processes that gave rise to the probability distribution under study.

The simplest methodology for nonparametric density estimation is the histogram [18], whereby space is divided into regular bins, and the estimated density within each bin is assigned a uniform value, proportional to the number of observations that fall within. Histogram estimates are not smooth and suffer greatly from the curse of dimensionality. A smoother version, first developed in [12, 13], uses a sum of kernel functions centered at each observation, with a bandwidth adapted to the level of resolution desired. Particular kernels have been devised to handle properties of the target distributions; for instance, when these are known to have support only in the positive half-line, Gamma kernels have been proposed as a substitute for their more widely used symmetric Gaussian counterpart [4]. Many kernels, including the Gaussian, can be conceptualized as smoothers arising from a diffusion process [3], paving the road for a unified, systematic treatment of bandwidth selection and boundary bias.

In nonparametric estimation, one must be careful not to overresolve the density, for which one needs to calibrate the smoothing parameters to the data [11]. The most universal methodology for this is cross-validation [6], in which the available data are partitioned into subsets, used alternatively for training and out-of-sample testing of the estimation procedure. A related algorithm is the bootstrap [8], which creates training and testing populations by drawing samples with replacement from the available data points.

An alternative methodology for nonparametric density estimation was developed in [17], based on normalizing flows in feature-space. Other procedures based on normalization include the exploratory projection pursuit [5], a methodology originally developed for the visualization of high-dimensional data, which normalizes selected small-dimensional cross sections of the space of features, and copula-based density estimation [10, 15], which normalizes the one-dimensional marginal distribution of each individual feature, and then couples these marginals through a multidimensional copula, typically Gaussian or Archimedean. Normalizing the data xj is finding a map y(x) such that the yj = y (xj ) have a prescribed distribution ¡J,(y), for which we shall adopt here the isotropic Gaussian

If such map is known, then the probability density p(x) underlying the original data is given by

p(x) = Jy (x)ß(y(x)),

where Jy (x) is the Jacobian of the map y(x) evaluated at the point x. In view of (1.3), density estimation can be rephrased as the search for a normalizing map.

There is more than semantics to this rephrasing: normalizing the data is often a goal per se. It allows us, for instance, to compare observations from different datasets, to define robust metrics in phase space, and to use standard statistical tools, often applicable only to normal distributions. More important for us here, however, is that it leads to the development of a novel family of density estimation techniques.

1.1 Density Estimation through Normalizing Flows

The simplest idea for finding a normalizing map y (x) is to propose a parametric family, y = yft (x), and maximize the log-likelihood of the data, combining (1.1), (1.2), and (1.3) into

" (x)||2"

(1.4) ß = arg max L =

ß 2A jDi

log J (Xj )) -

where we have omitted from the log-likelihood L the ft-independent term — n log(2^). In particular, if y (x) is chosen among all linear functions of the form

(1.5) yft(x) = A(x — b) (ft = {A e R"x", b e Rn}), then the output of the maximization in (1.4) is

(1.6) b = x, A = E" 2,

where x = m xj is the empirical mean and E = m xj xj the empirical covariance matrix of the data. In other words, a linear choice for y^ (x) yields the standard normalization procedure of subtracting the mean and dividing by the square root of the covariance matrix. In terms of density estimation, it yields the Gaussian

(1 7) o(x) = -1-2 (x"x)Ts-1(x-x)

(1.7) P(x) (2^)n/2|S|i/2e •

Yet this, as all parametric procedures, suffers from the extra structure it imposes on the data, by assuming that it has an underlying probability density of a particular form.

One way to approach the algorithm proposed in [17] is to factor the map y(x) into N parametric maps (z),

(1.8) yN(x) = 00N M (x),

since the composition of many simple maps can be made arbitrarily complex, thus overcoming the limitations of parametric maps. If this is considered as a function y ft (x), depending on the indexed family of parameters ft = (fti • • • ftN) on which to perform the maximization in (1.4), then we have just complicated matters without resolving any issue. Yet the following two realizations help us move forward:

• We can calculate the various fit sequentially: first find fi\ using y\(x) = <pp1 (x) in (1.4), then fi2 using y2(x) = <fip2 (x)), with fi\ fixed at the value found in the prior step, and so on. If the identity map is included in each elementary family for fit = 0,

0o(z) = z,

then each new step can only increase the value of the log-likelihood L, so even though we are not maximizing L over fi = (fi\ ... fiN), we are still ascending it through the sequence of maps.

• Switching perspective from density estimation to normalization, we can at each step i forget all prior steps and just deal with the currently normalized states zj = yt-\(xj) of the observations as if these were the original ones. In order to be able to compute at the end the estimated density of the original variables xj , we just need to update at each step the global Jacobian of the map, through

(1.9) J yi (xj) ! J * (zj )J yi-1 (xj),

i.e., by multiplying it by the Jacobian of the current elementary map. With this new perspective, all steps adopt the simple form in (1.4), with fi = fit and each xj replaced by the current zj. This duality is the basis of our algorithm: rather than set out to estimate the density p(x), we seek a normalizing map y(x). This we factor into many elementary maps, with parameters determined through a local density estimation, in which (1.4) is applied not to x but to the current state z(x) of the map.

Pushing this idea to the limit, we may think of a continuous flow z = 0t (x) in an algorithmic time t, with velocity field

(1.10) u(z) =

driven by the variational gradient of the log-likelihood L. From this perspective, the observations xj give rise to active Lagrangian markers zj (t), with zj (0) = xj, that move with the flow and guide it through their contribution to the local-in-time log-likelihood L. It was proved in [17] that, as the number of observations grows, y(x) = limtz(x, t) converges to a normal distribution, and the density p(x) estimated through (1.3) to the actual density of the data.

For the observations xj, the active Lagrangian markers that guide the flow leading to the map y(x), we know at the end their normalized values y(xj) and the corresponding estimated densities p(xj) from (1.3). Yet one is typically interested in evaluating the estimated density at other points xg. These might be points on a regular grid—hence the "g" in xg —required to plot or manipulate p(x). They can also represent events whose likelihood one would like to know, or points whose probability under various distributions is required for a classification problem. In order to evaluate the density at these extra points xtg , it is enough to add them as

passive Lagrangian markers that move with the flow but do not influence it, since they are not included in the likelihood function.

1.2 Individual Maps

The building blocks <pt (x) proposed in [17] were simple one-dimensional maps, centered at a random point xo, oriented in a random direction; they depended on three parameters fi. These parameters were found by ascent of the log-likelihood, i.e., through

(1.11) fi = vrp L\p=o, with a learning rate v given by

(1.12) v = f

^2 + IV L\\2'

and € — 1 prescribed. This simple formula for the learning rate guarantees that the size ||fi|| of all steps is bounded by € and decreases near a maximum of L. It was proved in [17] that the composition of such one-dimensional maps suffices to guarantee convergence to arbitrary distributions p(x), based on the fact that two distributions with the same marginals in all directions are necessarily identical. This procedure was further developed in [1] to address clustering and classification problems.

Yet the procedure just described suffers from some computational drawbacks:

• Exploring all directions through one-dimensional maps requires a number of steps that grows exponentially with the dimension of phase space. In many applications, such as to microarray data, this dimension can be very large. Moreover, performing random rotations—i.e., orthogonal transformations—in high dimensions is costly.

• In order to have a smooth ascent process, the step size € needs to be small, hence requiring the algorithm to perform a large number of steps to reach convergence.

In this paper, we address both of these issues. On the one hand, we propose elementary transformations that do not deteriorate when the dimensionality of phase space grows, the simplest and most effective of which is based on radial expansions. On the other, we exploit the fact that the elementary transformations have a very simple analytical form to go beyond straightforward gradient descent, and instead maximize in each step the local quadratic approximation to the log-likelihood in terms of the parameters fi. This allows us to take much larger steps and hence reduces significantly the total number of steps that the algorithm requires.

Extension of the local maps

Figure 2.1. Dependence of the length scale a on the center x0. In order not to overresolve the estimation, the maps need to include a sufficient number of observations within their typical length scale. Thus, for maps centered in relatively unpopulated areas, the radius a must be larger than in areas with high probability density so as to encompass a similar number of points. The two circles in the figure exemplify this: a point in the tail of the distribution is assigned a larger domain of influence than one in the core.

2 General Methodological Aspects

2.1 Center x0 and Length Scale a

All the elementary maps that we propose are of the form

centered at a random point x0. The parameter a, measuring the length scale of the map, has a value that depends on the selected node x0: in areas with small probability, the length scale must be large, not to overfit the data. We start by choosing a number np of points that we would like to have within a ball of radius a around x0. Then a is given by the expression

which results from inverting the density of the target normal distribution. Here m is the total number of data points, n the dimension of feature space, and Qn the volume of the unit ball in Rn. The concept is illustrated in two dimensions in Figure 2.1.

We can think of two candidate methodologies for selecting the point x0: to pick it at random from the actual observations—at their current normalized state—or to sample the normal distribution to which y (x) is converging. The former choice has the advantage of sampling the actual current density, not the estimated one; on the

negative side, it never picks points away from the observations, so it may be ineffective at reducing overestimated densities at points far from the observed set. The latter choice, on the other hand, will sample all points proportionally to their current estimated density, so it will detect and help correct points with overestimated probability, yet it may fail to sample points in areas with underestimated probability density, so these may never be corrected. We have implemented a balanced solution, whereby we randomly alternate between the two sampling methodologies just described.

2.2 Local Ascent

In [17], we proposed picking the parameters fi of each time step by gradient ascent of the log-likelihood, through (1.11) and (1.12). Yet such a procedure does not exploit to its full extent the simplicity of each elementary map. The explicit nature of these maps allows us to compute analytically not just the first but also the second derivatives of the log-likelihood L with respect to fi. With this is hand, we can take larger and more effective steps by maximizing the quadratic local approximation to L,

L « Lo + Gß + 1 ß'Hß,

Lo = L

ß =0'

and Hj =

are the log-likelihood, its gradient, and its Hessian matrix evaluated at fi Maximizing (2.2) yields

(2.4) fi = —H ~1G.

A little care is required, though, not to take steps that are too long, incompatible with the local quadratic approximation. If the Hessian matrix H were not negative definite, the quadratic form (2.2) would have no maximum, and regular gradient descent would be called upon. Luckily, this is never the case with the maps that we propose, whose Hessian H is always negative definite. It may happen though that L is locally quite flat, leading to a large value of ||fi || from (2.4). To avoid pushing the quadratic approximation too far from its domain of validity, we limit the step size to a maximum learning rate €. We adopt, if || H~1G || > €, the capped step

ß = ~e

H ~1G IIH"1G|| •

2.3 Preconditioning

It may be convenient to do some simple initial transformations that map the data points toward a normal distribution in the bulk. Reasons for this range from the general to the specific:

• For data points far from the origin, the gradient of their likelihood under a normal distribution could reach machine zero, at which point the algorithm will lack any guidance as to how to move them to improve their likelihood.

• Movements in the bulk may require a coarse resolution, as measured by the length scale a, at odds with the finer one needed for a more detailed resolution of the probability density.

• We may have some a priori knowledge of a family of distributions that should capture much of the data's variability. Using this to do first a simple parametric estimation may save much computational time.

• In some cases, we might be interested in how much the actual distribution differs from a conventional one, such as the log-norm for investment returns. Then we can first do a fit to the conventional distribution and then quantify the extent and nature of the subsequent maps.

This first set of maps can be thought of as a preconditioning step of the algorithm, which only differs from the subsequent steps in the form of the proposed maps or in the scale adopted. Two preconditioning steps that we include by default in the algorithm are subtracting the mean of the data,

(2.6) x ! x — a, a = — 7 Xj,

and dividing by the average standard deviation,

1 /1 (2.7) x ! -x, a = —T ||xj||2,

a \ mn

with corresponding initial estimation

Ilx-Mll2

(2.8) po(x) = (2^a2)" 2 e

o „ n__

Proposing a general Gaussian as in (1.7) is not generally advisable in high dimensions unless the sample size m is big enough to allow for a robust estimation of the covariance matrix £.

Another preconditioning candidate generally applicable consists of carrying out a few steps of the regular core procedure but with coarser resolution, i.e., with larger np in (2.1). More generally, we can have a value of np that decreases mono-tonically throughout the procedure, from an initial coarse value to the finest resolution desired or allowed by the data, thus blurring the boundary between preconditioning and the algorithm's core.

In specific cases, where a family of probability densities of specific form p0(x, ft) is known or conjectured to provide a sensible fit of the data, and a map y(x, ft) is known such that

po(x,ft) = Jy (x)u(y(x,ft)),

then the preconditioning step should consist of a parametric fit of these parameters fi followed by the map. The popular procedure of taking the log of a series of returns fits within this framework, where the conjecture p0(x) is a log-normal distribution.

Often p(x) has bounded or semi-infinite support, which may be known even though p itself is not. For instance, some components of x may be known to be positive or, if x denotes geographical location, p(x) may be known to vanish over seas or in other unpopulated areas [16]. When this is the case, it may be convenient to perform as preconditioning a first map that fills out all of space, such as

for one-dimensional data with x > 0. The advantage of such a preconditioning step goes beyond moving the data toward Gaussianity: it also guarantees that the estimated pest(x) will vanish outside the support of p(x).

In order to complete the description of the algorithm, we need to provide a form for the elementary maps of each computational step, the "building blocks" of the general map y(x) defining the estimated density p(x) through (1.3). In order to be useful, these elementary maps must satisfy some properties:

• They must include the identity map for fi = 0.

• They must constitute a basis, through composition, for quite general maps. Thus linear maps are not good, since their composition never leaves the group of linear transformations.

• For robustness, they should have a simple spatial structure without unnecessary oscillations. Our choices below are local, typically the identity plus localized bumps times linear functions.

• They must have a simple analytical dependence on the parameters fi, leading to first and second derivatives of the likelihood function with respect to fi that are not computationally intensive. We find below that a scalar fi, the simplest of all choices, works best, since no complexity is needed at the level of the elementary maps: any complexity of the actual y(x) can be built by the composition of simple maps.

• They should not deteriorate in high dimensions. The maps proposed in [17] require the laborious construction of general maps through the composition of one-dimensional transformations. This is always doable, as proved in [17], but not computationally efficient in high dimensions.

3.1 Radial Expansions

Among the simplest elementary transformations suitable for building general maps are isotropic local expansion centered at a random point xo of the form

(2.10)

x ! erf-1.! - 2e~x)

3 Elementary Building Blocks

y = x + ßf.IIx - %o||)(x - xo)'

depending on the single parameter f, positive for local expansions and negative for contractions.

A typical localization function f is given by

fi ^ ft \ 1 erf(r/a)

(3.2) f(r) =----,

where r = ||x — xo ||. Another choice is

(3.3) f(r) = 1

Even though the two are similar in shape, each choice has its advantages: the former is smoother and more localized, while the latter is faster to compute and, more importantly, the corresponding map (3.1) can be inverted in closed form, yielding

x - xo y - Xo

where s = ||y — xo|| and

s-ia+f)^ (s — .a + f)\2

r =-2-+ Vl-2-' + as

This is useful in a number of applications that involve finding the inverse x(y) of the normalizing map y.x): producing synthetic extra sample points xj from p.x), for instance, can be achieved by obtaining samples yj from the Gaussian ¡i{y) and writing xj = x.yj )). Still one more choice is

( O-rM2 for r <a (3.4) f.r) = « ior r<a'

( 0 otherwise.

This has the advantage of its compact support, which permits the easy superposition of various such maps simultaneously. All three families require f > —a for the maps to be one-to-one; the last one also requires that f < 3a. Figure 3.1 compares the three functions for x0 = 0 and a = 1. The map in (3.1) has Jacobian

J = | (1 + if)n-1i1 + iif + rf ')) |

and corresponding log-likelihood function L

Xlog(p(xj)) = X — 1 [Ixol2 + 2(xo, (1 + ff)(xj — xo)) + ((1 + if)rj)2] j j + (n — 1) log(1 + if) + log (1 + f(f + rj f '))|.

E([n — (xo,xj — xo) — rj] f + rj f '}

Figure 3.1. Three radial building blocks. The upper panels display /(|x|), the lower ones x/(|x|). On the left, a smooth, analytic block based on the error function; in the center, one with algebraic decay and closed-form inversion; and on the right, one with compact support.

E[(n + rj2)/2 + 2rj //0 + (j/')2] < 0,

so we may replace L by its quadratic approximation at ft = 0, yielding the following approximation to the maximizer:

ft=—

9fi 2 fi =0

3.2 One-Dimensional Maps

One may gain extra degrees of freedom by changing the map above into the component-wise

y' — x0 = (1 + ft'" / ( | x' — x0 | ))(x' — x0),

the composition of n one-dimensional maps, each depending on a parameter ft'. In this case,

L = XX log[1 + ft' (/ + I xj — x0 | /0)]

1 [ i2

— - [x0 + 2x0 (1 + ft' /') (xj — x0) + (xj. — x0)2( 1 + ft' /-)2],

+ | xj x0 | Ti x0 /- (xj x0) (xj x0) }

| /-f C fi2(xj - x0)2

so we must pick

fti =-

92L | ' 9fi2 Ifi =0

This family of maps is not isotropic, since it privileges the coordinate axes. To restore isotropy, one can rotate the axes every time step through a random orthogonal matrix. With this extra ingredient, this building block agrees with the one originally implemented in [17]; the only differences are the specific form of the stretching function, which in [17] was a more complex function depending on three parameters per dimension, and the maximization procedure, which is carried out here through a local quadratic approximation, not by first-order ascent of the log-likelihood.

3.3 Localized Linear Transformations

The radial expansions in (3.1) and, except for a minor twist, also the one-dimensional maps in (3.5) can be thought of as particular instances of a more general localized linear transformation of the form

y = x C /(||x - X0ll)A(x - X0).

For the radial expansions, we have A = ft/ and, for each one-dimensional map, A = ftnnT, where n is the column vector of direction cosines of the direction considered; in the latter case f applies not to ||x — x0|| but to |n • (x — x0)|.

For a general matrix A in (3.6), we have the following quadratic approximation to the logarithm of the density p at each point x:

Q j (x) = / 2(X)$ j c if (xj - xj) (xl - x0))

log(p(x)) « -

Lj (x) =

cE Lj (x)Aj C £ Qj (x)Aj Af;

. - /(x)x'

(xj - xj) C /(x)lj

2/(x) 9x^(xl - x^ ) C ^ ^ (xj - xj)(xl - x0).

3xi 3xk

Hence maximizing over A the quadratic approximation to the log-likelihood L = pm log(p(xm)) yields the system

X{X(Q/ (xm) + Qj (Xm))}Alk = - X L (Xm). kl m m

Notice that this building block requires much more computational work than the isotropic expansions; hence its use would only be justified if it yielded better accuracy in a much smaller number of steps. We found in the experiments below that this is not typically the case, so we conclude that simpler maps, with only a handful of parameters f—such as the single one for the radial expansions—are to be preferred.

4 Examples

In this section, we use some synthetic examples to illustrate the procedure and to compare the efficiency of the various building blocks proposed above. In all examples, we have used for preconditioning only the two steps in (2.6)-(2.7), which re-center the observations at x = 0 and stretch them isotropically so as to produce a unitary average standard deviation.

As a first example, consider the two-dimensional probability density displayed in Figure 4.1, given by

- s 1 ir-1\2

(4.1) p(x,y) = e 2 ° e 2(0.1) ,

where r and 9 are the radius and angle in polar coordinates: a distribution concentrated in a small neighborhood of the unit circle, with maximal density at (x, y) = (1,0). Such distribution, with thin support and pronounced curvature, would be hard to capture with any parametric approach. Yet the proposed algorithm does a very good job, as shown in Figure 4.2.

For the experiment displayed in Figures 4.1 and 4.2, we have taken a sample of size m = 1000, used the radial expansion in (3.1) with f(r) from (3.2), and adopted a value np = 500 for the calculation of the length scale a in (2.1). The Kullback-Leibler divergence [9] between the exact and the estimated distributions, displayed in the last panel of Figure 4.1, is given by

(4.2) KL = / ^ (^j) Pex(y)dy,

which is integrated numerically on the same grid used for the plots, a set of points carried passively by the algorithm, where pest is known. Another possibility, much more efficient in high dimensions, is to estimate KL through Monte Carlo simulation:

(4.3) KL « N I] log (Pex (xj ) - log (Pest (xj )).

-0.5 -1 -1.5

Real density

Observations

Real density

-0.5 -1 -1.5

' -tort*

-1 -0.5 0 0.5 1 1.5

Kullback-Leibler divergence

0.8 0.6 0.4 0.2

0 100 200 300 400 500 600

Figure 4.1. A synthetic two-dimensional example. On the left, the proposed probability density, displayed through contours and in perspective. On the right, the 1000-point sample used to test the procedure, and the evolution of Kullback-Leibler divergence between the analytical density and the one discovered by the algorithm.

This also reveals the connection between the Kullback-Leibler divergence between estimated and exact densities and the log-likelihood of the estimated density, which the algorithm ascends.

Experimenting with the other building blocks proposed above yields entirely similar results. We conclude that the radial expansions are to be preferred, since their use is much less computationally intensive. Moreover, the simplicity of the radial expansions brings in an extra degree of robustness, as revealed by a much smaller sensitivity to the choice of , the only free parameter of the algorithm.

Next we compare the procedure developed here with kernel density estimation, the most popular nonparametric methodology in use [18]. We have adopted Gaussian kernels of the form

(4.4) =

1 _U lly-xA 2

-e H h )

and proposed the estimate

(4.5) p(y) = - X Kh(xj ,y).

Figure 4.2. Evolution of the estimated density and normalized observations through three snapshots: on the left, the onset of the algorithm, after a preconditioning step that re-centers the observations and rescales them isotropically; in the center, the situation after 200 steps; and, on the right, a final estimation after 600 steps. The top two rows display the estimated density; the third row the normalized observations.

Hence each observation xj contributes to the local density within a neighborhood whose size scales with h. This bandwidth plays a similar role to the np of our procedure, the typical number of points affected by each map.

Figure 4.3 displays the results of applying both procedures, at different band-widths, to a sample with m = 500 points from the distribution in (4.1). We have picked two values for h and np, one too large, slightly underresolving the distribution, and one too small, slightly overresolving it. Comparing the results from the two procedures, we can make the following observations:

• Both procedures are robust, capturing the main features of the probability density p.x), whereas most parametric approaches would have done poorly.

• The mapping procedure yields smoother and tighter density profiles, and correspondingly smaller values of the Kullback-Leibler divergence between the exact and estimated densities.

The computational costs of both procedures are comparable: estimating the density at q points requires m x q evaluations of the kernel and ns x q map applications, respectively. Since the number ns of iterations before convergence scales with

Figure 4.3. Comparison between density estimation through the mapping procedure and through Gaussian kernels at various bandwidths. On the left, two estimates performed using radial expansions; on the right, two Gaussian kernel density estimations. The top row uses values of np and h that slightly underresolve the distribution, while the corrrespond-ing values on the bottom row slightly overresolve it. Both methodologies are robust and yield comparable results, yet the mapping procedure gets estimates that are both tighter and smoother, with corresponding lower values for the Kullback-Leibler divergence with the exact density underlying the sample.

the number m of observations, these two numbers of evaluations are of the same order. The mapping procedure has the additional cost of determining the optimal parameter ft for each step, but this is comparatively unimportant when q is much larger than m.

Beyond the comparison of effectiveness, which depends on the actual problem in hand, one can describe the main differences between the two procedures:

• The estimated density is expressed in terms of the sum of kernel functions in one case and of the composition of elementary maps in the other.

• In the implementations discussed here, kernel density estimation is explicit and deterministic, while there is a stochastic element to the choice of the centers for the elementary maps.

• Kernel density estimation is conceptually simpler, while the normalizing maps have a richer structure and more versatility.

• The kernels provide just an estimated density, while the new procedure also produces a normalizing map. This can be used for a variety of purposes, such as sampling.

DENSITY ESTIMATION ALGORITHMS Real density Observations

Real density

Kullback-Leibler divergence

Figure 4.4. A second synthetic two-dimensional example, the mixture of two Gaussians. On the left, the proposed probability density, displayed through contours and in perspective. On the right, the 2000-point sample used to test the procedure, and the evolution of Kullback-Leibler divergence between the analytical density and the one discovered by the algorithm.

Figures 4.4 and 4.5 show another two-dimensional experiment. In this case, the proposed density is the mixture of two anisotropic Gaussians, and, for illustration, the building block utilized is the general localized linear transformation in (3.6). Notice in Figure 4.5 a feature associated with the dual nature of the algorithm: since the normalizing procedure cannot fully eliminate the gap between the two Gaussians without overresolving, as shown in the three bottom panels, the corresponding density estimation, displayed in the top panels, cannot fully separate the two. A kernel estimator would also be unable to fully separate the two components, but here the reason would be more straightforward: a bandwidth h small enough to separate them would overresolve the estimation, particularly at the less populated

tails of the distribution. "Leibler" misspelled on

The examples above are two-dimensional to facilitate their display, yet the full calloutin Fig. 4.4. power of the algorithm manifests itself in high-dimensional situations. Thus we consider next the equal-weight mixture of two n -dimensional normal distributions, centered at x = ±2e1. Here we have used a sample of m = 1000 points, np = 500, and again the isotropic radial expansion in (3.1)-(3.2). Figure 4.6 compares the evolution of the Kullback-Leibler divergence between the exact and estimated density in dimensions n = 2, n = 5, and n = 10. In order to enable a meaningful comparison between problems in different dimensions, the KL from (4.2) in the plots is normalized by , the surface area of the n-dimensional unit sphere.

Figure 4.5. Evolution of the estimated density and normalized observations through three snapshots: on the left, the onset of the algorithm, after a preconditioning step that re-centers the observations and rescales them isotropically; in the center, the situation after 100 steps; and on the right, a final estimation after 500 steps. The top two rows display the estimated density; the third row the normalized observations.

Notice that the rate of convergence does not deteriorate significantly with the dimension n—nor does the time per step, which is nearly independent of n for radial expansions. The value of n = 10 is beyond the largest one might have hoped to resolve with a sample of size m = 1000, since 210 = 1024: one has on average one observation per the 10-dimensional equivalent of a quadrant! Thus it is surprising that the algorithm resolves this density so well.

5 Conclusions

We have developed a methodology for nonparametric density estimation. Based on normalizing flows, the new procedure improves on the one developed in [17] in that it is more robust and efficient in high dimensions and ascends the log-likelihood function through larger steps, based on a quadratic approximation rather than gradient ascent. It requires only one external parameter, np, with a clear interpretation: the level of resolution sought, measured in number of observations per localized feature of the estimated density. We have found that the simplest elementary transformations, such as localized radial expansions, are also the most

0 08 0 07 O.OS 0 05 ^ 0 04 0 03 0.02 0.01 0.

i- r r -1— -1-

! \ J \ _

1\ n=5 n=~IO -

_1_ _1_ __4- 4- - _[_1_ _L_

Figure 4.6. Evolution of the Kullback-Leibler divergence between the real and the estimated density for a Gaussian mixture in dimensions 2 (solid blue), 5 (dashed red), and 10 (dashed black).

efficient and robust building blocks from which to form the map that normalizes the data points.

Density estimation appears often in applications as a tool for more specific tasks. One advantage of the methodology developed here is its flexibility, which allows for easy adaptation to such tasks. Thus, in [1], we have adapted the algorithm from [17], a direct ancestor to the one in this paper, to do classification and clustering. Along similar lines, projects underway employ variations of the methodology proposed here to perform tasks as varied as medical diagnosis, relating behavioral traits to neuron classes in worms, Monte Carlo simulation, time series analysis, estimation of risk-neutral measures, and transportation theory. It is in the context of these more specific procedures that examples with real data make sense. In this paper, we have purposefully concentrated instead on "pure" density estimation, illustrating the new procedure only with synthetic examples. The advantage of using synthetic examples is that the knowledge of the precise distribution from which the observations are drawn permits quantifying the accuracy of its estimation. Thus, in small-dimensional problems, one can compare the real and estimated distributions visually and, in larger dimensions, through their Kullback-Leibler divergence.

Acknowledgment. The work of C. V. Turner was partly supported by grants from CONICET, SECYT-UNC, and PICT-FONCYT; the work of E. G. Tabak was partly supported by the National Science Foundation under grant DMS 0908077.

Bibliography

[1] Agnelli, J. P.; Cadeiras, M.; Tabak, E. G.; Turner, C. V.; Vanden-Eijnden, E. Clustering and classification through normalizing flows in feature space. Multiscale Model. Simul. 8 (2010), no. 5, 1784-1802.

[2] Bishop, C. M. Pattern recognition and machine learning. Information Science and Statistics. Springer, New York, 2006.

[3] Botev, Z. I.; Grotowski, J. F.; Kroese, D. P. Kernel density estimation via diffusion. Ann. Statist. 38 (2010), no. 5, 2916-2957.

[4] Chen, S. X. Probability density function estimation using Gamma kernels. Ann. Inst. Statist. Math. 52 (2000), no. 3, 471-480.

[5] Friedman, J. Exploratory projection pursuit. J. Amer. Statist. Assoc. 82 (1987), no. 397, 249266.

[6] Hall, P.; Racine, J.; Li, Q. Cross-validation and the estimation of conditional probability densities. J. Amer. Statist. Assoc. 99 (2004), no. 468, 1015-1026.

[7] Hastie, T.; Tibshirani, R.; Friedman, J. The elements of statistical learning. Data mining, inference, and prediction. Springer Series in Statistics. Springer, New York, 2001.

[8] Kohavi, R. A study of cross-validation and bootstrap for accuracy estimation and model selection. Proceedings of the 14th International Joint Conference on Artificial Intelligence, vol. 2, 1137-1143. Morgan Kaufmann, San Francisco, 1995.

[9] Kullback, S.; Leibler, R. A. On information and sufficiency. Ann. Math. Statistics 22 (1951), 79-86.

[10] Nelsen, R. B. An introduction to copulas. Lecture Notes in Statistics, 139. Springer, New York, 1999.

[11] Park, B. U.; Marron, J. S. Comparison of data-driven bandwidth selectors. J. Amer. Statist. Assoc. 85 (1990), no. 409, 66-72.

[12] Parzen, E. On estimation of a probability density function and mode. Ann. Math. Statist. 33 (1962), 1065-1076.

[13] Rosenblatt, M. Remarks on some nonparametric estimates of a density function. Ann. Math. Statist. 27 (1956), 832-837.

[14] Silverman, B. W. Density estimation for statistics and data analysis. Monographs on Statistics and Applied Probability. Chapman & Hall, London, 1986.

[15] Sklar, A. Fonctions de répartition à n dimensions et leurs marges. Publ. Inst. Statist. Univ. Paris 8 (1959), 229-231.

[16] Smith, L. M.; Keegan, M. S.; Wittman, T.; Mohler, G. O.; Bertozzi, A. L. Improving density estimation by incorporating spatial information. EURASIP Journal on Advances in Signal Processing 2010 (2010), 1-12.

[17] Tabak, E. G.; Vanden-Eijnden, E. Density estimation by dual ascent of the log-likelihood. Commun. Math. Sci. 8 (2010), no. 1, 217-233.

[18] Wasserman, L. All of nonparametric statistics. Springer Texts in Statistics. Springer, New York, 2006.

ESTEBAN G. Tabak Courant Institute 251 Mercer St. New York, NY 10012 E-mail: tabak@cims.nyu.edu

Cristina V. Turner FaMAF

Universidad Nacional de Córdoba

Medina Allende s / n

Ciudad Universitaria

5000 Córdoba

ARGENTINA

E-mail: cristina.turnerig gmail.com

Received March 2011.

Revised June 2011, December 2011.