The Minimax Distortion Redundancy in Empirical Quantizer Design

Peter L. Bartlett, Member, IEEE, Tamas Linder, Member, IEEE, and Gabor Lugosi

Abstract—We obtain minimax lower and upper bounds for the expected distortion redundancy of empirically designed vector quantizers. We show that the mean-squared distortion of a vector quantizer designed from independent and identically distributed (i.i.d.) data points using any design algorithm is at least away from the optimal distortion for some

distribution on a bounded subset of . Together with existing upper bounds this result shows that the minimax distortion redundancy for empirical quantizer design, as a function of the size of the training data, is asymptotically on the order of . We also derive a new upper bound for the performance of the empirically optimal quantizer.

Index Terms—Distortion redundancy, empirical quantizer design, lower bounds, minimax convergence rate, vector quantization.

I. Introduction

ONE basic problem of data compression is the design of a vector quantizer without the knowledge of the source statistics. In this situation, a collection of sample vectors (called the training data) is given and the objective is to find a vector quantizer of a given rate whose average distortion on the source is as close as possible to the distortion of the optimal (i.e., minimum distortion) quantizer of the same rate.

Most existing design algorithms (see, e.g., [7], [9], [19], and [23]) attempt to implement, in various ways, the principle of empirical error minimization in the vector quantization context. According to this principle, a good quantizer can be found by searching for one that minimizes the distortion over the training data. If the training data represents the source well, this empirically optimal quantizer will hopefully perform near optimally also on the real source. The problem of quantifying how good empirically designed quantizers are compared to the truly optimal ones has been extensively

Manuscript received February 16, 1997; revised March 10, 1998. This work was supported in part by OTKA under Grant F 014174, under a DIST Bilateral Science and Technology Collaboration Grant, by DGES under Grant PB96-0300, and by the National Science Foundation. The material in this work was presented in part at the EUROCOLT'97, Jerusalem, Israel, 1997, and at the IEEE International Symposium on Information Theory, Ulm, Germany, June 1997.

P. L. Bartlett is with the Department of Systems Engineering, Research School of Information Sciences and Engineering, Australian National University, Canberra 0200, Australia (e-mail: Peter.Bartlett@anu.edu.au).

T. Linder was with the Department of Electrical and Computer Engineering, University of California, San Diego, CA, USA on leave from the Technical University of Budapest, Hungary. He is now with the Department of Mathematics and Statistics, Queen's University, Kingston, Ont., Canada K7L 3N6 (e-mail: linder@code.ucsd.edu).

G. Lugosi is with the Department of Economics, Pompeu Fabra University, Ramon Trias Fargas 25-27, 08005 Barcelona, Spain (e-mail: lugosi@upf.es).

Publisher Item Identifier S 0018-9448(98)04787-7.

studied for the case when the training data consists of n vectors independently drawn from the source distribution. It was shown by Pollard [16], [18] under general conditions that the method of empirical error minimization is consistent in the following sense. Let Dn be mean-squared error (MSE) of the empirically optimal quantizer, when measured on the real source, and let D* be the minimum MSE achieved by an optimal quantizer. An empirically designed quantizer is consistent if the quantity D„ — D* (called the distortion redundancy) converges to zero as n tends to infinity.

Of course mere consistency does not give any indication of how large the training data should be so that the distortion of the designed quantizer is close to the optimum. This question can only be answered by analyzing the finite sample behavior of Dn. In this direction, it was shown in [10] and [15] that there exists a c such that Dn - D* < c^Jlog n/n for all sources over a bounded region. This result has since been extended to empirical quantizer design for vector quantizers operating on "noisy" sources and for vector quantizers for noisy channels [11]. An extension to unbounded sources is given in [13].

A deeper analysis of the method used to obtain the above upper bound shows that at the price of considerable technical difficulties, the \/log n factor can be eliminated. Indeed, using a result of Alexander [1] the above upper bound can be sharpened to O ( l/x/n).

Two basic questions relating to the finite sample behavior of quantizer design algorithms have remained unanswered. The first is whether the O (1/i/n) upper bound on the distortion redundancy Dn — D* is actually tight. The second, more general question is whether there exist methods, other than empirical error minimization, which provide smaller distortion redundancy (and thus use less training data to achieve the same distortion). The results of this paper answer both questions in a minimax sense.

There are indications that the upper bound can be tightened to 0(l/n). Indeed, for the special case of a one-codepoint scalar quantizer one can define the codepoint to be the average of the n independent and identically distributed (i.i.d.) training samples, a choice which actually minimizes the squared error on the training data. It is easy to see that Dn - D* = c/n, where c is the variance of the source. Another indication that an 0(l/n) rate might be achieved comes from a result of Pollard [17]. He showed that for sources with some specially smooth and regular densities, the difference between the codepoints of the empirically designed quantizers and the codepoints of the optimal quantizer obeys a multidimensional

0018-9448/98S10.00 © 1998 IEEE

central limit theorem. As Chou [3] pointed out, this implies that within the class of sources in the scope of this result, the distortion redundancy decreases at a rate 0(l/n) in probability.

In the main result of this paper (Theorem 1) we show that despite these suggestive facts, the conjectured 0(l/n) distortion redundancy rate does not hold in the minimax sense. Let B > 0 and consider the class B of (¿-dimensional source distributions 11 such that if X is distributed according to //. then (l/o!)||X||2 < B with probability one. We show that for any (¿-dimensional />;-codcpoint (k > 2) quantizer Qn which is designed by any method from n independent training samples, there exists a distribution in B for which the per-dimension MSE of Qn is bounded away from the optimal distortion by

a constant times " • Thus the gap between this lower

bound and the existing upper bound is reduced to a constant factor, if the parameters k and d are kept constant.

In addition to this general lower bound, a new minimax upper bound for the empirically optimal quantizer is derived in

Theorem 2. The bound is a constant times

fcl-2/d lof, n

0*0 = Vh

if x € Bi

distortion of Qn is the random variable

•, Xn].

Let D*(k. ji) be the minimum distortion achievable by the best A;-point quantizer under the source distribution /j„ That is,

( 5 lJ) Q II )ll )

where the minimum is taken over all (¿-dimensional, ¿-point quantizers. The following quantity is in the focus of our attention:

D*(k, ß)

that is, the expected excess distortion of ()„ over the optimal quantizer for In particular, we are interested in the minimax expected distortion redundancy, defined by

main merit of this bound is that it partially explains the curious dependence of the lower bound on k: the bound decreases in k for very small values of d. Also, for realistic values of quantizer dimension and rate, it is tighter than the O (1/\/v ) bound obtained via Alexander's inequality, and yet its proof is rather elementary and accessible.

II. Main Results A (¿-dimensional / -point quantizer () is a mapping

k, d) = inf sup

Qn ¡1

where B±, ■ • ■, Bh form a measurable partition of R. and y, e Rd. 1 < i < k. The y,'s are called codepoints, and the collection of codepoints {yi, • • •. ?//,.} is the codebook. If ¡j, is a probability measure on 'Rd. the distortion of Q with respect to ¡j, is

where |,r — Q(x)\\ is the Euclidean distance between x and

An empirically designed A;-point quantizer is a measurable function Qn. (Rdf+1 — Rd such that for each fixed xi, •••, xn € 1Zd, Qn{•••, xn) is a k-point quantizer. Thus an "empirically designed quantizer" consists of a family of quantizers and an "algorithm" which chooses one of them for each value of the training data x±, ■ • ■, xn.

In our investigation, X, Xi, • • •, Xn are i.i.d. random variables in Rd distributed according to some probability measure ¡j, with ij,{S{0, Vd)) = 1, where S(x, r) c TZd denotes the closed ball of radius r > 0 centered at ,r 6 R'1. In other words, we assume that the normalized squared norm (l/c()||X||2 of X is bounded by one with probability one. (By straightforward scaling one can generalize our results to cases with ¡j,(S(0, VdB)) = 1 for some fixed B < oo.) The

where the infimum is taken over all (¿-dimensional, ¿-point empirical quantizers trained on n samples, and the supremum is taken over all distributions over the ball 5(0, Vd) in Hd. The minimax expected distortion redundancy expresses the minimal worst case excess distortion that an empirical quantizer can have.

A quantizer O is a nearest neighbor quantizer if for all x, — QOr)11 < - yi\\ for all codepoints yi of Q. It is well known that for each quantizer Q and distribution \i there exists a nearest neighbor quantizer which has the same codebook as Q but less than or equal distortion. Therefore, when investigating the minimax distortion redundancy, it suffices to consider nearest neighbor quantizers.

The empirically optimal quantizer, denoted Q*n, is an empirically designed quantizer which minimizes the empirical

■Q(?i) II2

over all ¿-point nearest neighbor quantizers Q.

The first result upper-bounding the minimax distortion redundancy was given in [10], where it was proved that for the empirically optimal quantizer

for all /i. where c is a universal constant. The main message of the above inequality is that there exists a sequence of empirical quantizers such that for all distributions supported on a given (¿-dimensional sphere the expected distortion redundancy decreases as O (

>g n/n). Another application of this result, which uses the dependence of this bound on k. was pointed out in [13] (see the discussion after Theorem 2).

With analysis based on sophisticated uniform large-deviation inequalities of Alexander [1] or Talagrand [21]

it is possible to get rid of the one can prove that

g n factor. More precisely,

for all //., where c' is another universal constant (see the discussion in [10] and [6, Problem 12.10]).

The theorem below—the main result of this paper—shows that for any empirical quantizer (),-, (i.e., for any design method whose input is Xu ■•■, Xn and output is a ¿-dimensional, k-codepoint quantizer Qn) the excess distortion is as large as a

constant times d

fcl-4/d

for some distribution. Let $ denote

M) > cod\

lkl-4/d

liminf y/nJ*(n, k, d) >

l2d(R-o(d-i))

< oTV*(n, k, d) <

12<l(R+0(d-1 log d))

V n V n

The difference is more essential for small d. not only because of the difference in the exponents of k in the two bounds, but also because the constant c' in (3) is large (it is of the order of 103), a price paid for eliminating the

g n factor in (2). For this reason, we now present a new minimax upper bound on the distortion redundancy of empirically optimal quantizers.

Theorem 2: For the class of sources considered in Theorem

1, if n> k4/d, ^dk1~2/d log n > 15,kd>%,n> 8d, and n/ log n > dk1+2/d, then

the distribution function of a standard normal random variable.

Theorem 1: For any dimension d, number of codepoints k > 3, and sample size n > 16/c/(3<i>(-2)2), and for any empirically designed /--point quantizer ()„. there exists a distribution ¡j, on 5(0. \/7l) such that

kl-2/d

where Co is a universal constant which may be taken to be

The proof of the theorem is given in the next section.

Remarks:

i) In the proof of the theorem, for the sake of simplicity, we consider a family of distributions concentrated on a finite set of points in 5(0, Vd). It is then demonstrated that for each Qn there exists a // in this family for which (4) holds. Since these distributions can be arbitrarily well approximated (for our purposes) by distributions with smooth (say infinitely many times differentiable) densities, essentially the same argument shows that for each Q„ there exists a /j, with a smooth density such that (4) holds.

ii) The constant c0 of the theorem is rather small (note that

), and it can probably be improved upon at the expense of a more complicated analysis.

The above theorem, together with (3), essentially describes the convergence rate of the minimax expected distortion redundancy in terms of the sample size n. Using definition (1) we obtain that

where Q* is the empirically optimal quantizer.

Just like the lower bound of Theorem 1, the new upper bound is also a decreasing function of the number of codepoints k if d = 1. Comparing the two bounds leads to the conjecture that for very small values of d (i.e., for d — 1 and perhaps for d = 2, 3, 4) the minimax distortion redundancy is a decreasing function of /:. while for large values of d it is an increasing function of k. We cannot prove this conclusion because of the gap between the upper and lower bounds, but for d — 1 it is possible to show values of k\ < ko and n such that the minimax distortion redundancy for k± codepoints is larger than that for k> codepoints. Intuitively, one might expect the minimax distortion redundancy to increase with k since the number of unknown parameters (i.e., kd) is increasing with k. On the other hand, the distortion of an optimal quantizer becomes small as k increases, and "smaller" quantities can be estimated with smaller variance. (The effect is the same as encountered in estimating the parameter p based on n Bernoulli (]>) random variables, where the MSE of the best unbiased estimate is p( 1 — p)/n.) Since the distortion of a vector quantizer decreases with k typically as O (k~2/d), this effect becomes negligible for large d. This might explain why our upper bound is decreasing in k for d = 1 but is increasing in k for d > 2. The proof of Theorem 2 provides further insight. The exact dependence of the minimax distortion redundancy on k and d is still a challenging open problem.

The relatively simple proof of this result is given in Section III-B. Note that this upper bound is always better than (3) if

for some constants ci, C2 > 0 depending on d and k. However, there is still a gap if the bounds are viewed in terms of the number of codepoints k. For large d the difference is small. In fact, if, according to the usual information-theoretic asymptotic view, the number of codepoints is set as /: = 2nd for some constant rate R > 0, then the difference between the upper and lower bounds is asymptotically negligible in an exponential sense. Indeed, (3) and Theorem 1 imply that for large d, the per-dimension minimax distortion redundancy is sandwiched

where R is the rate of the quantizer defined by R = (1/d) log2 k. For practical values of the training set size, this condition is satisfied for medium bit rates. For example, for n — 106, the new upper bound is smaller than (3) if R > 2.16.

In recent work, Merhav and Ziv [13] studied a problem closely related to quantizer design. In their setup, the "design algorithm" is given N bits of information (called side

information bits) about the source. The question is how many side information bits are necessary and sufficient to obtain a (¿-dimensional rate R quantizer (R = (1/7/) log /:. where

is the number of codepoints) whose distortion is close to the optimum. Their main result gives the answer N = 2<ir jn an exponential sense, if d is large. The sufficiency part of this statement was proved using (2). Note that this problem is more general than the problem we consider. The information bits are allowed to represent an arbitrary description of the source, of which discretized independent training samples are a special case. While the necessity part of this result does not translate directly to a lower bound on the convergence rate we study, it does have implications on how the minimax bounds can depend on the rate R and dimension d. For example, it is not hard to see that the fact that side information bits are not enough implies that the minimax distortion redundancy convergence rate cannot be upper-bounded in the form c(2dat~'>/nf for any constants c,f.,S> 0.

Our setting is slightly different from that studied in [13]. While Merhav and Ziv concentrated on stationary and ergodic sources, we only restrict the distribution to have support in a bounded subset of 7Zd. It is not hard to see that in general there does not exist a real stationary process whose (¿-dimensional marginals have exactly our counterexample distribution. We presently do not see a way of constructing stationary and ergodic sources (as was done in [13] for determining the number of necessary side information bits) whose (¿-dimensional marginals approximate the counterexample distributions well enough so that the rather fine analysis of the lower bound carries over without destroying the n-1/2 rate.

Finally, we would like to point out that our formulation of minimax redundancy has close connections with universal lossy coding. In particular, following Davisson's [5] definitions of various types of universality for lossless coding, Neuhoff et al. [14] defined three main types of universality in fixed-rate universal lossy coding. Of these three definitions, the one called strong minimax universality parallels our minimax redundancy formulation. A sequence of fixed-rate block codes is called strongly minimax universal with respect to a given class of sources if the distortion and rate of the codes converge with increasing blocklength to their respective OPTA (optimal performance theoretically attainable) functions uniformly over the source class. Thus by choosing sufficiently large block-length for a strongly minimax universal code, one can achieve a preassigned level of performance regardless of which source in the class is encoded. In our case, the minimax distortion redundancy J*(n. /;. d) converges to zero with increasing n if and only if there exists a sequence of empirically designed quantizers Qn such that J(Q„. /i) converges to zero uniformly over all ¡i in the given source class. The implication is similar to the universal coding case; by choosing the number of training samples large enough, the distortion redundancy of the empirically designed quantizer will be arbitrarily small for all sources in the class.

Neuhoff et al. [14] also defined a weaker notion of universality. In this definition, a sequence of codes with increasing blocklength is weakly minimax universal with respect to a class

of sources if the rate and distortion converge (not necessarily uniformly) to their OPTA functions for each source in the class. Refining this definition, Shields [20] defined the notion of weak minimax convergence rates in universal coding. Using Shield's formulation, we can define weak minimax convergence rates in empirical quantizer design in the following way.

A nondecreasing positive function n — /(n) is called a weak rate for empirical quantizer design for a class of (¿-dimensional sources V if the following simultaneously hold.

i) There exists a sequence of / -point empirical quantizers {Qn} such that for each /f € V there is a finite number

for which

J(Qn, li•) < M(/j)f(n), for all n > 1. (5)

ii) For any sequence of / -point empirical quantizers {(),-,} and function g(n) = o(/(n)), there exists a source ¡i, € V such that J(Qn, I1)/o(n) is unbounded as

Note that the constant M(/i) in (5) can depend on the source distribution /f. For this reason, the minimax lower bound in Theorem 1 does not imply that the weak rate for the class of sources over S(0, Vd) cannot be less than n-1/2. It is an interesting and challenging problem to find the weak rate for this source class.

III. Proofs

A. Proof of Theorem 1

The basic idea of the proof may be illustrated by the following simple example: let d — 1, k — 3, and assume that p, is concentrated on four points: 0, e, 1 - e, and 1, such that

Then if e is sufficiently small, the codepoints of the optimal quantizer are 0, e, 1 — </2 in the first case, and </2. 1 — c. 1 in the second case. Therefore, an empirical quantizer should "learn" from the data which of the two distributions generates the data. This leads to a hypothesis testing problem, whose error may be estimated by appropriate inequalities for the binomial distribution. Proper choice of the parameters e, 6 yields the desired il(n~Li'2) lower bound for the minimax expected distortion redundancy. The general, d > l. k > o. case is more complicated, but the basic idea is the same.

We present the proof in several steps. Some of the technical details are given in the Appendix.

Step 1: First observe that we can restrict our attention to nearest neighbor quantizers, that is, to Q„ 's with the property that for all xi, • • •. ./■„. the corresponding quantizer is a nearest neighbor quantizer. This follows from the fact that for any Qn not satisfying this property, we can find a nearest neighbor quantizer Q'n such that for all ¡i. J(Q'n. /i) <

Step 2: Clearly,

/i ^ H€T>

where V is any restricted class of distributions on 5(C We define V as follows: each member of V is concentrated on the set of 2m = 4/c/3 fixed points {zi, Zi+w: i = 1, • • •, m}, where w — (A, 0, 0, • • •, 0) is a fixed (¿-vector, and A is a small positive number to be determined later. The positions of zi 5(0, Vd) satisfy the property that the distance

between any two of them is greater than AA, where the value of A is determined in Step 5 below. For the sake of simplicity, we assume that k is divisible by 3. (This assumption is clearly insignificant.) Let 6 < 1/2 be a positive number. For each , set

Step 7:

either

1-8 2m 1 + g 2m

цет>

г, Mi)-

where Q*n is the "empirically optimal" (or "maximum-likelihood") quantizer from Q, that is, if N, denotes the number of A','s falling in {z,. z, + w}, then ()*n has a codepoint at both z, and z^ + w if the corresponding N, is one of the to/2 largest values. For the other % s (i.e., those with the to/2 smallest iV,'s) Q*n has a codepoint at z, +w/2.

The proof is given in the Appendix.

Step 8: By symmetry, we have

The rest of the proof involves bounding ■](()%. /'i j below, where Q* is the empirically optimal quantizer.

such that exactly half of the pairs (z,. z, + w) have mass (1 - 8)/to, and the other half of the pairs have mass (l+6)/m, so that the total mass adds up to one. Let V contain all such distributions. The cardinality of V is M = (• Denote the members of V by /¿i, /¿2, • • •, mm-

Step 3: Let Q denote the collection of A;-point quantizers Q € Q such that for m/2 values of i £ {1, • • •, m}, Q has codepoints at both and z, + w, and for the remaining to/2 values of i, Q has a single codepoint at z, + w/2. If A > a/2/(1 - 8) + 1, then for any k-point quantize^ there exists a Q in Q such that, for all /j, in V, D(Q) < D(Q). The proof of this is given in the Appendix.

Step 4: Consider a distribution / / ; 6 V and the corresponding optimal_quantizer ()lj). Clearly, from Step 3, if A > ^/2/(1 - 8) + 1, then for the m/2 values of -i in {1, • • •, to} that have {¿j({zi, Zi + u;}) = (1 + 8)/m, Q^ has codepoints at both z, and z, + w. For the remaining m/2 values of i there is a single codepoint at z, + w/2.

For any distribution in V and any quantizer in Q, it is easy to see that the distortion of the quantizer is between (1 — 8) A2/8

Step 5: Let Qn denote the family of empirically designed quantizers such that for every fixed x±, - •-, xn, we have Q(-, xi, xn) 6 Q. Since 8 < 1/2, the property of the optimal quantizer described in Step 4 is always satisfied if we take A = 3. In particular, if A = 3, we have

inf max J (Qui p) = min max J(Qm p)

Q„ nev Qn£Qn vev

and it suffices to lower-bound the quantity on the right-hand side.

Step 6: Let Z be a random variable which is uniformly distributed on the set of integers {1, 2, • • •, M}. Then, for any Qn, we obviously have

Step 9: Recall that the vector of random integers is multinomially distributed with parameters

qm), where

Let N„( 1), • • •, N„(,n) be a reordering of the N,'s such that -^r(i) < ^(2) < • • • < N^-m). (In case of equal values, break ties according to indices.) Let pj (j = 1, • • •, m/2) be the probability of the event that among N/T(1). ■ ■ ■, 2), there are exactly j of the iV,'s with i > to/2 (i.e., the "maximum-likelihood" estimate makes j mistakes). Then it is easy to see that

A2<5 . " 2m ¿r J!';

since one "mistake" increases the distortion by A28/{2m). Step 10: From now on, we investigate the quantity

У. л>.,

that is, the expected number of mistakes. First we use the trivial bound

m/2 m/2

XI JPJ - :io X pj

with jo to be chosen later. J2]=j0 Pj is the probability that the maximum-likelihood decision makes at least jo mistakes. The key observation is that this probability may be bounded below by the probability that at least 2 j0 of the events Ai: • • •, Amj2 hold, where

m/2+ij

In other words,

m/2 ( m/2

E ^ p\ E^

Proof: Define the following sets of indices:

Then the maximum-likelihood decision makes ,S'i | mistakes.

If i € S2 and Ni > Nm/2+i, then mj2 + % e S±. Thus the number of indices i for which N, > A",,,/2+» is bounded from above by |5i|+m/2-|52| = 2|5i|, since |52|=to/2-|5i|.D

Step 11: Thus we need a lower bound on the tail of the distribution of the random variable • First we obtain

a suitable lower bound for its expected value.

=1 _ z

Now, bounding

Step 12: To obtain the desired lower bound for

conservatively, we have

lows [12]) which states that if 1 ally distributed, then

> h, N2>t2,---, Nm > tm} <

, then

^{E * <

and similarly

> n/m} >

1/2+1 < n/m} >

Therefore, by (7) we get

we use the following elementary inequality: if the random variable Z satisfies P{Z € [0, B\} — 1, then

> n/m and Nm/2+1 < n/m} > n/m}

The last inequality follows by Mallows' inequality (see Mal-

are multinomi-

f EZ1 EZ \ -~)-2B'

To see this, notice that for a in [0, B]

and substitute a = EZ/2.

Step 13: To apply this inequality, . Then (8) implies that

and, therefore,

{m/2 (m/2

£ J* > 2j0 J >P|£

■»/2

choose jo =

Finally, we approximate the last two binomial probabilities by normals. To this end, we use the Berry-Esseen inequality (see, e.g., Chow and Teicher [4]), which states that if Z±, • • •, Zn are i.i.d. random variables with EZ± — 0, E[Z±] — a2, and

where the second inequality follows from (9) and the last inequality follows from (8).

Step 14: Collecting everything, we have that

?(—2)4 prn

inf sup

Qn ¡1

where 4' is the distribution function of a standard normal random variable. Choose S = yVn/n. Observe that N± is the sum of n i.i.d. Bernoulli((l — 6)/m) random variables. Then the Bcrn-Essccn inequality implies that if n > 8m/$(—2)2, then

where A is any positive number with the property that m pairs of points {zi, Zi + w} can be placed in 5(0, Vd) such that the distance between any two of the z,'s is at least 3A. In other words, to make A large, we need find a (desirably large) A such that m points z\, • • •, zm can be packed into the ball 5(0, Vd-A). (We decrease the radius of the ball by A to make sure that the {zi + w)'s also fall in the ball 5(0, Vd).) Thus we need a good lower bound for the cardinality of the maximal 3A-packing of 5(0, Vd - A). It is well known (see Kolmogorov and Tikhomirov [8]) that the cardinality of the maximal packing is lower-bounded by the cardinality of the minimal covering, that is, by the minimal number of balls of radius 3A whose union covers 5(0, Vd— A). But this number is clearly bounded from below by the ratio of the volume of 5(0, Vd— A), and that of 5(0, 3A). Therefore, m points can certainly be packed in 5(0, Vd — A) as long as

If A < Vd/4 (which is satisfied by our choice of A below), the above inequality holds if

Thus the choice

Vd 4TOi/d

max min

xC5(0,r) l<i<AT

Then, for all p < 2r we have

K-^AT+l

Since there exists an integer N < (4r// above inequality, the lemma is proved.

ii) all quantizers in Q,

have their codepoints inside

iii) for any /c-point nearest neighbor quantizer Q whose codepoints are contained in 5(0. Vd), there exists a Q' £ Qr such that for all x £ 5(t

satisfies the required property. Resubstitution of this value proves the theorem. □

B. Proof of Theorem 2

The first step in the analysis of the performance of the empirical quantizer ();, is the following lemma.

Lemma 1: Let S(x, r) denote the closed (/-dimensional sphere of radius r centered at x. Let p > 0 and let N(p) denote the cardinality of the minimum p covering of 5(0. r), that is, Nip) is the smallest integer N such that there exist points {hi . • • •, n\} C 5(0, r) with the property

Proof: Let p = e/(4Vet). Then 0 < p < 2Vd, and by Lemma 1 there exists a p-covering set of points

the collection of all ¿-point nearest neighbor quantizers whose codepoints are from the covering set {yi, • • •, y.x}. Then

Proof: By a classical observation of Kolmogorov and Tikhomirov [8] the covering (10) exists if it is impossible to construct another set {zi, • ■ ■, c 5(0, r) which is

-separated, that is,

Let us now consider an arbitrary ^-separated set of cardinality N + l. Then the open balls of radius p/2 centered at the z, are disjoint and their union is included in 5(0, r + p/2). Also, if p/2 < r, then 5(0, r + p/2) C 5(0, 2r). Thus such a separating set cannot exist as long as N + 1 is greater than the ratio of the volumes of 5(0, 2r) and 5(0, p/2), that is,

If {xi, • ■ •, Xk} are the codepoints of Q, then there exists a quantizer ()' e Q„ with codepoints {. ■ • ■, x'k} such that ~ < P for all i. If Q(x) = Xj, we have by the nearest neighbor property that

The inequality \\x — Q(x)\\2 — \\x — Q'(x)\\2 < emay be proved similarly. □

Corollary 2: For all distributions such that P{||X|| < Vd} = 1, there exists a ¿-point quantizer ^ > 1) whose codepoints are contained in 5(0, Vd) and whose distortion satisfies

Proof: If k < 2d, then the statement trivially holds for the quantizer having one codepoint at the origin. Otherwise, let p = A.Vdk~1ld. Then p < 2Vd and by Lemma 1 there exists a set of points {yi, • • •, yc 5(0, Vd) that p-covers 5(0, Vd). Letting Q be the nearest neighbor quantizer with these codepoints, we get D(Q) < p2 = 1 &dk~2/d. □

Let 0 < <; <8d, and let Q„- be a set of quantizers satislying properties i), ii), and iii) of Corollary 1. Let Q e Qe denote a quantizer whose distortion is minimal in Q„ . that is,

for all Q £ Qe.

which satisfies the □

Corollary 1: Let 0 < e < 8d. There exists a finite collection of ¿-point quantizers Q„- such that i) the cardinality of Q„ is bounded as

Then it is clear that D(Q) < D* + e, where D* denotes the minimum distortion achievable by any quantizer. Let Qn be a quantizer in Q„ such that for all x £ 5(0, Vd)

\\x-Qn(x)\\2 < \\x - Q*n(x)\\2 + e.

Such a quantizer exists by Corollary 1. Then clearly, by the definition of the empirically optimal quantizer ();,

Dn(Qn) < Dn

for all Q € Qf.

The next lemma is based on ideas of Vapnik and Chervo-nenkis [22].

Lemma 2: For all b > <.. we have

P{D(Qn) -.

+ P< max

Proof: If

■ 26

then for each () e Q„

Therefore,

P< min Dn{Q) <

I Q: D(Q)>D(Q)+26

< P< max QCQc

- D„

But if D{Qn)-L P{D(Qn) -

min Dn(Q) <Dn(Q) + e

Q: D{Q)>D{Q)+26

[ Q: D{Q)>D{Q)+26 ^ +

< P< max

Lemma 3: Let Q 6 Then for all 7 > 0

Proof: The probability is clearly zero if 7 > \fT)( ~)(Q), we may use Bernstein's inequality [2]

where a2 = var (

< e-[„72D(Q)/2<T2+(2/3)4d7l/D(Q)]

— Q(X)||2). But observe that

with probability one, and, therefore, <r2 < 4dD(Q), and the statement follows. □

Corollary 3: For all b > <.

< (|ge| + 1)e-3n(i-e)2/(32d(D(Q)+2(6-e)))_

Proof: By Lemma 3 we have

If, in addition, Q is such that D(Q) > D(Q) + 26, then by the monotonicity of the function x—c^fx (for c > 0 and x > c2/4)

P< max

< \Qe\ max P

Q€Qc 1

< \Qe\e

-3n<52 / (32d(D(Q)+26)) o_3n(i-e)2/(32d(D(Q)+2(6-i)))

> 26, then there exists an () € Q„- such

■ 26 and Dn(Q) < Dn(Q) + e. Thus

On the other hand, by Bernstein's inequality

and applying Lemma 2 finishes the proof. □

Proof of Theorem 2: Since the distribution of X is supported on 5(0. \/7l). we have that with probability one, D(Qn) ~ D(Q) < Ad, hence for every u > 0

Thus it follows from Corollary 3 that for any u > e

-3n(M-e)2/(32d(D(Q)+2(«-e)))

If D(Q) > (32of log(8of|Qt\^/n)/n), then with

' Z2dD(Q)

we have u — a < D(Q). In such a case

<u + i

,-n(«-e)2/(32dD(Q))

' Z2dD(Q)

On the other hand, if take

< 32 d \

_l_ 1 _l_

/n)/n, then

> 7 > < e~3nr /(-32d\

, and, therefore

-n(u-e)/(32d)

Noting that we obtain

;)-£>* <3e+max

l32dD(Q)

Take e = 16dn 1//2. and also recall that by Corollary 2

kl-2/d

IV. Concluding Remarks

I jA.—4./d

cod\j- < J*(n, /;, d)

ki-2/d

if k = 2dn for a constant rate R, we obtain that the perdimension minimax distortion redundancy is approximately

for large d and n.

However, some interesting questions remain unanswered. We conjecture that the factor of \

; n in the upper bound of Theorem 2 might be eliminated, and the minimax expected distortion redundancy is some constant times

< D* + € < 16dk~2'd + 32dk~2/d

whenever n > /.;1/ . Substituting these values into the above inequality, we obtain the inequalities shown at the bottom of this page, if \ldkx~2ld log n > 15, kd > 8, and n > 8d. In particular, if nj log n > dk1+2/d, then

The main results of the paper are new upper and lower bounds for the minimax expected distortion redundancy of empirical quantizers. Combining these with previously known bounds we see that for some universal constants c0, c\ > 0

for some values of a 6 [1, 3/2] and b € [2, 4].

Another challenging problem is to find (or give bounds on) the weak minimax convergence rate defined at the end of Section II. In particular, Pollard's result [16] suggests that the weak minimax rate can still be O (1 Jn) for a class of sources with sufficiently regular and smooth densities. We have no conjecture at present, however, as to what the weak rat^might be for the class of all sources concentrated on 5((

Appendix

Proof of Step 3: Let C = {yi, • • •, y^} be the codebook of (). Consider the Voronoi partition of 1Zd induced by the set of points {z,. z, + tv: 1 < i < m} and for each i define Vi as the union of the two Voronoi cells belonging to and Zi+w. Furthermore, let m, be the cardinality ofC Pi V',. A new nearest neighbor quantizer Q with codebook C is constructed as follows. Start with C empty. For all i

• if rii, > 2, put z,i and Zi + w into C.

• if rrii = 1 or rrii = 0, put Zi + w/2 into C.

Note that C may contain more than k codepoints, but this will be fixed later. Define

K* + w)

{Zi + w}).

For most practical values of the dimension d. the number of codepoints /:. and the number of training vectors n, the two bounds are fairly close to each other, essentially describing the behavior of the minimax distortion. For example, it follows that the minimax distortion redundancy, as a function of the number of training samples n, is on the order of n-1/2. Also,

Then we have the following:

• if mi > 2, then Di(Q) = 0 so that A N ., _

• if rrii — 1, then there are two cases:

1) Q(zi) = Q(zi +w) e Vi. Then D^Q) > A(<3) since Q(zi) — Q(zi + w) — Zi + w/2 is the optimal choice with the condition that both z, and z,+iv arc mapped into the same codepoint;

^ fn I V n \/n

kl-2/d

32 kd2

2) either z, or z, + w is mapped by Q to a codepoint outside V,. Say ()(z,) £ V,. Then

2m 1 ±6 2m

( , j Zi

where the second inequality follows by the triangle inequality. (Here ± means + if // puts mass (1 + S)/m on {z,. Zi + w}, and — otherwise.) On the other hand, Di(Q) = (1 ± <5)A2/(4m) so that

• if mi = 0, then both Qiz,) and Q(z, + w) are outside Vi. Thus

which implies

1 ± <S A2

so that D,:

if A > 2.

< D(Q) (k k) ^ 4 '

-1)2-1).

Therefore,

'+ ^ ^ 4m

and this is no more than L

Proof of Step 7: Let (Y, Yu as the mixture

where /'',+1 is the (n + l)-fold product of //,. Then for any O,,

quantizer Qn achieving the minimum in (6) chooses its codebook as a function of the vector (JVi, • • •, Nm). Thus it suffices to restrict our attention to empirical quantizers that choose their codebook only as a function of (Ni, ■■■, Nm). Recall that each quantizer in Q is such that for each i it either has one codepoint at z, + w/2 or has codepoints at both z, and Zi + w. Since k = 3m/2, there must be m/2 codepoints of the first kind, and m of the second.

We will represent the distribution ¡j,z as an m-vector, 7 = (71, ' •', 7m) € rm C {-1, 1}™, with

r™= 7e {-l, l}"

We write P

Thus we conclude that D(Q) > D(Q), and we are done if C has no more than k codepoints. If C contains k > k codepoints, pick k — k arbitrary pairs {z,. Zi + w} € C and replace them with the corresponding codepoint z, +w/2. We thus obtain a nearest neighbor quantizer Q. Each such replacement increases the distortion by no more than (1 + 6)A2/(Am), so that

to denote the probability of the event under the multinomial distribution with parameters (n, Vi, • • •, Qm) where

_ 1 + 7^

® — TO •

We will represent a quantizer's choice of the codebook as a vector a = («i, • • •, am) 6 TTO, with o, = -1 indicating one codepoint at z, + w/2 and a, = 1 indicating codepoints at both Zi and z\ + w.

Represent the quantizer Q*(-, AV • • •, Xn) by

' i ) € r„

for the corresponding values of A",. Define a similarly in terms of Qn. Then it suffices to show that (with suitable abuse of notation)

On the other hand, there must be k - k indices i for which mi — 0. For each of these (12) holds, so that

Yn) be jointly distributed

for all m-tuples of nonnegative integers (ni, sum to n and for all functions ex.

For the numbers m, •••, nm, let a = a(ni, •••, nm) and = a*(ni, •••, nm). Define (3 <E {-1, 0, l}"1 by /j. = {a* - cti)/2. Note that ^ = 0. It is easy to see that

Since V. Vi. • • •. Y„ are exchangeable random variables, the distribution of Y given (Yi, Y„) depends only on the empirical counts (A"i. • • •, N^). It follows that the empirical

hence the difference D(a) — D(a*) is some positive constant times /{i^j- and so it suffices to show that

To prove this inequality, we shall split the outer sum into several parts, and show that each part is nonnegative. Each part corresponds to a set of distributions that satisfy a convenient symmetry property. First, divide the components of fi into

to/2 pairs (/'. j). with r( = —f3j. Without loss of generality, suppose

(hi-1 <0, and j, for all 1 < i < to/2. (13)

(hi >0 J

Then for 7 € {-1, 1}™\ let 5(7) denote the set of all permuted versions of 7 obtained by swapping the components 72^-1 and 72,• for all i in some subset of {1, • • •, to/2}. Clearly, it suffices to show that for all 7 e I'„,

E P7,n(Vi, Ni = m) Pili > 0-But we have

Without loss of generality, we can assume that (3j ^ 0 for all j. Indeed, suppose that (hi-i = (hi = 0 for some i. Then we can split the sum over 7 in (14) into a sum over the pair (72,-1, 72i) and a sum over the other components of 7, and the corresponding factors in the product can be taken outside the outermost sum, since J2]Li fijlj is identical for both values of the pair 72;)•

Now, (hi—i — —1 and (hi — 1 imply that 7/2,-1 < n2t-So to show that (14) holds for the cases of interest, it suffices to show that for all even m, for all ni, • • •, n,„ satisfying n2i-i < n2i, and all b e {-1, l}m, we have

E E^^II^0

We can ignore the nonnegative constant factor, and the other probabilities are of independent events, so we can write

■E^i

— n "^>(72i-l572i))n2i-l+»2i

So it suffices to show that for all 7 e {-1, 1}™, all ni, • • •, nm summing to n, and all (3 e {-1,0,1}TO satisfying (13), we have

First suppose to = 2. If b± = b2, the expression is clearly zero. Otherwise, it is equal to

which is clearly nonnegative, since n2 > n\. Next, suppose the expression is nonnegative up to some even number to. Let b G {-1, l}m+2. Then

m+2 m/2+1

E E!-|Wv n n

(m/2 \

H Pi Pm/2+1

(m m/2

6E E'-|Wv II

m/2 / m+2

+ tE E E+i(-1)j№/2+11

and both of these terms are nonnegative, since the expressions in parentheses are nonnegative by the inductive hypothesis. □

References

[1] K. Alexander, "Probability inequalities for empirical processes and a law of the iterated logarithm," Ann. Prob., vol. 4, pp. 1041-1067, 1984.

[2] S. N. Bernstein, The Theory of Probabilities. Moscow, USSR: Gaste-hizdat, 1946.

[3] P. A. Chou, "The distortion of vector quantizers trained on n vectors decreases to the optimum as Op(l/rc)," in Proc. IEEE Int. Symp. Information Theory, (Trondheim, Norway, 1994).

[4] Y. S. Chow and H. Teicher, Probability Theory, Independence, Inter-changeability, Martingales. New York: Springer-Verlag, 1978.

[5] L. D. Davisson, "Universal lossless coding," IEEE Trans. Inform. Theory, vol. IT-19, pp. 783-795, Nov. 1973.

[6] L. Devroye, L. Gyorfi, and G. Lugosi, A Probabilistic Theory of Pattern Recognition. New York: Springer-Verlag, 1996.

[7] R. M. Gray, J. C. Kieffer, and Y. Linde, "Locally optimum block quantizer design," Inform. Contr., vol. 45, pp. 178-198, 1980.

[8] A. N. Kolmogorov and V. M. Tikhomirov, " -entropy and -capacity of sets in function spaces," Transl. Amer. Math. Soc., vol. 17, pp. 277-364, 1961.

[9] Y. Linde, A. Buzo, and R. M. Gray, "An algorithm for vector quantizer design," IEEE Trans. Commun., vol. COM-28, pp. 84-95, 1980.

[10] T. Linder, G. Lugosi, and K. Zeger, "Rates of convergence in the source coding theorem, in empirical quantizer design, and in universal lossy source coding," IEEE Trans. Inform. Theory, vol. 40, pp. 1728-1740, Nov. 1994.

[11] _, "Empirical quantizer design in the presence of source noise or

channel noise," IEEE Trans. Inform. Theory, vol. 43, pp. 612-623, Mar. 1997.

[12] C. L. Mallows, "An inequality involving multinomial probabilities," Biometrika, vol. 55, pp. 422-424, 1968.

[13] N. Merhav and J. Ziv, "On the amount of side information required

for lossy data compression," IEEE Trans. Inform. Theory, vol. 43, pp. 1112-1121, July 1997.

[14] D. L. Neuhoff, R. M. Gray, and L. D. Davisson, "Fixed rate universal block source coding with a fidelity criterion," IEEE Trans. Inform. Theory, vol. IT-21, pp. 511-523, Sept. 1975.

[15] A. Nobel and R. Olshen, personal communication.

[16] D. Pollard, "Strong consistency of fc-means clustering," Ann. Statist., vol. 9, pp. 135-140, 1981.

[17] _, "A central limit theorem for fc-means clustering," Ann. Prob.,

vol. 10, pp. 919-926, 1982.

[18] _, "Quantization and the method of fc-means," IEEE Trans. Inform.

Theory, vol. IT-28, pp. 199-205, 1982.

[19] K. Rose, E. Gurewitz, and G. C. Fox, "Vector quantization by deterministic annealing," IEEE Trans. Inform. Theory, vol. 38, pp. 1249-1257, July 1992.

[20] P. C. Shields, "When is the weak rate equal to the strong rate?" in Proc. 1994 IEEE-IMS Workshop on Information Theory and Statistics. New York: IEEE, 1994, p. 16.

[21] M. Talagrand, "Sharper bounds for Gaussian and empirical processes," Ann. Prob., vol. 22, pp. 28-76, 1994.

[22] V. N. Vapnik and A. Ya. Chervonenkis, Theory ofPattern Recognition. Moscow, USSR: Nauka, 1974, in Russian. (German translation: Theorie der Zeichenerkennung. Berlin: Akademie Verlag, 1979.)

[23] E. Yair, K. Zeger, and A. Gersho, "Competitive learning and soft competition for vector quantizer design," IEEE Trans. Signal Processing, vol. 40, pp. 294-309, Feb. 1992.