Entropy 2008,10, 493-506; DOI: 10.3390/e10040493

OPEN ACCESS

entropy

ISSN 1099-4300

www.mdpi.com/journal/entropy

Article

Entropy and Uncertainty

Derek W. Robinson

Mathematical Sciences Institute, Australian National University, Canberra, ACT 0200, Australia E-mail: Derek.Robinson@anu.edu.au

Received: 17 June 2008/Accepted: 6 August 2008 /Published: 16 October 2008

Abstract: We give a survey of the basic statistical ideas underlying the definition of entropy in information theory and their connections with the entropy in the theory of dynamical systems and in statistical mechanics.

Keywords: Entropy, relative entropy, uncertainty, information theory.

1. Introduction

The concept of entropy originated in the physical and engineering sciences but now plays a ubiquitous role in all areas of science and in many non-scientific disciplines. A quick search of the ANU library catalogue gives books on entropy in mathematics, physics, chemistry, biology, communication theory and engineering but also in economics, linguistics, music, architecture, urban planning, social and cultural theory and even in creationism. Many of the scientific applications will be described in the lectures over the next three weeks. In this brief introductory lecture we describe some of the theoretical ideas that underpin these applications.

Entropy is an encapsulation of the rather nebulous notions of disorder or chaos, uncertainty or randomness. It was introduced by Clausius in the 19th century in thermodynamics and was an integral part of Boltzmann's theory. In the thermodynamic context the emphasis was on entropy as a measure of disorder. Subsequently the probabilistic nature of the concept emerged more clearly with Gibbs work on statistical mechanics. Entropy then had a major renaissance in the middle of the 20th century with the development of Shannon's mathematical theory of communication. It was one of Shannon's great insights that entropy could be used as a measure of information content. There is some ambiguity in the notion of information and in Shannon's theory it is less a measure of what one communicates but rather what one could communicate, i.e. it is a measure of one's freedom of choice when one selects a

message to be communicated. Shannon also stressed the importance of the relative entropy as a measure of redundancy. The relative entropy gives a comparison between two probabilistic systems and typically measures the actual entropy to the maximal possible entropy. It is the relative entropy that has played the key role in many of the later developments and applications.

Another major landmark in the mathematical theory of the entropy was the construction by Kol-mogorov and Sinai of an isomorphy invariant for dynamical systems. The invariant corresponds to a mean value over time of the entropy of the system. It was remarkable as it differed in character from all previous spectral invariants and it provided a mechanism for providing a complete classification of some important systems. Other landmarks were the definition of mean entropy as an affine functional over the state space of operator algebras describing models of statistical mechanics and the application of this functional to the characterization of equilibrium states. Subsequently entropy became a useful concept in the classification of operator algebras independently of any physical background.

In the sequel we discuss entropy, relative entropy and conditional entropy in the general framework of probability theory. In this context the entropy is best interpreted as a measure of uncertainty. Subsequently we develop some applications, give some simple examples and indicate how the theory of entropy has been extended to non-commutative settings such as quantum mechanics and operator algebras.

2. Entropy and uncertainty

First consider a probabilistic process with n possible outcomes. If n =2 this might be something as simple as the toss of a coin or something as complex as a federal election. Fortunately the details of the process are unimportant for the sequel. Initially we make the simplifying assumption that all n outcomes are equally probable, i.e. each outcome has probability p = 1/n. It is clear that the inherent uncertainty of the system, the uncertainty that a specified outcome actually occurs, is an increasing function of n. Let us denote the value of this function by f (n). It is equally clear that f (1) = 0 since there is no uncertainty if there is only one possible outcome. Moreover, if one considers two independent systems with n and m outcomes, respectively, then the combined system has nm possible outcomes and one would expect the uncertainty to be the sum of the individual uncertainties. In symbols one would expect

f (nm) = f (n) + f (m) .

The additivity reflects the independence of the two processes. But if this property is valid for all positive integers n, m then it is easy to deduce that

f (n) = log n

although the base of logarithms remains arbitrary. (It is natural to chose base 2 as this ensures that f (2) = 1, i.e. a system such as coin tossing is defined to have unit uncertainty, but other choices might be more convenient.) Thus the uncertainty per outcome is given by (1/n) log n or, expressed in terms of the probability,

uncertainty per outcome = —p log p .

Secondly, consider a process with n possible outcomes with probabilities pi,p2,... ,pn, respectively.

Then it is natural to ascribe uncertainty — pi log pi to the i-th outcome. This is consistent with the foregoing discussion and leads to the hypothesis

total uncertainty = — ^ pi logpi .

This is the standard expression for the entropy of the probabilistic process and we will denote it by the symbol H(p), or H(p1;... ,pn). (This choice of notation dates back to Boltzmann.) Explicitly, the entropy is defined by

H (p) = H (pi,...,pn) = —J2 Pi log Pi . (!)

Although the argument we have given to identify the entropy with the inherent uncertainty is of a rather ad hoc nature applications establish that it gives a surprisingly efficient description. This will be illustrated in the sequel. It is a case of 'the proof of the pudding is in the eating'.

Before proceeding we note that the entropy H(p) satisfies two simple bounds. If x G (0,1] then —x log x G [0,1] and one can extend the function x ^ —x log x to the closed interval [0,1] by setting —0log0 = 0. Then H(p) > 0 with equality if and only if one outcome has probability one and the others have probability zero. It is also straightforward to deduce that H(p) is maximal if and only if the probabilities are all equal, i.e. if and only if p1 = ... = pn = 1/n. Therefore one has bounds

0 < H(p1,...,pn) < logn. (2)

There is a third less precise principle: the most probable outcomes give the major contribution to the total entropy.

The entropy enters the Boltzmann-Gibbs description of equilibrium statistical mechanics through the prescription that the state of equilibrium is given by the microscopic particle configurations which maximize the entropy under the constraints imposed by the observation of macroscopic quantities such as the energy and density. Thus if the configuration with assigned probability pi has energy ei the idea is to maximize H(p) with E(p) = YJn=1 piei held fixed. If one formulates this problem in terms of a Lagrange multipliers fi G R then one must maximize the function

p =(p1,...,pn) ^ H(p) — ^E(p) . (3)

We will discuss this problem later in the lecture.

3. Entropy and multinomial coefficients

In applications to areas such as statistical mechanics one is usually dealing with systems with a large number of possible configurations. The significance of entropy is that it governs the asymptotic behaviour of the multinomial coefficients

nfi _ ___

n1!... nm\

where n1 + ... + nm = n. These coefficients express the number of ways one can divide n objects into m subsets of n1, n2,..., nm objects, respectively. If n is large then it is natural to examine the number

of partitions into m subsets with fixed proportions p1 = ni/n,... ,pm = nm/n. Thus one examines

(pin)! . . . (pmn)!

with p1 + ... + pn =1 as n —► to. Since the sum over all possible partitions,

EnC = mn

Cni...nm m ,

ni,...,nm

increases exponentially with n one expects a similar behaviour for Pn(p). Therefore the asymptotic behaviour will be governed by the function n-1 log Pn(p). But this is easily estimated by use of the Stirling-type bounds

(2nn) 1/2nne-ne1/(12n+1) < n! < (2nn)1/2nne-ne1/(12n) for the factorials. One finds

n-1 log Pn(p) = H (p) + O(-n-1 log n-1) (4)

as n — to where H (p) is the entropy. Thus the predominant asymptotic feature of he Pn is an exponential increase exp(nH) with H the entropy of the partition p1,... ,pm.

Next consider n independent repetitions of an experiment with m possible outcomes and corresponding probabilities q1,..., qm. The probability that these outcomes occur with frequencies p1,... ,pm is given by

Pn(p|q) = Pn(p) qpin ... Cmn = ( )!n' (-^ q?in ... qmmn .

(p1n)!... (pmn)!

Then estimating as before one finds

n-1 log Pn(p|q) = H (p|q) + O(-n-1 log n-1) (5)

as n — to where H(p|q) is given by

H (p|q) = -Y, (pi log pi - pi log qi) . (6)

The latter expression is the relative entropy of the frequencies pi with respect to the probabilities qi. Now it is readily established that H (p|q) < 0 with equality if, and only if, pi = qi of all i E {1,..., m}. Thus if the pi and qi are not equal then Pn(p|q) decreases exponentially as n — to. Therefore the only results which effectively occur are those for which the frequencies closely approximate the probabilities.

One can relate the variational principle (3) defining the Boltzmann-Gibbs equilibrium states to the relative entropy. Set qi = e-^ei/Z where Z = J2i . Then

H(p) - p E(p) = H(p|q) + log Z .

Therefore the maximum is obtained for pi = qi and the maximal value is log Z.

4. Conditional entropy and information

Next we consider two processes a and P and introduce the notation Ab..., An for the possible outcomes of a and B1,..., Bm for the possible outcomes of P. The corresponding probabilities are denoted by p(A1),... ,p(An) and p(B1),... ,p(Bm), respectively. The joint process, a followed by P, is denoted by aP and the probability of the outcome A^Bj is given by p^Bj). We assume that p^Bj) = p(Bj Aj) although this condition has to be relaxed in the subsequent non-commutative settings. If the a and P are independent processes then p^Bj) = p(Ai)p(Bj) and the identity is automatic.

The probability that Bj occurs given the prior knowledge that Aj has occurred is called the conditional probability for Bj given Aj. It is denoted by p(Bj |Aj). A moment's reflection establishes that

p(Aj)p(Bj |Aj) = p(AjBj) .

Since the possible outcomes Bj Aj of the process Pa are the same as the outcomes AjBj of the process aP one has the relation

p(Aj)p(Bj | Aj) = p(Bj )p(Aj|Bj)

for the conditional probabilities. If the two processes are independent one obviously has p(Bj |Aj) = P(Bj).

The conditional entropy of the process P given the outcome Aj for a is then defined by

H(P|Aj) = - £p(Bj |Aj) logp(Bj|Aj) (7)

in direct analogy with the (unconditional) entropy of P defined by

H(P) = - £ p(Bj)logp(Bj). j=1

In fact if a and P are independent then H(P| Aj) = H(P). Since Aj occurs with probability p(Aj) it is natural to define the conditional entropy of P dependent on a by

H(P|a) = £p(Aj)H(P|Aj) . (8)

The entropy H(P) is interpreted as the uncertainty in the process P and H(P|a) is the residual uncertainty after the process a has occurred.

Based on the foregoing intuition Shannon defined the difference

I(P|a) = H(P) - H(P|a) (9)

as the information about P gained by knowledge of the outcome of a. It corresponds to the reduction in uncertainty of the process P arising from knowledge of a.

The usefulness of these various concepts depends on a range of properties that are all traced back to simple features of the function x ^ — x log x. First, one has the key relation

H(P|a) = H(aP) - H(a) . (10)

This follows by calculating that

H(p|a) = - EEp(AiBj)logp(Bj|Ai) i=1j=1

= - EEp(AiBj)log(p(AiBj)/p(Ai)) i=1j=1

= - EEp(AiBj) (logp(AiBj) - logp(Ai)) = H(ap) - H(a) i=1j=1

Note that if a and p are independent then H(p|a) = H(p) and the relation (10) asserts that H(ap) = H(a) + H(p). Note also that it follows from (10) that the information is given by

I(p|a) = H(p) + H(a) - H(ap) . (11)

This latter identity establishes the symmetry

I(p|a) = I(a|p) . (12)

Next remark that H(p|a) > 0 by definition. Hence H(ap) > H(a) and, by symmetry, H(ap) > H (p). Thus one has the lower bounds

H(ap) > H(a) V H(p) . (13)

But it also follows by a convexity argument that H(p|a) < H(p). The argument is as follows. Since the function x > 0 — log x is convex one has

E Ai log xi < log(E AiXi)

i=1 i=1

for all Ai > 0 with ^n=1 Ai = 1 and all xi > 0. Therefore

H(p|a) = - EEp(AiBj)logp(Bj|Ai) i=1j=1

= Ep(Bj) E(p(AiBj)/p(Bj))log(1/p(Bj |Ai)) j=1 i=1

< E p(Bj )log E(p(Ai)/p(Bj )) = H (p) j=1 i=1

because ^n=1(p(AiBj-)/p(Bj)) = 1. (Here I have been rather cavalier in assuming p(Bj|Ai) and p(Bj) are strictly positive but it is not difficult to fill in the details.) It follows from H(p|a) < H(p) that

I(p|a) > 0 , (14)

i.e. the information is positive. But then using the identity (11) one deduces that

H(ap) < H(a) + H(p)

This is a generalization of the property of subadditivity f (x + y) < f (x) + f (y) for functions of a real variable. It is subsequently of fundamental importance.

Finally we note that the information is increasing in the sense that

I(Py|a) > I(P|a) (16)

or, equivalently,

H MY) - H (PY) < H (aP) - H (P) . (17)

The latter property is established by calculating that

Cfc )p(Bj )

H(a^Y) - H(aP) - H(Py) + H(P) = Cfc) log

< )p(Bj Cfc) / p(AiBjCfc)p(Bj)

" ilk P(Bj) V p(AiBj)p(Bj Cfc)

/ p(AjBj )p(Bj Cfc )

where we have used the bound -x log x < 1 - x. This property (17) is usually referred to as strong subadditivity as it reduces to the subadditive condition (15) if P is the trivial process with only one outcome.

Example There are two cities, for example Melbourne and Canberra, and the citizens of one always tells the truth but the citizens of the other never tell the truth. An absent-minded mathematician forgets where he is and attempts to find out by asking a passerby, who could be from either city. What is the least number of questions he must ask if the only replies are 'yes' and 'no'? Alternatively, how many questions must he pose to find out where he is and where the passerby lives?

Since there are two towns there are two possible outcomes to the experiment a of questioning. If the mathematician really has no idea where he is then the entropy H(a) = log 2 represents the total information. Then if one uses base 2 logarithms H(a) = 1. So the problem is to ask a question P that gives unit information, i.e. such that I(a,P) = H(a) = 1 or, equivalently, H^(a) = 0. Thus the question must be unconditional. This could be achieved by asking 'Do you live here?'.

Alternatively to find out where he is and to also decide where the passerby lives he needs to resolve the outcome of a joint experiment a1a2 where a1 consists of finding his own location and a2 consists of finding the residence of the passerby. But then the total information is H(a1a2) = H(a1)+Hai (a2) > 1. Hence one question will no longer suffice but two clearly do suffice: he can find out where he is with one question and then find out where the passerby lives with a second question. This is consistent with the fact that H(a1a2) = log 22 = 2.

This is all rather easy. The following problem is slightly more complicated but can be resolved by similar reasoning.

Problem Assume in addition there is a third city, Sydney say, in which the inhabitants alternately tell the truth or tell a lie. Argue that the mathematician can find out where he is with two questions but needs four questions to find out in addition where the passerby lives.

5. Dynamical systems

Let (X, B) denote a a-finite measure space, i.e. a set X equipped with a a-algebra B of subsets of X. Further let u denote a probability measure on (X, B). Then (X, B, u) is called a probability space. A finite partition a = (A1,..., An) of the space is a collection of a finite number of disjoint elements Ai of B such that Un=1 Ai = X. Given two partitions a = (A1,..., An) and p = (B1,..., Bm) the join a V p is defined as the partition composed of the subsets Ai n Bj with i E {1,..., n} and j E {1,..., m}.

If a = (A1,..., An) is a partition of the space then 0 < ^(Ai) < 1 and Y,n=1 ^(Ai) = 1 because u is a probability measure. Thus the ^(Ai) correspond to the probabilities introduced earlier and a V p now corresponds to the joint process ap. Therefore we can now use the definitions of entropy, conditional entropy, etc. introduced previously but with the replacements p(Ai) — ^(Ai), ap — a V p etc. Note that since ^(Ai n Bj) = u(Bj n Ai) one automatically has p(AiBj) = p(BjAi).

Next let T be a measure-preserving invertible transformation of the probability space (X, B, u). In particular TB = B and u(TA) = u(A) for all A E B. Then (X, B,u,T) is called a dynamical system. In applications one also encounters dynamical systems in which T is replaced by a measure-preserving flow, i.e. a one-parameter family of measure preserving transformations {Tt}ieR such that TsTt = Ts+i, T-s = (Ts)-1 and T0 = I is the identity. The flow is usually interpreted as a description of the change with time t of the observables A. The single automorphism T can be thought of as the change with unit time and Tn is the change after n-units of time.

The entropy of the partition a = (A1,..., An) of the probability space is given by

H(a; u) = - E MAi)log

and we now define the mean entropy of the partition of the dynamical system by

H(a ; u, T) = lim n-1H(a V Ta ... V Tna; u) (18)

where Ta = (TA1,..., TAn). Then the mean entropy of the automorphism T is defined by

H(u,T) = sup H(a ; u,T) (19)

where the supremum is over all finite partitions.

It is of course necessary to establish that the limit in (18) exists. But this is a consequence of subad-ditivity. Set

f (n) = H(a V Ta ... V Tn-1a; u)

f (n + m) = H ((a V Ta... V Tn-1a; u) V Tn(a V Ta ... V Tm-1a; u)

< H(a V Ta ... V Tn-1a; u) + H(Tn(a V Ta ... V Tm-1a); u) = f (n) + f (m)

for all n, m E N where we have used (15) and the T-invariance of u. It is, however, an easy consequence of subadditivity that the limit of n-1f (n) exists as n — to.

The entropy HT) was introduced by Kolmogorov and Sinai and is often referred to as the Kolmogorov-Sinai invariant. This terminology arises since H(^,T) is an isomorphy invariant. Two dynamical systems (X1, B1, T1) and (X2, B2, T2) are defined to be isomorphic if there is an invertible measure preserving map U of (X1, B1,^1) onto (X2, $2,^2) which intertwines T1 and T2, i.e. which has the property UT1 = T2U. If this is the case then H(^1,T1) = H(^2,T2). Thus to show that two dynamical systems are not isomorphic it suffices to prove that H(^1,T1) = H(^2,T2). Of course this is not necessarily easy since it requires calculating the entropies. This is, however, facilitated by a result of Kolmogorov and Sinai which establishes that the supremum in (19) is attained for a special class of partitions. The partition a is defined to be a generator if Vfc=_^ Tka is the partition of X into points. Then

H(u,T) = H(a ; ^T) (20)

for each finite generator a.

Another way of formulating the isomorphy property is in Hilbert space terms. Let H = L2(X ; Then it follows from the T-invariance property of ^ that there is a unitary operator on H such that Uf = f o T-1 for all f e Cb(X). Now the two dynamical systems (X1, B1, T1) and (X2, B2, T2) are isomorphic if there is a unitary operator V from H1 to H2 which intertwines the two unitary representatives U1 and U2 of the maps T1 and T2. Since one has U1 = VU2V-1 the spectra of U1 and U2 are also isomorphy invariants. The Kolgmogorov-Sinai entropy was the first invariant which was not of a spectral nature.

The mean entropy has another interesting property as a function of the measure The probability measures form a convex subset of the dual of the bounded continuous functions Cb(X). Now if and are two probability measures and A e [0,1] then

H(a; A^1 + (1 - A)^) > A H(a; ^1) + (1 - A) H(a; ^2) . (21)

The concavity inequality (21) is a direct consequence of the definition of H(a; and the concavity of the function x ^ -x log x. Conversely, one has inequalities

- log(A^1(Aj) + (1 - A) MAj)) < - log A - logMAj)

- log(A^1(Aj) + (1 - A) ^2(Aj)) < - log(1 - A) - log^2(Aj) because x ^ - log x is decreasing. Therefore one obtains the 'convexity' bound

H(a; A^1 + (1 - A)^) < A H(a; ^1) + (1 - A) H(a; ^2) - A log A - (1 - A) log(1 - A) . (22)

Now replacing a by a V Ta ... V Tn-1a in (21), dividing by n and taking the limit n ^ to gives

H(a ; A^1 + (1 - A)^2, T) > A H(a ; ^1, T) + (1 - A) H(a ; ^2, T) .

Similarly from (22), since - (A log A + (1 - A)log(1 - A))/n ^ 0 as n ^ to, one deduces the converse inequality

H(a ; A^1 + (1 - A)^2, T) < A H(a ; T) + (1 - A) H(a ; T) .

Hence one concludes that the map u — H(a ; u, T) is affine, i.e.

H(a ; Au1 + (1 - A)u2, T) = A H(a ; u, T) + (1 - A) H(a ; u2, T) (23)

for each partition a, each pair u1 and u2 of probability measures and each A E [0,1].

Finally it follows from the identification (20) that the mean entropy is also affine,

H(Au1 + (1 - A)u2,T) = AH(u1,T) + (1 - A) H(u,T) (24)

for each pair u1 and u2 of probability measures and each A E [0,1]. This is a somewhat surprising and is of great significance in the application of mean entropy in statistical mechanics.

6. Mean entropy and statistical mechanics

The simplest model of statistical mechanics is the one-dimensional ferromagnetic Ising model. This describes atoms at the points of a one-dimensional lattice Z with two degrees of freedom which we label as 0,1 and which we think of as a spin orientation. Thus X = {0,1}Z and a point x E X is a doubly-infinite array of 0s and 1s. The labels indicate whether a particle at a given point of the lattice has negative or positive spin orientation. Two neighbouring atoms with identical orientation are ascribed a negative unit energy and neighbouring atoms with opposite orientation are ascribed a positive unit of energy. Therefore it is energetically favourable for the atoms to align and provide a spontaneous magnetism. Since the configurations x of particles are doubly infinite the total energy ascribed to each x is usually infinite but the mean energy, i.e. the energy per lattice site is always finite.

The model generalizes to d dimensions in an obvious way. Then X = {0,1}zd and a point x E X corresponds to a d-dimensional array of 0s and 1s. If one now assigns a negative unit energy to each pair of nearest neighbours in the lattice Zd with similar orientations and positive unit energy to the pairs with opposite orientation then the energy of a configuration of particles on a cubic subset of Zd with side length L grows as Ld, i.e. as the d-dimensional volume. Therefore the energy per lattice site is again finite.

The group Zd of shifts acts in an obvious manner on X. Let T1,..., Td denote the unit shift to in each of the d directions. Further let u denote a Zd-invariant probability measure over X. Then the energy E(u) per lattice site is well defined, it has a value in [-1,1] and u — E(u) is an affine function. Now consider the entropy per lattice site.

Let a denote the partition of X into two subsets, the subset A0 of configurations with a 0 at the origin of Zd and the subset A1 of configurations with a 1 at the origin. Then V<k1=-^ ... V<kd=-^ Tf1... T^ = X and the partition a is a generator of X. Now the previous definition of the mean entropy generalizes and

H(u) = lim (L1... Ld)-1H( V ... V Tk1... Tdfcda : u) (25)

Ll'...'Ld—~ ki=0 fcd=o

exists by an extension of the earlier subadditivity argument to the d-dimensional setting.

The Boltzmann-Gibbs approach sketched earlier would designate the equilibrium state of the system at fixed mean energy as the measure which maximizes the functional

u — H(u) - pE(u)

This resembles the earlier algorithm but there is one vital difference. Now the supremum is taken over the infinite family of invariant probability measures ^ over X. There is no reason that the supremum is uniquely attained. In fact this is not usually the case.

There is a competition between two effects. Assuming P > 0 the energy term —PEis larger if Eis negative and this requires alignment of the spins, i.e. ordered configurations are preferred. But the entropy term His largest if the the system is disordered, i.e. if all possible configurations are equally possible. If P is large the energy term tends to prevail but if P is small then the entropy term prevails. In fact P is interpretable as the inverse temperature and there is a tendency to ordering at low temperatures and to disorder at high temperatures. Since there are two possible directions of alignment of the spins this indicates that there are two distinct maximising measures at low temperature and only one at high temperatures. The advantage of this description is that it reflects reality. The Ising model, with d > 2, indeed gives a simple description of a phase transition for which there is a spontaneous magnetization at low temperatures.

Although we have described the model with a nearest neighbour interaction which favours alignment of the model atoms the same general features pertain if the interaction favours anti-parallel alignment, i.e. if the alignment of neighbours has positive energy and the anti-alignment negative energy. Then it is still energetically favourable to have an ordered state but the type of ordering is different. The model then describes a phenomenon called anti-ferrogmagnetism.

The description of the invariant equilibrium states as the invariant measures which maximize the mean entropy at fixed mean energy has many other positive aspects. Since ^ ^ H— PEis an affine function it tends to attain its maximum at extremal points of the convex weakly* compact set of invariant measures E. In fact if the maximum is unique then the maximizing measure is automatically extremal. If, however, the maximum is not uniquely attained then the maximizing measures form a face A^ of the convex set E and each ^ e A^ has a unique decomposition as a convex combination of extremal measures in A^. This indicates that the extremal measures correspond to pure phases and in the case of a phase transition there is a unique prescription of the phase separation. This interpretation is corroborated by the observation that the extremal invariant states are characterized by the absence of long range correlations.

The foregoing description of the thermodynamic phases of macroscopic systems was successfully developed in the 1970s and 1980s and also extended to the description of quantum systems. But the latter extension requires the development of a non-commutative generalization of the entropy.

7. Quantum mechanics and non-commutativity

The Ising model has a simple quantum-mechanical extension. Again one envisages atoms at the points of a cubic lattice Zd but each atom now has more structure. The simplest assumption is that the observables corresponding to the atom at the point x e Zd are described by an algebra A{x} of 2 x 2-matrices. Then the observables corresponding to the atoms at the points of a finite subset A c Zd are described by an algebra Aa of 2|A| x 2|A|-matrices where |A| indicates the number of points in A. Thus

aa = nL A{x}

where the product is a tensor product of matrices. A quantum-mechanical state wa of the subsystem Л is then determined by a positive matrix рл with Тгл(рл) = 1 where Тгл denotes the trace over the matrices Ал. The value of an observable A e Ал in the state wa is then given by

^л(А) = Тгл(рлА)

Now if Л с Л' one can identify Ал as a subalgebra of Ал/ and for consistency the matrices рл that determine the state must satisfy the condition

Рл = Тгл' \л(рл') . (26)

The natural generalization of the classical entropy is now given by the family of entropies

#лМ = - Тгл (рл log рл) (27)

as Л varies over the bounded subsets of Zd. The previous mean entropy should then be defined by

HЫ = lim ЯлМ/|Л|

if the limit exists. The existence of the limit is now a rather different problem than before. Nevertheless it can be established for translation invariant states by a extension of the earlier subadditivity argument which we now briefly describe.

First if р and a are two positive matrices both with unit trace then the entropy of р relative to a is defined by

H(P|a) = - Тг(Р logР - Р log a)

in direct analogy with the earlier definition (6). The key point is that one still has the property H(Р|а) < 0. This is established as follows. Let р» and a» denote the eigenvalues of р and a. Further let ^ denote an orthonormal family of eigenfunctions of р corresponding to the eigenvalues р». Then

- Тг(р log р - р log a) = - log р» - рД^ log a^i))

< - log Р» - Р» log(^i,a^i))

= - ^ХР^^ a^i)) ^g^iA^ a^i)) i

< £(^,a^)(1 - ,a^)) = 1 - 1 = 0

where we have used convexity of the logarithm and the inequality -x log x < 1 - x.

Now suppose that Л! and Л2 are two disjoint subsets of Zd. Set р = рл1ил2 and a = рл1 0 рл2. Then it follows from the foregoing that

- Тгл1ил2 (рл1 ил2 log Рл1ил2 - Рл1ил2 log(рлl 0 Рл2)) < 0 . But using (26) and the identity

bg^1 0 Рл2) = bg^1) 0 1л2 + 1л1 0 log(рл2)

one immediately deduces that

HA1UA2 (w) < HA1 (W) + Ha2 (W) .

This corresponds to the earlier subadditivity and suffices to prove the existence of the mean entropy.

These simple observations on matrix algebras are the starting point of the development of a non-commutative entropy theory.

Bibliography

There is an enormous literature on entropy but the 1948 paper A mathematical theory of communication by Claude Shannon in the Bell System Technical Journal remains one of the most readable accounts of its properties and significance [1]. This paper can be downloaded from

http://plan9.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf

Note that in later versions it became The mathematical theory of communication.

Another highly readable account of the entropy and its applications to communication theory is given in the book Probability and Information by A. M. Yaglom and I. M. Yaglom. This book was first published in Russian in 1956 as Veronatjost' i informacija and republished in 1959 and 1973. It has also been translated into French, German, English and several other languages. The French version was published by Dunod and an expanded English version was published in 1983 by the Hindustan Publishing Corporation [2]. I was unable to find any downloadable version.

The example of the absentminded mathematician was adapted from this book which contains many more recreational examples and a variety of applications to language, music, genetics etc.

The discussion of the asymptotics of the multinomial coefficients is apocryphal. I have followed the discussion in the Notes and Remarks to Chapter VI of Operator Algebras and Quantum Statistical Mechanics 2 by Bratteli and Robinson, Springer-Verlag, 1981 [3]. This book is now available as two searchable pdf files:

http://folk.uio.no/bratteli/bratrob/VOL-1S 1.pdf

http://folk.uio.no/bratteli/bratrob/VOL-2.pdf.

There are now many books that describe the ergodic theory of dynamical systems including the theory of entropy. The earliest source in English which covered the Kolmogorov-Sinai theory was I believe the 1962-63 Aarhus lecture notes of Jacobs Ergodic theory I and II. These are now difficult to find but are worth reading if you can locate a copy. Another early source is the book by Arnold and Avez Problemes ergodiques de la mecanique classique, Gauthier-Villars, Paris 1967 [4].

More recent books which I have found useful are An Introduction to Ergodic Theory by Ya. G. Sinai, Princeton University Press, Mathematical Notes, 1976 [5]: An Introduction to Ergodic Theory by P. Walters, Springer-Verlag, Graduate Text in Mathematics, 1981 [6]: Topics in Ergodic Theory by W. Parry, Cambridge University Press, 1981 [7]. But there are many more.

Chapter VI of Operator Algebras and Quantum Statistical Mechanics 2 by Bratteli and Robinson [3] contains a description of the applications of entropy to spin systems but the theory has moved on since then. Another source which covers more recent developments in the quantum-mechanical applications

is Quantum entropy and its use by M. Ohya and D. Petz, Springer-Verlag, 1993 [8]. Finally a recent comprehensive treatment of the extension of the theory to the structural study of operator algebras is given in Dynamical Entropy in Operator Algebras by S. Nesheyev and E. St0rmer, Springer-Verlag, 2006 [9].

References and Notes

1. Shannon, C. A mathematical theory of communication. Bell Sys. Tech. 1948, 27, 379-423.

2. Yaglom, A. M.; Yaglom, I. M. In Probability and information ; Hindustan Publishing Corporation: India, 1983.

3. Bratteli, O.; Robinson, D.W. In Operator algebras and quantum statistical mechanics 2 ; SpringerVerlag: New York, 1981.

4. Arnold, V.I. ; Avez, A. In Problèmes ergodiques de la mecanique classique ; Gauthier-Villars: Paris, 1967.

5. Sinai, Ya. G. An introduction to ergodic theory ; Princeton University Press: New Jersey, 1976.

6. Walters, P. An introduction to ergodic theory ; Springer-Verlag: New York, 1981.

7. Parry, W. Topics in ergodic theory ; Cambridge University Press: UK, 1981.

8. Ohya, M.; Petz, D. In Quantum entropy and its use ; Springer-Verlag: Berlin, 1993.

9. Nesheyev, S.; Strmer, E. In Dynamical entropy in operator algebras ; Springer-Verlag: Berlin, 2006.

© 2008 by the authors; licensee Molecular Diversity Preservation International, Basel, Switzerland. This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Copyright of Entropy is the property of Molecular Diversity Preservation International (MDPI) and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.