So how do we evaluate *Prob(C|D)*? The multinomial
naive Bayes method proceeds as follows:

First, as before, *Prob(C|D) = Prob(D|C) Prob(C)/Prob(D)*.

Second, let *T1, T2 ... Tm* be the sequence of
lexical terms in *D*. We will
assume that the occurrence of a term *T* in the *I*th place of* D* depends
depends only on the category *C* and, given *C*, is conditionally
independent of all the other terms in *D* and of the position *I*.
Therefore

*Prob(D|C) = Prob(T1|C)*Prob(T2|C) * ... * Prob(Tm|C)*

where *Prob(Ti|C)* means "the probability that a randomly chosen word
token in a document of type *C* is the word *Ti*".

Third, we estimate *Prob(Ti|C)* and *Prob(C)*
using the training set.

*Prob(Ti|C)* is estimated as the relative frequency
of *Ti* in documents of category *C* =

(number of occurrences of *Ti* in *C*) /
(total number of words in *C)*.

*Prob(C)* is estimated as the fraction of documents in the training set that
are of category *C*.

Fourth, as usual in Naive Bayes, we can ignore the deonominator *Prob(D)*
since it is independent of the value of *C*.
Since we are
given *D* and are just trying to compare different categories, we need
not calculate *Prob(D)*. (This is known as a * normalizing factor*)
its function is to ensure that the probabilities add up to 1.

We can now calculate the product

*Prob(Ci)*Prob(T1|Ci)*Prob(T2|Ci) * ... * Prob(Tm|Ci)* for each
category *Ci* and choose the category that maximizes this.

First, as before, *Prob(C|D) = Prob(D|C) Prob(C)/Prob(D)*.

Second, let *{ T1, T2 ... Tm }* be the set of * distinct *
lexical terms in *D*. We will
assume that the * appearance *
of a term *Ti* depends only on the category *C* and
is independent of all the other terms in *D*. Therefore

*Prob(D|C) = Prob(T1|C)*Prob(T2|C) * ... * Prob(Tm|C)*

(where *Prob(Ti|C)*
means "the probability that a document of category *C* contains at least
one occurrence of *Ti*").

Third, we estimate *Prob(Ti|C)* and *Prob(C)* using the training set.

*Prob(Ti|C)* is estimated as the fraction of documents in C that contain
*Tii*.

The rest of the algorithm proceeds as above.

The stochastic generative model corresponding to this procedure generates documents by a random processs. The question then becomes, given a document of this kind, what is the conditional probability that it belongs to a given category?

As in the discussion of Naive Bayes generally as a stochastic generative model, the stochastic generative model here consists of two levels of roulette wheels.

Given the particular structure of roulette wheels, the assignment of probabilities that makes the training set most likely is to make each probability equal to the observed frequency in the training set.

First, you spin a roulette wheel to choose the value *v* of the
classication attribute with probability *Prob(X.C=v)*

You now arbitrarily choose the number *k*
of words in the document to be generated. For each category *v*
you have a separate roulette wheel, with all possible words as values,
where the probability of obtaining word *u* is
*Prob(X.T=u|X.C=v)*. You spin that roulette wheel *k*
times. That gives you all the words in the document. You put them in random
order, and there's your document.

(The fact that the number of words, *k*is an arbitrary
rather than a probabilistic choice corresponds to the fact that we're not
using document length as a feature to use in categorization. Of course, one
could, and that in fact might often be reasonable -- some kinds of documents
are typically longer than others. In that case, that would have to be another
roulette wheel here.

Now, for each category *v* and for each word *w* in the language
you have a separate roulette wheel, with two values: *TRUE* and
*FALSE*. Having chosen the category *v*, for every word
*w* you spin the roulette wheel. If it comes up TRUE, you include the
word *w* in the document at least once — the number of times to
include it is arbitrary.