Available online at www.sciencedirect.com

ScienceDirect

Procedia - Social and Behavioral Sciences 147 (2014) 307 - 312

ICININFO

Combining Probabilistic Classifiers for Text Classification

Kostas Fragos, Petros Belsis, Christos Skourlas*

Department of Informatics, TEI of Athens, Ag. Spyridonos 12210 Athens GREECE

Abstract

Probabilistic classifiers are considered to be among the most popular classifiers for the machine learning community and are used in many applications. Although popular probabilistic classifiers exhibit very good performance when used individually in a specific classification task, very little work has been done on assessing the performance of two or more classifiers used in combination in the same classification task. In this work, we classify documents using two probabilistic approaches: The naive Bayes classifier and the Maximum Entropy classification model. Then, we combine the results of the two classifiers to improve the classification performance, using two merging operators, Max and Harmonic Mean. The proposed method was evaluated using the "ModApte" split of the Reuters-21578 dataset and the evaluation results show a measurable improvement in the final evaluation accuracy.

© 2014 ElsevierLtd. Thisisanopenaccess article under the CC BY-NC-ND license (http://creativecommons.Org/licenses/by-nc-nd/3.0/).

Selection and peer-review under responsibility of the 3rd International Conference on Integrated Information.

Keywords:

1. Introduction

Text classification could be seen as a task of applying a learning model to extract documents' categories for a collection of documents. Then, this model is applied to each new document and eventually the document is assigned to some (one or more) categories. Text classification is important for many applications e.g. spam filtering, e-mail routing, web directory maintenance and news filtering. All these years, efficient training and application, performance tuning, and building of understandable classifiers are common topics for the text classification research. Statistical classification and machine learning techniques have been applied to text

* Corresponding author. Tel.: +30-2105910974; fax: +30-2105910975. E-mail address: cskourlas@teiath.gr

1877-0428 © 2014 Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.Org/licenses/by-nc-nd/3.0/).

Selection and peer-review under responsibility of the 3rd International Conference on Integrated Information. doi: 10.1016/j.sbspro.2014.07.098

categorization, including multivariate regression models, nearest neighbor classifiers, probabilistic Bayesian models, decision trees, neural networks (Dumais et al., 1998). The use of Support Vector Machines (SVMs) for text classification has been explored (Dumais et al., 1998), (Galathiya, 2012). Techniques for text classification can be classified in two main approaches: firstly, discriminative methods like Logistic Regression (LR), Support Vector Machines (SVMs) and secondly, probabilistic methods related to the aspect model (Hofmann, 1999), the maximum entropy model (Fragos et al), the latent dirichlet allocation (Blei et al, 2002), and the Bayesian classification (Hamad, 2007), (Grossman et al, 2005). Although popular classifiers exhibit very good performance when used individually in a specific classification task, very little work has been done on assessing the performance of two or more classifiers when used in combination in the same classification task. In this work, we classify documents using two probabilistic approaches based on the naive Bayes classifier and the Maximum Entropy classification model, respectively. To improve classification performance, we propose two merging operators, Max and Harmonic Mean, to combine the results of the two classifiers.

2. Two probabilistic approaches for documents classification

2.1. Naïve Bayes classifier

A text classifier could be defined as a function that maps a document d of words

(features),d = (Xi,X2,X3,...Xn), to a confidence that the document d belongs to a text category. If the features xi,...xn are conditionally independent, given the category variable c, then the Naïve Bayes classifier (Al-Aidaroos et al, 2010) is often used to estimate the probability of each category. The Bayes theorem can be used to estimate the probabilities:

„ , , „ Pr(d | c)P(c)

Pr(c | d) = v 1 ' w (1)

. P(d)

Fragos et al. (2005) used training data to estimate model parameters in order to find the best class (argmaxc Pr(c)Pr(d|c)) for the documents of the test set. This technique was based on the technique proposed by McCallum and Nigam (1998).

2.2. Maximum Entropy Classification

Entropy was used by Shannon in the communication theory. The entropy H itself measures the average uncertainty of a single random variable X:

H(p) = H(X) = - ^p(x)\og2 p(x) (2)

where, p(x) is the probability mass function of the random variable X. In a different context, entropy has been used in natural language processing tasks, etc. Della et al. (Della P. et al., 1997) shown that there is always a unique distribution with maximum entropy and that this distribution has an exponential form. Fragos et al. (2005) used the iterative scaling (IIS) algorithm, a hill-climbing algorithm for estimating the parameters of the maximum entropy model, specially adjusted for text classification. In Section 3, we explain how the chi square goodness of fit statistical test can be used as an alternative relatedness measure for text classification purposes. Section 4 describes how two merging operators of the classification results can be used to improve classification performance. In

Section 5, we present data used in the experiments and discuss the evaluation results. Finally, in section 6 our conclusions are given followed by some directions for future work.

3. X Square Test for Feature Selection

Chi-square test used in the past for feature selection in the text classification field. Yang and Pedersen (Yang and Pedersen, 1997) compared five measurements in term selection, and found that the chi-square and information gain gave the best performance. Fragos et al. (Fragos, 2005) proposed a new method to apply Maximum Entropy modeling for text classification using weights for the selection of the features of the model and the evaluation of the importance of each feature in the classification task. Instead of using Maximum Entropy modeling in the classical way, they used X square values to weight the features of the model and their importance. Their method was evaluated on Reuters-21578 dataset for test classification tasks. Example

Having the distinct categories c1='Acq' and c2^'Acq' from the Reuters-21578 'ModApte' split training dataset we want to decide if the word 'usa' is a good feature for the classification in the category 'Acq'. All the stopwords are removed and after that we calculate the frequency of the word "usa" in the category c1='Acq' equal to 1,238 and in the other categories (c2^'Acq') equal to 4,464. In the class 'Acq' there are 125,907 terms (words) and in the other classes there are 664,241. Total is equal to 790,148 terms (words). The null hypothesis is that the word 'usa' and the class label 'Acq' occur independently. We can compute the expected frequencies: w='usa' and c1='Acq': En= (5,702x125,907)/790,148=908.59 w='usa' and c1^'Acq': E^= (5,702x664,241)/790,148=4,793.4 w^'usa' and c1='Acq': E2J= (784,446x125,907)/790,148=124,998.4 w'^usa' and c1^'Acq': E22= (784,446x664,241)/790,148=659,447.6 Then we calculate the X2 value: X2 =(1,238-908.59)2/908.59 + (4,464-4793.4)2/4793.4 +

( 124,669-124,998.4)2/124,998.4 + (659,777-659,447.6)2/659,447.6 = 143.096.

Looking up the X2 distribution, for significance level a equal to 0.05, and for one degree of freedom, if the calculated value is greater than the critical value we can reject the null hypothesis. So, if the calculatedX2 value is large then we have a strong evidence for the pair ('usa', 'Acq') and the word 'usa' is a good feature for the classification in the category 'Acq'.

4. Merging Operators for the Naïve Bayes and Maximum Entropy classifiers

We use two operators to combine the results of the Naïve Bayes Classifier (NBC) and the Maximum Entropy Classifier (MEC) to compensate for errors in each classifier, and to improve the classification performance.

MaxC(d) = Max {NBC(d), MEC (d)} (3)

HarmonicC (d) = 2.0 x NBC(d) xMEC (d) / (NBC(d) + MEC (d)) (4)

The equation 3 shows that the MaxC(d) operator chooses a maximum value among the results of the Naïve Bayes (NBC (d)) and Maximum Entropy (MEC(d)) classifiers for an input document d. In Equation 4, the HarmonicC (d) operator estimates the Harmonic Mean of the results of these two classifiers. Jongwoo, Daniel, and George (Jongwoo et al, 2010) used these merging operators to classify sentences containing Databank Accession Numbers, a key piece of bibliographic information, from online biomedical articles

5. Evaluation

The proposed classification technique was evaluated using the "ModApte" split of the Reuters-21578 dataset. The corpus includes 9,603 training documents and 3,299 test documents. Ten categories out of 135 potential categories were chosen (see table 1). If a document belongs to the specific category then it is located into the "Yes" group, otherwise it is in the "No" group. The 10 categories with the number of documents for the training and test phase are shown in table 1.

Table 1. 10 categories from the "ModApte" split of the Reuters-21578 dataset with the number of documents for the Training phase and the Test phase

Category Training Set Training Set Test Set Test Set

(YES) (NO) (YES) (NO)

Acq 1615 7988 719 2580

Corn 175 9428 56 3243

Crude 383 9220 189 3110

Earn 2817 6786 1087 2212

Grain 422 9181 1 49 3150

Interest 343 9260 131 3168

Money-fx 518 9085 179 3120

Ship 187 9416 89 3210

Trade 356 9247 117 3182

Wheat 206 9397 71 3228

In the training phase and in the test phase all the documents were parsed and a list of stopwords was used. Eventually, a list of 32,412 discrete words-terms (out of a total of 790,148 words) was defined. Then, the X square test was applied on the corpus and the 2,000 higher ranked words were selected for each category to be used in the maximum entropy model. Table 2 presents the 10 top ranked word terms calculated by the X square test for three categories.

Table 2. 10 top ranked words calculated by the X square test for three categories of the ModApte Reuters-21578 training dataset

Corn Crude Earn

values crude earn

july comment usa

egypt spoke convertible

agreed stabilizing moody

shipment cancel produce

belgium shipowners former

oilseeds foresee borrowings

finding sites caesars

february techniques widespread

permitted stayed honduras

The features of the maximum entropy model were instantiated by using the 2000 higher ranked words (terms) for each category. To evaluate the classification performance of the classifiers we used the following measures: micro-Recall (¡uRe), micro-Precision (pPr) and micro-averaged F1 measure (micro-Fl). Let a denote the number of documents correctly classified in the class category by the system and let b denote the overall number of documents classified in the class and let d denote the overall number of the documents belong to the class. We define ¡Fr and ¡uRe as

V a V a

«Pr = and uRe =

^ V b ^ V d

where the summing is over all the classes.

The micro-Fl measure is then computed as the harmonic mean of ¡Fr and ¡uRe

micro - F1 = 2 x ^Prx ^Re/(^Pr + ^Re)

Table 3 shows the micro averaged F1 performance Micro-averaged F1 measure performance for Naive Bayes and Maximum Entropy Classifiers and our Max and harmonic merging Operators.

Table 3. Micro-averaged F1 measure performance for Naive Bayes and Maximum Entropy Classifiers and Max and harmonic Operators

_Algorithm_Performance

Naive Bayes 0.81

Maximum Entropy 0.88

MaxC 0.90

HarmonicC 0.91

It appears that Maximum Entropy classifier performs better than Naïve Bayes exhibiting a Micro-averaged F1 measure performance of 0.88. Both MaxC(x) and HarmonicC(x) operators increase Micro-averaged F1 measure performance over those resulting from the Naïve Bayes and SVM classifiers.

5. Conclusion

In this paper we describe a technique of using-combining two classifiers based on Naïve Bayes and Maximum Entropy, respectively, to classify documents of the "ModApte" split of the Reuters-21578 dataset. We use a chi-square feature selection strategy to select the most representative words-features, as it was proposed by Fragos et al. (Fragos, 2005). The Maximum Entropy model seems to have better performance than Naive Bayes classifier. Two merging operators are used to combine results of the Naïve Bayes and SVM classifiers to improve performance, especially for the Recall rate. The merging operators do improve the performance, as seen in the results for Micro-averaged F1 measure (0.90, 0.91 for MaxC and HarmonicC operators respectively). As future work, we intend to find additional methods of collecting sets of words-features and different merging operators to further improve the performance.

Acknowledgements

This research has been co-funded by the European Union (Social Fund) and Greek national resources under the framework of the "Archimedes III: Funding of Research Groups in TEI of Athens" project of the "Education & Lifelong Learning" Operational Programme

References

Al-Aidaroos, K.M., A.A. Bakar, A.A., and Othman, Z., 2010. Naive Bayes variants in classifi-cation learning. In Proceedings of the

International Conference on Information Retrieval and Knowledge Management, March 17-18, 2010, Shah Alam, Selangor, pp: 276-281. Blei, D., Ng A., and Jordan, M. 2002. Latent dirichlet allocation. In Proceedings of NIPS 14.

Della P., S., Della P., V. and Lafferty J., 1997. Inducing features of random fields. IEE trans-action on Pattern Analysis and Machine Intelligence, 19(4).

Dumais, T., S., Platt, J., Heckerman, D., and Sahami, M., 1998. Inductive learning algorithms and representations for text categorization. In Proceedings of the Seventh International Confer-ence on Information and Knowledge Management, pages 148-155. ACM Press.

Fragos, K., Maistros, I., Skourlas, C., 2005. A X2-Weighted Maximum Entropy Model for Text Classification. In Proceedings of 2nd International Conference On Natural Language Understanding and Cognitive Science, Miami, Florida: 22-23.

Galathiya, A. S., Ganatra, A., P., and CK Bhensdadia, K., C., 2012 An Improved decision tree induction algorithm with feature selection, cross validation, model complexity & reduced error pruning, IJSCIT march 2012.

Grossman, D., and P. Domingos, P., 2005. Learning Bayesian Network Classifiers by maxi-mizing conditional likelihood. In Proceedings of the twenty-first international conference on Machine learning, 361-368. ACM Press.

Hamad, A., 2007. Weighted Naive Bayesian Classifier. IEEE/ACS International Conference, on Computer Systems and Applications, AICCSA apos;07, Volume 1, Issue 1, Page(s):437 - 441.

Hofmann, T., 1999. Probabilistic latent semantic analysis. In Proceedings of UAI.

Jongwoo, K., Daniel X. L., and George, R., T., 2010. Naive Bayes and SVM Classifiers For Classifying Databank Accession Number Sentences. National Library of medicine, from Online Biomedical Articles.

McCallum A. and Nigam, K., 1998. A comparison of event models for naive Bayes text classi-fication. In AAAI/ICML-98 Workshop on Learning for Text Categorization.

Reuters-21578 http://www.daviddlewis.com/resources/testcollections/reuters21578/

Yang, Y. and Pedersen J., 1997. A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML'97) pp 412-420.