Available online at www.sciencedirect.com

SciVerse ScienceDirect

Procedía Engineering 15 (2011) 3785 - 3790

Procedía Engineering

www.elsevier.com/Iocate/procedia

Advanced in Control Engineeringand Information Science

A Cluster Tree Method For Text Categorization

Zhaocai Sun , Yunming Ye, Weiru Deng, Zhexue Huang

_Harbin Institute of Technology, Shenzhen Graduate Schoo., Shenzhen ,China_

Abstract

The decision tree is a flexible and useful classification tool. But on the data with high dimensionality, it meets problems. For most of current decision tree algorithms, when splitting a node of a tree, only the "best" one feature is selected and used. Since more features are ignored, the classification accuracy is not high. To solve the problem, this paper uses a cluster tree for text categorization. Unlike familiar decision trees (e.g. CART, C4.5), clustering results are used as the splitting rule and more features are considered. Obviously, the used clustering algorithm is an very important to the cluster tree. For better performance, a text clustering algorithm is proposed to enhance the cluster tree. Experiments show that the cluster tree solves the high-dimensionality problem and outperforms C4.5 and CART on text data. Sometimes, it may do better than LibSVM, which may be the most powerful tool for text categorization. © 2011 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of [CEIS 2011]

Keywords: Text categorixzation, cluster tree, decision tree

1. Introduction

With ever-increasing volume of text data from various online sources, it is an important task to categorize or classify these text documents into manageable and meaningful categories[1]. So far, many text categorization techniques have been proposed, such as centroid-based classifier[2], Bayesian classifier[3], support vector machines[4], decision tree[5], and so on. However, previous works have found that the classification accuracy of decision tree is not high on text categorization.

The difficulty of dealing with the text data is the high dimensionality[6]. But the performance of the decision tree is very poor. It is caused by the operating of familiar decision trees (e.g. C4.5, CART). In generating a decision tree, the node is split recursively by some criterions (e.g. information energy, Gini index), until each node is class-pure enough. For most of current decision tree algorithms, only the "best" one feature is selected and used in each splitting step. However, in the Vector Space Model (VSM)[8],

* Corresponding author. Tel.: +86 13713703451 E-mail address: sunnykiller@126.com

1877-7058 © 2011 Published by Elsevier Ltd. doi:10.1016/j.proeng.2011.08.709

which is used for text representation generally, the number of features are too high. Sometimes, it reaches tens or hundreds of thousands. Hence, too much features are ignored, which leads the poor performance.

The motivation of this paper is to use more features to build a tree classifier. Unlike C4.5[9] or CART[10], clustering or cluster analysis is executed on the leaf-node of the tree. By adding resulted clusters into the tree constantly, the cluster tree grows. Obviously, the performance of cluster tree strongly relies on the chosen clustering algorithm. For better performance of cluster tree, we also proposes a new text clustering algorithm.

This paper is organized as follows. Section 2 introduces the framework of the cluster tree. Section 3 proposes a text clustering algorithm to enhance the cluster tree. Comparing experiments are shown in Section 4. Section 5 is the conclusion.

2. The Framework of Cluster Tree

The cluster tree is a flow-cluster-like tree structure (Fig.1). The nodes are clusters or subsets of the training set. In the internal nodes, the class-purity is not high. But in the leaf nodes, the class-purity is high generally. Hence, if a new object drops in a leaf cluster, its class label can be estimated by the information of the cluster. For the cluster tree algorithm, it includes two steps, generating tree and classification using tree.

Fig.1 An example of cluster tree

2.1. Gsnsroting Cluster tree

In generating a cluster tree, the class-purity and the sample-capacity are two important parameters. In this paper, we define them as follows,

Definition 1. The /opo/ity of o /luster is the totol numbers of samples in the /luster.

6(Ck) =|CJ= /(xE Ck) (1)

Definition 2. The purity of o /luster is thot the moximum sompls-frsqusnt for eo/h /loss in the /luster.

max®3 xeC 1 (y = a))}

p(Ck) =-^^-

^ k) ) (2) By controlling above parameters, a cluster tree is generated. In our algorithm, the capacity threshold 6th and the purity threshold pth are pre-set. For cluster-node Ck , if the purity is high enough, p(Ck ) > pth, it is considered that the cluster Ck is so reliable that the partition of Ck is not necessary. Otherwise, it is also not reasonable that the capacity of a cluster is too small. If 6(Ck) < 6th , the clustering is not needed and Ck is left as a leaf of the tree. The detail of the cluster tree framework is shown in Algorithm 1.

Algorithm 1 Generate a cluster tree_

Output: The tree T Input: Training set (X; Y)

Parameter: Purity-threshold pth, Capacity-threshold 6th 1: initialize tree T with X; 2: select leaf-node set of T , L = {C1, C2, ^ } 3: for each Ck E L do

4: get the purity and the size of Ck (Eq.1 and Eq.2); 5: if p(Ck ) <pth A6(Ck ) >6th then 6: run a clustering algorithm on set Ck ;

7: grow T with new clusters on node Ck;

8: goto 2;

9: end if

10: end for_

2.2. Clossifi/otion using /luster tree

For the purpose of classification, each cluster-node of tree should be marked a class. In the cluster, if there is only one class (the purity is equal to 1), the cluster is marked with the class obviously. But if there are several classes, only the most frequent class is considered.

Definition 3. In /luster Ck, if samples with class fi k are the most, then /loss Q k is called the most frequent /loss of Ck .

Qk = argmax{] (^,^1(y = (3)

Given a new object x, a path in the tree is traced, until it reaches a leaf node. If x reaches the leaf node Ck , the class of x is labeled as the most frequent class Q k . But different with CART or C4.5, the cluster tree chooses the path by comparing the distance. In general, the distance measure is related to the used clustering algorithm, which will be discussed in the next section.

3. Text Clustering Algorithm

In fact, the performance of the cluster tree strongly relies on the clustering algorithm. For that reason, this section proposes a text clustering algorithm for better performance of cluster tree.

Let X = (X1, X2, b , Xn ) is the a set of n objects, where Xi = (xi 1, xi 2, ^ , xi m) is characterized by a set of m variables (attributes). Out algorithm aims to partition n objects into K clusters by the cosine distance,

t xi.idt, j dist (X,, Dt) = - j=

S \jx,.j S dk,A;

J"1 J"1 (4)

where. Dk = (dk 1. dk.2. ^ . dk _ ) is the centroid of cluster Ck .

Considering the sparseness in the text data. the importance of features are different. For example. if a term appears almost all documents. it is not important to clustering or classification. Conversely. if a term appears only in one cluster. it can be asserted that the term is significant. Hence. we calculate the centroids as follows.

dk.j .JCLvkj log kK k.J -E^k.j los^k.j) (5)

k "1 K k "1

where. /uk = ) is the mean of cluster Ct

Vk, j = d tx^x j (6)

Like k -means, the algorithm includes two steps. In the first step, the centroids Dk is updated by Eq.5. Then, each object X. is assigned into the nearest cluster in the second step. That is, argmaxC■ dist(X, , Ck ). Two steps are repeated until no object assignment changed. The detail of the algorithm is shown in Algorithm 2.

Algorithm 2 Cluster-feature-centroid clustering Input: Training set X and the number of clusters K Output: a set of K clusters 1:repeat

2: update the centroids of clusters

3: assign each object X, into the nearest clusters

4: until no change_

4. Experiments

In this work, experiments were conducted on 19MclassTextWC, which contains 19 multi-class text datasets collected by George Forman. In the experiments, we compared our algorithm with some famous

methods. The codes of C4.5 and CART are from the open source Weka. LibSVM is a version of support vector machines. ADCC was a k-means cluster tree[7].

4.1. Tree comparison

In the first experiment, the data set oh10 is selected. It has 3228 features and 10 classes. We use 80% as training data to generate the tree. For cluster tree and DCC, the parameters are set as,

Ph = -85 , 6th = 10. C4.5 and CART use the default parameters. Four trees are generated with 4

methods.

The details of 4 trees are listed in Table 1. As introduced, CART and C4.5 only use one feature in a splitting step. But for text data, it needs many features to build the classification model. In other words, there are many splitting step for generating a tree. For the better performance, the tree grows bigger and bigger. For example, the size of C4.5 is 201 and the number of leaves is 101. By clustering, the cluster tree has not such problems. But because used clustering algorithm is not suitable, the size of DCC tree is also big. Otherwise, our cluster tree uses more features in each splitting step. The clustering algorithm is also suitable. It is not surprised that the cluster tree is small. Meanwhile, due to the smaller size, the execute time of generating the cluster tree is very low. It is only 0.515 second, which is approximately 1/80 of CART.

Table 1. Tree comparison

Cluster Tree C4.5 CART DCC

Number of leaves 15 101 27 19

Size of tree 20 201 53 31

Time taken (second) .515 20.1 43.5 26.8

4.2. Comparison results

In the second experiment, we select some subsets of 19MclassTextWC. The properties of selected data sets are listed in Table 2. Five algorithms are performed with 5-fold-cross-validation on such data. For the cluster tree, C4.5, CART and DCC, the parameters are set as Section 4.1. LibSVM uses a linear Kernel with the regularization parameter 0.03125, which is generally used in text categorization.

Table 2. Properties of data

Dataset Features Classes Numbers of samples for each class

OhO 3180 10 57:71:76:181:115:136:194:66:51:56

Oh5 3012 10 74:72:61:149:85:120:93:144:61:59

0h10 3228 10 126:165:87:165:148:60:70:61:52:116

Oh15 3101 10 56:53:69:98:106:154:56:98:157:66

Tr21 7902 6 231:16:9:41:35:4

Tr41 7454 10 26:174:95:243:35:9:162:33:18:83

The comparison results are shown in Table.3. Due to the high-dimensionality problem, the classical decision tree (C4.5, CART) needs to grow bigger to match the data. Even so, the performances of C4.5 and CART still seem not high. DCC is also a cluster tree algorithm, which generates the tree by k-means. But in text data, k-means is not a good clustering method. So, it is hard to assert that DCC outperforms the traditional decision tree. However, clustering method proposed in this paper performs well for the task

of text clustering. It is not surprised that the cluster tree in this paper outperforms CART and C4.5. Moreover, our method can do better than LibSVM on some data.

Table3. The classification accuracy

Dataset ClusterTree C4.5 CART LibSVM DCC

OhO .8408 .8136 .8185 .8404 .8106

Oh5 .8913 .7941 .8409 .8584 .7989

OhlO .8095 .6990 .7752 .7933 .7428

Oh15 .8196 .7327 .7722 .8127 .7322

Tr21 .9104 .8125 .8125 .8779 .8806

Tr41 .9488 .9191 .9282 .9498 .8863

5. Conclusions

This paper uses a cluster tree method to solve the high-dimensionality problem of decision tree classifier. Moreover, we propose a text clustering algorithm to enhance the performance of cluster tree. Experiments show that our cluster tree outperforms C4.5 and CART on text data. Sometimes, it is better than LibSVM, which is the most powerful tool for text categorization currently.

Acknowledge

This research is supported in part by NSFC under Grant no.61073195, and Shenzhen Science and Technology Program under Grant no.CXB201005250024A and Natural Scientific Research Innovation Foundation in HIT under Grant no. HIT.NSFIR.2010128.

References

[1] Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 2002, 34(1):1-47.

[2] Han, E.H,. Karypis, G. Centroid-based document classification: Analysis and experimental results. Principles of Data Mining and Knowledge Discovery, Springer, 2000, 116-223.

[3] Wang, Q,. Garrity, G.M, Tiedje, J.M, Cole, J.R. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and environmental microbiology, 2007, 73:5261-5267.

[4] Joachims, T. Text categorization with support vector machines: Learning with many relevant features. Machine Learning: ECML-98, Springer 1998, 137-142.

[5] Lewis, D.D, Ringuette, M. A comparison of two learning algorithms for text categorization. Third annual symposium on document analysis and information retrieval, 1994, 33:81-93.

[6] HaoranWu, Bing Liu, Tong Heng, Xiaoli Liu. A refinement approach to handling model misfit in text categorization. KDD '02 Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data. ACM, New York, 2002.

[7] Y. Li, E. Hung, K. Chung, J.Z. Huang. Building a Decision Cluster Classification Model for High Dimensional Data by a VariableWeighting k-Means Method. AI2008: Advances in Artificial Intelligence, Springer, 2008, 5360:337-347.

[8] Salton G, Wong A, Yang C. S. A vector space model for automatic indexing. Communications of the ACM, Nov. 1975, 18(11):613-620.

[9] Quinlan, J. R. C4.5: Programs for Machine Learning. San Francisco: Morgan Kaufmann Publishers, 1993. '

[10] Breiman, Leo; Friedman, J. H., Olshen, R. A., Stone, C. J. Classification and regression trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software., 1984.