Scholarly article on topic 'Toward an enhanced Arabic text classification using cosine similarity and Latent Semantic Indexing'

Toward an enhanced Arabic text classification using cosine similarity and Latent Semantic Indexing Academic research paper on "Computer and information sciences"

CC BY-NC-ND
0
0
Share paper
Keywords
{"Arabic text" / Classification / "Supervised learning" / "Cosine similarity" / "Latent Semantic Indexing"}

Abstract of research paper on Computer and information sciences, author of scientific article — Fawaz S. Al-Anzi, Dia AbuZeina

Abstract Cosine similarity is one of the most popular distance measures in text classification problems. In this paper, we used this important measure to investigate the performance of Arabic language text classification. For textual features, vector space model (VSM) is generally used as a model to represent textual information as numerical vectors. However, Latent Semantic Indexing (LSI) is a better textual representation technique as it maintains semantic information between the words. Hence, we used the singular value decomposition (SVD) method to extract textual features based on LSI. In our experiments, we conducted comparison between some of the well-known classification methods such as Naïve Bayes, k-Nearest Neighbors, Neural Network, Random Forest, Support Vector Machine, and classification tree. We used a corpus that contains 4,000 documents of ten topics (400 document for each topic). The corpus contains 2,127,197 words with about 139,168 unique words. The testing set contains 400 documents, 40 documents for each topics. As a weighing scheme, we used Term Frequency.Inverse Document Frequency (TF.IDF). This study reveals that the classification methods that use LSI features significantly outperform the TF.IDF-based methods. It also reveals that k-Nearest Neighbors (based on cosine measure) and support vector machine are the best performing classifiers.

Academic research paper on topic "Toward an enhanced Arabic text classification using cosine similarity and Latent Semantic Indexing"

Accepted Manuscript

Towards an Enhanced Arabic Text Classification Using Cosine Similarity and Latent Semantic Indexing

Journal of

King Saud University -

Computer and Information Sciences

Fawaz S. Al-Anzi, Dia AbuZeina

PII: DOI:

Reference:

S1319-1578(16)30021-0 http://dx.doi.org/10.1016/joksuci.2016.04.001 JKSUCI 251

To appear in:

Journal of King Saud University - Computer and Information Sciences

Received Date: 4 November 2015

Revised Date: 28 March 2016

Accepted Date: 2 April 2016

Please cite this article as: Al-Anzi, F.S., AbuZeina, D., Towards an Enhanced Arabic Text Classification Using Cosine Similarity and Latent Semantic Indexing, Journal of King Saud University - Computer and Information Sciences (2016), doi: http://dx.doi.org/10.1016/joksuci.2016.04.001

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Towards an Enhanced Arabic Text Classification Using Cosine Similarity and Latent Semantic Indexing

Fawaz S. Al-Anzi and Dia AbuZeina Department of Computer Engineering, Kuwait University fawaz.alanzi@ku.edu.kw ; abuzeina@ku.edu.kw

Abstract: Cosine similarity is one of the most popular measures in text classification problems. In this paper, we use this important measure to investigate the performance of Arabic language text classification. Vector space model (VSM) is generally used as a model to represent textual information as features vectors. However, Latent Semantic Indexing (LSI) is a better textual representation technique as it maintains semantic information between the words. Hence, we use the singular value decomposition (SVD) method to extract text features based on LSI. In our experiments, we conduct a comparison between the cosine similarity and some of the well-known classifiers such as Naïve Bayes, k-Nearest Neighbors, Neural Network, Random Forest, Support Vector Machine, and classification tree. We use a corpus that contains 4,000 documents of ten topics (400 document for each topic). The corpus contains 2,127,197 words with about 139,168 unique words. The testing set contains 400 documents, 40 documents for each topics. As a weighing scheme, we use Term_Frequency.Inverse_Document_Frequency (TF.IDF). This study reveals that LSI-based features significantly outperforms the TF.IDF-based features using cosine similarity measure. It also reveals that cosine similarity measure outperforms most of the investigated classifiers; however, support vector machine (SVM) was found not-significantly outperforms the cosine measure.

Keywords: Arabic Text, Classification, Supervised Learning, Cosine similarity. Latent Semantic Indexing

Towards an Enhanced Arabic Text Classification Using Cosine Similarity and

Latent Semantic Indexing

Abstract: Cosine similarity is one of the most popular distance measures in text classification problems. In this paper, we used this important measure to investigate the performance of Arabic language text classification. For textual features, Vector space model (VSM) is generally used as a model to represent textual information as numerical vectors. However, Latent Semantic Indexing (LSI) is a better textual representation technique as it maintains semantic information between the words. Hence, we used the singular value decomposition (SVD) method to extract textual features based on LSI. In our experiments, we conducted a comparison between some of the well-known classification methods such as such as Naïve Bayes, k-Nearest Neighbors, Neural Network, Random Forest, Support Vector Machine, and classification tree. We used a corpus that contains 4,000 documents of ten topics (400 document for each topic). The corpus contains 2,127,197 words with about 139,168 unique words. The testing set contains 400 documents, 40 documents for each topics. As a weighing scheme, we used Term Frequency.Inverse Document Frequency (TF.IDF). This study reveals that the classification methods that use LSI features significantly outperforms the TF.IDF-based methods. It also reveals that k-Nearest Neighbors (based on cosine measure) and support vector machine are the best performing classifiers.

Keywords: Arabic Text, Classification, Supervised Learning, Cosine

ne similar

ty. Latent Semantic Indexing.

1. Introduction

Recently, text classification (TC) for Arabic language has been widely investigated. Manning and Schütze (1999) defined text classification as the task of classifying texts into one of a pre-specified set of classes based on their contents. According to Sebastiani (2002), text classification is the activity of labelling natural language texts with thematic categories from a predefined set. With big data environment, researchers have been hard at work to address the text classification problem in this huge information era. With massive growth of text search transactions, effective algorithms are needed to satisfy efficient retrieval time and relevance constraints. In today's market, achieving user satisfaction within this astronomical growth of online data is becoming very appealing to business investment. Search engines, e.g., Google and other high traffic query processing portals, are expected to meet and satisfy today's user demands.

Supervised machine learning (ML) approaches are widely used for text classification. The most popular machine learning algorithms include Naïve Bayes (NB), K-Nearest Neighbor (k-NN), Support Vector Machines (SVM), Neural Networks (NN), Classification Trees (CT), Logistic Regression (LR), Random Forest (RF), and Maximum Entropy (ME). In addition, similarity or distance measures are used for text classification as well as the bases for some classifiers. For example, k-NN algorithm uses a similarity function such as Euclidean distance or cosine similarity to find neighbors, Torunoglu et al. (2011).

In text classification problems, large features sets are a challenge that should be handled for better performance. Therefore, utilizing features reduction techniques are important for efficient representation of textual features. Harrag et al. (2010) presented a number of dimensionality reduction techniques such as root-based stemming, light stemming, and singular valued decomposition (SVD). In this work, we use the SVD as a feature reduction technique as well as for producing semantic rich features. SVD is a linear algebra method that is used to truncate the term-document matrix that produced by Latent Semantic Indexing (LSI), a well-known indexing and retrieval method. Even though the vector space model (VSM) is widely used for textual features representation, however, it is a semantic loss while LSI-SVD is characterized by maintains the semantic information. Rosario (2000) showed that SVD could be used to estimate the structure in words usage across the documents based on LSI that has the underlying structure in a word choice. Kantardzic (2011) indicated that LSI gives better results when using in text classification as it enable better representation of document's semantics.

This paper contains two parts. First, the LSI-SVD techniques were used to generate the textual features of a corpus that contains 4,000 documents. The generated features were then used along with the cosine similarity measure to classify the testing set documents. Second, a number of classification methods were employed for a comparison purpose. The classifiers include NN, NB, k-NN, SVM, RF, CT, LR, and CN2 (induction rule). In the implementation,

Gensim ("Gensim", 2016) and Orange tool ("Orange", 2016) were used. Gensim is a Python library for natural langue processing (NLP) while Orange is an open source machine-learning tool for data visualization and analysis.

In next section, literate review is presented. In section 3, we present the singular value decomposition followed by the theoretical background of the cosine similarity in section 4. The experiments setup is presented in section 5, and the results are discussed in section 6. Finally, the conclusion and future work are presented in section 7.

2. Literature Review

Cosine similarity measure has been widely used in pattern recognition and text classification. For example, Nguyen et al. (2011) used cosine similarity measure for face verification. In this work, the focus will be on the cosine measure for linguistic applications. Among such applications, Silber et al. (2002) used cosine measure for text summarization. El Gohary et al. (2009) used cosine measure to detect the emotions in the Arabic language text. Tak? and Tunga (2012) indicated that cosine is the commonly used similarity measure in the language identification problem. Sobh et al. (2006) used used cosine measure for Arabic language text summarization. Roberts et al. (2005) used cosine similarity for Arabic language concordance. Al-Kharashi et al. (1994) used the cosine measure for indexing and retrieval processes for Arabic bibliographic data. Lin et al. (1996) used cosine measure to extract concept descriptors (terms or keywords) from a Chinese-English bibliographic database. Elberrichi et al. (2012) indicated that cosine similarity dominant measures in information retrieval (IR) and text classification.

For Arabic text classification domain, various features extraction and classification methods were proposed in the literature as shown in Table 1. In this table, TF.IDF is the shorthand for Term Frequency Inverse Document Frequency, the well-known weighting scheme of text features. TF.IDF is a combination of two parts, (TF: the frequency of the word in the document, and IDF: the inverse of the frequency of the word throughout all documents). ANSI is the shorthand for American National Standards Institute.

Even LSI is a powerful features representation for words' semantic, the literature provided in Table 1 shows that LSI has very little contribution for Arabic text classification. Therefore, an effort was made to address this deficiency by utilizing semantic information for Arabic text classification. The cosine similarity measure was chosen for classification process. The highlighted cells in Table 1 indicates that only two research works have the same scope as this research (LSI and cosine). However, the first work, i.e. Froud et al. (2013) was conducted for documents

clustering while our proposed research is for text classification. In addition, we used a larger data set contains 4,000 documents while they used 278 documents. We also compared the results using eight well-known classifiers as well as exploring the performance of LSI using a wide range of rank approximation. Regarding the other work, i.e. Harrag et al. (2010), they used NN for classification while we used cosine similarity measure.

Table 1. Summary of Arabic text features and classifiers

References Features Classifier

Syiam et al. (2006) , Thabtah et al. (2008), Gharib et al. (2009), Hmeidi et al. (2008), Ababneh et al. (2014), Kanaan et al. (2009), Duwairi (2007), Zrigui et al. (2012), Moh'd Mesleh (2011), TF.IDF k-NN

Elberrichi et al. (2012), ANSI

Al-Shalabi et al. (2008) N-gram

Syiam et al. (2006) , Gharib et al. (2009), Omar et al. (2013), Kanaan et al. (2009), Moh'd Mesleh (2011) TF.IDF Rocchio

Jbara (2010), Larkey et al. (2004), Al-Eid et al. (2010), Alghamdi et al. (2012), Ezzat et al. (2012), Al-Kabi and Al-Sinjilawi (2007), Erkan et al. (2004) TF.IDF Cosine

Froud et al. (2013) LSI

Gharib et al. (2009), Omar et al. (2013), , Hmeidi et al. (2008), Al-Shargabi et al. (2011), Zrigui et al. (2012), Alsaleem (2011), Khorsheed and Al-Thubaity (2013), Hadni et al. (2013), Moh'd Mesleh (2011), Al-Shammari (2010), Harrag et al. (2009), Raheel et al. (2009), TF.IDF SVM

Al-Harbi et al. (2008) Chi-Squared

Al-Kabi and Al-Sinjilawi (2007), Duwairi (2007) TF.IDF Dice distance

Gharib et al. (2009), Omar et al. (2013), Al-Shargabi et al. (2011), Al-Kabi and Al-Sinjilawi (2007), Kanaan et al. (2009), Duwairi (2007), Zrigui et al. (2012), Alsaleem (2011), Khorsheed and Al-Thubaity (2013), Hadni et al. (2013), Moh'd Mesleh (2011), Al-Shammari (2010), Harrag et al. (2009), Raheel et al. (2009), TF.IDF NB

Al-Shargabi et al. (2011), Khorsheed and Al-Thubaity (2013), Harrag et al. (2009), Raheel et al. (2009) TF.IDF CT

Al-Harbi et al. (2008) Chi-Squared

Harrag et al. (2009) TF.IDF ME

Harrag et al. (2010) LSI NN

As this research demonstrates a comparative study of the different text classification algorithms, we present a comparison between the supervised machine learning algorithms found in the literature. Table 2 shows that SVM outperforms most of the classification algorithms for Arabic language text classification. The information presented in Table 2 are arranged as the researchers, the classifiers used, the best performance classifier, and the corpus size. However, the information provided in Table 2 is not judgemental as we agree with Sebastiani (2002) that illustrated the comparisons are only reliable when they are based on experiments performed by the same author under carefully controlled conditions. that is, no learning algorithm is universally best for all problems and datasets

Table 2. A Performance Comparison of Arabic Text Classification

Researchers Classifiers Best classifier Corpus size (doc., cat.)

Al-Shargabi et al. NB, SVM, SVM 2,356 , 6

(2011) and CT

Zrigui et al. (2012) SVM, NB, and k-NN SVM 1500 , 9

Gharib et al. (2009) k-NN, NB, and SVM SVM 1132 , 6

Alsaleem (2011) NB and SVM SVM 5121 , 7

Khorsheed and Al- SVM, NB, SVM 2 corpora:

Thubaity (2013) and CT Islamic Poems

Hadni et al. (2013) NB and SVM SVM 415 , 12

Moh'd Mesleh k-NN, SVM 7,842, 10

(2011) SVM, NB, Rocchio

Al-Shammari NB and SVM 2,966 , 3

(2010) SVM

Hmeidi et al. (2008) k-NN and SVM SVM 2066 , 2

Raheel et al. (2009) NB, SVM, and CT SVM 6,825 , 7

Harrag et al. (2009) NB, CT, SVM and ME CT 2 corpora: 350 , 8 280 , 14

Al-Harbi et al. SVM and CT Seven

(2008) CT corpora

Al-Kabi and Al- Cosine, NB 80 , 12

Sinjilawi (2007) NB, and Euclidean

Kanaan et al. k-NN, NB 1445 , 9

(2009) NB, and Rocchio

Duwairi (2007) NB, k-NN, and Distance NB 1000 , 10

Harrag et al. (2009) NB, CT, SVM and ME CT 2 corpora: 350 , 8 280 , 14

3. Singular Value Decomposition

In general, Salton and Buckley (1988) can model text classification features using VSM that was proposed. In

the VSM, the vector of the document is represented as "bag of words" in which each word corresponds to one independent dimension. Usually, the elements of the vectors is the weights of the importance of the words in the document. The weight can be represented using a binary value to indicate the presence or absence of the word. Other representations can be employed such as n-gram, keywords, or longer sentences, Zrigui, et al. (2012). It is clear that VSM representation has huge features vectors that should be carefully considered to avoid hardware limitation, software capabilities, and computational time complexity.

nd SVD me

In this work, we used LSI and SVD methods that were developed to improve the accuracy and effectiveness of IR techniques. LSI focuses on semantic meaning of words across a series of usage contexts, as opposed to using simple string-matching operations, Kantardzic, (2011). LSI has been utilized for many natural language processing applications such as search engines, Carpineto, et al. (2009), and other domain such as digital image processing, Andrews and Patterson (1976). The goal of using LSI and SVD decomposition technique is to find the relationships between the terms and documents. That is, LSI generate a termbydocument matrix that is mathematically decomposed to identify the semantic correlation between concepts and documents in an unstructured text, of course with no loss (or minimum loss) of information.

SVD is based on a theorem from linear algebra which says that a rectangular m-by-n matrix A can be broken down into the product of three matrices - an orthogonal matrix U, a diagonal matrix S, and the transpose of an orthogonal matrix V . The theorem is usually presented something like this: Amn = UmmSmnVTnn. Figure 1 demonstrates the reduced rank SVD. The bold K in the shaded region of U, S, and VT represents the values retained in computing rank k approximation. There are many free and commercial software that are available for related LSI. We initially used MATLAB for decomposing termbydocument. However, according to hardware limitation, we used Genism that is characterized by efficient use of memory.

Figure 1. SVD Representation of the document-term(s) matrix

4. Cosine Similarity Measure

The objective of this work is to investigate the performance of cosine measure as one of the most popular machine learning methods for Arabic language text classification. More precisely, we evaluated the performance of cosine similarity against NN, NB, k-NN, SVM, RF, CT, LR, and CN2 Rules. Theodoridis and Koutroumbas (2008) defines cosine similarity

measure as SCoSme(x,y) =

where 1x1= I £ \=±xf

and llyll=J E |=!yj are the lengths of the vectors x and

y, respectively. Both and are -dimensional vectors. Since cosine measure is easy to interpret and simple to compute for sparse vectors, it is widely used in text mining and information retrieval, Dhillon et al. (2001). Cosine similarity can also be defined by the angle or cosine of the angle between two vectors. This allows documents with the same composition, but different totals, to be treated identically which makes this the most popular measure for text documents, Strehl et al. (2000).

5. The Experiments Setup

This section presents two subsections, the data set and the proposed method.

5.1 The Data set

We created a corpus that contains 4,000 documents belonging to 10 different categories. The corpus contains 2,127,197 words that include more than 139,168 unique words. We got the documents from Alqabas newspaper in Kuwait ("Alqabas", 2016). Table 3 shows the statistics of the corpus used.

Table 3. The Corpus and the Categories Distribution

# Category # Doc. # words # Unique words

1 Health 400 218,214 29,574

2 Economy 400 181,366 29,443

3 Crimes and courts 400 172,145 29,416

4 Education 400 259,127 37,515

5 Technology 400 209,319 36,103

6 Sports 400 168,934 29,568

7 Tourism 400 270,142 40,488

8 Islam and Sharia 400 242,943 45,843

9 Parliament 400 182,503 31,183

10 Political Affairs 400 222,504 37,649

Total 4,000 2,127,197 346,782*

* The total number of the 139,168

unique words in the entire corpus is

The testing set contain 400 documents, 40 documents for each category. Hence, the total documents in the prepared corpus is 4,400 documents.

5.2 The proposed method

The proposed method is summarized using the following algorithm:

Step 1: For the entire corpus (4,000 documents for training and 400 document for testing), a preprocessing

step is performed to prepare the text for the classification process. Therefore, cleaning the text has the following three steps:

• The stoplist is declared to remove insignificant words. These words are common and have no discriminative meaning. The stoplist include words that are found in almost all documents such as the name of the newspaper, the source of the document, the serial number of the documents, etc. an example of stoplist in the corpus: { fSj '<^3I

• The ignore characters are specified. It includes

' , ! , @ , # , £ , € , $ , % , ° , A , & , * , ( , )

, - , _ , + , = , » , « , { , } , [ , ] , | , \ , / , : , ; , 0,1,2,3,4,5,6,7,8,9}. Therefore, any word contains one or more of the listed ignore characters is edited to remove the character/s.

• For all documents in the corpus, "'" is replaced by "I", and "J" by "I".

Step 2: Using python, the Gensim library is utilized to generate TF.IDF features and LSI features using the following steps:

• Create the dictionary that contain all words.

• The documents are converted to vectors using the information in the dictionary and the documents words counts.

• The vectors are weighted using TF.IDF. The number of features in each weighted vector is the same number of words in the documents, after passing all filtering processes (i.e. stoplist).

• The LSI vectors (features) are created using TF.IDF vectors. The number of features in each LSI vector is the k that is used for LSI transformation.

Step 3: The performance is measured using the TF.IDF features and LSI features generated in the previous step. The cosine similarity measure is used to perform classification. The rank k approximation (a suitable singular value) is selected. Hence, different k values should be investigated to find the optimum performance. Bradford (2008) indicated that for real corpora, target dimensionality (k) of 200-500 is recommended as a "golden standard". The classification process in this step is like k-NN classifier that find the similarity between the the document to be tested and all training documents. The k-NN classifier find the k-nearest documents, but the the cosine similarity classifier return the label of the nearest document (i.e k=1). The Gensim facilitates using cosine similarity with both TF.IDF and LSI features.

Step 4: The Orange tool is used to measure the performance using the LSI features. The classification is performed using the following classifiers :{ NN, NB, k-NN, SVM, RF, CT, LR, and CN2 Rules}. Step 5: The performance is evaluated using confusion matrix to find the performance metrics such as accuracy, precision, recall, and F1 measure. F1 = 2((precision*recall) / (precision+recall)). Sokolova et

al. (2009) measures.

has a comprehensive review of these

Figure 2 shows the proposed method in visualization form. The figure shows that TF.IDF features will be compared with LSI features using cosine measure. Then, the LSI features will be compared using all presented classifies including cosine measure.

Training data set

Preprocessing

Figure 2. The Framework of the Proposed Method

6. Experimental Results

In this section, the experimental results are presented. Before conducting any experiment, three parameters should be set. The first is the rank k approximation (for LSI cases), the small word threshold, and the word frequency threshold. Regarding the small words, it is possible to remove any word that is less than certain character length. This option gives the choice to remove the single character in the text such as J", the character should be removed as it is considered as noise and it will not help in the classification process. In fact, ignoring small words will get rid of many common words such as { Jl ,<J& ,} which translated using Google ("Google",2016) to :{ to, in, on, of }. However, many other small words are considered as keys phrases for some category such as: { , , , , o^} with the meaning : {art, soccer, blood, bank, oil}. Nevertheless, the experimental results show that this discard will enhance the performance in some cases as demonstrated in this section (Table 4).

Regarding the word frequency, it is the number of occurrences of each word in the entire corpus. For example, it is possible to select any word that appears more than one time to be considered as a feature entity; otherwise, the word will not be selected. This is important for two reasons; SVD find better correlation between words, hence, a single word has no correlation. The other reason is to remove typo words that usually appear once. The results are presented in two subsection; the first considered the cosine classifier

using TF.IDF and LSI features, and the other is for comparing the cosine classifier with other classifies.

6.1 Performance of TF.IDF and LSI features

In this experiment, we used cosine similarity to measure the classification performance using TF.IDF and LSI features. The required parameters were set as follows. The word frequency threshold is set to 1 (remove word that appears only once). The small words threshold was set to 1 (remove the words of one character length, less than or equal this threshold). For the rank k approximation, a range is selected to find the best performance using different k values. The range was chosen to start at 10, 12, 14,,... up to 100. Figure 3 shows that the best accuracy was achieved at k=46. The k=46 was considered as the baseline and used when comparing the cosine classifier with the other listed classifiers in the next subsection.

Figure 3. Accuracy of Different Singular Values

At k=46, the accuracy scored 82.5% based on the LSA features. For TF.IDF performance, the same parameters were used (i.e the word frequency threshold =1, and the small words threshold =1, TF.IDF does need require k value), the accuracy of TF.IDF scored 67.25% as shown in Table 4.

Table 4. The Performance of the Cosine Classifier

Features Accuracy Rank-k Word Small

frequency words

TF.IDF 67.25% No need 1 1

LSA 82.50% 46 1 1

To investigate whether the LSI has significantly outperforms the TF.IDF; the performance detection method proposed by Plotz (2005) was used. The confidence interval \£l, £u] has to be compute at the first place. Figure 4 shows how to find the confidence interval. N is set to the value 400, the number of documents in the Testing set. If the changed classification error rate is outside the confidence interval, these changes can be interpreted as statistically significant. Otherwise, they were most likely caused by chance. We used 95% as a level of confidence. We also used the error probabilities of the TF.IDF method, as 32.75% (100% - 67.25%) as reported in Table 4. Since we used 95% as a level of confidence, z is equal 1.96 from the standard normal distribution. It might be interpreted as a 95%

probability that a standard normal variable, z, will fall between -1.96 and 1.96.

Figure 4. Confidence Interval Calculation Formula

The confidence interval is found to be [32.75%-4.42, 32.75%+4.74]* [28.33%, 37.49%]. Since the error probabilities using LSI method is 17.5% (100-82.5%), we consider that using LSI features significantly outperform the TF.IDF features as 17.5% is outside the confidence interval.

To investing the effect of the small word removing, we performed experiments at k=46, small word threshold was set at different values {2, 3, 4, 5, 6, 7, 8} as indicated in Table 5. The first row entries in the table is the baseline settings for small word threshold. The word frequency threshold was set at 1. The results are presented in the Table 5.

Table 5. The Performance of Removing Small Words

Small words threshold # of removed words # of remaining words TF.IDF accuracy LSI accuracy at k=46

baseline =1 8,409 2,118,788 67.25% 82.50%

<=2 271,836 1,855,361 69.25% 83.00%

<=3 590,960 1,536,237 69.00% 80.75%

<=4 948,454 1,178,743 68.25% 83.50%

<=5 1,339,592 787,605 71.25% 83.50%

<=6 1,664,139 463,058 70.75% 80.75%

<=7 1,923,017 204,180 66.75% 73.75%

Table 5 shows that the performance was enhanced by removing small words in case of TF.IDF when removing small words up to less than or equal 6 characters. Then, it comes decreasing as many of the discriminator words were removed. In case of LSI, there are some cases that the performance was increased such as removing the words of less than or equal 2, 4, and 5. The information provided in Table 5 shows that a large number of small words could be discarded while obtaining better performance. In the case of less than or equal five characters (<=5), 1,339,592 small words were removed with better performance. In fact, this is extremely important in the proposed method as the time complexity is linear in the size of the training set, which scales poorly to large dataset. More research can be conducted to find the optimal performance when discarding such large amount of noise.

6.2 Performance of LSI using Different Classifiers

In this section, the performance was compared between the cosine classifier and the other eight classifiers {NN, NB, k-NN, SVM, RF, CT, LR, and CN2 Rules}. The

LSI features were used in this evaluation. The performance of the cosine classifier was already obtained using the Gensim as indicated in the previous subsection. The Orange tool was used for the other classifiers. Figure 5 shows a snapshot of the Orange tool with the implemented classifiers.

Figure 5. A Snapshot of Orange Tool

In the experiments, we used the accuracy as performance metric. Since the Testing data set has equal number of documents (i.e. 40 for each class), the accuracy and F-1 measure will have equal values. Therefore, the accuracy is used which is simply the ratio of correctly predicted observations to the number of actuals.

We performed the comparison using the the following parameters :{ word frequency=1, small word length=1, k=46}.Table 6 shows that the SVM classifier outperforms all other classifiers followed by cosine measure.

Table 6. The Performance of the Classifiers

Classifier Accuracy

SVM 84.75%

Cosine 82.50%

LR 81.25%

k-NN 77.25%

NN 76.75%

RF 74.75%

NB 65.26%

CT 54.00%

CN2 47.25%

The confidence interval was calculated using the cosine accuracy and found to be [14.09, 21.53]. Table 7 shows that the SVM is inside the confidence interval which mean that cosine and SVM an LR have the same performance from statistical point of view. Hence, the shaded classifiers in Table 7 were found to score the best performance.

Table 7. The Significance Test of the Classifiers

Classifier Accuracy Error Inside the confidence interval

SVM 84.75% 15.25% Yes

Cosine 82.50% 17.50% Yes - the reference

LR 81.25% 18.75% Yes

k-NN 77.25% 22.75% No

NN 76.75% 23.25% No

RF 74.75% 25.25% No

NB 65.26% 34.74% No

CT 54.00% 46.00% No

CN2 47.25% 52.75% No

The k-NN performance shown in Table 7 was measured using Manhattan distance (with k=10 neighbors). The Manhattan distance found to achieve better performance than other distance measures such as Euclidian, Hamming, and Maximal. However, as this research mainly focuses on the cosine similarity measure, we conducted more research on the performance of k-NN based on cosine measure. Neither Orange nor Weka ("Weka", 2016) tools provide the option to implement k-NN with cosine measure. Therefore, we used the RapidMiner (RapidMiner, 2016) machine-learning tool that was found to provide this option (i.e k-NN with cosine measure). Hence, the performance was measured using different value of k in the k-NN classifier. Figure 6 shows the accuraci achieved using k-NN based on cosine measure. results found to be better than other measures such as Manhattan distance measure that was the best using Orange tool.

Figure 6. The Performance with Different k Values in k-NN

In Figure 6, the highest accuracy was 84.5% at k=7. We also reinvestigated the performance of of SVM using the RapidMiner tool and found to be 84.5% (almost same with the accuracy achieved using Orange tool, 84.75%). Hence, our findings indicates that k-NN based on cosine measure scored the same accuracy as SVM, the powerful classification method.

7. Conclusion and Future Work

This paper shows that cosine similarity measure is a good option to be considered for the Arabic language text classification. It also provides an experimental comparison between eight text classification methods. The results show that SVM and k-NN (cosine measure based) classifiers have almost same performance.

As a future direction, we propose to investigate the performance of multi-level and multi-label Arabic text

classification. We would also propose investigating the features reduction methods such as what we proposed in this research (small words threshold) and weighting schemes.

Acknowledgements

This work is supported by Kuwait Foundation of Advancement of Science (KFAS), Research Grant Number P11418EO01 and Kuwait University Research Administration Research Project Number EO06/12.

References

O., Hadi

Ababneh, J., Almomani, O., Hadi, W., El-Omari, N. K. T., & Al-Ibrahim, A. (2014). Vector Space Models to Classify Arabic Text. International Journal of Computer Trends and Technology (IJCTT), 7(4), 219223.

Al-Eii

. B., Al-Khalif, R. S., & Al-Salman, A. S. July). Measuring the credibility of Arabic text ontent in Twitter. In Digital Information Management (ICDIM), 2010 Fifth International Conference on (pp. 285-291). IEEE.

Alghamdi, H. M., & Selamat, A. (2012, September). Topic detections in Arabic dark websites using improved vector space model. In Data Mining and Optimization (DMO), 2012 4th Conference on (pp. 6-12). IEEE.

Al-Harbi, S., Almuhareb, A., Al-Thubaity, A., Khorsheed, M. S., & Al-Rajeh, A. (2008). Automatic Arabic text classification.

Al-Kabi, M., & Al-Sinjilawi, S. (2007). A comparative study of the efficiency of different measures to classify Arabic text. University of Sharjah Journal of Pure and Applied Sciences, 4(2), 13-26.

Al-Kharashi, I. A., & Evens, M. W. (1994). Comparing words, stems, and roots as index terms in an Arabic information retrieval system. Journal of the American Society for Information Science, 45(8), 548-560.

Alqabas. (2016, January). Retrieved from http://www.alqabas.com.kw/Default.aspx

Alsaleem, S. (2011). Automated Arabic Text Categorization Using SVM and NB. Int. Arab J. e-Technol., 2(2), 124128.

Al-Shalabi, R., & Obeidat, R. (2008, March). Improving KNN Arabic text classification with n-grams based document indexing. In Proceedings of the Sixth International Conference on Informatics and Systems, Cairo, Egypt (pp. 108-112).

Al-Shammari, E. T. (2010, November). Improving Arabic document categorization: Introducing local stem. In Intelligent Systems Design and Applications (ISDA), 2010 10th International Conference on (pp. 385-390). IEEE.

Al-Shargabi, B., Al-Romimah, W., & Olayah, F. (2011, April). A comparative study for Arabic text classification algorithms based on stop words elimination. In Proceedings of the 2011 International Conference on Intelligent Semantic Web-Services and Applications (p. 11). ACM.

Andrews, H. C., & Patterson, C. L. (1976). Singular value decompositions and digital image processing. Acoustics, Speech and Signal Processing, IEEE Transactions on, 24(1), 26-53.

Bradford, R. B. (2008, October). An empirical study of required dimensionality for large-scale latent semantic indexing applications. InProceedings of the 17th ACM conference on Information and knowledge management (pp. 153-162). ACM.

Carpineto, C., Osinski, S., Romano, G., & Weiss, D. (2009). A survey of web clustering engines. ACM Computing Surveys (CSUR), 41(3), 17.

Dhillon, I. S., & Modha, D. S. (2001). Concept decompositions for large sparse text data using clustering. Machine learning, 42(1-2), 143-175.

Duwairi, R. M. (2007). Arabic Text Categorization. Int. Arab J. Inf. Technol.,4(2), 125-132.

El Gohary, A. F., Sultan, T. I., Hana, M. A., & El, M. M. (2013). A Computational Approach for Analyzing and Detecting Emotions in Arabic Text.International Journal of Engineering Research and Applications (IJERA), 3, 100-107.

Elberrichi, Z., & Abidi, K. (2012). Arabic text categorization: a comparative study of different representation modes. Int. Arab J. Inf. Technol., 9(5), 465-470.

Erkan, G., & Radev, D. R. (2004). LexRank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research, 457-479.

Ezzat, H., Ezzat, S., El-Beltagy, S., & Ghanem, M. (2012, March). Topicanalyzer: A system for unsupervised multi-label arabic topic categorization. In Innovations in Information Technology (IIT), 2012 International Conference on (pp. 220-225). IEEE.

Froud, H., Lachkar, A., & Ouatik, S. A. (2013). Arabic text summarization based on latent semantic analysis to enhance Arabic documents clustering.arXiv preprint arXiv:1302.1612.

Gensim. (2016, January). Retrieved from https://radimrehurek.com/gensim/

Gharib, T. F., Habib, M. B., & Fayed, Z. T. (2009). Arabic Text Classification Using Support Vector Machines. IJ Comput. Appl., 16(4), 192-199.

Google Translate. (2016, January). Retrieved from https://translate.google.com/

Hadni, M., Ouatik, S. A., & Lachkar, A. (2013). Effective Arabic Stemmer Based Hybrid Approach for Arabic Text Categorization. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol, 3.

Harrag, F., & Al-Qawasmah, E. (2010). Improving Arabic Text Categorization Using Neural Network with SVD. JDIM, 8(4), 233-239.

Harrag, F., El-Qawasmeh, E., & Pichappan, P. (2009, July). Improving Arabic text categorization using decision trees. In Networked Digital Technologies, 2009. NDT'09. First International Conference on (pp. 110115). IEEE.

Hmeidi, I., Hawashin, B., & El-Qawasmeh, E. (2008). Performance of KNN and SVM classifiers on full word Arabic articles. Advanced Engineering Informatics,22(1), 106-111.

Jbara, K. (2010). Knowledge discovery in Al-Hadith using text classification algorithm. Journal of American Science, 6(11), 409-419.

Kanaan, G., Al-Shalabi, R., Ghwanmeh, S., & Al-Ma'adeed, H. (2009). A comparison of text-classification techniques applied to Arabic text. Journal of the American society for information science and technology, 60(9), 1836-1844.

Kantardzic, M. (2011). Data mining: concepts, models, methods, and algorithms. John Wiley & Sons.

Khorsheed, M. S., & Al-Thubaity, A. O. (2013). Comparative evaluation of text classification techniques using a large diverse Arabic dataset. Language resources and evaluation, 47(2), 513-538.

Larkey, L. S., Feng, F., Connell, M., & Lavrenko, V. (2004, July). Language-specific models in multilingual topic tracking. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 402-409). ACM.

Lin, C. H., & Chen, H. (1996). An automatic indexing and neural network approach to concept retrieval and classification of multilingual (Chinese-English) documents. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, 26(1), 75-88.

Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. MIT press.

Moh'd Mesleh, A. (2011). Feature sub-set selection metrics for Arabic text classification. Pattern Recognition Letters, 32(14), 1922-1929.

Nguyen, H. V., & Bai, L. (2011). Cosine similarity metric learning for face verification. In Computer Vision-ACCV 2010 (pp. 709-720). Springer Berlin Heidelberg.

Omar, N., Albared, M., Al-Shabi, A., & Al-Moslmi, T. (2013). Ensemble of Classification Algorithms for Subjectivity and Sentiment Analysis of Arabic

Customers' Reviews. International Journal of Advancements in Computing Technology, 14(5), 77-85.

Orange. (2016, January). http://orange.biolab. si/

Retrieved

Plotz, T. (2005). Advanced stochastic protein sequence analysis.

Raheel, S., Dichy, J., & Hassoun, M. (2009, November). The Automatic Categorization of Arabic Documents by Boosting Decision Trees. In Signal-Image Technology & Internet-Based Systems (SITIS), 2009 Fifth International Conference on (pp. 294-301). IEEE.

Roberts, A., Al-Sulaiti, L., & Atwell, E. (2005, July). aConCorde: Towards a proper concordance of Arabic. In Proceedings of the Corpus Linguistics 2005 Conference, University of Birmingham, UK.

RapidMiner (2016, January). https://rapidminer.com/

Retrieved from

Rosario, B. (2000). Latent semantic indexing: An overview. Techn. rep. INFOSYS, 240.

Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5), 513-523.

Sebastiani, F. (2002). Machine learning in automated categorization. ACM computing surveys (CSUR), 3 (1) 1-47.

Silber, H. G., & McCoy, K. F. (2002). E lexical chains as an intermedi automatic text summari Linguistics, 28(4), 487-496

Tayek, ]

computed entation for Computational

M. (2006). A trainable eneric text summarizer. In

Sobh, I., Darwish, N. Arabic Bayesian e Proceedings of the Sixth Conference on Language Engineering ESLEC (pp. 49-154).

Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing & Management,45(4), 427-437.

Learning and Data Analysis 2008 (ICMLDA). Oct. 2008, San Francisco, CA, USA.

Theodoridis, S. and K. Koutroumbas (2008). Pattern Recognition, Fourth Edition, Academic Press.

Torunoglu, D., Cakirman, E., Ganiz, M. C., Akyoku§, S., & Gurbuz, M. Z. (2011, June). Analysis of preprocessing methods on classification of Turkish texts. In Innovations in Intelligent Systems and Applications (INISTA), 2011 International Symposium on (pp. 112117). IEEE.

Weka (2016, January). Retrieved http://www.cs.waikato.ac.nz/ml/weka/

Zrigui, M., Ayadi, R., Mars, M., & Maraoui, M. (2012). Arabic text classification framework based on latent dirichlet allocation. CIT. Journal of Computing and echno

Information Tecl

ology, 20(2), 125-140.

mology

Strehl, A., Ghosh, J., & Mooney, R. (2000, July). Impact of similarity measures on web-page clustering. In Workshop on Artificial Intelligence for Web Search (AAAI 2000) (pp. 58-64).

Syiam, M. M., Fayed, Z. T., & Habib, M. B. (2006). An intelligent system for Arabic text categorization. International Journal of Intelligent Computing and Information Sciences, 6(1), 1-19.

Takgi, H., & Gungor, T. (2012). A high performance centroid-based classification approach for language identification. Pattern Recognition Letters,33(16), 20772084.

Thabtah F., Mahazah M., Hadi W. (2008) VSMs with K-Nearest Neighbour to Categorise Arabic Text Data. Proceedings of the International Conference on Machine