Available online at www.sciencedirect.com

ScienceDirect

Fuzzy Information and Engineering

http://www.elsevier.com/locate/fiae

ELSEVIER

ORIGINAL ARTICLE

A Fuzzy Mutual Information-based Feature Selection Method

N. Hoque • HA. Ahmed - D.K. Bhattacharyya • J.K. Kalita

Received: 1 April 2015 / Revised: 24 December 2015/ Accepted: 23 March 2016/

Abstract In this paper, we present a feature selection method called Fuzzy Mutual Information-based Feature Selection with Non-Dominated solution (FMIFS-ND) using a fuzzy mutual information measure which selects features based on feature-class fuzzy mutual information and feature-feature fuzzy mutual information. To evaluate classification accuracy of the proposed method, a modification of the ^-nearest neighbor (KNN) classifier is also presented in this paper to classify instances based on the distance or similarity between individual features. The performance of both methods is evaluated on multiple UCI datasets by using four classifiers. We compare the accuracy of our feature selection method with existing feature selection methods and validate accuracy of the proposed classifier with decision trees, random forests, naive Bayes, KNN and support vector machines (SVM). Experimental results show that the feature selection method gives high classification accuracy in most high dimensional datasets as well as the accuracy of proposed classifiers outperforms the traditional KNN classifier.

Keywords Feature selection • Mutual information • Classification • Accuracy © 2016 Fuzzy Information and Engineering Branch of the Operations Research Society of China. Hosting by Elsevier B.V. All rights reserved.

N. Hoque • H.A. Ahmed • D.K. Bhattacharyya (E3)

Department of Computer Science and Engineering, Tezpur University, Napaam, Sonitpur-784028, Assam, India

email: dkb@tezu.ernet.in J.K. Kalita

Department of Computer Science, University of Colorado at Colorado Springs, USA

Peer review under responsibility of Fuzzy Information and Engineering Branch of the Operations Research

Society of China.

© 2016 Fuzzy Information and Engineering Branch of the Operations Research Society of China. Hosting by Elsevier B.V. All rights reserved.

This is an open access article under the CC BY-NC-ND license

(http://creativecommons.org/licenses/by-nc-nd/4.0/). http://dx.doi.org/10.1016/jfiae.2016.09.004

for Classification

CrossMark

1. Introduction

Feature selection is widely used in many applications of machine learning and pattern recognition. It is also known as variable, attribute or variable subset selection, and it needs to select a subset of relevant features from a large feature space. The selected subset of features is used to construct a model for effective classification of objects. The objective of feature selection is to select the most important features that can classify objects with high accuracy and low computational cost. Therefore, the selected set should contain fewer features with high relevance but low redundancy among them. Feature selection also helps address the curse of dimensionality problem, in addition to providing for improved model interpretability, reduced overfitting and predictive accuracy.

In many pattern recognition tasks in areas such as bioinformatics for example, the dimensionality of objects is usually very large, and as a result both supervised and unsupervised classification not only consumes a significant amount of time but also usually produces high classification error rate. To overcome these problems, the dimensionality of objects is reduced using either of two techniques, viz., feature selection and feature extraction. In feature selection, only the relevant features are selected from a large feature space but in feature extraction new features are created by means of combination and transformation of the original features set.

1.1. Feature Selection Approaches

Many feature selection approaches have been widely used to assist in classification of objects [1-5]. Feature selection methods themselves are classified into four categories based on their selection mechanisms, viz., filter, wrapper, embedded and hybrid. The filter approach selects a subset of features without using a learning algorithm. It is used with many real datasets where the number of features is high. The wrapper approach uses a learning algorithm to evaluate the accuracy produced by the use of the selected features in classification. Wrapper methods can give high classification accuracy for particular classifiers but generally have high computational complexity. An embedded approach [6] performs feature selection during the process of training and is specific to the applied learning algorithms. Finally, the hybrid [7] approach is a combination of both filter and wrapper methods. It combines the advantages of these two approaches.

1.2. Contribution

The main contribution of this paper is a fuzzy mutual information-based feature selection method used when classifying objects. The method is validated using gene expression and UCI machine learning datasets. Classification accuracy is evaluated by comparing with four well-known algorithms, namely, decision trees, random forests, KNN and SVM classifiers. A modified KNN algorithm is also proposed to evaluate the classification accuracy with several benchmark datasets. We provide experimental comparison of our feature selection method with several well-regarded feature selection methods in terms of classification accuracy and the accuracy of the proposed KNN classifier is compared with those of decision trees, random forests, naive Bayes, KNN and SVM classifiers.

13. Motivation

Probability or entropy is a primary measure of information content of a random variable. Feature selection using mutual information uses entropy to estimate the uncertainty of a random variable or a feature. We calculate mutual information of two variables using a joint probability distribution and a marginal probability distribution as shown in Eq. (1).

For computing probability of a random variable of categorical type, this formula is intuitively justifiable. But, for numeric types, because distance can be computed among the possible values of the random variable, effective probability computation using Eq. (1) is not straightforward. Let us consider two random variables X and ¥ which are of type numerical and categorical, respectively. From observations of the random variables shown in Table 1, we find the probability P(l) is 1/10 and probability P(5) is also 1/10 for variable X. It means the values 1 and 5 are equally probable. But now if we consider distances among the observations, we will be able to easily conclude that the value 1 or a value close to 1 is more likely to appear again as the rest of the observations are very close to 1 and are far from the value 5. Though this problem can be addressed by discretization or binning, the process is not obvious. In case of variable Y, no distance computation is involved and hence the above problem does not occur for categorical types. To analyze the probability of numeric type random variables, we may be able to use a fuzzy set to define membership value for each instance of the variable. The proposed method uses a fuzzy entropy measure to calculate feature-feature and feature-class mutual information to select an optimal feature set from a large feature space.

To validate the usefulness of the feature selection method, we use the KNN classifier. KNN is a very simple but effective supervised learning algorithm used in many applications of data mining and machine learning. During classification of a test instance, the KNN algorithm simply determines k nearest training objects using a distance or similarity measure. In distance or similarity computation, most KNN algorithms simply find the distance between objects considering the full set of features. However, if we consider the full set of features in similarity or distance computation when using KNN, noise or unexpected values of a feature may significantly impact on the computed similarity. Most widely used proximity measures such as Euclidean distance, cosine distance and Manhattan distance are very sensitive to the presence of noisy or unusual values. This problem caused by noisy or unusual values is further inflated by the fact that the KNN classifier typically computes correlation over the full set of features. So, instead of giving equal importance to the full set of features in distance computation, we give high priority to those features that match exactly and impose a penalty score on those features that do not match exactly. If a feature contains noise or unusual values, the penalty score will be very high, which will decrease the similarity between two objects.

In real life datasets, imprecision or vagueness in data is a common problem for

Table 1: Distribution pattern of values for two random variables.

X 1 1.1 0.9 5 1.12 5.1 1.13 0.89 0.87 1.15

¥ a aba a a b b a b

continuous random variables and imprecision may affect a model during analysis of data behavior. So, fuzzy logic is used to analyze imprecise data using fuzzy sets that define membership value for each interval pattern of a variable. If we apply Eq. (1) to calculate the mutual information for continuous random variables, in many situations feature-feature and feature-class mutual information may turn out to be 0. As a consequence, the classifier may produce a high misclassification rate. So, to overcome the problem, a fuzzy mutual information measure is used for effective classification with continuous features. Fuzzy mutual information is calculated based on a fuzzy set that defines the intervals of a continuous random variable. Using the fuzzy set, membership value of a variable will be computed with respect to its class labels.

The rest of the paper is organized as follows. In Section 2, we discuss the related work on fuzzy feature selection. Section 3 describes the proposed method with an example. The proposed KNN-ND classifier is presented in Section 4. Experimental results are presented and analyzed in Section 5. Finally, conclusion and future work are discussed in Section 6.

2. Related Work

In the literature, we find many statistical and soft computing based feature selection methods [8-13]. An efficient fuzzy classifier with feature selection based on fuzzy entropy was proposed by Lee et al. [14]. In this paper, a fuzzy entropy based feature selection method is used to select a subset of relevant features used as an input to a fuzzy classifier. Fuzzy entropy is used to evaluate the distribution pattern of a random variable and the pattern space of a variable is partitioned into non-overlapping fuzzy decision regions to obtain smooth boundaries for effective classification. Due to the non-overlapping decision regions, both the computational complexity and the load on the classifier are reduced. Jensen et al. [15] proposed a new method for feature selection using fuzzy and rough sets to deal with imprecision and uncertainty. Based on the statistical properties of data, a simple fuzzification process is used to derive the fuzzy sets to assign membership values to the objects. Similarly, Qian et al. [16] introduced a hybrid approach for attribute reduction based on indiscernibility and discernibility relations. Bhatt et al. [17] used fuzzy rough set theory for feature selection based on natural properties of fuzzy t-norms and t-conorms.

A mutual information-based feature selection algorithm called MIFS was introduced by Battiti [18] to select a subset of features. This algorithm considers both feature-feature and feature-class mutual information for feature selection. It uses a greedy technique to select a feature that maximizes the information about the class label. Kwak and Cho [19] developed an algorithm called MIFS-U to overcome the

limitations of MIFS to obtain better mutual information between input features and output classes than MIFS. In another algorithm called mRMR, Peng et al. [20] used mutual information to minimize redundancy and to maximize relevance among features. This algorithm combines two wrapper schemes. In the first scheme, it selects a candidate feature set and from the candidate set, a compact feature subset is selected to maximize the classification accuracy in the second scheme. Amir et al. [21] proposed a feature selection method called modified mutual information-based feature selection (MMIFS). The method uses mutual information to select a subset of features with maximum relevancy and minimum redundancy for intrusion detection systems. The method gives high classification accuracy on the KDD CUP 99 dataset. Moreover, a normalized mutual information based feature selection method is proposed by Estevez et al. [22]. The filter method they proposed is known as NMIFS which is a modification of Battiti's MIFS, MIFS-U and mRMR methods. The NMIFS method outperforms MIFS, MIFS-U and mRMR on several benchmark as well as synthetic datasets.

3. Proposed Method

For a given dataset D of d features /1, /2, /3, ■ ■ • ,fd represented by F, the problem is to select a subset F' of k features (where k < d) for a given class C, so that (i) f! e F' are most relevant for C,-, (ii) F' is optimal and (iii) the classification accuracy for F' is higher than obtained by using other subsets of features. The proposed feature selection method uses fuzzy set theory and mutual information to select features using a greedy approach. Fuzzy mutual information [23] is defined as shown in Eq. (2).

where X and Y are two fuzzy variables, H(X) and H(Y) are fuzzy entropy values for X and Y, respectively, and H(X, Y) fuzzy joint entropy for X and Y. To select a feature it computes two values, (i) Feature-class fuzzy mutual information and (ii) Feature-feature fuzzy mutual information. Feature class fuzzy mutual information is used to compute the correlation of a feature with respect to its class label and select a feature that has highest correlation score. It means that the feature can identify the class of the object with a high score. On the other hand, feature-feature mutual information is computed to find how similar are two features. High mutual information indicates a large reduction in uncertainty; low mutual information indicates a small reduction; and zero mutual information between two random variables means the variables are independent. We select features whose feature-feature mutual information values are very low so that the redundancy among the features can be eliminated. Let X = {xi,x2, X3, X4, ■ ■ • jc„) be a random variable with n elements, and A and B be two fuzzy sets defined on X. The fuzzy membership value of Wh feature for i'h class represented as //yt can be computed using the membership function given in Eq. (3) which is defined by Khushaba et al. [24].

FMI(X, Y) = H(X) + H(Y) - H(X, Y),

where m is the fuzzification coefficient, and e > 0 is a small value to avoid singularity, and 8 is the standard deviation involved in the distance computation, xi denotes the mean of the data objects that belong to class i and the radius of data r is represented as r=max(||jc,- - x/tlU). According to Ding et al. [25], fuzzy entropy and fuzzy joint entropy on X can be defined as the one shown in Eqs. (4) and (6), respectively. Fuzzy entropy:

H(A) = -- ypA{x)log^M + (1 -log (1 -(4)

H(B) = -- VWx)log^W + (1 -!iB(x))log(1 -№,«)]. (5)

Fuzzy joint entropy:

H(A U B) = - - Y[pA(x) V Hb(x)\ log[^W V fiB(x)]

n x£X

+ [1 iiB(x)] log [1 - ha(x) V Mb(x)1 (6)

3.1. Example

A2 0-5 0.7 0.5 0.1 0.8 J r05 0.3 0 0.5 0.2 0.8n

A(X) = [—,—, —, —,—, —] and B(X) = —, —, —, —, —]. Now,

Xj X2 X3 X4 X$ Xç, 1 X2 X3 X4 X5 Xg

H(A) = -i x [(0.2 xlog(0.2) + (1 -0.2) x log(l -0.2)) + (0.5 xlog(0.5) + (1 -0.5) x log(l-0.5))+(0.7xlog(0.7)+(l-0.7)xlog(l-0.7))+(0.5xlog(0.5)+(l-0.5)xlog(l-0.5))+(0.1xlog(0.1)+(l-0.1)xlog(l-0.1))+(0.8xlog(0.8)+(l-0.8)xlog(l-0.8))], H{A) = 0.5538. Similarly,

H(B) = 0.4997.

H(AuB) = -|x[(0.5xlog(0.5)+(l-0.5)xlog(l-0.5))+(0.5xlog(0.5)+(l-0.5)x log(l-0.5))+(0.7xlog(0.7)+(l-0.7)xlog(l-0.7))+(0.5xlog(0.5)+(l-0.5)xlog(l-0.5))+(0.2xlog(0.2)+(l-0.2)xlog(l-0.2))+(0.8xlog(0.8)+(l-0.8)xlog(l-0.8))], H(A u B) = 0.6152, FMI(A, B) = H(A) + H(B) - H(A U B), FMI(A, B) = 0.5538 + 0.4997 - 0.6152=0.4383.

3.2. Our Feature Selection Method

We start by computing feature-class fuzzy mutual information and select the feature that has the highest mutual information. The feature is then put in the selected feature subset and removed from the original feature set. Next, for each of the non-selected features, we compute the feature-class fuzzy mutual information and then calculate the average feature-feature fuzzy mutual information for each of the selected features. At this point, each non-selected feature contains feature-class fuzzy mutual information and average feature-feature fuzzy mutual information. From these calculated values, we want to select a feature that has the highest feature-class fuzzy

mutual information, but the lowest feature-feature fuzzy mutual information. This can be framed as an optimization problem. So, to select a feature that satisfies these two conditions, we use an optimization algorithm known as non-dominated sorting genetic algorithm (NSGA-II) [26] to calculate domination count CD and dominated count FD for feature-class and feature-feature values, respectively. The domination count of a feature is the number of features that it dominates for feature-class mutual information value and dominated count represents the number of features that it dominates for average feature-feature mutual information value. We select the feature that has the maximum difference between domination count and dominated count. The selected feature is put into the selected feature subset and removed from the original feature set. This procedure continues till the required number of features are selected. To describe the algorithms, symbols and notations used are given in Table 2.

Table 2: Symbols used and their meanings.

Symbol Meaning

D Dataset

F Set of features

F' Selected subset of features

CD Domination count

FD Dominated count

A,B Two fuzzy sets defined on X

CMI Feature class mutual information

FMI Feature feature mutual information

Oi,Oj i'h and fh objects

A kfk feature of an object

4fM.Oj) Distance between objects O; and Oj over the feature /4

ff(0„ Oj) Similarity count between objects 0; and Oj

ß(OhOj) Penalty count between objects O, and Oj

oj(0„ Oj) Penalty value between objects Ot and Oj

8(0¡, Oj) Penalty grade between objects O, and Oj

Let F = {fi,/2, fi, fa, fs) be a set of five features. First, we compute feature-class fuzzy mutual information for every feature and select the feature that has the maximum fuzzy mutual information value. Let us assume that feature /3 has the highest mutual information value and hence, /3 is removed from F and is put into the optimal feature set F', i.e., F' = F - /3. Next, for a feature fj e F, we compute feature-feature fuzzy mutual information with every other feature e F' and store the average mutual information value as the average feature-feature fuzzy mutual information for fj. This way, we compute feature-feature fuzzy mutual information for /1,/2,/t and /5. Again, we compute feature-class fuzzy mutual information for /1 > /2, /4 and /5. Consider the scenario shown in Table 3. Here, feature /2 has the maximum difference between CD and FD, which is 3. Hence feature /2 is selected.

In case of tie for the values of (CD-FD), we pick the feature that has maximum feature-class fuzzy mutual information. This procedure is continued until we get a subset with k number of features. The objective function to select a feature from m number of features can be defined as follows.

CD1 = £ </>(CMI,X

i= 1 m

Flf = Z <p(FMU),

\0, otherwise,

<KFMId = i1' ™>FM/y,VM-,;, r \0, otherwise,

Select the feature ft for which CDt - FDi is maximum.

Table 3: Domination count a feature.

ft feature class MI (CMI) Average feature-feature MI (FMI) CD FD CD-FD

/l 0.54 0.17 2 1 1

h 0.76 0.09 3 0 3

U 0.17 0.78 0 3 -3

fs 0.33 0.27 1 2 -1

3.3. Fuzzy-MI Based Feature Selection (FMIFS) Algorithm

The proposed FMIFS-ND algorithm selects k relevant features using feature-feature and feature-class mutual information. The steps of the algorithm are shown as below.

3.4. Complexity Analysis of FMIFS-ND

The function compute FCFMI() takes O(d) time to compute feature-class fuzzy mutual information, where d is the total number of features. Whereas the inner for loops of the proposed method takes 0(d x m) times to compute feature-class and feature-feature mutual information for every non-selected feature with m number of already selected features (m < k). Finally, the optimization method takes 0(d2) time to select the most relevant but non-redundant feature. Therefore, overall complexity of the method is 0(d) + 0(m xd) + 0(tP). If d » m, the the complexity of FMIFS-ND is O(cP-) otherwise 0(tf + dxm).

Next, we introduce an improved version of KNN classifier which exploits an instance-based learning method to compute dissimilarity between a pair of objects over each individual feature.

Input: n, the number of features; dataset D\ F, the set of features {/i,/2, • ■ ■ , /¿}.

Output: F', an optimal subset of features.

Steps:

for/=1 to d, do

Compute FMI(fi, C)

Select the feature f with maximum FM/(/;, C) F' = F' u {/;•) F = F-{fi} count=l;

while count < k do

for each feature fj 6 F, do FFFMI=0;

for each feature ft e F', do

FFFMI=FFFMI+compute FFFMI(/y,/,) end

AFFFMI= avg(FFFMI) FFCMI=Compute FFCMI(/}, C) end

Select the next feature fj that has minimum AFFFMI but maximum FFCMI F' = F'U{fi} F = F - {fj} i = j

count=count+l; end

Return features set F'

Algorithm 1: FMIFS-ND

4. KNN-ND Classifier

The following definitions help us discuss the proposed KNN-ND in a theoretical manner. For a test object O; that has n number of features (excluding the class label information), KNN-ND computes the nearest neighbors of O, using the following definitions.

Definition 4.1 Distance between objects: Let O; and Oj be two objects and fk be a feature of the two objects. The distance between the two objects over the feature fk is defined as

ow + ow' ()

Definition 4.2 Similarity count: The similarity count a(0„ Oj) between two objects Oi and Oj is the total number of features for which Afk(Oj, 0j)=0, where 1 <k<n.

Definition 4.3 Penalty count: The penalty count >8(0„ Oj) between two objects Oi and Oj is the total number of features for which Afk(Oi, Oj) ^ 0, where 1 <k<n.

Definition 4.4 Penalty value: The penalty value 6j(0;, Oj) between two objects Oi

and Oj over the dissimilar features is defined as

oKOi, Oj) = ^ ЛА(0„ Oj) if(AA(0„ Oj) Ф 0), where 1 <k<n. (9) A

Definition 4.5 Similarity grade: The similarity grade cr between two objects Oi and Oj is defined as

a(Oit Oi)

oiPi, Oj)= ". (10)

Definition 4.6 Penalty grade: The penalty grade в between two objects Oi and Oj is defined as

a*Ou O,)

d(Ol,Oj)= ". (11)

The proposed classification method is an instance-based learning method that computes similarity between two objects over each individual feature using the distance function given in Definition 4.1. This method is almost similar to the traditional k-nearest neighbor classifier except the following two properties.

Property 4.1 It computes distance/similarity between two objects based on the distance/similarity of each individual feature of an object.

Property 4.2 It also considers a penalty value for dissimilar features to measure distance/similarity between two objects.

The proposed method gives equal importance to each individual feature in measuring the distance/similarity between two objects. It counts the number of features a that are matched exactly (i.e., Л/ДО,, Oj) = 0) and the number of features /3 that are not matched exactly (i.e., Л/ДО;, Oj) Ф 0). It also computes a penalty value <a>(0,, Oj) from those features that are not matched exactly. If the number of exactly matched features is high, it concludes a high degree of similarity between the two objects. On the other hand, if the number of dissimilar features between two objects is more than the number of similar features, it concludes a low degree of similarity between the two objects. In the proposed method, the degree of similarity, represented as similarity score(Ou Oj) between any two objects О, and Oj is, computed as follows:

similarity-score = e(0„ Oj) + (o"(0„ Oj) - 0(0,, Оj)), when(a>/?), (12)

similarity score = a (0„ Oj) - (cr(0;, Oj) + 0(0,, Oj)), otherwise. (13)

The similarity score is used to find the objects most similar to a test object O; during its nearest neighbor computation. The similarity measure computes the distance between two objects considering each individual feature value instead of computing the distance based on the entirety of features together. In similarity computation, if the number of exactly matched feature pairs for two objects is greater than the number

Table 4: Similarity score values between a test object and the training objects.

Object Label /i /2 f> /4 /5 /6 a ß co <r 8 similarity score

Ol A 0.2 0.5 0.7 0.1 0.2 0.3 6 0 0 10 7

02 A 0.2 0.5 0.7 0.1 0.3 0.2 4 2 0.4 0.6667 0.0667 4.6000

O, B 0.6 0.5 0.7 0.8 0.9 0.1 2 4 2.436 0.333 0.4061 1.261

04 C 0.1 0.2 0.3 0.4 0.5 0.6 0 6 2.5235 0 0.4206 -0.421

Os A 0.2 0.5 0.1 0.3 0.2 0.3 4 2 1.25 0.6667 0.2083 4.4584

o6 B 0.2 0.5 0.7 0.8 0.3 0.2 3 3 1.1778 0.5000 0.1963 3.3037

Oj C 0.1 0.7 0.6 0.4 0.1 0.5 0 6 1.76 0 0.2934 -0.293

Os c 0.9 0.1 0.3 0.5 0.6 0.7 0 6 3.2697 0 0.545 -0.545

09 A 0.2 0.3 0.4 0.5 0.6 0.7 1 5 2.0894 0.1667 0.3482 1.1815

Oio B 0.1 0.9 0.7 0.6 0.5 0.3 2 4 1.7619 0.3333 0.2937 1.9604

of dissimilar feature pairs than the similarity value between the two objects is high. This way the method computes similarity score between a test object O; € Dtest and a training object Oj e Dlrain. Based on the similarity score, it finds k training objects with the highest similarity score and assigns the class label for the test object Oj that matched more than the other class labels in the k objects.

4.1. Comparison of KNN-ND with Other Versions of KNN Classifier

The proposed KNN-ND classifier differs from other versions of the KNN classifier in the following points.

1) The proposed KNN-ND classifier computes a similarity score between any two instances based on individual feature weights whereas the simple KNN classifier computes similarity between any two objects without considering the weights of individual features.

2) Unlike Gou et al. [27], the proposed KNN-ND algorithm computes similarity based on the distance between individual features whereas [27] computes dual weights of ¿-nearest neighbors.

3) Like Bhattacharya et al. [28], our method computes distance or similarity between a test instance and a training instance using a new distance measure.

4.2. Example

Let us consider a test object O; = {0.2,0.5,0.7,0.1,0.2,0.3} with six features and a training dataset with 10 objects. The proposed KNN-ND algorithm computes the similarity of the test O,- with objects Oj, V/' = 1,2, • • • , 10 as shown in Table 4. The similarity score between a test object O, and a training Oj is computed using the definitions defined above.

4.3. Algorithm

We propose a modified KNN classifier called KNN-ND that computes similarity or distance between any two objects using individual features. The steps of the proposed method are shown in Algorithm 2.

N «-Number of instances; Dtram set of training objects; Dtest«- set of test objects

for a test object Oi e Dtesu do for each object Oj e D„ain, do

compute a(Oi, Oj) using Definition 4.2 compute j8(0,-, Oj) using Definition 4.3 if a(0„ Oj) > P(Ou Oj) then

similarity score = a(Oh Oj) + (cr(Oi, Oj)- 9(0,, Oj))

similarity score = a(Oi, Oj) - (o"(0„ Oj) + 8(0,, Oj))

end end

Find k number of objects with highest similarity score

Assign the class label for Oi represented by the majority label of its k most

similar objects.

Algorithm 2: KNN-ND classification algorithm

4.4. Complexity Analysis of KNN-ND

The proposed KNN-ND algorithm takes 0(Nxk) time to find the k-nearest neighbors of a test instance, where N is total number of training instances. If d is the dimensionality of an instance, the overall complexity of the method is 0(N xkxd). Since, the value of k is smaller than both N and d.

4.5. Performance Measures

To validate the classification accuracy of our proposed algorithm, we use four performance analysis metrics, viz., accuracy (ACC), balanced accuracy (BAC), posterior distribution of the balanced accuracy (PDBAC) and kappa coefficient. Each of these metrics is explained here briefly. Accuracy of a classifier is defined as: Accuracy = (TP + TN)/(TP + TN + FP + FN), where TP=true positive, TN=true negative, FP=false positive and FN=false negative. Balanced accuracy [29] for mul-

ticlass classification is defined as: bac = 7 £ where kt is the number of correctly

¡=i *

predicted samples for class i, I is the total number of classes and n, the number of samples for class i. In the Bayesian framework, inference on classification performance is calculated based on prior distribution and posterior distribution represented as p(A) and p(A\D), respectively [29]. It uses probabilities to express uncertainty about classification performance before and after observing actual classification outcomes. The kappa coefficient [30] is represented as: K = (p0 - pe)/( 1 - Pe)> where pe and p„ represent expected agreement and observed agreement, respectively. This value is standardized to lie on a -1 to 1 scale.

5. Experimental Results

The proposed FMIFS-ND feature selection method is implemented in MATLAB 2008 software. We carried out the experiments on a workstation with 12 GB main

Fuzzy Inf. Eng. (2016) 8: 355-384_367

memory, 2.26 Intel(R) Xeon processor and 64-bit Windows 7 operating system. We also use a freely available toolbox called Weka, where many feature selection algorithms are available. The proposed feature selection method is evaluated on six UCI datasets and three gene expression datasets. To evaluate the performance of both proposed methods, we apply 10-fold cross validation. The classification accuracy on a subset of features selected by the proposed FMIFS-ND approach is plotted on graphs.

5.1. Performance Evaluation of FMIFS-ND

The performance of our feature selection method is evaluated on gene expression and UCI datasets. In both categories of datasets, our method shows satisfactory performance. In these categories, the performance of our method is compared with existing feature selection methods, viz., Chi square, gain ratio, info gain, ReliefF, maxRel (maximum relevance) and mRMR (minimal-redundancy-maximal-relevance). The maxRel feature selection method selects a subset of features based on mutual information between individual feature and class. Analysis of classification accuracy for the proposed FMIFS-ND method is shown in Figs.l and 12 on UCI and gene expression datasets, respectively.

5.1.1. On UCI Dataset

The accuracy of our proposed method on UCI datasets is shown in Figs.l to 3. As shown in Figs. 1(a) to 1(d), we observe that the proposed method gives high classification accuracy on the balance scale dataset for decision trees, random forests, KNN and SVM classifiers. The classification accuracy of the proposed feature selection method on the balance scale dataset is equivalent to the other two competent mutual information based feature selection methods, viz., mRMR and maxRel. The proposed feature selection method gives equivalent classification accuracy to the competing feature selection methods on Diabetes dataset as shown in Figs.l(e) to 1(h). For the acute dataset, the proposed FMIFS-ND method gives better classification accuracy than mRMR and maxRel using decision trees, random forests, SVM and KNN classifiers as shown in Fig.2. However, except SVMs and KNN classifiers, others give a bit lower classification accuracy for the proposed feature selection method compared to chi square, gain ratio, info gain and reliefF methods on the acute dataset. As shown in Fig.2, decision trees, random forests and KNN classifier give very high classification accuracy on iris dataset but accuracy of SVMs is less than 80%. Moreover, compared to mRMR and maxRel, our method gives better accuracy on all the selected features except feature number 2. The proposed method gives low classification accuracy compared to other competing feature selection methods on sonar dataset as shown in Figs.3(a) to 3(c). However, the method yields high accuracy on Sonar dataset using SVM classifier as shown in 3(d). As shown in Figs.3(e) to 3(h), the FMIFS-ND feature selection method gives better classification accuracy than mRMR and maxRel using all the competing classifiers except SVMs on the liver dataset; however, ReliefF outperforms the proposed method.

Fig. 1 Result analysis of FM1FS-ND on balance scale and diabetes datasets

(a) Decision Tree accuracy on acute dataset

-RlltctF

-Propc p.

<ti) Kitndnni l-oresl

ClúSxUdlí

OjnRjlIc

Infnc.iln

PHIpfr

Proposed Method

(c) KNN accuracy on acute dataset

-GaínRítio - InfOGOln -RelielF -niRMR

Proposed Method

(d) SVM accuracy' on acute dataset

(e) Dcclslon Tree accuracy on Iris dataset

Proposed Me Ihod

(f> Random Forest accuracy on iris dataset

- mfcGoln Proposed Method

(L)SVM accuracy on ins dataset

Fig. 2 Result analysis of FMIFS-ND on acute and iris dataseis

Fig. 3 Result analysis of FMIFS-ND on liver and sonar datasets

PDBAC for KNN

□ 95% CI - I0.fr!, 0.79J

|---ttutt- 50.0%

— rain - 72.1%

Fig. 4 PDBAC for statlog heart using different classifiers

30 25 20

25 _ 20

PDBAC for Decision Tree

Ï3 95%Ct-|0.9l.8.97J

- - charier " 50.0%

— -»3.9%_

0.4 0.6

PDBAC for KNN-ND

ID 95% CI - |0JM. 0.95| " " ch.ncc - 50.0% — n - 01.4%

20 g'5

PDBAC for KNN

il 95% Cl - [0JW. 0.951

---tlun- 50.0%

-mrin -91.5%

0.2 0.4 0.6 0.8

PDBAC for Naïve Bayes

PDBAC for 1

I LJH 95% Cl - |0.92, 0.97| ---chance - 50.0% |-mean » 94.8%

PDBAC for SYM

H 95% Cl - |0.9J, 0.98] " " - ch»ncc - 50.0% — mua-96.1%

0.4 0.6

0.4 0.6

Fig. 5 PDBAC for breast cancer using different classifiers

PDBAC for KNN

Fig. 6 PDBAC for hayes roth using different classifiers

14 12 1«

o Ll o

12 10 8

4 2 0 ■

PDBAClfor Decision Tree

! 95% a - lojtí, 0.961

---ih.m, - 11 1',

-mr»n-9IJ%

0.2 0.4 0.6 0.8

PDBACJ for KNN ND

0.2 0.4 0.6 0.8

PDBAC for KNN

ZZI95% CI " |0JW. 0.98|

---chancc - 33.3%

-m - 93.7%

0.2 0.4 0.6 0.8

PDBAC for Naive Bayes

t^M 95% CI - |0.86.0.97|

---chancc - 33J%

-mean - 92.6%

0.2 0.4 0.6 0.8

PDBAC for Random Forest

7 195% CI - |0JMI, 0.98| ---chance - 33.3%

0.2 0.4 0.6 0.8

PDBAC for SVM

I Si 95% CI - 10.62. 0JU)|

---chance - 33.3%

-mean - 71.2%

Fig. 7 PDBAC for iris dataset using different classifiers

PDBAC for KNN

Fig. 8 PDBAC for colon cancer dataset using different classifiers

30 25 20

20 g15

20 g15

PDBAC for Decision Tree

Bl 95% a - |0.6<). 0.661

- - chance - 33.3%

— nu jii - 62.4° .

0.2 0.4 0.6 0.8

PDBAC for KNN-ND

_J 95% a - I0.Ï2. 0.59|

- " chaict - 33.3%

— mean " 55J%

0.2 0.4 0.6 0.8 I

PDBAC for Random Forest

ZJ 95% CI - 10.54. 0.611

- - chance -

— mean " 57.*%

PDBAC for KNN i

^■95% CI-10.60.0.681 1

---ctaincc - 33.3%

{-mon - 63-3% 1 \

14 12 III

PDBAC for Naïve Bayes

' 95% Cl - 10.65, <l.7*|

---chance ■ 333%

--m«an - 71.0%

0.2 0.4 0.6 0.!

PDBAC for SVM i

I LSI 95% a - |0.6<». 0.681 ---ctaincc - 33.3% 1-■»■■ - <3.3% _ 1 /

Fig. 9 PDBAC for balance scale dataset using different classifiers

PDBAC for Decision Tree

PDBAC for KNN

f—^ Of. i I - nmq II qui

---tkun -

--mran - W.7%

PDBAC for KNN ND

* I 95% CI - 0.S7. o.9-|

---chinti' - 33J%

-mr.n-91.9%

PDBAC for RandomForest

IB 95% CI - 11)*' 0.951

---chMcr - .1 JJ%

-man - 90.1%

Fig. 10 PDBAC for wine dataset using different classifiers

PDBAC fo r D ecisio luTree

i 1 95% I I - [II.M 0.75|

---ckaacc • 50.0%

-mi - 6SJ%

PDBAC for KNN

iI 95% Cl - [0.59, 0.721

---thao«-50.0%

-mon - 65.0%

14 12 III

PDBAC for KNX ND

^■95% a - |«.J9.0.6I| f ---ïhancr - 50.0%

PDBAC for Naïve Bayes

[H95%('l-|0^|,0A5] i

---chaatr - 5ao%

-oicm-$M%

ft ^ 6

PDBAC for RandomForest

f 1 95% H - 10.56, 0.70|

---ClMKT - 50.0%

-mean-61.9%

PDBAC for §VM .

r i 95% CI - 10.60. 0.73]

---chantï - 5a0%

-mran-66.6%

Fig. 11 PDBAC for liver dataset using different classifiers

5.1.2. On Gene Expression Datasets

For the breast cancer dataset, the classification accuracy of FMIFS-ND is better than ReliefF for decision trees, random forests and KNN but in the case of SVMs, the accuracy is higher than all the compared feature selection methods as shown in Figs. 12(a) to 12(d). The method outperforms the five compared feature selection methods in terms of classification accuracy on the colon cancer dataset as shown in Figs. 12(e) to 12(h).

(a) Derision Tree

(b) Random Forest accuracy' on breast cancer

(c) KNN

J « ^^

-♦-ChiSquare -«-InfoGiin

-Proposed Method C.915

-♦-ChiScuire ••■6a nRatio

-«-M9Gain —ReliefF

(e) Decision Tree accuracy on colon cancer

(f) Random Forest

•a-GainRatio —irRMR

- - Proposed Methcd

(h) SVM accuracy on cob» cancer

—Chkquar,

-rhfoGain

—Prop5i«l Method

Fig. 12 Result analysis of FMIFS-ND on gene expression datasets 5.2. Performance Evaluation of KNN-ND

In this section, we evaluate the performance of KNN-ND on gene expression and UCI datasets.

5.2.1. On Gene Expression Datasets

The proposed classifier shows high classification accuracy on gene expression datasets. Especially on the breast cancer dataset, the proposed KNN-ND classifier gives better classification accuracy than KNN, decision trees and naive Bayes. On the colon cancer dataset, the classifier gives the highest classification accuracy compared to other

classifiers as shown in Table 6. Similarly, on the Leukemia dataset, the classifier gives better classification accuracy than decision trees, random forests, naive Bayes and KNN whereas the accuracy is compromised a bit compared to SVM.

5.2.2. On UCI Datasets

From the experimental results, we observe that the proposed KNN-ND classifier gives high classification accuracy on the wine dataset. It gives better accuracy compared to decision trees, random forests and KNN classifiers. But, the accuracy of the proposed method is a bit lower than naive Bayes and SVMs. For the liver dataset, classification accuracy of KNN-ND is equivalent to KNN and it outperforms naive Bayes. However, classification accuracy of KNN-ND is less than KNN, random forests and SVMs on the balance scale dataset whereas it outperforms decision trees and the naive Bayes classifier. On the Statlog heart dataset, the classification accuracy of KNN-ND is better than KNN but worse than other classifiers. Similarly, the proposed classifier gives better accuracy than KNN, SVMs and naive Bayes on the Hayes Roth dataset. On the iris dataset, KNN-ND outperforms all other classifiers as shown in Table 5. On the ionosphere dataset, the proposed KNN-ND classifier gives better classification accuracy than KNN, SVMs, naive Bayes and decision trees. However, the classification accuracy of the proposed algorithm is worse than all other classifiers in case of the cloud dataset. Finally, on the acute dataset, KNN-ND gives better accuracy than KNN only. We evaluate the PDBAC for various UCI datasets as shown in Figs.4 to 11 and we observed that PDBAC value of KNN-ND is better than KNN for most of the UCI datasets.

6. Conclusion

We propose a feature selection method called FMIFS-ND using fuzzy mutual information to evaluate classification accuracy on different datasets. The selected feature subset contains the features which have high relevance to their respective class labels. The performance of the method is evaluated on UCI and gene expression datasets. We use 10-fold cross validation to determine the accuracy of selected features using decision trees, random forests, KNN and SVM classifiers. From experimental analysis, we find that the proposed method gives high classification accuracy. In addition, we validate the classification accuracy of the KNN-ND classifier using many UCI datasets and three gene expression datasets. Experimental results show comparatively good classification results on all the datasets. As future work, we are planing to design a DDoS attack detection method [31, 32] incorporating incremental fuzzy feature selection technique for classification of DDoS attack traffic.

Table 5: Performance analysis on UCI datasets.

Data set classifier Acc SD BAC PDBAC Kappa

Random forest 0.7766 0.0231 0.9551 0.892 0.8260

Wine Naive Bayes 0.9547 0.0414 0.9615 0.932 0.9485

SVM 0.985 0.021 0.9457 0.938 0.9478

KNN 0.7612 0.0846 0.6786 0.656 0.5114

KNN-ND 0.9482 0.0474 0.9363 0.929 0.9302

Decision tree 0.6335 0.0586 0.6239 0.683 0.3839

Random forest 0.7348 0.0427 0.6829 0.629 0.2657

Naive Bayes 0.5759 0.0626 0.5804 0.582 0.1609

Liver SVM 0.698 0.069 0.6658 0.666 0.3283

KNN 0.6268 0.0652 0.6519 0.656 0.3482

KNN-ND 0.6250 0.0609 0.6151 0.548 0.3026

Decision tree 0.7877 0.0253 0.5635 0.624 0.7343

Random forest 0.8445 0.0252 0.6093 0.578 0.6108

Balance scale naive Bayes 0.7406 0.0439 0.7200 0.72 0.5574

SVM 0.871 0.033 0.6345 0.6052 0.7194

KNN 0.8397 0.0173 0.6145 0.633 0.7243

KNN-ND 0.8194 0.0350 0.5588 0.553 0.5453

Decision tree 0.7515 0.0356 0.7167 0.820 0.6645

Random forest 0.8241 0.0314 0.8233 0.721 0.4535

Naive Bayes 0.8311 0.0601 0.8400 0.842 0.7091

Statlog heart SVM 0.814 0.05 0.8417 0.837 0.6972

KNN 0.6574 0.0378 0.6533 0.659 0.3324

KNN-ND 0.7248 0.065 0.7533 0.752 0.5246

Decision tree 0.8015 0.0502 0.8824 0.798 0.7090

Random forest 0.8285 0.0498 0.7987 0.857 0.8109

Naive Bayes 0.5592 0.0692 0.6588 0.666 0.4731

Hayes roth SVM 0.465 0.103 0.591 0.504 0.283

KNN 0.4100 0.0698 0.6013 0.537 0.2787

KNN-ND 0.6938 0.0837 0.5111 0.478 0.2155

Decision tree 0.9480 0.0227 0.9600 0.913 0.9189

Random forest 0.9520 0.0301 0.9333 0.937 0.9595

Naive Bayes 0.9580 0.0269 0.9600 0.926 0.9400

Iris SVM 0.766 0.77 0.7200 0.712 0.5952

KNN 0.9553 0.0189 0.9333 0.937 0.9589

KNN-ND 0.9573 0.0417 0.8667 0.872 0.8542

Decision tree 0.8777 0.0453 0.8489 0.888 0.7996

Random forest 0.9343 0.0207 0.8929 0.846 0.7108

Naive Bayes 0.8529 0.0221 0.8189 0.867 0.7988

Ionosphere SVM 0.879 0.039 0.7876 0.818 0.6249

KNN 0.8366 0.0298 0.7689 0.784 0.6304

KNN-ND 0.8826 0.0387 0.5040 0.506 0.7631

Decision tree 0.9999 0.0001 0.9951 0.999 1.0000

Random forest 0.9998 0.0001 0.9951 0.999 0.9980

Naive Bayes 0.9995 0.0008 0.9951 0.999 1.0000

Cloud SVM 1.0 0 1 0.999 1.0000

KNN 0.9995 0.0002 0.9951 0.999 0.9978

KNN-ND 0.9871 0.0023 0.9951 0.9972 0.9893

Decision tree 1 0 0.9833 0.969 1.0000

Random forest 1 0 0.9833 0.969 1.0000

Naive Bayes 0.9825 0.0217 0.9833 0.925 1.0000

Accute SVM 1 0 0.969 1.0000

KNN 0.7858 0.0411 0.7341 0.523 0.5264

KNN-ND 0.8394 0.0015 0.9862 0.809 0.6629

Acc-accuracy, SD-standard deviation, BAC-balance accuracy, PDBAC-posterior distribution of the balanced accuracy, Kappa-kappa coefficient

Table 6: Performance analysis on gene expression datasets.

Data set Classifier Acc ± SD BAC PDBAC Kappa

Breast cancer Decision tree 0.9280±0.0181 0.9510 0.939 0.9006

Random forest 0.9618±0.0127 0.9416 0.948 0.9091

Naive Bayes 0.9398±0.0152 0.9111 0.909 0.8456

SVM 0.974±0.015 0.9633 0.961 0.9392

KNN 0.9184± 0.0159 0.9143 0.915 0.8539

KNN-ND 0.9546± 0.01 0.9161 0.914 0.8605

Colon Decision tree 0.7400±0.0622 0.6023 0.636 0.3077

Random forest 0.7742±0.0645 0.6273 0.631 0.2965

Naive Bayes 0.6700±0.1478 0.6095 0.421

SVM 0.7419±0.0323 0.7432 0.762 0.6092

KNN 0.7883±0.0511 0.6869 0.705 0.5000

KNN-ND 0.7983±0.1343 0.6957 0.601 0.2468

Leukemia Decision tree 0.5375±0.0630 0.4553 0.547 0.0948

Random forest 0.61429±0.1416 0.5268 0.474 0.0570

Naive Bayes 0.6333±0.0472 0.5004 0.519 0.0296

SVM 0.6571±0.0514 0.6255 0.634 0.3110

KNN 0.5431± 0.0676 0.5030 0.523 0.0354

KNN-ND 0.6528±0.1409 0.4894 0.516 0.3267

Acc-accuracy, SD-standard deviation, BAC-balance accuracy, PDBAC-posterior distribution of the balanced accuracy, Kappa-kappa coefficient

Acknowledgments

This work is supported by Ministry of Human Resource Development, under the FAST proposal scheme and UGC, Government of India under SAP Level-II. The authors are thankful to both the funding agencies.

References

[1] D.K. Bhattacharyya, J.K. Kalita, Network Anomaly Detection: A Machine Learning Perspective, CRC Press, 2013.

[2] N. Hoque, D. Bhattacharyya, J. Kalita, MIFS-ND: A mutual information-based feature selection method, Expert Systems with Applications 41(14) (2014) 6371-6385.

[3] F. Min, Q. Hu, W. Zhu, Feature selection with test cost constraint, International Journal of Approximate Reasoning 55(1) (2014) 167-179.

[4] S. Ibbakhi, P. Moradi, F. Akhlaghian, An unsupervised feature selection algorithm based on ant colony optimization, Engineering Applications of Artificial Intelligence 32 (2014) 112-123

[5] Z. Hu, Y. Bao, T. Xiong, R. Chiong, Hybrid filter-wrapper feature selection for short-term load forecasting, Engineering Applications of Artificial Intelligence 40 (2015) 17-27.

[6] I. Guyon, A. Elisseeff, An introduction to variable and feature selection, The Journal of Machine Learning Research 3 (2003) 1157-1182.

[7] E.P. Xing, M.I. Jordan, R.M. Karp, et al., Feature selection for high-dimensional genomic microarray data, in: ICML, Vol. 1, 2001, pp.601-608.

[8] S.M. Vieira, L.F. Mendonga, G.J. Farinha, J. Sousa, Modified binary pso for feature selection using svm applied to mortality prediction of septic patients, Applied Soft Computing 13(8) (2013) 3494-3504.

[9] X. Wang, J. Yang, X. Teng, W. Xia, R. Jensen, Feature selection based on rough sets and particle swarm optimization, Pattern Recognition Letters 28(4) (2007) 459-471.

[10] Y. Chen, D. Miao, R. Wang, K. Wu, A rough set approach to feature selection based on power set tree, Knowledge-Based Systems 24(2) (2011) 275-281.

[11] D. Tian, X.j. Zeng, J. Keane, Core-generating approximate minimum entropy discretization for rough set feature selection in pattern classification, International Journal of Approximate Reasoning 52(6) (2011) 863-880.

[12] T. Chakraborti, A. Chatteijee, A novel binary adaptive weight GSA based feature selection for face recognition using local gradient patterns, modified census transform, and local binary patterns, Engineering Applications of Artificial Intelligence 33 (2014) 80-90.

[13] M. Han, W. Ren, X. Liu, Joint mutual information-based input variable selection for multivariate time series modeling, Engineering Applications of Artificial Intelligence 37 (2015) 250-257.

[14] H.M. Lee, C.M. Chen, J.M. Chen, Y.L. Jou, An efficient fuzzy classifier with feature selection based on fuzzy entropy, THRF. Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 31(3) (2001) 426-432.

[15] R. Jensen, Q. Shen, New approaches to fuzzy-rough feature selection, IEEE Transactions on Fuzzy Systems 17(4) (2009) 824-838.

[16] J. Qian, D. Miao, Z. Zhang, W. Li, Hybrid approaches to attribute reduction based on indiscernibility and discernibility relation, International Journal of Approximate Reasoning 52(2) (2011) 212-230.

[17] R.B. Bhatt, M. Gopal, On fuzzy-rough sets approach to feature selection, Pattern Recognition Letters 26(7) (2005) 965-975.

[18] R. Battiti, Using mutual information for selecting features in supervised neural net learning, IEEE Transactions on Neural Networks 5(4) (1994) 537-550.

[19] N. Kwak, C.H. Choi, Input feature selection for classification problems, IEEE Transactions on Neural Networks 13(1) (2002) 143-159.

[20] H. Peng, F. Long, C. Ding, Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy, Pattern IEEE Transactions on Analysis and Machine Intelligence 27(8) (2005) 1226-1238.

[21] F. Amiri, M.R. Yousefi, C. Lucas, A. Shakery, N. Yazdani, Mutual information-based feature selection for intrusion detection systems, Journal of Network and Computer Applications 34(4) (2011) 1184-1199.

[22] P. Estevez, M. Tesmer, C. Perez, J. M. Zurada, et al., Normalized mutual information feature selection, IEEE Transactions on Neural Networks 20(2) (2009) 189-201.

[23] E. Hancer, B. Xue, M. Zhang, D. Karaboga, B. Akay, A multi-objective artificial bee colony approach to feature selection using fuzzy mutual information, 2015 IEEE Congress on Evolutionary Computation (CEC), ШЕЕ, 2015, pp.2420-2427.

[24] R.N. Khushaba, S.Kodagoda, S. Lai, G. Dissanayake, Driver drowsiness classification using fuzzy wavelet-packet-based feature-extraction algorithm, IEEE Transactions on Biomedical Engineering 58(1) (2011) 121-131.

[25] S. Ding, Z. Shi, S. Xia, F. Jin, Studies on fuzzy information measures, Fourth International Conference on Fuzzy Systems and Knowledge Discovery, Vol. 3, IEEE, 2007, pp.376-380.

[26] K. Deb, S. Agrawal, A. Pratap, T. Meyarivan, A fast elitist non-dominated sorting genetic algorithm for multi-objective optimization: NSGA-П, Lecture Notes in Computer Science 1917 (2000) 849-858.

[27] J. Gou, L. Du, Y. Zhang, T. Xiong, A new distance-weighted k-nearest neighbor classifier, Journal of Information & Computational Science 9(6) (2012) 429-436.

[28] G. Bhattacharya, K. Ghosh, A.S. Chowdhury, An affinity-based new local distance function and similarity measure for knn algorithm, Pattern Recognition Letters 33(3) (2012) 356-363.

[29] K.H. Brodersen, C.S. Ong, K.E. Stephan, J.M. Buhmann, The balanced accuracy and its posterior distribution, 20th International Conference on Pattern Recognition (ICPR), IEEE, 2010, pp.3121-3124.

[30] A.J. Viera, J.M. Garrett, Understanding interobserver agreement: The kappa statistic, Family Medicine 37(5) (2005) 360-363.

[31] N. Hoque, D.K. Bhattacharyya, J.K. Kalita, FFSc: a novel measure for low-rate and high-rate DDoS attack detection using multivariate data analysis, Security and Communication Networks, 9(13),

2032-2041.

[32] N. Hoque, D.K. Bhattacharyya, J.K. Kaiita, Botnet in DDoS Attacks: Trends and Challenges, IEEE Communications Surveys and Tutorials 17 (4), 2242 - 2270.