Author's Accepted Manuscript

Incremental partial least squares analysis of big streaming data

Xue-Qiang Zeng, Guo-Zheng Li

PATTERN RECOGNITION

THE JOURNAL OF THE PATTERN RECOGNITION SOCIETY

www.elsevier.com/locate/pr

PII: S0031-3203(14)00224-6

DOI: http://dx.doi.org/10.1016/j.patcog.2014.05.022

Reference: PR5130

To appear in: Pattern Recognition

Received date: 22 January 2014 Revised date: 18 May 2014 Accepted date: 31 May 2014

Cite this article as: Xue-Qiang Zeng, Guo-Zheng Li, Incremental partial least squares analysis of big streaming data, Pattern Recognition, http://dx.doi.org/ 10.1016/j.patcog.2014.05.022

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Incremental Partial Least Squares Analysis of Big Streaming Data

Xue-Qiang Zenga,b, Guo-Zheng Lic*

aComputer Center, Nanchang University, Nanchang 330031, China b Key Laboratory of Embedded System and Service Computing, Ministry of Education, Tongji University,

Shanghai 201804, China c Department of Control Science & Engineering, Tongji University, Shanghai 201804, China

Abstract

Incremental feature extraction is effective for facilitating the analysis of large-scale streaming data. However, most current incremental feature extraction methods are not suitable for processing streaming data with high feature dimensions because only a few methods have low time complexity, which is linear with both the number of samples and features. In addition, feature extraction methods need to improve the performance of further classification. Therefore, incremental feature extraction methods need to be more efficient and effective. Partial least squares (PLS) is known to be an effective dimension reduction technique for classification. However, the application of PLS to streaming data is still an open problem. In this study, we propose a highly efficient and powerful dimension reduction algorithm called incremental PLS (IPLS), which comprises a two-stage extraction process. In the first stage, the PLS target function is adapted so it is incremental by updating the historical mean to extract the leading projection direction. In the second stage, the other projection directions are calculated based on the equivalence between the PLS vectors and the Krylov sequence. We compared the performance of IPLS with other state-of-the-art incremental feature extraction methods such as incremental principal components analysis, incremental maximum margin criterion, and incremental inter-class scatter using real streaming datasets. Our empirical results showed that IPLS performed better than other methods in terms

* Corresponding author at: Room 605, Dianxin College Building, Caoan Road 4800, China. Tel.: +86 21 6958 3706; fax: +86 21 6958 9241.

Email address: gzli@tongji.edu.cn (Guo-Zheng Li)

Preprint submitted to Pattern Recognition

June 5, 2014

of its efficiency and further classification accuracy.

Keywords: Feature Extraction, Incremental Learning, Large-scale Data, Partial Least Squares, Streaming Data

1. Introduction

The rapid development of large-scale data acquisition and storage techniques means that there are increasing requirements for mining streaming data [ 1, 2, 3, 4]. Streaming data differ from conventional data in the following ways: (1) the data are not all avail-5 able at once because they arrive as continuous streams; (2) the order in which the data elements arrive cannot be controlled; (3) the data streams are potentially unending; (4) an element from a data stream cannot be retrieved repeatedly, i.e., each sample can be processed only once; and (5) the scale of the data is vast. These characteristics mean that the processing of streaming data is a challenging problem. Feature extraction has 10 been one of the main techniques used to facilitate the processing of large-scale data. However, normal feature extraction methods cannot meet the requirements of streaming data because they require that all the samples are loaded into memory.

To address this problem, the most common method is to modify feature extraction algorithms into incremental approaches [5]. For example, the extension of traditional 15 principal components analysis (PCA) into incremental PCA (IPCA) has been studied widely in the last few decades [6, 7]. Many types of IPCA algorithms have been proposed, where the main difference is the incremental representation of the covari-ance matrix. Most are based on singular value decomposition (SVD), which is time consuming because SVD updating is required when each single instance arrives. In 20 addition to IPCA, other incremental methods have been proposed. For example, Pang et al. proposed an incremental linear discriminant analysis (ILDA) algorithm for online face classification where the scatter matrices are updated incrementally [8]. Hiraoka et al. designed a gradient descent incremental linear discriminant analysis (GDILDA) algorithm, which is based on the theory of neural networks [9]. Lu designed an incre-25 mental complete linear discriminant analysis (ICLDA) algorithm for face recognition, where incremental QR decomposition is used to obtain the orthonormal bases of the

range and the null spaces of the within-class scatter matrix [10]. Some nonlinear incremental feature extraction methods have also been reported. Law and Jain proposed an incremental nonlinear mapping algorithm by modifying the ISOMAP algorithm [11]. 30 Guan et al. reported an online nonnegative matrix factorization (ONMF) with robust stochastic approximation [12].

All of these methods are suitable for application to small-scale streaming data where the model is incrementally built as the training samples arrive, thus the feature dimension is not a problem. However, they cannot meet the requirements of large-35 scale streaming data applications because their computation complexity is relatively high compared with the number of features.

In the last few years, a few highly efficient incremental feature extraction methods have been proposed. A well-known fast version of IPCA with convergence proof is called candid covariance-free IPCA (CCIPCA), which is not based on SVD and it does 40 not need to reconstruct the covariance matrix at each iteration step [13]. Yan et al. [14] proposed an incremental maximum margin criterion (IMMC) algorithm based on the maximum margin criterion (MMC) [15], which computes the difference in the between-class and within-class scatter matrices incrementally. Based on previous work [14], Yan et al. designed an incremental inter-class scatter (IIS) algorithm that only op-45 timizes the between-class distances [16, 17].

CCIPCA, IMMC, and IIS satisfy the requirements for processing streaming data, but their time complexity is not linear with the numbers of instances and features. However, there are few highly efficient models and more feature extraction methods are still needed because the performance of existing methods is not satisfactory for 50 streaming data in terms of the efficiency and further classification accuracy.

Partial least squares (PLS) is a wide class of methods for modeling the relations between sets of observed variables using latent variables [18]. PLC comprises regression tasks and dimension reduction techniques. As a feature extraction method, PLS is known to be effective for classification [19, 20, 21, 22, 23]. For example, Barker and 55 Rayens demonstrated the superiority of PLS to PCA as a dimension reduction method in a formal statistical manner [24], Liu and Rayens also illustrated its superiority using real datasets [25]. PLS has also been compared with some of state-of-the-art dimension

reduction methods [26, 27]. All of these studies have demonstrated the outstanding performance of PLS. However, the application of PLS-based dimension reduction to 60 streaming data is still an open problem. In this study, we propose an incremental partial least squares (IPLS) algorithm and we compare the performance of our proposed method with some state-of-the-art incremental feature extraction methods using real streaming data.

It should be noted that Helland et al. [28] proposed an incremental PLS regression 65 method called recursive partial least squares (RPLS). Furthermore, improved methods based on RPLS have been developed to solve the online regression problem [29, 30]. However, RPLS-based methods are not suitable for dimension reduction because the projection directions are not maintained during the incremental learning process.

This paper is organized as follows. Section 2 briefly introduces some related meth-70 ods, such as IPCA, IMMC, IIS, and PLS. Our proposed method is explained in detail in Section 3. The experimental settings, empirical results, and discussion are presented in Section 4. Finally, our conclusions are given in Section 5.

2. Related Methods

At present, linear feature extraction approaches are used widely in real tasks such as 75 document classification and face recognition. These methods aim to find a projection matrix that can efficiently project the data from the original high-dimensional feature space to a much lower-dimensional representation under a particular criterion. Different criteria will yield different subspace learning algorithms with different properties. PCA and linear discriminant analysis (LDA) are two of the most widely used linear 80 subspace learning approaches. Recently, PLS analysis, which is an efficient and robust subspace learning approach, has also been applied to many real tasks, with excellent performance.

2.1. Principal Components Analysis

PCA is the most popular feature extraction method and it is an unsupervised technique that tends to find a set of orthonormal basis vectors that maximizes the variance

over all the data [31, 32,33,34]. Suppose that the data sample points x(1), x(2),..., x(n) are p-dimensional centralized vectors. The goal of PCA is to find a subspace where the basis vectors correspond to the directions with the maximal variances. Let us denote

C = I Z/Li x(i)x(i)T as the covariance matrix of sample data. We define the objective function as

J(W) = traceWTCW. (1)

Then, PCA aims to maximize the objective function J(W) in a solution space 85 HpxK = {W e RpxK, WTW = I}. It has been proved that the column vectors of W are the K leading eigenvectors of the covariance matrix C. PCA projects the original data into a K-dimensional (K ^ p) subspace. The new low-dimensional feature vector is computed directly as WTx. Although PCA is the optimal solution in terms of the minimum reconstruction error, it is not optimal for classification because no class 90 information is used when computing the principal components.

The computational cost of PCA is attributable mainly to SVD [31] processing, which has a time complexity of O(n3), where n = min(p, n). Thus, it is difficult or even impossible to perform PCA with a large-scale dataset using high-dimensional representations.

95 PCA is a batch algorithm, thus it does not meet the requirements of many real-world problems. IPCA algorithms have attracted much attention in recent decades [6, 7, 13,35]. Many SVD-based IPCA algorithms have been proposed where the main difference is in terms of the incremental representation of the covariance matrix [6, 7]. However, these IPCA methods are not suitable for applications with huge streaming 100 data because SVD-based updating is too time consuming when instances arrive at high speed [17].

A rapid and practical version of IPCA with a proof of convergence is called CCIPCA, which is not based on SVD and it does not need to reconstruct the covariance matrix at each iteration step (thus it is covariance-free) [13]. CCIPCA was motivated by the 105 concept of statistical efficiency (the estimate has the smallest variance given the observed data). It maintains the number of samples and directly updates the eigenvectors incrementally, which yields efficient estimates for some well-known distributions (e.g.,

Gaussian). The time complexity of CCIPCA is linear with the number of samples and the number of features, and it converges rapidly on high-dimensional data.

110 Like traditional PCA, CCIPCA requires that the mean value of the training samples is fixed or that a zero-mean is an inherent feature of the dataset. In applications, CCIPCA usually adopts an approximate centralization process to meet the zero-mean requirement, where only the current sample is centered correctly, whereas all the historical data are not. In most cases where the inherent mean value of the training samples

115 converges, the approximate centralization process works sufficiently well. The detail deduction process used by CCIPCA is not given here due to space limits, but the detailed algorithm is given in Algorithm 1. Note that x(n) denotes the nth training sample and x(n) is the corresponding centered sample.

2.2. Linear Discriminant Analysis

LDA is used to find a lower-dimensional space that can allocate samples to different classes. LDA aims to maximize the Fisher criterion, i.e., an objective function:

\WTSbW\ J(W) = -

\WTSwW\

Sb = X pi (mi - m)(mi - m)T (2)

Sw = X piExea{(x - mi)(x - mi)T},

120 where Sb and Sw are called the interclass scatter matrix and intraclass scatter matrix, respectively. E denotes the expectation and pi = ni/n is the prior probability that a sample belongs to class i. W is obtained by solving W* = argmax J(W) in solution space HpxK = {W e RpxK, WTW = I}, i.e., the following generalized eigenvalue decomposition problem: Sbw = ASww.

125 Some ILDA algorithms have been proposed that meet the requirements of streaming data [8, 9]. LDA and ILDA explicitly utilize the label information of the samples, which is suitable for classification problems. However, there are at most c - 1 nonzero eigenvalues, thus the upper bound of K is c - 1, and at least p + c samples are required to ensure that Sw is not singular, which limits the applications of LDA and ILDA.

Algorithm 1 CCIPCA Algorithm

Input: x(n), n = 1,2,... // streaming data K // target dimension l // amnesic parameter Output: Projection matrix W

x(1) = x(1), u 1(1) = x(1);

for n = 2,3,... do

x1(n) = x(n) - x(n); U (n) = [];

for i = 1,2,..., min(n, K) do if i = n then

ui (n) = xi (n); else

Ul(n) = s^UiQn - 1) +

\\u (n-1)\\

end if

U (n) = [U (n), u(n)]; end for end for

W = normalized U(n); end

130 2.3. Maximum Margin Criterion

MMC [15] is a supervised feature extraction algorithm, which is similar to LDA. Based on the same representation as LDA, MMC aims to maximize the following target function,

J(W) = trace(WT(Sb - Sw)W), (3)

where W e RpxK, WTW = I.

MMC is easier to compute than LDA because it does not include an inverse matrix operation. The projection matrix W is obtained by solving the eigenvalue decomposition problem: (Sb - Sw)w = Aw. 135 Similar to other batch feature extraction approaches, MMC is not efficient for large-scale data or streaming data problems. The IMMC [ 14] algorithm was proposed to solve this problem, but IMMC is not stable because the criterion matrix is not determined as nonnegative in some cases. IMMC borrowed the idea of the CCIPCA algorithm proposed in [ 13] and it directly updates the eigenvectors incrementally, which 140 makes IMMC run very rapidly. Based on IMMC, Yan et al. designed the IIS algorithm, which only optimizes the interclass matrix S b [16, 17]. IMMC and IIS both have linear time complexity with the number of instances and features.

2.4. Partial Least Squares Analysis

PLS is a class of techniques for modeling the relations between blocks of observed 145 variables using latent variables. The underlying assumption of PLS is that the observed data are generated by a system or process that is driven by a small number of latent (not directly observed or measured) variables. Therefore, PLS aims to find uncorrelated linear transformations (latent components) of the original predictor variables that have high covariance with the response variables. The latent components of PLS are similar 150 to the principal components of PCA, both of which are regarded as the extraction of linear features. Based on these latent components, PLS predicts the response variables y, i.e., the regression task, and reconstructs the original matrixX, i.e., the data modeling task, at the same time.

A key point of PLS is the construction of component t by projecting X on the weight 155 w as t = Xw. The classical criterion of PLS aims to sequentially maximize the covari-ance between the response y and latent components. Some variant PLS approaches have been proposed to solve this problem [18]. By ignoring the minor differences among these algorithms, we next provide a brief description of the most frequently used approach, PLS1 [36].

PLS1 determines the first latent component 11 = Xw1 by maximizing the covariance between y and t1 under the constraint of || w 1 ||= 1. The corresponding objective function is

w1 = arg max(Cov(Xw, y)). (4)

wT w=1

We then introduce a Lagrange function as

L(w, A) = wTXTy - A(wwT - 1), (5)

where A is a Lagrange multiplier. At the saddle point, the derivatives of L must vanish, thereby leading to

XTy = 2 Aw. (6)

Thus, the exact solution of w is given as

Wl = T1-^-TT- (7)

160 Note that the original data X and y are both assumed to have been centered in this subsection.

To extract other latent components, PLS1 models the residual matrices, which could not be modeled by previous latent variables, as the new X and y sequentially. To obtain the residuals, PLS1 deflates the matrices X and y by subtracting their rank-165 one approximations based on 11:

E1 = X- t1(tTt1)-1 tTYX , (8)

where E1 is used as the new X to extract the next latent variable t2. Note that the deflation of y is not compulsory (since the results are the same) [18, 36], thus we do not include it in our illustration.

The extraction and deflation processes are used alternatively, thus some information 170 is represented and one latent component is extracted during each PLS iteration. The iteration time K is also the number of components and it is the only PLS parameter that is fixed by the user, or it is determined by the model selection method. In general, the maximal value of K is the rank of matrix X, which has non-zero covariance with y. The iteration process of PLS dimension reduction is summarized in Algorithm 2.

Algorithm 2 PLS1 Dimension Reduction Input: Feature set X

Target variable y Target dimension K Output: Projection matrix W

1: E0 = X, W = []; 2: for i = 1 to K do

3: Wi = ET_xyl \\ ET_xy \\; 4: W = [W, Wi ]; 5: ti = Ei-1 Wi;

6: Ei = Ei-1 - ti(tTti)-1 tTEi-U 7: end for

8: Output the projection matrix W.

175 3. Incremental Partial Least Squares

Clearly, traditional PLS operates as a batch process and it is not practical for real-world streaming data. In this section, we propose a highly scalable incremental feature extraction algorithm based on the PLS algorithm, which we call the IPLS algorithm.

3.1. Leading Projection Direction 180 As in the above section, suppose that the sample vectors are acquired sequentially, {x(1),y1}, |x(2),y2},..., possibly infinite. Each x(n), n = 1,2,..., is a p-dimensional vector and p may be 10,000 or higher, and y n is its corresponding class label, the

x(n) = - y, x(n)>

value of which is 1 for a positive sample and -1 for a negative sample. Unless stated otherwise, a variable S at step n is denoted by S (n) in the remainder of this study.

PLS assumes that x(n) and yn have zero means across samples, but the means change dynamically as huge volumes of data are received. As mentioned above, the deflation ofy is not technically a requirement during the iterations of PLS1 and neither is the centralization of y. Thus, we focus on the centralization of x(n) in the present study. Let x(n) denote the mean of x(n) at step n, then the ith centralized sample xn (i) at step n is given as

x (i) = x(i) - x(n)

V x(n),

where x(n) is computed incrementally as

n - 1 1

x(n) = -x(n - 1) + -x(ri). (10)

185 A critical problem of centralization across samples with streaming data is that the arriving instances should be centralized by the current mean, but all of the historical centralized samples also need to be updated. However, the update is difficult to perform since there are too many historical streaming instances to be stored. This is why many incremental methods, such as CCIPCA, adopt an approximate centralization process to 190 meet the zero-mean requirement. In most cases where the inherent mean value of the training samples converges, the approximate centralization process works sufficiently well. However, exact historical centralization is obviously more suitable when the intrinsic data model is not stable or the sample number is small.

Thus, we propose a novel method for updating the feature extraction model without explicitly re-centralizing the historical data. Based on the solution of PLS1 in (6), we

define a new variable v = XTy = 2Aw, and the estimate of v during step n is given as v(n) =X(n)Ty(n)

= Yj yix (i) + ynxn (n)

i=1 (1 n-1

= X yi (xn-1(i) - A(n)) + ynxn (n) i=1

=v(n - 1) - (n - 1)y(n - 1)A(n) + ynxn(n),

where A(n) is defined as the increment of the mean vector x(n) and y(n) is the mean of y at step n:

A(n) = x(n) - x(n - 1),

y(n) = " X yu

At initialization, we set v(0) = 0. After the value of v(n) is determined, the projec-195 tion direction w(n) is computed directly as w(n) = v(n)/ \\v(n)\\.

3.2. Other Projection Directions

Equation (11) only estimates the first projection direction, thus we need to find a different method to compute the other higher order projection directions incrementally. Based on similar incremental feature extraction methods, such as CCIPCA [ 13], 200 IMMC [14], and IIS [ 16,17], we propose a convenient method for generating "observations" only in a complementary space to compute the higher order projection directions, which is deduced from the property that the projection directions must be orthogonal to each other.

To compute the j + 1th projection direction, we simply need to subtract the projection of x(n) on the estimated jth latent component from the data, which yields a new x(n) (the residual) for the next iteration.

T vj (n) vj (n)

xj+i(ri) = Xj(ri) - Xj(ri) ,, ,, ,, ,, , (13)

where x1(n) = x(n). This avoids the time-consuming orthonormalization process and 205 the orthogonality of vj (n), j = 1,2, ••• , K is always ensured.

However, this less costly method is not suitable for our incremental PLS algorithm. Equation (13) is regarded as a type of deflation based on the projection direction w. The deflation of PLS in (8) is based on the latent variable t, where t = Xw. Both (8) and (13) guarantee the orthogonality of wj, j = 1,2, ••• , K, but the residual matrices 210 of X are different. Therefore, the projection directions obtained are also different.

Unfortunately, we cannot perform exactly the same deflation as (8) using a one-pass algorithm because the dimension of t is the number of instances, which may be infinite with streaming data. Therefore, we need to design a novel method that differs from other incremental feature extraction methods.

It is well-known that PLS1 is closely related to the Krylov sequence, which has been demonstrated previously [36]. It has been established that the following equivalences exist between the spaces spanned by the PLS vectors and the spaces spanned by the Krylov sequence:

{W} = {s, Cs, C2 s, ••• , CK-1 s}, (14)

where s = Xry is the cross-product vector and C = \XTX is the covariance matrix. The exact PLS projection directions are computed from a Gram-Schmidt orthonormal-ization of the Krylov sequence [37]. We represent this as

W = [ s, Cs\ s, C2 s\{ s, Cs}, ••• , CK-1 s\{ s, Cs, ••• , CK-2 s}], (15)

215 where the notion f\{g,h, •••} refers to the components of f that are orthogonal to the space spanned by {g, h, •••}, i.e., the residual of f after multiple regression on {g, h, •••}.

Based on (15), only the covariance matrix C and the leading vector v 1 are required to compute the whole projection direction sequence. This computation is not related directly to the original streaming data, which reduces the computational costs greatly 220 because only the variables C and v 1 need to be updated when each new sample arrives. The overall projection direction sequence is computed when necessary.

Although the time complexity of incremental updating of C is low, the storage requirements for C are too great if the original feature space is huge. In our experiment, the size of C is more than 3 GB since the feature dimension is more than 20 k. 225 Fortunately, we can reconstruct the approximate covariance matrix using a few PCA

principal components, which can be computed efficiently with the CCIPCA algorithm. In this manner, the time and space complexity are suitable.

Given L leading PCA principal components uj, j = 1,2, ••• , L and the first projection direction v1, the second projection direction v2 is computed directly as

where is the corresponding eigenvalue of uj. When the number L of principal components in the PCA model is of full rank, v2 computed by (16) is identical to that 230 derived by the traditional PLS1 algorithm. Our experiments showed that the top 20 PCA principal components are sufficient.

3.3. Algorithm Summary

By embedding the CCIPCA algorithm naturally into our IPLS algorithm, we obtain the overall computational process. The detailed IPLS algorithm is shown in Algo-235 rithm 3.

The IPLS algorithm can be divided into two rough stages, i.e., the online processing stage and the finalization stage. In the online processing stage, a preprocessing procedure is conducted initially, where the means, the centered vector, and A(n) are updated as each instance arrives. Next, the leading PLS projection vector v 1 is computed in-240 crementally as (11). Finally, the PCA eigenvectors ui are extracted using the CCIPCA algorithm, where i = 1,2,..., L. The finalization stage is needed only when the projection matrix is required. IPLS extracts the overall K projection vectors sequence from the leading vector v 1 and the top L PCA eigenvectors as (16).

The time complexity for IPLS to train N input samples is O(NLp + K2p), where p 245 is the dimension of the original data space, L is the PCA principal component number, and K is the target dimension, which is linear with the original data dimension N and p. In most cases, the time complexity is only O(NLp) since K ^ N. Furthermore, when handling each input sample, IPLS only needs to maintain the learned leading projection

v2 = Cv1

T v1 v1 V2 = V2 ~ V2 --- -—

\\v1 \\\\v1 \\

Algorithm 3 IPLS Algorithm

Input: (x(n),yn}, n = 1,2,... // streaming data K // number of dimensions extracted L // number of PCA components l // amnesic parameter of CCIPCA Output: Projection matrix W

x(1) = x(l), y(i) = y 1; Vi(1) = 0, Mi(1) = x(1); for n = 2,3,... do

J(„)=tIJ(„-l)+!*(„); y(n) = - 1) + !>.„;

xn(n) = x(n) - x(n); A(n) = x(n) - x(n - 1);

V1(n) = V1(n - 1) - (n - 1)y(n - 1)A(n) + yn^n(n); for i = 1, 2,..., min(n, L) do if i = n then

u (n) = xn (n); else

Ul(n\ = t=l=lUl(n - 1) + y7 M = - r"(V)r JukLL. end if end for end for K(n) = [V1(n)]; w = V1(n);

for i = 2,3,...,K do

w = I i raru№>№)Tw

Jj=1 \\uj(n)\\ Vi (n) = w;

for j = 1,2,...,i - 1 do

t VA") VA") .

end for

K(n) = [K(n), Vi(n)]; end for

W = normalized F(n); end

direction, some PCA projection weight vectors, and several first-order statistics of the 250 previous samples, such as the mean and sample number. Thus, IPLS can handle large-scale streaming data.

It should be note that PLS is a supervised feature extraction method, so it is not suitable when all of the samples belong to the same class, because the covariance of the label vector y relative to any other variable is zero. Therefore, the projection directions 255 computed by IPLS are invalid in this situation.

4. Experiments

Two experiments were performed to evaluate the proposed IPLS algorithm. The first was a handwritten digit recognition task, where the dataset was split into training and test sets. We evaluated the classification performance when the target dimension 260 and the training instance number were varied in this experiment. The results in terms of the time cost were also determined. The second experiment compared the performance of IPLS with existing algorithms using the Reuters-21578 corpus. We performed a novel experiment to simulate a real streaming application by utilizing the time-stamp feature. The class distribution of Reuters-21578 is severely skewed, so the performance 265 levels with the common and rare classes are discussed. The experiments are described in the following subsections.

4.1. Handwriting Recognition

The GISETTE dataset is a handwritten digit recognition problem, which is one of five datasets in the NIPS 2003 feature selection challenge [38]. The problem requires 270 the separation of the highly confusing digits '4' and '9', where the numbers of the two digits are even. The data were split into training, validation, and test sets, where the corresponding numbers of instances were 6,000, 1,000, and 6,500, respectively. The label information for the test set is not public, so we used the validation set as the test set in our experiments.

275 The digits have been size-normalized and centered in a fixed-size image of dimension 28x28. The original data were modified for the feature selection challenge. In

particular, the pixels were samples at random from the middle top part of the feature, which contains the information required to disambiguate 4 from 9, and higher order features were created from these pixels to move the problem into a higher-dimensional 280 feature space. Some distractor features called 'probes' were also added, which had no predictive power. The orders of the features and the patterns were randomized. Finally, 5,000 features were included in this problem.

To strengthen the conclusions, several widely used classification methods were tested in our experiments, which are listed as follows. The classification performance 285 was recorded based on the accuracy scores obtained by these classifiers using the test set.

• SVM: Linear support vector machine with c = 10.

• SMO: Support vector machines using sequential minimal optimization.

• ANN: Artificial neural network with one hidden layer. The maximum number of 290 training iterations was 500 and the node number was set to half of the problem

dimension.

• LD: Logistic discrimination.

• J48: The J48 decision tree algorithm.

• RR: Ridge regression.

295 • kNN: k nearest neighbor algorithm with k = 5.

We used a DELL PC work station with 24xIntel(R) Xeon(R) CPU X5680 at 3.33 GHz and 64 GB of RAM to conduct the experiments. The programming language used was JAVA and the open source machine learning toolkit in WEKA [39] was employed.

4.1.1. Variation of the target dimension 300 The target dimension is a critical parameter for dimension reduction methods. We varied the reduced dimension from 1 to 200, i.e., 1, 2, 3, 4, 5, 6, 8, 10, 15, 20, 30, 50, 80, 100, 150, and 200. For each dimension, we compared the following cases in our experiments.

• IPCA: The CCIPCA algorithm proposed by Weng et al. [13]. The weighting 305 parameter l of CCIPCA was set to 2 when the number of samples exceeded 20,

but l = 0 otherwise,

• The IMMC algorithm proposed by Yan et al. [14]. The parameter 6 = 0 was used for large-scale data, as reported by the authors.

• The IIS algorithm designed by Yan et al. [16, 17].

310 • Our proposed IPLS algorithm. The number of PCA projection directions was set to 20, the weighting parameter l of CCIPCA was set to 2 when the number of samples exceeded 20, but l = 0 otherwise.

The detail accuracy scores obtained with the SVM and J48 classification models are shown in Fig. 1. The results obtained using other classifiers were similar. The 315 abscissa in Fig. 1 increases with the LOG function based on 2.

To perform more rigorous comparisons, we provide the results of paired two-tailed t-tests in Table 1. The values (W/T/L) in Table 1 are the wins/ties/losses times from the t-test using the accuracy scores with all classifiers (at a significance level of 0.05). The t-test result in any table cell is based on a comparison of the column method with 320 the corresponding row method. The scores in the last row are the total wins/ties/losses times.

Table 1: Results of the paired two-tailed t-tests used to compare the IPCA, IMMC, IIS, and IPLS methods with various target dimensions

IPCA IMMC IIS IPLS

IMMC 0-2-5 -

IIS 0-2-5 6-1-0 -

IPLS 0-0-7 0-2-5 0-0-7 -

Total 0-4-17 11-5-5 5-3-13 19-2-0

Based on Fig. 1 and Table 1, some interesting observations can be made as follows. In Fig. 1, the performance levels of all the feature extraction methods combined with SVM increased monotonically as the target dimension increased. By contrast, the

(a) SVM

(b) J48

Figure 1: Classification accuracy obtained with the SVM and J48 method using the GISETTE dataset with various target dimensions.

325 accuracy curves decreased slightly when using J48 as the classifier. This was partly because SVM is less sensitive to the addition of weakly relevant features than J48.

When the target dimension was less than 10, IPCA performed worse than the other methods. This is because PCA does not consider the label information and the leading principal components of PCA reflect the major information related to X. However, 330 when the dimension reduction was high, the performance of IPCA was relatively similar to that of other methods because more training information was included by the addition of the principal components.

The other methods, i.e., IMMC, IIS, and IPLS, are designed for classification tasks, thus they performed much better than IPCA. The performance of IIS was clearly worse 335 than that of IMMC and IPLS. This is because IIS only considers the interclass information. This makes the target function much simpler, but the classification performance is reduced.

Strictly speaking, IPLS was among the best when combined with most classifiers. The performance of IMMC was similar to that of IPLS. It should be noted that IMMC 340 performed much better than all of the other methods when the dimension was one. However, when the target dimension exceeded two, the accuracy scores with IPLS were better than those with IMMC in most cases. This suggests that the target function of IMMC is effective for obtaining the leading projection direction.

4.1.2. Variation of the number of instances 345 The number of training instances is another critical parameter for dimension reduction methods. For the incremental learning models, we added the training instances continuously to the dimension reduction models. We then recorded the performance using the test set where the instance number varied from 10 to 6,000, i.e., 10, 15, 20, 30, 50, 80, 100, 150, 200, 300, 500, 800, 1000, 1,500, 2,000, 3,000, 4,000, and 6,000. 350 For each dimension, we compared the accuracy scores obtained with IPCA, IMMC, IIS, and IPLS. The target dimension of these methods was set to 10.

The detailed accuracy scores obtained with the SVM and J48 classification models are shown in Fig. 2. The results were similar with the other classifiers. The abscissa in Fig. 2 increases with the LOG function based on 2. The results of the corresponding 355 paired two-tailed t-test are given in Table 2.

Table 2: Results of the paired two-tailed t-test used to compare IPCA, IMMC, IIS, and IPLS with various numbers of instances _

IPCA IMMC IIS IPLS

IMMC 0-5-2 -

IIS 0-6-1 3-4-0 -

IPLS 0-2-5 0-2-5 0-2-5 -

Total 0-13-8 5-11-5 1-12-8 15-6-0

The results in Fig. 2 and Table 2 are similar to those shown in Fig. 1 and Table 1. IPLS and IPCA performed the best and worst, respectively. The performance of IMMC was better than that of IIS.

Fig. 2 shows that the performance levels of all methods increased as the training 360 samples arrived. When the training data were abundant, the performance curves began to converge. Some fluctuations were obtained with each method, which indicates that the change in the data's intrinsic model has a major influence on incremental methods.

........................................... ............ ................... ...........

■ - ri

' ' i — IPCA -•-IMMC -A- ||S —f— IPLS -

j/M - » ............

8 16 32 64 128 256 512 1024 2048 4096 Number of Training Data

8 16 32 64 128 256 512 1024 2048 4096 Number of Training Data

(a) SVM

(b) J48

Figure 2: Classification accuracy obtained with SVM and J48 using the GISETTE dataset with various numbers of instances.

4.1.3. Time efficiency

The time efficiency of the feature extraction methods was evaluated based on the 365 CPU runtime. We recorded the values in milliseconds with increases in the number of instances and the results are shown in Table 3. The results show clearly that the time complexity was nearly linear with the instance number for IPCA, IMMC, IIS, and IPLS. For any given dimension, IPCA was the fastest method and IMMC was the slowest. In most cases, IPLS ran slightly slower than IPCA but it was better than the 370 other two methods.

We do not present the time costs with various reduced dimensions, which is also a linear parameter based on the instance number from the algorithm description. It should be noted that IPLS is almost O(0) with reduced dimension. As shown in Algorithm 3, most of the computational time required by IPLS is accounted for by the 375 embedded CCIPCA algorithm. When the number of principal components was fixed at a given threshold, i.e., 20, the CPU time required by other processes was relatively low when N was larger than K.

4.1.4. The parameter L

As described in Algorithm 3, CCIPCA is embedded in IPLS. Thus, the number 380 of PCA components has a major effect on the performance of IPLS. We tested the

Table 3: Time Costs (ms) of Each Incremental Feature Extraction Method

Instance Num. IPCA IMMC IIS IPLS

10 4 12 9 6

20 15 90 30 22

50 41 318 92 74

100 78 499 192 151

200 149 704 363 290

500 654 1,414 977 693

1,000 1,030 2,710 1,953 1,321

2,000 2,280 4,698 3,952 2,627

4,000 4,608 8,623 7,860 5,380

6,000 6,954 13,920 12,522 8,168

performance where we varied the parameter L from 1 to 200, i.e., 1, 2, 3, 4, 5, 6, 8, 10, 15, 20, 30, 50, 80, 100, 150, and 200. The accuracy scores obtained using SVM are shown in Fig. 3. The target dimension was set to 10.

1.000.98-

g 0.940.920.90-

Figure 3: Accuracy of IPLS with SVM using the GISETTE dataset with various values of the parameter L.

The parameter of L

Based on Fig. 3, we can see that 80, 100, and 200 were the best thresholds, which 385 all obtained accuracy scores of 0.977. The performance curves began to converge when the parameter exceeded 15. From our perspective, the covariance matrix reconstructed using the top 20 PCA principal components is adequate for IPLS because the accuracy value was 0.973.

We also find that using only the top few PCA components was not helpful and 390 the worst score was obtained with the top three PCA components. We consider that the top PCA components mainly reflect the global training information and they have no critical discriminatory power. Compared with the global data information, local or minor data differences have a greater discriminatory capacity. This is why adding other PCA components was beneficial for the results shown in Fig. 3.

395 4.2. Text Classification

We performed experiments using the Reuters-21578 corpus, which is a collection of documents that appeared on Reuters newswire during 1987. The Reuters-21578 corpus is a well-known dataset, which has been used widely in the text classification area [40]. After removing some corrupted documents, we combined the training documents and 400 test documents into a single dataset and 11,359 documents were obtained.

The documents obtained comprised 120 different classes. The distribution of the classes was skewed and most of the classes were rare. The most common class had a document frequency of 3,986, whereas 97 classes had less than 100 positive instances, and 15 classes had only one positive instance. The 20 most common classes were 405 used in our experiments, for which the positive instance numbers varied from 120 to 3,986. These classes are listed in Table 4. Each class was regarded as a binary classification problem, where the label was set as 1 for positive instances and -1 for negative instances.

We preprocessed the documents in a normative manner, where all of the num-410 bers and stopwords were removed, words were converted into lowercase, and word stemming was performed using the Porter stemmer. This procedure generated 22,049 unique terms. The ltc weighting was then used to compute the weight vector, which is a form of TF x idf weighting [41].

Table 4: Top 20 Common Classes in the Reuters-21578 collection

Class No Class Name Positive Instance Number

1 earn 3,986

2 acq 2,448

3 money-fx 799

4 crude 632

5 grain 627

6 trade 552

7 interest 511

8 wheat 306

9 ship 304

10 corn 253

11 dlr 217

12 oilseed 192

13 money-supply 190

14 sugar 184

15 gnp 163

16 coffee 145

17 veg-oil 137

18 gold 135

19 nat-gas 130

20 soybean 120

The Reuters-21578 collection is a typical streaming dataset because each document 415 has a time-stamp, although it has not been used in most previous studies. We consider that the standard cross-validation experimental procedure is not suitable for streaming data because the critical time sequence property is not used when the documents are shuffled randomly. Thus, we designed a new experimental procedure. We divided the overall collection into 20 equal blocks according to the time course. The data blocks

420 were then added incrementally to the learning model as a data sequence and the next data block was used as the test set. The classification performance was recorded 19 times for each class and each method. The detailed experimental procedure is given in the Appendix.

Similar to the experimental settings used in last subsection, the performance levels of IPCA, IMMC, IIS, and IPLS were examined with seven classification models, i.e., SVM, SMO, ANN, LD, J48, RR, and kNN. The target dimension was set to 10. The performance was measured based on the F1 value for a given class, which is more suitable for imbalanced text classification than the accuracy score. Most of the classes we used in the experiments were relatively rare. Although they were the top 20 common classes in the Reuters-21578 collection, only the top two classes have more than 1,000 positive instances, which is small compared with the overall set of 11,359 documents. The F1 score is defined as

where precision is the number of true positive instances divided by the number of all 425 positive instances returned and recall is the number of true positive instances divided by the number of positive instances that should have been returned. It is clear that the F1 value is zero when all positive instances are predicted incorrectly.

Table 5 shows the results of the paired two-tailed t-test used to compare different feature extraction methods and classifiers. The values (W/T/L) in Table 5 are the 430 wins/ties/losses times of the t-tests using the F1 scores (at a significance level of 0.05).

Table 5: Results of the paired two-tailed t-tests used to compared the IPCA, IMMC, IIS, and IPLS methods with the Reuters-21578 dataset_

F1 = 2 X

precision x recall

precision + recall

IMMC 3-20-117

IIS 5-25-110 71-67-2

IPLS 2-7-131 14-89-37

0-58-82

Total 10-52-358 202-176-42 112-150-158 250-154-16

The t-test results show clearly that our proposed IPLS performed the best of all the methods, which was similar to the results of the handwriting recognition problem. IMMC had the second best performance, where the t-test results (W/T/L) compared with IPLS were 14-89-37. We also provide the detailed F1 scores for the different 435 classes using the SVM and J48 classification models in Fig. 4. The results obtained using the other classifiers were similar to those above. The F1 scores shown in Fig. 4 were averaged over all the test data blocks.

Class No Class No

(a) SVM (b) J48

Figure 4: F1 results for the top 20 classes in Reuters-21578 using SVM and J48. Note that Class No. is ordered by the positive instance number.

Figure 4 shows that none of the methods performed well with all classes. The detailed F1 values appear to depend greatly on the differences among the classes. Clearly, 440 IPCA performed the worst because it is an unsupervised method. In general, IPLS and IMMC were the best two methods, and their performance levels were very similar. The best method for each class was either IPLS or IMMC. In general, we suggest that IPLS performed slightly better than IMMC based on Fig. 4, which was supported by the t-test results. The performance of IIS was worse than that of IMMC in the classification 445 task, which was also suggested by the original authors [ 17].

Furthermore, we observed that the F1 performance with a given class had a relationship with the number of positive training documents. Most of the methods performed well when the class distribution was almost balanced. Indeed, IPCA performed almost

as well as the other methods with the top common class, which comprised 3,986 pos-450 itive documents. For the minor classes, however, IPCA performed extremely badly because it is difficult to extract discriminatory information for rare classes using an unsupervised method.

The classification performance had an almost linear relationship with the number of positive samples. We can see that the F1 scores with the IMMC, IIS and IPLS methods 455 decreased almost monotonically as the class number increased from 1 to 11, where the positive instance number decreased from 3,986 to 217. For other classes, the F1 values depended greatly on the differences in the classes, because their numbers of positive samples were similar.

5. Conclusions

460 Incremental feature extraction is an effective technique that facilitates data mining from large-scale streaming data. PLS is known to be a powerful feature extraction technique for the classification of traditional data. Therefore, we propose the IPLS method to exploit the advantages of both PLS and incremental feature extraction, thereby improving the generalization performance of streaming data analysis. The leading projec-465 tion direction extracted by IPLS agrees with that using traditional PLS because exact historical centralization is applied by IPLS. The extraction of high order projection directions by a deflation scheme is a distinguishing feature of PLS, which is difficult to modify incrementally. Based on the equivalence that exists between the space spanned by the PLS vectors and the space of the Krylov sequence, we relate the deflation pro-470 cess to the leading projection direction and the covariance matrix, rather than the original streaming data, because the covariance matrix can be estimated by IPCA of the streaming data. CCIPCA is embedded naturally into the IPLS algorithm, which has a high discriminatory capacity and low time complexity, where it is linear with both the number of samples and the number of features. We compared the performance of IPLS 475 with state-of-the-art methods such as CCIPCA using real streaming data in handwritten digit recognition and text classification tasks. Our empirical results showed that IPLS outperformed other methods in terms of its efficiency and further classification

accuracy.

The rapid emergence of large-scale streaming data applications means that increas-480 ing amounts of streaming data from scientific applications will have nonlinear properties and they may shift with time, thus linear methods such as IPLS and CCIPCA have limitations when handling this type of data. Therefore, highly efficient and nonlinear incremental feature extraction methods will be in great demand in the future.

ACKNOWLEDGMENTS

485 This research was supported by the Natural Science Foundation of China under grant nos. 61105053 and 61273305, the Nature Science Foundation of Jiangxi Province under grant no. 20132BAB201043, and China Postdoctoral Science Foundation under grant no. 2013M541540.

Appendix: Experimental Procedure

490 The following pseudo-code describes the experimental procedure used to collect and process the data.

1: Divide the dataset into equal 20 blocks according to the time series

2: for each class do

3: for i=1 to 19 do

4: Train the dimension reduction model on the ith data block incrementally

5: Merge the ith data block into the training set

6: Set the i + 1th data block as the test set

7: Transform the training set and the test set into the learned low-dimensional space

8: Train the classification model on the reduced training set

9: Measure the classification performance using the reduced test set

10: end for

11: end for

References

[1] M. M. Gaber, A. Zaslavsky, S. Krishnaswamy, Mining data streams: a review, SIGMOD Rec. 34 (2) (2005) 18-26.

495 [2] J. H. Friedman, J. J. Meulman, Clustering objects on subsets of attributes, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 66 (4) (2004) 815-849.

[3] D. Steinley, M. J. Brusco, Selection of variables in cluster analysis: An empirical comparison of eight procedures, Psychometrika 73 (1) (2008) 125-144.

500 [4] S. Lipovetsky, Finding cluster centers and sizes via multinomial parameterization, Applied Mathematics and Computation 221 (2013) 571-580.

[5] S. Chen, H. He, Towards incremental learning of nonstationary imbalanced data stream: a multiple selectively recursive approach, Evolving Systems 2 (1) (2011) 35-50.

505 [6] M. Artac, M. Jogan, A. Leonardis, Incremental PCA for on-line visual learning and recognition, in: Proceedings of the 16th International Conference on Pattern Recognition, Vol. 3, 2002, pp. 781-784.

[7] Y. Li, On incremental and robust subspace learning, Pattern Recognition 37 (7) (2004)1509-1518.

510 [8] S. Pang, S. Ozawa, N. Kasabov, Incremental linear discriminant analysis for classification of data streams, IEEE Transactions on Systems, Man, and Cybernetics, PartB: Cybernetics 35 (5) (2005) 905-914.

[9] K. Hiraoka, S. Yoshizawa, K. Hidai, M. Hamahira, H. Mizoguchi, T. Mishima, Convergence analysis of online linear discriminant analysis, in: Proceedings

515 of the IEEE-INNS-ENNS International Joint Conference on Neural Networks

(IJCNN'2000), Vol. 3, 2000, pp. 387-391.

[10] G.-F. Lu, J. Zou, Y. Wang, Incremental learning of complete linear discriminant analysis for face recognition, Knowledge-Based Systems 31 (0) (2012) 19-27.

[11] M. Law, A. Jain, Incremental nonlinear dimensionality reduction by manifold 520 learning, IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (3)

(2006)377-391.

[12] N. Guan, D. Tao, Z. Luo, B. Yuan, Online nonnegative matrix factorization with robust stochastic approximation, IEEE Transactions on Neural Networks and Learning Systems 23 (7) (2012) 1087-1099.

525 [13] J. Weng, Y. Zhang, W.-S. Hwang, Candid covariance-free incremental principal component analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (8) (2003) 1034-1040.

[14] J. Yan, B. Zhang, S. Yan, Q. Yang, H. Li, Z. Chen, W. Xi, W. Fan, W.-Y. Ma, Q. Cheng, IMMC: incremental maximum margin criterion, in: Proceedings of

530 the 10th ACM SIGKDD International Conference on Knowledge Discovery and

Data Mining, 2004, pp. 725-730.

[15] H. Li, T. Jiang, K. Zhang, Efficient and robust feature extraction by maximum margin criterion, IEEE Transactions on Neural Networks 17 (1) (2006) 157-165.

[16] J. Yan, B. Zhang, N. Liu, S. Yan, Q. Cheng, W. Fan, Q. Yang, W. Xi, Z. Chen, 535 Effective and efficient dimensionality reduction for large-scale and streaming data

preprocessing, IEEE Transactions on Knowledge and Data Engineering 18 (3) (2006)320-333.

[17] J. Yan, B. Zhang, S. Yan, N. Liu, Q. Yang, Q. Cheng, H. Li, Z. Chen, W.-Y. Ma, A scalable supervised algorithm for dimensionality reduction on streaming data,

540 Information Sciences 176 (14) (2006) 2042-2065.

[18] R. Rosipal, N. Kramer, Overview and recent advances in partial least squares, in: Subspace, Latent Structure and Feature Selection Techniques, Springer, 2006, pp. 34-51.

[19] D. V. Nguyen, D. M. Rocke, Tumor classification by partial least squares using 545 microarray gene expression data, Bioinformatics 18 (1) (2002) 39-50.

[20] L. Shen, E. C. Tan, Dimension reduction-based penalized logistic regression for cancer classification using microarray data, IEEE Transactions on Computational Biology and Bioinformatics 2 (2) (2005) 166-175.

[21] H.-N. Qu, G.-Z. Li, W.-S. Xu, An asymmetric classifier based on partial least 550 squares, Pattern Recognition 43 (10) (2010) 3448-3457.

[22] X.-Q. Zeng, G.-Z. Li, G.-F. Wu, J. Y. Yang, M. Q. Yang, Irrelevant gene elimination for partial least squares based dimension reduction by using feature probes, International Journal of Data Mining and Bioinformatics 3 (1) (2009) 85-103.

[23] G. Guo, G. Mu, Simultaneous dimensionality reduction and human age estima-555 tion via kernel partial least squares regression, in: 2011 IEEE Conference on

Computer Vision and Pattern Recognition (CVPR), 2011, pp. 657-664.

[24] M. Barker, W. Rayens, Partial least squares for discrimination, Journal of Chemo-metrics 17 (3) (2003) 166-173.

[25] Y. Liu, W. Rayens, PLS and dimension reduction for classification, Computa-560 tional Statistics 22 (2007) 189-208.

[26] J. J. Dai, L. Lieu, D. Rocke, Dimension reduction for classification with gene expression data, Statistical Applications in Genetics and Molecular Biology 5(1) (2006) Article 6.

[27] A.-L. Boulesteix, PLS dimension reduction for classification of microarray data, 565 Statistical Applications in Genetics and Molecular Biology 3 (1) (2004) Article

[28] K. Helland, H. E. Berntsen, O. S. Borgen, H. Martens, Recursive algorithm for partial least squares regression, Chemometrics and Intelligent Laboratory Systems 14 (13) (1992) 129-137.

570 [29] B. McWilliams, G. Montana, Sparse partial least squares regression for on-line variable selection with multivariate data streams, Statistical Analysis and Data Mining 3 (3) (2010) 170193.

[30] W. Ni, S. K. Tan, W. J. Ng, S. D. Brown, Localized, adaptive recursive partial least squares regression for dynamic system modeling, Industrial & Engineering Chemistry Research 51 (23) (2012) 8025-8039.

[31] I. T. Jolliffe, Principal Component Analysis, Springer, 2002.

[32] J. C. Lv, K.-K. Tan, Z. Yi, S. Huang, A family of fuzzy learning algorithms for robust principal component analysis neural networks, IEEE Transactions on Fuzzy Systems 18 (1) (2010) 217-226.

[33] D. Zhang, Z.-H. Zhou, Songcan Chen, Diagonal principal component analysis for face recognition, Pattern Recognition 39 (1) (2006) 140-142.

[34] J. Yang, D. Zhang, A. Frangi, J.-Y. Yang, Two-dimensional PCA: a new approach to appearance-based face representation and recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (1) (2004) 131-137.

[35] T.-J. Chin, D. Suter, Incremental kernel principal component analysis, IEEE Transactions on Image Processing 16 (6) (2007) 1662-1674.

[36] I. S. Helland, On the structure of partial least squares regression, Communications in Statistics, Simulation and Computation 17 (22) (1988) 581-607.

[37] S. De Jong, SIMPLS: an alternative approach to partial least squares regression, Chemometrics and Intelligent Laboratory Systems 18 (3) (1993) 251-263.

[38] I. Guyon, J. Li, T. Mader, P. A. Pletscher, G. Schneider, M. Uhr, Competitive baseline methods set new standards for the NIPS 2003 feature selection benchmark, Pattern Recognition Letters 28 (12) (2007) 1438-1444.

[39] I. H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, 2nd Edition, Morgan Kaufmann Publishers Inc., San Francisco, 2005.

[40] Y. Yang, X. Liu, A re-examination of text categorization methods, in: the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, ACM Press, 1999, pp. 42-49.

[41] C. Buckley, G. Salton, J. Allan, A. Singhal, Automatic query expansion using SMART: TREC 3, in: the Third Text Retrieval Conference (TREC-3), 1994.