Hindawi Publishing Corporation Journal of Applied Mathematics Volume 2014, Article ID 942520, 16 pages http://dx.doi.org/10.1155/2014/942520

Research Article

An Interior Point Method for L1/2-SVM and Application to Feature Selection in Classification

Lan Yao,1 Xiongji Zhang,2 Dong-Hui Li,2 Feng Zeng,3 and Haowen Chen1

1 College of Mathematics and Econometrics, Hunan University, Changsha 410082, China

2 School of Mathematical Sciences, South China Normal University, Guangzhou 510631, China

3 School of Software, Central South University, Changsha 410083, China

Correspondence should be addressed to Dong-Hui Li; dhli@scnu.edu.cn

Received 4 November 2013; Revised 12 February 2014; Accepted 18 February 2014; Published 10 April 2014 Academic Editor: Frank Werner

Copyright © 2014 Lan Yao et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This paper studies feature selection for support vector machine (SVM). By the use of the L1/2 regularization technique, we propose a new model L1/2-SVM. To solve this nonconvex and non-Lipschitz optimization problem, we first transform it into an equivalent quadratic constrained optimization model with linear objective function and then develop an interior point algorithm. We establish the convergence of the proposed algorithm. Our experiments with artificial data and real data demonstrate that the L1/2-SVM model works well and the proposed algorithm is more effective than some popular methods in selecting relevant features and improving classification performance.

1. Introduction

Feature selection plays an important role in solving the classification problems with high dimension features, such as text categorization [1, 2], gene expression array analysis [3-5], and combinatorial chemistry [6, 7]. The advantages of feature selection include (i) ignoring noisy or irrelevant features would prevent overfitting and improve the generalization performance; (ii) a sparse classifier can reduce the computation cost; (iii) a small set of important features is desirable for interpretability.

We address the embedded feature selection methods in the context of linear support vector machines (SVMs). Existing feature selection methods embedded in SVMs fall into three approaches [8]. In the first approach, some greedy search strategies are applied to iteratively adding or removing features from the data. Guyon et al. [3] developed a recursive feature elimination (RFE) algorithm, which has shown good performance on gene selection for microarray data. Beginning with the full feature subset, SVM-RFE trains a SVM at each iteration, and then eliminates the feature that decreases the margin the least. Rakotomamonjy et al. [9] extended this method by using other ranking criteria including the radius margin bound and the span-estimate.

The second approach is to optimize a scaling parameter vector a e [0,1]" that indicates the importance of each feature. Weston et al. [10] proposed an iterative method to optimize the scaling parameters by minimizing the bounds on leave-one-out error. Peleg and Meir [11] learned the scaling factors based on the global minimization of a data-dependent generalization error bound.

The third category of approaches is to minimize the number of features by adding a sparsity term to the SVM formulation. Though standard SVM based on ||w||2 can be solved easily by convex quadratic programming, its solution may not be a desirable sparse solution. A popular way to deal with this problem is the use of Lp regularization technique, which results in a L^-SVM. It is to minimize |M|^, subject to some linear constraints, where p e [0,1]. When p e (0,1],

(m \Up

H|p = (XKn • (1)

When p = 0, |M|0 = rT=i V = o). The L0-SVM can find the sparsest classifier by minimizing |M|0, the number of nonzero elements in w. However, it is discrete and NP-hard. From computational point of view, it is very difficult

to develop efficient numerical methods to solve the problem. A widely used technique in dealing with the L0-SVM is to use a smoothing technique so that the discrete model is approached by a smooth problem [4, 12, 13]. However, as the function ||w||0 is not even continuous, it is not desirable that a smoothing technique based method would work well. Chan et al. [14] explored a convex relaxation to the cardinality constraint andobtainedarelaxedconvexproblem that is close to but different from the previous L0-SVM. An alternative method is to minimize the convex envelope of the |MI0, such as LrSVM. The LrSVM is a convex problem and can yield sparse solution. It can be equivalent to a linear programming and hence can be solved efficiently. Indeed the L1 regularization has become quite welcome in SVM [12, 15, 16] and is well known as the LASSO [17] in the statistics literature. However, the L1 regularization problem often leads to suboptimal sparsity in reality [18]. In many cases, the solutions yielded from L1 -SVM are less sparse than those of L0-SVM. The Lp problem with 0 < p < 1 can find sparser solutions than the L1 problem, which was evidenced in extensive computational [19-21]. It has become a welcome strategy in sparse SVM [22-27].

In this paper, we focus on the L1/2 regularization and propose a novel L1/2-SVM. Recently, Xu et al. [28] justified that the sparsity-promotion ability of the L1/2 problem was strongest among the Lp minimization problems with all p e [1/2,1) and similar in p e (0,1/2]. So the L1/2 problem can be taken as a representative of Lp (0 < p < 1) problems. However, as proved by Ge et al. [29], finding the global minimal value of the L1/2 problemwas still stronglyNP-hard. But computing a local minimizer of the problem could be done in polynomial time. Our contributions of this paper are twofold. One is to derive a smooth constrained optimization reformulation to the L1/2-SVM. The objective function of the problem is a linear function and the constraints are quadratic and linear. We will establish the equivalence between the constrained problem and the L1/2-SVM. We will also show the existence of the KKT condition of the constrained problem. Our second contribution is to develop an interior point method to solve the constrained optimization reformulation and establish its global convergence. We will also test and verify the effectiveness of the proposed method using artificial data and real data.

The rest of this paper is organized as follows. In Section 2, we first briefly introduce the model of the standard SVM (L2-SVM) and the sparse regularization SVMs. We then reformulate the L1/2-SVM into a smooth constrained optimization problem. We propose an interior point method to solve the constrained optimization reformulation and establish its global convergence in Section 3. In Section 4, we do numerical experiments to test the proposed method. Section 5 gives the conclusive remarks.

2. A Smooth Constrained Optimization Reformulation to the L1/2-SVM

In this section, after simply reviewing the model of the standard SVM (L2-SVM) and the sparse regularization SVMs, we derive an equivalent smooth optimization problem to the

L 1/2-SVM model. The smooth optimization problem is to minimize a linear function subject to some simple quadratic constraints and linear constraints.

2.1. Standard SVM. In a two-class classification problem, we are given a training data set D = (xt; y;)"=1,where xt e Rm is the feature vector and yi e {0,1} is the class label. The linear classifier is to construct the following decision function:

/(*) = ■

: + b,

where w = (w1, w2,..., wm) is the weight vector and b is the bias. The prediction label is +1 if f(x) > 0 and -1 otherwise. The standard SVM (L2-SVM) [30] aims to find the separating hyperplane f(x) = 0 between two classes with maximal margin 2/||w||2 and minimal training errors, which leads to the following convex optimization problem:

min 2"wh +

s.t. yi (wTxt + b)>1-^i i=1,...,n, > 0,

where ||w||2 = ^])1/2 is the L2 norm of w, is the

loss function to allow training errors for data that may not be linearly separable, and C is a user-specified parameter to balance the margin and the losses. As the problem is a convex quadratic program, it can be solved by existing methods, such as the interior point method and active set method efficiently.

2.2. Sparse SVM. The L2-SVM is a nonsparse regularizer in the sense that the learned decision hyperplane often utilizes all the features. In practice, peoples prefer to sparse SVM so that only a few features are used to make a decision. For this purposes, the following L^-SVM becomes very welcome:

s.t. yi (wTxt + b) > 1 - i = 1,..., n, >0,

where p e [0,1]. || stands for the number of nonzero elements of w, and for p e (0,1], |M|^ is defined by (1).

Problem (4) is obtainedbyreplacing L2 penalty (|M|2) by Lp penalty (|M|p in (3). The standard SVM (3) corresponds to the model (4) with p = 2.

Figure 1 plots the Lp penalty in one dimension. We can see from the figure that the smaller p is, the larger penalties are imposed on the small coefficients (|w| < 1). Therefore, the Lp penalties with p < 1 may achieve sparser solution than the L1 penalty. In addition, the L1 imposes large penalties on large coefficients, which may lead to biased estimation for large coefficients. Consequently, the Lp (0 < p < 1) penalties become attractive due to their good properties in sparsity, unbiasedness [31] and oracle [32]. We are particularly interested in the L1/2 penalty. Recently, Xu et al. [21] revealed the representative role of the L1/2 penalty in the Lp regularization with p e (0,1). We will apply L1/2 penalty to SVM to perform feature selection and classification jointly.

p = 2 p = 1 p = 0.5

Figure 1: Lp penalty in one dimension.

2.3. L1/2-SVM Model. We pay particular attention to the L 1/2-SVM, namely, problem (4) with p = 1/2. We will derive a smooth constrained optimization reformulation to the L1/2-SVM so that it is relatively easy to design numerical methods. We first specify the L1/2-SVM:

min ZI

.t. y, + fc) > 1 - i=1,...,i

£ > 0, i=1,...,n.

Denote by m = (w, fc) and D the feasible region of the problem; that is,

D = fc) | + fc) > 1 -

£ >0, i = 1,...,nj.

Then the L^2-SVM can be written as an impact form

min 0(m), we D. (7)

It is a nonconvex and non-Lipschitz problem. Due to the existence of the term |1/2, the objective function is not even directionally differentiate at a point with some = 0,which makes the problem very difficult to solve. Existing numerical methods that are very efficient for solving smooth problem could not be used directly. One possible way to develop numerical methods for solving (7) is to smoothing the term |a^|1/2 using some smoothing function such as (o^) = (w2j + e2)1/4 with some e > 0. However, it is easy to see that the derivative of will be unbounded as Wj ^ 0 and

e ^ 0. Consequently, it is not desirable that the smoothing function based numerical methods could work well.

Recently, Tian and Yang [33] proposed an interior point L 1/2-penalty function method to solve general nonlinear programming problems by using a quadratic relaxation scheme for their L1/2-lower order penalty problems. We will follow the idea of [33] to develop an interior point method for solving the L1/2-SVM. To this end, in the next subsection, we reformulate problem (7) to a smooth constrained optimization problem.

2.4. A Reformulation to the L1/2-SVM Model. Consider the following constrained optimization problem:

¿2 - >0, j = 1,..., m,

Îj + Wj >0, j = 1,..., m, ij >0, j = 1,..., m,

+ fc) > 1 - , i=1,.. £ > 0, i = 1,...,n.

It is obtained by letting ij = |^j|1/2 in the objective function and adding constraints /j - > 0 and /j + > 0, j = 1,..., m, in (7). Denote by F the feasible region of the problem; that is,

F = i, fc) | ¿2 - ^ >0, + ^ > 0, ij >0, j = 1,...,m} n {(w, i, fc) | + fc) > 1 -

£ >0, « = 1,...,n}.

Let z = (w, i, fc). Then the above problem can be written as

min / (z), zf F.

The following theorem establishes the equivalence between the L1/2-SVM and (10).

Theorem 1. If w* = (w*,£*,fc*) e £m+"+1 is asolutionofthe L1/2 -SVM (7), then z* = (w*)^*)|w*|1/2,fc* ) e £m+"+m+1 is a solution of the optimization problem (10). Conversely, if ) is a solution of the optimization problem (10), then (w*, fc*) is a solution of the L1/2-SVM (7).

Proof. Let m* = ,fc*) be a solution of the L1/2-SVM

(7) and let z = (ô), f, i, fc) be a solution of the constrained

optimization problem (10). It is clear that z* = (w*,£*,

1/2 fc * ) £ C2T A H „ n-c m-a- »IT,! -/-2

..., m, and hence

F. Moreover, we have ^ >

m n m .

j=1 i=1 j=1 i=1

m *|1/2 n

Wj I + j=1 i=1

Since z* e F,we have

0(«* ) = /(z*)>/(z).

This together with (11) implies that /(z) = 0(w*). The proof is complete. □

It is clear that the constraint functions of (10) are convex. Consequently, at any feasible point, the set of all feasible directions is the same as the set of all linearized feasible directions.

As a result, the KKT point exists. The KKT system of the problem (10) can be written as the following system of nonlinear equations:

/ A(D -A(2) -XtYtA(4) \ Cen - A(4) - A(5) em - 2TA(1) - 2TA(2) - A(3)

/A(4) min ji2 -^A^}

L ( ( ( J j=1,2>...,m

min {¿2 + w,,A(2)}

L ( ( ( > (=1,2,...,m

min {i(,A(3)}

i ( ( J j=1,2,...,m

min {* ^ =u,...,n m- ft ,A®} ,=1,2,...,

R (w, £ i, fc, A) =

where A = (A(1),A(2),A(3),A(4),A(5)) are the Lagrangian multipliers, X = (x1, x2,..., xn)T, Y = diag(y) is diagonal matrix and = + fc) - (1 - ),« = 1,2,..., n.

For the sake of simplicity, the properties of the reformulation to L1/2-SVM are shown in Appendix A.

3. An Interior Point Method

In this section, we develop an interior point method to solve the equivalent constrained problem (10) of the L1/2-SVM (7).

3.1. Auxiliary Function. Following the idea of the interior point method, the constrained problem (10) can be solved by minimizing a sequence of logarithmic barrier functions as follows:

min O^ (w, i, fc)

- [log (i2 - wj) + log (i2 + wj) + log i;]

- fX (log Pi + log ^

i2 - Wj >0, j = 1,2,..., m, ij + ^j >0, j = 1,2,..., m, ij >0, j = 1,2, ...,m,

pi = + fc) + - 1 > 0, i = 1,2,..., n,

£ >0, 1=1,2,-

where ^ is the barrier parameter, converging to zero from above.

The KKT system of problem (14) is the following system of linear equations:

A(1) - A(2) -XtYtA(4) = 0, C*6n -A(4) - A(5) = 0, em - 2TA(1) - 2TA(2) - A(3) = 0, /A(4) = 0, (T2 -W)A(1) = 0, (15)

(T2 + W)A(2) = 0,

TA(3) = 0, PA(4) - ^en = 0, SA(5) - ^en = 0,

where A = (A(1), A(2), A(3), A(4), A(5)) are the Lagrangian multipliers, T = diag(i), W = diag(w), H = diag(£), and P = diag(Y(Xw+fc*en)+£-en) are diagonal matrices, and em e and en e £n stand for the vector whose elements are all ones.

3.2. Newton's Method. We apply Newton's method to solve the nonlinear system (15) in variables w, i, fc, and A. The subproblem of the method is the following system of linear equations:

Aw A^ Ai Afc AA(1 AA(2 AA(3 , AA(4 \AA(5

A(1) -A(2) -XtYtA(4)

c * en - A(4) - A(5) em - 2TA(1) - 2TA(2) -A(3) e^YA(4) (T2 -W)A(1) -^e, (T2 +W)A(2) TA(3)

PA(4) - Fen SA(5)

where M is the Jacobian of the left function in (15) and takes the form

0 0 0 0 I -7 0 -XTYT 0

0 0 0 0 0 0 0 -7 -J \

0 0 -2(01 +Ö2) 0 -2T -2T -7 0 0

0 0 0 0 0 0 0 T y 0

-01 0 2TD1 0 T2 - W 0 0 0 0))

02 0 2TD2 0 0 T2 + w 0 0 0))

0 0 Ö3 0 0 0 r 0 0

D 4 0 047 0 0 0 P 0

0 Ö5 0 0 0 0 0 0 h /

where D1 = diag(A(1)), D2 = diag(A(2)), D3 = diag(A(3)), D4 = diag(A(4)), and D5 = diag(A(5)). We can rewrite (16) as

(A(1) + AA(1)) - (A(2) + AA(2)) - XTYT (A(4) + AA(4)) = 0, (A(4) +AA(4)) + (A(5) + AA(5)) = C * en, 2 (D1 + D2) Ai + 2T (A(1) + AA(1))

+ 2T(A(2) AA(2)) + (A(3) + AA(3)) = em, / (A(4) + AA(4)) = 0, -D1Aw + 2TD1Ai + (J2 - W) (A(1) + AA(1)) = ^em, D2Aw + 2TD2Ai + (J2 + W) (A(2) + AA(2)) = ^em, D3 Ai + T(A(3) + AA(3)) = ^em, D4YXAw + D4A£ + D4y * Afc + P (A(4) + AA(4)) = , D5A£ + s(A(5) + AA(5)) = ^e„.

It follows from the last five equations that vector A = A + AA can be expressed as

A(1) = (J2 - W)-1 (^em + D1Aw - 2TD1Ai), A(2) = (J2 + W)-1 (^em - D2Aw - 2TD2Ai), A(3) = T-1 (^ -D3Ai), A(4) = P-1 (pe„ - D4YXAw - D4A£ - D4y * Afc), A(5) = S-1 -D5A*).

Substituting (19) into the first four equations of (18), we obtain

/Aw\ Ai

-W)-1 em + F(T2 + W)-1em\ / +XTyTP-1 \

-C * e„ + + ^H-1e„

-em + 2FT(T21 -W)-1em +2^T(T2 +w)-1em + ^T-1e„

where matrix S takes the form

/Sn S12 S13 S14 \

^21 ^22 ^23 ^24

S31 S32 S33 S34

\^41 ^42 ^43 S44 y

with blocks

S11 = U + V + XTYTP-1D4YX,

S12 = XTYTP-1D4,

S13 = -2 (U - V) r, S14 = XTyTP-1D4y

S21 = P^FX,

S22 = P-1Ö4 + 5-1Ds,

S23 = 0, S24 = P-1^,

S31 = -2 (U - V) T, S32 = 0,

S33 = 4T (U + V) T + T-1D3 -2(D1 + D2),

S34 = 0,

S4i = yrP-1DAYX,

S42 = yTP-1D4,

S43 = 0 S44 = yTP 1 D4y

and U = (T2 - W)-1D1 and V = (T2 + W)-1D2

3.3. The Interior Pointer Algorithm. Letz = (w, t, b) and Я = ; we first present the interior pointer algorithm to solve the barrier problem (14), and then discuss the details of the algorithm.

Algorithm 2. The interior pointer algorithm (IPA) is as follows.

Step 0. Given tolerance e^, set т1 e (0,(1/2)), I > 0, y1 > 1, ß e (0,1). Let к = 0.

Step 1. Stop if KKT condition (15) holds.

Step 2. Compute Azk from (20) and Як+1 from (19). Compute zk+1 = zk + akAzk. Update the Lagrangian multipliers to obtain Як+1.

Step 3. Let к := к + l.Goto Step 1.

In Step 2, a step length ak is used to calculate zk+1. We estimate ak by Armijo line search [34], in which ak = max[ß} \ j = 0,1,2,...} for some ß e (0,1) and satisfies the following inequalities:

(,k+1\2 k+1 ^ n

(t ) - w > 0,

(M1\2 k+1 ^ n

(t ) +w >0,

tk+1 > 0,

Ф (П-% M

< T1<xk (vwtb^fAwi* + V^(zk )TAtk

where т1 e (0,1/2).

To avoid ill-conditioned growth of Ak and guarantee the strict dual feasibility, the Lagrangian multipliers Ak should

be sufficiently positive and bounded from above. Following a similar idea of [33], we first update the dual multipliers by

(1)(k+1)

(Ф2 -wk

Â(1)(k+1),

(tk)2 -wk

(2)(k+1)

(Ф2 + wk

A (-2)(к+1)г

m (tkf +,

(3)(k+1)

(1)(fc+1)

(tk)2 -wk

(tk)2 - xk

A(1)(k+1)

(tkf - w

if Aüx^ > ,

' (Ф2 -wk

if A(«^

(tïï + wk

if min

(tk)2 + w

A(2)(k+1) < -Ï1

(tk)2 + wk

if A(2)(k+1) > -h1

(tk)2 + ,

if A(3)(k+1) < min -

A^+V, if mir

< -a(3)(k+1) < -h - ' - tk' if A(3)(k+1) > -h U At > Л '

(4)(k+1)

A (4)(k+1)

if AA(4)(k+1)

if min -

< Amk+1) < ,

if AP+» > Щ-,

where pt = + fc) + - 1;

(5)(k+1)

if Â(5)(k+1) < min {

(5)(k+1)

< A(5)(f+1) < №

< ' < tf '

if A(5)(k+1)

where the parameters I and y1 satisfy 0 < /, y1 > 1.

Since positive definiteness of the matrix S is demanded in this method, the Lagrangian multipliers Afc+1 should satisfy the following condition:

A(3) - 2 (A(1) +A(2))Î>0.

For the sake of simplicity, the proof is given in Appendix B.

Therefore, if Afc+1 satisfies (26), we let Afc+1 = Afc+1. Otherwise, we would further update it by the following setting:

A(1)(f+1) = y2Al A(3)(f+1) = y3A(3)(k+1),

i(1)(fc+1)

(2)(k+1)

= r2Â'

(2)(k+1)

where constants y2 e (0,1) and y3 >1 satisfy

— = max

2if+1 (A<1)(k+1) + Af)(k+1))

(3)(k+1)

with E = {1,2,..., n}. It is not difficult to see that the vector (A(1)(fc+1),A(2)(fc+1), A(3)(fc+1)) determined by (27) satisfies (26).

In practice, the KKT conditions (15) are allowed to be satisfied within a tolerance e„. It turns to be that the iterative process stops, while the following inequalities meet:

A(1) - A(2) - XtYtA(4) C * e„ - A(4) - A(5) em - 2TA(1) - 2TA(2) - A(3) £tt(4)

(T2 -W)A(1) (J2 +W)A(2)

TA(3) PA(4) -HA(5)

Res (z, A, =

< e„

A > -e,.e,

where e^ is related to the current barrier parameter and satisfies e^ X 0 as ^ ^ 0.

Since function O^ in (14) is convex, we have the following lemma which shows that Algorithm 2 is well defined (the proof is given in Appendix B).

Lemma 3. Let zk be strictly feasible for problem (10). If Azk = 0, then (23) is satisfied for all afc > 0. If Azk = 0, then there exists a afc e (0,1] such that (23) holds for all afc e (0, ak].

The proposed interior point method successively solves the barrier subproblem (14) with a decreasing sequence We simply reduce both e^ and ^ by a constant factor ^ e (0,1). Finally, we test optimality for problem (10) by means of the residual norm ||Res(z, A, 0)||.

Here, we present the whole algorithm to solve the L1/2 -SVM problem(10)

Algorithm 4. Algorithm for solving L1/2-SVM problem is as follows.

Step 0. Set e W e e £", A0 = A0 > 0, i° > + (1/2), i e |1,2,..., rnj. Given constants > 0, e^ >0, ^ e (0,1) and e > 0, let j = 0.

Step 1. Stop if Res(zj, Aj, 0) < e and Aj > 0.

Step 2. Starting from (zJ,AJ), apply Algorithm 2 to solve (14) with barrier parameter ^ and stopping

tolerance e„.. Set zj+1 = Aj+1 = Aj'\ and Aj+1 =

Step 3. Set = and go to Step 1.

and eri+i = . Let i := i + 1

In Algorithm 4, the index j denotes an outer iteration, while k denotes the last inner iteration of Algorithm 2.

The convergence of the proposed interior point method can be proved. We list the theorem here and give the proof in Appendix C.

Theorem 5. Let |(zk, Ak)| be generated by Algorithm 2. Then, any limit point of the sequence |(zk,Afc)| generated by Algorithm 2 satisfies the first-order optimality conditions (15).

Theorem 6. Let |(zJ,AJ)| be generated by Algorithm 4 by ignoring its termination condition. Then the following statements are true.

(i) The limit point of |(zJ, AJ)| satisfies the first order optimality condition (13).

(ii) The limit point z* of the convergent subsequence |zj|j c |z;| with unbounded multipliers |Aj|j is a Fritz-John point [35] ofproblem (10).

4. Experiments

In this section, we tested the constrained optimization reformulation to the L1/2-SVM and the proposed interior point method. We compared the performance of the L1/2-SVM with L2-SVM [30], L1-SVM [12], and L°-SVM [12] on artificial data and ten UCI data sets (http://archive.ics.uci .edu/ml/). These four problems were solved in primal, referencing the machine learning toolbox Spider (http://people .kyb.tuebingen.mpg.de/spider/). The L2-SVM and L1-SVM were solved directly by quadratic programming and linear programming, respectively. To the NP-hard problem L° -SVM, a commonly cited approximation method Feature Selection Concave (FSV) [12] was applied and then the FSV

problem was solved by a Successive Linear Approximation (SLA) algorithm. All the experiments were run in the personal computer (1.6 GHz of CPU, 4 GB of RAM) with MATLAB R2010b on 64 bit Windows 7.

In the proposed interior point method (Algorithms 2 and 4), we set the parameters as ^ = 0.6, Tj = 0.5,1 = 0.00001,

= 100000, and ^ = 0.5. The balance parameter C was selected by 5-fold cross-validation on training set over the range {2-10, 2-9,..., 210}. After training, the weights that did not satisfy the criteria |^^|/max;|) > 10 [14] were set to zero. Then the cardinality of the hyperplane was computed as the number of the nonzero weights.

4.1. Artificial Data. First, we took an artificial binary linear classification problem as an example. The problem is similar to that in [13]. The probability of y = 1 or -1 is equal. The first 6 features are relevant but redundant. In 70% samples, the first three features {x1,x2,x3} were drawn as = 1) and the second three features {x4 , , } as = N(0,1). Otherwise, the first three were drawn as x{ = N(0,1) and the second three as x{ = - 3,1). The rest features are noise = N(0,20),« = 7,..., m. Here, m is the dimension of input features. The inputs were scaled to mean zero and standard deviation. In each trial, 500 points were generated for testing and the average results were estimated over 30 trials.

In the first experiment, we consider the cases with the fixed feature size m = 30 and different training sample sizes n = 10,20,..., 100. The average results over the 30 trials are shown in Table 1 and Figure 2. Figure 2 (left) plots the average cardinality of each classifier. Since the artificial data sets have 2 relevant and nonredundant features, the ideal average cardinality is 2. Figure 2 (left) shows that the three sparse SVMs, L1-SVM, L1/2-SVM, and L0-SVM, can achieve sparse solution, while the L2-SVM almost uses full features in each data set. Furthermore, the solutions of L1/2-SVM and L0-SVM are much sparser than L1-SVM. As shown in Table 1, the L1-SVM selects more than 6 features in all cases, which implies that some redundant or irrelevant features are selected. The average cardinalities of L1/2-SVM and L0-SVM are similar and close to 2. However, when n = 10 and 20, the L0-SVM has the average cardinalities of 1.42 and 1.87, respectively. It means that the L0-SVM sometimes selects only one feature in low sample data set and maybe ignores some really relevant feature. Consequently, with the cardinalities between 2.05 and 2.9, L1/2-SVM has the more reliable solution than L0-SVM. In short, as far as the number of selected features is concerned, the L1/2-SVM behaves better than the other three methods.

Figure 2 (right) plots the trend of the prediction accuracy versus the size of the training sample. The classification performance of all methods is generally improved with the increasing of the training sample size n. L1-SVM has the best prediction performance in all cases and a slightly better than L 1/2-SVM. L 1/2-SVM shows more accuracy in classification than L2-SVM and L0-SVM, especially in the case of n = 10,..., 50.As shown in Table 1, when there are only 10 training samples, the average accuracy of L1/2-SVM is 88.05%, while the results of L2-SVM and L0-SVM are 84.65% and 77.65%, respectively. Compared with L2-SVM and L0-SVM,

L 1/2-SVM has the average accuracy increased by 3.4% and 10.4%, respectively, as can be explained in what follows. To the L2-SVM, all features are selected without discrimination, and the prediction would be misled by the irrelevant features. To the L0-SVM, few features are selected, and some relevant features are not included, which would put negative impact on the prediction result. As the tradeoffbetween L2- SVM and L0-SVM, L1/2-SVM has better performance than the two.

The average results over ten artificial data sets in the first experiment are shown in the bottom of Table 1. On average, the accuracy of L1/2-SVM is 0.87% lower than the L1-SVM, while the features selected by L1/2-SVM are 74.14% less than L1-SVM. It indicates that the L1/2-SVM can achieve much sparser solution than L1 -SVM with little cost of accuracy. Moreover, the average accuracy of L1/2-SVM over 10 data sets is 2.22% higher than L0-SVM with the similar cardinality. To sum up, the L 1/2-SVM provides the best balance between accuracy and sparsity among the three sparse SVMs.

To further evaluate the feature selection performance of L 1/2-SVM, we investigate whether the features are correctly selected. For the L2-SVM is not designed for feature selection, it is not included in this comparison. Since our artificial data sets have 2 best features (x3, x6), the best result should have the two features (x3, x6) ranking on the top according to their absolute values of weights |a^|. In the experiment, we select the top 2 features with the maximal for each method and calculate the frequency that the top 2 features are x3 and x6 in 30 runs. The results are listed in Table 2. When the training sample size is too small, it is difficult to discriminate the two most important features for all sparse SVMs. For example, when n = 10, the selected frequencies of L1-SVM, L0-SVM, and L1/2-SVM are 7, 3, and 9, respectively. When n increases, all methods tend to make more correct selection. Moreover, Table 2 shows that the L1/2-SVM outperforms the other two methods in all cases. For example, when n = 100, the selected frequencies of L1 -SVM andL0-SVM are 22 and 25, respectively, and the result of L1/2-SVM is 27. The L1 -SVM selects too many redundant or irrelevant features, which may influence the ranking in some extent. Therefore, L1 -SVM is not so good as L1/2-SVM at distinguishing the critical features. The L0-SVM has the lower hit frequency than L1/2-SVM, which is probably due to the excessive small feature subset it obtained. Above all, Tables 1 and 2 and Figure 2 clearly show that the L1/2-SVM is a promising sparsity driven classification method.

In the second simulation, we consider the cases with various dimensions of feature space m = 20,40,..., 180,200 and the fixed training sample size n = 100. The average results over 30 trials are shown in Figure 3 and Table 3. Since there are only 6 relevant features yet, the larger m means the more noisy features. Figure 3 (left) shows that as the dimension increases from 20 to 200, the number of features selected by L1-SVM increases from 8.26 to 23.1. However, the cardinalities of L1/2-SVM and L0-SVM keep stable (from 2.2 to 2.95). It indicates that the L1/2-SVM and L0-SVM are more suitable for feature selection than L1-SVM.

Figure 3 (right) shows that with the increasing of the noise features, the accuracy of L2-SVM drops significantly (from 98.68% to 87.47%). On the contrary, to the other three sparse

Ir----f

20 40 60 80

Number of training samples

20 40 60 80

Number of training samples

L2-SVM Lj-svm

L0-SVM L1/2-SVM

L2-SVM L1-SVM

L0-SVM

l1/2-svm

Figure 2: Results comparison on 10 artificial data sets with various training samples.

Table 1: Results comparison on 10 artificial data sets with various training sample sizes.

L j-SVM

L1/2-SVM

Card Acc Card Acc Card Acc Card Acc

10 29.97 84.65 6.93 88.55 1.43 77.65 2.25 88.05

20 30.00 92.21 8.83 96.79 1.87 93.17 2.35 95.50

30 30.00 94.27 8.47 98.23 2.03 94.67 2.05 97.41

40 29.97 95.99 9.00 98.75 2.07 95.39 2.10 97.75

50 30.00 96.61 9.50 98.91 2.13 96.57 2.45 97.68

60 29.97 97.25 10.00 98.94 2.20 97.61 2.45 97.98

70 30.00 97.41 11.00 98.98 2.20 97.56 2.40 98.23

80 30.00 97.68 10.03 99.09 2.23 97.61 2.60 98.50

90 29.97 97.89 9.20 99.21 2.30 97.41 2.90 98.30

100 30.00 98.13 10.27 99.09 2.50 97.93 2.55 98.36

On average 29.99 95.21 9.32 97.65 2.10 94.56 2.41 96.78

"n" is the number of training samples, "Card" represents the number of selected features, and "Acc" is the classification accuracy.

Table 2: Frequency of the most important features (x3, x6) ranked on top 2 in 30 runs.

n 10 20 30 40 50 60 70 80 90 100

L1-SVM 7 14 16 18 22 23 20 22 21 22

L 0-SVM 3 11 12 15 19 22 22 23 22 25

L 1/2-SVM 9 16 19 24 24 23 27 26 26 27

L 2-SVM

L 0-SVM

Table 3: Average results over 10 artificial data sets with varies dimension.

L 2-SVM L1-SVM L 0-SVM L 1/2-SVM

Cardinality 109.95 15.85 2.38 2.41

Accuracy 92.93 99.07 97.76 98.26

Table 4: Feature selection performance comparison on UCI data sets (Cardinality).

UCI Data Set

L j-SVM

L 1/2-SVM

Mean Std Mean Std Mean Std Mean Std

1 Pima diabetes 768 8 8.00 0.00 7.40 0.89 6.20 1.10 7.60 1.52

2 Breast Cancer 683 9 9.00 0.00 8.60 1.22 6.40 2.07 8.40 1.67

3 Wine (3) 178 13 13.00 0.00 9.73 0.69 3.73 0.28 4.07 0.49

4 Image (7) 2100 19 19.00 0.00 8.20 0.53 3.66 0.97 5.11 0.62

5 SPECT 267 22 22.00 0.00 17.40 5.27 9.80 4.27 9 1.17

6 WDBC 569 30 30.00 0.00 9.80 1.52 9.00 3.03 5.60 2.05

7 Ionosphere 351 34 33.20 0.45 25.00 3.35 24.60 10.21 21.80 8.44

8 SPECTF 267 44 44.00 0.00 38.80 10.98 24.00 11.32 25.60 7.75

9 Sonar 208 60 60.00 0.00 31.40 13.05 27.00 7.18 13.60 10.06

10 Muskl 476 166 166.00 0.00 85.80 18.85 48.60 4.28 41 14.16

Average cardinality 40.50 40.42 24.21 16.30 14.18

Table 5: Classification performance comparison on UCI data sets (accuarac y).

No. UCI Data Set L 2-SVM L j -SVM L 0-SVM L 1/2" -SVM

Mean Std Mean Std Mean Std Mean Std

1 Pima diabetes 76.99 3.47 76.73 0.85 76.73 2.94 77.12 1.52

2 Breast cancer 96.18 2.23 97.06 0.96 96.32 1.38 96.47 1.59

3 Wine (3) 97.71 3.73 98.29 1.28 93.14 2.39 97.14 2.56

4 Image (7) 88.94 1.49 91.27 1.77 86.67 4.58 91.36 1.94

5 SPECT 83.40 2.15 78.49 2.31 78.11 3.38 82.34 5.10

6 WDBC 96.11 2.11 96.46 1.31 95.58 1.15 96.46 1.31

7 Ionosphere 83.43 5.44 88.00 3.70 84.57 5.38 87.43 5.94

8 SPECTF 78.49 3.91 74.34 4.50 73.87 3.25 78.49 1.89

9 Sonar 73.66 5.29 75.61 5.82 74.15 7.98 76.10 4.35

10 Musk1 84.84 1.73 82.11 2.42 76.00 5.64 82.32 3.38

Average accuracy 85.97 85.84 83.51 86.52

L 2-SVM

L 0-SVM

SVMs, there is little change in the accuracy. It reveals that SVMs can benefit from the features reduction.

Table 3 shows the average results over all data sets in the second experiment. On average, the solution of Ll/2-SVM yields much sparser than LrSVM and a slightly better accuracy than L0-SVM.

4.2. UCI Data Sets. We further tested the reformulation and the proposed interior point methods to L ^2-SVM on 10 UCI data sets [36]. There are 8 binary classification problems and 2 multiclass problems (wine, image). Each feature of the input data was normalized to zero mean and unit variance, and the instances with missing value were deleted. Then, the data was randomly split into training set (80%) and testing set (20%). For the two multiclass problems, a one-against-rest method was applied to construct a binary classifier for each class. We repeated the training and testing procedure 10 times, and the average results were shown in Tables 4 and 5 and Figure 4.

Tables 4 and 5 summarize the feature selection and classification performance of the numerical experiments on UCI data sets, respectively. Here, n is the numbers of samples and m is the number of the input features. For the two

multiclass data sets, the numbers of the classes are marked behind their names. Sparsity is defined as card/m and the small value of sparsity is preferred. The data sets are arranged in descending order according to the dimension. The lowest cardinality and the best accuracy rate for each problem are bolded.

As shown in Tables 4 and 5, the three sparse SVMs can encourage sparsity in all data sets, while remaining roughly identical accuracy with L2-SVM. Among the three sparse methods, the L^2-SVM has the lowest cardinality (14.18) and the highest classification accuracy (86.52%) on average. While the LrSVM has the worst feature selection performance with the highest average cardinality (24.21), and the L0-SVM has the lowest average classification accuracy (83.51%).

Figure 4 plots the sparsity (left) and classification accuracy (right) of each classifier on UCI data sets. In three data sets (6, 9, 10), the L^2-SVM has the best performance both in feature selection and classification among the three sparse SVMs. Compared with LrSVM, L1/2-SVM can achieve sparser solution in nine data sets. For example, in the data set "8 SPECTF," the features selected by L1/2-SVM are 34%

200 180 160 140 120

.5 100

U 80 60 40 20 0

L2-SVM Lj-svm

Dimension

L0-SVM L1/2-SVM

100 98 96 94 92

¡3 90

^ 88 86 84 82 80

L2-SVM L1-SVM

^-1—■—1-1-1--1-1—_ ,

i - ----H..... □

Dimension

L0-SVM L1/2-SVM

Figure 3: Results comparison on artificial data set with various dimensions.

less than Lj-SVM, at the same time, the accuracy of L SVM is 4.1% higher than LrSVM. In most data sets (4, 5, 6, 9, 10), the cardinality of L^2-SVM drops significantly (at least 37%) with the equal or a slightly better result in accuracy. Only in the data set "3 Wine," the accuracy of L1/2-SVM is decreased by 1.1%, but the sparsity provided by L 1/2-SVM leads to 58.2% improvement over L1 -SVM. In the rest three data sets (1, 2, 7), the two methods have similar results in feature selection and classification. As seen above, the L1/2-SVM can provide lower dimension representation than L1-SVM with the competitive prediction performance.

Figure 4 (right) shows that, compared with L0-SVM, L 1/2-SVM has the classification accuracy improved in all data sets. For instance, in four data sets (3, 4, 8, 10), the classification accuracy of L1/2-SVM is at least 4.0% higher than L0-SVM. Especially in the data set "10 Musk," L1/2-SVM gives a 6.3% rise in accuracy over L0-SVM. Meanwhile, it can be observed from Figure 4 (left) that L1/2-SVM selects fewer feature than L0-SVM in five data sets (5, 6,7, 9,10). For example, in the data sets 6 and 9, the cardinalities of L1/2-SVM are 37.8% and 49.6% less than L0-SVM, respectively. In summary, L 1/2-SVM presents better classification performance than L0-SVM, while it is effective in choosing relevant features.

5. Conclusions

In this paper, we proposed a L1/2 regularization technique for simultaneous feature selection and classification in the SVM. We have reformulated the L1/2-SVM into an equivalent smooth constrained optimization problem. The problem

possesses a very simple structure and is relatively easy to develop numerical methods. By the use of this interesting reformulation, we proposed an interior point method and established its convergence. Our numerical results supported the reformulation and the proposed method. The L -SVM can get more sparsity solution than L j- SVM with the comparable classification accuracy. Furthermore, the L1/2-SVM can achieve more accuracy classification results than L0-SVM (FSV).

Inspired by the good performance of the smooth optimization reformulation of L1/2-SVM, there are some interesting topics deserving further research. For examples, to develop more efficient algorithms for solving the reformulation, to study nonlinear L1/2-SVM, and to explore varies applications of the L1/2-SVM and further validate its effective are interesting research topics. Some of them are under our current investigation.

Appendix

A. Properties of the Reformulation to

L 1/2-SVM

Let D be set of all feasible points of the problem (10). For any z e D, we let FD(z, D) and LFD(z, D) be the set of all feasible directions and linearized feasible directions of D at z. Since the constraint functions of (10) are all convex, we immediately have the following lemma.

Lemma A.1. For any feasible point z of (10), one has

FD (z, D) = LFD (z, D). (A.1)

90 80 70

# 50 40 30 20 10

L2-SVM Lj-SVM

4 5 6 7 Data set number

L0-SVM LJ/2-SVM

L2-SVM Lj-SVM

4 5 6 7 Data set number

L0-SVM LJ/2-SVM

Figure 4: Results comparison on UCI data sets.

Based on the above Lemma, we can easily derive a first order necessary condition for (10). The Lagrangian function of (10) is

L (ш, i, fc, A, v, p,

n m m m

= Ф. + Zi - fô - -fô+. A

¡=1 i=i i=i i=i (A.2)

- Z]iii- +^+ £- 0-

i=1 ¡=1 ¡=1

where A, v, ^ are the Lagrangian multipliers. By the use of Lemma A.1, we immediately have the following theorem about the first order necessary condition.

Theorem A.2. Let (w, i, fc) be a /oca/ solution of (10). Then there are Lagrangian multipliers (A, e

such that the following KKT conditions hold:

-> m+m+m+n+n

Xj > 0, t2j - Wj > 0, A;. (fj - Wj) = 0, j = 1,2,..., m,

^ >0, ij + Wj > 0, ^ (ij + = 0,

j = 1,2,..., rn,

> 0, tj > 0 Vjtj = 0, j = 1,2,..., m, pi > 0, (wTXi + fe) + ^ - 1 > 0,

pi (yi (wTXi + fe) + £ - 1) = 0, i=1,2,...,r ß >0, £ >0, ^ = 0, i=1,2,...,n.

The following theorem shows that the level set will be bounded.

Theorem A.3. For any given constant c > 0, the level set

— = Aj - ^j - = 0, i = ^ ^..., ^

= 1 - 2Xjtj - 2^jtj - = 0, j = 1,2,..., m, Э!

g^=C-Pi -<?i=0, ^ = lo ^ . . . , ^

Э! Д

Wc = z=|(№,i,Ç)|/(Z)<c)nD (A.4)

is bounded.

Proof. For any (ш, i, £) e Qc, we have

Combining with tt > 0, i = 1,... ,m, and > 0, j = 1,..., n, we have

0 <tj < c, i = 1,... ,m,

0 < < —, j = \,...,n. C J

Moreover, for any (w, t, Ç) e D,

< Î; , i = !,...,n.

Consequently, w, t, Ç are bounded. What is more, from the condition

+ v(3)TS33„(3) + v(3)tS34v(4) + v(4)tS41v(1) + v(4)tS49v(2)

+ v(4)tS43v(3) + v(4)tS44v(4)

= v№ (u + v + xtytp-1d4yx) v(1)

+ v(1)t (XTYTP-1D4) v(2)

+ v(1)t (-2 (U - V) T) v(3)

+ v(1)t (xTYTP-1D4y) v(4)

yi (WTX, + b)>1-^„ i=l,...,m, (A.8) + v(2^Tp-1D4YX v(1) + v(2)t (p-1D4 + E-1D5 ) v(2)

(2)Tn-^ (4)

+ V ' P D4yv '

w xt +b > 1 - if yi = 1, i = 1,... ,m, wTxt + b < -1 + if yi = -1, i = 1,... ,m.

Thus if the feasible region D is not empty, then we have

max - wTxi) ~h~ mi-1! ^ + - wTXi) ' (A'10)

Hence b is also bounded. The proof is complete. □

B. Proof of Lemma 3

Lemma 3 (see Section 3) shows that Algorithm 2 is well defined. We first introduce the following proposition to show that the matrix S defined by (21) is always positive definite which ensures Algorithm 2 to be well defined.

Proposition B.1. Let the inequality (26) hold; then the matrix S defined by (21) is positive definite.

Proof. By an elementary deduction, we have for any v(1) e Rm, v(2) e Rn, v(3) e Rm, v(4) e R with (v(1), v(2), v(3), v(4)) = 0

(v(1)T, v(2)T, v(3)T, v(4)T) S

(1)tc (1) (1)tc (2) (1)TC (3)

= V ' S11V ' + V ' S12V ' + V ' S13„ '

(1)tc (4) (2)TC (1) (2)TC (2)

+ V ' S14V + „ ' S21V ' + V ' S22V '

+ V(2)TS23V(3) + v(2)tS24 V(4) + V(3)TS31V(1) + V(3)TS39 V(2)

- 2v(3)t (U - V) Tv(1)

+ V(3)T (4T (U + V)T + T-1 D3 -2(D1 + D2)) ■

+ V(4)T/P-1 D4YX v(1)

+ V(4)TyTP-1 D4V(2) + v(4)TyTP-1D4y„(4)

= v(1)TU„(1) - 4v(3)tUTv(1) + 4v(3)tTUTv(3) + V( ) Vv ) + 4„(3)tvt„(1) + 4„(3)ttvt„(3) + v(2)tS-1D5„(2) + v(3)t (T-1D3 - 2D1 - 2D2) v(3)

AK „(2K „(KAV1

+ ( YXV' + V ' +„

„(2K „(4)

X ( ÏAV ' + V ' +„

y) p-1D4

= eTmU(D„i - 2TD„3)2em + eTmV(D„i + 2TD„3)2em

(2)T--1 ^ (2)

+ V ' w D5V '

+ v(3)T (T-1D3 - 2D1 - 2D2) v(3)

+ (YX„m + v(2) + v(4)y)TP-1D4 (YX„m + v(2) + v(4) y) > 0,

where D„1 = diag(„(1)), D„2 = diag(„(2)), D„3 = diag(„(3)), and D„4 = diag(„(4)). □

B.1. Proof of Lemma 3

Proof. It is easy to see that if AzK = 0, which implies that (23) >

holds for all ak > 0.

Suppose Az = 0. We have V^ (zk)TAwk + Vt^ (zk)TAtk

-V^z^A? + Vh^(zk )TAbk

= -(AwkT,AtkT, AtkT,Abk)s

Af \Abk

Since matrix S is positive definite and Azk = 0, the last equation implies

Vw^(zk)TAwk + Vt^(zk )TAtk

+ Vp^fAt* + Vh^^(zk)TAbk < 0.

Consequently, there exists a ak e (0, ak] such that the fourth inequality in (23) is satisfied for all ak e (0, ak].

On the other hand, since (xk,tk) is strictly feasible, the point (xk,tk) + a(Ax,tk,Atk) will be feasible for all a > 0 sufficiently small. The proof is complete. □

C. Convergence Analysis

This appendix is devoted to the global convergence of the interior point method. We first show the convergence of Algorithm 2 when a fixed ^ is applied to the barrier subproblem (14).

Lemma C.1. Let {(zk,Xk)} be generated by Algorithm 2. Then [zk] are strictly feasible for problem (10) and the Lagrangian multipliers {X } are bounded from above.

Proof. For the sake of convenience, we use gt (z), i = 1,2,..., 3m + 2n, to denote the constraint functions of the constrained problem (10).

We first show that {zk} are strictly feasible. Suppose on the contrary that there exists an infinite index subset K and an index i e {1,2,... ,3m + 2n} such that {gt (zk)}K X 0. By the definition of ®^(z) and f(z) = J™=1 tt + C£"=1 £ being bounded from below in the feasible set, it must hold that {Q^z^k ^ ot.

However, the line search rule implies that the sequence {<&^(zk)} is decreasing. So, we get a contradiction. Consequently, for any i e {1,2,... ,3m + 2n}, {gt(xk,tk)} is bounded away from zero. The boundedness of {Xk} then follows from (24)-(27). □

Lemma C.2. Let {(zk,Xk)} and {(Azk,Xk+1)} be generated by Algorithm 2. If{zk}K is a convergent subsequence of {zk}, then the sequence {(Azk,Xk+1)}K is bounded.

Proof. Again, we use gt(z), i = 1,2,..., 3m + 2n to denote the constraint functions of the constrained problem (10).

We suppose on the contrary that there exists an infinite subset K C K such that the subsequence {\\(Azk, Xk+1 tends to infinity. Let {zk}Ki ^

It follows from

Lemma C.1 that there is an infinite subset K C K such that {Xk}K ^ X* and g{(z*) > 0,Vi e {1,2,...,3m + 2n}. Thus, we have

as k ^ œ> with k e K. By Proposition B.1, S* is positive definite. Since the right hand size of (18) is bounded and continuous, the unboundedness of {(Azk,Xk+1)]K implies that the limit of the coefficient matrices of (18) is singular, which yields a contradiction. The proof is complete. □

Lemma C.3. Let {(z , X )} and {(Az ,X+ )} be generated by Algorithm 2. If{z }K is a convergent subsequence of {z }, then one has {Azk}K ^ 0.

Proof. It follows from Lemma C.1 that the sequence {Azk}K is bounded. Suppose on the contrary that there exists an infinite subset K C K such that {Azk}K, ^ Az* = 0. Since subsequences {zk}Ki, {Xk}Ki and {Xk}Ki are all bounded, there are points z*, X*, and X*, as well as an infinite index set K C K such that {zk}K ^ z*, [Xk}K ^ A*, and {Xk}K ^ X*. By Lemma C.1, we have g{(z*) > 0, Vi e {1,2,..., 3m + 2n}. Similar to the proof of Lemma 3, it is not difficult to get

V^^fAw* +Vt®ft(z*)TAt*

-V^z^A? + Vh^^(z*)TAb*

= -(Aw*T, aç*t, At*T,Ab*)s

1 AÇ* At*

Since gt(z*) > 0, Vi e {1,2,..., 3m + 2n}, there exists a a e (0,1], such that, for all a e (0, a],

g, (z* +aAz*)>0, Vi e {1,2,... ,3n}.

Taking into account t1 e (0,1/2), we claim that there exists a a e (0, a] such that the following inequality holds for all a e (0, a]:

(z* + aAz*) - (z*) < 1.1ti«WzO^(z*)tAz*. (C.4)

Let m* = min{j | ft e (0,oc],j = 0,1,...} and a* = . It follows from (C.3) that the following inequality is satisfied for all k e K sufficient large and any i e {1,2,..., 3m + 2n},

g, (zk + a*Azk) > 0.

Moreover, we have

(zk + (z)

=O^ (z* + a*Az*)-O^ (z*)

+ (zk + a*Azk)-Oft (z* + a*Az*) + Op (z* )-O, (zfc) < l.05T1a*WzO^(z*)TAz*

By the backtracking line search rule, the last inequality together with (C.5) yields the inequality ak > a* for all fc e K large enough. Consequently, when k e K is sufficiently large, we have from (23) and (C.3) that

(zk+1)<% (zk) + riakVz^(zk)TAzk

(zk) + riatVz<^^(zk)TAzk (C.7)

(z^ + lriCx^^fAz*.

This shows that {<&p(zk)}K ^ -<x>, contradicting with the

fact that {<&^(zk)}K is bounded from below. The proof is complete.

Then, we will establish the convergence of Algorithms 2 and 4, that is, Theorems 5 and 6 in Section 3.

C.1. Proof of Theorem 5

Proof. We let (z*,A*) be a limit point of {(zk,Xk)} and the subsequence {(zk,Ak)}K converges to (z*,A*).

Recall that (Azk, Ak) are the solution of (18) with (z, A) =

(zk,Xk). Taking limits in both sides of (18) with (z,X) =

(zk, Xk),ask ^ m with k e K, by the use of Lemma C.1, we obtain

X-U - X-(2) -XTYTX-(4) = 0,

C*en -X-(4) -Â-(5) =0, e„ - 2T-X-(1) - 2T-X-(2) - X-(3) = 0,

/X-(4) = 0, (T-2 -W-)X-(1) - = 0, (T-2 + W-)X-(2) - ^em = 0, T-X-(3) - ¡Aem = 0, P-X-(4) -^en = 0, Z-X-{5) -pen = 0.

This shows that {(z*, X*)} satisfies the first-order optimality conditions (15). □

C.2. Proof of Theorem 6

Proof. (i) Without loss of generality, we suppose that the bounded subsequence {(z},X})}j converges to some point (z*,X*). It is clear that z* is a feasible point of (10). Since } ^ 0, it follows from (29) and (30)that

{Res (zi,Xi,^j)}j Res (z-,X-,0) = 0,

X- > 0.

Consequently, the (z*, A*) satisfies the KKT conditions (13).

(ii) Let = max{\\A3\\m, 1} and A3 = ^V. Obviously, {A3} is bounded. Hence, there exists an infinite subset J' c J such that {A3}J ^ A = 0 and \\A^3\\^ = 1 for large j e J. From (30) we know that A > 0. Dividing both sides of the first inequality of (18) with (z, A) = (zk, Ak) by then taking limits as j ^ >x> with j e J, we get

A(1) - a{2) - xtyt~a(4) =0,

A4 + V5 = o,

2T*~Am +2T*~A(2) +A(3) = 0, /A(4) = 0,

(T*2 -W*)~A(1) = 0, (C.10)

(t*2 + W*)A(2) = 0, T*~A(3) = 0, P*A(4) = 0, s*A(5) = 0.

Since z* is feasible, the above equations has shown that z* is a Fritz-John point of problem (10). □

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

The authors would like to acknowledge support for this project from the National Science Foundation (NSF Grants 11371154,11071087,11271069, and 61103202) of China.

References

[1] Y. M. Yang and J. O. Pedersen, "A comparative study on feature selection in text categorization," in Proceedings of the 14th

International Conference on Machine Learning (ICML '97), vol. 97, pp. 412-420,1997.

[2] G. Forman, "An extensive empirical study of feature selection metrics for text classification," The Journal of Machine Learning Research, vol. 3, pp. 1289-1305, 2003.

[3] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, "Gene selection for cancer classification using support vector machines," Machine Learning, vol. 46, no. 1-3, pp. 389-422, 2002.

[4] H. H. Zhang, J. Ahn, X. Lin, and C. Park, "Gene selection using support vector machines with non-convex penalty," Bioinfor-matics, vol. 22, no. 1, pp. 88-95, 2006.

[5] Y. Saeys, I. Inza, and P. Larranaga, "A review of feature selection techniques in bioinformatics," Bioinformatics, vol. 23, no. 19, pp. 2507-2517, 2007.

[6] J. Weston, F. Pérez-Cruz, O. Bousquet, O. Chapelle, A. Elisseeff, and B. Scholkopf, "Feature selection and transduction for prediction of molecular bioactivity for drug design," Bioinformatics, vol. 19, no. 6, pp. 764-771, 2003.

[7] Y. Liu, "A comparative study on feature selection methods for drug discovery," Journal of Chemical Information and Computer Sciences, vol. 44, no. 5, pp. 1823-1828, 2004.

[8] I. Guyon, J. Weston, S. Barnhill, and V Vapnik, Eds., Feature Extraction: Foundations and Applications, vol. 207, Springer, 2006.

[9] A. Rakotomamonjy, "Variable selection using SVM-based criteria," Journal ofMachine LearningResearch, vol. 3, no. 7-8, pp. 1357-1370, 2003.

[10] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik, "Feature selection for SVMs," in Proceedings of the Conference on Neural Information Processing Systems, vol. 13, pp. 668-674, Denver, Colo, USA, 2000.

[11] D. Peleg and R. Meir, "A feature selection algorithm based on the global minimization of a generalization error bound," in Proceedings of the Conference on Neural Information Processing Systems, Vancouver, Canada, 2004.

[12] P. S. Bradley and O. L. Mangasarian, "Feature selection via concave minimization and support vector machines," in Proceedings of the International Conference on Machine Learning, pp. 82-90, San Francisco, Calif, USA, 1998.

[13] J. Weston, A. Elisseeff, B. Scholkopf, and M. Tipping, "Use of the zero-norm with linear models and kernel methods," Journal of Machine Learning Research, vol. 3, pp. 1439-1461, 2003.

[14] A. B. Chan, N. Vasconcelos, and G. R. G. Lanckriet, "Direct convex relaxations of sparse SVM," in Proceedings of the 24th International Conference on Machine Learning (ICML '07), pp. 145-153, Corvallis, Ore, USA, June 2007.

[15] G. M. Fung and O. L. Mangasarian, "A feature selection Newton method for support vector machine classification," Computational Optimization and Applications, vol. 28, no. 2, pp. 185-202, 2004.

[16] J. Zhu, S. Rosset, T. Hastie, and R. Tibshirani, "1-norm support vector machines," Advances in Neural Information Processing Systems, vol. 16, no. 1, pp. 49-56, 2004.

[17] R. Tibshirani, "Regression shrinkage and selection via the lasso," Journal of the Royal Statistical Society B: Methodological, vol. 58, no. 1, pp. 267-288,1996.

[18] T. Zhang, "Analysis of multi-stage convex relaxation for sparse regularization," Journal ofMachine Learning Research, vol. 11, pp. 1081-1107, 2010.

[19] R. Chartrand, "Exact reconstruction of sparse signals via non-convex minimization," IEEE Signal Processing Letters, vol. 14, no. 10, pp. 707-710, 2007.

[20] R. Chartrand, "Nonconvex regularizaron for shape preservation," in Proceedings ofthe 14th IEEE International Conference on Image Processing (ICIP '07), vol. 1, pp. I293-I296, San Antonio, Tex, USA, September 2007.

[21] Z. Xu, H. Zhang, Y. Wang, X. Chang, and Y. Liang, "L l/2 regularization," Science China: Information Sciences, vol. 53, no. 6, pp. 1159-1169, 2010.

[22] W. J. Chen and Y. J. Tian, "Lp-norm proximal support vector machine and its applications," Procedia Computer Science, vol. 1, no. 1, pp. 2417-2423, 2010.

[23] J. Liu, J. Li, W. Xu, and Y. Shi, "A weighted Lq adaptive least squares support vector machine classifiers-Robust and sparse approximation," Expert Systems with Applications, vol. 38, no. 3, pp. 2253-2259, 2011.

[24] Y. Liu, H. H. Zhang, C. Park, and J. Ahn, "Support vector machines with adaptive Lq penalty," Computational Statistics & Data Analysis, vol. 51, no. 12, pp. 6380-6394, 2007.

[25] Z. Liu, S. Lin, and M. Tan, "Sparse support vector machines with Lp penalty for biomarker identification," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 7, no. 1, pp. 100-107, 2010.

[26] A. Rakotomamonjy, R. Flamary, G. Gasso, and S. Canu, "l^-l^ penalty for sparse linear and sparse multiple kernel multitask learning," IEEE Transactions on Neural Networks, vol. 22, no. 8, pp. 1307-1320, 2011.

[27] J. Y. Tan, Z. Zhang, L. Zhen, C. H. Zhang, and N. Y. Deng, "Adaptive feature selection via a new version of support vector machine," Neural Computing and Applications, vol. 23, no. 3-4, pp. 937-945, 2013.

[28] Z. B. Xu, X. Y. Chang, F. M. Xu, and H. Zhang, "L l/2 regularization: an iterative half thresholding algorithm," IEEE Transactions on Neural Networks and Learning Systems, vol. 23, no. 7, pp. 1013-1027, 2012.

[29] D. Ge, X. Jiang, and Y. Ye, "A note on the complexity of Lp minimization," Mathematical Programming, vol. 129, no. 2, pp. 285-299, 2011.

[30] V. N. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, NY, USA, 1995.

[31] J. Fan and H. Peng, "Nonconcave penalized likelihood with a diverging number of parameters," The Annals of Statistics, vol. 32, no. 3, pp. 928-961, 2004.

[32] K. Knight and W. Fu, "Asymptotics for lasso-type estimators," The Annals of Statistics, vol. 28, no. 5, pp. 1356-1378, 2000.

[33] B. S. Tian and X. Q. Yang, "An interior-point l L^-penalty-method for nonlinear programming," Technical Report, Department of Applied Mathematics, Hong Kong Polytechnic University, 2013.

[34] L. Armijo, "Minimization of functions having Lipschitz continuous first partial derivatives," Pacific Journal of Mathematics, vol. 16, pp. 1-3, 1966.

[35] F. John, "Extremum problems with inequalities as side-conditions," in Studies and Essays. Courant Anniversary Volume, pp. 187-1204, 1948.

[36] D. J. Newman, S. Hettich, C. L. Blake, and C. J. Merz, "UCI repository of machine learning databases," Technical Report 9702, Department of Information and Computer Science, Uni-verisity of California, Irvine, Calif, USA, 1998, http://archive.ics .uci.edu/ml/.

Copyright of Journal of Applied Mathematics is the property of Hindawi Publishing Corporation and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.