Scholarly article on topic 'Enhanced gene ranking approaches using modified trace ratio algorithm for gene expression data'

Enhanced gene ranking approaches using modified trace ratio algorithm for gene expression data Academic research paper on "Computer and information sciences"

CC BY-NC-ND
0
0
Share paper
Academic journal
Informatics in Medicine Unlocked
OECD Field of science
Keywords
{"Gene regulatory network" / "Gene selection" / "Information gain" / "Trace ratio" / "Canonical correlation analysis" / Classification}

Abstract of research paper on Computer and information sciences, author of scientific article — Shruti Mishra, Debahuti Mishra

Abstract Microarray technology enables the understanding and investigation of gene expression levels by analyzing high dimensional datasets that contain few samples. Over time, microarray expression data have been collected for studying the underlying biological mechanisms of disease. One such application for understanding the mechanism is by constructing a gene regulatory network (GRN). One of the foremost key criteria for GRN discovery is gene selection. Choosing a generous set of genes for the structure of the network is highly desirable. For this role, two suitable methods were proposed for selection of appropriate genes. The first approach comprises a gene selection method called Information gain, where the dataset is reformed and fused with another distinct algorithm called Trace Ratio (TR). Our second method is the implementation of our projected modified TR algorithm, where the scoring base for finding weight matrices has been re-designed. Both the methods' efficiency was shown with different classifiers that include variants of the Artificial Neural Network classifier, such as Resilient Propagation, Quick Propagation, Back Propagation, Manhattan Propagation and Radial Basis Function Neural Network and also the Support Vector Machine (SVM) classifier. In the study, it was confirmed that both of the proposed methods worked well and offered high accuracy with a lesser number of iterations as compared to the original Trace Ratio algorithm.

Academic research paper on topic "Enhanced gene ranking approaches using modified trace ratio algorithm for gene expression data"

Author's Accepted Manuscript

Enhanced Gene Ranking Approaches using Modified Trace Ratio Algorithm for Gene Expression Data

Shruti Mishra, Debahuti Mishra

www.elsevier.combcate/imLi

PII: S2352-9148(16)30024-7

DOI: http ://dx. doi.org/ 10.1016/j. imu.2016.09.005

Reference: IMU18

To appear in: Informatics in Medicine Unlocked

Received date: 10 May 2016 Revised date: 25 September 2016 Accepted date: 26 September 2016

Cite this article as: Shruti Mishra and Debahuti Mishra, Enhanced Gene Rankin Approaches using Modified Trace Ratio Algorithm for Gene Expression Data Informatics in Medicine Unlocked, http://dx.doi.org/10.10167j.imu.2016.09.005

This is a PDF file of an unedited manuscript that has been accepted fo publication. As a service to our customers we are providing this early version o the manuscript. The manuscript will undergo copyediting, typesetting, an< review of the resulting galley proof before it is published in its final citable form Please note that during the production process errors may be discovered whic could affect the content, and all legal disclaimers that apply to the journal pertain

Enhanced Gene Ranking Approaches using Modified Trace Ratio Algorithm for

Gene Expression Data

Shruti Mishra , Debahuti Mishra

, Siksha 'O' Anusandhan University, Bhubaneswar-751030, Odisha, India

shruti_m2129@yahoo.co.in mishradebahuti@gmail.com

Correspondence: Abstract

Microarray technology enables the understanding and investigation of gene expression levels by analyzing high dimensional datasets that contain few samples. Over time, microarray expression data have been collected for studying the underlying biological mechanisms of disease. One such application for understanding the mechanism is by constructing a gene regulatory network (GRN). One of the foremost key criteria for GRN discovery is gene selection. Choosing a generous set of genes for the structure of the network is highly desirable. For this role, two suitable methods were proposed for selection of appropriate genes. The first approach comprises a gene selection method called Information gain, where the dataset is reformed and fused with another distinct algorithm called Trace Ratio (TR). Our second method is the implementation of our projected modified TR algorithm, where the scoring base for finding weight matrices has been re-designed. Both the methods' efficiency was shown with different classifiers that include variants of the Artificial Neural Network classifier, such as Resilient Propagation, Quick Propagation, Back Propagation, Manhattan Propagation and Radial Basis Function Neural Network and also the Support Vector Machine (SVM) classifier. In the study, it was confirmed that both of the proposed methods worked well and offered high accuracy with a lesser number of iterations as compared to the original Trace Ratio algorithm.

Keywords

Gene Regulatory Network; Gene Selection; Information gain; Trace Ratio; Canonical Correlation Analysis; Classification

1. Introduction

Genes, as good as their products (proteins) are the essential construct blocks of animation that do not function autonomously. Rather for a cell to function appropriately, they act together with each other and form an intricate network [1]. One such application to understand the behaviour of the genes and their expression levels is to construct a gene network that signifies the relationship between sets of genes which harmonize to achieve different tasks. For the understanding of the core biological process and its molecular system, Gene Regulatory Network (GRN) [2] plays a crucial part. However, modeling of these networks is a significant challenge that needs to addressed.

Apart from this, understanding the construction and functionalities of GRN is a basic problem in biology. With the accessibility of gene expression data and whole genome sequences, several computational approaches have been developed to discover, their regulatory network by enabling the recognition of their regulatory state component [3]. In the current era, formation of precise GRN models [4] is reaching a major percentage of importance in biomedical research. The gene expression of the microarray data monitors the behavior of thousands of genes simultaneously that provides a maximum chance to look into large scale regulatory networks. Lastly, an absolute GRN model allows us to incorporate experimental facts about the elements and interactions of the factors which leads to knowing the final state or the dynamical behavior of the network.

Gene selection [5-6] acts as a major criterion. Gene selection from microarray data (which is a high dimensional dataset) is statistically difficult problem. Usually, the number of samples is quite less as compared to thousands of genes whose expression levels are measured. Hence, it is important to restrain down to few disease related genes from thousands of microarray genes by the operation of selection or ranking. There are many gene selection or feature selection methods [7-8] that deal with the problem of curse of dimensionality in microarray data. Apart from this, it also helps to reduce the time and memory complexities which always create issues. Generally, gene selection or feature selection methods are split into two categories: classifier independent and classifier dependent. Filter methods [9] are believed to be a classifier dependent as the choice is based on some heuristic criterion and score, whereas wrapper and embedded methods are thought to be a part of the classifier dependent method. Wrapper method [10] assesses a subset of variables according to their efficacy to a given predictor whereas in embedded methods, a variable selection is performed as a part of the learning practice and are usually precise to a given learning machine. Other than gene selection, gene ranking is also an important factor of consideration for which different methods are available in the literature for study of class data. Some of them are Fold Change (FC), moderated t-statistics, Significance Analysis of Microarrays (SAMs) etc. There is another method called as RP method that is the only rank based non-paramteric method. This method independently handles up-regulated and down-regulated genes under one class and therefore produces two separate ranked gene lists.

Separate from these existing techniques, there are various computational techniques and methods for gene selection. Model et al. [11] established how phenotypic classes can be predicted by amalgamating feature selection methods and discriminant analysis for methylation pattern based discrimination between acute lymphobatic leukemia and acute myeloid leukemia. They used SVM to the methylation data for using every CpG position as a separate dimension. Li et al. [12] studied the problem of edifying the multi-class classifier for tissue classification based on gene expression datasets. They stated that for datasets with a small number of classes the results are good and for datasets with a large number of classes the accuracy is moderately less. Mundra and Rajapakse [13] used the famed t-statistics for gene ranking in the analysis of microarray data. Here, they have divided the t-statistics into two parts: relevant and irrelevant data points. A backward elimination based iterative approach was projected to rank genes using only the relevant sample points and t-statistics. It was found that the proposed method performed considerably better than the standard t-statistic approach. Kira et al. [14] partitioned the information points into clusters using k-d-tree and chose random data point from each cluster, and then performed feature selection by means of Relief which looks for frontier points to estimate feature weights. Pechenizkiy et al. [15] used the principal component analysis for dimensionality reduction after partitioning large datasets with k-d-tree. Cavill et al. projected a GA/k-NN based move for concurrent feature and sample selection from metabolic profiling data [16].

Similarly, Cawley et al. [17] proposed a straight forward Bayesian approach which gets rid of the regularization parameter fully, by integrating it out systematically using an uninformative Jeffrey's prior. The anticipated algorithm (BLogReg) uses two or three orders of magnitude faster than the original algorithm, as there is no longer a necessity for a model selection step. Two new dimensionality reduction techniques were proposed by Fitzgerald et al. [18]. These methods use the minimum and maximum information models. These are information theoretic extensions of Spike-Triggered Covariance (STC) with the intention that can be practiced with non-Gaussian stimulus distributions to locate relevant linear subspaces of random dimensionality. Piao et al. [19] projected an Ensemble Correlation-Based Gene Selection algorithm based on symmetrical indecision and Support Vector Machine. In the method, symmetrical indecision was used to analyze the importance of the genes and the diverse preparatory points of the pertinent subset were used to produce the gene subsets where Support Vector Machine was used as an assessment criterion of the wrap.

Nie et al. [20] proposed an optimized subset-level score and algorithm to proficiently discover the global optimal feature subset such that the subset-level score is maximized. This algorithm is called as Trace Ratio (TR) which uses the Fisher and Laplacian score as the evaluation criterion. It's essentially a graph based feature selection algorithm. Zhao et al. [21] introduced the trace ratio linear discriminant analysis (TR-LDA) algorithm for dememtia diagnosis. They also proposed the ITR algorithm (iITR) to resolve the TR-LDA problem. This process integrates with the sophisticated missing value imputation method and is used for the probe of the nonlinear datasets in many real-world medical diagnosis problems. Wang et al. [22] proposed a amalgamated objective to flawlessly hold trace ratio formulation and £-means clustering process in a manner that the trace ratio criterion is extended to unsupervised model. They also proposed an unsupervised feature selection method by integrating unsupervised trace ratio formulation and ordered sparsity-inducing norm regularization. This method was able to strap up the discriminant power of trace ratio criterion, and thus it tends to select discriminating features. The major disadvantage of using this trace ratio algorithm [23] is that though theoretically the algorithm converge and global optimum of the solution is achieved, but by extensive study it is found that sometimes the algorithm does not converge as the basic stopping criteria is not met. Hence, we do forcefully terminate the algorithm by providing some stopping criteria to it.

In our study, we have proposed two methods in which the trace ratio algorithm has been explored properly. In our first method, we have not altered any criteria of TR algorithm. Rather, we improvised and structured the dataset on the basis of information gain values. In our second method, we have modified the existing and original TR algorithm by changing the scoring criteria which is one of the fundamental steps in TR algorithm. Instead of using the Fisher's score, the canonical correlation analysis score is used to calculate the weight matrices within-class and between class. Canonical correlation score being a statistical technique aims at providing a better rank list when merged with the TR algorithm as compared to the existing Fisher's score. It is also relevant as it is expected to provide a far better classification accuracy rate when compared with the original TR algorithm. Both the proposed method is examined and evaluated on the basis five datasets i.e. Colon [24], Leukemia [25], Medulloblastoma [26], Lymphoma [27] and Prostate Cancer [28]. The nature of the dataset is quite large in terms of the number of genes, but have a small sample size. It was found that the information gain with the original TR algorithm and the modified TR algorithm provided promising results as compared to the unmodified TR algorithm.

The rest of the paper is divided as follows: the first section depicts the materials and methods that have been used for this work such as datasets used, the methods and the algorithm like information gain, TR algorithm, Canonical

correlation analysis, Performance Metrics etc. The next section deals with the experimental evaluation where preprocessing of the data, parametric discussion and schema diagram of the proposed model are discussed. Following this section, the result of the proposed technique along with the original technique have been critically analyzed and summarized. Lastly, the conclusion of the work is briefed with some future direction.

2. Materials and Methods

2.1 Datasets Used

Expression profiling of colon cancer or colorectal adenomas and normal mucosas from 32 patients were downloaded from Gene Expression Omnibus [24] (SOFT Matrices files were download and for the same log transformation was used as the data were mostly skewed to the right). This set consists of 32 adenomas and 32 normal mucosas sample (64 samples) having 43,237 genes. To illustrate the molecular developments underlying the alteration of normal colonic epithelium, the transcriptomes of 32 prospectively collected adenomas were measured along with those of normal mucosa from the same entities. Similarly, the Leukemia dataset was collected from [25] where the dataset consist of 10,056 genes with 48 samples of both ALL and AML (24 ALL- Acute Lymphocytic Leukemia and 24 AML- Acute Myeloid Leukemia each). Apart from these two, few more datasets were taken into consideration like the Medulloblastoma dataset [26] having 5893 genes with 34 samples of 25 C and 9 D samples (Medulloblastoma have four molecular sub types out of which two less well defined sub types are group C and group D), Lymphoma dataset [27] having 7070 genes having 77 samples of 58 DLBCL (Diffuse Large B-cell Lymphoma) and 19 FL (Follicular Lymphoma) samples (Affymetrix HuGeneFL array), and the prostate cancer dataset [28] having 12,533 genes with 102 samples of 50 normal and 52 tumor samples (Affymetrix Human Genome U95Av2 Array platform). These large-scale gene expression datasets were first statistically measured and then was used for the assessment of an existing TR algorithm and modified TR algorithm.

2.2 Information Gain

Information gain [29] is a synonym for Kullback-Leibler deviation. On the other hand, in the context of decision trees, the phrase is sometimes used synonymously with mutual information, which is the prospected value of the Kullback-Leibler divergence of a conditional probability distribution. Further, information gain ratio can be elaborated as the ratio of information gain to the inherent information. It is used to diminish a bias towards multivalued attributes by enchanting the number and size of branches into account when choosing an attribute [30]. One of the most vital character of this is to favoritise the decision tree against considering attributes with large number of distinct values. That is, it helps in deciding which attributes are the most relevant. Information gain being an important concept in information theory, is applied in the field of machine learning. In a classification system, for microarray data the information gain [31] is designed for each gene, a gene of an arithmetical amount of information provided in the classification system to decide the classification system for the gene of importance. This method can quickly rule out a large number of non-critical noise and inappropriate genes, process the search area of the most favourable subset of genes. Entropy is the measure that is used to reckon the information and compute the degree of vagueness of a random variable. Let node N represents or holds tuples of partition D. The expected information needed to classify a tuple D is given by eq (1):

Inf o ( D ) = -Y™ 1P il o g 2(p 0 (1)

Where, Inf o (D) is the entropy of D, pt is the probability that an arbitrary tuple in D belongs to class C

Suppose the tuples D are partitioned on some attribute A having v distinct values {a1, a2,....an}. If A is discrete valued then it can correspond directly to the v outcomes of a test on A. Attributes A can be used to split D into v partitions or subsets {D1, D2, ....Dn}, where Dj contains those tuples in D that have outcome aj of A. This amount can be measured as shown in eq (2):

Info a(D) = Y= 1 ^ X Info (Dj) (2)

Here, the term ^ acts as the weight of thejth partition. Info a(D) is the expected information required to

classify a tuple from D based on the partitioning by A. The information gain is the difference between the original information required that is based on the proportion of classes and the new requirement obtained after partitioning A. This is shown in eq (3):

G a i n (A ) = Inf o(D ) - Inf o a(D ) (3)

As larger the divergence, the stronger the correlation. As a result, the differential entropy defined information gain (shown in Algorithm I), represents the quantity of information obtained after the exclusion of uncertainty. Evidently, larger information gain value a feature has, the larger contribution it makes, the more vital for the classification [32]. Hence, when choosing genes, the one with great information gain is selected to represent the original high-dimensional gene first, and use them as a base for supplementary gene selection.

Algorithm I: Information Gain [29] Input: Original dataset, D

Output: Reordered gene sets as per the information gain values obtained for each attributes in D.

Step 1: Find the probability of each category of known samples. Step 2: Compute the entropy of the classification system (using eq (1)).

inf o(d )= -YT= iP il o g 2(p Ô Step 3: Compute the probability and computational probability of all values for each gene. Step 4: Calculate the conditional entropy or expected information required for classifying a tuple from D (using

eq (2)).

inf oa(D)=YJ= i-Dj Xinfo (Dj)

Step 5: Compute the information gain for all genes (using eq (3)). G a i n (A ) = inf o(D ) - inf o a(D )

Step 6: Sort the results obtained in step 5 based on the descending order of the gain obtained.

2.3 Significance and Analysis of Information Gain

Five datasets were considered that contained a moderately good number of samples and genes. The datasets at the initial stage were pre-processed using min-max normalization. In the normalized dataset, the information gain procedure was used to get hold of the information gain vectors for each attribute (i.e. Genes) which was then used to sort and re-order the dataset in descending order. Statistically, the considerable genes (genes having high amount of information gain content) are selected and kept at foremost and then other genes subsequently follows.

2.4 Trace Ratio

Feature reduction is a major issue in many machine learning and pattern recognition applications, and the trace ratio

ACCEPTED MANUSCRIPT

problem is an optimization setback concerned in many dimensionality reduction algorithms. Traditionally, the solution is approximated via generalized eigen value decomposition due to the intricacy of the original problem. Fisher and Laplacian score [33-34] are the two famous gene selection algorithms that belongs to the graph based gene selection environment. TR [35] is one of them, i.e. it's a graph based gene or feature selection algorithm that uses the two scores (Fisher and Laplacian score) as the evaluation criteria measure.

Let's consider two undirected graphs G w and Gb for within-class and between-class relations that are constructed using Fisher score, where the equivalent adjacency matrices being Ww and Wb. For a dataset X, where both the instances and belong to the same class, the within-class relationship will be higher. So, the feature subset selection should minimize (eq (4)),

Zij\\h-lj\\\Mw) „ (4)

for the same class, otherwise maximize. Between-class relationship, both for xt and Xj will be higher when they belong to different classes. So, the selected gene or feature subset should maximize (eq(5)),

Z ij\\l t-ljfiM^ij (5)

for the different classes, otherwise minimize. Here, is the instance of class for . In order to find the weight matrices Mw and M b, fisher score or laplacian score is used based on whether it is supervised or unsupervised feature selection. The weight matrices for fisher score can be classified as given below in eq (6) and eq (7):

(M w) tj= ]numit ' (6)

(0, iflt*lj

(Mb) , = 1™ mit ' (7)

—, Ifk^L

\num ' 1 J

Where, 11 denotes the class label of the zth instance of xt and num^ denotes the number of data or records belonging to class 11. The adjacency matrix using laplacian score can be calculated as shown in eq (8) and eq (9):

(Mw)ij = { e^-T^-' tf xtand xjare ne 19hbours (8)

I 0, otherwise

( M b)tj = y^DMw 1 1 TDMw (9)

Where, eq (8) denotes the radial distance and t denotes any constant. In order to unite both the objectives in a single function, ratio of the two is considered and maximized. The ratio is given by eq (10) and eq (11):

<P(Sv)

£ij\\h-lj\\ (Mb)ij

£ij\\k-lj\\ (Mw)ij

ACCEPTED MANUSCRIPT

tr(SVXLM bXT Sy) <P(Sv) ~ tr(SVXLMwXT Sy)

Where, Sv = [si±,si2, .. ..siJ denotes the selection matrix, where i±, i2,... i^ are the first k elements of the transformation [1,2,...«], which is gene or feature number. si denotes a column matrix with all zeros excluding 1 in the rth position and tr is the TR of the matrix. Let LMw and LMb are Laplacian matrices of the form given in eq (12) and eq (13):

LMW = DMW - Mw

(12) (13)

LMb = DMb - Mb

Where, DMw and DMbare diagonal matrices given in eq (14) and eq (15).

(D Mw)i i = £ ij(Mw)tj (14)

(D M b)i i = £ ij(M b)ij (15)

Let Y = XLMbXT and Z = XLMwXT. The score of the feature or gene set is calculated as per the TR criteria for a particular selection matrix S v which is given as in eq (16),

P = <p(Sv)

tr(STYSy) tr(STZSv)

Score of each gene or feature f i is computed using eq (17), F(fi) = mT(Y-pZ)mi

Where, mi is the column vector with all zeros except 1 and the /th position, and F is the selected feature or gene set. The algorithm of the trace ratio is stated below (shown in Algorithm II):

Algorithm II: Trace Ratio [35]

Step 1: Calculate adjacency matrices for within the class (Mw) and between the classes (Mb) using Fisher score as follows (eq (6) and eq (7)): Mw=—,if lt = lj AO, iflt\ = lj

numt. 3 3

Mb=----—, ifli = lj A — , if hi = lj

num num.¡. 3 num. 3

Step 2: Calculate the diagonal matrices (DMw and DMb) for the above adjacency matrices as given below (as in eq (14) and eq

(D Mw)a = £ij(Mw) ij

(DMb) ii = £ij(M b) ij

Step 3: Calculate Laplacian matrices (LMw and LMb) using the eq (12) and eq (13). LMW = DMW - Mw LMb = DMb - Mb

Step 4: Construct a matrix of k features by initially selecting randomly k features from original dataset (say Rk). Step 5: Declare an empty matrix (say Nk)to store top k features after finding scores of each feature

Step 6: Repeat steps 6 to 10 until RT! = NT

Step 7: Calculate Y = XLMbXTand Z = XLM w Step 8: Calculate Trace Ratios as TRy = TR(RTYRk)and TRz = TR(RTkZRk)

TED MANUSCRIPT

Step 9 : Calculate & = — i

Step 10 : Calculate Score of each feature as F (f t) = mT (Y — pZ)m t Step 11: Select new top k features based on the score and store in Nk Step 12: Store final k features Rk for further processing Step 13: Stop

2.5 Canonical Correlation Analysis

Canonical Correlation Analysis (CCA) [36] is a well-known statistical method that have been broadly used in information union to confine the correlation between two variables. CCA is an algorithm that is essentially used to find out the discriminate feature, or genes and lessen the superfluous information for gene selection. It is also a well-known multivariate analysis method for quantifying the correlation between two sets of multi-dimensional variables [37]. One of the main intent of CCA is to find and enumerate the correlation between two sets of multi-dimensional variables. It uses two views of the same pattern and projects them onto a lower dimensional space in which they are maximally correlated. The traditional CCA algorithm requires to calculate both the inverse and eigen decomposition of a D * D matrix [38].

Let's consider two sets of variables X and Y, which contains r variables in one set X and q variables in set Y.

x=(Xr)'ndY=(Y;)

We pick X and Y based on the number of variables that subsist in each set so that r < ;. A set of linear combinations called U and V is defined where U corresponds to the linear combinations from X and V will correspond to Y. Each member of U will be paired with a member of V. This leads to the sets of section as given below:

U1 = a11X1 H-----h alrXr

Ur = arlX1 + —I- arrXr

V1=bllYl + - + blqYq Vr = brlY1 + - + brqYq

Hence, (Ut' Vt) is defined as the ith canonical variate pair. The variance of Ut variables can computed using eq. (16): v ar( U t) = ZT=iZ I ia,Ta, iCo v(Xt,X{) (16)

Similarly, the variance of V j is computed using eq (17):

v ar(V j) = IT=iZlibjkbjiCOv(Yk'Yi) (17)

Now, covariance between Ut and V j can be computed as shown in eq (18): c o v( U fV j) = ZT=iZt ia t t b j ic o v(X k'Y i) (18)

The canonical correlation between Ut and V j can be calculated using eq (19):

Ht Jvar(U i)var(V v y

2.6 Performance Metrics Used

Stability of the selected features is a significant aspect when the task is knowledge discovery and not simply returning an accurate classifier. For the validation and assessment of the proposed methods, three different forms of metrics were applied. Though there are several validation indexes that are available, but for our domain Kuncheva's Stability Index (KSI) [39], Balanced Classification Rate (BCR) [40] and Balanced Error Rate (BER) [41] have been used. The detailed explanation of the three metrics is shown below:

a. Kuncheva 's Stability Index [39]

Let the number of features be in two subsets A and B. KSI is a stability measure that assumes that A and B have the same size (cardinality) i.e. \ A \ = \B \ = T where, k denotes the number of features in A or B. In other words, for two subsets such that and , where is (as

shown in eq (20)),

rC a _ observed r-Expectedr _ rn-k2 /orw

( ' ) = M ax r - Exp e cte d r = T(n - T) ( )

KSI is the average of pairwise consistency. A value 0 indicates the highest possible instability, whereas value 1 indicates the highest possible stability, i.e. all feature subsets have the same cardinal value and all subsets are identical.

b. Balanced Classification Rate [40]

BCR is the mean of sensitivity and specificity that introduces a balance amid the classification of two classes (as shown in eq (21)).

B CR = i (Sensitivity + Specificity) = \ + t^) (21)

where, TP is True Positive, FP is false positive, TN is true negative and FN is false negative.

c. Balanced Error Rate [41]

It is the average of errors on each class. It is also called as Half Total Error Rate. It is stated as given in eq (22).

BER = № + F-N) = i — BCR

where, FP is false positive, FN is false negative, P is total positive, N is total negative.

2.7 Proposed methodologies of TR algorithm for gene selection and ranking

Here, for our work two methodologies were proposed by using the TR algorithm. In our first method, IG-TR Gene Ranking is proposed that uses information gain as the base medium for evaluation along with the original existing TR algorithm. And, our second method, CCA-TR Gene Ranking aims at modifying the existing TR algorithm on the basis of scoring criteria.

Method I: IG-TR Gene Ranking

In this process, we have kept the original TR algorithm intact. Rather, as a substitute of modifying the algorithm we changed the base dataset. This change is not abstract, but is based on some criteria (based on information gain content). That is after the preprocessing step, the data set was again re-arranged and re-structured using the information gain value extracted. We calculated information gain for the dataset. Higher the information gain the better is the information content of the attribute, so we re-set the entire dataset based on this attribute content value. It is sorted according to the descending order. Once the dataset is redefined the TR algorithm is applied over it to rank the genes or attributes. Now, these ranked genes are sorted and passed into the classifier for the purpose of accuracy measurement.

Method II: CCA-TR Gene Ranking

There is another method that we have projected for the better performance of the TR algorithm. TR algorithm usually uses the standard Fisher's score or Laplacian score to find the weight matrices. As a replacement for using this form of scoring criteria, we preferred to choose another scoring criteria to replace one of them. For our work, we put back the usual Fisher's score with Canonical Correlation Analysis (CCA) factor. That is our new evaluation criteria for generating the TR or rank of genes is changed from the Fisher's score to canonical correlation score. Using this score, we generated the TR score for the genes which was then passed to different classifier for the purpose of accuracy estimation. The evaluation criteria for finding the weight matrices or adjacency matrices of the new TR algorithm is stated using eq (23) and eq (24).

cov( U itV ¡)

,ifli = lj

(Mb)ij =

Jvar(Ui)var(Vi)

Otherwise

where,

co v( U i, V j)

is the canonical correlation score and Ii and Ldenotes the class label.

Jvar( Ui)var(V{)

The detailed restructured algorithm in stated below (as shown in Algorithm III):

Algorithm III: Modified Trace Ratio Algorithm

Step 1: Calculate adjacency matrices for with-in the class (Mw) and between the classes (Mb) using Canonical Correlation score as follows (eq (23) and eq (24)):

(Mw)i j = ^====, if 11 = I j a 0,0 th erw is e

v wyiJ Vv ar( U i) * v ar(V i) 11 j

(Mb)ij = c 0 vV( U h vif i i = ij a 1 , 01 h erw i s e

IV ar( Ui)*v ar(Vj)

Step 2: Calculate the diagonal matrices (DMw and DMb) for the above adjacency matrices as given below (as in eq (14) and eq (15):

(D Mw) i i = Z ij(Mw) ij (D Mb)t i = Z ij(Mb) ij Step 3: Calculate Laplacian matrices (LM w and LMb) using the eq (12) and eq (13). LMW = DMW - Mw LMb = DMb - Mb

Step 4: Construct a matrix of k features by initially selecting randomly k features from original dataset (say Rk). Step 5: Declare an empty matrix (say Nk) to store top k features after finding scores of each feature Step 6: Repeat steps 6 to 10 until Rk\ = N k

Step 7: Calculate Y = XLMbXTand Z = XLMwXT

Step 8: Calculate Trace Ratios as TRy = T R(R\YRk) and TRZ = T R(R\ZRk)

Step 9 : Calculate ¡3 = —-

Step 10 : Calculate Score of each feature as F(f) = m T( Y — p Z)m i Step 11: Select new top k features based on the score and store in Nk Step 12: Store final k features Rkfor further processing Step 13: Stop

3. Experimental Analysis

In this section, we would begin with the basic pre-processing step that is a requisite for the five types of datasets taken for the purpose of normalizing it. This step would be followed by the parametric discussion and the measures that have been taken into consideration. We have also presented a schematic view of the proposed model. For the evaluation and analysis, MATLAB version R2014a was used with the system requirement of 8GB RAM.

3.1 Preprocessing

A primary and essential stage of pre-processing is normalization. Normalization process transforms the data into a layout that will be more simply and effectively processed for the purpose of the user. Here, the datasets were normalized using min-max normalization [42]. Min-Max normalization is an effortless technique where the technique can particularly fit the data in a pre-defined boundary with a pre-defined boundary. In other words, it's a way that one

linearly transforms the real data values such that the minimum and the maximum of the transformed data to take certain values. The technique can be represented as shown in eq (25

): CRIPT

^f _ (x xmin)

(xmax xmin)

where, xm ¿n=minimal data value appearing and xm ax= maximal data value appearing.

3.2 Parameter Discussion

In section 2.7, we have anticipated two different types of methods (IG-TR Gene Ranking and CCA-TR Gene Ranking) for finding the TR of a data matrix. This produces a new rank list of genes which are then conceded to the variants of ANN algorithm for the purpose of classification and accuracy measurement. The factor of covariance and variance have been used to provide better results. Top 50, 100, 150 and 200 genes were selected to be utilized in the TR algorithm for generation of TR and rank list. This assortment was a crucial criteria based on which the entire rank list was generated. Information gain is also a significant factor that have been considered here for generating high set of genes having huge information content. Hence, the selection of such genes played a major role in finding the TR and rank list. These two processes, enhance the chance of finding the appropriate list (rank list) for the purpose of classification in order to get a better performance value. TR itself is a well-defined algorithm and merging these extra parameters only improves its performance to a higher range. This merger takes a less number of iterations for generating the rank list as compared to the original and unmodified TR algorithm.

3.3 Implementation and Performance Analysis

The proposed schematic model is described herewith (shown in figure 1):

Figure. 1. Schematic Proposed Model

In section 2.7, two methodologies for generating TR were proposed. It involved a series of steps where we started with the normalization step that is common to both the methods (IG-TR Gene Ranking and CCA-TR Gene Ranking).

Min-max normalization was used along the five datasets for linearly transforming the new information into some specified boundaries. Now, we begin analyzing the first and second method individually (as shown in the method I and method II of figure 1).

In the first method or IG-TR Gene Ranking of figure 1, we need to compute the information gain of the datasets in order to find how important a gene vector is. This would additionally be used to select the gene vector or attribute that contain the highest information content. Based on this information content, the data matrix or dataset is sorted and reordered. The reordering is based on the descending order criteria where the gene of high information content is kept first and the genes with least information substance is kept at last. Usually, study says that the attribute of less information substance are not considered important, but in our study every gene vector is given an equal importance and hence they are kept in the data matrix. Now, this reformed dataset is used as the base input to the TR algorithm, where it was eventually found that with just a modest change in the base data the entire algorithm behaves differently. This difference was mainly on the basis of less number of iterations that was obtained for the convergence of the algorithm (where, based on the k number of genes selected, the rank set of all the genes would be same). This difference between the original TR algorithm and modified TR algorithm was also realized through the classification algorithm. The gene rank list was generated and this was given as the input to the classifiers (Resilient propagation, Back propagation, Manhattan propagation and Support Vector Machines) which offers an exceptional accuracy as compared to the original unmodified TR algorithm.

In the second method or CCA-TR Gene Ranking of figure 1, we eventually changed the scoring criteria of the TR algorithm instead of changing the base input. As a substitute of Fisher's score, we preferred to choose canonical correlation score to determine the weight matrices. CCA is a statistical method that is employed to confine the correlation between two variables. Thus, by using the new scoring method, we uncover the TR by selecting the k value (number of genes) as 50, 100, 150 and 200. Now, this modified TR algorithm also generates the rank list with few numbers of iterations by suitably converging. Now, this new rank list was passed to the variants of the classification algorithm where the classifier's accuracy improved a lot as compared to the original TR algorithm. Hence, we state that by this modification we established a huge of difference in the performance of the TR algorithm. The difference found at the end provided us with better accuracy value as compared to the unperturbed TR algorithm. Aside from the accuracy measures, the number of iterations required for convergence of the algorithm was eventually less than the original TR algorithm. For most of the datasets, both the methods gave us 100% accuracy or zero error factor.

4. Results and Discussion

As stated earlier, five datasets have been considered for the purpose of assessment. Each method's result have been significantly discussed with proper tabular and graphical representation. It was found that both the methods were responding positively and were depicted remarkable results as compared to the original algorithm. The anticipated methods were passed to different classification algorithms and it was experimental that they either in huge margin or in a smaller margin varied. Table 1 shows the characteristics and features of the dataset considered for the experimental analysis. A random training parameter of 70% from the whole dataset is selected and 30% of testing parameter is selected.

Table. 1. Description of the datasets used in experimental analysis

_Data_Amount of_No. of Samples_Training data_Testing data_References

Classl

Class2

Colon Leukemia Medulloblastoma

Lymphoma Prostate Cancer

2000 10056 5893 7070 12533

31 (N)

31 (T)

ACCfce rED M9 (?) lUSCRI2* t

50 (N)

[26] [27]

»■ —/

4.1 IG-TR Gene Ranking Algorithm

One of the major issues in gene selection progression is that with a huge ordered gene subset we need to find such a set (re-ordered set) where the classification accuracy would be higher. Based on this concept of gene selection, we tried to reform the dataset or data matrix into a proper ranked set and then this set was compared with the original dataset. In other words, a clear comparison was drawn by considering the original pre-processed dataset and reframed dataset that was passed as the base input to the existing TR algorithm. This was further materialized and validated by different classification algorithms. The results are shown from table 2 to 6 where, table 6 shows the average accuracy obtained with 10-fold change. From table 2 to 6, it was observed that the proposed method-I provided a good result in terms of accuracy with different classifiers. Also, from table 2 to 6, a clear implication can be extracted were Resilient propagation and SVM provided a better classification accuracy with less number of iterations as compared to back propagation and manhattan propagation.

Table. 2. Performance assessment of reframed input with original pre-processed base input where K=50 for TR algorithm

Dataset Resilient Propagation Back Propagation Manhattan Propagation SVM

Original Infogain Original Infogain Original Infogain Original Infogain

Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr

Colon 98.36 122 100 111 79.03 2024 100 628 80.78 2145 87.45 2043 98.33 120 100 98

Leukemia 91.6 12 100 12 91.66 14 100 13 89.47 321 92.5 224 90.47 14 99.14 16

Medullobla 99.10 44 100 35 76.47 725 100 602 79.65 856 85 601 100 48 100 32

-stoma

Lymphoma 98.70 2335 98.70 1347 97.40 4901 100 3568 95.23 3568 97.23 2864 99.10 2214 99.58 1087

Prostate 99.01 8197 100 983 50.98 4987 71.28 4029 65.25 3687 76 3402 100 8055 100 912

Cancer

Table. 3. Performance assessment of reframed input with original pre-processed base input where K=100 for TR algorithm

Dataset Resilient Propagation Back Propagation Manhattan Propagation SVM

Original Infogain Original Infogain Original Infogain Original Infogain

Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr

Colon 98.3 108 100 99 97.24 1042 98.3 967 82 2256 88 1568 98 114 100 92

Leukemia 100 16 100 12 98.45 25 100 9 92.01 426 95.24 368 100 12 100 10

Medullobla -stoma Lymphoma 99.01 41 100 38 98.24 3988 99.01 4011 86.35 965 88.25 867 100 38 100 32

98.70 1452 98.70 1321 99 4254 100 3987 92.48 4781 94.23 3892 98.54 1487 98.01 1235

Prostate 99.01 4496 100 1337 95.25 2471 99.01 1761 72 3874 82.89 3471 100 4520 100 1022

Cancer

Table. 4. Performance assessment of reframed input with original pre-processed base input where K=150 for TR algorithm

Dataset Resilient Propagation Back Propagation Manhattan Propagation SVM

Original Infogain Original Infogain Original Infogain Original Infogain

Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr

Colon 99.01 123 100 89 97.58 1234 99.25 1047 86.47 2458 95.25 2110 99 118 95.21 84

Leukemia 91.6 12 100 12 93.57 38 98.20 23 93.78 528 100 403 95.88 10 100 9

Medullobla 99.01 56 100 30 98.65 4078 100 3854 89.41 913 93.58 804 100 48 100 22

-stoma

Lymphoma 98.70 2190 98.70 1576 93.47 4378 96.58 3821 91.47 4582 93.58 4036 99.74 2054 99 1478

Prostate 99.01 4108 100 1128 88.25 2854 92.42 2103 78.58 3451 85.34 3241 100 4187 100 1158

Cancer

Table. 5. Performance assessment of reframed input with original pre-processed base input where K=200 for TR algorithm

Dataset Resilient Propagation Back Propagation Manhattan Propagation SVM

Original Infogain Original Infogain Original Infogain Original Infogain

Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr

Colon 98.15 145 100

Leukemia 100 16 100

Medullobla 98.36 51 100 -stoma

Lymphoma 98.70 1304 100

Prostate 99.01 4482 100 Cancer

105 95.85 1354 97.25 14 92.01 42 95.21

1730 92.58 3256 96.24 1575 86.21 2745 90.01

1204 88 2147 93.45 1543 33 91.47 682 95.47 541

MANUSCRIPT

2543 90.47 4421 95.3 4102 2235 80.25 3647 85.78 3278

99 124 100 98

100 12 100 14 100 48 100 22

97.25 1385 100 1601

100 4568 100 1489

Table. 6. Average Performance assessment of reframed input with original pre-processed base input where £=500 for TR

algorithm for 10-fold change

Dataset Resilient Propagation Back Propagation Manhattan Propagation SVM

Original Infogain Original Infogain Original Infogain Original Infogain

Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr

Colon 97.25 125 100 104 93.24 1147 95.24 1045 84 1687 93.25 1457 98.24 104 100 93

Leukemia 95.44 10 100 9 93.24 59 95.24 50 92.44 654 93.58 521 96.96 12 100 10

Medullobla 98.24 40 100 24 95.54 3457 98.11 2987 90.21 885 93.66 654 98 35 100 22

-stoma

Lymphoma 97 1085 100 1478 89.35 3325 92.54 3256 89 3956 90 3321 98.55 985 100 1325

Prostate 100 3547 98 985 88 2310 92.35 2185 77 3584 83 3321 100 3104 99.25 954 Cancer

For more computational feasibility and to have a better view of the proposed techniques efficiency, table 7 presents an analysis of the original TR algorithm along with ReliefF and the proposed method-I. It was observed that their is slight gain of accuracy in the proposed methodology when compared with ReliefF.

Table. 7. Performance Assessment of original TR algorithm, proposed IG-TR algorithm with ReliefF for Resilient propagation and

SVM classifier

Dataset Resilient Propagation SVM

Original Infogain ReliefF Original Infogain ReliefF

Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr

Colon 98.36 122 100 111 99.89 113 98.33 120 100 98 98.25 133

K=50 Leukemia Medullobla- stoma 91.6 99.10 12 44 100 100 12 35 98.44 100 11 39 90.47 100 14 48 99.14 100 16 32 98.55 100 13 38

Lymphoma 98.70 2335 98.70 1347 97.85 1459 99.10 2214 99.58 1087 98.02 1542

Prostate 99.01 8197 100 983 100 1058 100 8055 100 912

Cancer

Colon 98.3 108 100 99 99.54 110 98 114 100 92 99.87 102

Leukemia 100 16 100 12 100 18 100 12 100 10 100 15

Medullobla- 99.01 41 100 38 100 52 100 38 100 32 100 41

K=100 stoma

Lymphoma 98.70 1452 98.70 1321 97.45 1256 98.54 1487 98.01 1235 97.24 1358

Prostate 99.01 4496 100 1337 98.65 1029 100 4520 100 1022 99.89 3658

Cancer

Colon 99.01 123 100 89 99.55 93 99 118 95.21 84 97.85 98

Leukemia 91.6 12 100 12 96.88 18 95.88 10 100 9 100 15

K=150 Medullobla-stoma 99.01 56 100 30 99.97 36 100 48 100 22 100 39

Lymphoma 98.70 2190 98.70 1576 99.01 1856 99.74 2054 99 1478 98.32 1874

Prostate 99.01 4108 100 1128 100 1257 100 4187 100 1158 100 2568

Cancer

Colon 98.15 145 100 105 100 142 99 124 100 98 99.86 127

Leukemia 100 16 100 14 99.87 17 100 12 100 14 100 15

K=200 Medullobla-stoma 98.36 51 100 32 99.68 38 100 48 100 22 100 34

Lymphoma 98.70 1304 100 1730 100 1587 97.25 1385 100 1601 98.65 1784

Prostate 99.01 4482 100 1575 100 3542 100 4568 100 1489 100 3698

Cancer

4.2 CCA-TR Gene Ranking Algorithm

As discussed in section 3.3, in this method we have evaluated the results of the existing TR algorithm and proposed

TR algorithm on the five datasets selected. The results were again validated on different classifiers and they are shown as in table 8 to 12 where, table 12 depicts the average accuracy obtained with 10-fold change. It was observed that from table 8 to 12, the accuracy obtained for the proposed method-II quite appreciates for the different ANN classifier variants and SVM. From table 8 to 12, it was observed that Resilient propagation and SVM provided a better classification accuracy with few number of iterations as compared to back propagation and manhattan propagation.

Table. 8. Performance assessment of the existing TR algorithm and proposed TR algorithm where K=50 for the TR algorithm

Dataset Resilient Propagation Back Propagation Manhattan Propagation SVM

Original CCA-TR Original CCA-TR Original CCA-TR Original CCA-TR

Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr

Colon 98.36 122 100 133 79.03 2024 83.58 1524 80.78 2145 89.55 2048 98.14 118 100 124

Leukemia 91.6 12 100 11 91.66 14 94.88 14 89.47 321 92.66 256 89.87 10 100 10

Medullobla 99.10 44 100 29 76.47 725 85.86 540 79.65 856 86.34 785 98.24 48 100 27

-stoma

Lymphoma 98.70 2335 100 604 97.40 4901 98.11 3124 95.23 3568 98.35 2475 99 2354 100 564

Prostate 99.01 8197 99.01 1777 50.98 4987 86.47 2589 65.25 3687 79.36 3549 100 8058 100 1659

Cancer

Table. 9. Performance assessment of the existing TR algorithm and proposed TR algorithm where K=100 for the TR algorithm

Dataset Resilient Propagation Back Propagation Manhattan Propagation SVM

Original CCA-TR Original CCA-TR Original ^CCA-TR Original CCA-TR

Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr

Colon 98.3 108 100 93 97.24 1042 99.04 856 82 2256 88.33 1689 99.25 116 94.45 89

Leukemia 100 16 100 26 98.45 25 100 20 92.01 426 94.56 354 100 10 100 18

Medullobla 99.01 41 100 29 98.24 3988 100 3865 86.35 965 90.01 892 100 38 100 28

-stoma

Lymphoma 98.70 1452 98.7 764 99 4254 99.58 3358 92.48 4781 97.68 3658 98.14 1385 99.66 659

Prostate 99.01 4496 99.01 921 95.25 2471 99.41 2045 72 3874 80.56 2546 100 4236 100 936 Cancer

Table. 10. Performance assessment of the existing TR algorithm and proposed TR algorithm where K=150 for the TR algorithm

Dataset Resilient Propagation Back Propagation Manhattan Propagation SVM

Original CCA-TR Original CCA-TR Original CCA-TR Original CCA-TR

Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr

Colon 99.01 123 100 79 97.58 1234 100 1025 86.47 2458 96.87 2053 99.89 128 82.25 82

Leukemia 91.6 12 100 21 93.57 38 98.69 29 93.78 528 97.25 423 90.47 12 100 18

Medullobla 99.01 56 100 24 98.65 4078 100 3569 89.41 913 95.21 821 100 48 100 19

-stoma

Lymphoma 98.70 2190 98.7 828 93.47 4378 98.77 3698 91.47 4582 96.22 3548 99.77 2065 99.45 796

Prostate 99.01 4108 99.01 608 88.25 2854 93.20 2264 78.58 3451 83.27 3025 100 3965 100 587 Cancer

Table. 11. Performance assessment of the existing TR algorithm and proposed TR algorithm where K=200 for the TR algorithm

Dataset Resilient Propagation Back Propagation Manhattan Propagation SVM

Original CCA-TR Original CCA-TR Original CCA-TR Original CCA-TR

Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr

Colon 98.15 145 100 88 95.85 1354 98.65 1023 88 2147 92.35 2014 99 135 100 90

Leukemia 100 16 100 23 92.01 42 96.35 32 91.47 682 96.85 586 100 11 100 17

Medullobla 98.36 51 100 38 97.36 3954 100 2542 92.54 1054 98.31 993 99.56 45 100 26

-stoma

Lymphoma 98.70 1304 98.7 669 92.58 3256 96.87 3105 90.47 4421 93.54 3214 99.10 1287 99.23 584

Prostate 99.01 4482 99.01 515 86.21 2745 92.85 2105 80.25 3647 89.32 2598 100 4325 100 486

Cancer

Table. 12. Average Performance assessment of the existing TR algorithm and proposed TR algorithm where K=500 for the TR

algorithm for 10-fold change

Dataset Resilient Propagation Back Propagation Manhattan Propagation SVM

Original CCA-TR Original CCA-TR Original CCA-TR Original CCA-TR

Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr

Colon 98.21 132 100 70 94.25 1249 99.36 1156 89 1785 93.68 1622 98.22 135 100 65

Leukemia 97.9 13 100 13 93.78 65 98.32 52 92.56 742 97.85 689 96.11 10 100 10

Medullobla 99.02 45 99.70 28 96.98 3675 99 3458 91.54 985 95.84 862 100 39 100 25

-stoma

Lymphoma Prostate 98.7 1110 99.61 1056 90.14 3412 95.32 2596 88.45 4085 92.54 3845 99.58 1024 100 995

99.11 3713 99.31 638 89.56 2450 94.58 2150 78.62 3742 83.57 3548 100 3542 100 558

Cancer

Table 13, depicts the computational accuracy and the total number of iterations of the original TR algorithm along with the proposed method-II that is the CCA-TR method and the ReliefF method. When observed the accuarcy in most of the cases of ReliefF is quite less as compared to the proposed method. Hence, it is the proposed method is said to provide a better accuracy with less number of iterations.

Table. 13. Performance Assessment of original TR algorithm, proposed CCA-TR algorithm with ReliefF for Resilient

propagation and SVM classifier

Dataset

Resilient Propagation

Original CCA-TR ReliefF Original CCA-TR ReliefF

Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr Acc Itr

Colon 98.36 122 100 133 100 125 98.14 118 100 124 99.88 129

Leukemia 91.6 12 100 11 96.51 13 89.87 10 100 10 96.71 16

Medullobla- 99.10 44 100 29 100 35 98.24 48 100 27 100 36

Lymphoma 98.70 2335 100 604 99.65 1985 99 2354 100 564 100 642

Prostate 99.01 8197 99.01 1777 99.54 2958 100 8058 100 1659 100 3543

Cancer

Colon 98.3 108 100 93 99.21 98 99.25 116 94.45 89 93.24 95

Leukemia 100 16 100 26 100 15 100 10 100 18 100 12

Medullobla- 99.01 41 100 29 100 32 100 38 100 28 100 35

Lymphoma 98.70 1452 98.7 764 99.01 854 98.14 1385 99.66 659 99.36 892

Prostate 99.01 4496 99.01 921 98.35 1541 100 4236 100 936 100 2513

Cancer

Colon 99.01 123 100 79 100 84 99.89 128 82.25 82 90.21 105

Leukemia 91.6 12 100 21 99.54 25 90.47 12 100 18 99.75 16

Medullobla- 99.01 56 100 24 99.65 34 100 48 100 19 100 22

Lymphoma 98.70 2190 98.7 828 99.00 1069 99.77 2065 99.45 796 99.02 849

Prostate 99.01 4108 99.01 608 97.84 758 100 3965 100 587 100 782

Cancer

Colon 98.15 145 100 88 99.66 124 99 135 100 90 100 98

Leukemia 100 16 100 23 100 18 100 11 100 17 100 15

Medullobla- 98.36 51 100 38 99.53 45 99.56 45 100 26 100 32

Lymphoma 98.70 1304 98.7 669 99.08 754 99.10 1287 99.23 584 99.54 984

Prostate 99.01 4482 99.01 515 100 874 100 4325 100 486 100 1198

Cancer

4.3 Performance Assessment of Method -I and Method- II

In order to assess the proposed methods, three different types of metrics for evaluation was considered for the five datasets taken. Though there exists several performance indexes and metrics, but Kuncheva's Stability Index (KSI), Balanced Classification Rate (BCR) and Balanced Error Rate (BER) are the three metrics that are chosen for the assessment of the methods. It was observed that the proposed methods provided a suitable results as compared to the original and unmodified algorithm. Table 14-18 depicts the results obtained by the metric evaluation for five different types of datasets: Colon, Leukemia, Medulloblastoma, Lymphoma and Prostate Cancer.

ACCEPTED MANUSCRIPT

Table. 14. Performance Assessment for Colon dataset using Kuncheva's Stability Index(KSI), Balanced Classification Rate (BCR)

and Balanced Error rate (BER)

K- Index Resilient Propagation Back Propagation Manhattan Propagation

Value Metrics Origin al IG-TR CCA-TR Origina l IG-TR CCA-TR Origina l IG-TR CCA-TR

50 KSI 0.26 0.62 0.72 0.19 0.55 0.69 0.25 0.46 0.55

BCR 0.96 0.99 0.99 0.76 0.99 0.81 0.75 0.83 0.86

BER 0.04 0.01 0.01 0.24 0.01 0.19 0.25 0.17 0.14

100 KSI 0.28 0.57 0.69 0.24 0.52 0.70 0.28 0.44 0.53

BCR 0.93 0.99 0.99 0.93 0.96 0.97 0.77 0.84 0.81

BER 0.07 0.01 0.01 0.07 0.04 0.03 0.23 0.16 0.19

150 KSI 0.25 0.49 0.62 0.22 0.50 0.65 0.23 0.39 0.54

BCR 0.94 0.99 0.99 0.92 0.98 0.99 0.82 0.93 0.94

BER 0.06 0.01 0.01 0.08 0.02 0.01 0.18 0.07 0.06

200 KSI 0.25 0.51 0.63 0.24 0.45 0.66 0.25 0.47 0.51

BCR 0.94 0.99 0.99 0.92 0.94 0.96 0.86 0.88 0.87

BER 0.06 0.01 0.01 0.08 0.06 0.04 0.14 0.12 0.13

Table. 15. Performance Assessment for Leukemia dataset using Kuncheva's Stability Index(KSI), Balanced Classification Rate

(BCR) and Balanced Error rate (BER)

K- Index Resilient Propagation Back Propagation Manhattan Propagation

Value Metrics Origin al IG-TR CCA-TR Origina l IG-TR CCA-TR Origina l IG-TR CCA-TR

50 KSI 0.30 0.57 0.63 0.25 0.42 0.60 0.30 0.44 0.51

BCR 0.93 0.99 0.99 0.87 0.99 0.91 0.76 0.89 0.88

BER 0.07 0.01 0.01 0.13 0.01 0.09 0.24 0.11 0.12

100 KSI 0.26 0.46 0.62 0.21 0.39 0.52 0.32 0.45 0.49

BCR 0.97 0.99 0.99 0.93 0.99 0.99 0.89 0.93 0.90

BER 0.03 0.01 0.01 0.07 0.01 0.01 0.11 0.07 0.10

150 KSI 0.22 0.47 0.65 0.15 0.36 0.53 0.27 0.44 0.45

BCR 0.89 0.99 0.99 0.91 0.97 0.96 0.90 0.99 0.91

BER 0.11 0.01 0.01 0.09 0.03 0.04 0.10 0.01 0.09

200 KSI 0.19 0.48 0.62 0.18 0.33 0.49 0.29 0.48 0.50

BCR 0.99 0.99 0.99 0.89 0.90 0.92 0.83 0.90 0.90

BER 0.01 0.01 0.01 0.11 0.10 0.08 0.17 0.10 0.10

Table. 16. Performance Assessment for Medulloblastoma dataset using Kuncheva's Stability Index(KSI), Balanced Classification

Rate (BCR) and Balanced Error rate (BER)

K- Index Resilient Propagation Back Propagation Manhattan Propagation

Value Metrics Origin al IG-TR CCA-TR Origina l IG-TR CCA-TR Origina l IG-TR CCA-TR

50 KSI 0.22 0.61 0.69 0.22 0.57 0.72 0.23 0.42 0.68

BCR 0.98 0.99 0.99 0.73 0.99 0.82 0.62 0.81 0.82

BER 0.02 0.01 0.01 0.27 0.01 0.18 0.38 0.19 0.18

100 KSI 0.23 0.53 0.70 0.25 0.55 0.55 0.29 0.36 0.66

BCR 0.94 0.99 0.99 0.95 0.98 0.99 0.83 0.86 0.86

BER 0.06 0.01 0.01 0.05 0.02 0.01 0.17 0.14 0.14

150 KSI 0.27 0.55 0.72 0.21 0.49 0.52 0.28 0.31 0.62

BCR 0.95 0.99 0.99 0.94 0.99 0.99 0.86 0.90 0.92

BER 0.05 0.01 0.01 0.06 0.01 0.01 0.14 0.10 0.08

200 KSI 0.24 0.53 0.70 0.22 0.42 0.49 0.26 0.32 0.56

BCR 0.96 0.99 0.99 0.96 0.99 0.99 0.88 0.93 0.99

BER 0.04 0.01 0.01 0.04 0.01 0.01 0.12 0.07 0.01

Table 17. Performance Assessment for Lymphoma dataset using Kuncheva's Stability Index(KSI), Balanced Classification Rate

(BCR) and Balanced Error rate (BER)

K- Index Resilient Propagation Back Propagation Manhattan Propagation

Value Metrics Origin IG-TR CCA- Origina IG-TR CCA- Origina IG-TR CCA-

50 KSI 0.29 0.55 0.68 0.18 0.47 0.71 0.22 0.42 0.55

BCR 0.99 097 0.99 095 099 0.96 092 0.95 0.96

BER 0.01 0.03 0.01 0.05 0.01 0.04 0.08 0.05 0.04

100 KSI 0.31 0.52 0.72 0.25 0.51 0.70 0.21 0.45 0.54

BCR 0.94 0.97 0.96 0.97 0.99 0.97 0.90 0.91 0.93

BER 0.06 0.03 0.04 0.03 0.01 0.03 0.10 0.09 0.07

150 KSI 0.27 0.55 0.71 0.27 0.55 0.68 0.22 0.49 0.52

BCR 0.95 0.95 0.94 0.91 0.92 0.97 0.86 0.91 0.94

BER 0.05 0.05 0.06 0.09 0.08 0.03 0.14 0.09 0.06

200 KSI 0.26 0.60 0.77 0.25 0.50 0.71 0.16 0.47 0.44

BCR 0.97 0.99 0.94 0.91 0.92 0.92 0.85 0.94 0.89

BER 0.03 0.01 0.06 0.09 0.08 0.08 0.15 0.06 0.11

Table. 18. Performance Assessment for Prostate Cancer dataset using Kuncheva's Stability Index(KSI), Balanced Classification

Rate (BCR) and Balanced Error rate (BER)

K- Index Resilient Propagation Back Propagation Manhattan Propagation

Value Metrics Origin al IG-TR CCA-TR Origina l IG-TR CCA-TR Origina l IG-TR CCA-TR

50 KSI 0.13 0.36 0.52 0.20 0.34 0.52 0.24 0.28 0.48

BCR 0.95 0.99 0.97 0.48 0.69 0.82 0.61 0.74 0.72

BER 0.05 0.01 0.03 0.52 0.31 0.18 0.39 0.26 0.28

100 KSI 0.20 0.42 0.55 0.20 0.31 0.48 0.21 0.32 0.42

BCR 0.96 0.99 0.97 0.91 0.97 0.97 0.68 0.81 0.75

BER 0.04 0.01 0.03 0.09 0.03 0.03 0.32 0.19 0.25

150 KSI 0.21 0.50 0.58 0.19 0.34 0.50 0.19 0.32 0.47

BCR 0.97 0.99 0.97 0.84 0.88 0.89 0.72 0.82 0.79

BER 0.03 0.01 0.03 0.16 0.12 0.11 0.28 0.18 0.21

200 KSI 0.22 0.55 0.51 0.16 0.32 0.45 0.10 0.33 0.42

BCR 0.96 0.99 0.97 0.83 0.85 0.89 0.74 0.82 0.85

BER 0.04 0.01 0.03 0.17 0.15 0.11 0.26 0.18 0.15

From table 14-18, it is observed that the proposed two methods are providing quite satisfying outcome as compared to the original algorithm. In fact, for all the datasets the results are satisfying and encouraging. For K=50, 100, 150 and 200 results provided for all the datasets are quite encouraging. In certain cases, it has been observed that the results in both the proposed methods are not much difference, whereas there are cases in which one of the method out of the two is providing a better result. For KSI, it was found that the result approached to 1 (where intersection between two subsets is more i.e. good gene subset selection) than 0 (where intersection between two subsets is 0 i.e. bad gene subset selection). In our algorithm, the rank of the gene changes at each iteration and finally in the last iteration the rank is same as the K randomly selected data. Hence, an average of all iterations is considered except the last one. In other words, if there are n number of iterations, taking place in the TR algorithm (for both existing and the proposed algorithm), we will consider an average of (n-1) iteration's KSI. We are intentionally leaving the last iteration as the intersection of the subsets will lead us to a KSI value of 1. As a result of this, the KSI value would be more biased towards 1. The second metric BCR is considered as another parameter for evaluating the methods proposed. Here, more the result approach to 1, the better is the classification rate and better is the selection of genes. The third metric is BER, that can be obtained by performing the evaluation of (1-BCR) and the more it approaches to 0 the better is the evaluation of the gene subset selection. In other words, the error rate in any method proposed should be less i.e. approaching to 0. Also, for varying value of K there is quite a few fluctuation in the result and it can be stated that with less amount of K chosen, we are getting better results as compared to higher K value.

The above two proposed methods would allow us to select few top most relevant genes (like with K=50, 100, 150 or 200) based on which the visualization for the whole network would be created (using the GRN) and the interaction and relationship among the genes would be depicted. This may further be used to find the hub genes (i.e

genes that have maximum interaction pathways or edges with the corresponding genes) that we can presumably assume to be the disease causing genes or we may try and assess regulating the hub genes and see the effect that other genes receive due to the altercation of these genes.

5. Summarization

This paper can be finally summarized as follows:

1. To start with, we had normalized the dataset using Min-max normalization process to which the original and existing TR algorithm can be used.

2. From the existing TR algorithm, the rank of the genes was extracted and the dataset was reformed according to the new rank generated. Now, this dataset was conceded to a classifier and the accuracy of the datasets was assessed.

3. As TR algorithm is a powerful ranking technique, we thought of slightly re-structuring it. Two different approaches were considered and hence two new methods, namely IG-TR Gene Ranking and CCA-TR Gene Ranking were proposed.

4. In the first method, the datasets (normalized datasets) Information gain was computed and the genes containing highest information content were selected. Based on this selection, the dataset were redesigned according to the descending order. The existing TR algorithm was now used on this re-designed dataset and ranks were generated (by considering random amount of K value). The newly generated ranked set was passed to the classifier (variants of ANN and SVM) and its accuracy was tested. It was found that by slightly changing the input pattern of the TR algorithm, the accuracy obtained was quite high.

5. To add a new instance to this finding, another method was proposed where the dataset was intact; rather a whole new TR algorithm's scoring or ranking method was proposed.

6. Instead of considering Fisher's score as the criteria for scoring or ranking in the traditional TR algorithm, we selected another statistical technique called Canonical Correlation score for the generation of new rank list or set. It was observed that the rank generated out of this technique and the new dataset formed produced a better classification accuracy as compared to the traditional TR algorithm.

7. Finally, the two methods, though provided better results in comparison to the traditional TR algorithm, they need to be validated and assessed properly. Hence, KSI, BCR and BER were the three performance metrics chosen for assessing their performance. Lastly, it was proved that the two proposed technique provided better validation results as compared to the original algorithm.

6. Conclusion and Future direction

In this paper, two methods IG-TR Gene Ranking and CCA-TR Gene Ranking were proposed. These methods were assessed with five types of datasets. The basic aim of the proposed techniques was to rank the genes with few numbers of randomly selected genes. It was proved that the ranks generated out of these techniques were quite good and this was further validated by passing it through a classification stage. The accuracy of the classifiers obtained

provided as a suitable and valid means to assess the two techniques proposed. For our work, four variants of ANN classifiers (Resilient Propagation, Back Propagation, Manhattan Propagation and SVM) were selected, though we can choose any other classification technique for the purpose of the classification accuracy measure. It was observed that the accuracy of all the classifiers is more or less the same, but the proposed method's accuracy was far better as compared to the existing algorithm. For rank generation, different K values for considered and it was concluded that by choosing a small set of K we are able to acquire a better ranking pattern for the genes. The two methods along with the existing method are further validated using different performance metrics like KSI, BCR and BER. Ultimately, the genes obtained from these methods can be used for the purpose of constructing Gene Regulatory Networks (GRN) that would be considered for our future work.

References

[1] S. Mishra, D. Mishra, SVM-BT-RFE: An improved gene selection framework using Bayesian T-test embedded in support vector machine (recursive feature elimination) algorithm, Karbala International Journal of Modern Science, 1(2), pp.86-96, 2015.

[2] S. Mishra, D. Mishra, Methodologies for Modeling Gene Regulatory Networks, Encyclopaedia of Information Science and Technology, 3rd Edition, pp. 426-436, 2014.

[3] F. Leitner, M. Krallinger, S. L Tripathi, M. Kuiper, A. L^greid and A. Valencia, Mining cis-Regulatory Transcription Networks from Literature, Proc. of BioLINK Special Interest Group (ISBM/ECCB), pp. 5-12, 2013.

[4] G. Karlebach and R. Shamir, Modeling and analysis of gene regulatory networks, Nature Reviews Molecular Cell Biology, 9, pp. 770-780, 2008.

[5] V. Tyagi, A. Mishra, A survey on different feature selection methods for microarray data analysis, International Journal of Computer Applications, 67 (16), pp. 36-40, 2013.

[6] H.M. Alshamlan, G.H. Badr, Y.A. Alohali, The performance of bio-inspired evolutionary gene selection methods for cancer classification using microarray dataset, International Journal of Bioscience, Biochemistry andBioinformatics, 4 (3), pp. 166-170, 2014.

[7] C.P. Lee, Y. Leu, A novel hybrid feature selection method for microarray data analysis, Applied Soft Computing, 11 (1), pp. 208-213, 2011.

[8] S. Maldonado, R. Weber, F. Famili, Feature selection for high dimensional class-imbalanced datasets using support vector machines, Information Science, 286, pp. 228-246, 2014.

[9] C. Lazar, J. Taminau, S. Meganck, D. Steenhoff, A. Coletta, C. Molter, V. de Schaetzen, R. Dugue, H. Bersini, A. Nowe, A survey on filter techniques for feature selection in gene expression microarray analysis, IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 9(4), pp. 1106-1119, 2012.

[10] A. Abu Shanab, T.M. Khoshgoftaar, R. Wald, Evaluation of wrapper-based feature selection using hard, moderate, and easy bioinformatics data, Proc. on IEEE International Conference on Bioinformatics and Bioengineering (BIBE), pp. 149-155, 2014.

[11] F. Model, P. Adorjan, A. Olek, C. Piepenbrock, Feature Selection for DNA methylation based cancer classifiication, Bioinformatics,1(17), pp.157-164, 2001.

[12] Tao Li, Chengliang Zhang and Mitsunori Ogihara, A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression, Bioinformatics, 20(15), pp.24292437, 2004.

[13] P. A. Mundra , J.C. Rajapakse, Gene and sample selection for cancer classification with support vectors based t-statistic, Neurocomputing, 73, pp. 2353-2362, 2010.

[14] K. Kira, L.A. Rendell, A Feature selection problem: traditional methods and a new algorithm, Proc. of the 10th National Conference on Artificial Intelligence, pp.129-134, 1992.

[15] M. Pechenizkiy, S. Puuronen, A. Tsymbal, The impact of sample reduction on PCA-based feature extraction for supervised learning, Proc. of the 21st ACM Symposium on Applied Computing, pp. 553-558, 2006.

[16] R. Cavill, H. Keun, E. Holmes, J. Lindon, J. Nicholson, T. Ebbels, Genetic algorithms for simultaneous variable and sample selection in metabonomics, Bioinformatics, 25 (1), pp. 112-118, 2009.

[17] G.C. Cawley and N. L. C. Talbot, Gene selection in cancer classification using sparse logistic regression with Bayesian regularization, Bioinformatics, 22(19), pp. 2348-2355, 2006.

[18] J. D. Fitzergerald, R.J. Rowekamp, L.C. Sincich, T.O. Sharpee, Second Order Dimensionality Reduction Using Minimum and Maximum Mutual Information Models, PLoS ONE, 7(11), pp. 1-9, 2011

[19] Y. Piao, M. Piao, K. Park and K. H. Ryu, An ensemble correlation-based gene selection algorithm for cancer classification with gene expression data, Bioinformatics, 28(24), pp.3306-3315, 2012.

[20] F.Nie, S. Xiang, Y. Jia, C. Zhang and S. Yan, Trace Ratio Criterion for Feature Selection, Proc. of the Twenty-Third AAAI Conference on Artificial Intelligence, pp. 671-676, 2008.

[21] M. Zhao, R.H. M. Chan, P. Tang, W. Tommy, S. Chow, S. W. H. Wong, Trace Ratio Linear Discriminant Analysis for Medical Diagnosis: A Case Study of Dementia, IEEE Signal Process Letters, pp. 1-10, 2013.

[22] D. Wang, F. Nie, and H. Huang, Unsupervised Feature Selection via Unified Trace Ratio Formulation and K-means Clustering (TRACK), Machine Learning and Knowledge Discovery in Databases, 8726, pp. 306-314, 2014.

[23] Y. Jia, F. Nie and C. Zhang, Trace Ratio Problem Revisited, IEEE Transactions on Neural Networks , 20(4), pp. 729-735, 2009.

[24] Gene Expression Omnibus (GEO), GSE8671 Series http://www.ncbi.nlm.nih.gov/geo/, GSE8671 series.

[25] Leukemia Set, http://www.github.com/Leukemia.gct.

[26] Broad institute, http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi.

[27] M.A. Shipp, K.N. Ross, D.G. Jackson, P. Tamayo, A.P. Weng, J.L. Kutok, R.C.T. Aguiar, M. Gaasenbeek, M. Angelo, M. Reich, G.S. Pinkus, T.S. Ray, M.A. Koval, K.W. Last, A. Norton, T.A. Lister, J. Mesirov, D.S. Neuberg, E.S. Lander, J.C. Aster, T.R. Golub, Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning, Nature Medicine, 8, pp. 68-74, 2002.

[28] D. Singh, P.G. Febbo, K. Ross, D.G. Jackson, J. Manola, C. Ladd, P. Tamayo, A.A. Renshaw, A.V. DÂmico, J.P. Richie, E.S. Lander, M. Loda, P.W. Kantoff, T.R. Golub, W.R. Sellers, Gene expression correlates of clinical prostate cancer behaviour, Cancer Cell, 1, pp. 203-209, 2002.

[29] G. Wu, J. Xu, Optimized Approach of Feature Selection Based on Information Gain, Proc. of 2015 International Conference on Computer Science and Mechanical Automation (CSMA), pp. 157-161, 2015.

[30] J. D. Q. Xu, Attribute selection based on information gain ratio in fuzzy rough set theory with application to tumor classification, Applied Soft Computing, 13(1), pp. 211-221, 2013.

[31] C. Shang, M. Li, S. Feng, Q. Jiang, J. Fan, Feature selection via maximizing global information gain for text classification, Knowledge-Based Systems, 54, pp. 298-309, 2013.

[32] Q. Long, M. Scavino, R. Tempone, S. Wang, Fast estimation of expected information gains for Bayesian experimental designs based on Laplace approximations, Computer Methods in Applied Mechanics and Engineering, 259, pp. 24-39, 2013.

[33] B. R. Frieden, R. A. Gatenby, Principle of maximum Fisher information from Hardy's axioms applied to statistical systems, Physical Review E, 88 (4), pp. 1-13, 2013.

[34] Zhu, L. Miao, D. Zhang, Iterative Laplacian Score for Feature Selection, Pattern Recognition, Series Communications in Computer and Information Science, 321, pp 80-87, 2012.

[35] H. Wang, S. Yan, D. Xu, X. Tang, T. Huang, Trace Ratio vs. Ratio Trace for Dimensionality Reduction, Proc. of2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1-8, 2007.

[36] C. Zu, D. Zhang, Canonical sparse cross-view correlation analysis, Neurocomputing, 191, pp. 263-272, 2016.

[37] S. Wang, J. Lu, X. Gu, B.A. Weyori, J. Yang, Unsupervised discriminant canonical correlation analysis based on spectral clustering, Neurocomputing, 171, pp. 425-433, 2016.

[38] A. Tenenhaus, C. Philippe, V. Frouin, Kernel Generalized Canonical Correlation Analysis, Computational Statistics & Data Analysis, 90, pp. 114-131, 2015.

[39] L. I. Kuncheva, A Stability Index for Feature Selection, Proc. of 25th IASTED Int. Conf. on Artificial Intelligence and Applications, pp. 390-395, 2007.

[40] B. R. Lauwerys, D. Hernández-Lobato, P. Gramme, J. Ducreux, A. Dessy, I Focant, J. Ambroise, B. Bearzatto, A. N. Toukap, B.J. V. E. Elewaut, J. Gala, P. Durez, F. A. Houssiau, T. Helleputte, P. Dupont, Heterogeneity of Synovial Molecular Patterns in Patients with Arthritis, PLoS ONE, pp. 1-18, 2015.

[41] D.R. Pai, K.D. Lawrence, R.K. Klimberg, S.M. Lawrence, Analyzing the balancing of error rates for multi-group classification, Expert Systems with Applications, 39(17), pp. 12869-12875, 2012.

[42] M.M. Suarez-Alvarez, D.T. Pham, M.Y. Prostov, Y.I. Prostov, Statistical approach to normalization of feature vectors and clustering of mixed datasets, Proc. Royal Society, 468 (2145) , 2012.

Some of the highlights of the paper are stated below:

o To start with, we had normalized the dataset using Min-max normalization process to which the

original and existing TR algorithm can be used. o From the existing TR algorithm, the rank of the genes was extracted and the dataset was

reformed according to the new rank generated. Now, this dataset was conceded to a classifier and the accuracy of the datasets was assessed. o As TR algorithm is a powerful ranking technique, we thought of slightly re-structuring it. Two different approaches were considered and hence two new methods, namely IG-TR Gene Ranking and CCA-TR Gene Ranking were proposed. o In the first method, the datasets (normalized datasets) Information gain was computed and the genes containing highest information content were selected. Based on this selection, the dataset were redesigned according to the descending order. The existing TR algorithm was now used on

this redesigned dataset and ranks were generated (by considering random amount of K value). The newly generated ranked set was passed to the classifier (variants of ANN) and its accuracy was tested. It was found that by slightly changing the input pattern of the TR algorithm, the accuracy obtained was quite high. o To add a new instance to this finding, another method was proposed where the dataset was intact;

rather a whole new TR algorithm's scoring or ranking method was proposed. o Instead of considering Fisher's score as the criteria for scoring or ranking in the traditional TR algorithm, we selected another statistical technique called Canonical Correlation score for the generation of new rank list or set. It was observed that the rank generated out of this technique and the new dataset formed produced a better classification accuracy as compared to the traditional TR algorithm. o Finally, the two methods, though provided better results in comparison to the traditional TR algorithm, they need to be validated and assessed properly. Hence, KSI, BCR and BER were the three performance metrics chosen for assessing their performance. Lastly, it was proved that the two proposed technique provided better validation results as compared to the original algorithm.