Scholarly article on topic 'Rough Set based Attribute Clustering for Sample Classification of Gene Expression Data'

Rough Set based Attribute Clustering for Sample Classification of Gene Expression Data Academic research paper on "Computer and information sciences"

CC BY-NC-ND
0
0
Share paper
Academic journal
Procedia Engineering
OECD Field of science
Keywords
{"Attribute clustering" / Clustering / "Rough set theory" / k-means / k-medoids / DBSCAN}

Abstract of research paper on Computer and information sciences, author of scientific article — Rudra Kalyan Nayak, Debahuti Mishra, Kailash Shaw, Sashikala Mishra

Abstract Attribute clustering is one of the unsupervised data mining applications which have been previously used to identify statistical dependence between subsets of variables where the attributes within the same cluster have high similarity, but within different clusters have high dissimilarity. In this paper, we focus our discussion on the rough set theory for attribute clustering. Rough set theory is a theory adopted to deal with rough and uncertain knowledge, which analyzes the clusters and finds the data principles when previous knowledge is not available, providing a new method for data classification. Although there are numerous methods of rough set and cluster analysis, as the data objects are changing continuously, we have to improve these relevant technologies over time, and propose creative theory in response, meeting the demands of application. Lastly, the experimental result of our proposed algorithm Rough Set based attribute Clustering for Sample Classification (RSCSC) is compared with some of the traditional attribute clustering methods and it is proved to be efficient in finding the meaningful, feasible and compact patterns.

Academic research paper on topic "Rough Set based Attribute Clustering for Sample Classification of Gene Expression Data"

Available online at www.sciencedirect.com

SciVerse ScienceDirect

Procedia Engineering 38 (2012) 1788 - 1792

International Conference on Modelling Optimization and Computing (ICMOC-2012)

Rough Set based Attribute Clustering for Sample Classification of

Gene Expression Data

Rudra Kalyan Nayak*a, Debahuti Mishrab, Kailash Shaw0 and Sashikala Mishrad

a,b,dInstitute of Technical Education and Research, Siksha O Anusandhan Deemed to be University, Bhubaneswar, Odisha, India "Gandhi Engineering College, Bhubaneswar, Odisha, India

Abstract

Attribute clustering is one of the unsupervised data mining applications which have been previously used to identify statistical dependence between subsets of variables where the attributes within the same cluster have high similarity, but within different clusters have high dissimilarity. In this paper, we focus our discussion on the rough set theory for attribute clustering. Rough set theory is a theory adopted to deal with rough and uncertain knowledge, which analyzes the clusters and finds the data principles when previous knowledge is not available, providing a new method for data classification. Although there are numerous methods of rough set and cluster analysis, as the data objects are changing continuously, we have to improve these relevant technologies over time, and propose creative theory in response, meeting the demands of application. Lastly, the experimental result of our proposed algorithm Rough Set based attribute Clustering for Sample Classification (RSCSC) is compared with some of the traditional attribute clustering methods and it is proved to be efficient in finding the meaningful, feasible and compact patterns.

© 2012 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of Noorul Islam Centre for Higher Education

Keywords: Attribute clustering; Clustering; Rough set theory; ¿-means; i-medoids; DBSCAN

© 2011 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of [ICMOC-2Q12]

1. Introduction

Attribute clustering [1] in data mining, acts a key role and is necessary for discovering meaningful patterns for sample analysis for biological research work. Clustering is a main task of explorative data mining, and a common technique for statistical data analysis used in many fields, including machine learning, pattern recognition, information retrieval, and bioinformatics. There has been many classical clustering algorithms since cluster analysis was researched [2-3], for example; the method based on hierarchy (AGNES, DIANA) [4]; division-based methods (c-means, c-centre); density-based methods (DBSCAN, OPTICS, DENCLUE) [5]; grid-based methods (STING, Wave Cluster, CLIQUE); modelbased methods (statistical methods, neural network methods); constraint-based methods and so on. After the attribute clustering we can analyse the smaller number of samples very easily. Rough set theory has received considerable attention in machine learning and pattern recognition in last decade. It has been found useful in dealing with imperfect and inconsistent information [6], which is quite often encountered

* Corresponding author. Tel.: +91-9861366884; fax: +91-674-2351880. E-mail address: rudrakalyannayak@gmail.com

1877-7058 © 2012 Published by Elsevier Ltd. doi:10.1016/j.proeng.2012.06.219

in machine learning and data mining. The classical rough set model, proposed by Pawlak [6], dwells on boolean equivalence relations, where objects taking the same values of feature are said to be indiscernible or equivalent. Pawlak's rough set model has been widely exploited in feature selection, attribute clustering, rule extraction, and reasoning in presence of uncertainly [7-8]. In rough set-based [9-10] attribute clustering, the samples formed are clustered in such a way that it does not confine a sample into one cluster. It finds the lower and upper approximation of the clusters. The samples here may be clustered into more than one cluster. In this paper, we have proposed a clustering algorithm for clustering the features/attributes based on rough set, which can be further used for sample classification. The layout of this paper is as follows: in section 2, the related work on attribute clustering has been discussed, section 3 focuses on preliminary concepts of attributes clustering, rough set, section 4 shows a schematic representation of our proposed model. In section 5 the experimental evaluation and result analysis has been given and finally, the conclusion and future work of our paper is given in section 6.

2. Related Work

To consider an information system without any decision attribute was proposed by K. Thangavel et al. [11]. They have applied ¿-means algorithm to cluster the given information system for different values of k. Using reduct from rough set theory, Suchitra Upadhyaya et al. [12] tried to generate pattern. They remove reduct attributes to formulate pattern by concatenating most contributing attributes taking benchmarking mushroom dataset from UCI repository. Kun NIU et al. [13] discuss the problem of automatic subspace clustering proposing subspace clustering via attribute clustering model to find clusters embedded in subspaces of high dimensional datasets. Shifei DING et al. [14] proposed a fuzzy kernel clustering algorithm based on improved rough set attribute reduction where they optimised the cluster samples and found the cluster result by fuzzy c-means algorithm.

3. Preliminaries

3.1. Attribute Clustering: Cluster analysis is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other than to those in other clusters. Clustering can be roughly distinguished in; (a) hard clustering: each object belongs to a cluster or not and (b) soft clustering (also: fuzzy clustering): each object belongs to each cluster to a certain degree [15].

3.2. Rough Set Theory: Rough set theory [1] is meant for dealing with the vague and imprecise knowledge. Suppose we are given a set of objects U called the universe and an indiscernibility relation R c Ux U, representing our lack of knowledge about elements of U. For the sake of simplicity we assume that if is an equivalence relation. Let Xbe a subset of U. We want to characterize the set X with respect to R. The lower approximation of a set X with respect to R is the set of all objects, which can be for certain classified as X with respect to R (are certainly X with respect to R). The upper approximation of a set X with respect to R is the set of all objects which can be possibly classified as X with respect to R (are possibly X in view of R). The boundary region of a set X with respect to R is the set of all objects, which can be classified neither as Xnor as not-Xwith respect to R. Set X is crisp (exact with respect to R), if the boundary region of X is empty. Set X is rough (inexact with respect to R), if the boundary region of X is nonempty. Thus, a set is rough (imprecise) if it has nonempty boundary region; otherwise the set is crisp (precise). Formal definitions of approximations and the boundary region are as follows:

R-lower approximation ofX H (£) = Uj^A^'AfJt) if} (I)

R-upper approximation ofX R {X) = \JXiii{R(xy. R(x) X = } (2)

R-boundary region ofX RNr(X) = RX(X) - Rx{X) (3)

3.3. Rough set-based attribute clustering: Rough set-based attribute clustering derives patterns from the clusters. Rough set technique (RST) divides the data into indiscernible classes. RST has a natural appeal to be applied in clustering as these indiscernible classes can be constructed as clusters. Moreover, RST also performs automatic concept approximation by producing minimal subset of attributes (reduct) that

can distinguish all the entities in the dataset. Our aim is to generate pattern of individual clusters. These attributes in those clusters play significant role in pattern generation. Pattern is then formulated with the conjunction of major contributing attributes. The effectiveness of the approach is established with the help of Leukemia dataset [16].

4. Proposed Attribute Clustering Model

Fig. 1. Proposed Rough set-based model for attribute clustering

In our proposed approach as shown in fig.l; initially, the Leukemia dataset has been collected from Broad Institute [16]. Secondly, the traditional clustering algorithms such as £-means [17], ¿-medoids [18] and hierarchical [4] methods have been implemented on this data set. Then; proposed rough set based method has been applied for attribute clustering, finally; the comparison between traditional attribute clustering methods and proposed method has been done.

5. Experimental Evaluation and Result Analysis

In this section, we perform an extensive experimental study on the efficiency and effectiveness of our proposed algorithm with leukaemia real-life data set composed of 72 genes and 3859 samples [16]. All the experiments were carried out on Intel Dual Core machine with 2GB HDD. The operating system used is Microsoft XP and all programs are written in MATLAB 8.0. Our experiment is arranged in four steps.

Step 1: Reducing the data matrix using Principal Component Analysis (PCA): Using PCA the input dimensions are transformed to a new co-ordinate system in which the produced dimensions called PCs (Principal Components) are in decreasing order of their variance [19]. To eliminate the weaker components from the PC set, the corresponding variance, percentage of variance and cumulative variances in percentage are calculated. Then, the PCs having variances more than the mean variance are considered, ignoring the others. As the mean variance obtained for the data set is 1, hence the first four PCs will be retained for the interpretation. Then the transformation matrix is created using these reduced PCs. This transformation matrix with reduced PCs is applied to the normalized data set to produce the new reduced projected dataset with 72 genes and 72 number samples has been used for further data analysis in step 2.

Step 2: Normalizing the reduced data set: Using the normalization process, the reduced data values are scaled so as to fall within a small specified range of 0 to 1; so that any attribute having higher domain value will not dominate the attribute having lower domain value. An attribute value V of an attribute A is normalized to V using (4):

irf = (i'|- - rrtir.s-)/(mait - mir.,-)

Step 3: Application traditional clustering algorithms for attribute clustering: We have applied ¿-means [1-2], ¿-mediods [1-3] and hierarchical [1-4] clustering methods to our reduced data set. The results of those attribute clustering is shown in fig.2 (a), fig. 2 (b) and fig. 2 (c).

Step4: Implementation of proposed RSCSC for attribute clustering: This phase is composed of many subphases as described below. Using our proposed algorithm RSCSC, the reduced dataset is iteratively compared with the threshold value. If the distance of the data object is less than the threshold value, then

it is put in the upper approximation else in the in the lower approximation. Then the new mean for each cluster is computed till there are no more samples for assignment. Our proposed algorithm Rough Set-based Attribute Clustering for Sample Classification (RSCSC) is useful for rule generation from incomplete data sets. This method does not confine a sample to one cluster rather may belong to more than one cluster (overlapping concept is permitted) for much expressive power and analysis purpose.

(a) (b)

Fig.2. Attribute clustering for reduced data set using (a) ¿-means (b) i-mediods (c) Hierarchical The Proposed Algorithm

Algorithm: RSCSC (Reduced Gene expression matrix) Return Value: No. of clusters, Similarity matrix Step 1: Normalize the reduced dataset Step 2: Call clustering function db Step 3: Set cluster value as zero

Step 4: After finding out the clusters compute the mean of n-\ clusters Step 5: For each sample «i, compute the Similarity matrix gsim Step 6: n=1;

for each sample si n=n+1;

ith sample is placed in nth cluster; for each sample j\=i

compute the similarity of z'th sample with y'th sample gsim(i, j) using correlation coefficient metric;

if gsimO'.y) > P then assign j in nth cluster // p is the threshold value

Step 7: Go to step 2

Step 8: Allocate each data object D to the lower approximation or the upper approximation by finding the difference in its distance from centroid of the cluster pairs w; and mf. [ d(D-m{) - d(D-ntj) ]

Step 9: If the distance is less than some threshold 00, XD is in the upper approximation and XD is not

in the lower approximation else XD is in the lower approximation. Step 10: Compute new mean for each cluster n and iterate until there are no more samples for assignment.

The aforesaid algorithm takes gene expression matrix of Leukemia dataset [16] as input. The algorithm is iterated till there is no remaining sample for assignment. During the iteration process it periodically revises the similarity matrix, again finds the centroid taking the mean of lower and upper approximation of each cluster. Finally it returns the number of clusters formed shown in fig.3 and time taken (in milliseconds) is shown in fig.4. From the experiment, we can see that after the rough set-based

attribute clustering method is applied the time complexity for the clustering than the traditional methods is reduced and the number of clusters generated from the samples is more and compact. The experimental comparison is shown fig.4.

- ' t - , ■ " ,.j •• .,.

- I'vi^ •

J - ^ ... » « * * 6 J " '

Fig.3. Clusters obtained using proposed RSCSC method Fig.4. Performance comparison of proposed model with existing methods 6. Conclusion and Future Work

The proposed algorithm RSCSC of gene expression data has emerged as an efficient trend for finding out the meaningful patterns from very high dimensional dataset. We have achieved a comparably more proficient result than the traditional clustering methods. The proposed work can be enhanced by embedding the some optimization techniques with our proposed approach to optimize it in terms of the time complexity as well as space complexity.

References

[1] Jiang, D., C. Tang and A. Zhang. Cluster analysis for gene expression data: A survey IEEE Trans. Knowledge Data Eng. 2004; 16: 1370-1386.

[2] Sun Jigui, Li Jie and Zhao Lianyu. Clustering Algorithms Research. Journal of Software. 2008; 19:1: 48-61.

[3] Ding Shifei, Xu Li, Zhu Hong. Research and Progress of Cluster Algorithms based on Granular Computing. JDCTA. 2010: 4: 5: 96-104.

[4] E. B. Fowlkes and C. L. Mallows. A Method for Comparing Two Hierarchical Clustering. Journal of the American Statistical Association. 1983; 78: 553-569.

[5] Hans-Peter Kriegel, Peer Kröger, Jörg Sander, Arthur Zimek. Density-based Clustering. WIREs Data Mining and Knowledge Discovery. 2011; 1: 3: 231-240

[6] Z. Pawlak. Rough Sets: Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers. 1991.

[7] R. Jensen and Q. Shen. Fuzzy-Rough Sets Assisted Attribute Selection. IEEE Trans. Fuzzy Systems. 2007; 15:1: 73-89.

[8] Q.H. Hu, D.R. Yu, and Z.X. Xie. Information-Preserving Hybrid Data Reduction Based on Fuzzy-Rough Techniques. Pattern Recognition Letters. 2006; 27: 5: 414^23.

[9] D. Dubois and H.Prade. Rough Fuzzy Sets and Fuzzy Rough Sets. IntJ. General Systems. 1990; 17:2/3: 191-209.

[10] A.M. Radzikowska and E.E. Kerre. A Comparative Study of Fuzzy Rough Sets. Fuzzy Sets and Systems. 2002; 126: 137-155.

[11] KThangavel, Qiang Shen, A. Pethalakshmi. Application of Clustering for Feature Selection Based on Rough Set Theory Approach. The International Journal of Artificial Intelligence and Machine Learning (AIML) Journal2006; 6:1: 19-27.

[12] Shuchita Upadhyaya, Alka Arora, Rajni Jain, Deriving Cluster Knowledge Using Rough set Theory. Journal of Theoretical and Applied Information Technology. 2008; 4:8: 688-696.

[13] KunNiu, Shubo Zhang, Junliang Chen, Subspace clustering through attribute clustering. 2008; 3:1:44-48.

[14] Shifei Ding, Li Xu, Hong Zhu, Fengxiang Jin. A Fuzzy Kernel Clustering Algorithm Based On Improved Rough Set Attribute Reduction. 2011; 3: 6: 199-206.

[15] Zhang Huizhe, Wang Jian, Mei Hongbiao. Attribute Reduction of Fuzzy Rough Sets based on Variable Similar degree. Pattern Recognition and Artificial Intelligence. 2009; 22: 3:393-399.

[16] http://www.broad.nrit.edu/cgi-bin/cancer/publications/pub_paper.cgi?mode=view&paper_id=43

[17] Z. Huang. Extensions to the ¿-means algorithm for clustering large data sets with categorical value. Data Mining and Knowledge Discovery. 1998; 2: 283-304.

[18] R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. In: Proceedings of the 2ffh VLDB Conference. 1994; 144-155.

[19] D. Mishra, K. Shaw, S. Mishra, A. K .Rath, M . Acharya. Gene Expression Network Discovery: A Pattern based Biclustering. International Conference on Communication, Computing & Security, 2011; 307-312.