Scholarly article on topic 'User Profile Tracking by Web Usage Mining in Cloud Computing'

User Profile Tracking by Web Usage Mining in Cloud Computing Academic research paper on "Computer and information sciences"

CC BY-NC-ND
0
0
Share paper
Academic journal
Procedia Engineering
OECD Field of science
Keywords
{"Cloud computing" / "Web usage mining" / "Registry federation" / "Distributed algorithm"}

Abstract of research paper on Computer and information sciences, author of scientific article — Joan M. John, G. Venifa Mini, E. Arun

Abstract Service oriented domain like cloud computing, composite service selection is hard to accomplish in an effective manner. The high scalability of distributed systems, and large transaction of data bases, it is a vital important to categorizes the efficient methods for distributed mining. Researcher on web has given a great attention to web usage mining which is an evolution of internet to evaluate user profiles and knowledge sharing of web content. This paper discloses the efficient relationship between the local and global transaction data items. It extracts valuable information from the large amount of data available in the user web sessions. To segment the data, the distance between the two user sessions is measured with the help of similarity distance measures. Aim of this paper is to introduce a technique called registry federation mechanism to meet the associated business requirements and it also describes a distributed algorithm for sequential mining within registry federation.

Academic research paper on topic "User Profile Tracking by Web Usage Mining in Cloud Computing"

Available online at www.sciencedirect.com

Procedia Engineering

ELSEVIER

Procedía Engineering 38 (2012) 3270 - 3277

www.elsevier.com/locate/proeedia

International Conference On Modelling, Optimization And Computing (ICMOC 2012)

User Profile Tracking by Web Usage Mining in Cloud

Computing

Service oriented domain like cloud computing, composite service selection is hard to accomplish in an effective manner. The high scalability of distributed systems, and large transaction of data bases, it is a vital important to categorizes the efficient methods for distributed mining. Researcher on web has given a great attention to web usage mining which is an evolution of internet to evaluate user profiles and knowledge sharing of web content. This paper discloses the efficient relationship between the local and global transaction data items. It extracts valuable information from the large amount of data available in the user web sessions. To segment the data, the distance between the two user sessions is measured with the help of similarity distance measures. Aim of this paper is to introduce a technique called registry federation mechanism to meet the associated business requirements and it also describes a distributed algorithm for sequential mining within registry federation.

© 2012 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of Noorul Islam Centre for Higher Education

Keywords: Cloud computing, web usage mining, registry federation, distributed algorithm

1. INTRODUCTION

With the explosive growth of knowledge sources available on the World Wide Web. It is important to find the useful information from the huge amount of data. At the same time, in the number of websites presents a challenging task for web designers to organize the contents of the websites to provide to the

Joan M John3, Venifa Mini G.b, Arun E'a'b

Research Scholar3, Assistant Professor13, Associate Professora'b Department of Computer Science and Engineering, Noorul Islam University, Kumaracoil, Thuckalay, Tamil Nadu, India

Abstract

1877-7058 © 2012 Published by Elsevier Ltd. doi:10.1016/j.proeng.2012.06.378

needs of the web users. The solutions to these problems can be provided by path analysis using web user navigation patterns.

Cloud computing [2] enable users to have access to shared resources somewhere on the internet. A user can search a data and retrieve the files containing it. Different users can access different resources at any time. The user profile will be created by checking the frequent accessing of resources [3]. The similarity of sessions and accessing details can be obtained by finding the distance between two user sessions. This will be enabled and processed with the help of web log data.

Web usage mining is described as the use of data mining techniques to automatically discover and extract useful information from the web documents and services [8]. Web usage mining contains the details such as web server access logs, proxy server logs, referrer logs, browser logs, error logs, user profiles, registration data, cookies, user queries, and bookmark data. When people visit one website, leave some data such as IP address, visiting pages, visiting time and so on, and web usage mining will collect, analyze and process the log and record these data. Through analyzing these log files we can access to interesting usage patterns and information.

2. CLOUD COMPUTING

Concept of cloud computing

Cloud computing is a technology which contain vast computing resources that can be accessed by pay-per-use service. Virtualization plays an important role in cloud computing. Everything is moving into the cloud and can be maintained by cloud service provider.

The architecture contains three types of cloud RM-Cloud (registry management cloud), SL-Cloud (Services log cloud) and cloud customer [7].

RM-Cloud is responsible for administrate the whole infrastructure, creation of SL-cloud for registered service registries and establishment and maintenance of service registry foundation. For every service registry centre, SL-cloud has to be created to record and extract accessing information.

Fig 1. (a) Cloud based service Registry Federation (RF). (b) Features of Web Usage Mining

SOAP intermediaries collect information required for service accessing that will be kept on log database. Service mining tools use log database to support selection and ensemble of service. DSR (distributed service registries) is a

type of service registration centre among internet. Registries-repository (R-R database) provides service registration information and it maintains user sessions repository.

3. WEB USAGE MINING

3.1. Concept of web usage mining

Web servers record and accumulate data about user interactions whenever requests for resources are received. Analyzing the web access logs of different web sites can help understand the user behaviour and the web structure, thereby improving the design of this colossal collection of resources. There are two main tendencies in Web Usage Mining driven by the applications of the discoveries: General Access Pattern Tracking and Customized Usage Tracking.

Web Usage Mining is to mine data from log record on web page [5]. Log records give useful information such as URL, IP address and time and so on. Analyzing and discovering Log could help us to find more potential customers and trace service quality and so on.

The web usage mining is the process of applying the data mining technology to the web data and is the pattern of extracting something that the users are interest in from their network behaviours to be interested. When people visit one website, he will leave some data such as IP address, visiting pages, visiting time and so on, web usage mining will collect, analyze and process the log and recording data. Through these, utilize some mathematic method to establish users' behaviour and the interest models, and use these models to understand the user behaviour, thus to improve the website structure. Then finally provides a better characteristic information service for the user.

3.2. Approach of web usage mining

The web usage mining generally includes the following several steps: data collection, data pre-treatment, establishing interesting model the data back processes.

• Data collection

Data collection is the first step of web usage mining, the data authenticity and integrity will directly affect the following works smoothly carrying on and the final recommendation of characteristic service's quality. Therefore it must use scientific, reasonable and advanced technology to gather various data. At present, towards web usage mining technology, the main data origin has three kinds: server data, client data and middle data (agent server data and package detecting).

• Data pre-treatment

Some databases are insufficient, inconsistent and including noise. The data pretreatment is to carry on a unification transformation to those databases. The result is that the database will to become integrate and consistent, thus establish the database which may mine. In the data pretreatment work, mainly include data clearing, user recognition, user conversation recognition and data formatting.

• Establish interesting model

Use statistical method to carry on the analysis and mine the pre-treated data. We may discover the user or the user community's interests then construct interest model. At present the usually used machine learning methods mainly have clustering, classifying, the relation discovery and the order model discovery. Each method has its own excellence and shortcomings, but the quite effective method mainly is classifying and clustering at the present.

• Pattern analysis

Carry on the further analysis and induction to the interested pattern which has already established. First delete the less significance rules or models from the interested model storehouse; Next use technology of OLAP and so on to carry on the comprehensive mining and analysis; Once more, let discovered data or knowledge be visible; Finally, provide the characteristic service to the electronic commerce website.

4. DEVELOPMENT OF SERVICE REGISTRY FEDERATION

Different service registration centres are strongly associated among cross-organizational and cross-fields business integration. This will be very useful to divide these centres into many RFs, that will be tremendously decrease the mining space and increase the mining accuracy. Registry Ontology (RO) is established and controlled by Registry

management cloud (RM-Cloud). RM Cloud will develop Service Log Cloud (SL-Cloud) for a new registry centre, and issue updated RO to other SL-Cloud in the federation.

5. ASSOCIATE MINING RULES BASED ON SERVICE REGISTRY (AMR-SR)

In the federation SL-Cloud information is recorded in log database, which also consists of interaction information of different centres. Therefore it may greatly increase efficiency of service selection and ensemble by effectively mining such information. It provides methods of association rules in relevant distributed log database in the Federation.

5.1 Definition of Record of Log Database

To record various information provided by Definition 1 for each piece of interaction information among SOAP Intermediaries for the purpose of obtaining log database used by mining process. It provides the same recording format is used in log database of SL-Cloud for simplifying association mining algorithm.

Definition 1: XML Record is one data recorded by single service execution information in log database and may be expressed by six unit groups (Composite Services ID, Instance ID, Services ID, Type, Time Stamp and Status).

Composite Services ID: This composite service ID may be those BPEL document-based modes representing an abstract process of composite service.

Instance ID: This composite service execution instance ID is a unique execution example of identifying composite service.

Services ID: This service instance corresponding to abstract service in composite service may be uniquely identified by URI of example service.

Type: Type of SOAP information represents modes of request or response.

Time Stamp: It represents current moment of executing Services ID service.

Status: It represents the condition of success or failure of service request or response.

5.2 AMR-SR

If there are n SL-Clouds in the Federation relevant log databases, like {DBb DB2... DBn}. Transaction Set of association mining process is the set of composite service execution instance for every DB¡ , where I¡={WS¡<j) € DBij=l,2,...,m} and WSjQ)represents the specific service corresponding to DB¡ Transaction Set.

Apriori even as the classical algorithm for associate rule mining is only limited to centralized data set. The key problem of associate rule mining in distributed data set is how to reduce information transmission quantity among different nodes. Different classical algorithms are available here. The proposed work select FDM as the prototype to construct AMR-SR algorithm in the context of cloud.

Table 1: AMR-SR Notations and explanation

Notations Explanation

D¡ Number of transactions in DBj

S Support threshold minsup

L(k) Globally large k itemsets

X.Sup Global support count X

CA(k) Candidate set generated from L^ ,CA(k)=Apriori_gen(L(k.1))

GLjft) G1-large k-itemsets at Sj

CGi(k) Candidate sets generated from CGi(k_i)

LLi(k) Locally large k-itemsets in CG^

X.Supi Local support count of X at Sj

The aim of AMR-SR algorithm [6] is to generate local candidate set for each node, which will then create the overall candidate set according Theorem 1. To reduce information transmission between nodes, two technologies are used such as Local Candidate Set Pruning and Overall Candidate Set Pruning.

Theorem 1: vk>l, the following formula is true:

L(.k) C GGflrtifijg^iSi i w ^jj^i A priori Jon(GLHi-t3)

AMR-SR algorithm:

Input: Transaction Database DBi( support threshold Supi Output: L: the set of all globally large itemsets

Method: Iteratively execute the following program fragment distributively at each Cloud. The algorithm terminates when either L(k) =cp or the set of candidate sets CG(k)= tp Begin

/* generate candidate sets */ If k=l then

T(i)=get_local_count(DBj, 0,1)

Else {

Tj(k)=get_local_count(DBi,CG()(),i) } /* Local Candidate Set Pruning and Overall Candidate set pruning */ for all X£Tj(k) do

if X.Supi>=S*Di then

if JT. Slip, ■>■ 1% L^McxSup j № > S * D

then insert <X,X.Supj> into LL^) /* broadcast LL^compute gl-k-itemsets */ for j=l to n do send LL^to Cloud Sj; Receive LL^ from other Cloud; For all X£LLi(k) do { X.Sup=Jjjt1*.Stipi; if X.Sup>S*D then

insert X into Gi(k); } /* Broadcasting results */ broadcast G^)

receive Gi(-k) from all other Cloud SjjQ^)

Divide L(t) into GL^);

Return L(k)

5.3 Sequence Alignment Based Distance Measure (SABDM)

Two sessions are compared to get the distance between them [1]. If they are same match is to be found, otherwise they mismatch if the pages are different, indicating that either insertion / deletion / substitution operation is required so as to make both sessions exactly similar to each other. It is required to assign suitable scores for match and mismatch.

The score is assigned for both match and mismatch suitably say, match=2, mismatch—1, called score matrix. The distance between two web user sessions is computed based on the number of alignments obtained for pages in two sessions that are compared.

Algorithm: Sequence Alignment Based Distance Measure (SABDM)

Input: sessions sl=(prl,pr2,...,prm), s2=(pcl,pc2,...,pcn), match=2, mismatch=-l , similarity-count=0 Output: distance d between si and s2

Method:

1. Construct score matrix of size (m+1, n+1) and initialize as follows:

score (i, 0) = mismatch where, 0<i<=m+l score (0, i) = mismatch where 0<i<=n+l if pi=pj, score (i, j) = match if pi! =pj, score (i, j) = mismatch

2. Compute distance matrix Dist and pointer matrix Pointer of size (m+1, n+1) each

Pointer (0, i) = 0 where, 0<=i<=m+l Pointer (i, 0) = 0 where, 0<=i<=n+l Dist (0, i) =Dist (0, i-1) +mismatch for 0<i<=m+l Dist (i, 0) =Dist (i-1, 0) +mismatch for 0<i<=n+l

3. Dist (i,j) = max (0,

Dist (i-1, j) +mismatch, Dist (i, j-1) +mismatch, Dist (i-1, j-l)+score(i,j))

4. Trace the distance matrix back, by finding the position of cell with maximum value, check for match or mismatch

from score matrix. Use the Pointer matrix to move to the next location. Whenever match is found increment the similarity-count

5. Repeat the tracing process till a cell with value zero is encountered in Dist matrix

6. If more than one cell in Dist matrix contains the same maximum value, repeat the steps 3 to 5.

7. Find the normalized distance between si and s2 as given below:

distance = (max (m, n)-similarity-count)/max (m, n)

Illustration through an example

Consider a simple example to find the distance between two sessions' si and s2 by applying the sequence alignment based distance measure (SABDM) algorithm to understand the proposed technique of finding distance between any two web user sessions.

Let sl= (pi, p2, p3, p4, p5) and s2 = (pi, p3, p4, p2, p5) be two sessions to be aligned so as to compute the distance between them. Let m and n be the length of session si and s2 respectively. In this example the length of both si and s2 are same. Therefore, m =5 and n=5. The score matrix is constructed as given in table 1 with size (m+1, n+1).

The score values considered for match and mismatch are 2 and -1 respectively because, matches should be rewarded and mismatches should be penalized.

Initialize the first row and first column of the score matrix with the value -1 as per the requirements to use the concept of the dynamic programming and also fill rest of the cells with the value of either match or mismatch according to the comparison made for pair of pages considered at a time, i.e., enter value of match in the cells of score matrix whenever sl(i) is equal to s2(j) and set the cell value as mismatch if sl(i) is not equal to s2(j) for all i from 1 to m+1 and for all j from 1 to n+1.

Construct a distance matrix with size (m+1, n+1).

Dist (0, i) = Dist (0, i-1) + mismatch for0<i<= m+1

Distfi, 0) = Dist (i-1, 0) + mismatch for 0 < i <= n+1

Dist (i,j) = max { 0, Dist (i-l,j) + mismatch, Dist (i,j-l) + mismatch, Dist (i-1, j-1) + score (i j)} for 0<i<= m+1, 0<j<= n+1

Construct a pointer matrix of size (m+1, n+1) to store the positions of cells from which a value is obtained in distance matrix so that, these pointers can be used to trace back the distance matrix at later stage.

The table 3 shows the pointer matrix for the example considered. Here value 1 indicates link to top cell, value 2 indicates link to left cell and value 3 indicates link to left-top cell.

Table 2: Score matrix

j pi p3 p4 p2 p5

I Ö T

pi -1 2 -1 -1 -1 -1

p2 -1 -1 -1 -1 2 -1

p3 -1 -1 2 -1 -1 -1

p4 -1 -1 -1 2 -1 -1

p5 -1 -1 -1 -1 -1 2

Table 3: Distance matrix_

j pl p3 p4 p2 p5

I 0 ^2 3 ~A

pl -1 2 1 0 0 0

p2 -2 1 1 0 2 1

p3 -3 0 3 2 1 1

p4 -4 0 2 5 4 3

p5 -5 0 1 4 4 6

Table 4: Pointer matrix_

j pl p3 p4 p2 p5

I Ö Ö Ö Ö Ö Ö"

pl 0 3 2 2 0 0

p2 0 1 3 23 3 2

p3 0 1 3 2 12 3

p4 0 0 1 3 2 2

p5 0 0 1 -1 3 3

Check for match or mismatch by referring the score matrix. Increment the similarity Count value by 1 if match is found. If the distance value always lie in the range of 0 and 1. The value 1 indicates that the two sequences are entirely different and the value 0 indicates that the two sequences are exactly similar to each other. Distance tends to 0 means sequence are closer and 1 means sessions are completely different. It is considered as outlier. SABDM is also useful for finding the outliers.

6. CONCLUSION

When customers visit the web, they leave log information like IP address, credit card details, logging sessions etc. By analyzing this information it is easy to understand the customer behaviour. It is useful to improve the customer relationship and system performance. And also it shoots up the economic scales of cloud infrastructure. The proposed algorithm is simple, effective and easy to realize the suitable web usage mining demands. As a future work it is enhanced with cluster recovery to provide highly accurate guessing of a web user's future visit if the user's cluster can be exactly determined with in a specific time interval. With the help of clusters effortlessly construct meaningful visualizations of typical user browsing patterns.

REFERENCES

1. Poornalatha G, Prakash Raghavendra (2011) "Alignment based similarity distance measure for better web sessions clustering" 2nd international conference on ANT science direct pp.450-457.

2. Lingjuan Li, Min Zhang (2011) "The strategy of mining association rule based on cloud computing" International conference on business computing and global information, pp. 475-478.

3. Jianzong Wang, Jiguang Wan, Zhuo Liu, Peng Wang (2010) "Data Mining of Mass Storage based on Cloud Computing" Ninth International Conference on Grid and Cloud computing, pp. 426-431.

4. Costantinos Dimopoulos, Christos Makris, Yannis Panagis, Evangelos Theodoridis , Athanasios Tsakalidis (2010) "A web page usage prediction scheme using sequence indexing and clustering techniques" Data & Knowledge Engineering (69), pp. 371-382.

5. Yu-Hui Tao , Tzung-Pei Hong, Wen-Yang Lin, Wen-Yuan Chiu, (2009) "A practical extension of web usage mining with intentional browsing data toward usage" Expert Systems with Applications (36), pp.3937-3945.

6. Qingtian Han, Xiaoyan Gao (2009) "Research of distributed algorithm based on usage mining" 2nd international workshop on knowledge discovery and data mining, pp.211-214.

7. Kunal Verma, Kaarthik Sivashanmugam, Amit Sheth. (2005) "METEOR-S WSDI: A Scalable P2P Infrastructure of Registries for Semantic Publication and Discovery of Web Services", Information Technology and Management, pp-6(l):17-39.

8. Facca,F, M., & Lanzi, P, L. (2005) "Mining interesting knowledge from web logs : a survey" Journal of Data and Knowledge Engineering (53), Science Direct, pp. 225-41.