Scholarly article on topic 'The Research And Application of Web Log Mining Based on the Platform Weka'

The Research And Application of Web Log Mining Based on the Platform Weka Academic research paper on "Computer and information sciences"

CC BY-NC-ND
0
0
Share paper
Academic journal
Procedia Engineering
OECD Field of science
Keywords
{Weka / "log analysis" / "association rule" / "application of mining"}

Abstract of research paper on Computer and information sciences, author of scientific article — Xiu-yu Zhong

Abstract Weka is a data mining platform based on Java with open source code, which gathers many machine learning algorithms to mine data, including data pretreatment, classification, cluster class, association rule mining and visual interactive page. The server will record massive web log files when user visits the website, each item of the files records related information of visitor, such as the time, IP address, action, request pages as well as the state and so on. The system mines web log on weka platform, it obtains the custom of different user to visit the website by processing and analyzing primary data, and mines unusual information, and provides the reference for the policy decision and construction of website. Simulation results show that, applying weka in web log mining, will obtain frequent model which user visits the website, optimize the website structure, and make convenient to build intelligence website.

Academic research paper on topic "The Research And Application of Web Log Mining Based on the Platform Weka"

Available online at www.sciencedirect.com

SciVerse ScienceDirect

Procedía Engineering 15 (2011) 4073 - 4078

Procedía Engineering

www.elsevier.com/Iocate/procedia

Advanced in Control Engineeringand Information Science

The research and application of web log mining based on the

platform weka

Xiu-yu Zhong

_School of Computer Science, Jiaying University, Meizhou, Guangdong, China"_

Abstract

Weka is a data mining platform based on Java with open source code, which gathers many machine learning algorithms to mine data, including data pretreatment, classification, cluster class, association rule mining and visual interactive page. The server will record massive web log files when user visits the website, each item of the files records related information of visitor, such as the time, IP address, action, request pages as well as the state and so on. The system mines web log on weka platform, it obtains the custom of different user to visit the website by processing and analyzing primary data, and mines unusual information, and provides the reference for the policy decision and construction of website. Simulation results show that, applying weka in web log mining, will obtain frequent model which user visits the website, optimize the website structure, and make convenient to build intelligence website. © 2011 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of [CEIS 2011]

Keywords: Weka; log analysis; association rule; application of mining

1. Introduction

The web log files record browser's behavior, the interest and visiting habit of browser can be discovered by web log mining, the frequent degree and behavior pattern that users visit the website can be obtained. The important pattern of web log record is researched and analyzed, network managers enhance the performance and security of system by improving and promoting website, such as perfecting the structure and service of website, providing personalized service for user, preparing for contingency.

Many scholars have researched on the correlation theory, some research on the web log mining have had preliminary result. Y.T Wang proposed a compact graph structure, termed a path traversal graph, to record information about the navigation paths of website visitors [1]. Michal Munk proposed to optimize web portal in level of portal adaptation on the basis user and access hour on portal [2]. Resul Das proposed to pre-process web log files and use path analysis technique to investigate the URL information concerning access to electronic sources [3]. Nicolas proposed to use machine learning techniques and

a Corresponding author. Tel.: 13622940533,0753-2359938 E-mail address: wch@jyu.edu.cn

1877-7058 © 2011 Published by Elsevier Ltd. doi:10.1016/j.proeng.2011.08.764

Markov-chain models to build Self-adaptive utility-based web session management [4]. Kleinberg and Tomkins have given the method of distinguishing web page by mining web link structure [5]. Yu-Hui Tao proposed a unified intention-based Web transaction mining algorithm that can efficiently process the whole data set simultaneously with multiple intentional browsing data types as well as transform the intentional browsing data counts into easily understood linguistic items using the fuzzy set concept [6]. Enrique proposed a new method towards automatic personalized recommendation based on the behavior of a single user in accordance with all other users in web-based information systems by algorithm Apriori [7]. Jianping Zeng proposed a framework which can be easily implemented for the analysis of user activity on an interactive website, in the framework, user activity model is represented by a hidden Markov model (HMM), and the method for user interest computation is provided [8].

The algorithm Apriori is the most commonly used in mining association rules, it can be embed in weka, so the association of web log can be analyzed on the platform weka to discover the visit pattern and other information of user.

2. Weka platform

The entire name of weka is Waikato Environment for Knowledge Analysis, which source code may be gotten by the website http://www.cs.waikato.ac.nz/ml/weka. Weka is approved widely and is one of most complete tools in data mining. As a public data mining platform, weka gathers many machine learning algorithms to mine data, including data pretreatment, classification, regression, cluster class, association rule mining and visualization on new interface.

Weka has used a series of unified graphical interface with standard machine learning technology, it may unify many methods of pretreatments and post-processing, many different study algorithms are applied in data set and appraise the corresponding result. For example, weka 3.5.8 provides five menus on the main surface, the "Applications" function is usually used that contains cutting different mining duty, opening and saving data function, edition function, data pretreatment, data set show, selecting attribute of data set and data mining function and so on.

3. Web log mining

Web server log is the main data source for web log mining that records the user's request information to server. The system uses the web log of a shopping website which uses ASP technology based on Windows XP and uses IIS as web server. The log form is expansion, each record contains request data, request time, IP address of client, service name, server name, IP address of server, server port, method, request page URL, protocol state, sending bytes, receiving bytes, spent time, protocol version, host name, user agent, cookie, previous page of visitor and so on.

3.0 Data p/et/eatmeyt

Web log pretreatment reorganizes the web log into suit model for mining by a series of steps, the data pretreatment usually includes the following process: data cleaning, user identification, session identification, path addition. the pretreatment process in the system is as follows: (1) Data cleaning

This system uses Microsoft SQL Server 2000 as database, first, the web log with text format is loaded into database by data loading tool, and then the redundant files are deleted by inquiry analyzer, whose expansion name is gif or jpg or js or css or ico. The SQL sentences are as follows:

Delete from log where csuristem like '%gif';

Delete from log where csuristem like '%jpg'; Delete from log where csuristem like '%css'; Delete from log where csuristem like '%js' ;

(2) User identification

A shopping website is set up for test, all users are in the local network, each user has specific IP address without agent's disturbance. Therefore, different IP address represents different user and user identification is simplification.

(3) Session identification

In order to get sequence page of the same IP address ordered by visit time, the visit time is set as the primary key and user's IP address is set as the second key. The SQL sentences for this are as follows: Select from log group by cip order by time

(4) Path addition

The visitor cites buffer page in the session and connects requested page, the saved session in the session table is incomplete, it must be added path according to the website structure. The sequence of user's visit can be obtained by adding path. Some visit path in the simulation test is shown as figure 1. For loading them into database and recognition, the symbol "#" is used to partition item collection. 1lt/WRQ/Shopxp/web/xpnewp.asp?actionJiey urder=tejia#/URQ/Shopxp/web/prodiJct5hopxp.asp? id=ii62tt/MRQ/Shopxp/web/xpbuy .asp

2tt/MRQ/Shopxp/web/xpCatalog_xpsmall_Dpsc.asptt/URQ/Shopxp/web/productshopxp.asp 3tt/MRQ/Shopxp/web/xpdongtai.asptt/MRQ/Shopxp/web/shopxp_news.asptt/URQ/Shopxp/web/shopxp news. asp?id=73

WMRQ/Shopxp/web/xpdongtai.asptt/MRQ/Shopxp/web/shopxp_news.asptt/URQ/Shopxp/web/shopxp news. asp?id=7i(

5tt/lfRQ/Shopxp/web/index.asptt/URQ/Shopxp/web/xpneup.asptt/URQ/Shopxp/ueb/shopxp_news.asp?id=75 6tt/bfRQ/Shopxp/web/index.asptt/IJRQ/Shopxp/web/xpnewp.asp?

action_keii_ordei'=tejialt/WRQ/Shopxp/ueb/productshopxp.asp?id=462lt/URQ/Shopxp/web/xpbui;.asp 7tt/lfRq/Shopxp/web/index.asptt/URQ/Shopxp/web/xpdongtai.asptt/lfRQ/Shopxp/web/Get5hopxpCode.asp Stt/lfRQ/Shopxp/web/index.asptt/URQ/Shopxp/web/xpdongtai.asptt/lfRQ/Shopxp/web/shopxp_neus.asptt/U RQ/Shopxp/web/shopxp news.asp?id=7!j

9tt/MRQ/Shopxp/web/xpdongtai.asptt/MRQ/Shopxp/web/xpnewp.asp

6tt/MRQ/Shopxp/web/fankuixpliuyan.asptt/MRQ/Shopxp/web/shopxphelp.asp

1fl#/URQ/Shopxp/web/index.asptt/URQ/Shopxp/web/shopxphelp.asp?3ction=qiyexingxiang

11tt/URQ/Shopxp/web/shopxphelp.asp?action=shiyongfalutt/MRQ/Shopxp/web/xpa.asp

12#/URQ/Shopxp/web/index.asptt/URQ/Shopxp/web/xpnewp.asp?

action_keii_ordei'=tejialt/WRQ/Shopxp/web/productshopxp.asp?id=462lt/URQ/Shopxp/web/xpbui;.asp 13#/URQ/Shopxp/web/index.asptt/URQ/Shopxp/web/shopxphelp.asp?3Ction=qiyexingxiang 1WURQ/Shopxp/web/praduct5hopxp.asp?id=471tt/WRQ/Shopxp/web/xplistpl .asp 1Ftt/URQ/Shopxp/web/productshopxp.asp?id=461lt/WRQ/Shopxp/web/xplistpl.asp 16tt/URQ/Shopxp/web/shopxphelp.asptt/URQ/Shopxp/web/shopxphelp.asp?action=jifen 17tt/URQ/Shopxp/web/listztxp.asp?id=1308tt/MRQ/Shopxp/web/GetshopxpCode.asp 18#/URQ/Shopxp/web/xpdongtai.asptt/URQ/Shopxp/web/shopxp_news.asptt/MRQ/Shopxp/web/shopxp news .asD?id-/3

Fig. 1. Some visit path in the simulation test

3.2 association rule mining

The web log data which has been pretreated is very suitable for data mining, it is loaded on weka and mining algorithm may be selected, and related parameter can be set. For example, algorithm Apriori is selected, the mining parameter can be set which is shown as figure 2.

weka.gui.GenericObject Editor

v eka. associ at i oits. Apr i or i

АЪ out

Class implementing an Apriori-type algorithm.

с ar F al s cl as slndex

low егБоuxnifil inSuppor t metrlcType m i rJil ■=: tr i с rmm Rul e s outputlt emS e t s r em oveAllMi ssingCols si gri.i ±i canceLevel Tipp erB oixruifil i nSupp or t verbose

Con£idenee

_ Ш x

Fig. 2. Set mining parameter on weka

3.3 Experimental result analysis

After setting mining parameter, the data mining begins. The result of association rule mining is shown as figure 3.

Weka 3.5.8 - Explorer Program Applications Tools Visualization Windows Help

F -¡E'roces :■ 'lï]?fr:r Associate Select p ' 1 to <•.■. :TiaJ.i: ■

¿ssoci ator

.ny^jApriori -II 100 -I 0 -C l).3 -D 0.05 -U 1. 0 -H 0.1 -S -1.0 _

'. or output

Col003=/WRQ/3hop>:p/wel)/>:piiewp. asp?action_keY.._oi:<le]:=te:ia 7 =F^V:Ôol005 = /TJRQ/Shopx; C o 10 0 5=/ÏJRQ/ Shopxp /web/xpbuy. asp 7 ^Ëol004=/WRQ/Shopxp Ajeb/productshopxp. asp ?i Col004=/URQ/Shüpxp/Tjeb/pEOductshopxp. asp ?id=452 7 ==> Col005=/ïlRQ/ShopxpAieb/xpb-Col003=/iJRQ/Shopxp/Tjeb/xpnewp. asp ?action_key _ordeE=t.ej i a Col004=/ïïrRÛ/Shopxp/neb/; Col002=/ÏJRQ/Shopxp/Tjeb/index. asp Col004=/KRQ/Shopxp/web/p]:oduct.shopxp. asp ?id=462 Col002=/WRQ/Shopxp/web/index. asp Col003 = /ïïTRû/3hopxp/web/xpneTop. asp ?action_key_or Col004=/IJRQ/Shopxp/Tjeb/pEcductshopxp. asp ?id=462 7 ==> Col002=/WRQ/3hapxp/ueb/ind Col003=/WHQ/Shopxp/web/xpneMp. aspiact.iorL_]:eY_or<leE=t.ejia 7 Col002 = AIRQ/Sh3px:

Col003=/IJRQ/Shopxp/Tjeb/xprLewp. asp ?action_key_order=t.e j ia Col005=/WRQ/Shopxp/web/: Col002=/WRQ/Shopxp,'web/index.asp Col005 = /WRQ/Shopxp,'iTeb/xpbuy. asp 7 ==> Col003=/ Col002=/URQ/5hopxp/web/index.asp Cül003 = /WRQ/Shopxp/web/xpneTap.asp ?actiün_key or C o 10 0 5=/TJRQ /Shopxp/ueb /xpbuy .asp 7 ==>'-':Gol002=/TJRQ/Shopxp/web/index. asp Col003=/ Col003=/TJRQ/3hopxp/Tjeb/xpnewp. asp ?^çtion_key order=tejia 7 =K> Cül002 = /TüRQ/5hopx; Col004=/WRQ/3hopxp/web/pi:oductsliopxp. asp?id=462 Col005 = /TtRQ/Shopxp/Tjeb/xpbuy. asp Col002=/WRQ/Shopxp/Tjeb/index. asp Col005 = /KrRQ/Shopxp/iTeb/xpbuy. asp 7 = => ColGG-W Col002=/TJRQ/Shopxp/ueb/index. asp Col004=/KRQ/Shopxp/web/pi:oduct.shüpxp. asp ?id=462 Col005=/WRQ/ShopxpAieb/xpbuy. asp 7 =sts-Gol002=/tlEQ/3hopxp/Tjeb/index. asp Col004=/ Col004=/ÏJRQ/Shopxp/ïjeb/productshopxp. asp ?id=462 7 ==> Col002=/URQ/Shopxp/ueb/ind Col004=/URQ/Shapxp/Tjeb/productshopxp. asp ?id=4S2 Col00S = /ïïTRQ/5hopxp/web/xpbuy. a3p Col003=/TJRQ/Shopxp/ueb/xpnewp. asp ?action_key _order=t.a j i a Col005=/ïïTRU/Shopxp/web/: Col003=/WRQ/Shopxp/ueb/xpnewp. asp ?^ction _key__order=te j ia ColCJÛ4=/ïïTRQ/3hopxp/web/; C o 10 0 5=/ÏJRQ / Shopxp /web /xpbuy. asp 7 ==>(ÏÈol003=/WRQ/Shopxp/Tjeb/xpneHp. asp ?actiûii_.

Fig. 3. Some result of association rule mining

Some results in figure 3 are selected for analysis as table 1.

Table 1. Some association rule

Number Suppor Confidence association rule of homepage

t count (%)

1 7 100 /WRQ/Shopxp/web/productshopxp.asp?id=462 = /WRQ/Shopxp/web/xpbuy.asp

2 4 100 t/WRQ/Shopxp/web/productshopxp.asp?id=461 = /WRQ/Shopxp/web/xplistpl.asp

3 4 100 /WRQ/Shopxp/web/xpdongtai.asp = /WRQ/Shopxp/web/shopxp_news.asp

4 8 50 /WRQ/Shopxp/web/fankuixpliuyan.asp = /WRQ/Shopxp/web/shopxphe

lp .asp?action=qiyexingxiang

The first association rule shows that nearly all users would make purchase after seeing goods of id=462. In the actual homepage, It is discovered that this goods is the latest style notebook computer on market. According to the mining information, the businessman should increase the purchase quantity of this style notebook and advertise it on key position to raise sales.

The second association rule shows that most users have see the id=462 goods, they would look at the commentary of id=461 goods which is near to the id=462 goods, and they have not made purchase.

Therefore, from the first and the second association rule, the commentary to this style goods should be examined timely, and the more detailed parameter of goods should be provided to attract consumer to purchase. In addition, the notebook quantity should be increased with the enhance of people's living standard.

The third association rule shows that customers are very interested in the tendency of commercial city, specially the news of commercial city. Therefore, the tendency and news of commercial city should be renewed timely, and this page can be used to advertise the commercial city, and the customers would not think that the service of the commercial city is slow and would have a good impression to it.

The fourth association rule shows that customers enter enterprise's mailbox by feedback on message board, and the population is many. Therefore, in order to receive customer's feedback information timely and provide better service for customer, the mailbox link should be set up on the website homepage, it makes customer's complain convenient, and persuades old customer to stay and attracts new customers.

4. Conclusion

This system uses the data mining platform weka, the classical mining algorithm Apriori is embed on weka platform, the function of association rule mining has been realized. Simulation test takes the visit log of a shopping website as data source, the characteristic attribute of user's behavior is collected, the visit characteristic of user is discovered, policy decision and service are provided for the website manager.

Acknowledgment

The authors want to thank the Natural Science Foundation of Guangdong Province and the Science and Technology Plan Project of Guangdong Province for their general support for the research (with grant NO. 9151009001000043 and 0911050400004 respectively).

References

[1] Yao-Te Wang, Anthony J.T. Lee. Mining Web navigation patterns with a path traversal graph[J] Expert Systems with Applications, 2011,38( 6): p. 7112-7122.

[2] Michal Munk, Jozef Kapusta. Data preprocessing evaluation for web log mining: reconstruction of activities of a web visitor[J]. Procedia Computer Science, 2010,1(1): p. 2273-2280

[3] Resul Das, Ibrahim Turkoglu. Creating meaningful data from web logs for improving the impressiveness of a website by using path analysis method[J] . Expert Systems with Applications, 2009,36( 3): p. 6635-6644.

[4] Nicolas Poggi, Toni Moren. Self-adaptive utility-based web session management[J] Computer Networks, 2009,53(10): p.1712-1721.

[5] J.M.Kleinberg and A.Tomkins. Application of linear algebra in information retrieval and hypertext analysis[J]. In Proc. 18 th ACM Symp. Principles of Database Systems(PODS), Philadelphia, PA, 1999(5), p.185-193.

[6] Yu-Hui Tao, Tzung-Pei Hong, Yu-Ming Su. Web usage mining with intentional browsing data[J] Expert Systems with Applications, 2008,34(3): p.1893-1904.

[7] Enrique Lazcorreta, Federico Botella. Towards personalized recommendation by two-step modified Apriori data mining algorithm[J] Expert Systems with Applications, 2008,35(3): p.1422-1429

[8] Jianping Zeng, Shiyong Zhang. A framework for WWW user activity analysis based on user interest[J]. Knowledge-Based Systems, 2008,21(8): p. 905-910.