Scholarly article on topic 'Atomic Data Mining Numerical Methods, Source Code SQlite with Python'

Atomic Data Mining Numerical Methods, Source Code SQlite with Python Academic research paper on "Earth and related environmental sciences"

CC BY-NC-ND
0
0
Share paper
Keywords
{Python / "atomic data" / database / "data mining algorithms" / "data model" / "collaborative intelligence" / "machine learning"}

Abstract of research paper on Earth and related environmental sciences, author of scientific article — Ali Khwaldeh, Amani Tahat, Jordi Marti, Mofleh Tahat

Abstract This paper introduces a recently published Python data mining book (chapters, topics, samples of Python source code written by its authors) to be used in data mining via world wide web and any specific database in several disciplines (economic, physics, education, marketing. etc). The book started with an introduction to data mining by explaining some of the data mining tasks involved classification, dependence modelling, clustering and discovery of association rules. The book addressed that using Python in data mining has been gaining some interest from data miner community due to its open source, general purpose programming and web scripting language; furthermore, it is a cross platform and it can be run on a wide variety of operating systens such as Linux, Windows, FreeBSD, Macintosh, Solaris, OS/2, Amiga, AROS, AS/400, BeOS, OS/390, z/OS, Palm OS, QNX, VMS, Psion, Acorn RISC OS, VxWorks, PlayStation, Sharp Zaurus, Windows CE and even PocketPC. Finally this book can be considered as a teaching textbook for data mining in which several methods such as machine learning and statistics are used to extract high-level knowledge from real-world datasets.

Academic research paper on topic "Atomic Data Mining Numerical Methods, Source Code SQlite with Python"

Available online at www.sciencedirect.com

Procedía

Social and Behavioral Sciences

ELSEVIER Procedía - Social and Behavioral Sciences 73 (2013) 232 - 239

The 2nd International Conference on Integrated Information

Atomic data mining numerical methods, source code SQlite with

Python

a Department Of Computer Engineering, Faculty of engineering, Philadelphia University, 19392 Amman, Jordan

b' c Department of Physics and Nuclear Engineering, Technical University of Catalonia - Barcelona Tech, B5-209 North Campus UPC, 08034 Barcelona, Catalonia, Spain

d Software Development Engineer , American Airlines® , Addison TX | HDQ2 3N2D-63

This paper introduces a recently published Python data mining book (chapters, topics, samples of Python source code written by its authors) to be used in data mining via world wide web and any specific database in several disciplines (economic, physics, education, marketing. etc). The book started with an introduction to data mining by explaining some of the data mining tasks involved classification, dependence modelling, clustering and discovery of association rules. The book addressed that using Python in data mining has been gaining some interest from data miner community due to its open source, general purpose programming and web scripting language; furthermore, it is a cross platform and it can be run on a wide variety of operating systens such as Linux, Windows, FreeBSD, Macintosh, Solaris, OS/2, Amiga, AROS, AS/400, BeOS, OS/390, z/OS, Palm OS, QNX, VMS, Psion, Acorn RISC OS, VxWorks, PlayStation, Sharp Zaurus, Windows CE and even PocketPC. Finally this book can be considered as a teaching textbook for data mining in which several methods such as machine learning and statistics are used to extract high-level knowledge from real-world datasets.

© 2013 The Authors. Published by Elsevier Ltd.

Selection and peer-reviewunder responsibility of The 2nd International Conference on Integrated Information

Keywords: Python; atomic data; database; data mining algorithms; data model; collaborative intelligence, machine learning

1. Introduction

The process of automatic discovery of models, patterns changes, associations as well as anomalies in massive databases can be defined as data mining (DM). Basically, it is the use of computer programs to help making sense

* Corresponding author. E-mail address: amani.tahat@upc.edu

Ali Khwaldeha, Amani Tahat b Jordi Martic , Mofleh Tahat

Abstract

1877-0428 © 2013 The Authors. Published by Elsevier Ltd.

Selection and peer-review under responsibility of The 2nd International Conference on Integrated Information doi:10.1016/j.sbspro.2013.02.046

of information, in order to answer questions that cannot be addressed through simple query and reporting techniques, based on several methods such as machine learning and statistics to be used in extracting high-level knowledge from real-world datasets. On the other hand, information about data mining is widely available. No matter reader's level of expertise, he/she will be able to find helpful books and articles on data mining in several disciplines such as references [1-4]. This paper presents a short overview of the book [5] "Atomic data mining numerical methods, source code SQlite with Python" by Amani Tahat and Ali Khwaldeh. The book is available online via (http://www.aims-thinkenergy.org) and it could be freely, distributed, copied and modified under the Creative Commons Attribution-NonCommercial 3.0 Unported License. The main goal of this book is to introduce sample data sources (e.g., presented in figure (1) and Python implementations of several data mining algorithms which were prepared and used for data mining tutorials in the author's own data mining research laboratory and data base projects [6, 7] based on the following areas: parallel and distributed, mining dynamic and stream data, bioinformatics and scientific, atomic physics and anomaly detection. To get the most of these topics, book readers need to have good knowledge of Python as well as being familiar with text and language processing concepts or be ready to get heads up on them.

Figl: Data mining resources.

The book is divided into seven chapters and packed with practical examples on how to retrieve, explore, analyze and visualize data from database as well as social networks like Twitter, and the currently abandoned Google Buzz. It runs through these topics accompanied by many code examples at a fast pace and an advanced level. Generally the book provides a general description of data mining definition and some related issues but does not describe all aspects of data mining processes so that at the end of some sections it provides "Further readings" pointers for the readers interested in knowing more on the topics. The book started with defining the "atomic data" as the data that has not been processed for use (e.g., raw data) and required selective extraction, organization, analysis in addition to formatting presentation in order to become "information". Therefore, chapter one provides a short introduction to data mining and knowledge discovery which may facilitate the understanding of the following chapters. Figure 2 presents the results of processing "atomic data" which ends up in an electronic database, which enables the data to become accessible for further processing and analysis in a number of different ways using several data mining techniques including the use of computer software for analysing atomic data as well as finding the patterns by identifying the underlying rules and features in the data.

Fig2: The phases of atomic data mining

Because a basic understanding of data mining functions and algorithms are required for achieving the fastest Python DM implementation, the DM operations, techniques and algorithms were described in chapter two with focusing on Python integration techniques. Furthermore, chapter 3 can be worked as an introduction to Python programming for beginners. It starts with some fundamental concepts of programming, and is carefully designed for defining all terms when they are first used and to develop each new concept in a logical progression based on some other freely distributed Python open books from literature such as [1] , including some other examples and exercises based on some Python packages that has been written by the author of the current book to demonstrate aspects of software design, and to give readers a chance to experiment with simple graphics and animation. Chapters 4 and 5 cover the issue of DM process with Python. There, DM processes need to apply an algorithm that "learns" something about the data. This algorithm is referred to a machine learning algorithms, that are used in DM (e.g., classification learning, association learning, numeric estimation, clustering). These two chapters can be used as a programmer's guide to DM in Python including machine-learning algorithms in Python via several case studies (e.g. building decision trees, implementing genetic algorithm method for optimization, clustering algorithms). Furthermore, as this book does not assume any familiarity with DM or statistical techniques as well as machine learning, brief introductions to different modeling approaches were provided, since they are necessary in understanding case studies presented in several chapters (e.g., 2, 3, 4 and 5). The descriptions of these models have been given for providing readers with basic understanding on their merits, drawbacks in addition to the analysis objectives. On the other hand, if further theoretical insights were required, this book referred readers to other related existing Python DM books.

Since the size is one of the key issues in DM, chapter 6 describes typical DM problems that could be involved in a large database from where users seek to extract useful knowledge. The chapter addressed that using SQlite (http://sqlite.org/index.html) as a core database management system; it will enable the complete database to be stored in a single disk file, which mainly reduces the size. Moreover, the chapter shortly describes the valuable features of handling very large databases, freely available for several computer platforms with Unix, Mac OS X,

Windows, etc. Zero configuration with no setup or admin tasks supports large databases -up to one tera bite (TB) in size- by means of a relatively simple application programmer's interface (API), with no external dependencies, self-contained, server-less and transactional SQL database engine, etc. This chapter also covers the basics of database programming with the Python language via a successful practice example of designing an online atomic physics database management system from authors previous DM projects [7]. Finally chapter 7 will simplify learning how to write smart Python programs for accessing interested datasets from other web sites, collecting data from users of specific applications, and analyzing and understanding the data once someone has found it by using the collaborative filtering with Python and the sophisticated algorithms that has been presented in the previous chapters of the book. Each algorithm was described clearly and concisely with codes that could be used immediately on reader web site, blog, Wiki or specialized application. Next section provides some case studies of Python web mining followed by some conclusions.

2. Collecting data from websites using Python

Data mining in Python has been gaining some interest from data miner community as a perfect language for mining data, either on the web or elsewhere. Due to its features (open source, variety of tools for general purpose of programming and web scripting), Python DM tools and methods may be classified according the function they perform or according to the class of application that they could be used in. This issue was fully described in chapter two supported with some practical examples. Moreover, Python comes with modules for accessing the Internet, like urllib and urllib2 (http://docs.python.org/library/urllib2.html), Scrapy (http://scrapy.org/), Pattern (http://www.clips.ua.ac.be/pages/pattern), and mechanize package (http://pypi.python.org/pypi/mechanize/). Using the last one is recommended if user is short on time for writing less script where urllib2 would need more scripts lines (e.g 30 to100 or more).

2.1 Mechanize module

Here is an example of using the mechanize module for doing a text mining of (AJAC Atomic Calculation tool) via the website (http://aims-thinkenergy.org/amani/index.lua/section/47), with few Python script lines. Firstly users need to install the latest version of the model from its official website. Secondly, the use of this model is simple as using a web browser with the following Python scripts:

1. #!/usr/bin/env python

2. from mechanize import Browser

3. br = Browser()

4. br.open( ' http://aims-thinkenergy.Org/amani/index.lua/section/4/')

5. response = br.follow_link(text=' AJAC Atomic Calculation tool ')

6. print response.read()

This script will perform a simple process: to drive the user to the web site, go to the selected article via its link, download the html document, and print it. On the other hand, the following example is to illustrate the simplicity of using "mechanize" model. Consequently someone can use the very same methods to login at web sites and starts his own DM. Mechanize handles all of the fun of cookies and sessions, so that data miners need only to tell it where to go as follows:

7. #!/usr/bin/env python

8. from mechanize import Browser

9. br = Browser()

10. br.open('http://finance.yahoo.com/')

11. br.select_form(name='quote')

12. br['s'] = 'MSFT' # 's' is the input field in the quote form

13. response = br.submit() # Submit the form just like a web browser

14. print response.read()

This script goes to ( http://finance.yahoo.com), enters (MSFT) into the Get Quotes field, as appeared in figure 3, submits the form, and prints out the quote page html for Microsoft Corporation (MSFT).

MSFT M S FT MX "AZM P.fl SFT FIA. »MSFSY "MVSPY M ;

Microsoft Corporation Microsoft Corporation Stimt ID - MSFT Alpha Index MirxnRolt Corporation MSFT AIpna Index

NASDAQ OMXAIpha MSFT vs. SPY

Equity - NASDAQ

Index - NASDAQ

Index - NASDAQ Index - NASDAQ

Thu, Aug 30. 2012. S:38AM EDT

lO Year Bond EUR/U

1 .634 1 -255

0.02 -«-O.OO

-1.21% +0.14

http://us.lrd.yahc

Internet I Protected Mode: Off

100% —

OÏ MSFT: Summary for Microsoft Corporati...

yahoo.com q?s= msft&ql = l

Search the web (Babylon)

Sk Most Visited i Getting Started O! yahoomail |—

Qf - ] Q.

I EJ S -

- D - gi -<*■<- m -

Get Quotes

Finance Search

Dow +0.03* Nasdaq 0.00%

QUOTES Summary Order Book Options

Thu. Aug 30, 2012. 9:11AM EDT - US Markets open in 19 mins

O/Vcfe/j^K

CHARTS

Basic Chart Basic Tech. Analysi NEWS & INFO Headlines Financial Blogs

Microsoft Corporation (MSFT) -*•

30.65 Aug 29. 4:OOPM EDT | Pre-Market: 30.56 *0.09 <o.29*>) 9:10AM EDT - Nasdaq Real Time Price

Add to Portfolio

Market Pulse

Prev Close:

Bid: Ask:

1y Target Est: Beta:

Next Earnings Date:

30.65 N/A

30.54 * 1500 30.56 x 10OO 35.75 1.13

18-Oct-12 t±±j

Day's Range 52wk Range: Volume: Avg vol (3m): Market Oap: P/E (ttm): EPS (ttm): Div & Yield:

People viewing MSFT also viewed:

Fig3: DM by using the Python model "Mechanize'

2.2 Urllib2 model and some other related tools (re, time, pprint, csv)

This model has to be first installed to start working. The following problem will show another example for using Python in web DM based on the professional American football team "New York Giants" website (www.nfl.com). Imagine that someone wanted to compare "New York Giants" draft picks with the league as a whole. How would he go about obtaining data on the rest of the league's players? Herein, the old-fashioned manual way, could be used for collecting the player data team-by-team from the nfl.com website, therefore, the first step would be to find the list of team rosters (http://www.nfl.com/players/search?category=team&playerType=current), then to click through each team's roster. For instance, if you're from Ann Arbor, you might be a Lion's fan (http://www.nfl.com/players/search?category=team&filter=1540&playerType=current). The list of current players for Detroit Lions would be appeared. In order to collect the desired player information, however, you would again have to follow the link to each player's profile page, e.g., to check out the Lion's own first round pick from

(http://www.nfl.com/players/matthewstafford/profile?id=STA134157). At last, Stafford's statistics could be simply copied down. Thus, 30 seconds might be taken with page load times and user spreadsheet entry. The Lions have more than 70 player's rostered (more than just active players); assuming that, this was representative. There are 32 teams in the NFL. By even a conservative estimate, data would be collected to over than 2000 players. If each of the 2000 players would take 30 seconds, then about 17 hours would be needed to collect the required data. Someone might hand this data entry over to a team of bored undergraduate or graduate students, but then he would need to worry about double-coding and cost of labour. Furthermore, what if someone wanted to extend this analysis to historical players as well? He better start looking for a source of funding. What if there was an easier way? Therefore using Python DM could solve it freely. The solution requires around 51 lines of code as presented bellow. This kind of code could be produced in half an hour by an experienced Python programmer. The program itself can download the entire data set in less than half an hour. In total, this data set is the product of less than an hour of total time. The end result would be a spreadsheet with the (name, weight, age, height in inches, college, and NFL team) for 2,520 players. This isn't even the full list - for the purpose of this tutorial, players with missing data, e.g., unknown height, are not recorded. Here we will not cover how to visualize and analyze this data; all details are available in the book which could be visualized in both standard statistical models as well as network models. The following scripts presented the Python solution:

1. import urllib2, re, time, pprint, csv

2. # This regular expression extracts links to team rosters. reTeamData =

re.compile('/players/search\?category=team&filter=([09]+)&playerType=current">([A<]+)') def getTeamList():

I! !! II

Returns a list of tuples where each tuple looks like

(teamID, teamName) ii ii ii

3. # Download the list of teams and return all matches to the regular reTeamData regular expression. return

reTeamData.findall(urllib2.urlopen('http://www.nfl.com/players/search?category=team&playerType=cu rrent').read())

4. # This regular expression extracts a player's ESPN ID and their first and last names.

rePlayerData = re.compile('profile\?id=([A"]+)">([A,]+), ([A<]+)')

5. # This regular expression extracts the link to the "next" page of the team roster.

reNextPageURL = re.compile('href="([A"]+)">next</a>') def getTeamPlayers(teamlD):

ii II II

Return the list of players for a given team, where each player is (playerlD, playerLastName, playerFirstName)

6. # Download the first page of the team roster and store the list ofplayers. teamPageHTML =

urllib2.urlopen('http://www.nfl.com/players/search?category=team&filter=%s&playerType=current' % teamID).read()

players = rePlayerData.findall(teamPageHTML)

Check for a "next" page.

If one is found, then download this "next" page and add the players on that page to the previous list. Continue checking for more pages and storing until no more pages are found.

I! I! II

8. nextURL = reNextPageURL.findall(teamPageHTML)

9. while len(nextURL) > 0:

10. teamPageHTML = urllib2.urlopen('http://www.nfl.com' + nextURL[0].replace('&amp;','&')).read()

11. players.extend(rePlayerData.findall(teamPageHTML))

12. nextURL = reNextPageURL.findall(teamPageHTML)

13. return players

ii ii ii

14. The following regular expressions extract the desired information from the player's profile page.

ii II II

15. reHeight = re.compile('Height: ([A \r\n]+)')

16. reWeight = re.compile('Weight: ([A \r\n]+)')

17. reAge = re.compile('Age: ([A \r\n]+)')

18. reCollege = re.compile('College: ([a<]+)')

19. reName = re.compile('<title>([A<]+)')

20. reTeam = re.compile('team=[A"]+">([A<]+)</a>')

21. rePosition = re.compile('\| ([A-Z]{1,4})')

22. def getPlayerInfo(playerID):

ii II II

23. Returns the player's info.

ii II II

24. try:

25. pageData = urllib2.urlopen('http://www.nfl.com/players/profile?id=' + playerID).read()

26. heightTokens = reHeight.findall(pageData)[0].split('-')

27. height = int(heightTokens[0]) * 12 + int(heightTokens[1])

28. return {'name': reName.findall(pageData)[0],

29. 'position': rePosition.findall(pageData)[0],

30. 'height': height,

31. 'weight': int(reWeight.findall(pageData)[0]),

32. 'age': int(reAge.findall(pageData)[0]),

33. 'college': reCollege.findall(pageData)[0],

34. 'team': reTeam.findall(pageData)[0]}

35. except:

36. print 'Failed to load', playerlD

37. # Open the CSV file for output.

38. csvFile = csv.writer(open('players.csv', 'w'), delimiter=',', quotechar='"')

39. # Download the list of teams

40. teams = getTeamList()

41. # For each team, download the list of players

42. for team in teams:

43. print 'Retrieving players from the', team[1]

44. players = getTeamPlayers(team[0])

45. # For each player, download their info and write it to the CSV file

46. for player in players:

47. playerlnfo = getPlayerInfo(player[0])

48. if playerlnfo:

49. csvFile.writerow(playerInfo.values())

50. # Wait between each player

51. time.sleep(0.1)

3. Conclusion

In summary, our target readers are more users of data analysis tools than researchers or developers based on the aim of testing our source code as well as improving it to publish a comprehensive new Python library. Still, we hope the latter also find reading this book useful as a form of entering the "world" of Python and DM. The goal of this free book is not to describe all facets of DM processes because a lot of existing books have already covered this area; but it mainly focused in introducing our experience in using Python and its tools in atomic DM as well as the database programming. Someone may think that expensive tools necessarily mean better tools. In the contrary, this book showed that using freely distributed Python tools and will be able to perform free "serious" DM, with no compromise in the quality of the obtained solutions. This book was accompanied by a set of freely available Python source files that could be obtained at the book web site. These files include all codes used in case studies that have been written by its authors, in addition to those that have been collected from some other Python books, scientific papers and Python blogs. They facilitate the philosophy of "do it yourself' followed in this document. The authors strongly recommend that readers install Python models and try the code as they read the book as explained in this paper. All data used in the case studies are available at the book web site as well.

Reference

[1] Allen B. Downey (2012). Think Python, (1st. edition) . O'Reilly Media

[2] Jiawei Han & Micheline Kamber (2006). Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers. ISBN: 1558609016.

[3] Matthew A. Russell (2011). Mining the social web, O'Reilly Media

[4] Wes McKinney (2012). Python for data analysis, O'Reilly Media

[5] Amani Tahat & Ali Khwaldeh (2011). Atomic Data Mining Numerical Methods, Source code SQlite With Python, 1st. edition, Jordan:

Dar AL-Ketab AL-Thaqafi. ISBN 978-9957-550-22-6.

[6] Tahat, A and M. Ling, 2011. Mapping relational operations onto hypergraph model. The Python Paper, 6 1 04 2011.

[7] Tahat, A. and W. Salah, 2011. Comprehensive online atomic database management system (DBMS) with highly qualified computing

capabilities. Int. J. Data Database Manage. Syst., 3: 1-20.