Scholarly article on topic 'XML Structural Join Based on Extended Region Coding'

XML Structural Join Based on Extended Region Coding Academic research paper on "Computer and information sciences"

CC BY-NC-ND
0
0
Share paper
Academic journal
Physics Procedia
OECD Field of science
Keywords
{xml / "region coding" / "path query" / "structural join"}

Abstract of research paper on Computer and information sciences, author of scientific article — YANG Yang, LI Hai-ge

Abstract XML has become a standard technology in exchange of a wide variety of data on web and internet for its structure, label, portability and expansibility. To efficiently query XML documents has been the primary urgent task. At the present time, most of XML index and query are based on encoding the XML document tree, so all kinds of XML encoding schemes are proposed, and region coding is the mainstream coding and used most commonly, such as Dietz coding, Li-Moon coding, Zhang coding, Wan coding, etc. The paper proposes an extended region coding based on region coding. Preorder XML document tree, and take preorder numerical orders of a node's all descendants as the region. When carrying out structural join, if preorder numerical order of a node is in this region, structural relation can be ensured. So this extended region coding can help effectively judge structural relation and the XML document tree needn’t be traversed. Furthermore, the better structural join algorithms of XML path queries have received considerable attention recently, and some researchers have proposed some fine algorithms to solve the problem. Stack-Tree-Desc algorithm is one of these fine algorithms, it need separately scan ancestor list and descendant list one time to decide ancestor/descendant structural relation, but some unneeded join nodes still be scanned. For this reason, if some element nodes of ancestor list and descendant list which don’t need participate in structural join can be jumped, the query efficiency is enhanced. Therefore, based on Stack-Tree-Desc algorithm an improved algorithm which introduces index structure to avoid scanning unwanted nodes, so ordered scanning is unnecessary, the consuming time of query shortens accordingly. And this improved algorithm can quickly judge structural relation based on extended region coding presented in this paper. Experiment is conducted to test the effectiveness of the extended region coding and the Indexed Stack-Tree-Desc algorithm. Experiment results show that the method in this paper is effective.

Academic research paper on topic "XML Structural Join Based on Extended Region Coding"

Physics

Procedía

2012 International Conference on Medical Physics and Biomedical Engineering

XML Structural Join Based on Extended Region Coding

YANG Yang1, LI Hai-ge2

Computing Center, Henan University, Henan Kaifeng, China yangyang@henu.edu.cn 2Equipment Department, Kaifeng Construction Design Institute, Henan Kaifeng, China

lihaigekf@126.com

Abstract

XML has become a standard technology in exchange of a wide variety of data on web and internet for its structure, label, portability and expansibility. To efficiently query XML documents has been the primary urgent task. At the present time, most of XML index and query are based on encoding the XML document tree, so all kinds of XML encoding schemes are proposed, and region coding is the mainstream coding and used most commonly, such as Dietz coding, Li-Moon coding, Zhang coding, Wan coding, etc. The paper proposes an extended region coding based on region coding. Preorder XML document tree, and take preorder numerical orders of a node's all descendants as the region. When carrying out structural join, if preorder numerical order of a node is in this region, structural relation can be ensured. So this extended region coding can help effectively judge structural relation and the XML document tree needn't be traversed. Furthermore, the better structural join algorithms of XML path queries have received considerable attention recently, and some researchers have proposed some fine algorithms to solve the problem. Stack-Tree-Desc algorithm is one of these fine algorithms, it need separately scan ancestor list and descendant list one time to decide ancestor/descendant structural relation, but some unneeded join nodes still be scanned. For this reason, if some element nodes of ancestor list and descendant list which don't need participate in structural join can be jumped, the query efficiency is enhanced. Therefore, based on Stack-Tree-Desc algorithm an improved algorithm which introduces index structure to avoid scanning unwanted nodes, so ordered scanning is unnecessary, the consuming time of query shortens accordingly. And this improved algorithm can quickly judge structural relation based on extended region coding presented in this paper. Experiment is conducted to test the effectiveness of the extended region coding and the Indexed Stack-Tree-Desc algorithm. Experiment results show that the method in this paper is effective.

© 2012Published by Elsevier B.V. Selection and/or peer review under responsibility of ICMPBE InternationalCommittee. Keywords-xml; region coding; path query; structural join

1. Introduction

Available online at www.sciencedirect.cor

SciVerse ScienceDirect

Physics Procedia 33 (2012) 1374 - 1380

1875-3892 © 2012 Published by Elsevier B.V. Selection and/or peer review under responsibility of ICMPBE International Committee. doi:10.1016/j.phpro.2012.05.225

Along with the vigorous development of the internet, more and more data is described, stored, exchanged and represented by XML. So the abilities of information retrieval for XML document become increasingly important. Structural query of XML document is usually transformed into structural join operation of contain-join relation or document position relation between two lists (ancestor list and descendant list). And key words operation also is transformed into structural join operation of contain-join relation between two lists. Therefore, effectively supporting structural join is the key factor in querying XML data. The core of XML query is XPath expression query.

More algorithms are proposed to solve XPath query. Reference [1] presents MPMGJN algorithm via multi-predicate, this algorithm has a poor performance since the homonymy elements have nesting structural join. The EE- join algorithm of Reference [3] has the same problem.

In [2], Stack-Tree-Desc algorithm is put forward, and the algorithm maintains an ancestor stack, the stack stores ancestor nodes which might participate in structural join. The I/O cost of Stack-Tree-Desc algorithm is both scanning AList (Ancestor List) and DList (Descendant List). But according to XML document structure, some nodes of AList and DList can be judged and not to join structural join beforehand. And many operations of push and pop also can be reduced.

In order to judge structural relation of any two nodes pairs in the XML document tree, code scheme is put forward to decide structural relation without traversing XML document tree. For this reason, coding system can accelerate structural join. Some typical coding systems have Dietz coding, Li-Moon coding, Zhang coding, Wan coding, and so on. Based on the feature of typical region coding, extended region coding is offered in this paper. In this coding system, all descendants of a node are in interval[min, max], the characteristic can help quickly judge structural relation when XPath query is done. The properties and proofs of this encoding scheme are given in this paper.

This paper just wants to realize structural matching of any two nodes pair and accelerates XPath compute. Correspondingly, Indexed Stack-Tree-Desc(ISTD) algorithm which improves Stack-Tree-Desc algorithm is presented in this paper. Making full use of index structure, ISTD algorithm can jump over nodes which do not participate in structural join. With the help of extended region coding, structural relation of any two nodes can be decided quickly and easily. Accordingly, query efficiency is improved.

2. Extended region coding

In [2], operations which search node pairs satisfied with specific structural relations in XML document are called structural join. In order to accelerate structural join, researchers have put forward various index schemes, the basic ideas of the indexes are each node is encoded, therefore structural join between nodes can be judged from these codes, and traversing XML document is needless.

Make full use of orderly XML document, region coding is that every node of XML document tree is encoded in alphabetical order or visit order, the following coding is extended based on Wan region coding, so the coding in this paper is called extended region coding. In this paper, document tree model is used to show XML document, each node of XML document tree is regarded as quadri-tuple (preorder, min, max, level), preorder is preorder traversal order of a node, min is the smallest preorder order of the node's all descendants, max is the largest preorder order of the node's all descendants, level is the node's hierarchy located in the XML document tree. For the encoding program, preorder is the only sign of nodes. For multiple XML documents, quintuple (docID, preorder, min, max, level) is defined to identify nodes from different XML documents. If a node is a leaf, preorder, min and max of the node are the same. Supposing that a node only has one child, the node has the same min and max.

The properties of encoding scheme are as follows:

Property 1 Any two nodes u and v of the XML document tree, node u is the ancestor of node v if and only if docID(u)=docID(v) A the preorder of node v e interval u[min, max].

Proof: Any two nodes u and v of the XML document tree, docID(u)=docID(v) shows that node u and node v are in the same XML document, interval u[min, max] means that the sub tree which has u as the root node has the minimum descendant and the maximum descendant, the preorder of the minimum descendant is min, the preorder of the maximum descendant is max, all descendants of node u are in the interval [min, max], if the preorder of node v e interval u[min, max], then node u is the ancestor of node v.

Property 2 Any two nodes u and v of the XML document tree, node u is the parent of node v if and only if docID(u)=docID(v) A the preorder of node v e interval u[min, max] A level(v)-level(u)=1.

Proof: Any two nodes u and v of the XML document tree, according to property 1, the preorder of node v e interval u[min, max] means that node v is a descendant of node u, level(v)-level(u)=1, node u exists at the upper layer of node v in the tree, therefore, node u is the parent of node v.

Property 3 Any two nodes u and v of the XML document tree, the parent of node u and node v is node p, node u and node v are the preceding-sibling or the following-sibling relation if and only if docID(u)=docID(v) A the preorder of node u and v e interval p[min, max] A level(u)=level(v).

Proof: Any two nodes u and v of the XML document tree, docID(u)=docID(v) shows that node u and node v are in the same XML document, the preorder of node u and v e interval p[min, max], then u and v are both the descendants of node p, level(u)=level(v) indicates that node u and node v are in the same layer of the XML document tree, so node u and node v are brother relation.

3. Indexed Stack-tree-desc algorithm

Stack-Tree-Desc algorithm (STD) [2] is put forward by S.Al-Khalifa and H.V.Jagadish. The algorithm only needs to respectively scan AList and DList and can realize structural join of containing relation.

The basic idea of the Indexed Stack-Tree-Desc algorithm (ISTD) is as shown below:

The improved algorithm is also based on stack, the stack is used to store ancestor nodes, which may be needed in structural join. AList and DList only need a respective scan in the algorithm, and the structural join can be completed.

AList is the list of potential ancestors, in sorted order of preorder. DList is the list of potential descendants, in sorted order of preorder. If the stack is empty, and current node of AList is the ancestor of current node of DList, so current node of AList is pushed into the stack. If the stack isn't empty, and current node of AList is the descendant of element of stack top, then current node of AList is pushed into the stack. If a node of DList is the descendant of stack top, then it must be the descendant of all nodes in the stack, and it is impossible to be the descendant of other nodes of AList, because of the node's all ancestors having been in the stack. Therefore, the node of DList matches to all nodes in the stack, and the output is into the result set. If current node of DList is not the descendant of stack node, then stack node does not have descendants in the DList, therefore, stack node pops.

When AList and DList have structural join operations, the pointer doesn't always point to next node, but skips more nodes in the indexed Stack-Tree-Desc algorithm. AList and DList are in sorted order of preorder, index structure is used in order to find the nearest ancestor or descendant node and skip more unnecessary nodes. The data structure of the index table is node (preorder, name, min, max, level). In Figure 1, the ISTD algorithm uses index structure, locates the descendants of a node, avoids scanning all nodes of AList and DList, the algorithm efficiency is raised. The thick line segment indicates interval [min, max] of ALIST nodes, the thin line segment indicates interval [min, max] of DLIST nodes. The real line with an arrow shows the ability of skipping nodes, the broken line shows location with index structure.

d3 d6 d8

(14 d7

(a) Skipping descendant nodes

(b) Skipping ancestor nodes Figure 1. ISTD algorithm skipping unmatched nodes with index structure

The Indexed Stack-Tree-Desc algorithm is as shown in Figure 2:

Input: AList and DList, participated in structural join. Output: connection Output in sorted descendant order. Improved Stack-Tree-Desc (AList, DList)

/*Assume that all nodes in AList and DList have the same docID*/ /*AList is the list of potential ancestors, in sorted order of preorder*/ /*DList is the list of potential descendants, in sorted order of preorder*/

a = AList—>firstNode; d = DList—>firstNode; Output = NULL; while (AList and DList are not empty or the stack is not empty) { if (a.min > stack—> top.max && d.min > stack—>top.max) { tuple = stack—>pop(); } else if (a.min < d.min) { stack—>push(a);

a—>next node participated in structural join using index structure} else {

match all nodes of stack to d, append (a, d) to output; if (stack is empty) d—>next node participated in structural join using index structure else

d—>next node of DList

Figure 2. Indexed Stack-Tree-Desc algorithm

4. Performance test

4.1 experiment environment

The performance of Stack-Tree-Desc algorithm and Indexed Stack-Tree-Desc algorithm are tested in this paper. The test platform is Intel Core 2 Duo T5870 2.00 GHz, memory 2.00 GB, HDD 160GB, Microsoft Windows XP Professional Service Pack 2, NetBeans IDE 6.9.

Figure 3 is the XML DTD of test data. The edge having "A" means which it points at is attribute. The

edge having "+", "*", "?" shows that the element which the edge points at appears once or more times, zero or more times, zero or one time in the XML document respectively. The DTD covers common XML document structure, and has generality.

Using XML generator, a XML document having 81.6MB is generated. The document has 1896865 elements and attributes, 2743 "book" element nodes, 501223 "title" element nodes, 6974 "chapter" element nodes, 467653 "section" element nodes, 171134 "description" element nodes, 121428 "keyword" element nodes.

This paper adopts 4 query instances, they are book/price, chapter/section, section/description, book/child::*. We have a test by changing selectivity. On the basis of original parent list and child list, to delete a percentage of element nodes from parent list, so as to change parent nodes percentage, the performance of ISTD structural join algorithm is tested in this way. The percentage of parent nodes selects 100%, 70%, 50%, 20%, 1% respectively.

address email phone number title section description

Figure 3. The XML DTD of test data

4.2 experiment analysis

The selected performance indexes are two to test program performance. One is the numbers of scanning element nodes; the numbers of reading element nodes reflect the ability of the algorithm skipping irrelevant nodes. The other is the time of consuming.

The numbers of child nodes remain the same, and the percentage of parent nodes changes, the numbers of scanning nodes by the two algorithms (STD and ISTD) is as shown in Table I . According to Table I , the scanning nodes numbers of the two algorithms are different from the longitudinal comparison, ISTD is slightly better. From the horizontal comparison, scanning nodes numbers of STD and ISTD decrease, with increasing percentage, but the decline is also different, and the decrement of ISTD is large. Because of using index structure, unwanted nodes can be skipped, but not scanning nodes one by one.

The running time of the two algorithms is in Figure 4, when the percentage of parent (or ancestor) nodes is changed, and the percentage of child (or descendant) nodes remains the same. The abscissa represents the percentage of parent nodes or ancestor nodes, the ordinate represents the running time of algorithms, minimum time in seconds. From the running time, ISTD algorithm is superior to STD algorithm. When the percentage is lower, ISTD algorithm has more obvious advantages.

book/price capter/section

Figure 4. The comparation of running time between STD and ISTD 5. Conclusion

For the purpose of effective supporting XPath expression which is the core of XML query, quick judging structural relation of XML document is very necessary. Region coding is the main current encoding system of XML document, and many encoding scheme is presented based on this coding. The paper analyzes several region coding schemes, and puts forward extended region coding in order to judge structural relation and improve query efficiency when XML query is done. The properties and proofs of this encoding scheme are given in this paper. The Indexed Stack-Tree-Desc algorithm based on Stack-Tree-Desc algorithm is proposed, it effectively supports XPath query. Index structure is introduced in order to skip the unnecessary node query and improve executive efficiency. Experiment results show that extended region coding and the Indexed Stack-Tree-Desc algorithm are effective.

References

[1] Zhang C, Naughton J, De Witt D, et al. On supporting Containment queries in relational database management systems. In: Proceedings of the 2001 ACM SIGMOD international conference on management of data. Santa Barbara, California, United States. May 21-24, 2001. ACM Press. 425-436.

[2] Al-Khalifa S, Jagadish HV, Koudas N, Patel JM, Srivastava D, Wu Y. Structural joins: A primitive for efficient XML query pattern matching. In: Agrawal R, Dittrich K, Ngu AHH, eds. Proc. of the 18th Int'l Conf. on Data Engineering (ICDE). San Jose: IEEE Computer Society, 2002. 141-152.

[3] Li Q, Moon B. .Indexing and querying xml data for regular path expressions. In: Proceedings of the VLDB Conference, Roma, 2001, 361-370.

[4] Chien S-Y, Vagena Z, Zhang D, et al. Efficient Structural Joins on Indexed XML Documents. In: Proceedings of 28th International Conference on Very Large Data Bases (VLDB'02). Hong Kong, China. August 20-23, 2002. Morgan Kaufmann. 263-274.

[5] Jiang H, Lu H, Wang W, et al. XR-Tree: Indexing XML Data for Efficient Structural Joins. In: Proceedings of the 19th

International Conference on Data Engineering (ICDE'03). Bangalore, India. March 5-8, 2003. IEEE Computer Society. 253-263.

[6] WAN Chang-xuan, LIU Yun-sheng, XU Sheng-hua. Indexing XML Data Based on Region Coding for Efficient Processing of Structural Joins. Chinese Journal of Computers, 2005, 28(1): 113-127.

[7] WANG Jing, MENG Xiao-feng, WANG Shan. Structural Join of XML Based on Range Partitioning. Journal of Software, 2004, 15(5): 720-729.

[8] WAN Chang-xuan, LIU Xi-ping. Structural Join and Staircase Join Algorithms of Sibling Relationship. Journal of Computer Science & Technology, 2007, 22(2): 171-181.

[9] LIU Xi-ping, WAN Chang-xuan, CHEN Lei. Effective XML Content and Structure Retrieval with Relevance Ranking. In: Proceedings of the 18th ACM Int' l Conference on Information and Knowledge Management (ACM CIKM2009), Hong Kong, November 2-6, 2009. 147-156.

[10] WANG Hong-qiang, LI Jian-zhong, WANG Hong-zhi. Processing XPath over F&B-Index. Journal of Computer Research and Development, 2010(05).

[11] JIANG Jin-hua, CHEN Ke, LI Xiao-yan, et al. Efficient processing of ordered XML twig pattern matching based on extended Dewey. Journal of Zhejiang University Science A(An International Applied Physics & Engineering Journal), 2009(12).

[12] LI Guo-liang, FENG Jian-hua. An Effective Semantic Cache for Exploiting XPath Query/View Answerability. Journal of Computer Science & Technology, 2010(02).

[13] ZHOU Jun-feng, MENG Xiao-feng, LING TokWang. Efficient processing of partially specified twig pattern queries. Science in China(Series F:Information Sciences), 2009(10).

TABLE I. The Comparation of actual scanning nodes numbers between STD and ISTD (unit: thousand)

^'^-^^--^..JHn-Percentage 100% 70% 50% 20% 1%

^"~"-*,Algorithm XPath ^^^ STD ISTD STD ISTD STD ISTD STD ISTD STD ISTD

book/price 499 401 498 352 497 248 495 124 490 8

chapter/section 529 478 527 379 525 263 524 187 522 15

section/description 698 630 564 523 432 399 310 278 202 47

book/child::* 2023 1880 2020 1520 2019 1110 2018 480 2010 32