Scholarly article on topic 'The Research on Coding Scheme of Binary-Tree for XML'

The Research on Coding Scheme of Binary-Tree for XML Academic research paper on "Computer and information sciences"

CC BY-NC-ND
0
0
Share paper
Academic journal
Procedia Engineering
OECD Field of science
Keywords
{XML / "Coding Scheme" / "file system 1.Introduction"}

Abstract of research paper on Computer and information sciences, author of scientific article — Xiao Ke

Abstract Usually, there are four different options when storing and querying the XML data: Firstly, use the file system; secondly, use relational database system; thirdly, use object-oriented database system; fourthly, establish a special database system known as the native XML database system, such as Tamino, TextML. In order to effectively support the query in the structure of the XML, one method is establishing the path index of the XML documents tree, and speeding up the calculation capability to the query in the structure of the XML by the path index; another method is encoding the XML document tree nodes, that is, giving a unique code to each node of the XML document tree in order to directly identify the structural relationship in the nodes using the codes, rather than traverse the original XML documents. In other words, it is through coding that calculations of the query in the structure of XML can be changed into calculations of the connection in it.

Academic research paper on topic "The Research on Coding Scheme of Binary-Tree for XML"

Available online at www.sciencedirect.com

SciVerse ScienceDirect

Procedía Engineering 24 (2011) 861-865

2011 International Conference on Advances in Engineering

The Research on Coding Scheme of Binary-Tree for XML

Xiao Ke *

Hunan University of science And Engineering, Computer and Communication Engineering Department, Yongzhou Hunan 425100,

Abstract

Usually, there are four different options when storing and querying the XML data: Firstly, use the file system; secondly, use relational database system; thirdly, use object-oriented database system; fourthly, establish a special database system known as the native XML database system, such as Tamino, TextML. In order to effectively support the query in the structure of the XML, one method is establishing the path index of the XML documents tree, and speeding up the calculation capability to the query in the structure of the XML by the path index; another method is encoding the XML document tree nodes, that is, giving a unique code to each node of the XML document tree in order to directly identify the structural relationship in the nodes using the codes, rather than traverse the original XML documents. In other words, it is through coding that calculations of the query in the structure of XML can be changed into calculations of the connection in it.

©2011 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of ICAE2011

Key words: XML; Coding Scheme; file system 1.Introduction

1. Introduction

With the emergence of a large number of XML data, the problem of how to safely and effectively store and query XML data has become an important issue in the field of database currently. Up to now, the nodes coding method have become the mainstream and a variety of coding schemes have been put forward for them in XML documents query processing including mainly basing on interval and basing on path. Coding schemes based on interval chiefly use one of the characteristics that XML documents are in order, that is, give each node a code according to the location of every element node ordered in dictionary in the original XML documents; whereas coding schemes based on path use the nested characteristic of the XML documents, give a code to every path and every element node starting from the root node according to the nested structure of XML. But there are a number of coding schemas by far, such as

Corresponding author. Tel.: +86-15201298261. E-mail address: ieee2010@foxmail.com

1877-7058 © 2011 Published by Elsevier Ltd. doi: 10.1016/j.proeng.2011.12.414

interval coding [1-2], the prefix code [3-5], bit vector coding [6]. Some coding schemes are not conducive to updating the document; some are not in favor of code storage because of too long coding space. In this paper, a coding scheme based on Binary-Tree has been brought forward, to some extend, can not only judge the two-node ancestor-offspring relationship in linear time, but also optimize the storage.

2. Indexed Stack-tree-desc algorithm

Stack-Tree-Desc algorithm (STD) [2] is put forward by S.Al-Khalifa and H.V.Jagadish. The algorithm only needs to respectively scan AList and DList and can realize structural join of containing relation.

The basic idea of the Indexed Stack-Tree-Desc algorithm (ISTD) is as shown below:

The improved algorithm is also based on stack, the stack is used to store ancestor nodes, which may be needed in structural join. AList and DList only need a respective scan in the algorithm, and the structural join can be completed.

AList is the list of potential ancestors, in sorted order of preorder. DList is the list of potential descendants, in sorted order of preorder. If the stack is empty, and current node of AList is the ancestor of current node of DList, so current node of AList is pushed into the stack. If the stack isn't empty, and current node of AList is the descendant of element of stack top, then current node of AList is pushed into the stack. If a node of DList is the descendant of stack top, then it must be the descendant of all nodes in the stack, and it is impossible to be the descendant of other nodes of AList, because of the node's all ancestors having been in the stack. Therefore, the node of DList matches to all nodes in the stack, and the output is into the result set. If current node of DList is not the descendant of stack node, then stack node does not have descendants in the DList, therefore, stack node pops.

When AList and DList have structural join operations, the pointer doesn't always point to next node, but skips more nodes in the indexed Stack-Tree-Desc algorithm. AList and DList are in sorted order of preorder, index structure is used in order to find the nearest ancestor or descendant node and skip more unnecessary nodes. The data structure of the index table is node (preorder, name, min, max, level). In Figure 1, the ISTD algorithm uses index structure, locates the descendants of a node, avoids scanning all nodes of AList and DList, the algorithm efficiency is raised. The thick line segment indicates interval [min, max] of ALIST nodes, the thin line segment indicates interval [min, max] of DLIST nodes. The real line with an arrow shows the ability of skipping nodes, the broken line shows location with index structure.

d3 d6 d8

(a) Skipping descendant nodes

a2 a4 a7 a8

d1 d2 (b) Skipping ancestor nodes

ISTD algorithm skipping unmatched nodes with index structure

The Indexed Stack-Tree-Desc algorithm is as shown in Figure 2:_

Input: AList and DList, participated in structural join.

Output: connection Output in sorted descendant order.

Improved Stack-Tree-Desc (AList, DList) /*Assume that all nodes in AList and DList have the same docID*/ /*AList is the list of potential ancestors, in sorted order of preorder*/ /*DList is the list of potential descendants, in sorted order of preorder*/ a = AList—>firstNode; d = DList—>firstNode; Output = NULL; while (AList and DList are not empty or the stack is not empty) { if (a.min > stack—> top.max && d.min > stack—>top.max) { tuple = stack—>pop(); } else if (a.min < d.min) { stack—>push(a); a—>next node participated in structural join using index structure} else {

match all nodes of stack to d, append (a, d) to output; if (stack is empty) d—>next node participated in structural join using index structure else

d—>next node of DList

3. Extended region coding

In [2], operations which search node pairs satisfied with specific structural relations in XML document are called structural join. In order to accelerate structural join, researchers have put forward various index schemes, the basic ideas of the indexes are each node is encoded, therefore structural join between nodes can be judged from these codes, and traversing XML document is needless.

Make full use of orderly XML document, region coding is that every node of XML document tree is encoded in alphabetical order or visit order, the following coding is extended based on Wan region coding, so the coding in this paper is called extended region coding. In this paper, document tree model is used to show XML document, each node of XML document tree is regarded as quadri-tuple (preorder, min, max, level), preorder is preorder traversal order of a node, min is the smallest preorder order of the node's all descendants, max is the largest preorder order of the node's all descendants, level is the node's hierarchy located in the XML document tree. For the encoding program, preorder is the only sign of nodes. For multiple XML documents, quintuple (docID, preorder, min, max, level) is defined to identify nodes from different XML documents. If a node is a leaf, preorder, min and max of the node are the same. Supposing that a node only has one child, the node has the same min and max.

The properties of encoding scheme are as follows:

Property 1 Any two nodes u and v of the XML document tree, node u is the ancestor of node v if and only if docID(u)=docID(v) A the preorder of node v e interval u[min, max].

Proof: Any two nodes u and v of the XML document tree, docID(u)=docID(v) shows that node u and node v are in the same XML document, interval u[min, max] means that the sub tree which has u as the root node has the minimum descendant and the maximum descendant, the preorder of the minimum descendant is min, the preorder of the maximum descendant is max, all descendants of node u are in the interval [min, max], if the preorder of node v e interval u[min, max], then node u is the ancestor of node v.

Property 2 Any two nodes u and v of the XML document tree, node u is the parent of node v if and only if docID(u)=docID(v) A the preorder of node v e interval u[min, max] A level(v)-level(u)=1.

Proof: Any two nodes u and v of the XML document tree, according to property 1, the preorder of node v e interval u[min, max] means that node v is a descendant of node u, level(v)-level(u)=1, node u exists at the upper layer of node v in the tree, therefore, node u is the parent of node v.

Property 3 Any two nodes u and v of the XML document tree, the parent of node u and node v is node p, node u and node v are the preceding-sibling or the following-sibling relation if and only if docID(u)=docID(v) A the preorder of node u and v e interval p[min, max] A level(u)=level(v).

Proof: Any two nodes u and v of the XML document tree, docID(u)=docID(v) shows that node u and node v are in the same XML document, the preorder of node u and v e interval p[min, max], then u and v are both the descendants of node p, level(u)=level(v) indicates that node u and node v are in the same layer of the XML document tree, so node u and node v are brother relation.

4.Experimental analysis of stru-coding

At present, XML documents coding method includes interval coding, bit vector coding, prefix coding and PBitree coding, and interval coding, bit vector coding and PBitree coding do not support to update the documents. Although it was suggested to improve interval coding, reserve the space for the coding <order, size>, it has not a very good reservation standard. And when the reservation space is used up, it still needs to re-encode.

Although prefix coding can support to update the documents, the storage space is very huge. The Stru-code coding in the paper can not only quickly judge the ancestors-offspring relationship with log n in average coding length, but also when the XML document being updated, only change a little, without re-coding.

The experiments are made in this article as the standard xmark, sharks experimental data. Experimental environment is windows 2000 server platform, CPU: AMD2600, RAM: 1G and Standard C + + code.

Experimental I : determine the size of B (the divided blocks)

When coding the Binary-tree, if it did not use the block coding, storage performance would sharply decline, mainly because of too much redundant information. And if used it, it would significantly reduce the storage space, and also be lower than the interval coding. The main reason is that the interval coding needs to preserve pre and post intervals, so storage space is bigger than the Binary-tree coding which has been divided into blocks. However, by using sub-block coding, it is needed to determine the size of B firstly, of which Sharks use the Shakespeare 2.00 standard data (http://www.XML.com/pub/r7396), and Xmark is generated based on standard data.

5. Conclusions

With the widely use of XML, there are more and more demands for the XML query processing. However, the majority of XML query languages generally use the core technology of regular path expression, such as XQuery, XPath, Lorel and XML-QL, when the XML document structure query is implemented. And the implementation of this technology often requires that the query system can quickly judge whether the ancestors-offspring relationships or the parent-child relationship among nodes is established in the XML document, among which coding the location of the XML document node in the tree is a mainstream approach. This paper proposes an XML coding method based on Binary-Tree using addition and shift equivalent operation, which can quickly judge the father-child or ancestor-descendant relationship in nodes in consist time complexity and maintain the order among the brother nodes. This paper solves the problem that existing methods need to re-code in update. What is more, a new storage method which is based on Binary-Tree is put forward which can reduce the average length to O (log n). Finally, through experimental analysis, Stru-code coding has been proved to have a very good performance in time and space.

References

[1]Zhang C, Naaghton J, DeWitt D, et al. On Supporting Containment Queries in Relational Database Management Systems [A]. In:Proc. of the ACM SIGMOD Conf[C]. Santa Barbara, California, May 2001, pp.426-437.

[2]Amagasa T, Yoshikawa M, Uemura S. QRS : A Robust Numbering Scheme for XML Documents. In : Proc. of ICDE , 2003, pp.705-707.

[3]Cohen E , Kaplan H , Milo T. Labeling Dynamic XML Trees. In :Proc. of PODS , 2002, pp.271-281.

[4]A. Sheth, J. Larson and E. Watkins. TAILOR, A tool for updating views. LNCS v.303, pp. 190-213, 1988.

[5]Y. Masunaga. A relational database view update translation mechanism. Proceedings of the 10th International Conference on Very Large Data Bases, pp. 309-320, 1984.

[6]M. Keller. Algorithms for translating view updates to database updates for views involving selections, projections and Joins. 4th PODS, 1985.

[7]A. M. Keller. Choosing a view update translator by dialog at view definition time. Proceedings of the 12th International Conference on Very Large Data Bases, pp. 467-474, August 25-28, 1986.

[8]L. Wang, M. Mulchandani and E.A. Rundensteiner. Updating XQuery views published over relational data: a roundtrip case study. In XML Database Symposium, pp. 223-237, 2003.

[9]V.P. Braganholo, S.B. Davidson and C.A. Heuser. From XML view updates to relational view updates: old solution to a new problem. In VLDB, pp. 276-287, 2004.

[10]S Abiteboul , D Quass , J McHugh et al1 The Lorel query language for semi-structured data1 Int' l Journal on Digital Libraries , vol. 1, no. 1, 1997, pp.68-80.

[11]Joins. Chinese Journal of Computers, 2005, 28(1): 113-127.

[12]WANG Jing, MENG Xiao-feng, WANG Shan. Structural Join of XML Based on Range Partitioning. Journal of Software, 2004, 15(5): 720-729.

[13]WAN Chang-xuan, LIU Xi-ping. Structural Join and Staircase Join Algorithms of Sibling Relationship. Journal of Computer Science & Technology, 2007, 22(2): 171-181.

[14]LIU Xi-ping, WAN Chang-xuan, CHEN Lei. Effective XML Content and Structure Retrieval with Relevance Ranking. In: Proceedings of the 18th ACM Int' l Conference on Information and Knowledge Management (ACM CIKM2009), Hong Kong, November 2-6, 2009. 147-156.

[15]WANG Hong-qiang, LI Jian-zhong, WANG Hong-zhi. Processing XPath over F&B-Index. Journal of Computer Research and Development, 2010(05).

[16]JIANG Jin-hua, CHEN Ke, LI Xiao-yan, et al. Efficient processing of ordered XML twig pattern matching based on extended Dewey. Journal of Zhejiang University Science A(An International Applied Physics & Engineering Journal), 2009(12).

[17]LI Guo-liang, FENG Jian-hua. An Effective Semantic Cache for Exploiting XPath Query/View Answerability. Journal of Computer Science & Technology, 2010(02).

[18]ZHOU Jun-feng, MENG Xiao-feng, LING TokWang. Efficient processing of partially specified twig pattern queries. Science in China(Series F:Information Sciences), 2009(10).