Available online at www.sciencedirect.com

SCIENCE DIRECT«

Electronic Notes in Theoretical Computer Science 150 (2006) 3-19

www.elsevier.com/locate/entcs

A Data Model for Data Integration

James J. Lu1

Department of Mathematics and Computer Science Emory University Atlanta, GA 30332. U.S.A.

Abstract

Data integration systems provide a uniform query interface (UQI) to multiple, autonomous data sources [4]. This paper presents the universal data model (UDM) that captures the semantically salient aspects of relational, entity-relationship, and XML data models. As a consequence, UDM — including its accompanying query language — provides a simple and elegant UQI for integrating data represented in some of the most widely adopted data models.

Keywords: Data Model, Data Integration, Query Languages

1 Introduction and Motivation

Data integration systems provide a uniform query interface (UQI) to multiple, autonomous data sources [4]. A UQI is achievable in different ways and at different levels of abstractions.

• Perhaps the most popular approach is to separate the user of the integrated system from the details of the data sources at the schema level, by specifying a mediated schema. A user needs to understand the structure and semantics of the mediated schema in order to pose meaningful queries. The key steps in this approach involve setting up the mediated schema (either manually or automatically [13]) and specifying its relationships to the local schemas (in either GAV or LAV fashion).

1 Email: jlu@mathcs.emory.edu

1571-0661/$ - see front matter © 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.entcs.2005.11.031

• A more dynamic approach is to extend query languages with powerful constructs to facilitate integration without the explicit creation of the mediated schema. Systems such as Hilog [2], SchemaSQL [7], and IDL [6] are examples of this approach. The language serves as the UQI, and it allows users to write queries with only partial knowledge of the "implied" mediated schema.

In this paper, we consider a third alternative: a universal data model (UDM) as the UQI. Introduced in [8], the UDM unifies relational (RDB), entity-relationship (ERD), and XML data by capturing the most semantically salient aspects of these data models while abstracting out structural — often artificial — variations. A database in the UDM is a collection of units related by contexts. Rough equivalents to these notions in RDB and XML are, respectively, tuples related by relations, and sub-elements related by elements. However, unlike RDB in which tuples are partitioned into relations, units in a Universal Database (UDB) naturally belong to multiple contexts. Moreover, units related by the same context may have very different arities. The query language of UDM is called the Context-Based Query Language (CBQL), and it facilitates structured queries (in the sense of SQL and XPath), information retrieval (in the sense of keyword searches), and combinations thereof.

In this paper, we apply UDM to address the issue of semantic heterogeneity in data integration. Specifically, we show that an integration of several database instances amounts to viewing these data sources through the UDM "looking glass". No wrapping, cleansing, trimming, stretching, or other transformations of the data are necessary. Furthermore, no mediated schema, implicit or explicit, needs to be created to query the data sources. The idea is depicted in the following diagrams.

Js vL vl

Universal Database

Relational Data

XML Data

Object Data

An Isometric View

A Bird's-eye View

The left illustration emphasizes the fact that UDM is an abstraction that allows one to see RDB, XML, and other forms of data at a higher level.

The right illustration depicts that through the UDM abstraction, RDB tuples, XML trees, and OODB objects appear as units (represented as puzzle pieces) that, while of different shapes and sizes, can nevertheless interact with one another. Users may pose queries in CBQL, but SQL and XPath queries written for local data sources also have precise interpretations over the entire UDB. For handling ontological variations (e.g., name and interpretation discrepancies, data/metadata conflicts), CBQL relies on the information retrieval paradigm, which may extract irrelevant answers to queries, to avoid the need to preprocess metadata.

This paper is a continuation of the work described in [8] in which we presented the foundation of UDM and the query language CBQL, and we demonstrated the the use of CBQL for both structured and information-retrieval queries. Here, our emphasis is on data integration, and several key definitions of UDM are generalized in Section 2. An algebra for UDB (abbreviated UA), is presented in Section 3 along with numerous examples to illustrate simultaneous querying over relational and XML data. Section 5 analyzes translations of relational algebraic and XPath queries to the UA.

2 The Universal Data Model

A basic tenet of most data models is that data exist in contexts and in relationships to one another. In relational databases, contexts are given through relation and attribute names, and relationships by grouping data into tuples and linked via foreign keys. In XML, contexts are given through element and attribute names, and relationships by organizing data into trees and subtrees, and through id references.

Contexts provide the means through which information can be queried, but traditional data management systems associate, in addition, specific roles to contexts. In relational databases, the roles of attributes and relations are reflected in the basic SQL query: select <attributes> from <relations> .... One must be aware of which contexts have been assigned the role of attribute, and which the role of relation in order to write a correct query. A relation name in the select clause, for instance, is not allowed. Similarly, in XPath queries, notions of root, parent, descendent, and other axes reflect the roles that are, either absolutely or relatively, associated with elements and attributes.

In practice, roles impose unnecessary restrictions and have hindered the development of query languages that are transparent to the logical structures of the data [8]. Early work to address this issue in the relational setting involves the notion of a universal relation [9, 10]. Here we broaden this conception to

heterogeneous data models.

We assume disjoint countably infinite sets V and S called values and context identifiers, respectively. Intuitively, values are the data in the database including strings, integers, booleans, etc., and context identifiers are the building blocks for contexts.

A context is a subset of S. Each context c has an associated subset, Dom(c), of V, called the domain of c. A unit is a function u that maps a finite number of contexts to non-empty subsets of V while satisfying the condition u(c) C Dom(c), for every context c.2 If u(c) = 0, we say that c is a well-defined context for u, and denote it u(c) j. A unit u is well-defined if u(c) j for some context c.

A partial ordering ^ exists on a set of units: Unit ui is a subunit of unit u2, u1 ^ u2, iff for each context c such that u1(c) j, there is a context c' 2 c such that u2(c') = u1(c).3

We adopt the notational convention of relational databases for representing units. Suppose c1, ...,cm are all the well-defined contexts for u. Then, u can be written < c1 : v1,..., cm : vm > where u(c^) = v^ for 1 < i < m. Each pair ci : vi is a component of u.

Definition 2.1 A universal database (UDB) is a quadruple (V, S, Dom, U) where U is a finite collection of units.

Example 2.2 (Modeling RDB) Consider the relational database, DB1,that contains the following two relations.

Author

Contract

Name SSN

Diana Mary Tom 1234 1111 4321

Publisher Book Author

Prentice Diana 1111

Addison XML 1111

St. Martins Little Children 4321

We can model DB1 as the UDB (V, S, Dom, U) where

(i) V is the set of all data that can appear in the two relations (e.g., Diana, 1234, Prentice, XML, etc.);

(ii) S = {Contract, Name, SSN, Publisher, Book,Author} (we abbreviate symbols in the set as c,n,s,p,b, and a, respectively);

(iii) Dom maps contexts to non-empty sets of values. For example, Dom({a, n}) contains {Diana, Mary,Tom}.

2 Singleton sets are often written without braces.

3 Note that assuming sets of values are incomparable, this ordering is essentially the approximation ordering adopted in the study of denotational semantics [14].

(iv) The well-defined units of the UDB are as follows.

< {a,n}:Diana,{a,s}:1234> < {c,a}:1111,{c,p}:Prentice,{c,b}:Diana>

< {a,n}:Mary,{a,s}:1111> < {c,a}:1111,{c,p}:Addison,{c,b}:XML>

< {a,n}:Tom,{a,s}:4321> < {c,a}:4321,{c,p}:St. Martins,{c,b}:Little Children>

The mapping, to UDB from any relational database is now straightforward. Suppose D is a relational database and t is a tuple of a relation r e D over the schema {ai,..., am}. We denote by ^(t) the unit < {r, ai} : t(ai),..., {r, am} : t(am) >.

Definition 2.3 Given a relational database D, ^(D) = (V, S, Dom, U) is a UDB that satisfies the following.

(i) V is the union of the domains that appear in D.

(ii) S is the set of all relation and attribute names in D.

(iii) For each context {r, a} where r is the name of a relation in D and a an attribute in the schema of r, Dom({r, a}) equals the domain of r.a.

(iv) U = UreDW) | t e r}.

The advantage of the UDM, however, goes beyond modeling "traditional" data. A simple example to illustrate that units are not required to possess regular structures; that they may have different shapes and sizes while belonging to the same context is the following.

<{emp,name}:John, {assigned,dept}:electronics> <{emp,name}:Joe, {assigned,dept}:{toy,appliance}> <{dept,name}:appliance,{phone}:4321,{floor}:{4,5}> < {dept,name} :electronics, {phone} :1234> <{emp,name}:Jim, {manages,dept}:electronics, {phone}:1111>

One may view the first two units and the last one as data related by the context employee, view all the units as data related by the context dept, and view the last three units as data related by the context phone. Indeed, this flexibility gives us a way of modeling entity-relationship diagrams more intuitively and succinctly than the typical representation in relational databases.

Example 2.4 (Modeling ERD) While ERDs are most widely adopted for designing relational databases, the graphical notations themselves form a log-

ical data model that need not be tied to any particular database model.

Consider the simple example shown in the figure below. The diagram depicts entities Employee (with attributes name, birthdate, and social security number), Department (with attributes name and address), and the many-to-many "Works for" relationship between Employee and Department.

We can model the diagram as the UDB where S = {W, E, D, n, s, b, a}, and the "employee" and "department" entities are characterized by units for which each context in the sets {{E,n}, {E,s}, {E,b}} and {{D,n}, {D,a}} are well-defined, respectively. As the relationship "Works for" is a many-to-many relationship, a typical representation in a relational database requires a separate relation. In the UDM, we may capture "Works for" succinctly by extending each employee unit u with the assignment of all departments for which u works for as the value for the context {E,W}. This representation is conceptually similar to how relationships are modeled in the Object Definition Language. An example is

<{E,n}:John,{E,s}:1111,{E,b}:1-1-2005,{E,W}:{HR,VP}>

It captures the idea that John works for both the HR and VP departments. A component for each "department" unit that represent the inverse relationship can be similarly specified. For this paper, we will restrict attention to UDBs that arise as the result of RDB and XML sources.

A particularly simple UDB model of XML is to create one unit for each document. First, given an XML tree T =< V, E > (corresponding to some document) and a vertex v e V, let path(v) denote the set of vertices along the path from the root of T to v, excluding v. The label of v (i.e., element or attribute names if v is an internal node, or data value otherwise) is denoted lab(v), and the context associated with v, con(v), is the set Uvepath(v)lab(v').

Definition 2.5 Given an XML tree T, define ^(T) = (V, S, Dom, U) to be the UDB that satisfies the following. V is the set of labels associated with the leaves of T, S is the set of labels associated with non-leaf nodes of T, and the only unit in U maps, for each leaf node l, con(l) to lab(Z), while all other contexts map to 0.

Example 2.6 (Modeling XML) Suppose V includes, among other values, "Election", "Fiction", "Tom", "Perrota","1997", andS ={b,a,f,l,t,d,y,m,@g,@i, @a} represent elements book, author, firstname, lastname, title, date, year,

<book genre="Fiction" <author id="4321">

authorid="4321"> <firstname> Tom </firstname>

<title> Election </title> <lastname> Perrota </lastname>

<date> </author>

<year> 1997 </year>

<month> July </month>

</date>

</book>

Doc. 1 Doc. 2

Fig. 1. An XML Database

month, and attributes @genre, @id, and @authorid respectively. Then the units

<{b,@g}:Fiction, {b,@a}:4321, {b,t}:Election, {b,d,y}:1997, {b,d,m}:July> <{a,@i}:4321, {a,f}:Tom, {a,l}:Perrota>

model documents 1 and 2 of Figure 1, respectively.

To accommodate multiple sub-elements of the same name, we introduce additional context identifiers. For instance, if author is a subelement of book and a book has multiple authors, we may identify groupings among the first and last names of authors via unique context identifiers associated with the <author> element. Note that the only purpose for these identifiers is to allow repeated elements to be grouped appropriately within units. They do not attempt to recapture structural details of the input XML document, and in particular, the mapping ^ ignores hierarchical relationships among elements. 4

To summarize the section, ^ and ^ enable RDB and XML databases to be viewed as a set of units with no distinguishable structural variations. This blurring of representation facilitates not only integration of independently designed data sources, but also promotes data design in which the most suitable data model can be applied to represent different parts of the database. In the next section, we show example queries that simultaneously compute over units from Examples 2.2 and 2.6.

4 We acknowledge that there are situations when structural relationships reflect important semantics in the data, and hence stronger notions of contexts (e.g., [11]) may be desirable. This is a topic of ongoing research.

3 UA: An Algebra for UDB

A SQL-like query language for UDM was proposed in [8]. We do not reintro-duce the language syntax here but only present its algebraic foundation. In our examples, we assume U is the UDB formed by the union of all units in the relational database of Example 2.2 and the units in the XML database of Example 2.6.

Selection: The operation A selects units based on conditions over context identifiers and values. First, we extend the syntax of ordinary boolean expressions. A term is either a context or a set of values.

Definition 3.1 A boolean expression is formed recursively:

(i) A context is a boolean expression.

(ii) If • e {=,<,>, <, >, =, in} and b1 and b2 are terms, then b1 • b2 is a boolean expression.

(iii) If b1 and b2 are boolean expressions, then so are b1 A b2, b1 V b2, —b1.

Definition 3.2 A unit u satisfies the boolean expression b, written u |= b, if the following conditions hold.

(i) If b is a context, then there exists a context c 2 b such that u(c) j. Remark. We note that more complex tests will be useful in practice and may replace set inclusion as the condition for satisfaction. In particular, when querying across independently developed data sources, stemming context identifiers and using synonyms to test for satisfaction will be useful for handling semantic heterogeneities due to variations in names and interpretations.

(ii) If b has the form b1 • b2, then

• if b1, b2 are contexts, then u(c1) • u(c2) for some c1 2 b1 and c2 2 b2;

• if b1 is a value and b2 is a context, then b1 • u(c2) for some c2 2 b2;

• if b1 is a context and b2 is a value, then u(c1) • b2 for some c1 2 b1;

• if b1, b2 are values, then b1 • b2.

Remark: The operator in represents the containment relationship C.

(iii) If b has the form b1 A b2, then u satisfies both b1 and b2.

(iv) If b has the form b1 V b2, then u satisfies either b1 or b2.

(v) If b has the form —b1, then u does not satisfy b1.

Proposition 3.3 Suppose u = b and u ^ u'. Then u' = b.

Definition 3.4 Suppose b is a boolean expression, c is a context, and U is a set of units. The query AbU is defined to be the set: {u e U | u = b}.

Example 3.5 When writing queries in the UA, the query designer need not

have complete knowledge of the well-defined contexts for each unit. The less knowledge one incorporates into the query, the less precise is the returned result.

(i) Suppose we are interested in finding information about authors named Diana, and we know that {a,n} (from Example 2.2), {a,f}, and {a,l} are possible contexts in which author names may arise in U. We may write

the query A{a,n} = 'Diana'v{a,f} = 'Diana'v{a,l} = 'Diana'1U and the result is <{a,n}:Diana,{a,s}:1234> .

(ii) Suppose we are interested in finding information about Tom, but only know that Author is a valid context for author names. Then the query A{a} = 'Tom'U results in the units: <{a,n}:Tom,{a,s}:4321> and <{a, @i}:4321, {a,f}:Tom, {a,l}:Perrota>.

As a special case of A, if c is a context, we write U@c to denote AcU. For the first example above, we may compute the same answer with the query

An = 'Diana'U

Suppose ans(Q) denotes the set of units computed from a query Q. The following is an immediate consequence of Proposition 3.3.

Corollary 3.6 Suppose u,u' e U and u < u'. If u e ans(AbU), then u' e ans(AbU).

Projection: Given a unit u and a set of contexts C, we denote by (u|C) the unit that satisfies the following.

(i) If c 2 c' for some c' e C, then (u|C)(c) = u(c).

(ii) (u|C)(c'') = 0 for all other contexts c''.

Definition 3.7 Suppose C is a set of contexts and U is a set of units. The query ^CU consists of the set {u' | u e U and u' = (u|C)}.

Example 3.8 The query ^sU@a finds all SSNs of authors in U (ignoring for now that @id is a synonym for SSN). The result is the set of units {<{a,s}:1234>, <{a,s}:1111>, <{a,s}:4321>}.

If C is a singleton {c}, we abbreviate ^CU as U.c. Renaming Suppose u is a unit and c1, c2 are contexts. Then £C1^C2 (u) is the unit that satisfies the following.

(i) For each context c such that c1 C c and ((c — c1) U c2) is not well-defined for u, ic1^c2(u)((c — ci) U c2) = u(c) and ^^(u)(c) = 0.

(ii) For all other contexts c', £C1^C2(u)(c') = u(c').

Definition 3.9 Suppose U is a set of units and c1,c2 are contexts. Then aC1^C2U is the set {^C1^C2(u) | u e U}. In case c2 = 0, we abbreviate aC1^C2U

by U!c1.

Observe that renaming allows us to "switch", to remove, as well as to add subcontexts to existing well-defined contexts of units.

We adopt the convention that each of the shorthands ., and ! has higher precedence than operators written using regular notations. Join and Set Operations: Join and set theoretic operators can be similarly defined for sets of units. The latter operators involve no special notation: Given unit sets U1, U2, the expressions U1 U U2, U1 n U2, and U1 — U2 are all well-defined.

We say two units u1 and u2 are consistent if whenever u1(c) j and u2(c) j for some context c, then u1(c) = u2(c). Given a unit u, define graph(u) = {(c, u(c)) | u(c) j}. The join of U1 and U2 on the boolean expression b, denoted U1 ob U2, is the set of all units u such that there exist u1 e U1,u2 e U2, u1 and u2 are consistent, graph(u) = graph(u1) U graph(u2), and u = b.

Example 3.10 To find books and their authors in the UDB U, we join units over the condition {a,s} = {c,a} (for units from Example 2.2) and on the condition {b,@a} = {a,@i} (for units from Example 2.6).

(1) (U) °{ui,a,s}={u2,c,a}v{ui,b,@a}={«2,a,@i} (U)

Note the context identifiers u1 and u2 gives us a way to "differentiate" between copies of the same unit.

If we only know the relation names but none of the attribute names of Example 2.2, and only the root element names of Example 2.6. Then an attempt at the same query appear as follows.

Observe that not only do we get all the units computed by Query (1), but we also get cross matching between units from the relational database and units from the XML database. The complete set of results is shown in Figure 2 (we omit the identifiers u1 and u2 in the illustration). The example illustrates how data that originate from relational sources and XML sources can be easily joined to produce meaningful results (e.g., units 6 and 7). For each unit, the solid underline indicates the part that comes from Example 2.2, while the dotted underline indicates the part that comes from Example 2.6. Note that unit 1 is computed erroneously due to the inadvertent match between the name of the author and the book title. This illustrates the information-retrieval paradigm that UA adopts when insufficient context is provided. For similar reason, units 8 and 9 are also incorrect.

1. <{a,n}:Diana,{a,s}:1234,{c,a}:1111,{c,p}:Prentice,{c,b}:Diana>

2. <{a,n}:Mary,{a,s}:1111,{c,a}:1111,{c,p}:Prentice,{c,b}:Diana>

3. <{a,n}:Mary,{a,s}:1111,{c,a}:1111,{c,p}:Addison,{c,b}:XML> 4. <{a,n}:Tom,{a,s}:4321,{c,a}:4321,{c,p}:St. Martins,{c,b}:Little Children>

5. <{a,@i}:4321,{a,f}:Tom,{a,l}:Perrota,{b,@g}:Fiction,{b,@a}:4321,{b,t}:Election,{b,d,y}:1997,{b,d,m}:July>

6. <{a,n}:Tom,{a,s}:4321,{b,@g}:Fiction,{b,@a}:4321,{b,t}:Election,{b,d,y}:1997,{b,d,m}:July>

7. <{a,@i}:4321,{a,f}:Tom,{a,l}:Perrota,{c,a}:4321,{c,p}:St. Martins,{c,b}:Little Children>

8. <{b,@g}:Fiction,{b,@a}:4321,{b,t}:Election,{b,d,y}:1997,{b,d,m}:July,{c,a}:4321,{c,p}:St.Martins,{c,b}:Little Children> 9. <{c,a}:1111,{c,p}:Prentice,{c,b}:Diana,{c,a}:1111,{c,p}:Addison,{c,b}:XML>

Fig. 2. Output from Query (2)

4 Related Work

The attention of most data integration techniques is on integrating local schemas into a mediated schema. In particular, numerous schema matching techniques have been developed to address various forms of schema heterogeneity (see [3] for a sample of recent approaches). In many cases, heuristics and machine learning techniques are employed to determine similarity among schemas.

The UDM avoids the issue of schema integration altogether. In this respect, our work closely resembles the work on Univeral Contextual Queries (UCQ) of Norrie and Kerr [12]. The differences between the UDM and UCQ are first of all, the UDM applies to a variety of data models while the UCQ has been discussed only with respect to the relational data mode. Secondly, no formalization of the data model behind the UCQ exists. As a consequence, no independent semantics for the UCQ exist; the meaning of each UCQ query depends operationally on its heuristic translation to SQL (or some other query languages).

Another line of research relevant to the UDM is the class of higher-order language extensions that have been proposed for managing structural heterogeneity in local schemas (e.g., SchemaSQL [7] and FISQL [15]). Compared to previous approaches which typically limit their scope to multi-relational databases, our work is more general simply because the data model subsumes several data models including relational. Secondly, within the relational data model, restructuring among database, relation, and attribute names in the UDM is a non-issue.5 More importantly, since role designations of metadata are ignored in the UDM, UA queries that manipulate data among sources with such schematic discrepancies tend to be more generic. For instance, consider

data from the relational source data from the XML source

5 Adding database names simply introduces one more context identifiers to each component of every unit.

a simple example adapted from [7]: assume a database DB with schemas below that describe the minimum salaries of faculty of each rank in the CS and Math departments.

DB::CS(rank,salInfo), DB::Math(rank,salInfo)

To list the minimum salaries of full professors across the database in SchemaSQL:

select R.salInfo

from DB -> Rel, DB::Rel R,

where R.rank = "Full"

Suppose the schema is redesign as DB::salInfo(rank,Math,CS) where salary information for each department is listed under the attributes Math and CS. Then the above query must be rewritten to reflect the change. In UA, the simple query ^sannfo (Arank = "Full" (U)) will return appropriate answers, albeit in different forms, for both the original and the modified schemas.

To handle semantic discrepancies, we have noted in Definition 5 that a more general notion of satisfaction may be useful. In this respect, we may regard the UDM as a data model framework parameterized by a domain-appropriate definition of satisfaction, and various retrieval models studied in the information retrieval community are interesting candidates for the definition. Somewhat intriguing in this consideration is that retrieval are based on similarities over the metadata (i.e., contexts) of units, not data. This is a topic of current study.

Lastly, we note that an important motivation for exploring the UDM is to facilitate logical data independence — the same motivation that prompted the study of the Universal Relation (UR) [10].6 Specifically, the UR aims to relieve "the user of the need for logical navigation among relations" by assuming that principle connections exist among entities. In contrast, we approach this issue by allowing partial contexts to be specified in joining conditions. While the price we pay for this flexibility is that queries may result in incorrect answers, we regain the important ability to choose other, often equally reasonable, connections among entities. Another issue that the UDM addresses is the complementary question of how one can relieve the user of the need to navigate among relation and attribute designations.7

6 The UR has been the inspiration for other work on data integration (e.g., [16, 12])

7 The UR addresses the issue by imposing the sometime troublesome unique role assumption [1].

J.J. Lu /Electronic Notes in Theoretical Computer Science 150(2006) 3—19

5 Interfacing Relational and XML Databases

In this section, we investigate more precisely translations of relational algebraic (RA) and XPath queries into UA queries. Recall that ans(Q) denotes the computed answers for the query Q.

5.1 The Relational Algebra

In Definition 2.3, ^(D) defines the UDB that corresponds to the RDB D. We extend ^ to results of RA-queries as follows. Suppose Q is an RA-query over D with schema {a1,..., am}. Then for each tuple t e ans(Q), ^(t) is the unit < {aj : t(aO,..., {am} : t(vm) >.8 Moreover, ^(ans(Q)) = {^(t) | t e Q}.

Suppose Q is an RA-query over a relational database and ans(Q) is its computed result. Even under the most elaborate translation scheme from Q to a UA-query Q', there is no guarantee that the results of the two queries will coincide (i.e., that ans(Q') = ^(ans(Q))). Consider for example the database D that contains the two relations: r = {< a : v1 >} and a = {< r : v2 >}, and the RA-query Q = na(r). The UDB ^(D) is the set {< {r, a} : v1 >, < {r, a} : v2 >}. As the only well-defined context of the two units are indistinguishable, no UA query can produce the same result as ^(ans(Q)) without explicitly incorporating the values v1 and v2 into the selection condition.

Ignoring the pathological cases of relations of arity one or relations that have names matching one of its attributes, however, we can obtain more precise translations.

Assumption 5.1 For the remainder of the section, we assume relational databases that satisfy the following.

(i) Each relation has arity at least two.

(ii) No relation has a name that equals one of its attribute names. We overload ^ to map RA- to UA-queries.

Definition 5.2 Suppose Q is a relational algebraic query of some relational database D. We denote ^(Q) the UA-query obtained from Q as follows.9

(i) Replace the operators a, n, x, p by A, p, o and a, respectively.

(ii) Replace each base relation r in Q by ^(D)!r.

Theorem 5.3 Suppo.se D is a relational database and Q is an RA-query over

8 Since the result of Q is an anonymous relation [5].

9 We make the simplifying assumption that selection conditions are of the form bi = b2 where b1, b2 are attributes, and that joins are expressed as a combination of cross product followed by selection [5].

D. Then ^(ans(Q)) C ans(^(Q)) where ^(Q) is a UA query over ^(D).

Proof. Suppose that the schema of ans(Q) is {a1,..., am} and u is a unit in ^(ans(Q)). There exists a tuple t e ans(Q) such that u = ^(t). We show by induction on the structure of Q that u e ans(^(Q)).

(Base Case) Since Q corresponds to a base relation r e D, t is an r-tuple. By Definition 2.3, ^(t) =< {r, a1} : v1,..., {r, am} : vm >e ^(D). As 0(Q) = ^(D)!r, it follows that u e 0(Q).

(Selection) The query Q has the form aa=bQ1 for some RA-query Q1 and t e ans(Q1). By the induction hypothesis ^(t) e ans(^(Q1)). As t(a) = t(b), u = (a = b). It follows by the definition of A that u e ans(^(Q)).

(Projection) The query Q has the form nai,...,amQ1 for some RA-query Q1 and t is a subtuple of a tuple t1 e ans(Q1). By the induction hypothesis, ^(t1) e ans(^(Q1)). It follows by the definition of ^ that (t11 {{a1},..., {am}}) = u e ans(^(Q)).

(Product) The query Q has the form Q1 x Q2 for some RA-queries Q1 and Q2 with disjoint schemas. The tuple t has the form < b1 : v1,..., bi : Vi, c1 : w1,...,cj- : Wj >,i + j = m where < b1 : v1,...,bi : vi >e ans(Q1) and < c1 : w1,...,cj- : cj >e ans(Q2). By the induction hypothesis, < {b1} : v1,..., {bi} : vi >e ans(^(Q1)) and < {cj : W1,..., {cj} : cj >e ans(^(Q2)). Clearly, it follows that u, which is the "concatenation" of these two units, is an element of ans(^(Q1) o ^(Q2)) = ans(^(Q)).

(Renaming) The query Q has the form pa^ai (Q1), 1 < i < m for some query RA-query Q1. There exists a tuple t1 e ans(Q1) such that t1 is the tuple

< a1 : t(a1),..., ai_1 : t(ai_1), a : t(ai), ai+1 : t(ai+1),..., am : t(am) > .

By the induction hypothesis, ^(t1) e ans(^(Q1)). As the only well-defined context for ^(t1) that contains {a} is {a} itself, (^(t1)) = ^(t) and is an element ans(aa^ai(^(Q1))), or ans(^(Q)). □

The reverse of the theorem does not hold, and the imprecision arises from the possibility that a base relation name may be used as an attribute name in a different relation (e.g., Author in Example 2.2). To strengthen the translation, therefore, requires some modifications to Definition 5.2.

Definition 5.4 Suppose r is a relation in a relational database over the schema A. The set {{r} U {a} | a e A} is called the dictionary of r.10

By the fact that names of relations are unique in a relational database and Assumption 5.1, we have the following.

10 This is in reference to the notion of data dictionaries in relational database systems.

Proposition 5.5 Suppose D is a relational database and d(r1) and d(r2) are the dictionaries of relations r1 and r2 such that d(r1) = d(r2). Then r1 = r2.

We now strengthen Step 2 of Definition 5.2 as follows.

2. Replace each base relation r in Q by (Ad(r)^(D))!r.

We denote this stronger form of query translation by

Theorem 5.6 Suppose Q is a relational algebraic query over the relational database D. Then ans(^+(Q)) over ^(D) is contained in $>(ans(Q)).

Proof. As before, let {a1,..., am} be the schema of ans(Q). The proof proceeds by induction on the structure of ^+(Q). We show only the base step.

The query ^+(Q) has the form (Ad(r)^(D))!r where r is a base relation. We have Q = r. Consider a unit u e ans(^+(Q)). There is a unit u1 e ans(Ad(r)^(D)) such that u = ^r^0(u1). By the construction of ^(D) and Proposition 5.5, u1 = ^(t1) for some r-tuple t1. As all well-defined contexts for u1 contain the context identifier r, it follows that u e ^(ans(Q)). □

5.2 XPath

A semantically exact translation from XPath queries to UA is less straightforward since there is a basic discrepancy between what XPath and UA can compute. The mapping ^ models an XML document as a single unit u. Hence any UA-query over u computes at most a single unit unless joins or unions are applied. On the other hand, an XPath query, which performs no union or join, can still produce multiple XML fragments from a single document.

Instead, a reasonable translations is to establish the following. Given an XPath query X, construct an UA-query ^(X) such that each XML fragment in the result of X corresponds to a subset of the components in the unit computed by ^(X). By the very nature of UA, it is possible for the unit to contain information unrelated to the query. Hence the challenge in the construction is to compute the smallest subunit possible while satisfying the above condition. This is a subject of ongoing study.

6 Conclusion and Future Work

We have presented a universal data model that unifies several of the most popular data models, and have shown its potential in addressing long-standing issues in data integration. The data model facilitates a simple and intuitive UQI over relational, XML, and ERD data sources. Aside from research issues that have been noted throughout the paper, there are a number of interesting topics for future work.

(i) Certain types of structural information may be semantically important. In order to capture structures, a more general definition of context may be desirable. One possibility is to generalize the notion of contexts to sets of regular expressions over S. This will allow UDBs to closely mimic the hierarchical natures of data models such as object databases and XML, yet retain the simplicity and generality for modeling relational databases. On a related issue, it will be important to incorporate into the UDM ways of expressing constraints among units.

(ii) In this paper, we have focused on modeling relational and XML data in the UDB. Formalizing connections to other data models (e.g., object and binary relational model) will be important for extending the impact of the UDM.

(iii) A simple UDB prototype has been implemented in Prolog. The design focuses on questions of how base units can be represented in a relational database, and how units that result from intermediate queries can be compactly represented — without materializing the entire unit. A more elaborate implementation effort is ongoing and will be reported.

References

[1] Tzy-Hey Chang and Edward Sciore. A universal relation data model with semantic abstractions. IEEE Transactions on Knowledge and Data Engineering, 4(1):23-33, 1992.

[2] Weidong Chen, Michael Kifer, and David Scott Warren. HILOG: A foundation for higher-order logic programming. Journal of Logic Programming, 15(3):187-230, 1993.

[3] AnHai Doan, Natalya Fridman Noy, and Alon Y. Halevy. Introduction to the special issue on semantic integration. SIGMOD Record, 33(4):11-13, 2004.

[4] Alon Y. Halevy. Answering queries using views: A survey. The VLDB Journal, 10(4):270-294, 2001.

[5] P. C. Kanellakis. Elements of relational database theory. In J. van Leeuwen, editor, Handbook of Theoretical Computer Science: Volume B: Formal Models and Semantics, pages 1073-1156. Elsevier, Amsterdam, 1990.

[6] Ravi Krishnamurthy, Witold Litwin, and William Kent. Interoperability of heterogeneous databases with schematic discrepancies. In RIDE-IMS, 1991.

[7] Laks V. S. Lakshmanan, Fereidoon Sadri, and Subbu N. Subramanian. SchemaSQL: An extension to SQL for multidatabase interoperability. ACM Transactions on Database Systems, 26(4):476-519, 2001.

[8] James J. Lu. Logical data independence reconsidered. In Proceedings of ISMIS. Springer-Verlag, 2005.

[9] David Maier, David Rozenshtein, Sharon C. Salveter, Jacob Stein, and David Scott Warren. Toward logical data independence: A relational query language without relations. In Proceedings of SIGMOD. ACM Press, 1982.

[10] David Maier, Jeffrey D. Ullman, and Moshe Y. Vardi. On the foundations of the universal relation model. ACM Trans. Database Syst., 9(2):283-308, 1984.

[11] John McCarthy and Sasa Buvac. Formalizing context (expanded notes). Technical report, Stanford University, 1994.

[12] Moira C. Norrie and D. Kerr. Universal contextual queries in database networks. In Proceedings of CoopIS, 1995.

[13] Erhard Rahm and Philip A. Bernstein. A survey of approaches to automatic schema matching. VLDB Journal: Very Large Data Bases, 10(4):334-350, 2001.

[14] Joseph Stoy. Denotational Semantics: The Scott-Strachey Approach to Programming Language Theory. MIT Press, 1977.

[15] Catharine M. Wyss and Edward L. Robertson. Relational languages for metadata integration. ACM Transactions on Database Systems, 30(2):1-33, 2005.

[16] J. Leon Zhao, Arie Segev, and Abhirup Chatterjee. A universal relation approach to federated database management. In Proceedings of the Eleventh International Conference on Data Engineering. IEEE Computer Society, 1995.