Scholarly article on topic 'On the Formal Specification and Derivation of Relational Database Applications'

On the Formal Specification and Derivation of Relational Database Applications Academic research paper on "Computer and information sciences"

CC BY-NC-ND
0
0
Share paper
Keywords
{}

Abstract of research paper on Computer and information sciences, author of scientific article — Roberto Souto, Maior de Barros

Abstract The development of database applications is usually carried out informally. The derivation of database programs directly from formal specifications is a well known and unsolved problem. Most of the previous work on the area either tried to solve the problem too generally or was restricted to some trivial aspects, for example deriving the database structure and/or simple operations. However difficult in general, deriving relational database applications directly from Z specifications satisfying a certain set of rules (the method) is not arduous. With appropriate tool support, writing formal specifications according to the method and deriving the corresponding relational database programs can be straightforward. Moreover, it should produce code which is standardized and thus easier to understand and maintain. This paper summarizes material from my Ph.D. thesis [4]. Therefore, it is a pleasure to thank again my supervisors, Ray Welland and David Harper. The financial support for the Ph.D. course was provided by CAPES (Brazilian Federal Agency for Postgraduate Education) and by UFPE (Federal University of Pernambuco).

Academic research paper on topic "On the Formal Specification and Derivation of Relational Database Applications"

Electronic Notes in Theoretical Computer Science 7 (1998)

URL: http://www.elsevier.nl/locate/entcs/volume7.html 27 pages

On the Formal Specification and Derivation of Relational Database Applications

Roberto Souto Maior de Barros

Departamento de Informática, Universidade Federal de Pernambuco. Caixa Postal 7851, Cidade Universitaria, CEP 50.732-970, Recife-PE, Brazil.

E-mail: roberto@di.ufpe.br

Abstract

The development of database applications is usually carried out informally. The derivation of database programs directly from formal specifications is a well known and unsolved problem. Most of the previous work on the area either tried to solve the problem too generally or was restricted to some trivial aspects, for example deriving the database structure and/or simple operations. However difficult in general, deriving relational database applications directly from Z specifications satisfying a certain set of rules (the method) is not arduous. With appropriate tool support, writing formal specifications according to the method and deriving the corresponding relational database programs can be straightforward. Moreover, it should produce code which is standardized and thus easier to understand and maintain.

1 Introduction

Having worked in the formal specification area for a number of years, my attention was mainly devoted to the application of formal methods in the development of real life software. In particular, my M.Sc. thesis [1] involved the formal specification of a large system, namely UFPE's Student Records Control System.

In addition, it is unlikely that a generic solution to the problem of deriving real applications will be proposed in the near future. Hence, it was advisable to restrict the scope of the research to some well understood domain. The database area, and especially the relational database model [2,3], seemed to be the perfect target for the utilization of formal methods in this context.

Although some work has already been published, the utilization of formal or semi-formal techniques in the development of real life database applications has not been seriously attempted yet. A common drawback in some of the previous attempts to solve this problem has been to try to solve it too generally by not restricting it to applications based on a specific database model, or rather trying to refine a wide variety of application programs.

©1998 Published by Elsevier Science B. V.

Another frequent mistake has been to overlook the vital need to specify constraints and to verify they are satisfied at all times, so that the consistency of the database is guaranteed. This is normally done by only addressing the correct behaviour of simple atomic operations and usually leaves the false impression that deriving database applications is fairly simple.

On the other hand, experts on the database area tend to think the automatic derivation of real database applications is too difficult, especially because the programs must guarantee the constraints are satisfied.

The main objective of this paper is to present a summary of the research carried out at The University of Glasgow as part of a Ph.D. degree [4]. The investigation was restricted to the specification and reification of relational database applications. Furthermore, it also considered all relevant kinds of constraints as well as more complicated transactions.

The first part of this work provided a general method for the specification of relational database applications. The method allows for an abstract formal specification of the applications, focusing on the important aspects of the relational model and applications, without consider that specific Relational Database Management Systems (RDBMS) and query languages may not support some features. It provides the formal basis in terms of which applications can be specified, verified (using formal reasoning) and implemented (reified).

The formal specification language used in this work is Z [5], for a number of reasons. Firstly, model-oriented languages seem to be more appropriate to specify database transactions, because of the convenient notion of state. Secondly, Z is probably the most widely used formal specification language, it has been under development for over a decade and is now established. An extensive literature is also available. Finally, Z is very flexible and permits the adoption of different levels of abstraction, even within the same specification document. This gives the specifier the necessary freedom to adopt the most appropriate level of abstraction for each part of the specification.

The second part of this work investigated the derivation of relational database programs directly from formal specifications written according to the method and presented a simple mapping. The mapping discusses the problems involved in the derivation of database programs from the specifications without binding the investigation to any specific system or language. In other words, the mapping is general and should be applicable to many RDBMSs.

Finally, this work includes a prototype tool built to support the method and implement the mapping. The prototype was built to show that the specification of relational database applications using the method and the construction of the corresponding database programs can be reasonably simple if appropriate tool support is provided; to provide evidence that the syntax and semantics of the method are sound and that it is possible to build a full scale syntactic editor for the method; and to demonstrate that the mapping can be adapted to specific RDBMSs, that it is possible to derive database programs automatically, at least for a large number of applications, and that building a tool to implement the mapping for a particular RDBMS is not too difficult.

The RDBMS chosen for this experiment was DBPL [6], an academic tool developed at the University of Hamburg. The database programming language DBPL extends the programming language Modula-2 [7] with a new persistent data type called relation and high-level relational expressions based on the predicate calculus. The main reason for adopting DBPL is the fact that the new type relation and the corresponding access expressions are well integrated with the Modula-2 language. As a consequence, it avoids the impedance mismatch which is common in the case of query languages such as SQL [8] being embedded in programming languages such as C or COBOL. In addition, the DBPL system implements a bigger subset of the theoretical relational model than most systems currently available.

The research was restricted to the relational model for a number of reasons. Firstly, the specification method developed is reasonably simple and does not enforce any constraints on either the real implementation of relations or the choice of a specific database system and language. Also, the proof of properties about such specifications, though not investigated in detail, seems to be fairly easy, involving only first order logic and set theory. Finally, the very high-level nature of its query languages means it seems likely that it is not necessary to use refinement techniques to derive application programs and, moreover, this process did not seem to be arduous.

The rest of this paper is organized as follows: Section 2 describes related work. Sections 3 and 4 present concise summaries of the method and the mapping, respectively. Section 5 covers details of the functionality and implementation of the prototype tool. Section 6 gives suggestions for further work. Finally, some conclusions are presented in Section 7.

2 Related work

This section presents a literature survey of the formal specification and derivation of database applications. The emphasis is specifically put on the derivation of database transactions in general and of relational database transactions in particular.

The approaches most related to the work described in this paper are the work of Xiaolei Qian and, especially, the work done by the database group at the University of Hamburg, which is referred to as the Hamburg work.

2.1 The work of Pastor and Olive

Pastor and Olive [9] propose a method for the generation of transaction specifications concerned with updating views and guaranteeing the integrity of the database. The context of their work is deductive databases [10] and their method augments the deductive database schema with a set of transition rules and internal event rules. A transition rule is a predicate defined in terms of the current state and the integrity constraints of the database, whereas an event rule is a predicate that specifies which operations (usually insertions and deletions) can happen as a result of a database update operation. The

authors also describe a prototype tool, implemented in Prolog, which is capable of producing pseudo-code written in English and in Catalan, as well as Prolog implementation code written according to their method.

2.2 The work of Sheard and Stemple

Sheard and Stemple [11] present a thorough and theoretically sound treatment for the verification of database transactions safety. They describe a theorem prover that can be used to prove that database transactions are safe in the sense that they do not violate the set of specified database constraints.

The formal theory used by the tool is based on the Boyer and Moore [12] style but is extended with higher order functions and theorems. The specification language is called the Abstract Database Type Programming Language (ADABTPL).

The authors claim that both the theory used to build the tool and the ADABTPL specification language are not restricted to the relational model. However, the specification language does include a number of features which are specifically based on the relational model and the example presented in the paper is an extensive relational database example.

On the other hand, the ADABTPL language does not cover the specification of dynamic constraints (called transition constraints by the authors), only covers the two simplest aggregate functions (count and sum), and does not provide an explicit structure to capture the foreign key constraints, even though these can still be specified.

2.3 The work of Steinberg, Faley, and Chinn

Steinberg, Faley, and Chinn [13] describe a more practical approach. The main problem they propose to address is the fact that software developers often do not meet the needs of end users in a timely fashion.

The authors claim that one of the approaches to solve the problem is to encourage end users to get more involved in the design and development of the software they use. They also claim that one of the difficulties to achieve this goal is the fact that traditional CASE tools were developed primarily for the trained professional rather than the end user. The proposed solution is to use their tool, which is called The Analyst. In addition, they assert the tool can be used by novice end users to design and implement customized relational database prototypes. Moreover, that this is achieved by writing English sentences.

Allegedly, the user would provide the entities, attributes, and possible queries, in addition to the attributes which should be listed in the results, using some restricted form of English sentences (e.g. pronouns are not accepted). The system then performs some validations and, when the prototype is acceptable to the user, the system generates the corresponding implementation code for either dBASE or Paradox.

According to the authors, the time taken to develop the applications is reduced, because the process depends less on the availability of developers.

They also claim this shorter development time, together with standardization and automatic generation, diminishes the possibility of misunderstandings in the systems requirements and reduces the cost of software maintenance.

Finally, they state that the results of an experiment using graduate business students with no previous experience in systems analysis or programming demonstrated that users could match almost exactly the model task solution to the problem they were given in little more than an hour.

It is very difficult to assess the merits and limitations of this work without actually seeing the tool or the problems used in the described experiment. Nevertheless, it is clear that all these claims seem too good to be true. I suspect the class of problems that can be solved using the tool is very limited. Moreover, the treatment of database constraints must be very rudimentary if at all existent.

2-4 The work of Xiaolei Qian

Xiaolei Qian [14] discusses the use of transaction synthesis (refinement techniques) to transform declarative specifications into procedural implementations. In this work, the transaction synthesis is the process of finding a transaction that satisfies the specification. This synthesis is formalized as the process of finding constructive proofs of specification theorems and extracting appropriate transactions from these proofs.

Proofs are represented as tables called deductive tableaux which consist of three lists of formulas: an assertion list, a goal list, and a transaction entry list. The synthesis system consists of deduction rules that manipulate the tableaux preserving its validity.

The proof system used to carry out the transaction synthesis is an extended version of the deductive-tableau proof system for first-order logic developed by Manna and Waldinger [15].

There are a number of aspects of this work which are similar to the research described in this paper. These are:

• The work is driven by the belief that the automatic generation of database transactions is both desirable and feasible. The author claims the automatic generation of programs in a restricted but well understood and important domain is desirable, to take advantage of the well defined semantics of the database transactions and avoid the violation of the integrity constraints; and feasible, because such transactions are usually dominated by data manipulations rather than complex computations.

• The database state is explicitly characterized as a finite set of relations. The author claims it is "relatively simple" to define it this way and that, often, "it is possible to specify precisely the effect of every language construct on database states".

• The work assumes that "transactions are always executed in valid databases where integrity constraints are satisfied", i.e., the database is assumed to be in a valid state before any transactions are executed.

Nevertheless, there are a number of important aspects which are different in the two approaches. The main differences are:

• This approach is much more formal than the one adopted in the research described in this paper, with a lot of emphasis being put on reasoning about state transitions and proving that the resulting transactions satisfy the specifications.

• There is no explicit method and/or guidelines to help the users to write the formal specifications and to carry out the proofs from which the transactions are extracted. In other words, this approach requires a much higher knowledge of mathematics and is unlikely to be usable by developers of real database applications.

• The resulting transactions are not explicitly built to any existing RDBMS, only to a hypothetical system supporting the transaction language described by the author.

2.5 The Hamburg work

The approach to the derivation of database applications developed by the database group at the University of Hamburg defends the use of a formal method together with a conceptual design language as well as an implementation language in an integrated framework [16].

In their main approach they suggest that conceptual designs should be written using an expressive semantic data and transaction model, namely the TDL language [17], which is derived from TAXIS [18]. In particular, TAXIS has been enriched with constructs for a predicative specification style. The extensions include multi-valued attributes, a set-oriented expression language, and the predicative techniques for specifying the dynamic parts of the system, i.e. transactions, functions, and derived classes and attributes.

Also, the database structures and constraints, initially written in TDL, should then be formally transformed into equivalent abstract machines, as defined by Abrial [19], using the B-Method. The transactions are modelled by operations in the abstract machines. The proof obligations for guaranteeing consistency are semi-automatically verified using the B-Tool [20].

In the next step, these abstract machines should be refined into other abstract machines that are equivalent to programs written in the strongly typed programming language DBPL. In other words, they provide specific B specifications that are sufficiently refined to be directly translated to DBPL. According to the authors, it was the explicit specification of state and invariants and the possibility of stepwise refinement within the same language that made the abstract machine approach a natural choice for the specification of database applications.

Finally, these final B specifications should be translated to DBPL syntax. The automatic transformation of TDL designs into abstract machines was described in a paper by Schewe, Schmidt, and Wetzel [16]. This paper has also provided a small set of refinement rules which formalize the transformation

of these initial abstract machines into other machines which are equivalent to DBPL programs. It also describes which properties must be verified to guarantee transaction consistency and correct refinement, and indicate how to use a mechanical theorem proving assistant to guide the proofs.

A more recent paper by Giinther, Schewe, and Wetzel [21] characterized the final B specifications that are equivalent to DBPL programs. In addition, it describes an automatic transformation of final B specifications into DBPL syntax. In the first part, the authors show that DBPL programs are indeed equivalent to certain B specifications. In the second part, they use the algebraic specification language and term rewriting system OBJ [22] to implement the mapping to DBPL syntax.

A considerable part of this work was part of DAIDA, an ESPRIT project funded by EEC under research contract number 892.

An alternative approach

An alternative approach based on a slight variation of the aforementioned scenario was also considered by Schewe, Schmidt, and Wetzel [23]. However, I believe it was never investigated in detail. Basically, they proposed a new database specification language called SAMT (Structured Abstract Module Types) that would add strong types to the abstract machine formalism and would support the idea of modules with import and export constructs, similar to modula-2 modules.

The main aim was to design a language that could be used to write modular strongly-typed specifications already in the conceptual level, and also to refine these specifications into executable database programs. Hence, SAMT would substitute both TDL and the abstract machines in their original approach and, thus, it would eliminate some of the complexity issues of the multi-language approach.

The motivation to design SAMT was their will to overcome the following problems:

• All objects that are part of the state are necessarily persistent. The reason this was considered a problem is the fact that they do not consider their approach to be restricted to the relational model.

• Their inability to automatically derive appropriate DBPL final data structures. The main problem is that they found it difficult to generate appropriate types for the structures since their specifications are untyped (B does not support types).

Comparison to my approach

There are some similarities between the Hamburg approach mine. Firstly, it is in both cases possible to prove, already at the specification level, that transactions maintain the consistency of the database. This possibility was not pursued as part of my Ph.D. thesis though.

Secondly, the relational model has, in both approaches, been used as the main target for the generation of database applications. However, they do not

provide any method or facilities to support specific features of the relational model, mainly because they do not consider their approach to be restricted to the relational model.

Finally, the implementation language used in both works is DBPL. Nevertheless, their approach to the mapping is specific to DBPL and is not easily adaptable to be used with another language. In my approach, DBPL is just the chosen example of a target database language which is used to instantiate the generic mapping.

Despite these, the means used to achieve the main objective are different. Their emphasis was on the derivation of efficient DBPL programs and on proving, formally, that these programs do not violate the database constraints. My emphasis was on a specific method aimed at helping practitioners with the formal specification of relational applications and on a generic mapping that can be adapted to generate implementations for any RDBMS.

The two problems present in their approach are not problems in my approach. Firstly, the fact that all the state is persistent is not a problem, because only the relations are part of the state and these must be persistent. Although Z is not strongly typed, the method adopts the strategy of having strongly typed domains based on their names. This avoided the problem in the mapping of the structures, which was their second problem.

3 The Method

As already mentioned, an important first part of this research was the development of a method for the formal specification of relational database applications. The method provided a formal starting point for the investigation of all other aspects of the work. Therefore, it was vital to improve it as much as possible before proceeding to investigate the other parts because a weak method would probably make the whole work fail.

The method is aimed at formalizing the design of real relational database transactions and, so, it should help practitioners in the development of real world applications. In addition, the method is generic and may be the first step in the direction of the formal development of database applications and of specification standardization in this context. Moreover, it should improve the system documentation and the quality of the application programs which should contain fewer errors.

This section presents a very brief summary of the specification method. The method addresses the definition of domains and relations, the specification of constraints, and querying and updating of relations, including error handling. More advanced features such as transactions, sorting of results, aggregate functions, etc. are also addressed. The complete description of the method was given in [4]. Previous versions were published elsewhere [24,25].

Domains are sets of values from which one or more attributes draw their values. The aim is to prevent comparisons of attributes which are not based on the same domain by strongly type-checking domains based on their names. Some examples of domain definitions are presented below.

ENUM == N

DNUM == { n : N | n > 100 }

SEX ::= Male \ Female \ NULLSEX

Relations are specified as sets of tuples, which respects the original relational model defined by Codd [3]. Also, the method per se does not enforce any constraints on the way relations and operations may be implemented.

The formal definition of a relation is split into two parts: the relation intention and the relation extension. The intention is a record (Z tuple type [26]) which defines its attributes ("variables" of the tuple type), each of which must be of a valid domain. For example:

i:\ll'I. = [ENum : ENUM; Sex : SEX;

Age : AGE; DNum : DNUM; ... ]

The relation extension is a variable of type SET (F) of the type defined earlier, which is declared in a schema. The predicate of such a schema specifies all static constraints that only depend on the relation being defined. This includes the required (not null) and candidate key constraints, specified using the operators REQUIRED and KEY_OF respectively, as well as other static attribute constraints. For example:

_Employee_

empls : TEMPI

REQUIRED empls ENum A REQUIRED empls Sex A

KEY_OF empls ENum A

Ve : empls • e.Age > 25

The state schema, e.g. DB, which will represent the Database as a whole, groups all database definitions by including the relation extension schemas -see example below.

Employee Depart

FOR—KEY depts ManENum empls ENum A

Ve : empls • (3 w : works • w.ENum = e.ENum) A

yd : depts • d.NEmp = $ {e : empls \ e.DNum = d.DNum}

The predicate of the schema above specifies all static constraints depending on more than one relation and this includes the foreign key constraints specified using the FOR—KEY operator.

In particular, the foreign key specification above means that, for all tuples of relation depts, attribute ManENum must either be null or match the primary key attribute ENum of some tuple of relation empls.

As in standard Z, other state schemas called ADB and 'E.DB as well as an initialization schema called Init_DB are defined. The details are omitted.

Read-only operations, i.e. select, join, and project, are specified by schemas such that: (1) they include the 'E.DB schema; (2) they declare the input (if any) and output variables of the operations; (3) their output variables are usually relations, i.e., their types are PA, where A is the intention (type of the tuples) of some relation; and (4) their predicates describe the result of the operations using a set comprehension. Specific constraints involving the input variable(s) of the schema may also be specified.

In the select operation, the set comprehension is used to describe the result as a set of tuples of a given relation based on a select condition using its attributes. Theta-joins, the most general form of joins, are described similarly but more than one relation is used and a join condition is specified, based on attribute(s) of all relations. The (extended) project operation is similar to the select operation, but it explicitly specifies the result based on computations of some attributes of the qualifying tuples.

An example combining select, project, and join operations is given below.

_Empls_salary_hours_

pi : PNUM-

sempl_work\ : P EMPF_WORK

3 pj : projs • pj.PNum = pi A

sempl_work\ = { e : empls; w : works \

w.PNum = pi A e.ENum = w.ENum • (e.ENum, e.Salary, w.Hours) }

Update operations are specified by schemas such that they (1) include the A DB schema; (2) declare the input (if any) variables of the operations -normally there are no output variables; (3) specify what relations are changed by the operations, using a schema expression based on the 'E.DB schema; and (4) describe, in their predicates, the updates in one or more relations of the database.

Given that DEFETE is another predefined operator, examples of predicates of update operations include: empls' = empls U sempll7 for inserts; empls' = DEFETE sempl ENum sel} for deletes based on the primary keys; and works' = {w : works • if w £ sworkl then w \ (PNum = pi) else w}} for updates of attributes based on a select condition.

In the specific cases of deletes based on the primary key (and updates of the primary key attributes) the method also covers the specification of the foreign key compensating actions. Specifically, when the primary key of the relation is the target of one or more foreign keys, the predicate of such schemas must also specify what happens to all references for deleted tuples to prevent violations of the foreign key constraints.

In general, three possibilities are considered [2]. When Restricted is chosen, deletes are performed only if there is no foreign key reference to any of the tuples selected for deletion. When Cascades is specified, every tuple where there is a foreign key reference to a deleted tuple is also deleted. Finally, Nullifies changes all foreign references for deleted tuples to contain the null value. Again, the specification details are omitted.

More complicated transactions are specified using the schema piping (>>) of basic operations written according to other rules of the method. Notice that the version of the piping operator (>>) used here is not part of standard Z. It allows for the output and primed state variables (all results) of the first schema to be matched against the input and unprimed state variables of the second schema, respectively.

In addition, renaming variables of the component schemas is usually necessary to make variables of different operations be the same variable, avoid name clashes, and/or keep the ? and ! naming conventions for input and output variables valid in the transaction. Extra parentheses are sometimes needed to enforce an order in the association of the schemas. Occasionally, additional predicates are needed to specify constraints depending on the inputs of more than one subtransaction and/or to make the value of a variable refer to the value of an attribute of a tuple variable.

A generic transaction definition is presented below, where Transacl_Ok is the correct behaviour of the transaction, Oper_ 1,..., Oper_n are the components of the transaction, and <condition> is an additional predicate as mentioned above.

Transacl_Ok = ( Oper_ 1 [b 1 / al,...] >> ... >>

Oper_n [bn / an,...] | <condition> )

The method also covers other advanced features such as sorting of results, aggregate functions, composite attributes, and views but, once again, the details are omitted.

The specification of the extended transactions which allow for error handling explicitly state what the possible errors are and give a specific error message for each of them. To identify all possible error conditions, the negation of the preconditions of the corresponding schemas that describe the correct behaviour of the transactions must be simplified.

For each transaction using the database, there will be a corresponding error schema, which describes the possible errors that may occur. Each of these schemas (1) include the 'E.DB schema, because no change is done in the relations when errors occur; (2) declare the variable result! to keep the error

message; (3) declare all input variables (if any) declared by the corresponding schema that deals with the correct behaviour of the transaction, because they will be involved in some of the possible errors; and (4) describe, in its predicate, what the possible errors are and which messages correspond to each of them. A sketch of such an error schema is presented below.

_Transacl—Error_

result! : STRING <input_vars>

( <error_l> A result! = "message 1") V ( <error_n> A result! = "message nv)

In the schema above, <input_vars> is the declaration of all input variables of the schema that deals with the correct behaviour of the transaction, <error_l>, etc. are the possible errors, and "message 1", etc. are the corresponding error messages.

Now, the extended transactions, e.g. Transacl, are specified by extending their original specifications, i.e. Transacl—Ok, to describe what happens if any error occurs. Basically, if the preconditions are satisfied the result is "success" (schema 0k)} otherwise no change is done in the database and a specific message is put in result! (Transacl—Error).

Transacl = (Transacl—Ok A Ok) V Transacl—Error

Finally, in addition to the method per se, a number of guidelines on how to write relational database specifications in Z using the method were provided [4]. Basically, it is postulated that the users should not write the complete specification at once but rather split the task into a number of steps. They should write a first specification containing only a small subset of the details. The information left out of this first specification would then be gradually added in several steps. Once again, the details are omitted.

4 The Mapping

The second major problem addressed by this work was the derivation of database programs directly from formal specifications. The investigation was restricted to the relational model and considered all relevant kinds of constraints as well as more complicated transactions.

Specifically, a generic mapping aimed at generating relational database programs directly from formal specifications written according to the method was proposed. The mapping described, for a comprehensive subset of the method, what the target implementation code should look like, without binding it to any particular database system and/or language. Nevertheless, most

examples were written in DBPL [6], the RDBMS used to implement the prototype.

The efficiency of the generated code, though taken into account, was not a primary concern. In fact, it was sometimes disregarded in order to make the mapping as smooth as possible. However, this does not mean the generated programs are going to be terribly slow because a number of these operations are optimized by the compiler.

Finally, although an effort was made to keep the generated programs as close to the specifications as possible so that the mapping is simple, it was not always possible to achieve this simplicity. In some cases, in addition to the relevant data from the corresponding section of the specification, the implementation includes data from other parts of the specification method. It was also sometimes necessary to incorporate design decisions into the mapping so that the generated programs were syntactically correct.

It would not be appropriate to include the complete description of the generic mapping in this paper due to limitations of space. However, some of the more interesting parts of the mapping process are summarized below. Full details of the mapping are given in [4]. A simplified version was also published elsewhere [27].

4-1 Domains

If the DBMS/query language does not support domains (or user type definitions), then all domains should be enforced by means of explicit constraints. Note that, in this case, avoiding operations between attributes and/or variables of different domains which are implemented by the same basic data type is not going to be possible.

All domains in the specification (attributes and variables as well) will, ultimately, be implemented by one of the basic data types offered by the DBMS and/or query language. Sometimes, these data types include a parameter giving the size they will occupy and, so, the mapping will have to incorporate some specific value as default.

If necessary, explicit constraints are to be enforced on the values that attributes and variables (drawn from the domains) can take. This depends on which kind of domain definition is used in the specification and the details are omitted.

4-2 Constraints

If the DBMS supports the specification of constraints directly, the appropriate syntax should be used and the translation should be simple.

It is possible that some special kinds of constraints — e.g. required attributes, primary key, and foreign key constraints — are supported, even though static and/or dynamic constraints in general are not. Strictly speaking, these are simply special cases of static attribute constraints. Therefore, if not supported, they should be treated as any other constraints. Otherwise the translation should be simple.

In the specific case of relational systems which do not support constraints directly, the necessary constraints will have to be enforced explicitly in the implementation of the transactions that can possibly violate the integrity of the database (it is assumed the database is in a valid state before the operations).

The most direct way of generating the necessary conditional expressions in the application programs is from the preconditions of the transactions, more specifically, from the error schema(s) associated with the transaction. In fact, even when constraints are supported, it is usually necessary to test the DBMS return codes so that specific error messages can be reported to the user.

Even though constraints of the form V t : rel • <condition> can be expressed in DBPL as ALL t IN rel ( <condition> ), where <condition> must be of type BOOLEAN and might use OR, AND, NOT, ALL (V), and SOME (3), the DBPL system does not provide a way of enforcing them. Thus, the appropriate tests that guarantee the consistency of the database should still be generated from the error schémas. Hence, part of the discussion about mapping the constraints is presented in Subsection 4.4.

4-3 Transactions

Transactions are more complicated operations, potentially involving a number of simpler operations, which must execute as a whole or fail completely. Most RDBMSs support the definition of transactions and the appropriate syntax should be the target code.

The most common way of supporting transactions is by delimiting their scope with two special commands provided to the user: one to start a transaction and the other to end it successfully. Some DBMSs implicitly insert these delimiters in the beginning and at the end of application programs so that, by default, each program is a transaction. In others, these delimiters are written as a special kind of procedure.

A third command usually allows the user to abort the transaction at any time and will normally undo all the database updates already done. The component operations are simply written within the transaction scope using the normal syntax.

Regardless of these implementation details, the mapping should be simple for most DBMSs. If procedures are supported by the query/host language, they should be used to separate the correct behaviour of the transactions from the error handling code.

In DBPL, a transaction is just a special kind of procedure, the difference being it starts with the keyword TRANSACTION. Also, there is no automatic undo facility for unsuccessful transactions.

The chosen approach to the mapping of the correct behaviour of the transactions to DBPL is to write them and their subtransactions (basic operations) as procedures, named after the corresponding specifications. Input and output variables are to be passed as value and variable parameters, respectively. Parameters of the subtransactions which are not parameters of the transaction should be mapped to local variables.

For example, transaction Salary—dept of the company database example returns the sum of the salaries of all employees hired by department dl. It was specified using two subtransactions: Empls_of_dept} that returns all employees sempll hired by department (/?, and Sum_Salary_empls} that receives a set of employees sempll and returns the sum of their salaries tot_sal\. The corresponding DBPL implementation code, excluding error handling, is presented below.

TRANSACTION Salary_dept ( d: DNUM; VAR tot.sal: SALARY;

VAR result : STRING );

PROCEDURE Salary_dept_Ok; (** Correct Behaviour **)

VAR sempl: REL_EMPL;

Empls_of_dept ( d, sempl );

Sum_salary_empls ( sempl, tot_sal );

END Salary_dept_Ok;

... error handling code ...

Salary_dept_Ok;

result := "Success";

END Salary_dept;

4-4 Error handling

According to the specification method, the predicates of the error schemas associated with the transactions are written as sequences of expressions of the form presented below, connected to each other by logical disjunctions (V).

( <condition> A result! = "error messagev )

In the expression above, <cond> stands for generic predicates representing the logical conditions which violate the precondition of the transaction. In general, each of them involves a combination of predicates connected by logical conjunctions (A), disjunctions (V), and/or negations (->), as well as existential quantifiers (3). Expressions involving the universal quantifier (V) can be rewritten using the existential quantifier, because V x : T • p is equivalent to -> (3 x : T •-> p) for any predicate p. Sometimes, set comprehensions are also included.

The approach here is to map each of these generic predicates to the appropriate piece of implementation code that evaluates it and verifies, using a conditional statement, whether the precondition is violated. Whenever one of these predicates is true, the transaction must fail. This means that all changes which might have already been made must be undone, so that the consistency of the database is guaranteed. If an undo facility is supported it should

be used whenever appropriate. Otherwise, no change should be made to the database before all such predicates are checked.

For most DBMSs, it should be simple to generate conditional statements of the implementation language from the aforementioned predicates, except for the existential quantifiers. When the DBMS supports existential quantifiers, the appropriate syntax should be used and the translation ought to be simple. Otherwise, the result of the evaluation of the existential quantifiers should be assigned to auxiliary boolean variables. These variables should then be used in the conditional statement.

These existential quantifiers are always based on relations and attributes, since they are derived from the precondition of the transactions. Therefore, the evaluation of these expressions can be implemented by checking whether the relevant read-only operations (i.e. select, join, and/or project) implicitly specified by their predicates actually return any data. If they do, the result is true, else the result is false.

For example, suppose Salary—dept is a transaction that returns the total salary of employees working for a given department dl. The predicate of the corresponding error schema Salary_dept_Error is given by:

( -i (3 dp : depts • dp.DNum = dl) A

result! = 11 Invalid department number" )

In DBPL, the universal and existential quantifiers can be mapped directly to the FORALL and SOME commands, which simplifies the problem. Thus, the corresponding DBPL error handling implementation code should be:

IF NOT ( SOME dp IN depts ( dp.DNum = d ) ) THEN

result := "Invalid Department Number";

RETURN;

If DBPL did not support the existential quantifier, it would be simulated by testing whether the select operation {dp : depts \ dp.DNum = dl } returns any tuples, and the result would be assigned to an auxiliary boolean variable as follows:

IF REL_DEPT { EACH dp IN depts : dp.DNum = d }

<> REL_DEPT { } THEN

exist_aux := TRUE;

exist_aux := FALSE;

The error handling implementation code presented before would then follow, but the auxiliary variable exist_aux would substitute the existential quantifier (SOME . . .) in the conditional statement. The resulting code is presented below.

IF NOT exist_aux THEN.

result := "Invalid Department Number"; RETURN;

5 The Prototype

This research also involved a substantial piece of implementation. Specifically, a prototype tool was developed. It aims to support the method and instantiate the mapping for a particular RDBMS, namely the DBPL system [6].

Specifically, the prototype is meant to automatically generate relational database applications to be run on the DBPL system. Nevertheless, it is worth emphasizing that DBPL is just the chosen example of a target database system/language.

In particular, the rest of this section summarizes the functionality and implementation details of the prototype tool.

5.1 Design and implementation strategy

The prototype is composed of a syntactic editor for the method and a built-in tool which translates the specifications to database commands. Its outputs are specifications written in Z (using the syntax provided by the zed. sty [28] style option for MgX) and relational database applications written in DBPL, respectively.

Since the tool is only a prototype, it does not support the full method. For instance, the syntactic editor accepts a large subset of all possible specifications which are correct according to the method, even though many of the incorrect ones are not rejected.

Another design decision was to embed part of the semantics of the method in the editor to generate the specifications automatically as much as possible, so that the actual typing done by the user would be restricted to a minimum.

One of the design decisions which proved to be very useful was to write the target DBPL programs beforehand, because it provided a concrete target for the mapping process. It also helped to find errors and omissions on the description of the mapping. Sometimes, the ideal implementation code proved to be too difficult to map and, so, these programs were changed so that the mapping could be as smooth as possible.

The last design decision made was not to use formal methods techniques in the implementation of the tool, mainly because it is not a production tool, only a prototype. However, since the prototype tool was built using the synthesizer generator and the inputs to this system are formal specifications, it is still correct to assert that the tool was formally specified.

Although the prototype has not been finished, a considerable part of its functionality has been implemented. Except for the specification of more complicated transactions, the code which generates the syntactic editor for the method and the corresponding DBPL implementation code has been written.

5.2 Tool support

The prototype was developed using the Synthesizer Generator [29,30], which is a powerful tool for implementing language-based editors. It allows for the generation of syntactic editors fairly quickly, as long as the syntax and semantics of the target language are well defined. In particular, the view facility of the Synthesizer Generator was used to automatically generate parts of relational database programs written in DBPL. The syntactic editor that supports and enforces the method is a bonus.

The effort to learn the basic features of the system was also fairly small. It took about two weeks to get the first specification running and another two weeks to experiment with most of the features of the Synthesizer Generator system.

As far as I can see, the main challenge was to come up with a good design for instantiating the general mapping to DBPL, the particular RDBMS chosen, within the time available. Writing the Synthesizer Specification Language (SSL) code for the syntactic editor and using the view facility to generate database programs per se were reasonably straightforward.

To give a better idea of what the prototype tool looks like, snapshots of a number of screens are provided at the end. For instance, Figures 1 and 2 present consecutive snapshots of the specification window. These include the intention and extension of the relations as well as part of the database state schema of the company database example [4].

Figure 3 is a snapshot of the TYPES_D view and shows the generated DBPL implementation code corresponding to the specifications presented in Figures 1 and 2. This view generates the definition module that contains all the global types. The snapshot shows the types of the intentions and extensions of all the relations, and this includes the primary keys of the relations.

6 Further work

Now, this section presents a number of suggestions for future extensions and further work. In fact, some of them are subject of current investigation.

One of these activities envisages the automatic generation of relational database programs to be run on one or more database systems which are actually used in the development of real world applications.

The starting point for this activity would be adapting the generic mapping to generate code for another relational database system, written in one of its target query languages. Obviously, a system offering SQL [8] as (one of) its query languages would be a natural candidate for the investigation.

This activity would possibly include the construction of one or more prototype systems before the construction of a production-quality tool is attempted. In addition, these prototypes should be more ambitious than the one developed during my Ph.D.

Another natural extension foresees the use of modularization techniques together with the specification method, aimed at a better organization of large

specifications, which would be the result of applying the method to real world database applications.

This could be achieved by using the modularization structures Document and Chapter of Zc [31], also proposed for incorporation in Z [32]. The idea is to split the specification of systems (documents) into several modules (chapters) based on the connections between objects. Specifically, the specification of complex relational database applications should be split into several chapters based on the connections between the relations, i.e. the foreign keys. The problems that may arise from such a separation and a detailed explanation of what is needed to avoid them would be the subject of the investigation.

Another activity refers to a more detailed investigation of the treatment for error handling. The main objective would be to try to identify, for each of the operations prescribed by the method, all the possible kinds of constraints that might be violated.

Moreover, it might even be possible to identify specific equations in the simplified precondition of the transactions that correspond to certain operation and constraint pairs. The results of such investigation could then lead to a more straightforward way of developing the precondition of database transactions written according to the method. The automatic generation of parts of the predicate of the error schemas associated with the transactions might also be feasible.

Although this activity can be investigated independently of the others, it is obvious that the results of such an investigation will depend on and/or influence the results of some of the others, for instance the modularization activity.

Another possibility is the investigation of how reasoning techniques could be applied to specifications written according to the method. The aim would be to come up with a comprehensive set of theorems about common properties of such specifications and prove them so that the users would be discharged from proving them again.

One possible approach to prove such theorems is to generate, using another view in the prototype tool, a version of the specifications written in the specification language adopted by some other system supporting theorem proving, e.g. ADABTPL [11].

The main benefit of such an exercise would be to prove, formally, that all the structures chosen for the specification method are sound and satisfy the relational database theory and, moreover, that transactions specified according to method do indeed maintain the consistency of the database.

Again, this activity can be investigated independently of the others. However, its results ought to be directly influenced by the results of the modularization phase.

A fairly simple activity, which would probably make the previous activity easier, is the construction of a syntactic editor that supports and enforces the method in full. Ideally, this syntactic editor would be built with the same tool used in the construction of the prototype, i.e. the Synthesizer Generator.

There are a number of other directions in which this research could advance. One of them would be to work on guidelines aimed at maximizing the reuse of specifications of subtransactions.

Another way to proceed would be to use a controlled experiment to compare the specification of simple relational database applications written using the method against others written without the method. To be meaningful, such an experiment would have to be carried out using several groups of people with different backgrounds. Its results might enable an easier identification of the strengths and weaknesses of the method and, thus, help to improve it.

Finally, it is possible that the ideas and results of this research can be adapted for developing object-oriented database applications.

7 Conclusion

In summary, this paper presented an overview of the research carried out during the last four years at The University of Glasgow and is about the utilization of formal techniques for the development of relational database applications. In particular, it argued that the formal specification and derivation of relational database programs can be made reasonably simple by the provision of appropriate methods and tool support.

In the perfect world, applications ought to be formally specified and modularization techniques used, when necessary, to make the formal specifications easier to understand. In addition, reasoning and/or refinement techniques could be applied before the implementation is actually developed.

This work has addressed the problems of specifying relational database applications formally and of deriving relational database programs directly from the specifications.

It was claimed most previous approaches to the derivation of database programs had not properly addressed the problem, because the problem was either kept too general, without being restricted to any particular database model, or greatly simplified, by not addressing the specification of the database constraints and/or more complicated transactions. The work described here is restricted to the relational database model and addresses all possible constraints as well as generic transactions.

Specifically, a method for the formal specification of relational database applications was provided. The method is aimed at formalizing the design of real relational database transactions and, thus, it ought to help practitioners in the development of real world applications. In addition, the method is generic and may be the first step in the direction of the formal development of database applications and of specification standardization in this context. Moreover, it should improve the system documentation and the quality of the application programs which should contain fewer errors.

One of the conclusions of this research was that the choice of Z as the formal language for the specification of relational database transactions was an appropriate decision. In the main, the specification method uses only standard Z [5]. Still, most aspects of the method are clear and simple and are

defined using a suitable level of abstraction. Also, the extensions to Z used or suggested in this work were kept to a minimum.

Nevertheless the choice of Z in this work does not preclude using other model-oriented specification languages. In other words, the method is generic and different users may use different languages to specify applications. In particular, a previous paper on this method [33] was written in Zc [31], a strongly-typed Z-like language, with minor modifications only.

This work has also proposed a simple translation process to map specifications that result from using the method to relational database applications. The mapping addresses the problems involved in such a process without binding the investigation to any specific database system or language. Also, it is not restricted to the correct behaviour of simple atomic operations. On the contrary, it considers all the relevant kinds of constraints as well as more complicated transactions.

In general, there is more than one way of writing correct database commands to implement any particular operations. The utilization of the mapping allows for the standardization of the database operations contained in the application programs, which ought to lead to programs being easier to understand. As a consequence, the costs of testing and maintenance might also be reduced.

A prototype syntactic tool which aims to support and enforce a reasonably large subset of the method was also developed.

The prototype was built using the Synthesizer Generator [29,30], which is a powerful tool for implementing language-based editors. The mapping was implemented using the view facility of the Synthesizer Generator. Writing the SSL code per se was straightforward. The effort to learn the basic features of the system was also reasonably small.

The Synthesizer Generator helped to create appropriate support to using the method for the specification of relational database applications as well as to deriving relational database programs from the specifications.

To conclude, it is believed the mapping process mentioned here as well as its actual prototype implementation for DBPL (and indeed for most RDBMSs) are neither too easy nor too difficult. Moreover, it is claimed this work provides evidence that the application of formal techniques in the development of real life software is feasible. Even though there is no formal proof that the mapping retains all the properties of the method, the well-defined semantics of the relational model together with extensive testing of the prototype suggests this is indeed the case.

Acknowledgements

This paper summarizes material from my Ph.D. thesis [4]. Therefore, it is a pleasure to thank again my supervisors, Ray Welland and David Harper. The financial support for the Ph.D. course was provided by CAPES (Brazilian Federal Agency for Postgraduate Education) and by UFPE (Federal University of Pernambuco).

References

[1] Barros R. S. M. Formal specification of very large software: a real example. Master's thesis, Federal University of Pernambuco (UFPE), Departamento de Informática, Recife, Brazil, October 1988. In Portuguese.

[2] Date C. J. An Introduction to Database Systems, volume 1. Addison-Wesley Publishing Company Inc., Reading, Massachusetts, USA, sixth edition, 1995.

[3] Codd E. F. A relational model of data for large shared data banks. Communications of the ACM, 13(6):377—387, June 1970.

[4] Barros R. S. M. On the Formal Specification and Derivation of Relational Database Applications. PhD thesis, The University of Glasgow, Department of Computing Science, November 1994.

[5] Spivey J. M. The Z Notation: A Reference Manual. Prentice Hall International (UK) Ltd., Hemel Hempstead, UK, second edition, 1992.

[6] Schmidt J. W. and Matthes F. The dbpl project: Advances in modular database programming. Information Systems, 19(2):121—140, 1994.

[7] Wirth N. Programming in Modula-2. Texts and Monographs in Computer Science. Springer-Verlag, Berlin, Germany, third edition, 1985.

[8] Melton J. and Simon A. R. Understanding the New SQL: A Complete Guide. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann Publishers Inc., San Francisco, California, USA, 1993.

[9] Pastor J. A. and Olivé A. An approach to the synthesis of update transactions in deductive databases. In Proceedings of CISMOD'94 (Fifth International Conference on Information Systems and Management of Data), Madras, India, October 1994.

[10] Minker J. Foundations of Deductive Databases and logic Programming. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann Publishers Inc., San Francisco, California, USA, 1988.

[11] Sheard T. and Stemple D. Automatic verification of database transaction safety. ACM Fransactions on Database Systems, 14(3):322-368, September 1989.

[12] Boyer R. S. and Moore J. S. A Computational logic. Academic Press, New York, USA, 1979.

[13] Steinberg G., Faley R., and Chinn S. Automatic database generation by novice end-users using english sentences. Journal of Systems Management, 45(3):10-15, March 1994.

[14] Qian X. The deductive synthesis of database transactions. ACM Fransactions on Database Systems, 18(4):626-677, December 1993.

[15] Manna Z. and Waldinger R. Fhe logical Pasis for Computer Programming, volume 2. Addison-Wesley Publishing Company Inc., Reading, Massachusetts, USA, 1990.

[16] Schewe K.-D., Schmidt J., and Wetzel I. Specification and refinement in an integrated database application environment. In Prehn S. and Toetenel W. J., editors, VDM'91: Formal Software Development Methods, volume 551 of Fecture Notes in Computing Science, pages 496-510. Springer-Verlag, October 1991.

[17] Borgida A., Mylopoulos J., and Schmidt J. W. Final version on tdl design. DAIDA Deliverable DES1.2, ESPRIT Project 892, September 1987.

[18] Mylopoulos J., Bernstein P. A., and Wong H. K. T. A language facility for designing interactive database-intensive applications. ACM Transactions on Database Systems, 5(2):185-207, June 1980.

[19] Abrial J.-R. Fhe B-Book. Cambridge University Press, 1996. To appear.

[20] Edinburgh Portable Compilers Ltd., UK. B-Fool Reference Manual, 1991.

[21] Giinther T., Schewe K.-D., and Wetzel I. On the derivation of executable database programs from formal specifications. In Woodcock J. C. P. and Larsen P. G., editors, FME'93: Industrial-Strength Formal Methods, volume 670 of Fecture Notes in Computing Science, pages 351-366. Springer-Verlag, April 1993.

[22] Goguen J. A. and Winkler T. Introducing obj3. Technical Report SRI-CSL-88-9, SRL, Menlo Park, USA, 1988.

[23] Schewe K.-D., Wetzel I., and Schmidt J. Towards a structured specification language for database applications. In Harper D. J. and Norrie M. C., editors, Specification of Database Systems, Glasgow 1991, Workshops in Computing Series, pages 255-274. Springer-Verlag, 1992.

[24] Barros R. S. M. and Harper D. J. A method for the specification of relational database applications. In Nicholls J. E., editor, Z User Workshop, York 1991, Workshops in Computing Series, pages 261-286. Springer-Verlag, 1992.

[25] Barros R. S. M. Formal specification of relational database applications: A method and an example. Research Report GE-93-02, Department of Computing Science, The University of Glasgow, UK, September 1993.

[26] van Diepen M. J. and van Hee K. M. A formal semantics for z and the link between z and the relational algebra. In Bj0rner D., Hoare C. A. R., and Langmaack H., editors, VDM and Z - Formal Methods in Software Development, volume 428 of Fecture Notes in Computing Science, pages 526551. Springer-Verlag, 1990.

[27] Barros R. S. M. Deriving relational database programs from formal specifications. In Naftalin M., Denvir T., and Bertran M., editors, FME'94: Industrial Benefit of Formal Methods, volume 873 of Fecture Notes in Computing Science, pages 703-723. Springer-Verlag, October 1994.

[28] Spivey J. M. A guide to the zed style option. Unpublished, December 1990.

[29] Reps T. W. and Teitelbaum T. Fhe Synthesizer Generator: a system for constructing language-based editors. Texts and Monographs in Computer Science. Springer-Verlag, New York, USA, 1989.

[30] Grammatech, Inc., Ithaca, NY, USA. The Synthesizer Generator Reference Manual, fourth edition, 1992.

[31] Sampaio A. C. and Meira S. L. Zc: A notation for complex systems specification. In proceedings of XV SEMISH, SBC Brazilian Congress, Rio de Janeiro, Brazil, 1988. In Portuguese.

[32] Sampaio A. C. and Meira S. L. Modular extensions to Z. In Bj0rner D., Hoare C. A. R., and Langmaack H., editors, VDM and Z - Formal Methods in Software Development, volume 428 of Lecture Notes in Computing Science, pages 211-232. Springer-Verlag, 1990.

[33] Barros R. S. M. and Harper D. J. Formal development of relational database applications. In Harper D. J. and Norrie M. C., editors, Specification of Database Systems, Glasgow 1991, Workshops in Computing Series, pages 21-43. Springer-Verlag, 1992.

Prot.str

File Edit View Options Structure Tent

Read /tnp_nnt/local/shape/dbpl/dbpltape/Prot/Prot.str

Intention of relations \Jbegin{zed}

EMPL - \defs - [ENum: ENUM; Sex: SEX; Salary: SALARY; SupENum: ENUM; DNum: DNUM]

\also \also

DEPT " \defs « [DNum: DNUM; ManENum: ENUM; NEmp: NEMP]

PROJ - \defs - [PNum: PNUM; DNum: DNUM]

WORK - \defs - [ENum: ENUM; PNum: PNUM; Hours: HOURS] \end{zed}

Extension of relations

\b e gin{s chema}{Employe e} empls: \power EMPL \where REQUIRED REQUIRED REQUIRED REQUIRED \also KEY\_0F \end{schema}

empls « ENum

empls ™ Sex

empls ™ Salary

empls ™ DNum

empls ™ ENum

- \land \\ \land \\

- \land \\

- \land \\

\begin{schema}{Depart} depts: \power DEPT \where

REQUIRED « depts « DNum REQUIRED ~ depts ~ ManENum \also

KEY\_0F « depts ™ DNum

\land \\ - \land \\

Content: ident declList

Fig. 1. The prototype - specification window - part 1

Fig. 2. The prototype - specification window - part 2

Fig. 3. The prototype - DBPL database structure window