Scholarly article on topic 'The question of scale economies in software—why cannot researchers agree?'

The question of scale economies in software—why cannot researchers agree? Academic research paper on "Social and economic geography"

CC BY-NC-ND
0
0
Share paper
Academic journal
Information and Software Technology
OECD Field of science
Keywords
{"Software estimation models" / "Production functions" / "Scale economies" / "Research synthesis" / "Data envelopment analysis"}

Abstract of research paper on Social and economic geography, author of scientific article — Barbara A. Kitchenham

Abstract This paper investigates the different research results obtained when different researchers have investigated the issue of economies and diseconomies of scale in software projects. Although researchers have used broadly similar sets of software project data sets, the results of their analyses and the conclusions they have drawn have differed. The paper highlights methodological differences that have lead to the conflicting results and shows how in many cases the differing results can be reconciled. It discusses the application of econometric concepts such as production frontiers and data envelopment analysis (DEA) to software data sets. It concludes that the assumptions underlying DEA may make it unsuitable for most software datasets but stochastic production frontiers may be relevant. It also raises some statistical issues that suggest testing hypothesis about economies and diseconomies of scale may be much more difficult than has been appreciated. The paper concludes with a plea for agreed standards for research synthesis activities.

Academic research paper on topic "The question of scale economies in software—why cannot researchers agree?"

Information and Software Technology 44 (2002) 13-24

www.elsevier.com/locate/infsof

The question of scale economies in software—why cannot

researchers agree?

Barbara A. Kitchenham*

Department of Computer Science, University of Keele, Staffordshire ST5 5BG, UK Received 27 June 2000; revised 20 September 2001; accepted 4 October 2001

Abstract

This paper investigates the different research results obtained when different researchers have investigated the issue of economies and diseconomies of scale in software projects. Although researchers have used broadly similar sets of software project data sets, the results of their analyses and the conclusions they have drawn have differed. The paper highlights methodological differences that have lead to the conflicting results and shows how in many cases the differing results can be reconciled. It discusses the application of econometric concepts such as production frontiers and data envelopment analysis (DEA) to software data sets. It concludes that the assumptions underlying DEA may make it unsuitable for most software datasets but stochastic production frontiers may be relevant. It also raises some statistical issues that suggest testing hypothesis about economies and diseconomies of scale may be much more difficult than has been appreciated. The paper concludes with a plea for agreed standards for research synthesis activities. © 2002 Elsevier Science B.V. All rights reserved.

Keywords: Software estimation models; Production functions; Scale economies; Research synthesis; Data envelopment analysis

1. Introduction

Ever since Boehm [1] raised the issue of diseconomies of scale in software development in his COCOMO-1 model, other researchers have studied different data sets to investigate the nature of the relationship between production effort and product size.

Banker and Kemerer [2] made a significant methodological contribution to the debate on economies and diseconomies of scale by assembling different published data sets and investigating the extent to which different data sets revealed similar relationships between effort and size. Initially, they investigated the relationship between effort and size proposed by Boehm, i.e.

Effort = 30 X Size31 (1)

In order to estimate /31, they applied a natural logarithmic transformation to the effort and size variables and used least-squares analysis to find a linear model of the form:

ln(Effort) - ln(30) + 31 ln(Size) (2)

They found 2 of 8 data sets had statistically significant diseconomies of scale. However, they criticized use of

* Tel.: +44-1622-820-484. E-mail address: barbara@cs.keele.ac.uk (B.A. Kitchenham).

model (2) because it models returns to scale using a single parameter. They hypothesised that individual data sets exhibit mixed economies of scale where smaller projects exhibit economies of scale and larger projects exhibit diseconomies of scale. They suggested that this could be modelled by using quadratic forms of the linear and logarithmic models:

Effort = 30 + /31 X Size + /32 X Size2 (3)

ln(Effort) = & + A ln(Size) + 32 ln(Size)2 (4)

Using model (3), five of the eight of the quadratic terms were significant (p < 0.10, two-tailed test). Using model (4) with one-tailed tests and a significance level of 0.10, none of the quadratic terms were significant. However, they warned that the quadratic form of the model caused collinearity among the independent variable that is a problem for least squares regression. To avoid the problem of collinearity, they used data envelopment analysis (DEA) to investigate their hypotheses about the functional form of the relationship between effort and size. This work was continued in a later paper [3]. They concluded that the DEA analysis results were consistent with their hypothesis of mixed returns to scale.

I applied model (2) to 17 data sets [4] and found that the

0950-5849/02/$ - see front matter © 2002 Elsevier Science B.V. All rights reserved. PII: S0950-5849(01)00204-X

hypothesis of a linear relationship (i.e. no economies or diseconomies of scale) could only be rejected on only two of the data sets. In contrast, Hu [5] analysed 9 data sets using models (3) and (4) and compared them with models (1) and (2). He concluded that a linear model could be firmly rejected and that model (4) was the model that best fitted the different data sets.

The purpose of this paper is not to analyse yet more data sets but to investigate why it does not seem possible for researchers to agree about a research hypothesis when there appears to be ample empirical data available to test it. This issue should be of concern to researchers because it raises basic methodological issues about how we test hypotheses and how we accumulate scientific evidence.

It must be noted that data sets discussed in this paper are extremely old. This would be of concern if the paper were intended to discuss productivity levels in software projects. However, I do not believe this is a significant weakness because the nature of software data sets has not changed in the last 20 years and this paper is concerned solely with the problems associated with data analysis.

2. The reasons for different conclusions

In order to understand why different researchers have come to different conclusions about scale economies, it is necessary to compare the analysis procedures used in each paper in more detail.

2.1. Data set selection

Although there is an overlap in all four papers i.e. [2-5], each paper used a different selection of data sets to test the hypothesis of scale economies. Furthermore, I based my conclusions on analysing homogeneous subsets of reported data sets, even if they originated from a single source. I split four of the data sets as follows:

1. The COCOMO data set was analysed as three separate data subsets corresponding to the embedded, semidetached and organic project classification.

2. The Kitchenham-Taylor data set was analysed as three subsets since the data came from an ICL mainframe system, BT System X software, and BT Software House applications [6].

3. The MERMAID-1 dataset [7] was analysed in three subsets based on the language/environment classification. This identified projects as COBOL, Advanced COBOL or 4GL projects.

4. The MERMAID 2 data set was analysed in two subsets New projects and Enhancement Projects.

Banker et al. [3] investigated subsets of the COCOMO data set and the MERMAID-1 data set but used the full data sets for their main hypotheses tests.

The different choice of data sets and the use or not of data

subsets makes a substantial difference to the proportion of data sets claimed to exhibit specific features, so can have a significant impact on the conclusions drawn from the analyses.

There are two opposing viewpoints about whether or not to partition data sets. The argument for partitioning data sets is that mixed, heterogeneous data sets result in misleading analysis. For example, if we are concerned with the height of adults and we do not separate height data on the basis of sex, we would observe a misleading double-peaked distribution. Furthermore, correlations between body size measures such as the relationship between waist and hip size are different between the sexes.

However, the counter argument for not partitioning software data sets, is that there is no well agreed basis for such partitioning. For example, if we partitioned height or body size data sets on the basis of eye colour, we would not improve the precision of our data analyses. So if we have no agreed basis for partitioning, software data sets should be analysed as they are reported.

Which argument you find most compelling is a matter of opinion, clearly, I favour the use of partitioning while other researchers do not. Nonetheless with respect to the Kitchenham-Taylor data set (which has been analysed by all researchers), there is a strong argument for partitioning since the data set was reported as coming from three different sources. Furthermore as can be seen in the figure reported by Pickard et al. [8], the quadratic effect is entirely due to data from one of the partitions. If it is argued that the data set is too small to be partitioned, I believe that the data set should be omitted from any research synthesis activities.

2.2. The choice of hypothesis test

A notable difference between the hypothesis tests used by Banker and Kemerer [2] to test model (2) and the hypothesis tests, I used [4] was the level of significance selected and the choice of one-tailed or two-tailed tests. Banker and Kemerer used a significance level of p < 0.10 and one-tailed tests. I used a significance level of p < 0.05 and two-tailed tests.

The effect of using a one-tailed as opposed to a two-tailed test is to lower the critical value of the test statistic. Similarly the effect of using significance level of p < 0.1 as opposed to a level of p < 0.05 is also to lower the critical value of the test statistic. (Note: In their later paper, Banker et al. [3] used a 5% level of significance.)

The choice of one-tailed or two-tailed test depends on the nature of the alternative hypothesis. Banker and Kemerer were initially testing whether or not large projects exhibited diseconomies of scale (i.e. whether ¡¡1 was greater than 1), which implies a one-tailed test. Banker and Kemerer's analysis showed four data sets with slight diseconomies of scale (¡ 1 < 1) and four with slight economies of scale (¡1 > 1). They also discussed another paper that reported a diseconomy of scale but did not report its data set. Thus, I thought it was more appropriate to test whether or not the

Table 1

Applying a logarithmic model to the Belady-Lehman data set using different dependent variables (33 projects)

Independent variable Parameter value Standard error t P > |t| [95% Confidence interval] Dependent variable = ln(Effort)

Constant 1.541 0.3808 4.046 0.000 0.7640 2.3172

ln(Size) 1.061 0.1008 10.522 0.000 0.8552 1.2664

Dependent variable = ln(Size)

Constant -0.3608 0.3868 -0.933 0.358 -1.1497 0.4281

ln(Effort) 0.7365 0.0700 10.522 0.000 0.5937 0.8792

parameter ¡ 1 was different from one before testing the direction of any returns to scale. Testing whether ¡¡1 < > 1 implies a two-tailed test.

The selection of a 0.10 or 0.05 level of significance is a matter of personal choice. In principal, a 0.10 significance level might be used by a researcher who is averse to missing a significant effect, a 0.05 significance might be used by a researcher who is averse to reporting a spurious effect.

It should be noted that many researchers prefer to report the actual probability value (p). However, when undertaking hypothesis testing or constructing confidence limits, it is still necessary to specify the significance level (alpha value) that will be used prior to analysis, in order to avoid the possibility of bias.

2.3. The treatment of model parameters

An important issue when analysing data sets using models of the form of (3) and (4) is whether or not the significance of the model parameters is considered. Pickard et al. [8] criticized Hu [5] because he did not give any consideration to whether or not the quadratic terms in his models were significantly different from zero. Banker and Kemerer [2] also used models (2) and (4) without commenting on the significance of the parameters although they did present the values of the t statistics from which the significance can be assessed.

However, it remains a matter of personal choice whether researchers use Hu's approach and test the model as whole irrespective of the significance of the independent variables, or use Pickard et al.'s approach and reject a model if some of the parameters values are not significant.

2.4. The use of production functions

2.4.1. Size as the dependent variable

One of the characteristics of an econometric production function is that it assumes that output is determined by input which is usually measured in terms of labour and capital. In software cost estimation terms this means that size is determined by effort. Following the econometric approach, Hu [5] used models where size was the dependent variable and effort was the independent variable so instead of models 1 -4 he used the following models:

Size = ¡0 X Effort^1 (5)

ln(Size) = ln(j80) + ln(Effort) (6)

Size = ¡0 + X Effort + ¡2 X Effort2 (7)

ln(Size) = ¡0 + ft ln(Effort) + ¡2 ln(Effort)2 (8)

An implication of this is that his parameter estimates are likely to be completely different from those of Banker and Kemerer [2] and Kitchenham [4], as shown below. The two variables x and y are related by the models:

y = Oy + ¡y(x - xm) (9)

x = Ox + ¡¡x(y - ym) (10)

where ym is the average y value and xm is the average x value.

These models are equivalent to models (2) and (6) with an adjusted constant. Using least squares:

¡¡y = X(x - xm)(y - ym)/ X(x - xm)2 (11)

¡x = X(x - xm)(y - ym)^(y - ym)2 (12)

Now, the squared correlation between the two variables is: r2 = [£(x - xm)(y - ym)]2/^(x - xm)2(y - ym)2] (13) which implies that

¡¡¡2 = r2X(y - ym)2^(x - xm)2 (14)

= r2 X(x - xm)2^(y - ym)2 (15)

Thus ¡y = ¡x if and only if £(y - ym)2 = £(x - xm)2.

For example, if we consider the Belady-Lehman data set and use the logarithmic transformation models (2) and (6), the regression analyses are given in Table 1. Using the analysis shown in the upper part of Table 1, Kitchenham [4] concluded that a linear model could not be rejected since the parameter value for the independent variable is not significantly different from 1. Using the analysis shown in the lower part of Table 1, Hu [5] concluded a linear model could be rejected, since the parameter value for the independent variable is significantly different from Eq. (1). Both are correct but they are correct relative to different models.

2.4.2. Assumptions underlying production functions and data envelopment analysis

The production function approach embodies a very different viewpoint of the relationship between effort and size from the viewpoint implicit in regression analysis. To clarify this issue, I include a brief discussion of production functions and efficiency assessment based on Coelli et al. [10].

A production function defines the maximum output(s) attainable from a given vector of inputs. A production function is also referred to as a production frontier because of its maximal property. Firms (or more formally 'decisionmaking entities') either operate on the production frontier in which case they are called technically efficient, or beneath the frontier if they are not technically efficient. Coelli et al. identify four methods used to study productivity and efficiency:

• least-squares econometric production models;

• total factor productivity indexes;

• data envelopment analysis;

• stochastic frontiers.

Both least-squares econometric models and productivity indexes assume that all firms in a business sector are technically efficient. They are usually used to assess the impact of technical change over different time periods. DEA and stochastic frontiers do not assume all firms are technically efficient. They are often used to compare the efficiency of different firms at one point in time.

The main difference between DEA and stochastic frontiers concerns the form of error term used for each model. Assume a production function takes the form:

ln(y) = px (16)

where ln(y) is the logarithm of the (scalar) output of a firm; x is a (k + 1) row vector whose first element is 1, and the remaining elements are the logarithms of the k-input variables used by a firm; p = (30, 31,..., 3k) is a (k + 1) column vector of unknown parameters that need to be estimated. Eq. (16) is a generalization of Eq. (2).

Given a sample of N data points, for DEA, the model including the error term is:

ln(y ) = px - Ui (17)

where i = 1 , . , N represent the values for each data point.

In this model, the error term ut is a non-negative number, which in econometric terms represents the technical inefficiency in production of firms in the industry being modelled. Thus, DEA constructs a frontier function where no data point is above the frontier and all data points are on or below the frontier. (Note in this discussion, I am considering an output-oriented production function. For an input-oriented production function we look for a frontier representing the minimum input necessary to achieve a particular

output. In this case, the x and y axes are inverted and inefficient firms operate above the frontier.)

The DEA frontier is deterministic, in the sense that the observed output yi is bounded above by the non-stochastic quantity exp(px,). It takes no account of the possible impact of measurement errors, model incompleteness, or other noise in the data.

A stochastic frontier production function attempts to allow for such noise by including an additional random error, vi, to the model, to give:

ln(y) = pXi - ui + Vi (18)

The random error term accounts for measurement error, model incompleteness and other random factors (e.g. luck, staff ability, user motivation etc.). Thus, using a stochastic frontier model, we do not expect the frontier to bound all the data points. Some data points may be above the frontier and some below it. DEA derives a frontier from the most productive points in the data set, whereas a stochastic frontier is obtained by considering the values of all points in the data set.

Banker et al. [3] incorporated a statistical assessment into their DEA analysis in order to test whether a DEA model assuming constant returns to scale was a better fit to the data than a model based on mixed returns to scale. They did this by allowing the error term, u, to be distributed either as a half normal or as an exponential distribution. However, they did not include a random error term.

If we want to use DEA to determine whether or not software project data sets exhibit certain properties, we need to consider whether or not the assumptions inherent in DEA apply to software projects. Several DEA assumptions are defined in Table 2 and compared with the situation found in software data sets. Table 2 indicates that data sets of software projects cannot be assumed to conform to the assumptions underlying DEA. Therefore, I do not believe that the results of a DEA analysis can be used to investigate economies or diseconomies of scale in software projects except in very special circumstances. In most cases, I believe we need to use a regression line or a stochastic frontier.

A stochastic frontier might be appropriate if we had a data set comprising either groups of projects from different companies or groups of projects from the same company that use different development methods.

If we have projects from the same company using the same development methods, the technical efficiency term would effectively be a constant (U) not a variable. This means that it our model would be:

ln(y) = px - U + Vi (19)

The constant U can be regarded as a part of the intercept parameter, 30, and we now have a simple multiple regression model.

It is also important to note that whatever the nature of a data set, there is no reason why the DEA frontier should be

Table 2

DEA assumptions compared with the characteristics of software data sets

Assumption

Situation for software data sets

There is no measurement error or noise in input or output variables.

All relevant input variables are included in the model.

All the data points in a data set represent independent decisionmaking entities.

In the case of size measured in function points, there is strong evidence of measurement error ([14]).

In the case of effort, in my experience there is measurement error due to factors such post-hoc reporting, unpaid overtime and budget manipulation (i.e. assigning costs that belong to over-spending projects to projects that are underspending).

There is no general agreement in commercial cost models about which variables need to be included. Most, however, include variables other than size alone.

Cost models based solely on size usually perform badly [18], suggesting other factors are important.

A decision-making entity is one that has complete autonomy regarding the use of inputs. In many firms this would mean being able to decide the proportion of investment allocated to capital and labor and how capital and labor are used to produce the firm's outputs. The technical efficiency of the firm would relate to these management decisions. In the software industry, such decisions might relate to the proportion of contractors to employees, investment in technology and training etc. In some firms, software projects managers have this sort of autonomy, but in others they do not. For example, the Kitchenham-Taylor data set [6] includes data from ICL VME projects. All these projects were enhancements to an existing mainframe operating system. All projects used the same development methods and tools, and project managers had no authority to change working methods. In such a case, I do not believe projects can be regarded as 'decision-making entities'. In some organizations, project managers may have complete autonomy but certainly not in all organizations. Furthermore, our data sets do not provide sufficient information to know whether they do or do not.

comparable to a stochastic production line or a regression line. It is quite possible for the central tendency of a data set to be linear, while at the same time for a function fitted to the most productive data points to be non-linear. Thus, in my view, it is not possible for the results of a DEA analysis to formally disprove the results of a regression analysis. They do not measure the same properties of the data set.

There is also a problem with the statistical tests Banker et al. applied to their DEA analysis. They attempted to test whether a variable return to scale was a better fit to the DEA frontier than a constant return to scale. However, they based their F tests on 2n by 2n degrees of freedom where n is the number of projects in each data set. Thus, they appear to have assumed both that the outputs from different models derived from the same data set were independent, and that effort and size for each data point can be considered independent. However, Stensrud and Myrtveit [11] have pointed out that alternative predictions made using the same data must be treated as paired data points not independent data points. Assuming that Banker et al. should have used n/2 by n/2 degrees of freedom, their tests of the null hypothesis of constant returns to scale based on a DEA viewpoint change substantially. For the exponential inefficiency model, the hypothesis would be rejected for only two of the 11 data sets (p < 0.05), rather than rejected for eight data sets as stated in the paper. For the half-normal inefficiency model, the hypothesis would be rejected for five of the data sets (p < 0.05) rather than rejected for nine of the data sets as stated in the paper.

2.5. Sensitivity analysis

Pickard et al. [8] criticised Hu because he did not perform

any sensitivity analysis i.e. he did not check that the models he obtained for each data set were robust with respect to outliers. This criticism can also be levelled at Banker and Kemerer [2] and Kitchenham [4].

Pickard et al. [8] pointed out that the quadratic models appeared very sensitive to outliers. Banker and Kemerer [2] noted that the introduction of quadratic terms could produce collinearity problems. In their subsequent paper, Banker et al. [3] found that the logquadratic model (4) showed evidence of severe collinearity but the simple quadratic model (3) did not. Thus, they used model (3) and performed further sensitivity analyses by assessing the impact of removing 'outliers'. Again they based their analysis on the complete reported data sets rather than homogeneous subsets.

The interesting issue here is to consider which points they included as outliers. For example, they concluded that the Kemerer data set did not have an outlier, the Belady-Lehman data set had two outliers and the MERMAID-2 data set had 1 outlier. The effort size scatter plots for these data sets are shown in Figs. 1 -3, respectively. It is clear that in all the data sets there are more unusual data points than Banker et al. found outliers. Banker et al. [3] used the Belsley-Kuh-Welsch criteria to determine whether or not a point was an outlier. The problem with this approach is that such criteria detect outliers relative to a specified model.

2.5.1. The Kemerer data set

The point labelled A in Fig. 1 may not be an outlier with respect to quadratic model, it is however, a high leverage point. Furthermore as given in Table 3, if it is omitted from the analysis, the quadratic term is no longer significant. In

Fig. 1. Kemerer data set.

fact, the quadratic term, the linear term and the constant are not significantly different from zero. Table 3 also shows the simple linear model derived from the Kemerer data set omitting point A. In this case, the linear term is significantly different from zero. Banker et al. [3] criticised the use of the simple log linear model because estimates of coefficients may be biased and standard errors inflated if the model is mis-specified by the omission of a relevant variable. However, the results are given in Table 4 suggest that inclusion of spurious variables can also distort statistical analyses.

that the inclusion of an unnecessary quadratic term can distort the analysis.

2.5.3. The MERMAID data set

For the MERMAID-2 data set shown in Fig. 3, Banker et al. [3] treated point B as an outlier but not point A. However, the quadratic term is significant if only project B is removed from the analysis, but is no longer significant if both projects A and B are removed (see Table 5).

2.5.2. The Belady-Lehman data set

Analysis of the Belady-Lehman data set suggests that Banker et al. [3] identified point A and C as outliers, although it would appear from Fig. 2 that point B is also a high influence data point. Table 4 shows the quadratic analysis when all the data is used, when points A and C are omitted and when A, B and C are omitted. The quadratic analysis with all data points indicates weak support for a quadratic model (assuming a 0.10 significance level). The quadratic model omitting points A and C suggests strong support for a quadratic model. However, the quadratic model omitting points A, B and C suggests no support for a quadratic model with all the parameter being nonsignificant. Table 4 also shows a simple linear regression on the Belady-Lehman data set excluding points A, B and C, that suggests that there is a strong linear relationship between effort and size. The results in Table 4 again confirm

3. Statistical issues

There are several technical concerns with hypothesis testing that raise problems when attempting to test hypotheses about the parameters in regression models and when attempting to synthesize results from several different data sets. These factors do not explain why researchers disagree; they illustrate some problems with hypothesis testing procedures.

3.1. The relationship between correlation and regression

As shown in Eqs. (14) and (15), there is a functional relationship between the parameter 31 in a linear model and the correlation coefficient. Thus, if we have messy data with a low correlation coefficient, we will underestimate the value of /31.

Fig. 2. Belady-Lehman data set.

3.2. Errors in the independent variables

Ordinary least-squares regression assumes that the independent variable(s) are measured without error (see for example [12], Section 9.14 or [13], Section 3.4). It may be reasonable to assume size is error free, if it is measured in lines of code after system testing and all products were developed from scratch. It is more difficult to accept the assumption if size is measured in function points [14] and/ or the product is an enhancement of an existing product or includes reused code.

If there are errors in the independent variables, ordinary least squares will underestimate the values of parameters. If an independent variable X is measured without error, the

usual regression model is:

7 = a + 0(X - Xm) + 6 (20)

where Xm is the mean of X.

If X is measured with error, we actually measure X' = X + e. If 6, e, and X are all normally and independently distributed, Y and X' follow a bivariate normal distribution and the regression coefficient is:

0 = 0/(1 + A) (21)

where A = c/c6» i.e. the ratio of the measurement error variance to the model error variance. In this case, 0 is called the structural regression coefficient.

Snedecor and Cochran [12] point out that if the objective

Fig. 3. The Mermaid-2 data set (1 = New projects, 2 = Enhancement projects, 3 = Type unknown).

Table 3

The Kemerer data set (17 projects)

Independent variable Value Standard error t P > \t\ [95% Confidence interval]

Quadratic model using all data points

Constant 26267.62 13504.25

Size FP -58.62 24.952

Size2 0.0487 0.01058

1.945 -2.349 4.605

0.072 0.034 0.000

-2696.11 -112.1381 0.02603

55231.36 -5.1032 0.07142

Quadratic model omitting point A Constant 2747.66

Size FP 25.20

0.0036

11240.1 27.70 0.0153

0.244 0.910 0.237

0.811 0.379 0.816

-21535.11

-34.63 -0.03671

27030.43 85.03 0.02945

Linear model omitting point A Constant 4808.6

Size FP 18.850

6870.6 6.6572

0.700 2.832

0.495 0.013

9927.38 4.572

19544.55 33.129

is to predict the population regression on line or the value of an individual Y from the sample of values (Y, X'), ordinary least squares can be used with X' instead of X. The effect of the errors in X is to decrease the accuracy of predictions since the residual error is increased.

However, if we want to use regression analysis to decide whether there are economies of diseconomies of scale, we are interested in the value of 3, so we need to be aware that our estimate will be biased if we have measurement errors in our input variable(s).

3.3. The problem of combining evidence

Both Hu [5] and Banker et al. [3] attempted to combine the evidence from the individual data sets into an overall test of the null hypothesis. Banker et al. used a method based on aggregating evidence about the significance levels of the tests of the null hypothesis for each individual test. Hu accumulated all the data into a single large data set and

analysed that data set in the same way that he analysed the individual data sets.

It would be extremely useful to have a standard method of accumulating evidence from independent experiments. Medical statisticians have devoted considerable research into such techniques but they have also recognised that are significant problems [9]. Several software engineering researchers have looked at the techniques available for combining evidence and noted the same problems [9,15,16].

Four particular problems affect Banker et al.'s and Hu's research:

1. If you attempt to aggregate test statistics from different models, you may be able to reject the null hypothesis but that does not imply you can accept the alternative hypothesis. For example, if 50% of the individual studies rejected the null hypothesis of a linear model, an aggregate test would reject the null hypothesis of a linear model. However, with the same data sets, if you used

Table 4

Applying a quadratic model to the Belady-Lehman data set (all values)

Independent variable Value Standard error t P > \t\ [95% Confidence interval]

Quadratic model all data points Constant -309.06

Size (KLOC) 17.5713

Size2 -0.0156

Quadratic model excluding points A and C Constant -39.80

Size (KLOC) 9.755

Size2 -0.0093

Quadratic model excluding points A, B and C Constant 69.41

Size (KLOC) 4.187

Size2 0.0181

Linear model excluding points A, B and C Constant -8.37

Size (KLOC) 8.002

369.822 5.5486 0.0089

71.612 1.2437 0.0018

86.45 2.9731 0.01354

64.914 0.8603

0.836 3.167 1.757

0.556 7.844 5.098

0.803 1.408 1.339

-0.129 9.302

0.410 0.004 0.089

0.583 0.000 0.000

0.429 0.170 0.192

0.898 0.000

-1064.33 6.239 -0.0338

-186.49 7.2078 -0.0130

107.97

-1.9130

-0.0097

-141.34 6.240

446.22 28.903 0.00254

106.89 12.3028 -0.0055

246.80 0.04589 0.0459

124.60 9.764

Table 5

The Mermaid-2 data set (30 projects)

Independent variable

Standard error

[95% Confidence interval]

Quadratic model excluding point B

Constant 3502.969 2319.039

Size (Adjusted FPs) -6.6921 17.9740

Size2 0.04982 0.02290

Quadratic model excluding points A and B

Constant -397.85 2185.956

Size (adjusted FPs) 53.840 22.190

Size2 -0.0561 0.0345

1.511 -0.372 2.176

-0.182 2.426 -1.624

0.143 0.713 0.039

0.857 0.023 0.117

1263.883 -43.6382 0.00275

4899.913 8.138 -0.1272

8269.822 30.2538 0.09688

4104.207 99.541 0.0150

the null hypothesis of a quadratic model, that null hypothesis would also be rejected.

2. Standard meta-analysis techniques are based on estimating the 'size of the response variable' by combining the data from different studies. The standard techniques do not encourage accumulating the data into a single data set. Response size measures are usually derived from weighted averages of the values obtained in each study. The weights are based on the number of data points in each study and/or the standard error of the response size measure.

3. Meta-analysis is more appropriate for formal experiments than for observational studies.

4. Meta-analysis is not reliable if the phenomenon under study is heterogeneous, i.e. response effects are not consistent among the studies.

These factors suggest that although it would be extremely useful to have a formal means of accumulating evidence, the currently accepted methods are unlikely to be relevant to the problems of determining whether or not there are economies or diseconomies of scales in software products.

4. Discussion

The major difference between Hu's regression results and the regression results obtained by Kitchenham [4] and Banker et al. [3], is the use of size as the dependent variable rather than the independent variable. The difference between the regression results obtained by Kitchenham [4] and Banker et al. [3] is because they have different approaches to hypothesis testing and because Banker et al.'s sensitivity analysis favoured quadratic models. However, the results are actually more similar than they first appear. Banker et al. [3] found that the linearity hypothesis could be rejected on seven out of 11 data sets (64%). Omitting the Kitchenham-Taylor data set on the basis that it composed of three small data sets, and reclassifying the Belady-Lehman, Kemerer and MERMAID-2 data sets to support the linear hypothesis, the linearity hypothesis can be rejected in only three out of 10 data sets (33%). Furthermore, two of those data sets (Berens and Albrecht-Gaffney) were not included in the data sets analysed by Kitchenham [4]. The remaining data set is the Bailey-Basili data set. This

Fig. 4. The Bailey-Basili data set.

is shown in Fig. 4 with the linear model and the quadratic model. The quadratic model looks a superior fit to the data but neither model is very satisfactory because of the lack of data points in the size region 15-45 KLOC.

The difference between DEA results and regression-based results is due to the fact that DEA constructs a production frontier based on the most productive projects while regression measures the central tendency of the data set. Since they measure different data set properties, it is invalid to suggest that DEA results can confirm or disconfirm the regression results. It is more appropriate to consider which technique is most appropriate given the natures of the data sets being analysed. In my opinion, the assumptions made by DEA are not valid for most software data sets, so we need to use regression models or stochastic production frontiers to investigate economies or diseconomies of scale.

With respect to any final conclusions about economies or diseconomies of scale, I believe the nature of software data sets makes it difficult to come to any firm conclusion. Many of our data sets are highly skewed with a few large data points. In such circumstances, it is almost impossible to be sure that any indication of economies or diseconomies of scale is not an artefact of the data set. Furthermore, many statisticians are concerned about the validity of hypothesis testing and tests of significance [17]. For example, even though I have argued that in most cases, it is not valid to reject the linear hypothesis, the 95% confidence interval for the parameter 31 given in Table 1 make it clear that substantial economies or diseconomies of scale cannot be ruled out. The other data sets have similarly large confidence intervals for 31.

It is also worth asking whether the choice of model makes any practical difference. For example, Dolado [18] compared linear regression models with models based on genetic programming on 12 data sets. Six of the data sets were old data sets (i.e. the Belady and Lehman, the Boehm, the Albrecht and Gaffney, the Kemerer, and the Kitchenham and Taylor datasets), the other six data sets were more recent. Doladao used genetic programming to derive cost models because it permits a large variety of different models to be assessed. However, in spite of genetic algorithms finding very diverse models, marginal cost analysis of the equations did not detect any significant deviations from the linear model.

I would suggest that any software data set should be treated on its own merits. Any analysis should start by viewing a scatter plot to get a visual understanding of the nature of the data set and, in particular, to identify whether there are any unusual data points. I do not believe that detection of 'outliers' can be delegated to the selection provided automatically by a statistical procedure. In addition, the logarithmic transformation is the best starting point for analysis of software data sets that exhibit an unstable variance i.e. the scatter plot of size against effort is fan-

shaped. The logarithmic transformation stabilizes the variance, reduces skewness and, reduces the impact of outliers. Thus, in practice, the log transformation appears to be a sensible preliminary to analysis even if the relationship is linear. It should, however, be noted that experienced statisticians might feel more comfortable using a generalized linear model that allows the analyst to choose a log linear error term in combination with a different distribution for the dependent variable such as a gamma distribution [19].

5. Conclusions

I have shown that the reason why different researchers have drawn different conclusions when analysing the same data sets is because each researcher based his/her conclusions on different assumptions. Some of the assumptions are explicit such as the choice of significance level others are implicit in the choice of analysis technique such the use of the production function approach or the choice of outlier detection method.

I believe it is important to analyse sets of data sets in order to identify common trends. However, I believe that the software engineering research community needs to accept common standards for such research. In particular:

1. We need to agree criteria for including and excluding data sets.

2. We need to agree criteria to for deciding whether or not data sets should be partitioned.

3. We need researchers to publish their data sets. For this reason, the size and effort data for the MERMAID-2 data set is included in the Appendix A. I must also apologise to the research community because I am not permitted to publish one of the other data sets used in my previous paper [4]. Personally, I have decided that in the future, I will not publish any articles relating to data sets that I am not permitted either to publish or to make available for independent audit.

4. When using analysis techniques from other disciplines, we need to explain their underlying assumptions and be certain that the assumption apply in our circumstances.

5. We need to agree criteria for sensitivity analysis.

Acknowledgements

I am indebted to Ingunn Myrveit and Eric Stensrud for helping me to understand DEA and production functions.

Appendix A. MERMAID-2 data set (и/а = information not available)

See Table Al.

Table Al

Project number Adjusted FP Raw FP Total effort (h) Total duration (months) Project type N (new), E (enhancement)

l 23 23 238 3.45 E

2 38 42 490 6.75 E

3 36 44 616 2.9 E

4 57 51 910 2.55 E

5 36 47 1540 10 E

6 29 38 1680 10 E

7 23 34 1750 10.5 E

8 99 115 3234 9 N

9 605 550 3360 2 N

10 34 42 3850 5.5 E

ll 338 371 5460 15 N

l2 l33 157 5110 16.25 E

l3 ll8 107 6440 11 E

l4 653 634 17920 35 N

l5 502 528 18620 20 E

l6 306 268 21280 27 n/a

l7 l70 179 24850 11.6 N

l8 911 884 48230 29.6 N

l9 221 235 3415 7.5 E

20 6l3 626 11551 7 n/a

2l 1507 1408 4860 8.5 N

22 559 n/a 14224 26 E

23 218 291 9080 9 E

24 479 499 1635 9 N

25 26 33 296 4 E

26 125 137 3720 5 E

27 205 n/a 4672 6 E

28 105 109 2065 8 E

29 114 107 1690 6 E

30 36 38 504 4 E

References

[1] B.W. Boehm, Software Engineering Economics, Prentice-Hall Inc., Englewood Cliffs, NJ, 1981.

[2] R.D. Banker, C.F. Kemerer, Scale economies in new software development, IEEE Transactions on Software Engineering SE-15 (10) (1989) 1199-1205.

[3] R.D. Banker, H. Chang, C.F. Kemerer, Evidence on economies of scale in software development, Information and Software Technology 36 (5) (1994) 275-282.

[4] B.A. Kitchenham, Empirical studies of assumptions that underlie software cost estimation models, Information and Software Technology 34 (4) (1992) 304-310.

[5] Q. Hu, Evaluating alternative software production functions, IEEE Transactions on Software Engineering 23 (6) (1997) 379-387.

[6] B.A. Kitchenham, N.R. Taylor, Software project development cost estimation models, Journal of Systems and Software 5 (1985) 267278.

[7] J.-M. Desharnais, Analyse statistique de la productivitie des project de development en informatique a partir de la technique des point des fonction, Université du Quebec a Montreal, Masters thesis, 1989.

[8] L. Pickard, B.A. Kitchenham, P. Jones, Comments on evaluating

alternative software production functions, IEEE Transactions on Software Engineering 25 (2) (1999) 282-284.

[9] L. Pickard, B.A. Kitchenham, P. Jones, Combining empirical results in software engineering, Information and Software Technology 49 (14) (1999) 811-821.

[10] T. Coelli, D.S. Prasada Rao, G.E. Battese, An Introduction to Efficiency and Productivity Analysis, Kluwer Academic Publishers, Dordrecht, The Netherlands, 1998.

[11] E. Stensrud, I. Myrtveit, Human performance estimating with analogy and regression models: an empirical validation, Proceedings of the Fifth International Software Metrics Symposium (METRICS 1998), IEEE Computer Society Press, Los Alamitos, CA, 1998.

[12] G.W. Snedecor, W. Cochran, Statistical Methods, seventh ed, The Iowa State University Press, Ames, Iowa, 1980, pp. 171-172.

[13] N.R. Draper, H. Smith, Applied Regression Analysis, Wiley Series in Probability and Statistics, third ed. Wiley, New York, 1998.

[14] C.F. Kemerer, Reliability of function points measurement: a field experiment, Communications ACM 36 (2) (1993) 85-97.

[15] W. Hayes, Research synthesis in software engineering: a case for meta-analysis, Sixth International Software Metrics Symposium (METRICS 1999), IEEE Computer Society Press, Los Alamitos, CA, 1999, pp. 142-151.

[16] J. Miller, Can results in software engineering experiments be safely combined?, Proceedings of the Sixth International Software Metrics Symposium, (METRICS 1999), IEEE Computer Society Press, Los Alamitos, CA, 1999, pp. 152-158.

[17] M.R. Nester, An applied statistician's creed, Applied Statistics 45 (4) (1996) 401-410.

[18] J.J. Dolado, On the problem of the software cost function, Information and Software Technology 43 (1) (2001) 61-72.

[19] R. Fewster, E. Mendes, Empirical evaluation and prediction of web applications, Fourth International Conferences on Empirical Assessment and Evaluation in Software Engineering, EASE 2000, Keele University, 2000.