Contents lists available at ScienceDirect

Regulatory Toxicology and Pharmacology

journal homepage: www.elsevier.com/locate/yrtph

Current modeling practice may lead to falsely high benchmark dose ■. estimates

CrossMark

Joakim Ringblom *, Gunnar Johanson, Mattias Oberg

Karolinska Instituted Institute of Environmental Medicine, Unit of Work Environment Toxicology, P.O. Box 210, SE-171 77 Stockholm, Sweden

ARTICLE INFO

Article history:

Received 10 December 2013 Available online 21 March 2014

Keywords: Benchmark dose Simulation Risk assessment Dose-effect Point of departure

ABSTRACT

Benchmark dose (BMD) modeling is increasingly used as the preferred approach to define the point-of-departure for health risk assessment of chemicals. As data are inherently variable, there is always a risk to select a model that defines a lower confidence bound of the BMD (BMDL) that, contrary to expected, exceeds the true BMD. The aim of this study was to investigate how often and under what circumstances such anomalies occur under current modeling practice. Continuous data were generated from a realistic dose-effect curve by Monte Carlo simulations using four dose groups and a set of five different dose placement scenarios, group sizes between 5 and 50 animals and coefficients of variations of 5-15%. The BMD calculations were conducted using nested exponential models, as most BMD software use nested approaches. "Non-protective" BMDLs (higher than true BMD) were frequently observed, in some scenarios reaching 80%. The phenomenon was mainly related to the selection of the non-sigmoidal exponential model (Effect = a ■ eb dose). In conclusion, non-sigmoid models should be used with caution as it may underestimate the risk, illustrating that awareness of the model selection process and sound identification of the point-of-departure is vital for health risk assessment.

© 2014 The Authors. Published by Elsevier Inc. This is an open access article under the CCBY-NC-ND license

(http://creativecommons.org/licenses/by-nc-nd/3XI/).

1. Introduction

Health risk assessment of chemicals is the process of characterizing and quantifying the potential adverse health effects associated with exposure to chemicals (NRC, 1994). Characterizing the hazardous properties of a chemical agent in quantitative terms includes selection of a relevant dataset and description of the dose-effect relationship for the critical effect. With the exception of direct acting carcinogenic substances it is assumed that there are some doses that do not result in adverse effects (Dybing et al., 2002; Edler et al., 2002). An important part of the risk assessment is therefore the process of finding the threshold dose, below which toxicity is not expected. Experimental data are used to describe the dose-effect relationship and to define the point of departure (POD), for development of reference or limit values such as acceptable or tolerable daily intakes, occupational exposure limits, derived no

Abbreviations: BMD, benchmark dose; BMDL, lower confidence bound of the benchmark dose; NOAEL, no observed adverse effect level; LOAEL, lowest observed adverse effect level; EFSA, European Food Safety Authority; OECD, Organisation for Economic Co-operation and Development; USEPA, United States Environmental Protection Agency.

* Corresponding author. Fax: +46 8 31 41 24.

E-mail addresses: joakim.ringblom@ki.se (J. Ringblom), gunnar.johanson@ki.se (G. Johanson), mattias.oberg@ki.se (M. Öberg).

effect levels, population adjusted doses and acute guidance values. Such values are key elements in the regulatory process to define the quality standards for water, food, air, work places etc. It is therefore of vital interest that the methods used to interpret toxi-cological data are continuously examined and further developed to ensure the best use of data.

Traditionally, the no-observed-adverse-effect level (NOAEL) method is used as the POD in regulatory toxicity testing. The NOAEL is defined as the highest tested dose that does not give an effect which is statistically significant from that in the control group (WHO, 1999). The benchmark dose (BMD), on the other hand, is defined as the dose that results in a predefined effect level, ideally at a level that is close to, but not yet, adverse. The BMD method is suggested as a more scientifically sound alternative to the NOAEL method and has been implemented in most regulatory guidance documents as an alternative or preferred approach (ECHA, 2008; EFSA, 2009; NAC/AEGL, 2001; Solecki et al., 2005; USEPA, 1995; WHO, 2009). The BMD method involves fitting a dose-effect curve, or a set of curves, to the data of interest. A critical effect size, also referred to as benchmark response, of 5% is often used as a default for continuous data. The lower bound of the 90% confidence interval of the BMD (BMDL) is suggested to be used as the POD (Crump, 1984; Sand et al., 2008; USEPA, 2012).

http://dx.doi.org/10.1016/j.yrtph.2014.03.004 0273-2300/© 2014 The Authors. Published by Elsevier Inc.

This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/).

The BMD approach overcomes many of the weaknesses of the NOAEL (Allen et al., 1994; Crump, 1984). BMD incorporates information on the sample size and the shape of the dose-effect curve. In addition, BMD values are not constrained to one of the experimental doses, and are less dependent on the study design. Furthermore, high variability in the data and low number of subjects both tend to result in a higher NOAEL, i.e. a less precautious POD, whereas the BMDL values tend to become lower, i.e. more precau-tious, as variability and sample size decreases. BMD aligned design of experiments have also been suggested as a mean to refine and reduce animal experiments (Oberg, 2010).

Several dose-effect or dose-response models may be fitted to the data. Historically, power models and polynomials were used for continuous data, but in recent years more attention has been given to the nested exponential models and nested Hill models that describe sigmoidal curves that level off at higher doses and are assumed to be more biologically relevant (Sand et al., 2008). Nested models are included in the most frequently used software for the statistical analysis of dose-effect data, i.e. BMDS (USEPA, 2013), developed by the US EPA, and PROAST (Slob, 2011), developed by the Dutch National Institute for Public Health and the Environment (RIVM). These models are also recommended by EFSA (EFSA, 2009). In the nested approaches, the software algorithm moves from simpler (few parameters) to more complex models (more parameters) according to statistically based decision rules.

In theory, the BMDL should be lower than the "true" BMD in 95%, and higher in 5% of all cases, provided that the correct model is used. The assumption of model correctness is easily ignored by risk assessors as a possible source of error. However, considering the variability of the experimental data, there is an obvious risk that the use of any approach results in an inappropriate model selection (i.e. selection of a model that deviates from the true dose-effect relationship). This might in turn result in a misleading calculation of the BMD and BMDL. Thus, there is a risk that estimated BMDL values become higher than the true BMD far more frequently than the expected 5%. A high frequency of "false" or "non-protective" BMDLs would be unfortunate, as the resulting POD would then be less protective than assumed. The aim of this study was to use simulated data to investigate how frequently and under what circumstances non-protective BMDL values occur with continuous data, and to quantify the possible impact on quantitative health risk assessment.

2. Materials and methods

The occurrence of non-protective BMDLs was investigated using one fixed dose-effect curve with added variability. The OECD standard of four dose group was tested, i.e. control, low, medium and high, with logarithmic dose spacing (1, 3, 10). Five dose-placement scenarios (A-E) were tested, in these the doses were placed on

different parts of the dose-effect curve, while maintaining the 13-10 dose spacing. Different dose placements and variabilities were used to take into account that in a real toxicity study the potency of the tested substance is not known beforehand and that several similar endpoints with different sensitivity and variability may be evaluated from the same study.

2.1. Scenarios

Five different dose placement scenarios with the same relative logarithmic (control, 1, 3, 10) dose spacing were used, ranging from a scenario with the three dose groups at the low end of the dose-effect curve where the highest dose were corresponding to a dose that caused 50% of the maximum effect (ED50) to a scenario with the dose group at the high end where the lowest dose corresponding to ED50 (Fig. 1). Simulations were performed with 5,10, 20 and 50 animals per dose group. Both the number of dose groups and the number of animals per group were chosen according to standard OECD Guidelines for the testing of chemicals (OECD, 2012).

2.2. Generation of dose-effect data

1200 datasets were generated for each of the five scenarios by Monte Carlo simulation. The effect size of each individual in these datasets was calculated with the exponential model (as described by Moerbeek et al., 2004):

y = a x(c -(c - 1)xe-bx" ) + e (1)

where (y) is the effect and (x) is the dose. The background effect (a) and the shape parameter (d) were both set to 1, whereas the maximum effect (c) was set to 1.3. These values are realistic (Slob and Setzer, 2013) and the same values for (a) and (c) were also used by Slob et al. (2005). The potency parameter (b) was set to 0.231 to obtain an ED50 value of 3. The critical effect size was set to 5% increase over background corresponding to a "true" BMD of 0.79. A log-normally distributed residual error (e), was added to each sample to account for inter-individual variability and measurement error. The error was defined as

(e = erZ) (2)

where r is the standard deviation of the logarithm and Z is the standard normal deviate. Coefficients of variation of 5%, 10% or 15%, were investigated, by varying r. The choice of CVs was based on earlier studies (Slob et al., 2005), where it was noted that median CV where around 10% in historical data analyzed with the benchmark dose approach.

Fig. 1. Dose placement scenarios used in the Monte Carlo generation of dose-effect data. The dose placements ranged from the lowest dose placement (dose placement A) where the highest dose was equal to the EC50 to the highest dose placement (dose placement E) where the dose in the lowest dose group was equal to the EC50. The error bars indicate the 95% confidence interval of the group means with 20 animals/group and a coefficient of variation (CV) of 10%. Data were also generated for 5,10 and 50 animals/ group and CVs of 5% and 15%.

2.3. Model selection process

The BMD calculations were conducted using the nested exponential models described by Slob (2002) (Fig. 2). The models are termed nested as the simpler models are derived from the more complex ones by setting the parameters (b), (c) and/or (d) to zero or one.

A commonly used model selection procedure suggested by Slob (2002), henceforth called the standard method, was used to select the most appropriate model for the BMD calculations. First, if no higher model gave a better fit than model 1, it was concluded that there was no dose-effect trend and no BMDL was calculated. Second, if a dose-effect trend was observed, model 2 was compared to models 3, 4 and 5 and the most favorable model according to likelihood ratio tests was then chosen. This likelihood ratio tests penalizes models with more parameters in order to reduce the risk of overparameterization and is implemented in both PROAST and BMDS as a tool for model selection within nested models.

An alternative model selection procedure was tested for comparison. This procedure was based on the same set of nested exponential models and selection via the likelihood ratio test. However,

it did not allow model 2 as final choice of model. If model 2 was the selected model using the standard method, the alternative method selected a model with a ceiling parameter (model 4 or 5) instead.

In the model fitting process some restriction was applied to the model parameters. The parameter (d) was restricted to p 1 in line with the parameter restrictions in the BMDS software. The parameter (c) was limited to values between 1.055 and 1000 as they are defaults in PROAST (v28.1) if a critical effect size of 5% is used.

2.4. Global and local goodness of fit tests

A global and a local goodness of fit test were applied to the results of the model fitting. The local and global goodness of fit tests are available in some BMD software to guide the selection of a reasonable model. The global fit of the selected model was evaluated by comparison with the fit of the full model according to the likelihood ratio test with a p-value 60.1. The full model estimates the mean and the variance of the effect for each dose group without making any interpolations between the dose groups. While it is not possible to calculate a BMDL by the full model, it may still be used as a reference.

Fig. 2. The nested exponential dose-effect models used in the Benchmark-Dose calculations (Slob, 2002). The variables (x) and (y) represent dose and effect, respectively. Model 1 is obtained from model 2 by setting b = 0, model 2 is obtained from model 3,4 or 5 by setting c = 0 and/or d = 1. The axes have no numerical values, as the curve shape (base line, steepness, maximum) will depend on the choice of model parameters.

The local fit was evaluated by calculating scaled residuals for the estimated effects:

scaled residual =

jlobserved :

i 1predicted group mean I

where 1 is the mean effect at a specific dose and the SEM is the standard error of the mean of the effect at that dose, both being estimated on log-scale. If the full model is significantly better than the selected model or if a scaled residual is p2, the selected model is generally recommended to be questioned by the assessor performing the benchmark dose analysis. In our study, a model was still selected regardless of whether the full model was significantly better than the selected model or had a scaled residual of p2, but it was highlighted that the calculated BMDL might be unreliable.

2.5. BMDL-calculations

The lower bound of the two-sided 90% confidence intervals for the BMD was defined as the BMDL and compared to the ''true'' BMD. In general, a BMDL is considered to be conservative and health protective as it is interpreted as an underestimate of the true BMD and, hence, an overestimate of the health risk. Following the definition of BMDL, the BMDL/trueBMD-ratio is expected to fall below 1 in 95% of the cases. Conversely, the occurrence of ratios above 1 in more than 5% of the cases suggests that the health risk is underestimated.

Different methods can be used to estimate the confidence interval for the BMD. In this paper the likelihood ratio method was used in all simulations as it is the most common for calculating confidence intervals for the BMD. Since the likelihood ratio method can give unreliable results in some situations (Moerbeek, 2004), the bootstrap method was used to verify the occurrence of non-protective BMDLs from the likelihood ratio method. The bootstrap method was implemented as described in Moerbeek et al. (2004).

In addition to the BMD estimates, NOAEL values for all simulations were calculated by multiple comparisons of the effects in the

different groups by a two-sided Dunnett's test at the 5% significance level. The Dunnett's test was performed on log-scale since the data were log-normally distributed. If the middle dose group was the only dose that had an effect that was significantly different from controls, the low dose group was still considered as the NOAEL even though the high dose group was not different from controls. It was also noted how often the NOAEL exceeded the true BMD.

2.6. Software

The Monte Carlo simulations were carried out using R (version 2.13.1) (R Core Development Team, 2011).The Dunnett's tests were performed using the glht function in the multcomp package (Hothorn et al., 2008) and the BMD calculations were performed using PROAST version 28.1 (Slob, 2011) with some added code. The calculations were parallelized using the snowfall package (Knaus, 2010). The research code can be provided upon request.

3. Results

The percentage of BMDLs found to be higher than the ''true'' BMD, hereafter called non-protective BMDLs, occurred in more than the expected 5% in nearly all tested scenarios. The frequency of non-protective BMDLs was clearly related to the assumed variability as well as the dose placement scenario (Fig. 3, Table 1). In general, higher frequencies of non-protective BMDLs were associated with higher variability and dose-placements in the lower end of the dose-effect curve. Thus, in the simulation with medium variability (CV = 10%), doses placed in the low region of the dose-effect curve (scenario A), and a high number of animals (50/group) the percentage of non-protective BMDLs was 76%. Maintaining all settings but decreasing the number of animals to 5 resulted in a decreased number of non-protective BMDLs down to 5%, instead the frequency of no dose-effect trend observed increased from

Fig. 3. The frequency of simulations with non-protective BMDLs (trBMDLd > 1), where BMDL is the lower bound of the two-sided 90% confidence interval of the benchmark dose and true BMD is the ''true'' benchmark dose. The BMDL values were calculated using either the standard or alternative procedure for selecting between five nested models. Dose-effect data were generated by Monte Carlo simulation from a predefined dose-effect curve with added lognormal variation (CV) of 5%, 10% or 15%. The shades of the columns indicate the number of animals per dose group. The lighter foremost columns represent 50 animals/group, followed by 20, 10 and 5 animals/group.

Table 1

confidence interval of the benchmark dose calculated using the standard procedure with nested exponential models. Dose-effect data were generated by Monte Carlo simulation from a predefined dose-effect curve with a coefficient of variation of 10%. The dose placement scenarios (A-E) are illustrated in Fig. 1. Additional tables for all scenarios can be found in Supplements.

Dose scenario Animals/group 1 < Ratio 6 3 (%) 3< Ratio 6 10(%)

A 5 4.9 0.6

10 23 0.3

20 48 0.4

50 76 0.4

B 5 74 0.4

10 80 0.4

20 67 0.0

50 35 0.0

C 5 68 1.8

10 58 1.2

20 33 0.5

50 8.2 0.0

D 5 18 27

10 5.4 15

20 4.7 2.2

50 6.1 0.0

E 5 3.8 14

10 5.5 2.3

20 7.8 0.0

50 6.8 0.0

0% to 32%. When doses were placed closer to the center of the dose-effect curve (scenarios B and C), the frequency of non-protective BMDLs ranged from 8% (scenario C, 50 animals/group) to 80% (scenario B, 10 animals/group). When doses were placed in the upper region of the dose-effect curve (scenarios D and E), the non-protective BMDLs were not as frequent compared to the situation at lower doses, ranging from 6% (scenario D, 50 animals/ group) to 46% (scenario D, 5 animals/group).

As expected, non-protective BMDLs were generally less frequent at low variability, as compared to the situation with higher variability. Still, as high as up to 84% were non-protective for dose placement A and 20 animals/group using a CV of 5% (Fig. 3). The simulations with higher variability (CV= 15%) generally showed the same tendencies under the standard procedure as those with medium variability. However, at higher dose placement (scenarios D and E) the non-protective BMDLs were more frequent than in the simulations with medium variability.

More than 90% of the simulations with non-protective BMDLs, calculated using the standard procedure for model selection,

passed the extra global and local goodness of fit tests indicating that these tests alone are not enough to safeguard against the non-protective BMDLs (data not shown).

In most cases where non-protective BMDLs were observed, the ratio between the estimated BMDL and the true BMD was usually within a factor of three (Table 1, Supplement Tables S1, S4 and S7). In the simulations where the variability was low (CV = 5%), a ratio between 3 and 10 was rarely observed. As the variability increased (CV = 10% and 15%), the frequency where the BMDLs exceeded the true BMD by a factor between 3 and 10 increased, most evident at dose placements in the upper region (scenario D and E). Under scenario D with a medium variability (CV = 10%), every fourth simulation with 5 animals/group resulted in a ratio above 3 (Table 1). In the same scenario (i.e. D) with even higher variability (CV= 15%) and 10 animals/group almost every third simulation showed a ratio above 3.

When the confidence intervals were calculated with the bootstrap method (tested in parallel), similar frequencies of non-protective BMDLs were obtained as with the likelihood ratio method (data not shown), indicating that the observed phenomenon was not a result of the likelihood ratio method.

With the alternative method, excluding model 2, the frequencies of non-protective BMDLs were never higher than 11% (Fig. 3) when the confidence intervals were estimated using the likelihood ratio method. In almost all of these simulations with the alternative method, the ratio between BMDL and the true BMD was less than a factor of 3 (Supplement Tables S2, S5 and S8). The frequencies of non-protective BMDLs were closer to the nominal 5% when the confidence intervals were estimated using the bootstrap method (data not shown).

The traditional point-of-departure (i.e. NOAEL) was also calculated and related to the true BMD at the critical effect size of 5% increase from background (Fig. 4). As expected, lower numbers of animals was related to a decreased probability of detecting any significant effect, i.e. no effect detected at any dose. It was observed that with dose placements in the higher end of the dose-effect curve and with higher number of animals per dose group the probability of only detecting a LOAEL increased substantially. In addition, a considerable percentage of the simulations resulted in NOAEL values that were 3-10 times higher than the true BMD (Supplemental Tables S3, S6 and S9). In some scenarios, with few animals per dose group, the NOAEL even exceeded the true BMD more than tenfold. For example, every second simulation of scenario E with a high variability (CV=15%) and five animals per group resulted in NOAEL values more than ten times higher than the true BMD, i.e. higher than the effect defined as close to, but not yet, adverse.

Distribution of „BMbmd ratios, where BMDL is the lower bound of the two-sided 90%

Fig. 4. The frequency of simulations with non-protective NOAELs (¿NOAmb > 1), where true BMD is the "true" benchmark dose. Dose-effect data were generated by Monte Carlo simulation from a predefined dose-effect curve with added lognormal variation (CV) of 5%, 10% or 15%. The shades of the columns indicate the number of animals per dose group. The lighter foremost columns represent 50 animals/group, followed by 20,10 and 5 animals/group. Note that NOAELs and BMDLs are different concepts so this figure is not directly comparable to Fig. 3.

4. Discussion

Theoretically, the BMDL should be higher than the true BMD value in 5% of all cases, assuming that the selected model describes a true dose-effect relationship. BMDLs that are higher than the underlying BMD are unfortunate, as the resulting POD is less protective than assumed. According to our simulations, the frequency of non-protective BMDLs (i.e. BMDL/trueBMD > 1) are high under most scenarios and reach as high as 80% under some scenarios (Fig. 3). Our results challenge the interpretation of BMDL values as being equivalent to conservative estimates of no-effect threshold levels. For the non-protective BMDLs, the ratios between the BMDL and the true BMD were in most cases found to be within a factor of three. However, in scenarios with few animals per group and dose placements in the upper region of the dose-effect curve, the ratios ranged from 3 to 10 in 31% of the simulations, thereby clearly underestimating the risk at the POD.

Because of the increased awareness of the BMD approach as a preferred method for establishing the POD in quantitative risk assessment, it is of interest that the model selection procedure used to interpret toxicological data is continuously examined and further developed, thereby to ensure the best use of data. Derived BMDL values that frequently underestimate the risk for health effects are certainly unsatisfactory.

We evaluated the underlying reason for non-protective BMDLs in our simulations by tracking which equation was used following a common algorithm for model selection. Simulations resulting in non-protective BMDLs were found to be strongly associated with selection of the second exponential model in the nested structure for the BMD calculations. In 60% of the simulations with scenario C, 10 animals/group and an assumed CV of 10%, the BMDLs were non-protective. In almost all these situations, model 2 was the selected model (for example see Fig. 5). Similar results (i.e. model 2 results in non-protective BMDLs) were also observed with the other scenarios (data not shown). Fig. 6 illustrates how the selection of model 2 can result in falsely high BMDL-values; a single simulation in which model 2 is selected as the base for BMDL estimates are compared with the true dose-effect curve (scenario C, CV=10% and 10 animals/group). Model 2 does not include a parameter that enables the dose-effect curve to level-off at high doses. In the standard procedure for model selection, model 2 is frequently selected when there is insufficient information to identify the effect plateau at high doses. This can explain the

> o 10%

[3 Model 3,4 or 5

□ Vlodel 2

T~H......r-rl

BMDL true BMD

Fig. 5. The distribution of 5ur§L5-ratios, where BMDL is the lower bound of the 90% confidence interval of the benchmark dose and true BMD is the "true" benchmark dose. BMDLs were calculated by the standard procedure for model selection using 1200 sets of dose-effect data with a coefficient of variation(CV) of 10% generated by Monte Carlo simulation with dose placement scenario C and 10 animals/group. The lighter grey striped bars shows the ratio for the simulations where model 2 was the selected model. The darker grey bars shows the ratios for the model 3, 4 or 5. The vast majority of simulations resulting in non-protective BMDL values (BMDL/ trueBMD > 1) are associated with the selection of model 2.

Fig. 6. Example of a simulated dataset using dose scenario C, CV = 10% and 10 animals/group. The upper solid curve is the underlying "true" dose-effect curve and the lower dashed curve is the estimated curve using model 2 from the nested set of five exponential models. The filled triangles are the geometric mean at each dose level. The error bars indicate the 95% confidence interval for the geometric mean effect for a simulated group. The vertical lines indicate (from left to right) the true BMD, the estimated BMDL, the estimated BMD and the BMDU (the upper bound of the confidence interval for the BMD) respectively. Note that the estimated BMDL exceeds the true BMD.

observation in Fig. 3, where non-protective BMDLs were mostly found in scenarios with few animals and doses placed in the low effect region.

Larger groups generally resulted in fewer non-protective BMDLs, with low dose placements (scenario A) as the exception (Fig. 3). Dose placement scenario A resulted in a situation where the non-protective BMDLs were most frequent in the simulations with more animals. As the confidence interval of the BMD became narrower with a high number of animals, the probability of a non-protective BMDL also increased. However, as the confidence intervals were narrow, the ratio between the estimated BMDL and the true BMD was only slightly higher than one. The confidence intervals also got narrow with more animals at higher dose placements, but in those cases the high number of animals also increased the frequency of cases where the plateau was identified (i.e. model 2 was not selected) leading to lower BMDL values. However, in simulations when the plateau effect was not identified (i.e. model 2 was selected), the BMDL/trueBMD-ratio was often found to be >3 (see Supplement Table S7 for example).

The observation that BMDLs from the simulated data can be higher than the ''true'' BMD in a large number of simulations has previously been mentioned in a study on continuous data (Slob et al., 2005) and investigated in a study on quantal data (West et al., 2012). Slob et al. (2005) mentioned that non-protective BMDLs occurred frequently when the total number of animals in a study was 50 or less and the residual variation was large (CV = 18%). The potential impact on risk assessment was, however, not further commented on by the authors. The present simulations show high frequencies of non-protective BMDLs with lower variation (CV = 5% and 10%) and with more animals (up to 200, i.e. 50 per dose group).

West et al. (2012) recently showed that the frequencies of non-protective BMDLs can be as high as 100% for quantal data. They concluded that the true extra risk at the BMDL under an incorrectly selected model can surpass the target benchmark response, exposing the potential dangers of traditional strategies for model selection when calculating BMDs and BMDLs. Our conclusions on continuous data are similar, although the fractions of non-protective BMDLs in our study seem less worrying than those presented by West et al. (2012). It has to be noted that the case with 100% non-protective BMDLs in their study comes from a situation where

only one model were considered in the estimations, thereby not taking into account that the fitted model might be incorrect. Furthermore, local fit measures such as scaled residuals were not taken into account. Therefore, the frequency of non-protective BMDLs might be misleading, since models with poor local fit is not likely to be accepted by an assessor.

In the present study, we assume the true dose-effect curve for continuous data to be sigmoid on the log dose scale and thus to level off at high doses. This is in accordance with most biological observations, from enzyme kinetics to population models of infectious diseases and has been confirmed in a recent study on a variety of endpoints (Slob and Setzer, 2013). A plateau might of course not be reached within the studied dose range. However, with the current use of nested models, it is assumed that there is no ceiling effect unless there is clear evidence of the opposite. An alternative approach, that in our opinion is more biologically plausible, is to assume that a plateau exist unless there are good reasons to believe the opposite. If prior knowledge about a plateau is available this should be taken into account in the BMD modeling. One possibility would be to use models with scientifically justified narrow bounds for the plateau parameter. Another possibility to identify the plateau level would be to fit the dataset of interest simultaneously with historical data on the same endpoint but with other substances, as done by Slob and Setzer (2013).

Even though it might make sense from a biological viewpoint to always include a plateau parameter in the dose-effect analysis, this approach can be problematic if all doses appear at the lower part of the dose effect curve. Thus, if it is of special importance to determine the plateau parameter, for instance in order to calculate the ED50, the study should obviously include dose groups that inform about the plateau. Also, including an unrestricted plateau parameter does increase the risk of obtaining BMDLs that are too high as a result of numerical problems in the estimations. If the profile likelihood method is used to calculate confidence intervals it is particularly important to examine the profile likelihood curves before accepting a BMDL. It is also possible to estimate the confidence interval using another method, such as the bootstrap method. However, the bootstrap method is more time-consuming than the likelihood ratio method.

None of the studies observing falsely high BMD estimates (Slob et al., 2005; West et al., 2012, present study) take into account that in a real life situation the risk assessor should always make a visual inspection of the BMD curves and their fit to the experimental data. Yet, even if a model fits the data perfectly it does not necessarily mean that that model describes the underlying reality equally well. In fact, without knowing the "true" dose-effect relationship it is impossible to observe how well the estimated BMD describe the risk for health effects.

Since the NOAEL approach is still used in many regulatory settings, it was evaluated in comparison with the true BMD at an effect size of 5%. Our simulations clearly illustrate the disadvantages with the NOAEL approach. Thus, the NOAEL value (in relation to the true effect level) is highly dependent on dose placement, especially when the variability is high and the number of animals is low. The NOAEL frequently exceeds the true BMD, being higher in as many as 96% of the simulations in one scenario. This supports the notion by several agencies that the NOAEL approach should be used with caution and that the BMD can be considered as a more robust alternative (EFSA, 2009; NAC/AEGL, 2001; WHO, 2009) and that the NOAEL should not be interpreted and used as a no-effect-level, especially if the numbers of animals are low and/or the variability is high.

In conclusion, the current practice of using nested model selection in BMD modeling of continuous data can result in high frequencies of falsely high BMDL estimates eventually resulting in non-protective POD for quantitative risk assessment. The phenomenon is mainly related to the selection of a non-sigmoidal

exponential model. Such models should be used with caution as they may underestimate the risk, illustrating that awareness of the model selection process and sound identification of the point-of-departure is vital for health risk assessment.

Acknowledgments

This work was supported by the Swedish National Research Council [2012-1869], the Swedish Board of Agriculture and the IMM Junior Faculty Program at the Institute of Environmental Medicine. The computations were carried out at the Swedish National Infrastructure for Computing [SNIC 001/11-226] via the National Super Computer Centre in Linkoping. We are grateful to Dr. Salomon Sand for valuable advice regarding benchmark dose analysis and to Dr. Ian Jarvis for help with language editing.

Appendix A. Supplementary data

Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/jj.yrtph.2014.03.004.

References

Allen, B.C. et al., 1994. Dose-response assessment for developmental toxicity. II. Comparison of generic benchmark dose estimates with no observed adverse effect levels. Fundam. Appl. Toxicol. 23, 487-495.

Crump, K.S., 1984. A new method for determining allowable daily intakes. Fundam. Appl. Toxicol. 4, 854-871.

Dybing, E. et al., 2002. Hazard characterisation of chemicals in food and diet. Dose response, mechanisms and extrapolation issues. Food Chem. Toxicol. 40, 237282.

ECHA, 2008. In: European Chemicals Agency (Ed.), Guidance on Information Requirements and Chemical Safety Assessment, Chapter R.8: Characterisation of Dose [Concentration]-Response for Human Health. Helsinki, Finland.

Edler, L. et al., 2002. Mathematical modelling and quantitative methods. Food Chem. Toxicol. 40, 283-326.

EFSA, 2009. Guidance of the Scientific Committee on a request from EFSA on the use of the benchmark dose approach in risk assessment. EFSA J. 2009,1-72.

Hothorn, T. et al., 2008. Simultaneous inference in general parametric models. Biometric. J. 50, 346-363.

Knaus, J., 2010. Snowfall: easier cluster computing (based on snow).

Moerbeek, M. et al., 2004. A comparison of three methods for calculating confidence intervals for the benchmark dose. Risk Anal. 24, 31-40.

NAC/AEGL, 2001. In: Subcommittee on Acute Exposure Guideline Levels, N.R.C. (Ed.), Standing Operating Procedures for Developing Acute Exposure Guideline Levels for Hazardous Chemicals. Washington, DC.

NRC, 1994. In: NR Council (Ed.), Science and Judgment in Risk Assessment. Washington, DC.

Oberg, M., 2010. Benchmark dose approaches in chemical health risk assessment in relation to number and distress of laboratory animals. Regul. Toxicol. Pharmacol. 58, 451-454.

OECD, 2012. OECD Guidelines for the Testing of Chemicals.

R Core Development Team, 2011. R: A Language and Environment for Statistical Computing. Vienna.

Sand, S. et al., 2008. The current state of knowledge on the use of the benchmark dose concept in risk assessment. J. Appl. Toxicol. 28, 405-421.

Slob, W., 2002. Dose-response modeling of continuous endpoints. Toxicol. Sci. 66, 298-312.

Slob, W., 2011. BMD approach. R package version 28.1.

Slob, W. et al., 2005. A statistical evaluation of toxicity study designs for the estimation of the benchmark dose in continuous endpoints. Toxicol. Sci. 84, 167-185.

Slob, W., Setzer, R.W., 2013. Shape and steepness of toxicological dose-response relationships of continuous endpoints. Crit. Rev. Toxicol.

Solecki, R. et al., 2005. Guidance on setting of acute reference dose (ARfD) for pesticides. Food Chem. Toxicol. 43, 1569-1593.

USEPA, 1995. The use of the benchmark dose (BMD) approach in health risk assessment. Final report EPA/630/R-94/007. In: USEPA Risk Assessment Forum (Ed.). Washington, DC.

USEPA, 2013. Benchmark Dose Software (BMDS) Version 2.4 R70. National Center for Environmental Assessment.

USEPA, 2012. Benchmark dose technical guidance document. EPA/100/R-12/001. In: US Environmental Protection Agency R.A.F. (Ed.), Washington, DC.

West, R.W. et al., 2012. The impact of model uncertainty on benchmark dose estimation. Environmetrics 23, 706-716.

WHO, 1999. Principles for the Assessment of Risks to Human Health from Exposure to Chemicals. Environmental Health Criteria 210, Geneva.

WHO, 2009. Principles for modeling dose-response for the risk assessment of chemicals. In: Environmental Health Criteria 239, Geneva.