Biostatistics (2016), 17, 4, pp. 634-649

doi:10.1093/biostatistics/kxw012

Advance Access publication on March 18, 2016

Accounting for selection and correlation in the analysis of two-stage genome-wide association studies

DAVID S. ROBERTSON* MRCBiostatistics Unit, IPHForvie Site, Robinson Way, Cambridge CB2 0SR, UK david.robertson@mrc-bsu.cam.ac.uk

A. TOBY PREVOST

Imperial College London, 1st Floor, Stadium House, 68 Wood Lane, London W12 7RH, UK

JACK BOWDEN

MRC Integrative Epidemiology Unit, University of Bristol, Oakfield House, Bristol BS8 2BN, UK and MRC Biostatistics Unit, IPH Forvie Site, Robinson Way, Cambridge CB2 0SR, UK

Summary

The problem of selection bias has long been recognized in the analysis of two-stage trials, where promising candidates are selected in stage 1 for confirmatory analysis in stage 2. To efficiently correct for bias, uniformly minimum variance conditionally unbiased estimators (UMVCUEs) have been proposed for a wide variety of trial settings, but where the population parameter estimates are assumed to be independent. We relax this assumption and derive the UMVCUE in the multivariate normal setting with an arbitrary known covariance structure. One area of application is the estimation of odds ratios (ORs) when combining a genome-wide scan with a replication study. Our framework explicitly accounts for correlated single nucleotide polymorphisms, as might occur due to linkage disequilibrium. We illustrate our approach on the measurement of the association between 11 genetic variants and the risk of Crohn's disease, as reported in Parkes and others (2007. Sequence variants in the autophagy gene IRGM and multiple other replicating loci contribute to Crohn's disease susceptibility. Nat. Gen. 39(7), 830-832.), and show that the estimated ORs can vary substantially if both selection and correlation are taken into account.

Keywords: Correlated outcomes; Genome-wide scan; Selection bias; Two-stage sample; Uniformly minimum variance conditionally unbiased estimator.

1. Introduction

The two-stage "learn and confirm" strategy is widely implemented across biomedical research. For example, in the sphere of randomized clinical trials, the vast majority of adaptive trial designs follow a two-stage strategy, where design adaptations are made at the first interim analysis, with the trial then proceeding under a fixed design until its completion. See Bauer and others (2016), Bretz and others (2009), and Posch and others (2005) for a thorough review of the literature.

*To whom correspondence should be addressed.

© The Author 2016. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

A key application has been in two-stage adaptive seamless trials incorporating mid-trial treatment selection (Bowden and Glimm, 2014; Kimani and others, 2013; Sampson and Sill, 2005). Typically, multiple experimental treatments are simultaneously compared with a control in stage 1, and the most promising treatment is selected for confirmatory analysis in stage 2.

In biomedical science more generally, there is an increasing focus on the identification of genetic markers predictive of phenotypic variation, in order to better understand disease etiology and to target future drug development. The design of choice is the genome-wide association study (GWAS), in which upwards of one million single nucleotide polymorphisms (SNPs) can be tested simultaneously for their association with a given phenotype. Those SNPs achieving "genome-wide" significance (typically defined as a p-value of < 10-8) are then validated in a replication study (Ioannidis and others, 2009).

A whole variety of formal two-stage design procedures have been proposed in this setting; see, for example, Satagopan and Elston (2003), Satagopan and others (2004, 2002), Thomas and others (2004), with an overview provided in Thomas and others (2009). In practice, the two-stage GWAS approach has been successfully applied to many disease areas, including sclerosis (Chio and others, 2009), epilepsy (Guo and others, 2012), and breast cancer (Sapkota and others, 2012).

However, ranking and selecting candidates in a two-stage trial can induce bias in the final estimates of the parameter of interest. Indeed, as a general rule, the more aggressive the selection, the worse is the bias (Bauer and others, 2010). Specifically, a candidate has to perform "well" in stage 1 in order to proceed to stage 2, which often leads to overly optimistic estimates, or the so-called "winner's curse". When there is heavy selection from many candidates of roughly equivalent parameters, then the selected candidate is typically based on chance variability rather than true superiority (Sill and Sampson, 2007). All this calls into question the generalizability of the trial's findings to future patients.

The need to correct for this selection bias in the analysis of two-stage trials has long been recognized. An efficient and unbiased estimator can be obtained through the technique of Rao-Blackwellization. This involves taking the unbiased stage 2 data and conditioning on a complete, sufficient statistic for the parameter in question. By the Lehmann-Scheffe theorem, this gives the uniformly minimum variance conditionally unbiased estimator (UMVCUE).

This two-stage framework was introduced by Cohen and Sackrowitz (1989), who derived the UMVCUE for the selected mean, where the stage 1 populations are independent and normally distributed. The method lends itself naturally to estimation in two-stage adaptive seamless phase II/III clinical trials (Bowden and Glimm, 2008; Kimani and others, 2013), with equivalent work carried out in biomarker research (Pepe and others, 2009; Koopmeiners and others, 2012; Robertson and others, 2015). In the GWAS setting, Bowden and Dudbridge (2009) extended the Cohen and Sackrowitz method to derive the UMVCUE for the SNP-disease log odds ratios (ORs), where either a one-sided or two-sided p-value test is used for ranking the SNPs.

A crucial assumption in all of the above analyses is that the stage 1 population parameter estimates are independent random variables. However, this may not be a reasonable assumption to make. For example, in the GWAS setting, SNPs in the same chromosomal region may be in linkage disequilibrium (LD).

The aim of this paper is to present general theory to enable the calculation of UMVCUEs for each setting that explicitly accounts for correlation. We achieve this by deriving the UMVCUE in the multivariate normal setting with an arbitrary known covariance structure. We illustrate the utility of the new theory using an application in the GWAS setting.

Since the focus of this paper is on point estimation, our primary criterion for comparing the performance of different estimators is the magnitude of the bias. Of course, bias is not the only criterion with which to judge an estimator, especially considering the bias-variance trade-off. However, it is important to quantify the bias of an estimator, not least due to regulatory concerns. As a secondary criterion, we also compare the mean squared error (MSE) of the estimators, which takes into account the variance. In practice, the relative importance attached to the criterion of bias and MSE will vary from setting to setting.

In Section 2, we describe the general model framework and derive the form of the multivariate normal UMVCUE. We then present a short simulation study in Section 3 for the bivariate normal case. We extend our framework to estimating ORs in genome-wide scans in Section 4, and apply the resulting UMVCUE to data reported in Parkes and others (2007). We conclude with a discussion in Section 5.

2. General framework for the UMVCUE 2.1 Model description

Consider the following two-stage trial design. Suppose that we have K correlated continuous stage 1 parameter estimates, defined by X = (X1,..., XK), for the candidate treatments, biomarkers or genetic variants C1, ...,CK. Let X follow a multivariate normal distribution, defined by X ~ N (f, V), where f is the vector of (unknown) means and V = (Vij), i, j e{1,..., K} is the covariance matrix. We assume that V is known and is positive definite. For notational convenience, we let Vu = o2 for i = 1,..., K.

We denote the ordered stage 1 estimates as Xa), where the ordering is simply by effect size: X(1) > X(2) > ■ > X(K). We consider other ordering mechanisms in due course. Let Yi be the stage 2 estimate of the i th ranked candidate, and so Yi ~ N (¡(i), t2). At the end of stage 2, the aim is to efficiently estimate ) for some j e{1,..., K}.

The MLE for ¡(j) is a weighted average of the first- and second-stage data:

rj X (j) + o(j )Yj

(j) = „2 , T2 . o(j) + T2

This estimator may be biased, because it does not take into account the first-stage selection procedures or the correlation. An unbiased estimator can easily be found by just using the stage 2 data Yj. However, this estimator suffers from lower precision since we are neglecting the stage 1 data. Hence we look for an unbiased estimator that uses data from both stages.

2.2 Calculating the UMVCUE

Let Q be the event {X: X 1 > X2 > ••• > XK}, which implies that X (i) = Xi for i = 1,..., K. Without loss of generality, we condition on Q for the remainder of the derivation of the UMVCUE.

To start with, consider estimating the mean of the highest ranked population, ¿1. For notational convenience, we let Y1 = Y and r1 = t .In Section 1 of the supplementary materials (available at Biostatistics online), we prove the following theorem for the complete and sufficient statistic for fi = (¡1 ).

Theorem 2.1 The statistic Z = (Z1,..., ZK) is sufficient and complete for f = (¡1, ), where

Zi = Xi + (V1i It 2)Y for i = 1,...,K.

Then, by the Lehmann-Scheffe theorem, U = E(Y | Z = z, Q) is the UMVCUE for under Q. Hence we have the following result, as proved in Section 2 of the supplementary materials (available at Biostatis-tics online).

Theorem 2.2 The UMVCUE for given Q is

U= t2Z1 t2 $(W1) - $(W2)

o2 + t2 ^o2 + t 2 $m) - $(W2), (.)

kiVvf + T2

T2 VoF+T2 k1 = min( Ai), k2 = max( A2),

for i = 1, 2,

T 2( Zj - Zj+1) .

V1j - V1,j+1

: V1j > V1,j+1; j = 1,

t 2( Zj - Zj+1) .

V1j - V1,j+1

: V1j < V1,j+1; j = 1,

and we define min{0} = and max{0} = -to.

,.,K - 1 ,.,K - 1

Remark In the independent normal setting, Cohen and Sackrowitz (1989) note that all dependency on observations X3, X4,..., XK vanished in the expression for the UMVCUE—which they interpreted as a negative result when K > 2. However, we see in Equation (2.1) that accounting for correlation means that all of the observations can contribute to the sets A1 and A2 and be implicitly used in the estimator. In fact, any of the observations X3, X4,..., XK can be used explicitly, depending on which feature in min( A1) and max (A2).

Remark 2 It is often the case that the stage 1 estimates are asymptotically normal rather than precisely normal. For example, log ORs have an asymptotic normal distribution. In these settings, our theorems hold asymptotically.

Suppose that, instead of only selecting the single top-ranking treatment or biomarker for continuation to stage 2, we take forward the top K from a larger group of K'. For example, in the GWAS setting, typically K'« 500 000 and K < 100. We now require the UMVCUE for the jth best candidate out of K and the following corollary covers this case.

Corollary 2.3 For a given value of j e{1,..., K}, the statistic Zj = (Z1 j, complete for i = \xK), where

Zij = Xi + ~2 Yj Tj

for i = 1,..., K.

Also, the UMVCUE for fZj given Q is

.., ZKj) is sufficient and (2.2)

T j zjj a) + Tj2

0(W1) - $(W2)

a2 + 2 ®(W1) - $(W2)

aj + Tj2

aj + Tj2

for i = 1, 2,

k1 = min( A1), k2 = max( A2),

j Zij - Zi + 1,j ) Vij - Vi+u

jZij - Zi + 1,j) Vij - Vi + 1,j

: Vij > Vi+i,

= 1,..., K - 1

: Vij < Vi+i,j; i = 1,..., K - 1

and we define min{0} = +to and max{0} = -to.

Remark As a simple check of our methodological development's consistency with previous results, if we set the off-diagonal terms of the covariance matrix V equal to zero, then we recover the independent normal setting. The UMVCUE was derived in this special case by Bowden and Glimm (2008). We compare our results in Section 3 of the supplementary materials (available at Biostatistics online) and show that they are the same.

In Section 4, we show the form of the UMVCUE when ranking by p-value, and subsequently show how to construct bootstrapped confidence intervals (CIs) in this setting.

3. Simulation study: bivariate normal case

3.1 Bivariate normal UMVCUE

We now consider the UMVCUE derived in the bivariate normal setting, where K = 2. Suppose we have two correlated continuous stage 1 outcome measures, defined by X = (X1, X2), for the candidates C1 and C2. We let X follow a bivariate normal distribution, with

( of P0102\

\POT02 022 )

where p is the correlation coefficient between X1 and X2, and we assume p = ±1. From Equations (2.1), the statistic (Z1, Z2) is sufficient for ¡x2), where

Z1 = X1 + 4 Y, Z2 = X2 +

P0102 T 2

The UMVCUE depends on whether Vn > V12 ^^ a1/a2 > p. Since t2(Z1 - Z2)/(Vn - V12): t2(Z1 - Z2)/(aj2 - pa1a2), if follows that, from Equation (2.1), the UMVCUE for given Q is

t 2 Z1

O2 + t2 yOf+T2 Q(W) 02

t 2 z1

.,°1 if — > p,

of + t2 Vo2 + t2 ®(-W) t 2 z1

ifa <p,

if — = P, O2

(Z1 - Z2)v/O12 + t2

o12 - PO1O2

VOÎ+T2

Remark Note that if 0\ /a2 = p, then the UMVCUE is in fact equal to the MLE fx\.

3.2 Simulation results

In this subsection, we compare various estimators for the largest mean f \ in the bivariate normal setting:

Y = the observed stage 2 data for the highest ranked population. ff \ = the MLE for f \, which ignores selection and correlation.

UB = the UMVCUE derived by Bowden and Glimm (2008), which accounts for selection but ignores correlation. Note that this is equivalent to U when we set p = 0. U = our UMVCUE given in Equation (3.1), which accounts for selection and correlation.

To evaluate the performance of a generic estimator for f \, say f we calculate the bias and MSE as defined in Posch and others (2005), Sill and Sampson (2007), and Bowden and Glimm (2008):

bsel(fi) = ^ E[fi - fi I X\ = X]P(X\ = Xi

MSEsel(f\) = 53 E[(f\ - fi) | X\ = Xi]P(X\ = Xi). t

i = \ i

In our simulation study, we set a\ = 0.05 and a2 = 0. \ and then consider four simple scenarios (i)-(iv) X

with varying values of the true means i = (f\,f2) and stage 2 variance t2: r

(i) i = (0.\ 0T), t = 0.05, (ii) i = (0.\, 0.3), t = 0.05, I

(iii) i = (0.\0.\), t = o\, (iv) i = (0.\, 0.3), t = o\. rr

Note that, for scenarios (i) and (ii), MSEsel(Y) = 0.052. For scenarios (iii) and (iv), the exact MSE of U

Y is as follows: t

MSEsel(Y) = E[(Y - f\)2 I X\ > X2]P(X\ > X2) + E[(Y - f2)2 I X\ < X2]P(X\ < X2) H

^ / f\ - f2^ + o22 V v0\2 + O22 - 2p0\02 )

^ $ / f \ - f2

\J0\2 + o22 - 2p0\02

Figure \ shows the mean relative bias (defined as bsel/f\) of the MLE (fx\) and the UMVCUE (UB) (which ignores correlation) as a function of p for the four scenarios. Since the mean bias of the stage 2 data (Y) and the UMVCUE (U) are virtually identical to zero for all values of p (by definition), we do not show the simulation results here.

The plots show that the bias of both estimators is at a maximum for p close to - \ The MLE is positively biased for p < 0.5, and in scenarios (ii) is in fact substantially biased for all values of p. We see that UUB is positively biased for p < 0, unbiased when p = 0 (as would be expected), and negatively biased for p > 0. The bias of both estimators is worse when t = 0\ compared with t = 0.05.

Figure 2 shows the MSE of the estimators (fx \, Y, UB, U) for the four scenarios. As would be expected, the MLE always has the lowest MSE, while Y always has the highest. We also see that there is a substantial reduction in the MSE when using the UMVCUEs U or UB.

In scenarios (i)—(iii), the MSE of U decreases as p approaches 0.5, and then increases again. This is to be expected given Equation (3.!), which shows that U reduces down to the MLE when 0\ = p02.

& 0.00

-0.05 -

-0.5 0.0

Correlation

-0.5 0.0

Correlation

-0.5 0.0

Correlation

-0.5 0.0

Correlation

Fig. 1. Mean relative bias of the MLE (fi 1, blue), and the UMVCUE (Ub , black) that ignores the correlation. Plots show the mean of 100 000 simulations for each value of p. The horizontal line shows where the bias equals zero.

In scenarios (iii) and (iv), we also see that, for p close to 1, the MSE of U and the MLE are very close.

In all of the scenarios, the MSE of U is greater than that for UB when p < 0, equal when p = 0 (as would be expected), and less than that for UB when p > 0. However, these differences in the MSE are much less than the differences between the MSE of Y and either U or UB. In conclusion, there is little or no advantage in terms of MSE when using the estimator UB as opposed to our UMVCUE U, which properly accounts for correlation.

4. Application to the Parkes and others' GWAS data

We now apply our methodology to data from the genome-wide association scan for Crohn's disease published in 2007 by the Wellcome Trust Case Control Consortium (2007) (WTCCC). There were 1748 cases for the disease, and 2938 controls in this cohort, which identified 12 SNPs associated with disease status at genome-wide significance.

A replication study was then reported by Parkes and others (2007) in a follow-up cohort of 1182 cases and 2024 controls, in which 12 SNPs were successfully replicated in this study. Following

0.0014

(ii) 0.009

m 0.007 w

Correlation

Correlation

-1.0 -0.5 0.0 0.5

Correlation

m 0.007-w

-1.0 -0.5 0.0 0.5

Correlation

Fig. 2. MSE of the MLE (blue), Y (green), UB (black), and U (red). Plots show the mean of 100 000 simulations for each value of p, except for Y, which is the theoretical value of the MSE.

Bowden and Dudbridge (2009), we exclude one (rs6887695) from our illustration (as its significance in the WTCCC scan was severely reduced after data cleaning).

Table 1 shows the estimated ORs for stages 1 and 2 (i.e. the WTCCC data and the replication data), as well as the overall MLE. This is then followed by two previously described estimators that combine the data from the two stages in a way that attempts to correct for bias. The first estimator (denoted by ZP) is the corrected MLE of Zhong and Prentice (2008). The second (denoted by UB) is the estimator considered in Bowden and Dudbridge (2009), who calculated the UMVCUE assuming that all of the log ORs for the SNPs are uncorrelated.

As the bold entries in Table \ indicate, two pairs of SNPs are on the same chromosomal region. The first- and second-ranked SNPs (rs17234657 and rs9292777) are both on chromosomal region 5p13, while the fifth- and seventh-ranked SNPs (rs13361189 andrs4958847) are both on chromosomal region 5q33.

It can be safely assumed that SNPs from distinct genomic regions are in linkage equilibrium. However, in an effort to assess the potential correlation between SNPs in the same genomic region, we used the data from 498 Europeans in the 1000 genomes project (1000 Genomes Project Consortium, 2012). The empirical correlation between the SNPs (encoded as 0,1,2 to indicate the number of risk increasing alleles) on chromosomal regions 5p13 and 5q33 were -0.28 and 0.78, respectively.

Table 1. The ORs (95% CIs) of 11 SNPs estimated from allele frequencies reported by Parkes and others (2007) in an initial scan (stage 1) and replication study (stage 2). SNPs are ranked in order of stage 1 statistical significance, with those on the same chromosomal region highlighted in bold

Chr SNP Stage 1 OR Stage 2 OR MLE ZP UB

5p13 rs17234657 1.55 (1.38, 1.74) 1.16 (1.00, 1.35) 1.39 1.39 1.16

5p13 rs9292777 1.38 (1.26, 1.51) 1.34 (1.20, 1.50) 1.37 1.37 1.39

10q24 rs10883365 1.27 (1.17, 1.38) 1.18 (1.05, 1.32) 1.24 1.24 1.16

18p11 rs2542151 1.35 (1.21, 1.50) 1.15 (1.00, 1.32) 1.27 1.25 1.15

5q33 rs13361189 1.51 (1.30, 1.76) 1.38 (1.15, 1.66) 1.46 1.45 1.40

3p21 rs9858542 1.26 (1.15, 1.38) 1.17 (1.04, 1.31) 1.22 1.21 1.17

5q33 rs4958847 1.35 (1.20, 1.53) 1.36 (1.17, 1.59) 1.36 1.35 1.35

5q23 rs10077785 1.29 (1.16, 1.43) 1.19 (1.05, 1.36) 1.25 1.22 1.19

1q24 rs12035082 1.22 (1.12, 1.33) 1.14 (1.02, 1.27) 1.19 1.17 1.15

21q22 rs2836754 1.22 (1.12, 1.33) 1.15 (1.03, 1.28) 1.19 1.16 1.16

1q31 rs10801047 1.38 (1.18, 1.61) 1.47 (1.22, 1.76) 1.42 1.39 1.44

Although the populations considered by Parkes and others (2007) and the 1000 genomes project are different, these correlations do suggest LD exists between these SNPs. Hence a natural question to ask is how the OR estimates are affected by correlation. To do so, we first need to extend our framework to account for ranking by p-value.

4.1 Calculating the UMVCUE

Consider a two-stage design with a genome-wide association scan (stage 1) followed by a replication study (stage 2). Only those SNPs that meet selection criteria (see below) in stage 1 continue on to stage 2. A common approach in the genome-wide setting is to rank the SNPs according to the statistical significance of the effects, with a p-value threshold pcrit. We assume that K of the SNPs pass pcrit and are then ranked. For a one-sided test, we condition (WLOG) on the event

Q1 = ( X: X1 > X2 > Xk > $-1(1 - pcrit)!.

a1 a2 aK crit

The UMVCUE in this case is given in Section 4 the supplementary materials (available at Biostatistics online).

For a two-sided test, we instead condition (WLOG) on the event

a XM > ™ Hk!

0\ O2 Ok

> - W2)}

Then the UMVCUE for fij is as follows, with the proof found in Section 5 of the supplementary materials (available at Biostatistics online):

Theorem 4.1 The UMVCUE for ß2 given Q2 is

_2 v^M

fr_e;=i ) - $(w2i)

Uj = „2 , _2 !—:-- v^M , ™ - , s' (41)

22 02 + T2

yOJTT^EM=1 *(Wu) - ®(W2i)'

W 02 + T2

ai\/02 + t2

LJ[ai, bi] = [) (A\i n A2i) U (A3i n A4i ) n (A5 U A6),

A\i = {Y : (0i Vi+1,2 - 0i+\ Vij)Y > tj(0iZi+\^j - 0i+\Zij)}, A2i = {Y : (0i Vi+1,j + 0i + 1 Vij)Y < tj(0, + 1 Zij + 0iZi +1,j )} A3i = {Y : (0i Vi+\,j + 0+1 Vij)Y > 0+\Zij + 0iZi+\,j)}, A4i = {Y : (0i Vi+\,j - 0i+\ Vij)Y < tj(0iZi+\j - 0i+\Zij)},

A5 = {Y : VKjY < T2[ZK - 0K$-1(1 - Pcrit/2)]}, A6 = {Y : VKjY > Tj2[Zk + 0k$-1(1 - Pcrit/2)]}

and we define min{0} = and max{0} = -to.

0^ + -j2

Remark The introduction of the variable M above is for notational convenience. In particular, we reexpress the intersection of sets (p|fi-1 (A1i n A2i) U (A3i n A4i)) n (A5 U A6) as the disjoint union of M closed intervals on the real line [a\, b\],..., [aM, bM], where M > 1. Note that possibly a\ = -to and/or

bM = TO.

For either test, if all the off-diagonal terms of the covariance matrix are equal to zero, then we recover the independent normal setting as in Bowden and Dudbridge (2009). In Section 6 of the supplementary materials (available at Biostatistics online), we compare our estimators and show they are the same.

4.2 Results

Rather than assuming we know p, we take each pair of SNPs that are on the same chromosomal region and calculate the UMVCUEs as the correlation between the log ORs varies between ± 1. More explicitly, given a pair of SNPs on the same chromosomal region which have stage 1 ranks j\ and j2, respectively, then we assume that the log ORs X21 and Xj2 follow a bivariate normal distribution, with correlation coefficient p

fj\ fj2

0ji p0j\ 0j2 p02\ °/2 0j2

Starting with the SNPs on chromosomal region 5p13, the upper half of Figure 3 shows the UMVCUEs for the ORs as a function of the correlation coefficient p. The first plot for SNP rs17234657 shows that the UMVCUE can vary substantially if there is positive correlation between the SNPs. As p increases from 0 to 0.88, the UMVCUE increases from 1.16 to 1.32, with a particularly rapid increase for p greater than about 0.5. For p > 0. 88 we actually see a (small) decrease in the UMVCUE. In contrast, for negative values of p, the UMVCUE is hardly affected.

The second plot for SNP rs9292777 shows that the UMVCUE decreases as p increases, but in absolute terms correlation has little effect. The main reason for this is because the stage 1 and stage 2 estimates

Chr 5p13

0.0 Correlation

-0.5 0.0 0.5 Correlation

Chr 5q33

0.0 Correlation

■0.5 0.0 0.5 Correlation

Fig. 3. The UMVCUEs for the ORs of the SNPs on chromosomal region 5p13 (rs17234657 and rs9292777) and 5q33 (rs13361189 and rs4958847) as a function of the correlation coefficient.

for the OR were very similar (1.38 and 1.34, respectively), so we would not expect the UMVCUE to be dramatically different regardless of any correlation.

Performing the analysis for the SNPs on chromosomal region 5q33 gives the results displayed in the lower half of Figure 3. Looking at the first plot for SNP rs13361189, in absolute terms correlation has little effect on the UMVCUE. However, note the piecewise nature of the UMVCUE: it increases for -1 < p < -0.58, decreases for -0.57 < p < -0.33, remains constant for -0.33 < p < 0.1, and then (rapidly) decreases again for p> 0.1. Also note that this time, there is a large discrepancy between the stage 1 and stage 2 log ORs (1.51 and 1.38, respectively). The fact that the UMVCUE hardly varies indicates that the substantial correction to the stage 1 estimate is in fact robust to correlation.

As for the second plot for SNP rs4958847, again the correlation hardly changes the value of the UMVCUE (in fact, the UMVCUE remains constant until p > 0.5). However, this should be expected since the stage 1 and stage 2 estimates for the OR were very similar (1.35 and 1.36, respectively). Constructing CIs. After constructing a point estimate for the OR, it is natural to construct CIs as well. We use a parametric bootstrap procedure inspired by Bowden and Glimm (2008) and Pepe and others (2009), where we assume that the true log OR mean fij = Uj (the UMVCUE calculated from the observed data).

Starting with the SNPs on chromosomal region 5q33, these had the fifth and seventh stage 1 ranks. Given the observed stage 1 log ORs x = (x1,...,x11) and the assumed correlation coefficient p between

Fig. 4. The UMVCUE for the OR of the highest ranked SNP (rs17234657) as a function of the correlation coefficient with the second-ranked SNP (rs9292777). Also plotted are the 95% bootstrapped CIs for the UMVCUE, using 10 000 bootstrapped replicates per data point. The 95% CIs for the stage 2 OR (in red) and the MLE (in black) are displayed for comparison purposes.

X5 and X7, we use the following procedure:

1. Simulate "bootstrapped" stage 1 data X(5B), X(7B) by drawing from a bivariate normal distribution

ст52 po5oq\ PCT5CT7 ст72 )

conditional on X4I/04 > |X^I/05 > I^IM and talM > IXy7B> I/07 > |x8|/08.

2. Simulate bootstrapped stage 2 data y5B) , y7B) from independent normal distributions

y(B) - N(U5, a52) and Y7B - N(U7, a72). 3. Calculate the bootstrapped UMVCUEs U5 observed values x\,. ..,x4, x6, x8,..., хц.

(B) fT(B)

u7B using X(B), X7B), y(b), y(b) and the original

These steps are then repeated for a large value of B, with the a/2 and (1 - a/2) empirical quantiles then used as the (1 - a)% CIs for U5 and U7. Bootstrapped CIs for the MLE, as well as the Zhong & Prentice estimator (ZP) are also then straightforwardly available.

Further details of how the variables are generated in step \ can be found in Section 7 of the supplementary materials (available at Biostatistics online), as well as an extension to the general multivariate case.

As for the SNPs on chromosomal region 5p13, these had the first and second stage 1 ranks. The same bootstrap procedure described above can be used, except that in step 1, we condition on |X\B)I/0\ > IX^I/02 > IX3I/03.

Figure 4 shows the bootstrapped 95% CIs for SNP rs17234657 on chromosomal region 5p13. When the correlation p = 0, the CI for the UMVCUE is virtually identical to that achieved by just using the stage 2 data. However, we see that, for p > 0.5, the CIs become appreciably different, since the stage 2 OR ignores the correlation.

Table 2. Bootstrapped 95% CIs for the ORs of the two highest ranked SNPs when the correlation p = 0. There were 10 000 bootstrapped replicates used

Stage 2 UMVCUE ZP MLE

rs17234657 (1.002, 1.348) (1.003, 1.348) (l.289, 1.529) (1.290, 1.526)

rs9292777 (1.243, 1.547) (1.299, 1.566) (1.308, 1.487) (1.307, 1.485)

Table 2 compares the bootstrapped 95% CIs for the various estimators for the two highest ranked SNPs 0

(rs17234657 and rs9292777) when the correlation p = 0. For the SNP rs9292777, there is a 12% reduction |

in the CI width (from 0.304 to 0.268) when using the UMVCUE compared to just using the stage 2 data. a

For both SNPs, the ZP estimator and the MLE give very similar CIs. d

5. Discussion Jj

In this paper, we present a general framework for accounting for selection and correlation in two-stage s

trials, with an emphasis on the GWAS setting. We achieve this by deriving the UMVCUE in the multivariate g

normal setting with an arbitrary known covariance structure. Our framework relaxes the assumption present in the literature that the stage 1 population parameter estimates are independent. x

As the bivariate normal simulation study demonstrated, it is important to correctly account for cor- d

relation. Indeed, if the correlation coefficient is not close to zero, then the UMVCUE that ignores cor- g'

relation can be substantially biased. Due to the bias-variance trade-off, the MSE of the UMVCUE is |

greater than that of the MLE, but is substantially less than the MSE of using the (unbiased) stage 2 r

data alone. b

The GWAS example showed how our estimation strategy can be applied in practice. We also described g

how to construct CIs using a parametric bootstrap procedure. Our results demonstrate how correcting §

for correlation is necessary in the GWAS setting. Indeed, for the Parkes and others (2007) data, the §

OR estimate for the highest ranking SNP varied substantially for high positive values of the correlation g

coefficient. y

In the GWAS setting, our framework assumes that the LD structure of the SNPs is known. §§

In practice, we would envisage using large external datasets such as the International HapMap 1

Project (The International HapMap Consortium, 2003) to give an estimate of SNP LD. Alternatively, the correlation could be estimated based on the stage 1 data itself if individual participant data were available, but further research is needed to give guidance as to whether doing so would induce bias into our UMVCUE and/or appreciably increase the MSE.

Even if it is not possible to have reliable estimates of the correlation between the (log) ORs, a sensitivity analysis (like the one we carried out) is still useful to probe the robustness of the reported ORs to any LD between the SNPs.

Due to the structure of the Parkes and others (2007) data, our results in Section 4.2 focused on the bivariate setting. Of course, in other GWAS studies the data may mean that it is necessary to consider multiple (i.e. K > 2) SNPs in a local region that are in LD. Our UMVCUE, although still tractable, would then have a much more complex dependence on K (K — 1)/2 correlation coefficients.

We note that some progress was made to account for correlated outcomes by Sill and Sampson (2007), who looked at two-stage trials with mid-trial treatment selection based on a surrogate endpoint. They derived the UMVCUE for the final endpoint when the surrogate and final endpoint follow a bivariate

normal distribution. However, our framework is a more general one, and allows application to other trial settings. Indeed, as current work, we have applied our estimator to two-stage adaptive seamless phase II/III clinical trials, allowing for unequal stage 1 (and stage 2) variances as well as generalized selection rules.

One of the limitations of our work is the construction of CIs. In the bivariate case, we can simply use a parametric bootstrap procedure. As for the general multivariate case, in the supplementary materials (available at Biostatistics online) we show theoretically how to extend our method. However, it is not clear what the coverage of such CIs would be in practice. An alternative approach could be based on the work of Sampson and Sill (2005), who developed exact CIs in the independent normal setting. It may be possible to extend their analysis to correlated multivariate normal outcomes.

In the framework for our UMVCUE, we treat the stage 2 data as coming from independent univariate normal distributions, i.e. we ignore the potential stage 2 correlation structure. As current work, we are deriving efficient unbiased estimators that make use of, and fully account for, all the available correlated stage 2 outcomes. However, preliminary results appear to show that no UMVCUE exists for this case, despite the obvious appeal of making use of more data, when available.

Finally, another extension would be trials with more than two stages. Bowden and Glimm (2014) derived Rao-Blackwellized estimators for multistage drop-the-loser trials assuming independent normal outcomes, and this could be a starting point for looking at multistage trials with multiply correlated outcomes.

Supplementary material Supplementary Material is available at http://biostatistics.oxfordjournals.org.

Acknowledgments

The authors would like to thank the anonymous referees for their helpful comments, and Paul Newcombe for his help in accessing the 1000 genomes project data. Conflict of Interest: None declared.

Funding

This work was supported by a Medical Research Council Methodology Research Fellowship (grant number MR/N501906/1 to J.B.). Funding to pay the Open Access publication charges for this article was provided by the MRC BSU, grant codes MC_UP_1302/2 and MC_UP_1302/3.

References

1000 Genomes Project Consortium. (2012). An integrated map of genetic variation from 1092 human genomes. Nature 491(7422), 56-65.

Bauer, P., Bretz, F., Dragalin, V., Konig, F. and Wassmer, G. (2016). Twenty-five years of confirmatory adaptive designs: opportunities and pitfalls. Statistics in Medicine 35(3), 325-347.

Bauer, P., Koenig, F., Brannath, W. and Posch, M. (2010). Selection and bias—two hostile brothers. Statistics in Medicine 29(1), 1-13.

Bowden, J. and Dudbridge, F. (2009). Unbiased estimation of odds ratios: combining genome-wide association scans with replication studies. Genetic Epidemiology 33(5), 406-418.

Bowden, J. and Glimm, E. (2008). Unbiased estimation of selected treatment means in two-stage trials. Biometrical Journal 50(4), 515-527.

Bowden, J. and Glimm, E. (2014). Conditionally unbiased and near unbiased estimation of the selected treatment mean for multistage drop-the-losers trials. Biometrical Journal 56(2), 332-349.

Bretz, F., Koenig, F., Brannath, W., Glimm, E. and Posch, M. (2009). Adaptive designs for confirmatory clinical trials. Statistics in Medicine 28, 1181-1217.

Chio , A., Schymick, J. C., Restagno, G., Scholz, S. W., Lombardo, F., Lai, S. L., Mora, G., Fung, H. C., Britton, A., Arepalli, S. and others (2009). A two-stage genome-wide association study of sporadic amyotrophic lateral sclerosis. Human Molecular Genetics 18(8), 1524-1532.

Cohen, A. and Sackrowitz, H. B. (1989). Two stage conditionally unbiased estimators of the selected mean. Statistics & Probability Letters 8(3), 273-278.

Guo, Y., Baum, L. W., Sham, P. C., Wong, V., Ng, P. W., Lui, C. H. T., Sin, N. C., Tsoi, T. H., Tang, C. S. M, Kwan, J. S. H. and others (2012). Two-stage genome-wide association study identifies variants in CAMSAP1L1 as susceptibility loci for epilepsy in Chinese. Human Molecular Genetics 21(5), 1184-1189.

Ioannidis, J. P. A., Thomas, G. and Daly, M. J. (2009). Validating, augmenting and refining genome-wide association signals. Nature Reviews Genetics 10(5), 318-329.

Kimani, P. K., Todd, S. and Stallard, N. (2013). Conditionally unbiased estimation in phase II/III clinical trials with early stopping for futility. Statistics in Medicine 32(17), 2893-2910.

Koopmeiners, J. S., Feng, Z. and Pepe, M. S. (2012). Conditional estimation after a two-stage diagnostic biomarker study that allows early termination for futility. Statistics in Medicine 31(5), 420-435.

Parkes, M., Barrett, J. C., Prescott, N. J., Tremelling, M., Anderson, C. A., Fisher, S. A., Roberts, R. G., Nimmo, E. R., Cummings, F. R., Soars, D. and others (2007). Sequence variants in the autophagy gene IRGM and multiple other replicating loci contribute to Crohn's disease susceptibility. Nature Genetics 39(7), 830-832.

Pepe, M. S., Feng, Z., Longton, G. and Koopmeiners, J. (2009). Conditional estimation of sensitivity and specificity from a phase 2 biomarker study allowing early termination for futility. Statistics in Medicine 28, 762-779.

Posch, M., Koenig, F., Branson, M., Brannath, W., Dunger-Baldauf, C. and Bauer, P. (2005). Testing and estimation in flexible group sequential designs with adaptive treatment selection. Statistics in Medicine 24(24), 3697-3714.

Robertson, D. S., Prevost, A. T. and Bowden, J. (2015). Correcting for bias in the selection and validation of informative diagnostic tests. Statistics in Medicine 34, 1417-1437.

Sampson, A. R. and Sill, M. W. (2005). Drop-the-losers design: Normal case. Biometrical Journal 47(3), 257-268.

Sapkota, Y., Robson, P., Lai, R., Cass, C. E., Mackey, J. R. and Damaraju, S. (2012). A two-stage association study identifies methyl-CpG-binding domain protein 2 gene polymorphisms as candidates for breast cancer susceptibility. European Journal of Human Genetics 20(6), 682-689.

Satagopan, J. M. and Elston, R. C. (2003). Optimal two-stage genotyping in population-based association studies. Genetic Epidemiology 25(2), 149-157.

Satagopan, J. M., Venkatraman, E. S. and Begg, C. B. (2004). Two-stage designs for gene-disease association studies with sample size constraints. Biometrics 60(3), 589-597.

Satagopan, J. M., Verbel, D. A., Venkatraman, E. S., Offit, K. E. and Begg, C. B. (2002). Two-stage designs for gene-disease association studies. Biometrics 58(1), 163-170.

Sill, M. W. and Sampson, A. R. (2007). Extension of a two-stage conditionally unbiased estimator of the selected population to the bivariate normal case. Communications in Statistics - Theory and Methods 36(4), 801-813.

The International HapMap Consortium. (2003). The international hapmap project. Nature 426, 789-796.

Thomas, D. C., Casey, G., Conti, D. V, Haile, R. W., Lewinger, J. P. and Stram, D. O. (2009). Methodological issues in multistage genome-wide association studies. Statistical Science 24(4), 414.

Thomas, D., Xie, R. and Gebregziabher, M. (2004). Two-stage sampling designs for gene association studies. Genetic Epidemiology 27(4), 401-414.

Wellcome Trust Case Control Consortium. (2007). Genome-wide association study of 14 000 cases of seven common diseases and 3000 shared controls. Nature 447, 661-678.

Zhong, H. and Prentice, R. L. (2008). Bias-reduced estimators and confidence intervals for odds ratios in genome-wide association studies. Biostatistics 9(4), 621-634.

[Received June 23, 2015; revised November 11, 2015; accepted for publication January 15, 2016]