Hindawi Publishing Corporation Journal of Probability and Statistics Volume 2009, Article ID 487194,37 pages doi:10.1155/2009/487194

Research Article

Model and Variable Selection Procedures for Semiparametric Time Series Regression

Risa Kato and Takayuki Shiohama

Department of Management Science, Faculty of Engineering, Tokyo University of Science, Kudankita 1-14-6, Chiyoda, Tokyo 102-0073, Japan

Correspondence should be addressed to Takayuki Shiohama, shiohama@ms.kagu.tus.ac.jp Received 13 March 2009; Accepted 26 June 2009 Recommended by Junbin Gao

Semiparametric regression models are very useful for time series analysis. They facilitate the detection of features resulting from external interventions. The complexity of semiparametric models poses new challenges for issues of nonparametric and parametric inference and model selection that frequently arise from time series data analysis. In this paper, we propose penalized least squares estimators which can simultaneously select significant variables and estimate unknown parameters. An innovative class of variable selection procedure is proposed to select significant variables and basis functions in a semiparametric model. The asymptotic normality of the resulting estimators is established. Information criteria for model selection are also proposed. We illustrate the effectiveness of the proposed procedures with numerical simulations.

Copyright © 2009 R. Kato and T. Shiohama. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

Non- and semiparametric regression has become a rapidly developing field of statistics in recent years. Various types of nonlinear model such as neural networks, kernel methods, as well as spline method, series estimation, local linear estimation have been applied in many fields. Non- and semiparametric methods, unlike parametric methods, make no or only mild assumptions about the trend or seasonal components and are, therefore, attractive when the data on hand does not meet the criteria for classical time series models. However, the price of this flexibility can be high; when multiple predictor variables are included in the regression equation, nonparametric regression faces the so-called curse of dimensionality.

A major problem associated with non- and semiparametric trend estimation involves the selection of a smoothing parameter and the number of basis functions. Most literature on nonparametric regression with dependent errors focuses on the kernel estimator of the trend function (see, e.g., Altman [1], Hart [2] and Herrmann et al. [3]). These results have been extended to the case with long-memory errors by Hall and Hart [4], Ray and Tsay [5],

and Beran and Feng [6]. Kernel methods are affected by the so-called boundary effect. A well-known estimator with automatic boundary correction is the local polynomial approach which is asymptotically equivalent to some kernel estimates. For detailed discussions on local polynomial fitting see, for example, Fan and Gijbels [7] and Fan and Yao [8].

For semiparametric models with serially correlated errors, Gao [9] proposed the semiparametric least-square estimators (SLSEs) for the parametric component and studied its asymptotic properties. You and Chen [10] constructed a semiparametric generalized least-square estimator (SGLSE) with autoregressive errors. Aneiros-Perez and Vilar-Fernandez [11] constructed SLSE with correlated errors.

Like parametric regression models, variable selection of the smoothing parameter for the basis functions is important problem in non- and semiparametric models. It is common practice to include only important variables in the model to enhance predictability. The general approach to finding sensible parameters is to choose an optimal subset determined according to the model selection criterion. Several information criteria for evaluating models constructed by various estimation procedures have been proposed, see, for example, Konishi and Kitagawa [12]. The commonly used criteria are generalized cross-validation, the Akaike information criterion (AIC), and the Bayesian information criterion (BIC). Although best subset selection is practically useful, these selection procedures ignore stochastic errors inherited between the stages of variable selection. Furthermore, best subset selection lacks stability, see, for example, Breiman [13]. Nonconcave penalized likelihood approaches for selecting significant variables for parametric regression models have been proposed by Fan and Li [14]. This methodology can be extended to semiparametric generalized regression models with dependent errors. One of the advantages of this procedure is the simultaneous selection of variables and the estimation of unknown parameters.

The rest of this paper is organized as follows. In Section 2.1 we introduce our semiparametric regression models and explain classical partial ridge regression estimation. Rather than focus on the kernel estimator of the trend function, we use the basis functions to fit the trend component of time series. In Section 2.2, we propose a penalized weighted least-square approach with information criteria for estimation and variable selection. The estimation algorithms are explained in Section 2.3. In Section 2.4, the GIC proposed by Konishi and Kitagawa [15], the BICm proposed by Hastie and Tibshirani [16], and the BICp proposed by Konishi et al. [17] are applied to the evaluation of models estimated by penalized weighted least-square. Section 2.5 contains the asymptotic results of proposed estimators. In Section 3 the performance of these information criteria is evaluated by simulation studies. Section 4 contains the real data analysis. Section 5 concludes our results, and proofs of the theorems are given in the appendix.

2. Estimation Procedures

In this section, we present our semiparametric regression model and estimation procedures.

2.1. The Model and Penalized Estimation

We consider the semiparametric regression model:

yi = a(ti) + fixi + ei, i = 1,...,n,

where yi is the response variable and x, is the d x 1 covariate vector at time i, a(ti) is an unspecified baseline function of ti with ti = i/n, p is a vector of unknown regression coefficients, and £i is a Gaussian and zero mean covariance stationary process.

We assume the following properties for the error terms £i and vectors of explanatory variables x,.

(A.1) It holds that {ei} is a linear process given by

ei = X bieH> (2.2)

where b0 = 1 and {ei} is an i.i.d. Gaussian random variable with E{ei} = 0 and E{e2} = o\.

(A.2) The coefficients bj satisfy the conditions that for all \z\ < 1, 2y=o bjzj / 0 and j j2\bj\ <

We define y(k) = cov(et,et+k) = E{etei+k}.

The assumptions on covariate variables are as follows.

(B.1) Also x, = (xn,... ,xid)' e Rd and {xij },j = 1,...d, have mean zero and variance 1.

The trend function a(ti) is expressed as a linear combination of a set of m underlying basis functions:

a(ti) = £ Wkfa (ti) = w'^(t), (2.3)

where {^(ti) = (^i(ti),...,^m(ti))'} is an m-dimensional vector constructed from basis functions {$k(ti); k = 1,.. .,m}, and w = (w1,.. .,wm)' is an unknown parameter vector to be estimated. The examples of basis functions are B-spline, P-spline, and radial basis functions. A P-spline basis is given by

*(ti) = (ti.....tp, (ti - K1)p+.....(ti - Kk)p+)', (2.4)

where {Kk}k=1r.rK are spline knots. This specification uses the so-called truncated power function basis. The choice of the number of knots K and the knot locations are discussed by Yu and Ruppert [18].

Radial basis function (RBF) emerged as a variant of artificial renewal network in late 80s. Nonlinear specification of using RBF has been widely used in cognitive science, engineering, biology, linguistics, and so on. If we consider the RBF modeling, a basis function can take the form

*k (ti)= exp(- ^T^), (2.5)

where ¡ik determines the location and s2k determines the width of the basis function.

Selecting appropriate basis functions, then the semiparametric regression model (2.1) can be expressed as a linear model

y = Xp + Bw + e, (2.6)

where X = (x1,...,xn)', y = (y\,...,yn)', B = (^1,...,^n)' with = (§\(Hri),...,$m(i/n))'. The penalized least-square estimator is then a minimizer of the function

- (y - Xp - Bw)'(y - Xp - Bw) + nlw'Kw, (2.7)

where I is the smoothing parameter controlling the tradeoff between the goodness-of-fit measured by weighted least-square and the roughness of the estimated function. Also K is an appropriate positive semidefinite symmetric matrix. For example, if K satisfies w'Kw = j0[a"(u)]2du, we have the usual quadratic integral penalty (see, e.g., Green and Silverman [19]). By simple calculus, (2.7) is minimized when p and w satisfy the block matrix equation

XX X'B \/ p \ /X'\

B'X B'B + nlK/ Vw/ VByy. (2 8)

This equation can be solved without any iteration (see, e.g., Green [20]). First, we find Bw = S(y - Xp), where S = B(B'B + aK)-1B' is usually called the smoothing matrix. Substituting Bw into (2.6), we obtain

y = X p + e, (2.9)

where y = (I-S)y, X = (I-S)X ,and I is the identity matrix of order n. Applying least-square to the linear model (2.9), we obtain the semiparametric ordinary least-square estimator (SOLSE) result:

PSOLSE = (X'X)-1X'y , (2.10)

wsolse = (B'B + n£K)-1B'(y - Xpsolse). (2.11)

Speckman [21] studied similar solutions for partial linear models with independent observations. Since the errors are serially correlated in model (2.1), pSOLSE is not asymptotically efficient. To obtain an asymptotically efficient estimator for p, we use the prewhitening transformation. Note that the errors {si] in (2.6) are invertible. Let b(L) = ^J=1 bjei-j, where L is the lag operator and a(L) = b(L)-1 = a0 - 2j=i ajLj with a0 = 1. Applying a(L) to the model (2.6) and rewriting the corresponding equation, we obtain the new model:

y = Xp + Bw + e, (2.12)

where y = (y )', X = (x1,...,xn )', B = (t1,...,tn)' and e = (ei,.../e„)'. Here

yi = yi " Xaiyi-i' & = & " Xai

j=1 j=1

(2.13)

xi = xi — aj xi-j. j=1

The regression errors in (2.12) are i.i.d. Because, in practice, the response variable yi is unknown, we use a reasonable approximation by y based on the work by Xiao et al. [22] and Aneiros-Perez and Vilar-Fernandez [11].

Under the usual regularity conditions the coefficients aj decrease geometrically so, letting t = t(n) denote a truncation parameter, we may consider the truncated autoregression on ei:

ei = Si ajSi-j, (2.14)

where ei are i.i.d. random variables with E(ei) = 0. We make the following assumption about the truncation parameter.

(C.1) The truncation parameter t satisfies t(n) = c log n for some c> 0.

The expansion rate of the truncation parameter given in (C.1) is also for convenience. Let Tt be the n x n transformation matrix such that eT = Tte. Then the model (2.12) can be expressed as

Tt y = Tt Xp + Tt Bw + Tt e,

(2.15)

/611 0 •••

621 -622 0

-a1 1/

(2.16)

with 611 = adVY(0)> ^22 = Oe/^(1 - p2(1))y(0), 621 = (1 - p2(1))r(0)).....Here

p(h) = J(h)/p(0) denotes the lag h autocorrelation function of [Si],

Now our estimation problem for the semiparametric time series regression model can

be expressed as the minimization of the function

L(p, w) = 2 (y - Xp - Bw)'V-1(y - Xp - Bw) + aw'Kw, (2.17)

where V-1 = oe~2TTTT and ae2 = n 1 ||TTe||2. Based on the work by Aneiros-Perez and Vilar-Fernandez [11], an estimator for Tx is constructed as follows. We use the residuals e = y -XpSOLSE - BwSOLSE to construct an estimate of TT using the ordinary least square method applied to the model

Si = a1£i_1 + ••• + aT£i-T + residuali. (2.18)

Define the estimate aT = (a1, a2,..., aT)' of aT = (a1, a2,..., aT)', where

aT = (e; ET) e, (2.19)

where e = (ST+1, ...,Sn) and ET is the (n - t ) x t matrix of regressors with the typical element Si-j. Then TT is obtained from TT by replacing aj with aj, a2e with a2e, and so forth. Applying least-square to the linear model, we obtain

TT y = TT Xp + TT Bw + TT e. (2.20)

P SGLSE = (X TXf) XT yT,

-1 , (2.21) wSGLSE = (BTBP + nlK) BT (yp - XPPSGLSE),

where XT = (I - S)XT and yT = (I - S)yT, with yT = TTy and XT = TTX. The following theorem shows that the loss in efficiency associated with the estimation of the autocorrelation structure is modest in large samples.

Theorem 2.1. Lei the conditions of (A.1), (A.2), (B.1), and (C.1) hold, and assume that X = limn-(X,n-1X'V-1X is nonsingular. Let p0 denote the true value of p, then

p - p^ = V^(psGLSE - p^ + 12), (2.22)

V^(pSGLSE - p^ -n(0,X-1), (2.23)

where — denotes convergence in distribution and fi = (X'TXT)-1xTyT. Assume that = lim„-(X)n-1B'V-1B is nonsingular and let w0 denote the true value of w, then one has

Vw(W - wo) = ^ñ(ivscLSE - wo) + Op^Çn) J '

Vñ(wscLSE - wo) ^ n(0, , (2'25)

wfere W = (BTBT + (yT - XTfi).

2.2. Variable Selection and Penalized Least Squares

Variable and model selection are an indispensable tool for statistical data analysis. However, it has rarely been studied in the semiparametric context. Fan and Li [23] studied penalized weighted least-square estimation with variable selection in semiparametric models for longitudinal data. In this section, we introduce the penalized weighted least-square approach. We propose an algorithm for calculating the penalized weighted least-square estimator of 0 = {ft, w')' in Section 2.3. In Section 2.4 we present the information criteria for the model selection.

From now on, we assume that the matrices XT and BT are standardized so that each column has mean 0 and variance 1. The first term in {2.7) can be regarded as a loss function of p and w, which we will denote by ¡{ft, w). Then expression {2.7) can be written as

L{p, w) = ¡{p, w) + n^w'Kw. {2.26)

The methodology in the previous section can be applied to the variable selection via penalized least-square. A form of penalized weighted least-square is

S{p,w) = ¡{fi,w) + n(jZpM (\pt\) +

TjPhi\wjI) ) + n¿,w'Kw-

(2.27)

where p^ (•) are penalty functions and Xi are regularization parameters, which control the model complexity. By minimizing (2.27) with a special construction of the penalty function given in what following some coefficients are estimated as 0, which deletes the corresponding variables, whereas others are not. Thus, the procedure selects variables and estimates coefficients simultaneously. The resulting estimate is called a penalized weighted least-square estimate.

Many penalty functions have been used for penalized least-square and penalized likelihood in various non- and semiparametric models. There are strong connections between the penalized weighted least-square and the variable selection. Denote by 0 = (p', w')' and z = (zi,...,zd+m)' the true parameters and the estimates, respectively. By taking the hard thresholding penalty function

p.x(|0|) = I2 + (|0|- X)2I(|0| <X),

(2.28)

we obtain the hard thresholding rule

p = zI(|z| > 1). (2.29)

The L2 penalty f\(|0|) = 1|e|2 results in a ridge regression and the L1 penalty Pi(|0|) = 1|e| yields a soft thresholding rule

e = sgn(z)I(|z| > 1)+. (2.30)

This solution gives the best subset selection via stepwise deletion and addition. Tibshirani [24, 25] has proposed LASSO, which is the penalized least-square estimate with the L1 penalty, in the general least-square and likelihood settings.

2.3. An Estimation Algorithm

In this section we describe an algorithm for calculating the penalized least-square estimator of 0 = (p', w') . The estimate of 0 minimizes the penalized sum of squares L(0) given by (2.17). First we obtain 0SOLSE in Step 1. In Step 2, we estimate TT by using s obtained in Step

1. Then 0SGLSE is obtained using TT (Step 3). Here the penalty parameters 1, and I, and the number of basis functions m are chosen using information criteria that will be discussed in Section 2.4.

Step 1. First we obtain pSOLSE and wSOLSE by (2.10) and (2.11), respectively. Then we have the model

y = bwsolse + Xp SOLSE + e. (2.31)

Step 2. An estimator for TT is constructed followings the work of Aneiros-Perez and Vilar-Fernandez [4]. We use the residuals e = y - BwSOLSE - XpSOLSE to construct an estimate of TT using the ordinary least square method applied to the model

Si = a1Si-1 + ••• + aTSi-T + residuali. (2.32)

The estimator TT is obtained from TT by replacing parameters with their estimates. Step 3. Our SGLSE of 0 is obtained by using the model

yT = Bt w + Xfp + ef, (2.33)

where yT = TTy, BT = TTB, XT = TTX, and eT = TTe. Finding the solution of the penalized least-square of (2.27) needs the local quadratic approximation, because the L1 and hard thresholding penalty are irregular at the origin and may not have second derivatives at some points. We follow the methodology of Fan and Li [14]. Suppose that we are given an initial

value 0(0) that is close to the minimizer of (2.27). If e(0) is very close to 0, then set e(0) = 0. Otherwise they can be locally approximated by a quadratic function as

fP f|e(°)|) 1

[pi-,e)]' = P'X](e|)sgn(] « | (07)| jey, (2.34)

when ej0) / 0. Therefore, the minimization problem (2.27) can be reduced to a quadratic minimization problem and the Newton-Raphson algorithm can be used. The right-hand side of equation (2.27) can be locally approximated by

l(p°,w°) + Vlp(p, w)'(p- p„) + Vlw(p0,w°) (w - w0)

'0) T- v lw\p0

+ 2 (p -p0)'w„) (p - p°) + 2(w - w°)lVlw(p°,w°)(w - w0), (2.35) 1

2 (p° - p)'V^ro(w - w°) + np'Zh(p°)p + nw'Xi2(wc)w,

vi,(p°, w°) = , viro (p°, w°) = ^ w°)

dp ' wvr dw

V2l Cp w\ ^p°' w°) V2l Cp ^ d2l(p0, w°)

V ^= dpdp' , V Mp°,= dwdw' '

V2lprW(p°, w°) =

d2l(p °, w°)

f p ik(0)b

£11 (p°) = diag

1 K1 ..... i,

f p1 (|wi0)D P'x (|w0?

£12(w°) = diag j Pl2VI 1 U Pl2VI 0

, (0) (0) ^ W1' wm

(2.36)

The solution can be found by iteratively computing the block matrix equation:

Xx* + n£i1(p(0)) XlB, \/p^

BTXt BTBt + aK + n£Xl (w(0) ) / \w,

This gives the estimators

wSGLSE =

fisglse = (XtXt + nzh (fi(0)))-1Xtyf, whtlse = (b;~Bt + n^K + nXl2 (w(0)^(-1)B^(^yf - Xffi,

fiHT \

fi SGLSE )'

(2.38)

where yf = (I - St)yfr Xf = (I - St)Xfr and St = Bf(btBt + n£K + nLkl(w(0)))-1BT. 2.4. Information Criteria

Selecting suitable values for the penalty parameters and number of basis functions is crucial to obtaining good curve fitting and variable selection. The estimate of 0 minimizes the penalized sum of squares L(0) given by (2.17). In this section, we express the model (2.15) as

where AT = (XT, BT) and 0 = (fi', w')'. In many applications, the number of basis functions m needs to be large to adequately capture the trend. To determine the number of basis functions, all models with m < mmax are fitted and the preferred model minimizes some model selection criteria.

The Schwarz BIC is given by

where al is the least-square estimate of ol without a degree of freedom correction. Hastie and Tibshirani [16] used the trace of the smoother matrix as an approximation to the effective number of parameters. By replacing the number of parameters in BIC by trS^, we formally obtain information criteria for the basis function Gaussian regression model in the form

yT = AT 0 + e,

(2.39)

BIC = nlog{2na^j + logn(the number of parameters),

(2.40)

BICm = n log(2noe) + (trS0) log n,

(2.41)

where o^ = n 1 ||y - S0y||2 and

Here XA(0) is defined by (2.44) in what follows.

We also consider the use of the BICp criterion to choose appropriate values for these unknown parameters. Denote

2Xl (p) = diag{pli (|fiic|).....Pi

(2.43)

Xl2 (w) = diag{ pl2 (|roio ,

XA(0) = (XAi (p), 2Xl (w)). (2.44)

Let N and N2 be the number of zero components in po and w0, respectively. Then the BICp criterion is

BICp = n log(lno2) + n0 2a(0)0 + nlQK0 + log|j^0) |

i i (2.45)

- log | k| - log |2a(0) |+ - (m - N2) log I + Const,

where Jg(0) is the (d + m + 1) x (d + m + 1) matrix of second derivatives of the penalized likelihood defined by

(a;At + nSA(0) + n^K

lnAA; n

(2.46)

\ 0e2 2o1 /

Here A is a diagonal matrix with ¿th element Aj = diag[e1,...,en] and 1n = (1,..., 1)'. The n-dimensional vector q has ¿th element (Tyyy - AT/Jj0j)2/2dj -1/2^ where Tjy is the element in the ¿th row and jth column of TT. Also K is the (d + m) x (d + m) matrix defined by

, K Od,m .

K M O O ), (2 47)

and |K|+ and |2x(0)|+ are the product of the (m-N1) and (d+m-Ni-N2) nonzero eigenvalues of K and 2i(0), respectively.

Konishi and Kitagawa [15] proposed a framework of Generalized Information Criteria (GIC) to the case where the models are not estimated by maximum likelihood. Hence, we also consider the use of GIC for the model evaluations. The GIC for the hard thresholding penalty function is given by

GIC = n log(2n02) + n + 2tr j IgJg} ,

(2.48)

where Ig is a (m + d + 1) x (m + d + 1) matrix. Also Ig is basically the product of the empirical influence function and the score function. It is defined by

Ic . J_ " - {"»•^AAT.ah). (2.49)

The number of basis functions m, penalty parameters l,X1,X2 are determined by minimizing BICm, BICp or GIC.

2.5. Sampling Properties

We now study the asymptotic properties of the estimate resulting from the penalized least-square function (2.27).

First we establish the convergence rate of the penalized profile least-square estimator. Assume that penalty functions p'Xy (■) and p'Xi (■) are negative and nondecreasing with Px (0) = Px (0) = 0. Let ff0 and w0 denote the true values of ff and w, respectively. Also let1'

ain = m^pj\pj0\)\ : j f 0}. a2n = max^pfj\wj0 ^ : wj0 / 0}.

(2.50)

bin = max] |fto|)| : j f o}. b2n = ma«( ^ (|w;o|)| : Wjo f o}.

Theorem 2.2. Under the conditions of Theorem 2.1, if a1n,b1n,a2n, and b2n tend to 0 as n ^ to,

---- -"-HT

then with probability tending to 1, there exist local minimizers f and W of L(ff, w) such that || f SC3LSE-foil = °p(n-1/2 + ain) and |WHLTOSE - wo| = Op(n-1/2 + a2n).

Theorem 2.2 demonstrates how the rate of convergence of the penalized least-square

---HT ---'HT '

estimator 0SGLSE = (fSGLSE, wsSGLse) of L(0) depends on X' for i = 1,2. To achieve the root n convergence rate, we have to take Xij small enough so that an = OP(n-1/2).

Next we establish the oracle property for the penalized least-square estimator. Let fSi consist of all nonzero components of ff0 and let fNi consist of all zero components. Let wS2 consist of all nonzero components of w0 and let wN2 consist of all zero components. Let

*(0'P0 f Psi + *Wi (OPNi f XSi (i)'PSi.

^(i)'wo f <p's2wS2 + <N2(0WN f <S2(t)'Ws2.

bP f (p'xin (|Pio|)sgn(Pio).....p'xSin (|Psi0|)sgn(Psi0^'.

bw f (p'x2n(lwio|)sgn(wio).....p'xSnn(\ws2o\)sgn(ws2o)) .

(2.5i)

Further, let p1 consist of the first S1 components of p and let p2 consist of the last d - S1

components of PSGLSE. Let w1 consist of the first S2 components of w and let w2 consist of the

SGLSE'

last m - S2 components of wHT

Theorem 2.3. Assume that for j = 1,...,d and k = 1,...,m, one has A1 ^ 0, -JnA1 ^ to, A2 ^ 0 and -JnA2 ^ to. Assume that the penalty functions p'A dfy |) and p'A (|wk |) satisfy

PA (Pi)

lim inf lim inf—^-> 0,

n^<x> ¡¡j^o+ Ai

(2.53)

Pa fak ) lim inf lim inf——-> 0.

n > to Wk ^ 0+ A 2

If a1n = a2n = Op (n 1/2) then, under the conditions of Theorem 2.1, with probability tending to 1, the

root n consistent local minimizers ^HTLSE = (p1, p2) and wHJLSE = (w'^ w2)' in Theorem 2.2 must satisfy the following:

(1) (sparsity) p2 = w2 = 0;

(2) (asymptotic normality)

Vn(Is1 + £i1 (ft))(K - $10 + (IS1 + (P))-1b^ NsJo,X^),

(2.54)

Vn(Is2 + XA1 (w)) (w - w10 + (Is2 + (w) + ¿K)-1bro) Ns^0,X2(11))

Here and consist of the first S1 and S2 rows and columns of and Z2 defined in Theorem 2.1, respectively.

3. Numerical Simulations

We now assess the performance of semiparametric estimators of the proposed in previous section via simulations. We generate simulation data from the model

yj = a(tj) + p'xj + ej, (3.1)

where a(tj) = exp(-3(j/n)) sin(3nj/n), p = (3,1.5,0,0.2,0,0,0)' and e(t) is a Gaussian AR(1) process with autoregressive coefficient p. We used the radial basis function network modeling to fit the trend component. We simulate the covariate vector x from a normal distribution with mean 0 and cov(xj,xj) = 0.5|j-j|. In each case, the autoregressive coefficient is set to 0, 0.25, 0.5 or 0.75 and the sample size n is set to 50, 100 or 200. Figure 1 depicts some examples of simulation data.

We compare the effectiveness of our proposed procedure (PLS + HT) with an existing procedure (PLS). We also compare the performance of the information criteria BICm, GIC and BICp for evaluating the models. As discussed in Section 3, the proposed procedure (PLS + HT) excludes basis functions as well as explanatory variables.

0 10 20 30 40 50 (a)

0 20 40 60 80 100 (b)

0 50 100 150 200

Figure 1: Simulation data with (a) n = 50 and p = 0.5, (b) n = 100 and p = 0.5, (c) n = 200 and p = 0.5. The dotted lines represent a(t); the solid lines a(t) + e(t).

First we assess the performance of a(t) by the square root of average squared errors (RASEa):

RASE« =

ng1idX i"(tk) - a(tk)}2,

where {tk,k = 1, ...,ngrid} are the grid points at which the baseline function a(-) is estimated. In our simulation, we use ngr;d = 200. Table 1 shows the means and standard deviations of RASEa for p = 0,0.25,0.5,0.75 based on 500 simulations. RASEa increases as the autoregressive coefficient increases but decreases as the sample size increases. From Table 1, we see that the proposed procedure (PLS + HT) works better than PLS and that models evaluated by BICp work better than those based on BICm or GIC.

Then the performance of ß is assessed by the square root of average squared errors (RASE^):

RASE^ =

a \ 2 - ß). (3.3)

The means and standard deviations of RASE^ for p = 0,0.25,0.5,0.75 based on 500 simulations are shown in Table 2. We can see that the proposed procedure (PLS + HT) works better than the existing procedure. There is almost no change in RASE^ as the autoregressive coefficient changes (unlike the procedure of You and Chen [10]), whereas RASE^ depends strongly on the information, BICP works the best among the criteria. We can also confirm the consistency of the estimator, that is RASE^ decreases as the sample size increases.

Table 1: Means (standard deviations) of RASE,

p = 0.0 p = 0.25 p = 0.50 p = 0.75

n = 50

PLS BICm 0.069 (0.013) 0.081(0.017) 0.114 (0.037) 0.232 (0.115)

GIC 0.047 (0.013) 0.062 (0.015) 0.106 (0.037) 0.229 (0.120)

BICp 0.042 (0.010) 0.060 (0.019) 0.103 (0.039) 0.226 (0.124)

PLS + HT BICm 0.061 (0.040) 0.070 (0.021) 0.101 (0.038) 0.226 (0.103)

GIC 0.053 (0.017) 0.068 (0.020) 0.101 (0.034) 0.218 (0.097)

BICp 0.046 (0.015) 0.060 (0.019) 0.093 (0.034) 0.214 (0.101)

n = 100

PLS BICm 0.041(0.008) 0.052 (0.012) 0.080 (0.025) 0.172 (0.080)

GIC 0.034 (0.008) 0.044 (0.011) 0.074 (0.026) 0.170 (0.080)

BICp 0.036 (0.010) 0.044 (0.010) 0.070 (0.024) 0.163 (0.079)

PLS + HT BICm 0.042 (0.008) 0.051 (0.016) 0.080 (0.024) 0.172 (0.079)

GIC 0.040 (0.015) 0.048 (0.016) 0.073 (0.024) 0.168 (0.078)

BICp 0.037 (0.011) 0.041 (0.011) 0.068 (0.023) 0.158 (0.075)

n = 200

PLS BICm 0.029 (0.005) 0.040 (0.016) 0.058 (0.018) 0.129 (0.056)

GIC 0.025 (0.008) 0.033 (0.010) 0.056 (0.018) 0.125 (0.057)

BICp 0.029 (0.006) 0.031 (0.007) 0.050 (0.015) 0.114 (0.052)

PLS + HT BICm 0.030 (0.005) 0.040 (0.016) 0.058 (0.019) 0.127 (0.053)

GIC 0.027 (0.009) 0.033 (0.011) 0.054 (0.015) 0.123 (0.054)

BICp 0.019 (0.008) 0.028 (0.009) 0.047 (0.018) 0.109 (0.048)

Table 2: Means (standard deviations) of RASE^.

p = 0.0 p = 0.25 p = 0.50 p = 0.75

n = 50

PLS BICm 0.022 (0.007) 0.023 (0.007) 0.022 (0.007) 0.020 (0.007)

GIC 0.021 (0.006) 0.023 (0.007) 0.023 (0.010) 0.021 (0.007)

BICp 0.021(0.006) 0.022 (0.007) 0.022 (0.009) 0.020 (0.007)

PLS + HT BICm 0.011(0.005) 0.013 (0.007) 0.012 (0.007) 0.010 (0.005)

GIC 0.010 (0.004) 0.013 (0.007) 0.013 (0.009) 0.011(0.006)

_BICp_0.010 (0.004)_0.011 (0.005)_0.011 (0.006)_0.010 (0.005)

n = 100

PLS BICm 0.014 (0.004) 0.014 (0.004) 0.014 (0.005) 0.012 (0.004)

GIC 0.013 (0.004) 0.014 (0.004) 0.013 (0.004) 0.012 (0.004)

BICp 0.014 (0.004) 0.014 (0.004) 0.013 (0.004) 0.011 (0.004)

PLS + HT BICm 0.007 (0.003) 0.008 (0.004) 0.007 (0.004) 0.006 (0.003)

GIC 0.007 (0.003) 0.008 (0.004) 0.007 (0.003) 0.006 (0.003)

_BICp_0.007 (0.003)_0.007 (0.003)_0.006 (0.003)_0.006 (0.003)

n = 200

PLS BICm 0.009 (0.003) 0.009 (0.003) 0.009 (0.003) 0.007 (0.002) GIC 0.009 (0.003) 0.009 (0.003) 0.008 (0.003) 0.007 (0.002) BICp 0.009 (0.003) 0.009 (0.003) 0.008 (0.002) 0.007 (0.002) PLS + HT BICm 0.004 (0.002) 0.005 (0.002) 0.005 (0.002) 0.005 (0.002) GIC 0.005 (0.002) 0.005 (0.002) 0.005 (0.002) 0.004 (0.002) _BICp_0.005 (0.002)_0.005 (0.002)_0.004 (0.002)_0.005 (0.002)

Table 3: Means (standard deviations) of first step ahead prediction errors.

p = 0.0_p = 0.25_p = 0.50_p = 0.75

n = 50

PLS BICm 0.136 (0.115) 0.150 (0.116) 0.140 (0.120) 0.158 (0.117)

GIC 0.111 (0.088) 0.127 (0.097) 0.134 (0.098) 0.149 (0.122)

BICp 0.111 (0.088) 0.127 (0.097) 0.131 (0.095) 0.149 (0.122)

PLS + HT BICm 0.121 (0.096) 0.106 (0.086) 0.119 (0.092) 0.139 (0.112)

GIC 0.094 (0.071) 0.118 (0.093) 0.126 (0.094) 0.139 (0.112)

BICp 0.095 (0.071) 0.116 (0.092) 0.124 (0.093) 0.139 (0.112)

n = 100 PLS BICm 0.101 (0.086) 0.105 (0.082) 0.130 (0.112) 0.145 (0.124)

GIC 0.090 (0.070) 0.101 (0.078) 0.105 (0.082) 0.137 (0.109)

BICp 0.091 (0.070) 0.096 (0.072) 0.105 (0.092) 0.137 (0.109)

PLS + HT BICm 0.097 (0.082) 0.096 (0.078) 0.098 (0.088) 0.140 (0.162)

GIC 0.084 (0.063) 0.091 (0.071) 0.103 (0.081) 0.130 (0.111)

BICp 0.084 (0.063) 0.091 (0.071) 0.103 (0.081) 0.130 (0.111)

n = 200

PLS BICm 0.091 (0.070) 0.105 (0.081) 0.114 (0.087) 0.174 (0.129) GIC 0.087 (0.068) 0.095 (0.072) 0.102 (0.077) 0.139 (0.114) BICp 0.086 (0.068) 0.095 (0.072) 0.102 (0.077) 0.139 (0.114) PLS + HT BICm 0.084 (0.066) 0.090 (0.069) 0.091 (0.068) 0.123 (0.096) GIC 0.083 (0.063) 0.090 (0.069) 0.098 (0.076) 0.126 (0.100) _BICp_0.082 (0.063)_0.092 (0.070)_0.098 (0.076)_0.126 (0.100)

The first step ahead prediction error (PE), which is defined as

PE = V(yn+1 - yn+1|n)2, (3.4)

is also investigated. Table 3 shows the means and standard errors of PE for p = 0,0.25,0.5,0.75 based on 500 simulations. The PE increases as the autoregressive coefficient increases, but the PE decreases as the sample size increases. From Table 3, we see that PLS + HT works better than the existing procedures and there is almost no difference in the PE depending on the information criteria. The models evaluated by BICm perform well for large sample sizes.

The means and standard deviations of the number and deviation of basis functions are shown in Tables 4 and 5. The BICp gives a smaller number of basis functions than the other information criteria. The models evaluated by BICp also give smaller standard deviations of the number of basis functions. The models determined by BICp tend to choose larger deviations of basis functions than those based on BICm and GIC. The number of basis functions increases gradually as the sample size or p increase. From Table 4, it appears that the number of basis functions does not depend on the sample size n. From Table 5, it also appears that the deviations of basis functions do not depend on the sample size n and p.

We now compare the performance of our procedure with existing procedures in terms of the reduction of model complexity. Table 6 shows simulation results of the means and standard deviations of the number of parameters excluded (ft = 0 or w = 0) by the proposed procedure. The results indicate that the proposed procedure reduces model complexity. From Table 6, It appears that the models determined by BICp tend to exclude fewer parameters

Table 4: Means (standard deviations) of the number of basis functions.

p = 0.0 p = 0.25 p = 0.50 p = 0.75

n = 50

BICm 7.87 (1.38) 8.85 (1.13) 8.76 (1.24) 8.82 (1.17)

GIC 8.06 (1.44) 8.75 (1.27) 8.84 (1.20) 8.84 (1.24)

BICp 6.02 (0.14) 6.15 (0.53) 6.17 (0.37) 6.21 (0.48)

n = 100

BICm 7.98 (1.31) 8.83 (1.17) 8.71 (1.30) 8.71 (1.30)

GIC 8.01 (1.37) 8.91 (1.18) 8.67 (1.29) 8.95 (1.20)

BICp 6.20 (0.50) 6.22 (0.44) 6.31 (0.60) 6.35 (0.66)

n = 200

BICm 7.93 (1.33) 8.18 (1.44) 8.25 (1.48) 8.20 (1.39)

GIC 8.11 (1.35) 8.11 (1.52) 8.39 (1.41) 8.55 (1.37)

BICp 6.15 (0.66) 6.22 (0.73) 6.46 (1.03) 6.93 (1.43)

Table 5: Means (standard deviations) of the deviations of basis functions.

p = 0.0 p = 0.25 p = 0.50 p = 0.75

n = 50

BICm 0.10 (0.02) 0.10 (0.02) 0.10 (0.02) 0.10 (0.02)

GIC 0.11 (0.03) 0.10 (0.03) 0.10 (0.03) 0.10 (0.03)

BICp 0.14 (0.02) 0.18 (0.03) 0.16 (0.03) 0.16 (0.03)

n = 100

BICm 0.10 (0.02) 0.09 (0.02) 0.09 (0.02) 0.09 (0.03)

GIC 0.11 (0.03) 0.09 (0.02) 0.10 (0.03) 0.09 (0.02)

BICp 0.15 (0.02) 0.15 (0.04) 0.15 (0.03) 0.13 (0.03)

n = 200

BICm 0.10 (0.02) 0.11 (0.03) 0.11 (0.03) 0.10 (0.03)

GIC 0.11 (0.03) 0.12 (0.04) 0.11 (0.04) 0.10 (0.03)

BICp 0.15 (0.03) 0.17 (0.02) 0.16 (0.03) 0.14 (0.04)

and give smaller standard deviations for the number of parameters excluded. This is due to the selection of a smaller number of basis functions compared to the selection based on the other criteria (see Table 4). There is almost no dependence of the number of excluded parameters on p. The models evaluated by BICp give a larger number of excluded parameters as the sample size increases. On the other hand, the models evaluated by BICm or GIC give a smaller number of excluded parameters as the sample size increases.

Table 7 shows the means and standard deviations of the number of basis functions excluded as w = 0 by the proposed procedure. From Table 7 it appears that the models evaluated by BICp tend to exclude fewer basis functions than those based on GIC and BIC. Again this is due to the selection of a smaller number of basis functions (see Table 4). The models determined by BICp also give smaller standard deviations of the number of basis functions than the other criteria. There is almost no dependence of the number of basis functions on p.

Table 8 shows the means and standard deviations of the number of basis functions excluded as ¡5 = 0 by the proposed procedure. The number of ¡5 which values really 0 was five. From Table 8 we see that the proposed procedure gives nearly five. The models determined

Table 6: Means (standard deviations) of the number of parameters excluded.

p = 0.0 p = 0.25 p = 0.50 p = 0.75

n = 50

PLS + HT BICm 7.715 (0.915) 6.910 (1.087) 7.300 (1.364) 6.888 (1.343)

GIC 8.345 (1.568) 7.404 (1.850) 7.620 (1.715) 7.337 (1.598)

BICp 4.950 (0.419) 5.020 (0.502) 5.070 (0.492) 5.092 (0.421)

n = 100

PLS + HT BICm 7.506 (0.784) 7.334 (1.251) 5.698 (0.772) 5.460 (0.700)

GIC 7.916 (1.239) 7.718 (1.435) 5.906 (0.919) 5.740 (0.866)

BICp 4.990 (0.184) 5.076 (0.332) 5.092 (0.316) 5.086 (0.327)

n = 200

PLS + HT BICm 7.062 (0.723) 5.594 (0.744) 5.544 (0.736) 5.460 (0.702)

GIC 7.450 (1.116) 5.764 (0.847) 5.656 (0.864) 5.586 (0.802)

BICp 5.008 (0.109) 5.152 (0.359) 5.162 (0.385) 5.086 (0.356)

Table 7: Means (stnadard deviations) of the number of basis functions excluded.

p = 0.0 p = 0.25 p = 0.50 p = 0.75

n = 50

BICm 3.52 (2.29) 4.21 (2.23) 3.98 (1.60) 3.96 (1.49)

GIC 3.74 (2.15) 4.40 (1.90) 4.18 (1.51) 4.26 (1.46)

BICp 1.03 (0.22) 1.20 (0.60) 1.28 (0.54) 1.24 (0.49)

n = 100

BICm 3.35 (2.19) 4.49 (2.04) 3.78 (1.58) 3.95 (1.60)

GIC 3.67 (2.15) 4.62 (1.84) 3.91 (1.53) 4.30 (1.60)

BICp 1.06 (0.31) 1.78 (0.96) 1.31 (0.60) 1.36 (0.66)

n = 200

BICm 3.64 (2.13) 3.26 (1.71) 3.26 (1.71) 3.61 (1.60)

GIC 3.86 (2.02) 3.43 (1.81) 3.65 (1.69) 3.89 (1.76)

BICp 1.12 (0.34) 1.23 (0.75) 1.46 (1.03) 1.93 (1.44)

by BICp give results more close to five and smaller standard deviations of the number of basis functions than the other criteria. The number of basis functions approaches five as the sample size increases. The standard deviations of the number of basis functions excluded decrease as p increases. These results indicate that the proposed procedure reduces model complexity.

Table 9 shows the percentage of times that various fa were estimated as being zero. As for the parameters fa = 0, j = 1,2,5, these parameters were not estimated zero for every simulations, we omit to show the corresponding results on Table 9. The results indicate that the proposed procedure excludes insignificant variables and selects significant variables. It can be seen that the proposed procedure gives a better performance as the sample size increases and that BICp is superior to the other criteria.

4. Real Data Analysis

In this section we present the consequence of analyzing the real-time series data using proposed procedure. We use two data in this study; the data about the spirit consumption data in United Kingdom and the association between fertility and female employment in Japan.

1870 1890 1910 1930 Time (a)

2.1 2.05 0

1.95 1.9 1.85 1.8

1870 1890 1910 1930 Time (b)

1870 1890 1910 1930 Time

Figure 2: Application data-set: (a): yi = log(the annual per capita consumption of spirits); (b): xi1 = log(per capita income); (c): xi2 = log(price of spirits).

Table 8: Means (standard deviations) of the number of explanatory variables excluded.

p = 0.0 p = 0.25 p = 0.50 p = 0.75

n = 50

BICm 4.14 (1.60) 4.15 (1.63) 4.69 (0.92) 4.79 (0.74)

GIC 4.28 (1.47) 4.35 (1.41) 4.70 (0.89) 4.72 (0.87)

BICp 4.97 (0.21) 4.95 (0.26) 4.97 (0.23) 4.99 (0.14)

n = 100

BICm 4.15 (1.59) 4.17 (1.55) 4.72 (0.92) 4.77 (0.87)

GIC 4.22 (1.51) 4.29 (1.47) 4.77 (0.84) 4.65 (1.03)

BICp 4.98 (0.14) 4.95 (0.26) 5.00 (0.04) 5.00 (0.06)

n = 200

BICm 4.14 (1.59) 4.78 (0.82) 4.78 (0.82) 4.72 (0.86)

GIC 4.16 (1.55) 4.68 (1.01) 4.75 (0.88) 4.66 (1.04)

BICp 4.99 (0.11) 4.99 (0.15) 5.00 (0.00) 5.00 (0.04)

4.1. The Spirit Consumption Data in the United Kingdom

We now illustrate our theory through an application to spirit consumption data for the United Kingdom from 1870 to 1938. The data-set can be found in Fuller [26, page 523]. In this dataset, the dependent variable yi is the logarithm of the annual per capita consumption of spirits. The explanatory variables xi1 and xi2 are the logarithms of per capita income and the price of spirits, respectively, and xi3 = xiixi2. Figure 2 shows that there is a change-point at the start of the First World War (1914). Therefore, we prepare a variable z: z = 0 from 1870 to 1914 and

Journal of Probability and Statistics Table 9: Percentage of times pi is estimated as zero.

p3 = 0 p4 = 0 p6 = 0 p7 = 0 P8 = 0

n = 50 p = 0

BICm 0.84 0.83 0.82 0.83 0.82

GIC 0.87 0.85 0.87 0.86 0.83

BICp 1.00 0.99 0.99 1.00 1.00

p = 0.25

BICm 0.83 0.83 0.84 0.83 0.83

GIC 0.86 0.86 0.86 0.89 0.87

BICp 0.99 0.99 0.99 0.99 0.98

p = 0.50

BICm 0.95 0.93 0.94 0.94 0.93

GIC 0.94 0.94 0.93 0.94 0.95

BICp 0.99 0.99 1.00 1.00 0.99

p = 0.75

BICm 0.96 0.96 0.95 0.95 0.97

GIC 0.94 0.93 0.95 0.94 0.96

BICp 1.00 1.00 1.00 1.00 1.00

n = 100 p = 0

BICm 0.83 0.83 0.84 0.82 0.82

GIC 0.85 0.84 0.85 0.84 0.84

BICp 1.00 0.99 1.00 0.99 1.00

p = 0.25

BICm 0.83 0.84 0.83 0.82 0.85

GIC 0.87 0.85 0.88 0.85 0.85

BICp 0.99 0.99 0.99 0.99 1.00

p = 0.50

BICm 0.95 0.93 0.95 0.95 0.94

GIC 0.96 0.95 0.94 0.96 0.95

BICp 1.00 1.00 1.00 1.00 1.00

p = 0.75

BICm 0.96 0.95 0.95 0.95 0.95

GIC 0.93 0.94 0.94 0.92 0.92

BICp 1.00 1.00 1.00 1.00 1.00

n = 200 p = 0

BICm 0.92 0.93 0.92 0.91 0.94

GIC 0.94 0.94 0.94 0.95 0.95

BICp 1.00 1.00 1.00 1.00 0.99

Table 9: Continued.

ß3 = 0 ß4 = 0 ß6 = 0 ß7 = 0 ß8 = 0

p = 0.25

BICm 0.95 0.94 0.94 0.95 0.93

GIC 0.94 0.94 0.94 0.94 0.93

BICp 1.00 1.00 1.00 1.00 1.00

p = 0.50

BICm 0.97 0.95 0.95 0.96 0.95

GIC 0.96 0.95 0.95 0.94 0.95

BICp 1.00 1.00 1.00 1.00 1.00

p = 0.75

BICm 0.96 0.94 0.95 0.95 0.93

GIC 0.93 0.93 0.93 0.94 0.94

BICp 1.00 1.00 1.00 1.00 1.00

Table 10: Estimated Coefficients for Model 4.1.

PLS estimators SE PLS + HT estimators SE

ßi -0.653 3.080 0

ß2 -1.121 5.962 0

ß3 1.842 9.164 0

ß4 3.570 3.761 2.395 0.421

ß5 -2.553 4.455 0

ß6 -1.25 7.763 -2.411 0.524

z = 1 from 1915 to 1933. From this we derive another three explanatory variables: Xi4 — Xi 1Z, xi5 — xi2z, and xi6 — xi3z. We consider the semiparametric model:

yi — a(ti)+ + £U i — 1,...,69. (4.1)

We computed the basis function estimate for a using the existing procedure (PLS) and the proposed procedure (PLS + HT) with BICp. The resulting estimates and standard errors (SEs) of {$ are given in Table 10. The selected number of basis function is seven with one excluded basis function and the spread parameter s is estimated as 0.12. Therefore, we obtain the model

yi — a(ti)+ 2.395xi4 - 2.411xi5, i — 1,...,69. (4.2)

The results indicate that the proposed procedure excludes some variable and reduces model complexity. Table 10 shows that PLS + HT selects only ¡4 and ¡6. That indicates possible interactions between consumption and income and between consumption and incomexprice after 1915. Consumption increases as income increases; however, as income increases and the price also increases, consumption decreases. We plot the estimated trend curve, residuals and autocorrelations functions in Figures 3 to 5. The residual mean square is 1.7 x 10-4.

0 \)J\

0.5 lAi. 1

-1 - Ym-, VM

1.5 x

1870 1880 1890 1900 1910 1920 1930 1940

Figure 3: Plots of estimated curves. The solid line represents y; the dotted lines are the estimates of y; the dashed lines are the estimated curve a.

Figure 4: Plot of residuals.

You and Chen [10] used the following semiparametric partially linear regression

model:

yi = a(ti) + p1xi1 + p2xi2 + ei, i = 1,...,69. (4.3)

The semiparametric least-square (SLS) regression gives yi = a(ti) + 0.65xi1 - 0.95xi2. The residual mean square is 2.2 x 10-4, which is more than that of our SGLSE fit. For a fair comparison, we use model (4.3) to revise You and Chen's estimation. Our semiparametric generalized least-square gives yi = a(ti) - 0.71xi2. The result indicates that xi1 is insignificant in model (4.3).

-0.5 -1 1.5

Figure 5: ACF plot of residuals.

Figure 6: Plots of standardized total fertility rate and female labor force participation rate for women aged 15 to 49 in Japan, 1968-2007. The solid line represents standardized TRF; the dotted lines are standardized FLP.

4.2. The Association between Fertility and Female Employment in Japan

Recent literature finds that in OECD countries the cross-country correlation between the total fertility rate and the female labor force participation rate, which until the beginning of the 1980s had a negative value, has since acquired a positive value. See for example, Brewster and Rindfuss [27], Ahn and Mira [28], and Engelhardt et al. [29]. This result is often interpreted as evidence for a changing sign in the time series association between fertility and female employment within OECD countries.

However, OECD countries, including Japan, have different cultural backgrounds. We investigate whether or not the relation between the total fertility rate (TFR) and the female labor force participation rate (FLP) has changed in Japan from a negative value to a positive value. This application challenges previous findings and could be good news for policy

Figure 7: Plots of estimated curves, the solid line represents y; the dotted lines are the estimated curves of y; the dashed lines are the estimated curves of a.

Table 11: Estimated coefficients for Model 4.4.

log (FLP) PLS estimators SE PLS + HT estimators SE

for 1968-1984 -0.32 -1.99 -0.36 -2.16

for 1984-2007 -0.28 -2.00 -0.31 -2.18

for 1968-1979 0.02 0.17 0

for 1980-1989 0 1.37 0

for 1990-1999 -0.04 -0.51 0

for 2000-2007 0.04 0.17 0

makers, as a positive relationship implies that a rising FLP is associated with an increasing TFR.

Usually, FLP contains all women aged 15 to 64. However, TFR is a combination of fertility rates for ages 15-49, so we use the FLP of women aged 15 to 49 instead of women aged 15 to 64. We take the TFR from 1968 to 2007 in Japan. The estimation is a semiparametric regression of log(TFRj) on log(FLPj). As the law of the Equal Employment Act came into force in 1985, we use the interaction variables "dummy for 1968-1984 x log(TFR)" (xi2) and for 1985-2007 (xi3). We also use dummy variables for 1990-1999 and 2000-2007 (xi4, xi5) and consider the semiparametric model

log (TFR) = a(ti) + p'xi + ei, i = 1,...,40. (4.4)

We applied the existing procedure (PLS) and proposed procedure (PLS + HT) with BICp. The resulting estimates and standard errors (SE) of f are given in Table 11. Therefore, we obtain the model

y = a(ti) - 0.27xii - 0.20xi2, i = 1,..., 40. (4.5)

The residual mean square of PLS + HT is 2.24 x 10-2 and that of PLS is 2.47 x 10-2. The selected number of basis functions is six with one excluded basis function and the spread parameter s

0.1 0.05 0

-0.05 -0.1 -0.15

0.8 0.6 0.4 0.2 0 0.2

1970 1980 1990 2000

Figure 8: Plot of residuals.

0 5 10 15

Figure 9: ACF plot of residuals.

is estimated as 0.30. Table 11 shows that PLS + HT selects only log(FLPi) 1968-1984 and 19852007. That indicates a negative correlation between TFR and FLP for 1968-2007, especially for 1968-1984, which means TFR decreases as FLP increases. We could not see the positive association in 80s which has been reported in recent studies, see, for example, Brewster and Rindfuss [27], Ahn and Mira [28], and Engelhardt et al. [29]. We plot the estimated trend curve, residuals and autocorrelation functions in Figures 7 to 9.

5. Concluding Remarks

In this article we have proposed variable and model selection procedures for the semiparametric time series regression. We used the basis functions to fit the trend component. An algorithm of estimation procedures is provided and the asymptotic properties are investigated. From the numerical simulations, we have confirmed that the models determined by

the proposed procedure are superior to those based on the existing procedure. They reduce the complexity of models and give good fitting by excluding basis functions and nuisance explanatory variables.

The development here is limited to the case with Gaussian stationary errors, but it seems likely our approach can be extended to the case with non-Gaussian long-range dependent errors, along with the lines explored in recent work by Aneiros-Perez et al. [30]. However, the efficient estimation for regression parameter is an open question in case of long-range dependence. This is a question we hope to address in future work. We also plan to explore the question of whether the proposed techniques can be extended to the cointegrating regression models with an autoregressive distributed lag framework.

Appendix Proofs

In this appendix we give the proofs of the theorems in Section 2. We use ||x|| to denote the Euclidian norm of x.

Let aT,n = (a1n,..., aT/n)' be the infeasible estimator for aT = (a1f..., aT)' constructed using OLS methods. That is aT,n = (a1 a2,n,..., aT/n)' = (ET Et )-1ET e, where e = (eT+1,... , £n)' and ET = [sirj], i = 1,..., n, j = 1,..., t with ei/j = ei-j-T. For ease of notation, we set ajrn = aj,n = 0 for j > t, and ao,n = ao,n = 1. We write r(k) for cov(s0, ek). Then we can construct the infeasible estimate V using aT,n and r(k), k = 0,..., t. The following lemma states that the estimators ft and w given in Theorem 2.1 have asymptotically normal distributions.

Lemma .1. Under the assumptions of Theorem 2.1, one has

- ft) -N(0,X-1), (A.1)

Vn(w - w) - n(0, X4), (A.2)

where X-1 and X-1 are defined in Theorem 2.1.

Proof of Lemma .1. From model (2.6), y - Xft - Bw can be written as

y - Xft - Bw = y - Bw + (y - Xft) - (y - Xft) - Xft

= (y - Xft) + (X - X) ft - (y - y) - Bw

= (y - Xft) + S(y - Xft) - Bw

= (y - Xft) - B(w - w),

where y, X, and w are given by y = (I-S)y, X = (I-S)X, and w = (B'B+n£K)-1B'e, respectively. Hence w can be expressed without using ß. Then the minimization function L(ß, w) in (2.17) can be written as

L(ß, w) = i{ (y- Xß) V (y- Xß) - 2(w - w)'B'V-1 (y- Xß)

+ (w - w)'B'V-1B(w - w)} + aw'Kw (A.4)

= Ii(ß)+12 (ß, w) + I3 (w) + I4 (w).

First we consider asymptotic normality for w, using the model

y = Xß 0 + Bwo + e. (A.5)

The estimators w minimize the function L(f, w), which yields

^LW;W) = I2 (ff, w) +13 (w) +14 (w)

= -B'V-1 (y - Xf) + BV-1B(w - w0) + B'V-1B(w0 - w) + 2n£K(w - w0) + n£Kw0

Then the minimization of this quadratic function is given by

w - w0 = (B'V-1B + niK} Vv-1 { (y - Xf) - B(w - w0) - n£VB-1Kw0}

= (b'v-1b + niKj-1B'V-^y - Xf)

+ (B'V-1B + n'iKj Vv-1 B(w0 - w)

- n^B'V-1B + n'iKj BV-1VB-1Kw0 = A1 + A2 + A3.

We now deal with A1, A2, and A3. First we evaluate A1. From the expansion (A + aB) 1 = A-1 - aA-1BA-1 + O(a2), we can see that

A1 = (b'V-1 B + nlKj BV-1£

BV-1B „ \ 1 BV-1£ -+ ¿K -

f/b'v-1b\-1 /b'v-1b\-1 /b'v-1b\| BY-1

^^"TT) KW + < ^H^) Kw, + OW J^. (A.8)

( )- 1 SVli £ _ ¿( )- 1 k(EY-iBY" BY-1 £ + c^ BY-1 £

n n n n n n

O(£) + O(?).

B'V-1B\ 1 BV-1

Similarly, we obtain

A2 = (B'V-1B + n'iKj B'V-1B(w0 - w)

BV-1B „ \-1 B'V-1B,

¿K 1 -(wo - w)

{(^V-1)'-<' + O(

B'V-1B ,,

(wo - w)

/b'v-1b\ , /b v-1b

= (wo - w) - ¿1 - 1 K(wo - w) + OU2)( - l(wo - w).

Finally, we can evaluate A3 as follows:

A3 = -(B V-1B + nlKj Vv-1B-1n£Kwo 1

B'v-1B

Bv-B + ¿K) ¿Kwo

= {( - ¿(^H^ ¿Kwo

= -¿( ^ )-'kwo + f( SY* )k( bV-Î)kwo + O(?) Kwo

(A.1o)

We can also observe that the weighted least-square estimates w have a normal distribution. Hence

w - w0 = Op(n-1/2). (A.11)

If i = O(nn) and n < -1/2, then A1, A2, and A3 become

A1 = ( ^ )-1 HT' + O(i) + O^

A2 = (w0 - w) + O(i) x Op (w0 - w) + o(i2) X Op(w0 - w) = Op (n-1/2), and A3 = O(i) + O(i2) + O(i3) = O(i). Therefore, (A.7) can be written as

(A.12)

B' V-1B\ 1 B' V-1

w - w0 = I - ) -£

Op n-1/2 . (A.13)

By the law of large numbers and the central limit theorem,

Vn(w - w0) ^ n(O, X-1). (A.14)

Next we deal with the estimators ff. These minimize the function L(f, w), which yields

df>( f w) _

= I1 (f) + I2(f, w) = -X'V-1£ + X'V-1X(f - ft,) + X'V-1B(w - w). (A.15)

(A.16)

The minimization of this quadratic function is given by

$ = f0 + (x'v-1X)-1{ X'v-1£ + X'v-1B(w - w)} = f0 + (X'V-1X)-1X'V-1{£ + B(w- w)}.

If we substitute w for its estimator w0, from (A.14) and (A.11), we have

$ = f0 + (x'v-1X)-1{ X'v-1£ + X'v-1B(w0 - w)} = f0 + (X'v-1X)-1X'v-1£ + Op(V1/2).

Similarly, by the law of large numbers and the central limit theorem,

-h) DN(0,X-0. (A.18)

This completes the proof of the lemma. □

(A.17)

Proof of Theorem 2.1. Let the estimator aT,n = (a1/n, ...aTrn)' be the ordinary least-square estimate applied to model (2.18). For the ease of notation, we set aj,n = aj,n = 0 for j > t and a0 n = a0 n = 1. Then we write

ei, n = ei Si,n + Ri,n + Qi, n ,

(A.19)

Si,n = jr aj{ ($ - p) xi-j +(w - w)'^i-j j

j=0 ^ J

Ri,n = X (j - aj rn) (yi-j - $Xi-j - w'tyi-^j,

(A.20)

Qi,n = (aj,n - j (yi-j - $ Xi-j - w'^i-0 .

From assumptions (A.1), (A.2), and Lemma .1 we can see that under the assumptions about t and by the Caucy-Schwarz inequality

\Sirn\ <f>;\ ($- f) Xi-j + (w- w)'fa

E aj xi-j + ||w - w||

j=0 j=0

(A.21)

Op n-1/2 .

Next we evaluate Rin. In An et al. [31, proof of Theorem 5]: it is shown that, under the assumptions about t(n),

V^ ^ ((log(n) V/2

j=0 ($> - Hn) =

(A.22)

Thus, by the Cauchy-Schwarz inequality

\Ri,n\ < ( x(aj,n - ajrn)2

1/2 / t A 1/2

I (l (yi-j - $ Xi-j - wv'^i-^ j ,

(A.23)

which yields R^ = o((log(n)/n)1/4))Op(T1/2) = Op (1). Finally, we evaluate Qi/n. By the extended Baxter inequality from Buhlmann [32, proof of Theorem 3.1], we have

ajn - an < C ^ | aj |.

aj,n aj

(A.24)

j=T + 1

Notice that y- - // xi-j - W^- = e,n. Since ein is a stationary and invertible process whose linear process coefficients satisfy the given summability assumptions, we have for some

\Qi,n\ < M £Iaj,n - aj| < \aj\ = op(1). (A.25)

j=0 j=T+1

From the above decomposition and evaluation, we can see that

y - Xft - Bw = yT - Xtft - Btw + Op(1). (A.26)

Therefore, in order to prove the second equation in Theorem 2.1 we just need to show

n (x'v;' X - X 'v;1X) = Or(Q)1 1 (B'V - B - B'VB) = Op((I)1 -L X'(V-1 - V-1)e = Op (( n f)

(A.27)

-L„(V, - v,)£ = o/Q) "2

To see the above results are true, let yTii be the ith element yT of model (2.20). We have for TTii (the ith row of TT), XTii (the ith column of XT), and BTii (the ith column of BT)

ei = T^e = ei + ^(aj - a^e-, j=1

XT ,ij = TT, j • XT, i = ij + X ajXXi-j,i + X (aj - ah)XXi-j,i

j=1 j=1

= XXTiij + X(2/ - aj)Xi-j,i, (A.28)

BTAj = TT,j • BT,i = BT,ij + XajBi-jii + ^(aj - a^Bi-j^ j=1 j=1

= Bt,/ + X (aj - aj)Bi-ji j=1

for i = t + 1,t + 2,...,n, with similar expressions holding for i = 1, 2,...,t. By (A.26) and the fact that 11at - a|| = Op((T/n)1/2) (see Xiao et al. [22]), it follows that

1 n 1 n ( / TV 1/2)

^= ^^+Op((n) ), n X X^X^k = i jj Y^fc + O/ (n) ),

(A.29)

1 n 1 n / /t\ 1/2\

^= ^^+°-((n) ),

nsft-®«=nsBT,ii<+°p((n))•

Therefore, using the expansion (A + aB) 1 = A 1 - aA 1BA 1 + 0(a2) and from (A.17), (A.13) and (A.27), we have

^(feu* - ft) = ( "'( ;£={ X'V-£ + 0„(„-'2)})

x^x + / ( "

*(' ^£ + or((")1!1) + 0,(,r')

Vn p\\n

= V"(/ - A,) + op((")1/2

V"(WSGLSE - wo) = ({ -1n (b'V-1£ + Op (n~1/2

B'V-1B ^ /(T\1/2\\-1/B'V-1 ^ /(T\-1/2

Op -) + Op -

n ■ ^v\\n) )) \ vn p\\nJ

= V"(/ - /o) + op((")

(0" + O^"-1)

(A.3o)

This completes the proof of Theorem 2.1. □

Proof of Theorem 2.2. We write an = n 1/2 + an. It is sufficient to show that, for any given Z> 0, there exist large constants C such that

M inf S(00 + «nu) > S(0oH < 1 - Z- (A.31)

This implies, with probability at least 1 - Z, that there exists a local minimizer in the balls {00 + anu : ||u|| < C}. Define

Dn(u) = S(0o + anu) -S(0o). (A.32)

Note that p\jn (0) = 0 and that p\jn (\dj|) is nonnegative, so Dn(u) > n-1{l(00 + anu) -1(00)}

+ X {pj«(I j + anUj0 - px,n(|fyj|)} (A.33)

+ 1(00 + anu)'!K(00 + anu) - 0K00,

where l(0) is the first term of (2.7) and K is defined in (2.47). Now 1

2n-1{l(00 + anu) -1(0)}

a2„ , f B'VB 1 , f B'V

Op( , , - - i , n1n

in ' fB'VB ' fB'V f _1/2\]

-—— + Opui -ainu^ — £ + Op{n ' Jj (A.34)

' f X'V- 1X .„ 1 ' f X'V- 1 / 1

u2j —n— + Op(1) - a2nuA n £ + Op[n 1

Note that B'V-1B/n ^ X1, B'V-1£/n ^ ¿1, X'V-1X/n ^ X2, and X'V-1£/n ^ ¿2 are finite positive definite matrices in probability. So the first term in the right side of (A.34) is of order Op(C^a!n), and the second term is of order Op(C1n-1/2a1n) = Op(Ca2n). Similarly, the third term of (A.34) is of order Op (C^n) and the fourth term is order Op(C2n-1/2a2n). Furthermore,

Iwj0 + a1nujD -pjni1 wj0D}, (A.35)

X{pv(1 j + a2nUj0 -px,ln(1 jD }' (A.36)

are bounded by

Vma1na1n||u|| + a^buMI = Ca2n(-Jm + b^C),

, v (A.37)

Vda2na2n||u|| + anb2n|u|2 = Ca^Vd + b2nCj

by the Taylor expansion and the Cauchy-Schwarz inequality. As bn ^ 0, the first term on the right side of (A.34) will dominate (A.35) and (A.36) as well as the second term on the right side of (A.34), by taking C sufficiently large. Hence (A.31) holds for sufficiently large C. This completes the proof of the theorem. □

Lemma .2. Under the conditions of Theorem 2.3, with probability tending 1,for any given ft and w, satisfying Hft1 - ft10|| = ||w1 - w10|| = Op(n-1/2) and any constant C,

j{ (ft„»')', (w„ „')} = 1ftilsCin-,i;!,ilwilsCl„-,/,S{ (ft-, ft2)',« w2)'} - (A38)

Proof. We show that with probability tending to 1, as n ^ to, for any ft1 and w1 satisfying ||ft1 - ftlol = lwl - Wlol = Op(n-1/2), ||ft2l < C1 n-1/2, and lw21 < C2n-1/2, dl(ft, w)/dfy and fy have the same signs for fy e (-C1n-1/2,C1n-1/2), for j = S1 + 1,...,d. Also dl(ft,w)/dwj and Wj have the same signs for Wj e (-C2n-1/2,C2n-1/2), for j = S2 + 1,...,m. Thus the minimization is attained at ft2 = w2 = 0. For fy / 0 and j = S1 + 1,...,d,

= lj(fy) + npjn (|fy|)sgn(fy), (A.39)

where lj (ft) = dl(ft)/dfyj. By the proof of Theorem 2.1,

lj (ft) = -n{&j - (ft - ft0)'Z1j + Op(n-1/2) }, (A.40)

where l2j is the jth component of ¿2n and is the jth column of X1. From ||ft - ft0l = Op (n-1/2), n-1 lj (ft) is of order Op(n-1/2). Therefore,

dSft) = n\j2n{Xj1np'Xnn (|fy;|)sgn(fyj) + °p(n-1/2/A1n)}. (A.41)

Because lim infn^^.liminffy ^0+A-11np^i (\ftj\) > 0and n-1/2A;1n ^ 0, the sign of the derivative is completely determined by that of fy. For Wj / 0 and j = S1 + 1,...,m,

dS (w)

-Wi = lj (w) + npxnn( | Wj |) sgn(w;) + 2n¿Kw, (A.42)

where Ij (w) = dl(w)/dwj. By the proof of Theorem 2.1,

j(w) = -n{$1j - (w - w0)'X2j + Op(n-1/2) }, (A.43)

where is the jth component of ¿1n and X2j is the jth column of X2. From ||w - w0|| = Op(n-1/2), n-1lj(w) is of order Op(n-1/2). Therefore,

nXj2n{Xj2np'Xjln(\wj\)sgn(wj) + Op(V1/2A2n) }. (A.44)

Because liminfn^^liminfWj^0+jnp^2 (\wj\) > 0 and n 1/2Xj2n ^ 0, n 1/2Xj2n ^ 0, the sign of the derivative is completely determined by that of wj. This completes the proof. □

Proof of Theorem 2.3. Part (a) follows directly from follows by Lemma .2. Now we prove part (b). Using an argument similar to the proof of Theorem 2.1, it can be shown that there exist a w1 and f1 in Theorem 2.3 that are a root-n consistent local minimizer of Sf(w'1;0')'} and S{(f', 0') }, satisfying the penalized least-square equations:

dS( w', 0'Y

—^-- = o,

ds(f i, 0')'

(A.45)

Following the proof of Theorem 2.1, we have

aVw'", 0')'

-Kdw 7 = n [$2(1) + Op(n~1/2J + n^Kw + 2(1) + Op(1)\ (w 1 - w10)]

+ n [bn + XA1 (w){ 1 + Op(1^(w1 - ww)],

' (A.46)

aVf'", 0')

' = n [$1(1) + Op(n~1/2^ + {x 1(1) + Op (1)} ($ 1 - pw)] n [bn + X + Xx2 (f){1 + Op(1)} ($ 1 - P10)]

where ¿1(1) and ¿2(1) consist of the first Sj, j = 1,2,...,S1 and j = 1,2,...,S2 components of and |2 respectively. Also X1(1) and X2(1) consist of the first Sj, j = 1,2,...,S1 and j = 1,2,...,S2 rows and columns of X1 and X2, respectively.

Therefore, similar to the proof of Theorem 2.1 and by Slutsky's theorem, it follows that

Acknowledgments

The authors are grateful to two anonymous referees whose probing questions have led to a

substantial improvement of the paper. This research was supported by the Norinchukin Bank

and Nochu Information System endowed chair of Financial Engineering in the Department

of Management Science, Tokyo University of Science.

References

[1] N. S. Altman, "Kernel smoothing of data with correlated errors," Journal of the American Statistical Association, vol. 85, pp. 749-759,1990.

[2] J. D. Hart, "Kernel regression estimation with time series errors," Journal of the Royal Statistical Society: Series B, vol. 53, pp. 173-188,1991.

[3] E. Herrmann, T. Gasser, and A. Kneip, "Choice of bandwidth for kernel regression when residuals are correlated," Biometrika, vol. 79, pp. 783-795,1992.

[4] P. Hall and J. D. Hart, "Nonparametric regression with long-range dependence," Stochastic Processes and Their Applications, vol. 36, pp. 339-351,1990.

[5] B. K. Ray and R. S. Tsay, "Bandwidth selection for kernel regression with long-range dependence," Biometrika, vol. 84, pp. 791-802, 1997.

[6] J. Beran and Y. Feng, "Local polynomial fitting with long-memory, short-memory and antipersistent errors," Annals of the Institute of Statistical Mathematics, vol. 54, no. 2, pp. 291-311, 2002.

[7] J. Fan and I. Gijbels, Local Polynomial Modeling and Its Applications, Chapman and Hall, London, UK, 1996.

[8] J. Fan and Q. Yao, Nonlinear Time Series: Nonparametric and Parametric Methods, Springer, New York, NY, USA, 2005.

[9] J. T. Gao, "Asymptotic theory for partially linear models," Communications in Statistics: Theory and Methods, vol. 22, pp. 3327-3354,1995.

[10] J. You and G. Chen, "Semiparametric generalized least squares estimation in partially linear regression models with correlated errors," Journal of Statistical Planning and Inference, vol. 137, no. 1, pp. 117-132, 2007.

[11] G. Aneiros-Perez and J. M. Vilar-Fernandez, "Local polynomial estimation in partial linear regression models under dependence," Journal of Statistical Planning and Inference, vol. 52, pp. 2757-2777, 2008.

[12] S. Konishi and G. Kitagawa, Information Criteria and Statistical Modeling, Springer, New York, NY, USA, 2008.

[13] L. Breiman, "Heuristics of instability and stabilization in model selection," The Annals of Statistics, vol. 24, no. 6, pp. 2350-2383,1996.

[14] J. Fan and R. Li, "Variable selection via nonconcave penalized likelihood and its oracle properties," Journal of the American Statistical Association, vol. 96, no. 456, pp. 1348-1360, 2001.

[15] S. Konishi and G. Kitagawa, "Generalised information criteria in model selection," Biometrika, vol. 83, no. 4, pp. 875-890,1996.

[16] T. J. Hastie and R. Tibshirani, Generalized Additive Models, Chapman and Hall, London, UK, 1990.

[17] S. Konishi, T. Ando, and S. Imoto, "Bayesian information criteria and smoothing parameter selection in radial basis function networks," Biometrika, vol. 91, no. 1, pp. 27-43, 2004.

Vn(ISl + (ß))(ß! -ß10 + (ISl + NSl (0,Z-1)),

Vn(ls2 + (w)) (w1 - W10 + (IS2 + Xl2(w) + ¿K)-1b^ Ns2(0,Z-(\))

(A.47)

This completes the proof of Theorem 2.3.

[18] Y. Yu and D. Ruppert, "Penalized spline estimation for partially linear single-index models," Journal of the American Statistical Association, vol. 97, no. 460, pp. 1042-1054, 2002.

[19] P. J. Green and B. W. Silverman, Nonparametric Regression and Generalized Linear Models, Chapman and Hall, London, UK, 1994.

[20] J. Green, "Penalized likelihood for generalized semi-parametric regression models," International Statistical Review, vol. 55, pp. 245-259,1987.

[21] P. Speckman, "Kernel smoothing in partial linear models," Journal of the Royal Statistical Society: Series B, vol. 50, pp. 413-436,1988.

[22] Z. Xiao, O. B. Linton, R. J. Carroll, and E. Mammen, "More efficient local polynomial estimation in nonparametric regression with autocorrelated errors," Journal of the American Statistical Association, vol. 98, no. 464, pp. 980-992, 2003.

[23] J. Fan and R. Li, "New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis," Journal of the American Statistical Association, vol. 99, no. 467, pp. 710-723, 2004.

[24] R. Tibshirani, "Regression shrinkage and selection via the LASSO," Journal of the Royal Statistical Society: Series B, vol. 58, pp. 267-288,1996.

[25] R. Tibshirani, "The LASSO method for variable selection in the Cox model," Statistics in Medicine, vol. 16, no. 4, pp. 385-395,1997.

[26] A. Fuller, Introduction to Statistical Time Series, John Wiley & Sons, New York, NY, USA, 2nd edition, 1996.

[27] K. L. Brewster and R. R. Rindfuss, "Fertility and women's employment in industrialized countries," Annual Review of Sociology, vol. 26, pp. 271-286, 2000.

[28] N. Ahn and P. Mira, "A note on the changing relationship between fertility and female employment rates in developed countries," Journal of Population Economics, vol. 15, no. 4, pp. 667-682, 2002.

[29] H. Engelhardt, T. Kogel, and A. Prskawetz, "Fertility and women's employment reconsidered: a macro-level time-series analysis for developed countries, 1960-2000," Population Studies, vol. 58, no. 1, pp. 109-120, 2004.

[30] G. Aneiros-Perez, W. Gonzalez-Manteiga, and P. Vieu, "Estimation and testing in a partial linear regression model under long-memory dependence," Bernoulli, vol. 10, no. 1, pp. 49-78, 2004.

[31] H. Z. An, Z. G. Chen, and E. J. Hannan, "Autocorrelation, autoregression and autoregressive approximation," The Annals of Statistics, vol. 10, pp. 926-936,1982.

[32] P. Buhlmann, "Moving-average representation of autoregressive approximations," Stochastic Processes and Their Applications, vol. 60, no. 2, pp. 331-342,1995.

Copyright of Journal of Probability & Statistics is the property of Hindawi Publishing Corporation and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.