Flexible Regression Models and Relative Forecast Performance

October 30, 2017 | Author: Anonymous | Category: N/A

Share Embed

Report this link

Short Description

Purdue University†. Svend Hylleberg U-statistic (U) and the Granger - Newbold version ......

Description

Flexible Regression Models and Relative Forecast Performance∗ Christian M. Dahl Svend Hylleberg Department of Economics, Department of Economics, Purdue University† University of Aarhus January 20, 2003 Abstract In this paper four alternative flexible nonlinear regression model approaches are reviewed and their performance evaluated based on various measures of out of sample forecast accuracy. The class of flexible regression model considered includes Neural Networks, Projection Pursuit models and the Random Field regression model approach recently suggested by Hamilton (2001, Econometrica, 69, 537— 73). An empirical illustration is provided, showing that linear models for the US unemployment rate and the growth rate in US industrial production cannot outperform the “best” flexible nonlinear regression models in terms of out of sample forecast accuracy. The results indicate a possible presence of a nonlinear component in the conditional mean function of both time series.

JEL classification: C10; C45; C50 Keywords: Flexible regression models; Real time forecast accuracy;

1

Introduction

Due to thresholds, capacity constraints, rationing, institutional restrictions like tax brackets, and asymmetries of different kinds, nonlinear relations are an integral part of many ∗ Comments and suggestions from seminar participants at the University of Aarhus, the University

of Venezia Ca’Foscari, and the European University Institute in Florence are gratefully acknowledged. We would also like to thank Norman Swanson, an editor and 2 anonymous referees for their valuable comments. † Correspondance to: Christian M. Dahl, Department of Economics, Krannert School of Management, Purdue University, West Lafayette, IN. 47907, USA. E-mail: [email protected].

1

economic theories. Nonetheless, most empirical econometric models are basically linear. Several explanations for this state of the art can be given in addition to the obvious one of familiarity and convenience. One of the possible explanations is based on the procedure often applied when deciding on the application of a nonlinear or linear specification. In most cases a linear model is specified at the outset and a nonlinear specification is only considered if some test for nonlinearity indicates that the linear specification may be in doubt. Unfortunately, very little information as to what kind of nonlinear model should be applied, can be extracted from most or all of the existing tests for nonlinearity implying that the actual choice of nonlinear model is rather arbitrary and tailored to the data in the sample applied. In fact, it often turns out that the out of sample forecast performance of the rejected linear model is far better than the out of sample forecast performance of the nonlinear alternative. As out of sample forecast performance is one of the preferred means to guard against the inherent danger of overparameterized nonlinear models, such evidence typically implies that the nonlinear specification of the model is in doubt. Hence, the linear model often ends up as the preferred specification. On the other hand, even if the nonlinear specification adopted should have better forecast performance than the linear specification, there may exist another nonlinear specifications, which in a better and more parsimonious way represents the theoretical and empirical information available. If this is the case, how do we avoid ending up with a nonlinear specification which at best has little credibility as it has only been chosen because it performs better than a specific linear model, and maybe is at odds with a more general data coherent nonlinear model? In accordance with the general-to-specific modelling strategy, one might argue that the best way to proceed is to start by applying a specification flexible enough to contain the linear and a wide range of nonlinear specifications as special cases. The flexible nonlinear specification can then play the role of an unrestricted model encompassing the class of models within which more specific and interpretable models must be found. The more specific models must then be tested against the general model and not rejected in order to be applied in the subsequent analysis. Two major problems exist. Firstly, how do we find a good flexible model in a feasible and cost effective way, and secondly, how do we avoid that the flexible model is an overparameterized model, overfitting the sample applied? The answer could be to apply one of the flexible nonlinear regression models available specified in a computer intensive but labor saving way and base the choice of model on the relative forecast accuracy of the model in an out of sample context. A related approach to the one suggested here is advocated by Swanson and White (1995,1997a,1997b) and Stock and Watson (1998). The approach is applied in finding nonlinear components in US macroeconomic series. Both Swanson and White and Stock and Watson select the preferred model based on the BIC criterion, and while

Swanson and White applies a neural network nonlinear model, Stock and Watson also 2

apply exponential smoothing and smooth transition autoregressions. In both cases the results are quite favorable to the linear model, but we will argue below that this result may be due to the choice of model selection criterion and to the limited class of flexible nonlinear regression models used. Our results indicate that the best flexible regression model should be chosen among several flexible nonlinear regression models available in the literature, since each of the individual flexible approaches seems to posses model specific approximation abilities that may differ according to the underlying nonlinear patterns. In particular, we consider flexible nonlinear regression models such as Hamilton’s Flexible Regression Model (FNL), see Hamilton (2001), the Neural Network Regression Model (ANN), see White (1992), and two versions of the Projection Pursuit Regression Model (PPR). The first version of the PPR model is based on the algorithm suggested by Friedman and Stueltze (1981) (PPR1) and the second is suggested by Aldrin, Boelviken, and Schweder (1993) to be applied in cases of moderate nonlinearities (PPR2). The FNL is a parametric model while the ANN and the PPR models are nonparametric models. Although the applications of the flexible models in the present context are restricted to univariate models the flexible models are chosen among the class of models which easily can be generalized to the multivariate case. We argue that model selection within each of the four classes of flexible models should be made by a forward stepwise procedure, where a method of simultaneous model selection and model estimation is applied at each step, and where the model selection criteria applied are the AIC criterion, the BIC criterion or Cross Validation (CV ). Our results indicate that no selection criterion is dominating uniformly in terms of choosing the best forecasting model. In order to find the best flexible regression model we propose that predictive precision of the competing flexible regression models is evaluated by use of simple absolute forecast performance measures such as for instance the mean squared errors and simple directional forecast measures such as the degree of diagonal concentration. The approach suggested is applied to the growth rates in US industrial production and US unemployment in an attempt to make an additional contribution to the ongoing discussion of the possible existence of asymmetries and nonlinearities in the US business cycle.1 For both series the results indicate that the forecast accuracy of the flexible regression models is in general better than the forecast accuracy of the linear models. Among the non-linear models Hamilton’s flexible regression model and the project pursuit model applicable in case of moderate nonlinearities in general seem most accurate. Hence, contrary to the results reported by Swanson and White (1997) and Stock and Watson (1998) our results support the existence of a nonlinear component in the US business cycle. There are at least two reasons why the approach we suggest is more 1 All software and data used in this paper can be obtained from the corresponding authors webpage

at http://www.mgmt.purdue.edu/faculty/dahlc/

3

powerful in detecting nonlinear components than the ones advocated by Swanson and White (1997) and Stock and Watson (1998). Firstly, the two novel flexible regression model approaches seems to outperform existing methods, secondly, in addition to the BIC selection criterion, the cross validation selection criterion is advocated.2 The outline of the paper is as follows: Each of the four flexible regression model approaches are presented in Section 2. Section 3 contains a presentation of various linearity tests used for preliminary identification of the nonlinear models. Section 4 provides an discussion of flexible nonlinear regression model evaluation procedures. Section 5 offers two empirical illustrations and finally Section 6 contains concluding remarks.

2 Flexible regression models Four flexible regression models will be considered. Three of these, the Neural Network Regression Model, see White (1992), the Projection Pursuit Regression Model, see Friedman and Stueltze (1981), Huber (1985) and Härdle (1990) and the Projection Pursuit Regression Model for moderate nonlinearities, see Aldrin, Boelviken and Schweder (1993) are already well known in the literature although applications in the field of dynamic time series analysis in other areas than financial markets are limited. The fourth approach denoted Hamilton’s Flexible Regression Model approach - is novel and due to Hamilton (2001). While the Neural Network model and the Projection Pursuit models specify the nonlinear components directly as part of the mean function the model suggested by Hamilton introduces the nonlinear components through a parameterized covariance function that uniquely determines the properties of the zero-mean Gaussian random function representing the unobserved conditional mean function.

2.1 Hamilton’s Flexible Regression Model The basic idea underlying the flexible regression model approach suggested by Hamilton (2001) is to view not only the endogenous variable as a realization of a stochastic process but also to consider the functional form of the conditional mean function itself as the outcome of a random process. Consider the model yt

= µ (x , δ ) + , t

f nl

t

(1)

where t is a sequence of independent N (0, σ2 )-distributed error terms and µf nl (xt , δ) is a function of a k × 1 vector xt , which may include lagged dependent variables.3 Let the 2 Based on Monte Carlo simulations Dahl (2002) shows that Hamilton’s flexible regression model

approach outperforms the neural network approach for a wide range of the most common linear and nonlinear regression models used in econometric time series modelling. 3 For the sake of convenience it is assumed in the following that all variables are demeaned.

4

mean of the conditional distribution, i.e. µf nl (xt , δ), be represented as having a linear part and a stochastic nonlinear part, i.e., as

µf nl (xt , δ) = xt β + λm(g xt ),

(2)

where for any choice of z, m(z) is a realization from a random field with the asymptotic distribution given by

m(z)

∼ N (0, 1), (3) E (m(z) m(w)) = Hk (h), (4) 1 and where h is defined as h ≡ 12 [(z − w) (z − w)] 2 .4 The realization of m(.) is viewed as being predetermined with respect to {x1 , .., xT , 1 , .., T } and m(.) is therefore considered to be independent of {x1 , .., xT , 1 , .., T }. The covariance matrix Hk (h) is defined by Gk−1 (h, 1)/Gk−1 (0, 1) if h ≤ 1 Hk (h) = , (5) 0 if h > 1 where Gk (h, r), 0 < h ≤ r is5

Gk (h, r) =

()

Closed form expressions for Hk h for k

r h

(r

2

− z2 ) k2 dz.

(6)

= {1, .., 5} are provided by Hamilton (2001) who

also gives a general description of the statistical properties of the random field.6 7 Since it is not possible to directly observe m(z) - for any choice of z - we cannot observe the functional form of µf nl (xt, δ). Hence, inference about the unknown parameters of the model summarized by δ = {β, λ, g, σ} must be based on observing the realizations of yt and xt only. For that purpose rewrite model (1) as y = Xβ + u,

(7)

where y is a T × 1 vector with tth element equal to yt, X a T × k matrix with tth row equal to xt and u a T × 1 vector with tth element equal to λm(g xt ) + t. Conditional on

4 Here

g is a k × 1 vector of parameters and

×

denotes element-by-element multiplication i.e. g xt

is the Hadamard product. β is a k 1 vector of coefficients. 5 Notice G (h, r) = h r, and Gk (h, r) can then be computed recursively by 0

−

Gk (h, r) = for k = 2, 3,.... 6 The correlation between m (z ) t

2

− 1 +h k (r2 − h2 )k/2 + 1kr+ k Gk−2 (h, r)

and m(ws ) is given by the volume of the intersection of a k dimensional unit spheroid centered at zt and a k dimensional unit spheroid centered at ws relative to the volume of a k dimensional unit spheroid. Hence, the correlation between m(zt ) and m(ws ) is zero if the Euclidean distance between zt and ws is ≥ 2. 7 The reader interested in a critical review on the choice of an appropriate random function is referred to Dahl and González-Rivera (2002) 5

an initial set of parameters λ, g, and by defining ζ

≡ λσ and W (X ; g, ζ ) = ζ 2 H + IT

we

may obtain the GLS estimate of the parameters of the linear part of the model consisting of β and σ2 as

T (g, ζ ) β

= [X W 1 (X ; g, ζ )X ] 1 X W 1 (X ; g, ζ )y, (8) 1 σ 2T (g, ζ ) = [y − X β T (g, ζ )] W 1 (X ; g, ζ )[y − X βT (g, ζ )], (9) T where IT is the identity matrix of dimension (T × T ) and the {t, s} entry of the matrix H - denoted H (t, s) - is equal to H (h ) if h ≤ 1 H (t, s) = , (10) 0 if h > 1 h = 21 [(x − x ) (x − x )] 12 , x = g x .

−

−

k

−

−

ts

ts ts

ts

t

t

s

t

s

t

Based on the ideas of Wecker and Ansley (1983), Hamilton (2001) shows, that the concentrated log likelihood function can be written as η(y, X ; g, ζ ) = −

T 2

ln(2

π) −

T 2

ln

σ 2T (g, ζ ) −

1

ln

2

|W

(

X ; g, ζ )

|− T. 2

(11)

Once the estimates of (g, ζ ) maximizing equation (11) have been obtained, the estimates of βT and σ2T is given from (8) and (9).

From (7) it is clear that the nonlinearities are introduced into Hamilton’s model through the specification of the covariance function associated with the unobserved part of the model. As already mentioned, this is feasible since the spherical random field has a zero-mean and is Gaussian distributed asymptotically and therefore uniquely determined by the parameterized covariance function, see Yaglom (1962) and Dahl and GonzálezRivera (2002). The estimator of the conditional mean function µf nl (xt , δ) is given by the tth row of T + P0 (P0 + σ 2T IT )−1 y − X β T ], (12) Xβ where the {t, s} entry of the matrix P0 - denoted P0 (t, s) - is equal to 2 λ Hk(hts) hts ≤ 1 , P (t, s) = (13) [

0

h >1

0

hts xt

ts

= 21 [(x − x ) (x − x )] 12 , = g x , t

s

t

t

s

(14)

(15)

as shown by Hamilton (2001). The estimator of the conditional mean function will be consistent for µ(.) belonging to a very broad class of deterministic nonlinear functions. This result will also apply in the case of µ(.) being linear. Since we are going to evaluate 6

the forecast accuracy of the model out of sample and equation (12) only works for cases where the conditional mean function is evaluated at points observed in the sample, a modification must be made. To be more specific, we seek to calculate µ f nl (x∗ , δ), where x∗ = {x∗1 , x∗2 , .., x∗k } does not belong to the sample. If we let P0∗ (t) denote the covariance δ) and µf nl (x∗ , δ) for t = 1, .., T, we can obtain an estimate of µf nl (x∗ , δ) between µf nl (xt, as

µf nl (x∗ , δ) = x∗βT

+ P0 (P0 + σ2T IT )−1 [y − X βT , ∗

]

(16)

where

P0∗ = {P0∗(t), t = 1, 2, ...T }, λ2 Hk(h∗) h∗ ≤ 1 P ∗ (t) = t

0

xt x∗

= 12 [(x − x∗ ) (x − x )] 12 = gx , = g x∗ .

t

,

t

h∗t > 1

0

h∗t

(17)

∗

t

,

t

(18) (19) (20)

(21)

In equation (16) P0 and P0∗ denote P0 and P0∗ evaluated at the maximum likelihood estimates of λ and g. Within the general model setup given by equation (1) and (2), µ f nl (x∗ , δ ) is the predictor that minimizes the prediction variance over all unbiased,

homogeneous and linear predictors, e.g. Dahl (1999).8

2.2 The Neural Network Regression Model Neural networks models are nonlinear models that can be specified to fit past and future values of a time series hereby extracting hidden structures and relationships governing the data. In a traditional statistical context neural networks can be considered a nonlinear, non-parametric inference technique that in the unconstrained form is data driven and model free. A priori the relationships between input variables and output variables are unconstrained and no predetermined parameters are required to specify the model. Let us consider the single hidden layer feedforward network in which the output yt given inputs xt is determined as9 yt

= µ (x , κ ) + , t

ann

where

µann (xt, κ) κ

t

q

=

xt β +

=

{β, θ1 , θ2 , ....θq , γ 1 , .., γq }.

j =1

θj ψ j (xt γj ),

(22)

(23) (24)

8 For a derivation of the prediction rule given by equation (16) and a discussion of its basic properties

the reader is refered to Dahl (1999). 9 The model is denoted feedforward because signals flow from input to output and not vice versa.

7

Following White (1989) we take the activation function to be a logistic function and to be identical for all hidden units, i.e. ψj (xtγ j ) = ψ (xtγ j ) = (1 + exp(−xtγ j ))−1 for j = 1, .., q . Furthermore, we augment the single hidden layer network by direct links from the input units to a single output with weights β = {β1 , .., β k } implying that the neural network model will have a linear component and assume that the output also contains a white noise term t ∼ nid(0, σ2 ). Finally, we let θ = {θ1 , .., θq }, denote the hidden-units-to-output weights. The parameters of the model and hence the estimate of the conditional mean function is obtained by applying nonlinear least squares, NLS, i.e. by solving (25) min E (y − µann (xt, κ))2 . κ

The NLS procedure may converge to a local rather than a global optimum and therefore proper starting values are known to be of a very important matter. For every single specification under consideration we therefore worked with five different sets of parameter vectors of starting values and iterated from these until convergence.10 The iterated parameter vector corresponding to the smallest value of the objective function given by equation (25) was chosen. As shown by White (1992) the single hidden layer feedforward neural network model possesses the universal approximation property, and can approximate any nonlinear function to an arbitrary degree of accuracy with a suitable number of hidden units. Of course this tells us nothing about the performance of such techniques in practice, and for a given set of data it is possible for one technique to dominate another in terms of accuracy etc.

2.3 The Projection Pursuit Regression Model One very closely related approach to the parametric Neural Network Regression Model is the nonparametric Projection Pursuit Regression Model proposed by Friedman and Stueltze (1981) and Huber (1985), see also Härdle (1990). For a single output variable yt and a variable input vector given by xt the Projection Pursuit Regression Model can be written in the form (26) yt = µppr (xt , 1) + t , where

v

µppr (xt, 1)

=

xt β +

1

=

{β, ω1 , .., ω v , Φ1 , .., Φv },

j =1

ω j ϕj (xtΦj ),

(27) (28)

1 0 In order to generate the intial/first out-of-sample forecast we estimated the model using 50 starting

values and made sure that the resulting parameters estimates looked reasonable. Secondly, for every subsequent estimation - where the only difference in the infomation set was one additional observation - we used the previous set of parameter estimates as starting values and then in addition tried 5 other randomly generated starting values.

8

The parameters Φj define the projection of the input vector xt onto a set of planes labelled by j = 1, ..,v. These projections are transformed by the nonlinear activation functions denoted ϕj (.) and these in turn are linearly combined with weight ω j and added to the linear part, xt β, to form the output variable yt . The first algorithm considered, for obtaining an estimate of 1, is the original algorithm suggested by Friedman and Stueltze (1981), but with a few modifications. First, the algorithm is augmented such that the estimation of the weights ωj for j = 1,.., v can be obtained using simple ordinary least squares. Secondly, least squares techniques are used for concentrating out the linear part of yt . Thirdly, cubic splines with automatic data dependent determination of the smoothing parameter are applied in the estimation of the empirical activation functions ϕj , j = 1,.., v . Finally, AIC , BIC, and CV in turn are used as stopping rules with respect to the choice of the appropriate number of activation functions given by v, but also as determinants of the number of regressors, kj for j = 1, ..,v included in every individual activation function. Let y and X be as defined in equation (7). The set of regressors included in the various activation functions is allowed to differ. Let X j denote the matrix of the kj regressors included in activation function ϕj . How to choose the proper dimension of X j is of course one of our main concerns and it will be discussed in details later. Furthermore, we define zv+1 = X v+1 Φv+1 . The algorithm denoted PPR1 can then be described as follows 1. Condition on the regressors by computing the residuals rv for v = 0 and provide initial starting values for the parameters Φv and ωv . In particular, let rv

=

y − X v ( X v X v )− 1 (X v y ) ,

∼ U (−1, 1), ωv ∼ U (−1, 1),

(29)

(30)

Φv

(31)

where U is the uniform distribution. In order to explore the nature of the local optima repeated runs are important and necessary when estimating projection pursuit models. As for the ANN model we considered five different sets of starting values every time a new hidden unit was introduced. These were all drawn from a uniform distribution defined on the interval [−1; 1].

2. Find the projection vector Φv+1 ∈ Rkv+1 (||Φv+1 || = 1) that maximizes the goodness of fit measure R2v+1 (Φv+1 ) defined as

R2v+1 (Φv+1 ) = 1 − (rv rv )

−1

v

(r

− ω v+1 ϕ v+1 (zv+1 )) (rv − ω v+1 ϕ v+1 (zv+1 )),

(32)

where the estimated empirical activation function ϕ v+1 (zv+1 ) is determined by the

9

cubic spline approach ϕ v+1 (zv+1 )

Sλs (w)

=

min

w(z v+1 )

Sλs (w),

= (y − w(zv+1 )) (y − w(zv+1 )) + λs

(33)

(w (z

v+1

))

2

dzv+1 ,

with the weight ω v+1 obtained from a linear regression of rv on ϕ v+1 (zv+1 ). 3. If R2v+1 is sufficiently small, stop the algorithm, otherwise go to step 4 4. Construct a new set of residuals rv+1

= rv − ϕ v+1 (zv+1 ),

and add an additional activation function. Furthermore, update v back through step 2 and step 3.

(34)

= v + 1 and go

One important difference between the neural network model described above and the projection pursuit model of this section is that each hidden unit in the projection pursuit regression is allowed a different activation function and that these functions are not prescribed in advance, but are determined from the data as part of the estimation procedure. Another difference is that the parameters in the projection pursuit regression are optimized cyclically in groups while those in the neural network are optimized simultaneously. Specifically, estimation in the Projection Pursuit Regression Model takes place for one hidden unit at a time, and for each hidden unit the second-layer weights are optimized first, followed by the activation function and the first layer weights. The process is repeated for each hidden unit in turn until a sufficiently small value of the error function is achieved or until some other stopping criterion is satisfied. Since the output yt depends linearly on the second-layer parameters, these can be optimized by linear least square techniques. Optimization of the activation functions ϕj (.) represents a problem of one-dimensional curve fitting for which a variety of techniques can be used, as for instance the one based on cubic splines or the one based on the Nadaraya-Watson type kernel smoother, see Härdle (1990) for a discussion. Here we have chosen to work with the cubic spline smoother. Finally, we will consider a flexible regression model denoted PPR2 which is very closely related to the PPR model outlined above. However, it is much simpler at the expense of being not quite as flexible as the standard algorithm. Aldrin, Boelviken and Schweder (1993) argue that nonlinear structures in practice often are only moderately deviant from a linear structure. These moderate nonlinear structures include S-shapes and other moderate curvatures, slight jumps in derivatives or local dips or bumps. Upon this argument they are able to implement a very simple and fast algorithm for estimation 10

of the conditional mean function given by equation (27) under the assumption that it is nonlinear but monotone or approximately monotone. Aldrin, Boelviken and Schweder (1993) provide extensive numerical evidence showing that the PPR2 model outperforms the PPR1 model when the nonlinearity is moderate and the signal to noise ratio is small. In economics the use of aggregated data is quite common and as the degree of nonlinearity can diminish with aggregation, see Granger and Teräsvirta (1993), a projection pursuit model like PPR2 is an obvious possibility. Consider a simple version of the model given by equation (26) - (28) such as y

= =

ϕ(X Φ) +

(35)

ϕ(z) + ,

where y and are vectors with yt and t as the tth elements, respectively. Assume without loss of generality that the columns of the matrix X with the tth row equal to xt have zero mean and that the coefficient vector Φ is standardized so that ΦΣx Φ = 1 where Σx = E (X X ). Furthermore, denote the residual vector as r = y − X Φols where Φols =

Σx 1 Σxy , Σxy = E (X y). In addition, let the predictor be z = X Φ, and consider the vector u = (X − zΦ Σx ) and notice that z and u are uncorrelated as E (z u) = E (z (X − z Φ Σx )) = Φ Σx − Φ Σx ΦΦ Σx = Φ Σx − Φ Σx = 0. Premultiplying y = ϕ(X Φ) + by X then −

imply

X y

= X ϕ (z ) + X = X ϕ (z ) + u ϕ (z ) − u ϕ ( z ) + X = Σx Φz ϕ(z ) + u ϕ(z) + X ,

(36)

and by applying the expectation operator we obtain

Σxy = Σx ΦE (z ϕ(z)) + E (u ϕ(z)). Finally, premultiplication by Σ 1 gives Φ = η Φ + B,

(37)

− x

(38)

ols

where

η = E [z ϕ(z)], B = Σ 1 E [u ϕ(z)] = Σ 1 E [(X − zΦ Σ ) ϕ(z)].

− x

− x

x

(39)

(40)

Equation (38) writes the OLS estimate Φols as a sum of one term proportional to the true direction vector Φ and another term B. As u and z are uncorrelated, the correlation between u and the transformation ϕ(z) may be expected to be small. Indeed, it can 11

be shown, see Aldrin, Boelviken and Schweder (1993), that if X follows an elliptically contoured distribution like the Gaussian, E [uϕ(z)] = 0 whereby B = 0 and Φols = ηΦ. Also notice that if ϕ(z) is linear, i.e. ϕ(z) = X Φ, then η = E (zϕ(z)) = Φ Σx Φ = 1 and the ordinary least square estimator will be equal to the true parameter Φols = Φ. Heuristically, η = E [z ϕ(z)] measures the correlation between the linear model z = X Φ and the nonlinear model ϕ(X Φ). If this correlation is low then Φols must be expected to be far from Φ, but if the correlation is reasonably high, which is true in the case of moderate nonlinearities, Φols must be expected to be close to Φ. This suggests to ols - for Φ and then obtain ϕ (z) by take the linear ordinary least squares estimate - Φ smoothing y in this direction. In the general case where we have a linear part - as in ols . equation (26) - ϕ (z) is obtained by smoothing r in the direction governed by Φ To sum up, the simple algorithm based on the ordinary least square estimator will work as long as 1. η is not to close to zero implying that only moderate nonlinear structures can be analyzed and 2. B is of a small magnitude requiring that X should be approximately Gaussian distributed. The first step in the modified algorithm of Aldrin et al. (1993) is to construct a consistent sample version of Φols , ηj , and Bj − denoted Φ ols , ηj , and Bj - for j = 1,.., v j j in order to get a consistent estimate of the true direction vector Φj . In our setup the sample version of equation (38) equals

= ηΦ + B + (X X ) 1 (X j ), ηj = T −1 zj ϕ j (zj ), − 1 Bj = T (X j X j ) 1 (ϕ j (zj ) (X j − (X j X j ) 1 Φ )z ), . z = X Φ

Φ

ols j

j

j

j −

j

j

−

−

ols j

ols j

j

(41) (42)

(43) (44)

ols is already close to the true If the conditions discussed above are satisfied, such that Φ j direction, Aldrin et al. (1993) suggest the following iterative scheme: 1. Center the response rv for v = 0. The response is centered by de-meaning y by a linear combination of the regressors X v . In particular, rv

= y − X v (X v X v ) 1 (X v y ).

−

(45)

2. Estimate Φ0v+1 by ordinary least squares as

Φ 0v+1 = (X v+1 X v+1 )

−1

(

X v+1 rv ).

(46)

and obtain ϕ v+1 by using the cubic spline smoother defined by equation (33) with the smoothing parameter determined by generalized cross validation.

12

vj +1 according to equation (42) and (43) and update the direc3. Compute ηjv+1 and B tion vector according to Φ j −1 − B j (47) Φ jv+1 = v+1 j v+1 , ηv+1 jv+1 (X v+1 X v+1 )−1 Φ jv+1 and standardize such that Φ

= 1.

jv+1 ) and obtain the associated optimal 4. Find the smoothing function ϕ jv+1 (X v+1 Φ weight as

ω +1 = (ϕ +1(X +1 Φ +1 ) ϕ +1(X j v

j v

j v

v

j v

v +1

v+1 )) Φ j

−1

j

(ϕ v+1 (X

v +1

v+1 ) r ). Φ j

v

(48)

||,

(49)

5. Repeat step 3 and 4 until the scalar difference given by

D=

||

Φ jv−+11

Φ jv−+11 (X v+1 X v+1 )

−1

− 1

v+1 Φ j−

Φ jv+1

Φ jv+1 (X v+1 X v+1 )

−1

v+1 Φ j

becomes sufficiently small. 6. If the criterion of fit is satisfied stop the algorithm, otherwise go to step 7 7. Construct a new set of residuals

= rv − ϕ v+1 (X v+1 Φ jv+1 ), (50) and add an additional activation function. Furthermore, update v = v + 1 and go back through step 2 - step 7. In terms of representational capability, the standard Projection Pursuit Regression can be regarded as a generalization of the multilayer neural network model, since the activation functions are more flexible. It is therefore not surprising that Projection Pursuit Regression Models should have the same universal approximation capabilities as Neural Network Regression Models. v+1 r

3 Preliminary identification of the flexible nonlinear model In order to justify the need for a flexible nonlinear model - and to check whether the parameters λ in equation (2), θj in equation (23), and ω i in equation (27) all are different from zero ensuring statistical identification - we suggest to test the null hypothesis of the series being linear by use of a battery general specification tests. The collection of test statistics for neglected nonlinearity will include Hamilton’s test, Hamilton (2001), the Regression Error Specification Test or RESET test, Ramsey (1969), 13

a test, denoted the Tsay test, see Tsay (1986), the Neural Network test, see Lee, White and Granger (1993), and a particular version of White’s information matrix test, White (1987,1992). The tests are chosen because of their relatively good performance with respect to size and power against an unspecified nonlinear-in-mean alternative model, see Lee, White and Granger (1993) and Dahl (2002). 3.1

Hamilton’s Lagrange multiplier test

The Lagrange multiplier test suggested by Hamilton considers the null hypothesis H0 : λ = 0 in equation (2). However, g is not identified when λ equals zero, but Hamilton (2001) solves this problem by assuming that the ith element of g, i.e. gi , is proportional to the standard deviation of the ith row in xt. Fixing the nonidentified parameters to the scale of the variables implies that the Lagrange multiplier statistics for neglected nonlinearity becomes

HLM

=

H − σ2 tr(MHM )]2 , σ [2tr{[MHM − (T − k)−1 Mtr(MHM )]2 }] [

(51)

4

where

σ

2

)

T ×T

H (t, s) hts

s2i

(52)

M

and the (t, s element of the

= My, = (T − k)−1 , = IT − X (X X ) 1 X ,

=

covariance matrix

Hk (hts )

0

T−

= 21 [k−1 =

−

T

1

k

(xi,t

t=1

H

(54)

m(g xt ) is given by

of

hts ≤ 1 , hts > 1

(xi,t

i=1

(53)

− xi,s )2 ] 12 , s2i

− T −1

T

xi,t )2 ,

(56)

(57)

t=1

where Hk (.) is defined in equation (5). The Lagrange multiplier statistics asymptotically χ2 (1) distributed.

3.2 The Neural Network test

(55)

HLM

is

When the null hypothesis of linearity is true, i.e. H0 : Pr[E (yt|Xt ) = xt β ∗ ] = 1 for some choices of β ∗ and Xt = {x1 , x2 , .., xt } the optimal network weights θj in equation (23) are zero for j = 1, ..,q. The neural network test for neglected nonlinearity 14

can therefore be interpreted as testing the hypothesis H0 : θ1 = θ2 = .. = θq = 0 for particular choices of q and γj . As in Lee et al. (1993) we set q equal to 10 and draw the direction vectors γ j (= γ j ) independently from a uniform distribution on the interval [−2; 2] after having normalized yt to take values on the interval [−1; 1] only. The test is then carried out by regressing (T ×1) = y − XT (XT XT )−1 (XT y) on 1(T ×1) and Ψ(T ×q) = {ψ (XT γ1 ) ,.., ψ(XT γ q ) } where y = {y1 , y2 , .., yT }. The Lagrange multiplier test statistics is given by (58) NNLM = T R2 → χ2 (q), where R2 , is the coefficient of determination from the auxiliary regression. Because the observed components of Ψt typically are highly correlated Lee et al. (1993) recommend using a small number of principal components instead of the q original variables. Using the q∗ < q principal components of Ψt , denoted Ψ∗t , not collinear with xt an equivalent test statistics is given by NNLM ∗ = T R2pc χ2 (q∗ ), (59)

→

where

R2pc is the coefficient of determination from a regression of (T ×1) on Ψ∗(T xq∗ )

.

3.3 The RESET test and Tsay’s test Consider the linear model

yt

=x β+u ,

(60)

t

t

where yt is the dependent variable and xt a k vector of regressors

11

. The first step

the prediction consists of regressing yt on xt in order to obtain an estimate of β, say β,

= x β, and the residuals u = y − f whereby the sum of squared residuals are SSR0 = T=1 u2 . In the second step, regress u on x and on the s × 1 vector M , to be 1 − M α 2 and defined later, and compute the residuals from this regression v = u − x α T 2 . Finally, in the third step compute the F the residual sum of squares SSR = =1 v ft

t

t

t

t

t

t

t

t

t

t

t

statistics given by

F

t

t

t

t

SSR0 − SSR)/m = (SSR/ (T − k − m) ∼ F (s, T − k − s).

(61)

Under the linearity hypothesis the F statistics above is approximately F -distributed with s and T − k − s degrees of freedom. The difference between the RESET test and the Tsay test lies in the choice of Mt. The RESET test defines Mt = {ft2 , .., fts+1 }. Because fti , i = 2, .., s + 1 tends to be highly correlated with xt and with themselves the test is conducted using the s∗ < s largest principal components of ft2 , .., fts+1 not perfectly . Tsay (1986) collinear with xt and therefore not with the linear combination ft = xt β suggests using Mt = vech(xt xt) where the operator vech implies that Mt contains the elements on and below the diagonal of the matrix xt xt, i.e. the squared explanatory variables and the cross-products of these. 11

Notice, the regressors may be lagged dependent variables 15

3.4 White’s dynamic information matrix test The information matrix test is developed from the observation that if a model is well specified the information matrix equality holds, while this is not the case in a misspecified model. The version of White’s dynamic misspecification test applied in this paper will be based on the covariance of the conditional score functions. For a Gaussian linear model the log likelihood function can be written as ηt(xt , β, σ ) = − where

1 2

log(2

π) − log(σ ) − u2t , 1

2

(62)

ut = σ−1 (yt − xt β ). The conditional score function is then given by st (xt, β, σ) = σ−1 (ut , ut xt, u2t − 1).

(63)

Evaluating the conditional score at the quasi maximum likelihood estimators of the cor σ). The information matrix test is rectly specified model under H0 gives st = st (xt, β, based on forming the q × 1 indicator m t = S ∗ vec(stst) where S is a selection matrix. In particular we obtain the test statistics denoted ”White3” in Lee et al. (1993) by the auxiliary regression ut = σ −1 (yt − xtβ) on xt and kt - where kt is defined to satisfy m t = kt ut. The test statistics and its asymptotic distribution is then given by WIM

= T R 2 → χ 2 (q ),

(64)

where R2 is the coefficient of determination from the auxiliary regression.

4

Evaluating flexible regression models

To evaluate the flexible regression models, recursive model selection and estimation procedures are needed for each of the four flexible regression approaches and the purpose of this section is to describe computational feasible methods. For simplicity the appropriate number of regressors and nonlinear components included in each of the four models are chosen according to the AIC , BIC, and cross validation (CV ) criteria, see e.g. Akaike (1969) , Schwartz (1978), Stone (1974, 1977) and Wahba and Wold (1975). When evaluating the flexible regression models we suggest using simple measures of forecast ability such as the mean squared error (MSE ), the mean absolute deviation (MAD), the forecast absolute percentage error (MAPE ), and simple directional measures based on a contingency table such as the Henriksson and Merton (1981) test (HM ) and the χ2 test for independence (χ2 ), the confusion rate (CR), and the degree of diagonal concentration (φ), see, e.g. Pesaran and Timmermann (1991). In addition we report Theil’s U-statistic (U ) and the Granger - Newbold version of the Mincer-Zarnowitz regression, where the actual value is regressed on the forecast, and the coefficient of determination 16

(R2 ) applied as a measure forecast accuracy if the regression has an intercept of zero and a slope of one. The objective is to find the best model based on the precision of the out of sample forecast. We may also obtain - as a by-product - useful information about the performance of the various model selection criteria with respect to choosing the best forecasting model.

4.1

Recursive model selection and estimation procedures

Although modern computers are very efficient the computations involved in the different nonlinear approaches discussed here can be excessive. Furthermore, the procedure applied must be somewhat automatic, and in addition parsimony is an important objective. The procedure applied here in the specification and estimation of linear and Flexible Regression Models is a forward stepwise procedure, with simultaneous model selection and estimation at each step. The exact procedure applied is somewhat dependent upon which of the flexible approaches is considered and we will therefore provide a detailed description of the alternative procedures applied in all of the four cases.

4.1.1 Hamilton’s Flexible Regression Model From equation (1) and (2) it is seen that the model contains a linear and a nonlinear part. The first step consists of performing a forward stepwise linear regression with regressors (lags in the univariate case) added one at a time until no additional regressor improves upon the model selection criterion applied. The number of regressors in the linear part is then fixed. Next the number of regressors in the nonlinear part - consisting of the random function (·) - is to be determined. As in the linear part this is done by including regressors one at a time until the model selection criterion cannot be improved upon. If, after adding the first regressor to the nonlinear part of the model, the model selection criterion is not improved upon this will imply that the preferred model will be linear. When applied recursively to different but consecutive time periods, the Flexible Regression Model approach allows for the preferred model to be linear in some periods and nonlinear in others. Furthermore, a key feature of the model selection and estimation procedure is that every time a new regressor is added to the model all the parameters in the linear and nonlinear part are reestimated by maximum likelihood.

m

4.1.2 The Neural Network Regression Model First, the number of regressors in the linear part of the model is determined and fixed in exactly the same way as described above. Secondly, a single hidden unit is added and regressors are selected one by one as part of the first hidden unit until the model selection criterion no longer can be improved. The number of regressors included in the 17

first hidden unit is thereafter fixed and a second hidden unit is added and the process repeated until five hidden units have been tried or the model selection criterion cannot be improved upon by adding additional hidden units. Again all the parameters of the model are reestimated by nonlinear least square every time a new regressor is included.

4.1.3 The Projection Pursuit Regression Model The model selection procedure applied in connection with the Projection Pursuit Regression Model is similar to the model selection in the neural network case. However, there is one main difference. Since the parameters of the model are estimated in groups, not all the parameters of the model are reestimated every time a new regressor is included. In order to cut down the computational burden we do not consider backfitting as it is also apparent from the description of the PPR1 and PPR2 algorithms. The rule implies that only the subset of parameters in the hidden unit in which the new regressor is added are reestimated. The model selection procedure is as follows. Add a hidden unit - which in this case is an empirically determined univariate function - including a constant term and one regressor. Add regressors to this hidden unit and reestimate the model every time a new regressor is included until the model selection criterion cannot be improved. When the number of regressors in the first hidden unit is determined fix both the number of regressors and the parameters at their estimated values. Add the second hidden unit and repeat the process until five hidden units have been tried or the model selection criterion cannot be improved upon by adding additional hidden units.

4.2 h steps ahead real time forecasts In order to evaluate the forecast ability of the four flexible regression model approaches, sequences of out of sample one step ahead forecasts yt1 +1 are generated by use of a data window containing a sample from the starting point in period t0 to period t1 . In the next step a second set of one step ahead forecast yt1 +2 is computed using a data window beginning at time t0 and terminating at time t1 + 1. Continuing this procedure, rolling the data window forward one period every time enables us to simulate sequences of true out of sample forecasts. The sequences we generate contains n data points. The forecast period must be long enough to include periods where the nonlinear characteristics are present. For instance, if asymmetric dynamics over the phases of the business cycle is the expected cause of the nonlinearities, recessions as well as expansionary economic phases must exist in the out of sample forecast period. In the following we will apply both one step ahead and four steps ahead forecasts. The four steps ahead forecast sequence for each flexible regression method is constructed in a way analogous to the sequences constructed for the one step ahead forecast described above. The motivation for also considering four steps ahead forecasts for each flexible 18

regression method is that linear models might locally approximate nonlinear patterns reasonably well. Hence, the one step ahead forecast measure may not unveil the nonlinear components, while a four step ahead forecast could. A drawback is that the overall forecast ability of all the econometric models may fall dramatically as a consequence of extending the forecast horizon. Hence, we may end up with the difficult task of comparing forecasts which are all of a rather low quality. The method we use here to produce the one and four steps ahead forecast is the so-called direct method, e.g. Granger and Teräsvirta (1993, p.132). The primary reason for this particular choice is due to the conceptual and computational simplicity of the direct method in regression models that includes nonlinear components in the conditional mean function. According to the direct method the h steps ahead forecasts are calculated simply as yt+h = µh (xt, ς ), (65)

where ς is obtained from the general regression model yt+h = µh (xt , ς ) + t and hence will equal a) the least square estimator of the linear regression coefficients in cases where µh (·) is a linear function in the regressors b) the maximum likelihood estimator of ϕ when applied to Hamilton’s Flexible Regression Model or c) the nonlinear least square estimator of κ and 1 in the Neural Network Regression Model or Projection Pursuit Regression Models, respectively. Notice, that the estimated conditional mean function µh (·) typically will depend on the forecast horizon.

5

Empirical Illustrations

In this section we apply the described procedures in order to evaluate the flexible (possibly) nonlinear regression models for the growth rates in US industrial production and US unemployment. First, we use the tests for nonlinearities to motivate using the flexible nonlinear regression models. Secondly, we evaluate the flexible regression models in terms of forecast accuracy and comparisons are made to the linear model.

5.1

Identifying nonlinear time series components

In order to obtain the correct size, the tests for linearity should be based on the residuals from the best linear model. In practice, this is done by calculating each of the test statistics based on the best linear model being selected by the three model selection criteria in turn. Furthermore, we are conditioning the test statistics on the whole sample period. Table 1 about here (66) We begin by considering the series of first differences in US unemployment (seasonally adjusted). We find that the best linear model based on AIC and CV consists of a constant 19

term and four lags, whereas the best linear model chosen by the BIC criterion includes a constant term but only two lags. The results presented in Table 1 indicate that the null hypothesis of linearity in all cases is rejected at the 5% level except in the case where inference is based on the outcome from the RESET test. However, based on the RESET test, rejection of linearity is still supported at a 10% level. Hence, the applied tests indicate the presence of a nonlinear component, suggesting some kind of nonlinear specification of the univariate model for the change in unemployment. For the growth rate of US industrial production (seasonally adjusted) the three model selection criteria agree on the best univariate linear model which includes a constant and two lags, see Table 1. Based on Hamilton’s linearity test and the RESET test it is not possible to reject the null of linearity unless the level of significance is raised to 15%. This result, however, is in strict disagreement with the outcome of the Neural Network test, Tsay’s and White’s test where the null hypothesis of linearity is rejected at the 5% level. It could be due to low power of Hamilton’s test and the RESET test against a specific kind of nonlinearity inherited in the industrial production series or it could be due to moderate nonlinearities difficult to detect. Hence, the applied tests disagree on the necessity for a non-linear specification of the univariate model for the growth rates of industrial production, and the evidence could imply that the nonlinearity in the US industrial production is nonexistent or of a moderate nature.

5.2

Relative forecast accuracy

To evaluate and find the best flexible regression model a sequences of real time forecast from the alternative flexible regression models are generated as described in Section 4.2. In order to base decisions on reliable statistical grounds a sufficiently large sequence of real time forecasts is needed. However, a large in sample for ”initial” estimation is also required in order to obtain a reliable forecast based on the flexible regression models. Consequently, when faced with observational data of a limited size one have to decide carefully on how long the sequence of real time forecast should be bearing in mind that a too long sequence may lead to too many rejections of the flexible regression model. We have chosen to produce a sequence of real time forecast starting in 1980q1 and ending in 1998q2, hereby consisting of 76 observations. This leaves more than 120 observations for estimating the initial flexible regression model and for producing the first reliable real time forecast for 1980q1. In recent work on the search for nonlinear components in US macroeconomic time series, real time forecast comparisons between linear and flexible regression models have been done by the use of the exponential smoothing approach, see Stock and Watson (1998), the neural network model, see Swanson and White (1995,1997a,1997b) and Stock 20

and Watson (1998), or the smooth transitions autoregressions, see Teräsvirta (1995) and Stock and Watson (1998). In case of forecast ability based on MSE in particular the evidence from these studies has been in favor of the linear model. However, as also pointed out by Swanson and White this could be due to the chosen model selection criterion and in fact they question the use of BIC for selecting the best model with respect to forecast performance. Table 2 about here

(67)

The results of the model selection and forecasting exercise for the change in US unemployment rate are given in Table 2. The first row indicates the model selection criterion used (i.e. the criterion that produced the best performing model), while the second row gives the number of cases (as a frequency) where the model was improved by adding a nonlinear component. The absolute measures of forecast performance are presented in the second block of rows of Table 2, while the third block contains the measures of the directional forecast performance. For h = 1, Hamilton’s FNL model based on the CV criterion performs the best overall, particularly when the absolute measures of predictive accuracy are considered. The projection pursuit model PPR2, which is especially suitable in cases of only moderate nonlinearities, performs also quite well especially with respect to the directional measures. Notice, that the nonlinear component in the PPR2 model is imposed by constraining v to v ≥ 1 in equation (27). While PPR2 therefore always include the nonlinear component, PPR1 rarely does for h = 1. The FNL model suggests that the nonlinear component is necessary in 50% of all cases while the ANN applies a nonlinear component in 91% of all cases. For h = 4, the PPR models are dominant in terms of forecast performance. The PPR1 is the only model that has satisfying properties with respect to the directional measures, while the PPR2 outperforms all other models in terms of the absolute measures of forecast performance. For all the nonlinear models the frequency by which the nonlinear component is included increases with the forecast horizon. Table 3 about here

(68)

The results conditional on the growth rate of the US industrial production are presented Table 3. When h = 1, the best performing model wrt. the absolute measures is the PPR2 using the CV criterion while the ANN model and the CV criterion works best wrt. the directional measures. From the linearity test results it is not surprising that the FNL model does not perform any better than the linear model as Hamilton’s test, HLM , did not reject linearity. The ANN model finds that a nonlinear component should be added in all cases and also this results is in agreement with the outcome from the neural network based linearity test. 21

For h = 4, it is again the FNL and the PPR2 that perform best. As for unemployment rate AIC seems the preferable selection procedure when using the PPR2 while BIC selects the best performing FNL model. Notice, in particular, the results based on the R2 measure, which indicates that the accuracy of the FNL approach is about five times better than the accuracy of linear model. In addition, notice that the FNL model includes a nonlinear component in all cases. Finally notice, that the ANN model and the PPR1 models offer no or very little improvement relative to the linear model. Summing up, the empirical results indicates that for both time series under consideration, the forecast accuracy of the best flexible regression models, in general, are better than the forecast accuracy of the linear model. Among the non-linear models Hamilton’s flexible regression model and the project pursuit model applicable in case of moderate nonlinearities seem most accurate. A general specification test of nonlinear economic models for the US unemployment rate and/or the growth rate in US industrial production should therefore involve a comparison related to flexible nonlinear regression models and not be limited only to the linear model.

6

Conclusions

Based on real time forecast accuracy we have considered the task of evaluating and finding the best flexible regression model among alternative parametric as well as nonparametric regression model approaches. From the limited empirical evidence obtained here it is tentatively suggested to find good flexible regression model for a univariate time series by the following procedure: 1. Generate sequences of h steps ahead real time forecasts by recursively specification and estimation of the flexible regression models. We suggests - as a minimum - to use Hamilton’s (2001) flexible regression model approach and the Projection Pursuit model approach suggested by Aldrin, Boelviken and Schweder (1993). For recursive model selection we recommend also to consider the cross validation model selection criterion, even though it is more CPU intensive relative to AIC and BIC . 2. Based on the sequences of real time forecasts select the best flexible regression models by using simple measures of absolute forecast performance as well as simple measures of the directional forecast performance. Based on the described procedures we demonstrate how to apply flexible regression models as a tool for identifying nonlinear components by comparing the real time forecast accuracy based on the flexible regression models with the real time forecast accuracy of the linear specification. The evidence shows that flexible regression models may be a potential powerful new instrument for identifying general nonlinear components in economic time series.

22

References Akaike, H. (1969), ‘Fitting autoregressions for prediction’,

Annals of the Institute of

, 243—247. Aldrin, M., Boelviken, E. & Schweder, T. (1993), ‘Projection pursuit regression for moderate non-linearities’, Computational Statistics and Data Analysis 16, 379—403. Dahl, C. M. (1999), ‘An investigation of tests for linearity and the accuracy of flexible nonlinear inference’. Unpublished manuscript, Department of Economics, University of Aarhus. Dahl, C. M. (2002), ‘An investigation of tests for linearity and the accuracy of likelihood based inference’, Econometrics Journal 5, 263—284. Dahl, C. M. & Gonzalez-Rivera, G. (2002), ‘Testing for neglected nonlinearity in regression models based on the theory of random fields’, Journal of Econometrics (forthcomming) . Friedman, J. & Stueltze, W. (1981), ‘Projection pursuit regression’, Journal of American Statistical Association 76, 817—823. Granger, C. W. J. & Terasvirta, T. (1993), Modelling Nonlinear Economic Relationships, Oxford University Press, Oxford, New York. Haerdle, W. (1990), Applied Nonparametric Regression, Cambridge University Press, New York, NY. Hamilton, J. D. (2001), ‘A parametric approach to flexible nonlinear inference’, Econometrica 69, 537—573. Huber, P. (1985), ‘Projection pursuit’, Annals of Statistics 13, 435—475. Lee, T.-H., White, H. & Granger, C. W. J. (1993), ‘Testing for neglected nonlinearity in time series models’, Journal of Econometrics 56, 269—290. Ramsey, J. B. (1969), ‘Tests for specification errors in classical linear least square regression analysis’, Journal of the Royal Statistical Society Series B, 31, 350—371. Schwartz, G. (1978), ‘Estimating the dimension of a model’, Annals of Statistics 6, 461— 464. Stock, J. H. & Watson, M. W. (1998), ‘A comparison of linear and nonlinear univariate models for forecasting macroeconomic time series’, NBER, Working Paper (6607). Statistical Mathematics

21

23

Stone, C. (1977), ‘Consistent nonparametric regression (with discussion)’, Annals of Statistics 5, 595—645. Stone, M. (1974), ‘Cross-validatory choice and assessment of statistical predictions (with discussion)’, Journal of Royal Statistical Society, Series B 36, 111—147. Swanson, N. R. & White, H. (1995), ‘A model-selection approach to assessing the information in the term structure using linear models and artificial neural networks’, Journal of Business and Economic statistics 13(3), 265—275. Swanson, N. R. & White, H. (1997a), ‘Forecasting economic time series using flexible versus fixed specification and linear versus nonlinear econometric models’, International Journal of Forecasting (13), 439—461. Swanson, N. R. & White, H. (1997b), ‘A model selection approach to real-time macroeconomic forecasting using linear models and artificial neural networks’, The Review of Economics and Statistics pp. 541—550. Terasvirta, T. (1995), ‘Modelling nonlinearity in u.s. gross national product 1889-1987’, Empirical Economics 20, 577—597. Tsay, R. S. (1986), ‘Nonlinearity test for time series’, Biometrika 73, 461—466. Wahba, G. & Wold, S. (1975), ‘A completely automatic french curve: fitting spline functione by cross validation’, Communications in Statistics, Series A 4, 1—17. Wecker, W. E. & Ansley, C. F. (1983), ‘The signal extraction approach to nonlinear regression and spline smoothing’, Journal of the American Statistical Association 78, 81—89. White, H. (1989), An additional hidden unit test for neglected nonlinearity in multilayer feedforward networks, in ‘Proceedings of the international joint conference on neural networks’, IEEE Press, New York, NY, Washington, DC, pp. 451—455. White, H. (1992), Estimation, Inference and Specification Analysis, Cambridge University Press, New York, NY. Yaglom, A. M. (1962), An Introduction to the Theory of Stationary Random Functions, Prentice-Hall, Englewood Cliffs, N.J.

24