The full text on this page is automatically extracted from the file linked above and may contain errors and inconsistencies.

FORECAST EVALUATION AND COMBINATION by Francis X. Diebold and Jose A. Lopez Federal Reserve Bank of New York Research Paper No. 9525 November 1995 This paper is being circulated for purposes of discussion and comment only. The contents should be regarded as preliminary and not for citation or quotation without permission of the author. The views expressed are those of the author and do not necessarily reflect those of the Federal Reserve Bank of New York or the Federal Reserve System. Single copies are available on request to: Public Information Department Federal Reserve Bank of New York New York, NY 10045 Forecast Evaluation and Combination Francis X. Diebold Jose A. Lopez Department of Economics University of Pennsylvania 3718 Locust Walk Philadelphia, PA 19104-6297 Research and Market Analysis Group Federal Reserve Bank of New York 33 Liberty Street New York, NY 10045 Print date: October 23, 1995 ABSTRACT: Forecasts are of great importance and widely used in economics and finance. Quite simply, good forecasts lead to good decisions. The importance of forecast evaluation and combination techniques follows immediately -- forecast users natura lly have a keen interest in monitoring and improving forecast performance. Here we provide a five-part selective account of forecast evaluation and combination methods. In the first, we discuss evaluation of a single forecast, and in particular, evaluation of whether and how it may be improved. In the second, we discuss the evaluation and comparison of the accuracy of competing forecasts. In the third, we discuss whether and how a set of forecasts may be combi ned to produce a superior composite forecast. In the fourth, we describe a number of forecast evaluation topics of particular relevance in economics and finance, including methods for evaluating directionof-change forecasts, probability forecasts and volatility forecasts. In the fifth, we conclude. Acknmvledgements:The views expressed here are those of the author s and not those of the Federal Reserve Bank of New York or the Federal Reserve System. We thank Clive Granger for useful comments, and we thank the National Science Foundation, the Sloan Foundation and the University of Pennsylvania Research Foundation for financial support. It is obvious that forecasts are of great importance and widel y used in economics and finance. Quite simply, good forecasts lead to good decis ions. The importance of forecast evaluation and combination techniques follows immediatel y -- forecast users naturally have a keen interest in monitoring and improving forecast perfo rmance. More generally, forecast evaluation figures prominently in many questions in empi rical economics and finance, such as: Are expectations rational? (e.g., Keane and Runkle, 1990; Bonham and Cohen, 1995) Are financial markets efficient? (e.g., Fama, 1970, 1991) Do macroeconomic shocks cause agents to revise their forec asts at all horizons, or just at short- and medium-tenn horizons? (e.g., Campbell and Mankiw, 1987; Cochrane, 1988) Are observed asset returns "too volatile"? (e.g., Shiller, I 979; LeRoy and Porter, 1981) Are asset returns forecastable over long horizons? (e.g., Fama and French, 1988; Mark, 1995) Are forward exchange rates unbiased and/ or accurate forec asts of future spot prices at various horizons? (e.g., Hansen and Hodrick, 1980) Are government budget projections systematically too optim istic, perhaps for strategic reasons? (e.g., Auerbach, 1994; Campbell and Ghysels, 1995) Are nominal interest rates good forecasts of future inflation? (e.g., Fama, 1975; Nelson and Schwert, I 977) Here we provide a five-part selective account of forecast evaluation and combination methods. In the first, we discuss evaluation of a single forec ast, and in particular, evaluation of whether and how it may be improved. In the second, we discuss the evaluation and comparison of the accuracy of competing fore casts. In the third, we discuss whether and how a set of forecasts may be combined to produce a superior composite forecast. In the fourth, we describe a number of forecast evaluation topics of particular relevance in economics and finance, including methods for evaluating dire ction-of-change forecasts, probability forecasts and volatility forecasts. In the fifth, we conc lude. In treating the subject of forecast evaluation, a tradeoff emerges between generality and tedium. Thus, we focus for the most part on linear least-squares forecasts of univariate covariance stationary processes, or we assume nonnality so that linear projections and conditional expectations coincide. We leave it to the reader to flesh out the remainder. However, in certain cases of particular inter est, we do focus explicit! y on nonlinearities that produce divergence between the linear projectio n and the conditional mean, as well as on nonstationarities that require special attention . I. Evaluating a Single Forecast The properties of optimal forecasts are well known; forecast evaluation essentially amounts to checking those properties. First, we establish some notation and recall some familiar results. Denote the covariance stati onary time series of interest by y • Assumin g that 1 the only deterministic component is a possibly nonzero mean, µ, the Wold representation is y1 = µ + r + b t\_ + b rt-2 + ... , where £ ~ WN(O,a 2h and WN deno 1 1 1 2 tes serially 1 uncorrelated (but not necessarily Gaussian, and hence not necessarily independent) white noise. We assume invertibility throughout, so that an equivalent one-sided autoregressiv e 2 representation exists. The k-step-ahead linear least-squares forecast is y' 1-k.1 the corresponding k-step-ahead forecast error is et+k.t -y t+k -y-t+k,t -e t+k +be I t+k-1 + ... +b k-1 0 c-t+I' (I) Finally, the k-step-ahead forecast error variance is (2) Four key properties of errors from optimal forecasts, which we discuss in greater detail below, follow immediately: (a) Optimal forecast errors have a zero mean (follows from (I)); (b) I-step-ahead optimal forecast errors are white noise (special case of (I) corresponding to k = I); (c) k-step-ahead optimal forecast errors are at most MA(k-1) (general case of (I)); (d) The k-step-ahead optimal forecast error variance is non-decreasing in k (follow s from (2)). Before proceeding, we now describe some exact distribution-free nonparametric tests for whether an independently (but not necessarily identically) distributed series has a zero median. The tests are useful in evaluating the properties of optimal forecast errors listed above, as well as other hypotheses that will concern us later. Many such tests exist; two of the most popular, which we use repeatedly, are the sign test and the Wilcoxon signedrank test. Denote the series being examined by x" and assume that T observations are availab le. The sign test proceeds under the null hypothesis that the observed series is indepen dent with a 3 zero median. 1 The intuition and construction of the test statistic are straightforward -- unde r the null, the num ber of positive observations in a sample of size T has the binomial distribution with parameters T and 1/2. The test statistic is therefore simply where if x, > 0, otherwise. In large samples, the studentized version of the statistic is standard non nal, S - T a 2 - N(O,l). Thus, significance may be assessed using stan dard tables of the binomial or non nal distributions. Note that the sign test does not require distr ibutional symmetry. The Wilcoxon signedrank test, a related distribution-free procedur e, does require distributional symmetry, but it can be more powerful than the sign test in that case. Apart from the additional assumption of symmetry, the null hypothesis is the same, and the test statistic is the sum of the ranks of the absolute values of the positive observations , T W = L I,( x, ) Rank( I x, I ), t=l 1 If the series is symmetrically distributed, then a zero median of course corresponds to a zero mean. 4 where the ranking is in increasing order (e.g., the largest absolute observation is assigned a rank of T, and so on). The intuition of the test is simple -- if the underlying distribution is symmetric about zero, a "very large" (or "very small") sum of the ranks of the absolute values of the positive observations is "very unlikely." The exact finite-sample null distribution of the signed-rank statistic is free from nuisance parameters and invariant to the true underlying distribution, and it has been tabulated. Moreover, in large samples, the studentized version of the statistic is standard normal, W _ T(T 1) + a 4 - N(O,l). T(T + !)(2T + 1) 24 Iesting_Eroperties_o£Dptimal_Eor.ecasts Given a track record of forecasts, Y,.u• and corresponding realizations, Y,.,, forecast users will naturally want to assess forecast perfonnance. The properties of optimal forecast s, cataloged above, can readily be checked. a. Optimal Forecast Errors Have a Zero Mean A variety of standard tests of this hypothesis can be perfonned, depending on the assumptions one is willing to maintain. For example, if e,.,., is Gaussian white noise (as might be the case for I-step-ahead errors), then the standard t-test is the obvious choice because it is exact and unifonnly most powerful. If the errors are non-Gaussian but remain independent and identically distributed (iid), then the t-test is still useful asymptotically. However, if more complicated dependence or heterogeneity structures are (or may be) operative, then alternative tests are required, such as those based on the generalized method of 5 moments. It would be unfortunate if non-nonnality or richer dependence/heterogeneity structur es mandated the use of asymptotic tests, beca use sometimes only short track records are available. Such is not the case, however , because exact distribution-free nonpara metric tests are often applicable, as pointed out by Cam pbell and Ghysels (I 995). Although the distribution-free tests do require indepen dence (sign test) and independence and sym metry (signed-rank test), they do not require non nality or identical distributions ove r time . Thus, the tests are automatically robust to a variety of forecast erro r distributions, and to heteroskedasticity of the independent but not identically distributed type. For k> 1, however, even optimal forecast erro rs are likely to display serial correlat ion, so the nonparametric tests must be modified . Und er the assumption that the forecast erro rs are (k-1)-dependent, each of the following k series of forecast erro rs will be free of seria l e3 + 3k, 3 +2k, ... }, ... , {e,k,k• e,k.,k• e ,,,k• ... }. Thus, a Bonferroni bounds test (with 4 size bounded above by ex) is obtained by perf onn ing k tests, each of size ex/k, on each of the k erro r series, and rejecting the null hypothesis if the null is rejected for any of the series. This proc edu re is conservative, even asymptotically. Alte rnatively, one could use just one of the k erro r series and perf onn an exact test at level ex, at the cost of reduced pow er due to the discarde d observations. In concluding this section, let us stress that the nonparametric distribution-free tests are neither unambiguously "better" nor "wo rse" than the mor e com mon tests; rather, they are useful in different situations and are ther efore complementary. To thei r cred it, they are often 6 exact finite-sample tests with good finite-sample power, and they are insensitive to deviations from the standard assumptions of normality and homoskedasticity require d to justify more standard tests in small samples. Against them, however, is the fact that they require independence of the forecast errors, an assumption even stronger than conditional-mean independence, let alone linear-projection independence. Furthermore , although the nonparametric tests can be modified to allow fork-dependence, a possib ly substantial price must be paid either in tenns of inexact size or reduced power. b. I-Step-Ahead Optimal Forecast Errors are White Noise More precisely, the errors from linear least squares forecasts are linearprojection independent, and the errors from least squares forecasts are conditional-m ean independent. The errors never need be fully serially independent, because dependence can always enter through higher moments, as for example with the conditional-varianc e dependence of GARCH processes. Under various sets of maintained assumptions, standard asymptotic tests may be used to test the white noise hypothesis. For example, the sample autocorrelati on and partial autocorrelation functions, together with Bartlett asymptotic standard errors, may be useful graphical diagnostics in that regard. Standard tests based on the serial correlation coefficient, as well as the Box-Pierce and related statistics, may be useful as well. Dufour ( 1981) presents adaptations of the sign and Wilcoxon signed-rank tests that yield exact tests for serial dependence in I-step-ahead forecast errors , without requiring nonnality or identical forecast error distributions. Consider, for examp le, the null hypothesis that the forecast errors are independent and symmetrically distributed with zero median. Then 7 median(e1+ 1,1e1+ 2 ,1+ 1) = O; that is, the product of two symmetric independent random variables with zero median is itself symmetric with zero med ian. Under the alternative of positive serial dependence, median(e1+ 1 ,1e1+2 ,1+ ) > 0, and unde r the alternative of negative serial dependence, 1 median(e,+ 1,,e,+ 2 ,1+1) < 0. This suggests examining the cross-product series z = e,.1,, e,. .,., 1 2 for symmetry about zero, the obvious test for whic h is the signed-rank test, T WO = L t= I I. ( z1 )Rank( Iz1 I)· Note that the z sequence will be serially dependent even if the 1 e,+ 1,1 sequence is not, in apparent violation of the conditions required for validity of the signedrank test (applied to ZJ. Hence the importance of Dufo ur's contribution -- Dufour shows that the serial correlation is of no consequence and that the distribution of W O is the same as that of w. c. k-Srep-Ahead Oprimal Forecast Errors are ar Mos! MA(k-1) Cumby and Huizinga (1992) develop a useful asym ptotic test for serial dependence of orde r greater than k-1. The null hypothesis is that the e,+k,t series is MA(q) (0 ,; q ,; k-1) against the alternative hypothesis that at least one autocorrelation is nonzero at a lag greater than k-1. Under the null, the sample autocorre lations of e,+k,t' p = [ Pq-i• ... , Pq.,} are asymptotically distributed {f p - N( 0, V ). 2 Thus, C= Tp 1V 1 p is asymptotically distributed as x; under the null, where V is a consistent estimator of V. Dufo ur's (1981) distribution-free nonparametric tests may also be adapted to provide a finite-sample bounds test for serial dependence of order greater than k-1. As before, separate the forecast errors into k series, each of which is serially independent under the null of (k-1)2 s is a cutoff lag selected by the user. 8 dependence. Then, for each series, take zk.t = e,,u e,,zk.t•k and reject at significance level bounded above by a if one or more of the subset test statistics rejects at the a/k level. d. The k-Step-Ahead Optimal Forecast Error Variance is Non-Decreas ing in k The k-step-ahead forecast error variance, a~ = var(e1,k_1) = 0 2 ( ~bi) , is non- decreasing in k. Thus, it is often useful simply to examine the sample k-step-ahead forecast error variances as a function of k, both to be sure the condition appear s satisfied and to see the pattern with which the forecast error variance grows with k, which often conveys useful infonnation. 3 Fonna l inference may also be done, so long as one takes care to allow for dependence of the sample variances across horizons. Assessing.._Optimalit_y_with_RespectJ.CLll!Llnfonnation_Set The key property of optimal forecast errors, from which all others follow (including those cataloged above), is unforecastability on the basis of infonnation available at the time the forecast was made. This is true regardless of whether linear-projection optimality or conditional-mean optimality is of interest, regardless of whether the relevant loss function is quadratic, and regardless of whether the series being forecast is station ary. Following Brown and Maita! (1981), it is useful to distinguish betwee n partial and full optimality. Partial optimality refers to unforecastability of forecast errors with respect to some subset, as opposed to all subsets, of available infonnation, 0,. Partial optimality, for example, characterizes a situation in which a forecast is optimal with respect to the inforniation used to construct it, but the infonnation used was not all that could have been used. Thus, each of a 3 Extensions of this idea to nonstationary long-memory environments are developed in Diebold and Lindner (I 995). 9 set of competing forecasts may have the part ial optimality property if each is optimal with respect to its own information set. One may test partial optimality via regressio ns of the form e,,,., = a \ x1d2 1. The part icul ar case of testing partial optimality with respect to good deal of attention, as in Min cer and Zarn owitz (1969). y1, , 1 + u1, where has received a The relevant regression is to (a 0 , a,) = (0, 0) or (Po, P,) = (0, 1). 4 One may also expand the regression to allow for various sorts of nonlinearity. For example, following Ramsey (1969), one may test whe ther all coefficients in the regression e ,,_ = 1 1 J L aj y1~'-' jcQ + u1 are zero . Full optimality, in contrast, requires the fore cast erro r to be unforecastable on the basis of all information available when the forecast was made (that is, the entirety of Qi), Conceptually, one could test full rationality via regressions of the fonn e,.u = a 1x + u, If 1 a =0 for all xi'::0 1, then the forecast is fully opti mal. In practice, one can never test for full optimality, but rather only partial optimality with respect to increasing info nnat ion sets. Distribution-free nonparametric methods may also be used to test optimality with respect to various info nnat ion sets. The sign and signed-rank tests, for example, are read ily adapted to test orthogonality between forecast erro rs and available info nnat ion, as proposed by Campbell and Duf our (1991, 1995). If, for example, e,+t,, is linear-projection independ ent of xtO ,, then cov(ei, , , x ) = 0. Thus, in the symmetric case, one may use the signed-rank 11 1 test for whe ther E[z,J = E[e,,i.1 x ] = 0, and mor e generally, one may use the sign test for 1 4 In such regressions, the disturbance should be white noise for !-step-ahead forecasts but may be serially correlated for multi-step-ah ead forecasts. 10 whether median(z,) = median( e1 , 1.1 x1 ) = 0. 5 T S" = L The relevant sign and signed-rank statistics are T I.( z,) and W" t " I = L I.( z t " I 1) Rank( Iz1 I) Moreov er, one may allow for nonlinear transformations of the elements of the information set, which is useful for assessing conditional-mean as opposed to simply linear-projection independence, by taking z1 = e,. 1•1 g("i), where g(.) is a nonlinear function of interest. Finally, the tests can be generalized to allow for k-step-ahead forecast errors as before. Simply take z, = e, ,k., g(x1), divide the z1 series into the usual k subsets, and reject the orthogonality null at significa nce level bounded by a if any of the subset test statistics are significant at the a/k level. 6 II. Comparing the Accuracy of Multiple Forecasts Measures_of.Eore.cast Acc11rac¥ In practice, it is unlikely that one will ever stumble upon a fully-optimal forecast; instead, situations often arise in which a number of forecasts (all of them suboptimal) are compare d and possibly combined. The cmcial object in measuring forecast accuracy is the loss function, L( y 5',,u} , 1 often restricted to L( e,,k.t} which charts the "loss," "cost" or "disutility" associated with various pairs of forecasts and realizations. In addition to the shape of the loss function, the forecast horizon (k) is also of cmcial importance. Rankings of forecast accuracy may be very different across different loss functions and/or differen t 5 Again, it is not obvious that the conditions required for application of the sign or signedrank test to z, are satisfied, but they are; see Campbell and Dufour (I 995) for details. 6 Our discussion has implicitly assumed that both e + ,, and g(xi) are centered at zero. This 1 1 will hold for e,+i,, if the forecast is unbiased, but there is no reason why it should hold for g(xi). Thus, in general, the test is based on g(x,)-µ , where µ is a centering paramet er such as 1 1 the mean, median or trend of g(xi). See Campbell and Dufour (1995) for details. 11 horizons. This result has led som e to argue the virtues of various "universally applicable" accuracy measures. Clements and Hendry (1993), for example, arg ue for an accuracy measure under which forecast rankings are invariant to certain transformation s. Ultimately, however, the approp riate loss function depends on the situation at hand. As stressed by Diebold (1993) am ong many others, forecasts are usu ally constructed for use in particular decision environments ; for example, policy decisions by government officials or trading decisions by market partici pants. Thus, the appropriate acc uracy measure arises from the loss function faced by the fore cast user. Economists, for examp le, may be interested in the profit streams (e.g ., Leitch and Tan ner, 1991, 1995; Engle et al., 199 3) or utility streams (e.g ., McCulloch and Rossi, 199 0; West, Edison and Cho, 1993) flowing from various forecasts. Nevertheless, let us discuss a few stylized statistical loss functions, because they are used widely and serve as popula r benchmarks. Accuracy measures are usually defined on the the mean err or, ME I T L e,,k.t , and mean percent error, T = - {al MPE I = T T ~ p, •k.t , provide measures of bias, which is one com ponent of accuracy. The most common overall accura cy measure, by far, is mean square d err or, IT MSE = IT e,:k., , or mean squared percent err 2 or, MSPE = T (al P,.u· Often the square T {al roots of these measures are used to preserve units, yielding the roo t mean squared err or T RMSE = _!_ e,:k.1 , and the 1 T 2 root mean squared percent error, T (al RMSPE = Pi,k.t (al Somewhat less popular, but neverth eless common, accuracy measures are mean absolute error, I T MAE = 1 T \e,, ul, and mean absolute percen t error, MAPE = TL,a IP,,ulT ,a1 L -I : L TL -L 1 12 MSE admits an infonnative decomposition into the sum of the variance of the forecast error and its squared bias, MSE = E[(Y,,k - Y.,k,1/] = var(y,,k - Y,,k.,) + (E[y, ,k] E[y,,k.,])2, or equivalently MSE = var(y ,,.) + var(y ,,k,,) - 2cov( y,,k, Y,,k.t) + (E[y, ,k] - E[y,,k .,])2 This result makes clear that MSE depends only on the second moment structure of the joint distribution of the actual and forecasted series. Thus, as noted in Murp hy and Wink ler (1987, 1992), although MSE is a useful summary statistic for the joint distribution of Y,.k and Y,.k.t' in general it contains substantially less infonnation than the actual joint distribution itself. Other statistics highlighting different aspects of the joint distrib ution may therefore be useful as well. Ultimately, of course, one may want to focus directly on estimates of the joint distribution, which may be available if the sample size is large enough to penni t relatively precise estimation. Measuring_Eorecastahilicy It is natural and infon native to evaluate the accuracy of a foreca st. We hasten to add, however, that actual and forecasted values may be dissimilar, however, even for very good forecasts. To take an extreme example, note that the linear least squares forecast for a zeromean white noise process is simply zero -- the paths of foreca sts and realizations will look very different, yet there does not exist a better linear forecast under quadratic loss. This example highlights the inherent limits to forecastability, which depends on the process being forecast; some processes are inherently easy to forecast, while others are hard to forecast. In other words, sometimes the infonnation on which the forecaster optim ally conditions is very 13 valuable, and sometimes it isn't. The issue of how to quantify forecas tability arises at once. Granger and Newbold (I 976) propose a natural definition of forecastability for covariance stat ionary series under squared-error loss, patterned after the familiar R 2 of linear regression G = var(Y,.i.,) ~ var( e,, 1.,) I var( Y,.i ) var(y,, 1 )' where both the forecast and forecas t erro r refer to the optimal (that is, linear least squares or conditional mean) forecast. In closing this section, we note that although measures of forecastability are useful constmcts, they are driven by the population properties of processes and their optimal forecasts, so they don 't help one to evaluate the "goodness" of an actu al reported forecast, which may be far from optimal. For example, if the variance of Y,, .1 is not much lower than 1 the variance of the covariance stat ionary series Y,+ 1 , it could be that either the forecast is poor, the series is inherently almost unfore castable, or both. StatisticaLCompar.ison_oLEorecastAc cur,1c~7 Once a loss function has been decided upon, it is often of interest to know which of the competing forecasts has smallest exp ected loss. Forecasts may of course be ranked according to average loss over the sample per iod, but one would like to have a me asure of the sampling variability in such average losses. Alternatively, one would like to be able to test the hypothesis that the difference of exp ected losses between forecasts i and j is zero (i.e ., E[L(Y,.,. y,'.u)l = E[L(y,,k• y,\. ))), against the alternative that one fore 1 cast is better. 7 This section draws heavily upon Die bold and Mariano (1995). 14 Stekler (1987) proposes a rank-based test of the hypothesis that each of a set of forecasts has equal expected loss. 8 Given N competing forecasts, assign to each forecast at each time a rank according to its accuracy (the best forecast receives a rank of N, the secondbest receives a rank of N-1, and so forth). Then aggregate the period-by-period ranks for each forecast, T H; = L Rank( L( Y,,k• y,'.k.t) ), t=I = I, ... , N, and form the chi-squared goodness-of-fit test statistic, Under the null, H ~ 2 X:--:-i· As described here, the test requires the rankings to be independent over space and time, but simple modifications along the lines of the Bonferroni bounds test may be made if the rankings are temporally (k-1)-dependent. Moreover, exact versions of the test may be obtained by exploiting Fisher's randomization principle. 9 One limitation of Stekler's rank-based approach is that infonnation on the magnitude of differences in expected loss across forecasters is discarded. In many applications, one wants to know not only whether the difference of expected losses differs from zero (or the ratio differs from I), but also by how much it differs. Effectively, one wants to know the sampling distribution of the sample mean loss differential (or of the individual sample mean losses), ' Stekler uses RMSE, but other loss functions may be used. 9 See, for example, Bradley (1968), Chapter 4. 15 which in addition to being directly info nnative would enable Wald tests of the hypothesis that the expected loss differential is zero . Diebold and Ma rian o (I 995), building on earl ier work by Gra nge r and Newbold ( 1986) and Meese and Rog off ( 1988), develop a test for a zero expected loss differential that allows for forecast errors that are nonzero mean, non-Gaussian, serially correlated and contemporaneo usly correlated. In general, the loss function is L(y , 1 5','.k,t) Because in many applications the loss function will be a direct function of the forecast erro r, L( y , y,'.k.t) = L( e,'.k.t 1 we writeL(e,'.k.tl from this point on to eco nomize on notation, while recognizing • that certain loss functions (such as direction-of-change ) don 't collapse to the L(e,'.k.t) fon n. 10 The null hypothesis of equal forecast accuracy for two forecasts is E[L(e\+k,.)] = E[L (e\+k.J], or E[d J = 0, where d, = L(e\+k,.) - L(e\+k,.) is the loss differen tial. f If d, is a covariance stationary, short-m emory series, then standard results may be used to deduce the asymptotic distribution of the sample mean loss differential, T whe re d = ~ ~ [L(e,'.k.t) :t - L(e /,ul ] is the sample mean loss diff erential, f/0 ) = _I_ y tr) is the spectral density of the loss differential at frequency zero, 2TI,a-« Yit ) = E[(d, - µ)(d,_, - µ)) is the auto covariance of the loss differential at disp lace men t,, and µ is the population mean loss diff erential. The formula for ( (0) shows that the correction 1 for serial correlation can be substantial, even if the loss differential is only wea kly serially correlated, due to the cumulation of the autocovariance tenn s. In large samples , the obvious 10 In such cases, the L(y,. Y;.1-k.t) fonn should be used. 16 statistic for testing the null hypothesis of equal forecast accuracy is the standar dized sample mean loss differential, - B d = ---- ~ 2n~(O) where f/0) is a consistent estimate of f/0). It is useful to have available exact finite-sample tests of forecast accuracy to complement the asymptotic tests. As usual, variants of the sign and signedrank tests are applicable. When using the sign test, the null hypothesis is that the median of the loss differential is zero, median(L(e,'.k.t) - L(e/.k.t)) = 0. Note that the null of a zero median loss differential is not the same as the null of zero difference between median losses; that is, median(L( e, '.k.t) - L( e/.u)) ,, median(L( e, '.k.t)) - median(L( e/,k.t)) For this reason, the null differs slightly in spirit from that associated with the asymptotic Diebold-Mari ano test, but nevertheless, it has the intuitive and meaningful interpretation that P(L(e,'.u)>L(e /.k.t)) = When using the Wilcoxon signed-rank test, the null hypothesis is that the loss differential series is symmetric about a zero median (and hence mean), which corresp onds precisely to the null of the asymptotic Diebold-Mariano test. Symmetry of the loss differential will obtain, for example, if the distributions of L( e,'.k,<) and L( e/.u) are the same up to a location shift. Symmetry is ultimately an empirical matter and may be assesse d using standard procedures. The construction and intuition of the distribution-free nonparametric test statistic s are 17 straightforward. The sign test statistic is SB T W8 = L 1.(d,) Rank(ld,I). t =I T = L I,(d,), and the signed-rank test statistic is Serial correlation may be handled as before via Bonferroni bounds. It is interesting to note that , in multi-step forecast comparisons, forecast erro r serial correlation may be a "common feature" in the tenninology of Engle and Koz icki (1993), because it is induced largely by the fact that the forecast horizon is longer than the interval at which the data are sampled and may therefore not be present in loss differe ntials even if present in the forecast errors themselv es. This possibility can of course be checked empirically. West (1994) takes an approach very much related to, but nevertheless diff erent from, that of Diebold and Mariano. The mai n difference is that West assumes that forecasts are computed from an estimated regressi on model and explicitly accounts for the effects of parameter uncertainty within that fram ework. When the estimation sample is small, the tests can lead to different results. Howeve r, as the estimation period grows in length relative to the forecast period, the effects of parame ter uncertainty vanish, and the Diebol d-Mariano and West statistics are identical. We st's approach is both more general and less general than the Diebold-M ariano approach. It is more general in that it corrects for nonstationarities induced by the updating of parameter estimates. It is less general in that those corrections are made with in the confines of a more rigid framework than that of Diebold and Mariano, in whose fram ework no assumptions need be made about the often unknown or incompletely known models that underlie forecasts. In closing this section, we note that it is sometimes infonnative to compar e the 18 accuracy of a forecast to that of a "naive" competitor. A simpl e and popular such comparison is achieved by Theil 's (1961) U statistic, which is the ratio of the I-step-ahead MSE for a given forecast relative to that of a random walk forecast 5',,u= y,; that is, u = Generalization to other loss functions and other horizons is imme diate. The statistical significance of the MSE comparison underlying the U statistic may be ascertained using the methods just described. One must remember, of course, that the random walk is not necessarily a naive competitor, particularly for many economic and financial variables, so that values of the U statistic near one are not necessarily "bad." Sever al authors, including Anns trong and Fildes (I 995), have advocated using the U statist ic and close relatives for comparing the accuracy of various forecasting methods across series. III. Combining Forecasts In forecast accuracy comparison, one asks which forecast is best with respect to a particular loss function. Regardless of whether one forecast is "best," however, the question arises as to whether competing forecasts may be fruitfully comb ined -- in similar fashion to the construction of an asset portfolio -- to produce a composite foreca st superior to all the original forecasts. Thus, forecast combination, although obviously relate d to forecast accuracy comparison, is logically distinct and of independent interest. EorecasLEncompassing_Iests 19 Forecast encompassing tests enable one to dete nnine whether a certain forecast incorporates (or encompasses) all the relevant infonnation in competing forecasts. The idea dates at least to Nelson (1972) and Cooper and Nelson (1975), and was fonnalized and extended by Chong and Hendry (1986). For simplicity, let us focus on the case of two ,' I ' 2 . Cons1·cter h ,ore casts, Y,-k., and y,,u . n t e regressio If (Po, P,, P,) = (0,1 ,0), one says that model 1 forecast-encompasses model 2, and if P,l = (0,0, 1), then model 2 forecast-encompasses model 1. For any other (Po, P,, (Po, P,, P,l values, neither model encompasses the other, and both forecasts contain useful information about y,+k· Under certain conditions, the encompassing hypotheses can be tested using standard meth ods. 11 Moreover, although it does not yet seem to have appeared in the forecasting literature, it would be straightforward to develop exact finit e-sample tests (or bounds tests when k > 1) of the hypothesis using simple generalizations of the distribution-free tests discussed earlier. Fair and Shiller (I 989, 1990) take a different but related approach based on the regression ~i.k - Y,) = P0 + Pi(Y,~k.t - Y,) + P2(5',: •., - Yi)+ e,.u · As before, forecast-encompassing corresponds to coefficient values of (0,1 ,0) or (0,0 ,1). Under the null of forecast encompassing, the Chong-Hendry and Fair-Shiller regressions are identical. When the variable being forecast is integrated, however, the Fair-Shiller fram ework may prove more convenient, because the spec ification in tenns of changes facilitates the use of Gaussian asymptotic distribution theory. 11 Note that MA(k- I) serial correlation will typic ally be present in e, •k.t if k > I. 20 EurecasLCombination Failure of one model's forecasts to encompass other models' forecasts indicates that all the models examined are misspecified. It should come as no surprise that such situations are typical in practice, because all forecasting models are surely missp ecified -- they are intentional abstractions of a much more complex reality. What , then, is the role of forecast combination techniques? In a world in which information sets can be instantaneously and costlessly combined, there is no role; it is always optimal to comb ine information sets rather than forecasts. In the long run, the combination of infonnation sets may sometimes be achieved by improved model specification. But in the short run -- particularly when deadlines must be met and timely forecasts produced -- pooling of inform ation sets is typically either impossible or prohibitively costly. This simple insight motivates the pragmatic idea of forecast combination, in which forecasts rather than models are the basic object of analysis, due to an assumed inability to combine infonnation sets. Thus, forecast combination can be viewed as a key link between the short-nm, real-time forecast production process, and the longer-run, ongoing process of model development. Many combining methods have been proposed, and they fall roughly into two groups, "variance-covariance" methods and "regression-based" metho ds. Let us consider first the variance-covariance method due to Bates and Granger (1969). Suppose one has two unbiased forecasts from which a composite is fanne d as 12 12 The generalization to the case of M > 2 competing unbiased foreca sts is straightforward, as shown in Newbold and Granger (1974). 21 Because the weights sum to unity, the composi te forecast will necessarily be unbiased. Moreover, the combined forecast erro r will satisfy the same relation as the combined fore cast; that is, c I e,.k.t = we,,k.t . 2 wit. h a vana nce o,2 = w2 On + (l-w )20; 2 + + 2 (1-w)e,,k,t , 2w( l-w )o 12 , where 0; and 0; are unconditional 1 2 forecast erro r variances and 0 is their cova riance. The combining weight that minimiz 12 es the combined forecast erro r variance (and hence the combined forecast erro r MSE, by unbiasedness) is w = Note that the optimal weight is detennined by both the underlying variances and covariances. Moreover, it is straightforward to show that, except in the case where one forecast encompasses the other, the forecast erro r vari ance from the optimal composite is less than min(o; 1, 0; 2) Thus, in population, one has nothing to lose by combining forecasts and potentially much to gain. In practice, one replaces the unknown variance s and covariances that underlie the optimal combining weights with consistent estim ates; that is, one estimates w' by replacing o,i with 6. = 'J In finite samples of the size typically available, sampling erro r contaminates the combining weight estimates, and the problem of samplin g erro r is exacerbated by the collinearity that 22 typically exists among primary forecasts. Thus, while one hopes to reduce out-of-sample forecast MSE by combining, there is no guarantee. In practice, howev er, it turns out that forecast combination techniques often perform very well, as documented Cleme n's (1989) review of the vast literature on forecast combination. Now consider the "regression method" of forecast combination. The form of the Chong-Hendry and Fair-Shiller encompassing regressions immediately suggests combining forecasts by simply regressing realizations on forecasts. Granger and Ramanathan (I 984) showed that the optimal variance-covariance combining weight vector has a regression interpretation as the coefficient vector of a linear projection of Yi+k onto the forecasts, subject to two constraints: the weights sum to unity, and no intercept is includ ed. In practice, of course, one simply runs the regression on available data. In general, the regression method is simple and flexible. There are many variations and extensions, because any "regression tool" is potentially applicable. The key is to use generalizations with sound motivation. We shall give four examples: time-varying combining weights, dynamic combining regressions, Bayesian shrinkage of combi ning weights toward equality, and nonlinear combining regressions. a. Time- Va,ying Combining Weights Time-varying combining weights were proposed in the variance-cov ariance context by Granger and Newbold (I 973) and in the regression context by Diebo ld and Pauly (1987). In the regression framework, for example, one may undertake weighted or rolling estimation of combining regressions, or one may estimate combining regressions with explicitly timevarying parameters. 23 The potential desirability of time-varying weig hts stems from a number of sources. First, different learning speeds may lead to a particular forecast improving over time relat ive to others. In such situations, one naturally wan ts to weight the improving forecast progressively more heavily. Second, the desi gn of various forecasting models may make them relatively better forecasting tools in some situa tions than in others. For example, a structura l model with a highly developed wage-price sect or may substantially outperform a simpler model during times of high inflation. In such times, the more sophisticated model should received higher weight. Third, the paramete rs in agents' decision rules may drift over time , and certain forecasting techniques may be relat ively more vulnerable to such drift. b. Dynamic Combining Regressions Serially correlated errors arise naturally in com bining regressions. Diebold (1988) considers the covariance stationary case and argues that serial correlation is likely to appe ar in unrestricted regression-based forecast combini ng regressions when P +p * I. Mor e 1 2 generally, it may be a good idea to allow for serial correlation in combining regressions to capture any dynamics in the variable to be fore cast not captured by the various forecasts. In that regard, Coulson and Robins (1993), follo wing Hendry and Mizon (1978), point out that a combining regression with serially correlated disturbances is a special case of a combining regression that includes lagged dependent vari ables and lagged forecasts, which they advo cate. c. Bayesian Shrinkage of Combining Weights Toward Equality Simple arithmetic averages of forecasts are often found to perfonn very well, even relative to "optimal" composites. 13 Obvious ly, the imposition of an equal weights constrain t 13 See Winkler and Makridakis (1983), Clemen (1989), and many of the references therein. 24 eliminates variation in the estimated weights at the cost of possibly introducing bias. However, the evidence indicates that, under quadratic loss, the benefits of imposing equal weights often exceed this cost. With this in mind, Clem en and Winkler (1986) and Diebold and Pauly (1990) propose Bayesian shrinkage techniques to allow for the incorporation of varying degrees of prior information in the estimation of combining weights; least-squares weights and the prior weights then emerge as polar cases for the posterior-mean combining weights. The actual posterior mean combining weights are a matrix weighted average of those for the two polar cases. For example, using a natural conju gate nonnal-gamma prior, the posterior-mean combining weight vector is where P"';°' is the prior mean vector, the combining regression, and Q is the prior precision matrix, F is the design matri x for Pis the vector of least squares combining weights. The obvious shrinkage direction is toward a measure of centr al tendency (e.g., the arithmetic mean). In this way, the combining weights are coaxed toward the arithmetic mean, but the data are still allowed to speak, when (and it) they have something to say. d. Nonlinear Combining Regressions There is no reason, of course, to force combining regression s to be linear, and various of the usual alternatives may be entertained. One partic ularly interesting possibility is proposed by Deutsch, Granger and Teriisvirta (I 994), who suggest )'t'.k.t = l(s,=l )(P 11 y,'.k.t + P12 Y,:u) + l(s,=2)(P 2S,\.1 + P22Y1:k_,) The states that govern the combining weights can depen d on past forecast errors from one or both models or on various economic variables. Furthenno re, the indicator weight need not be 25 simply a binary variable; the transition between states can be made more gradual by allow ing weights to be functions of the forecast errors or economic variables. IV. Special Topics in Evaluating Economic and Financial Forecasts EY_a! uating_llir.ection-of-Cha nge_Eoreca.sts Dire ction -of-c hang e forecasts are often used in financial and economic decision-making (e.g ., Leitch and Tann er, 1991, 1995; Satchell and Tim men nann , 1992). The question of how to evaluate such forecasts immediately arise s. Our earli er results on tests for forecast accu racy com paris on remain valid, appropriately modified, so we shall not restate them here. Instead, we note that one frequently sees asses sments of whe ther direction-of-change forecasts "have value," and we shall discuss that issue. The question as to whe ther a direction-of-change forecast has value by necessity involves comparison to a naive benc hma rk -the direction-of-change forecast is com pare d to a "naive" coin flip (with success probability equa l to the relevant marginal). Con side r a 2x2 contingency table. For ease of notation, call the two states into which forecasts and realizations fall "i" and "j". Com mon ly, for exam ple, i = "up" and j = "dow n." Figu res I and 2 mak e clea r our notation regarding observed cell counts and unobserved cell probabilities. The null hypothesis that a direction-of-change forecast has no value is that the forecasts and realizations are independent, in which case P. = P I. P .J , V i, j. As always, one proc eeds unde lJ r the null. The true cell probabilities are of cour se unknown, so one uses the consistent 0 = -' ,. 0 estimates P. unde r the null, E;_; , and P. J = P;.Pp, 0 0 = _J. , by Ei.i Then one consistently estimates the expected cell counts , , 00 . = P;P.P = -'-J . 0 26 Fina lly, one constructs the statistic 2 (0 EIJ )2 ll_ C = L~ - ~ --~ Under the null, C d x,.2 E.. IJ An intimately-related test of forecast value was proposed by Merton (1981) and i,j=l p Henriksson and Merton (1981), who assert that a forecast has value if -" + p -1!. > 1. They p p 1· P. P. " therefore develop an exact test of the null hypothesis that -" + -1!. = I against the inequality p p I. J alternative. A key insight, noted in varying degrees by Schnader and Stekler (1990) and Stekler (1994), and formalized by Pesaran and Timmermann (1992), is that the HenrikssonMerton null is equivalent to the contingency-table null if the marginal probab ilities are fixed at 0 0 the observed relative frequencies, -" and _.J. The same unpalatable assump tion is necessary 0 0 for deriving the exact finite-sample distribution of the Henriksson-Merton test statistic. Asymptotically, however, all is well; the square of the Henriksson-Merton statistic , appropriately nonnalized, is asymptotically equivalent to C, the chi-squared conting ency table statistic. Moreover, the 2x2 contingency table test generalizes trivially to the NxN case, with a 2 Under the null, C:x ~ Xe, IJ(:X-JJ· A subtle point arises, however, as pointed out by Pesaran and Timmennann (1992). In the 2x2 case, one must base the test on the entire table, as the off-diagonal elements are detennined by the diagonal elements, because the two elements of each row must sum to one. In the NxN case, in contrast, there is more latitude as to which cells to examine, and for purposes of forecast evaluation, it may be desirable to focus only on the diagonal cells. In closing this section, we note that although the contingency table tests are often of interest in the direction-of-change context (for the same reason that tests based on Theil' s U- 27 statistic are often of interest in more standard contexts), forecast "value" in that sense is neither a necessary nor sufficient condition for forecast value in tenns of a profitable trading strategy yielding significant excess returns. For example, one might beat the margina l forecast but still earn no excess returns after adjustin g for transactions costs. Alternatively, one might do worse than the marginal but still make huge profits if the "hits" are "big," a poin t stressed by Cumby and Modest (1987). E:11aluating1'robahility__Eorecasts Oftentimes economic and financial forecast s are issued as probabilities, such as the probability that a business cycle turning poin t will occur in the next year, the probabil ity that a corporation will default on a particular bon d issue this year, or the probability that the return on the S&P 500 stock index will be more than ten percent this year. A number of specialized considerations arise in the evaluation of prob ability forecasts, to which we now tum. Let P1 +k,1 be a probability forecast made at time t for an event at time t+k , and let Ri+k= I if the event occurs and zero otherwise. P +,.i is a scal ar if there are only two possible events. 1 Mor e generally, if there are N possible events, then P 1+k,t is an (N- l)xl vector. 14 For nota tional economy, we shall focus on scalar probabil ity forecasts. Accuracy measures for probability forecast s are commonly called "scores," and the most common is Brie r's ( I 950) quadratic probability score, also called the Brie r scor e, 14 The probability forecast assigned to the Nth event is implicitly detennined by the restriction that the probabilities sum to 1. 28 Clearly, QPS E [0,2), and it has a negative orientation (smaller values indicate more accurate forecasts). 15 To understand the QPS, note that the accuracy of any forecast refers to the expected loss when using that forecast, and typically loss depends on the deviation between forecasts and realizations. It seems reasonable, then, in the contex t of probability forecasting under quadratic loss, to track the average squared divergence betwee n P,+k I and R.+k, which is what the QPS does. Thus, the QPS is a rough probability-forecast analog of MSE. The QPS is only a rough analog of MSE, however, because P,+k is in fact not a I forecast of the outcome (which is 0-1), but rather a probability assign ed to it. A more natural and direct way to evaluate probability forecasts is simply to compa re the forecasted probabilities to observed relative frequencies -- that is, to assess calibra tion. An overall measure of calibration is the global squared bias, GSB = 2(P - R.)2, IT ==RH. GSB E [0,2] with a negative orientation. T (cl T (al Calibration may also be examined locally in any subset of the unit interval. For where P IT L P,.u and R L example, one might check whether the observed relative frequency corresponding to probability forecasts between . 6 and . 7 is also between . 6 and . 7. One may go farthe r to fonn a weighted average of local calibration across all cells of a J-sunset partition of the unit interval into J subsets chosen according to the user's interest and the specifics of the situation. 16 This leads to the local squared bias measure, 15 The "2" that appears in the QPS fonnu la is an artifact from the full vector case. We could of course drop it without affecting the QPS rankings of compe ting forecasts, but we leave it to maintain comparability to other literature. 16 For example, Diebold and Rudebusch (I 989) split the unit interval into ten equal parts. 29 1 J - - LSB = - "2 r(P - R \2 L. J J J/' T j•I whe re T; is the num ber of probability fore casts in set j, Pj is the average forecast in set j, and R; is the average realization in set j, j = 1, ... , J. Not e that LSBE[0,2], and LSB = O implies that GSB = 0, but not conversely. Testing for adequate calibration is a stra ightforward matter, at least und er indepen dence of the realizations. For a given event and a corresponding sequence of forecasted probabilities {P, .•., r. create J mutually exclusive and collectively exhaustive subsets of forecasts, and denote the midpoint of each range TI;, j = 1, ... , J. Let ~ denote the number of observed events when the forecast was in set j, resp ectively, and define "range j" calibration statistics, , 1 Z; (R; = - Ti TI; ) I (T.I TIJ (I - - (R; - e; ) I J = 1, ... , J, w.2 Tij )) 2 .I and an overall calibration statistic, J L where R. =L R; , e, = j " I w. J j = 2 J T;TI;, and w. = I L j = I T;TI;(l - TI_;) Z0 isaj oin ttes tof adequate local calibration across all cells , while the Z; statistics test cell-by-cell loca l 17 calibration. Und er independence, the binomial stm ctur e would obviously imply that a a 2 0 - N(O,l), and Z; - N(O,l), \fj = 1, ... , J. In a fascinating development, Seil lierMoiseiwitsch and Daw id (1993) show that the asymptotic non nali ty holds much mor e 17 One may of course test for adequate glob al calibration by using a trivial partition of the unit interval -- the unit interval itself. 30 generally, including in the dependent situations of practical relevance. One additional feature of probability forecasts (or more precisely, of the corresponding realizations), called resolution, is of interest: RES = - 1 T 1 L2T jal J (- -)2 R - R . J RES is simply the weighted average squared divergence between R and the how much the observed relative frequencies move across cells. RES ~ i's, a measure of J 0 and has a positive orientation. As shown by Murphy (1973), an inforn1ative decomposition of QPS exists, QPS = QPSR + LSB - RES, where QPSR is the QPS evaluated at P,,k, = R. This decomposition highlights the tradeoffs between the various attributes of probability forecasts. Just as with Theil' s U-statistic for "standard" forecasts, it is sometimes inforniative to compare the perfo nnan ce of a particular probability forec ast to that of a benchmark. Murphy (I 974), for example, proposes the statistic M = QPS - QPSR = LSB - RES, which measures the difference in accuracy between the forec ast at hand and the benchmark - forecast R. Using the earlier-discussed Diebold-Marian o approach, one can also assess the significance of differences in QPS and QPSR, differences in QPS or various other measures of probability forecast accuracy across forecasters, or differ ences in local or global calibration across forecasters. Ev_a!uating_V__olatility_Eorecasts Many interesting questions in finance, such as options pricin g, risk hedging and 31 portfolio management, explicitly depend upon the varia nces of asset prices. Thus, a variety of methods have been proposed for generating volatility forecasts. As opposed to point or probability forecasts, evaluation of volatility forecasts is complicated by the fact that actual conditional variances are unobservable. A standard "solution" to this unobservability problem is to use the squared realization e;., as a proxy for the true conditional variance h,+>, beca use E [ = h,.,, where v,,, ~ e;,,/0,,,_ 1 J WN(0, 1). 18 Thus, for example, 1\,u )2. Although MSE is often used to measure volat ility forecast accuracy, Bollerslev, Engle and Nelson (1994) point out that MSE is inappropriate, because it penalizes positive volatility forecasts and negative volat ility forecasts (which are meaningless) symmetrically. Two alternative loss functions that penalize volatility forecasts asymmetrically are the logarithmic loss function employed in Paga n and Schwert (1990), and the heteroskedasticity-adjusted MSE of Bollerslev and Ghysels ( 1994), Bollerslev, Engle and Nelson (1994) suggest the loss function implicit in the Gaussian quasi- 18 Although e;,, is an unbiased estimator of h,+>, it is an imprecise or "noisy" estimator. v,,, e;,, v;,, has a conditional mean ofh,+> because v;., x; Yet, because the median of a x; distribution is 0.455, e;,, < ¾h,., more than For example, if ~ N(0, l), = h,,, ~ fifty percent of the time. 32 maximum likelihood function often used in fitting volatility models; that is, As with all forecast evaluations, the volatility forecast evaluations of most interest to forecast users are those conducted under the relevant loss function. West, Edison and Cho (1993) and Engle et al. (1993) make important contributions along those lines, proposing economic loss functions based on utility maximization and profit maximization, respectively. Lopez ( 1995) proposes a framework for volatility forecast evaluation that allows for a variety of economic loss functions. The framework is based on transfonning volatility forecasts into probability forecasts by integrating over the assumed or estimated distribution of E1. By selecting the range of integration corresponding to an event of interest, a forecast user can incorporate elements of her loss function into the probability forecasts. For example, given e,.k/0, ~ D(O, h,,u} and a volatility forecast h,,u, an options trader interested in the event e,,k E [ L,.,,k• where z,+k is the standardized innovation, [t,. t•k' u,.,,, J would generate the probability forecast f( z,,,} is the functional fonn of D (0, 1), and u,_ ,,k] is the standardized range of integration. In contrast, a forecast user interested in the behavior of the underlying asset, Y,,k = µ,,k., generate the probability forecast 33 + e,,k where µ,,k.1 = E [Y,,k/0,], might where µ,,u is the forecasted conditional mean and ['y. •••' uy. H] is the standardized range of integration. Once generated, these probability forecasts can be evaluated using the scoring mie s described above, and the significance of diffe rences across models can be tested using the Diebold-Mariano tests. The key advantage of this framework is that it allows the evaluatio n to be based on observable events and thus avoi ds proxying for the unobservable tme variance . The Lopez approach to volatility forecast eval uation is based on time-varying probabilities assigned to a fixed interval. Alte rnatively, one may fix the probabilities and vary the widths of the intervals, as in traditional confidence interval cons tmct ion. In that rega rd, Christoffersen (1995) suggests exploiting the fact that if a (I-a: )% confidence interval (deno ted [Ly,t+t, Uy,,+iD is correctly calibrated, then the "hits" are iid Bemoulli(l-o:). That is, if one defines then I,+i,, is one with probability (I-a:) and zero with probability a:. Given the T values of the indicator variable for the T forecast intervals , one can dete nnin e whether the forecasted intervals are well calibrated by testing the hypo thesis that the indicator variable is an iid Bernoulli( 1-o:) random variable. The iid property can be checked using the grou p test of David (1947), which is unif onn ly most powerful against first-order dependence. Defi ne a group as a string of consecutive zeros or ones, and let k be the num ber of groups in the sequence {l,+1,,}, Und er the null that the sequence is iid, the distribut ion of k given the total number of ones, n,, and 34 the total numb er of zeros, n0 , is for b2, where n=n0 + n" and f2 , f2,(n-2t ) ' 1 = ---, (2t) for k odd. A likelihood ratio test of the Bernoulli hypothesis (that is, a joint test of iid behavior and correct coverage) is readily constructed by comparing the maximized log likelihoods of restricted and unrestricted Mark ov processes for the indicator series {I,+ 1 ,}. The unrestricted transition probability matrix is 1-rr 00] 7t 11 and restricted transition probability matrix is IIR = [1-a-a al. 1 ct The corresponding approximate likelihood functions are 19 and L(IIR I I) = (1-a/'oo'"iol (a/'"'"11>, where n,., is the numb er of observed transitions from i to j and I is the indicator sequence. The '' The likelihoods are approximate because the initial tenns are dropped. 35 likelihood ratio statistic is LR = 2[1nL(Il / I) Unde r the null hypothesis, LR ~ lnL(IlR / I)] x;, where fl and fIR are the maximum-likelihood estimates. V. Concluding Remarks Thre e modern themes permeate this survey, so it is worth highlighting them explicit! y. The first theme is that various types of forecasts, such as probability forecasts and volatility forecasts, are becoming more integrated into econ omic and financial decision making, leading to a derived demand for new types of forecast evalu ation procedures. The second theme is the use of exact finite-sample hypothesis tests, typically based on distribution-free nonparametrics. We explicitly sketc hed such tests in the context of forecasterror unbiasedness, k-dependence, orthogonality to available infon natio n, and when more than one forecast is available, in the context of testing equa lity of expected loss, testing whet her a direction-of-change forecast has value, etc. The third theme is use of the relevant loss function. This idea arose in many places, such as in forecastability measures and forecast accu racy comparison tests, and may readily be introduced in others, such as orthogonality tests, enco mpassing tests and combining regressions. In fact, an integrated tool kit for estim ation, forecasting, and forecast evaluation (and hence model selection and nonnested hypothesi s testing) unde r the relevant loss function is rapidly becoming available; see Weiss and Ande rsen (1984), Weiss (1995), Diebold and Mari ano (1995), Christoffersen and Diebold (1994), and Diebold, Ohanian and Berkowitz (1995). 36 References Annstrong, J.S. and Fildes, R., I 995. "On the Selection of Error Measu res for Comparisons Among Forecasting Methods," Journal of Forecasting, 14, 67-71. Auerbach, A., 1994. "The U.S. Fiscal Problem: Where We Are, How We Got Here and Where We're Going," NEER Macro Annual. Cambridge, Mass.: MIT Press. Bates, J.M. and Granger, C.W .J., 1969. "The Combination of Foreca sts," Operations Research Quarterly, 20, 451-468. Bollerslev, T., Engle, R.F. and Nelson, D.B., 1994. "ARCH Model s," in R.F. Engle and D. McFadden (eds.), Handbook of Econometrics, Volume 4. Amsterdam: North-Holland. Bollerslev, T. and Ghysels, E., 1994. "Periodic Autoregressive Condi tional Heteroskedasticity," Working Paper #178, Department of Finance, Kellog g School, Northwestern University. Bonham, C. and Cohen, R., 1995. "Testing the Rationality of Price Forecasts: Comment," American Economic Review, 85, 284-289. Bradley, J.V., 1968. Distribution-Free Statistical Tests. Englewood Cliffs, New Jersey: Prentice-Hall. Brier, G.W., 1950. "Verification of Forecasts Expressed in Tenns of Probability," Monthly Weather Review, 75, 1-3. Brown, B.W. and Maita!, S., 1981. "What Do Economists Know? An Empirical Study of Experts' Expectations," Econometrica, 49, 491-504. Campbell, B. and Dufour, J.-M., 1991. "Over-Rejections in Rational Expectations Models: A Nonparametric Approach to the Mankiw-Shapiro Problem," Econo mics Letters, 35, 285-290. Campbell, B. and Dufour, J.-M., 1995. "Exact Nonparametric Orthog onality and Random Walk Tests," Review of Economics and Statistics, 77, 1-16. Campbell, B. and Ghysels, E., 1995. "Federal Budget Projections: A Nonparametric Assessment of Bias and Efficiency," Review of Economics and Statist ics, 77, 17-31. Campbell, J.Y. and Mankiw, N.G., 1987. "Are Output Fluctuations Transitory?," Quarterly Journal of Economics, 102, 857-880. Chong, Y.Y. and Hendry, D.F., 1986. "Econometric Evaluation of Linear Macroeconomic 37 Models," Review of Economic Studies, 53, 671-690. Christoffersen, P.F. , 1995. "Predicting Uncertainty in the Foreign Exchange Markets," Manuscript, Department of Economics, University of Pennsylvania. Christoffersen, P.F. and Diebold, F.X. , 1994. "Opt imal Prediction under Asymmetric Loss," Technical Working Pape r #167, National Bureau of Economic Research, Cambridge, Mass. Clemen, R.T. , 1989. "Combining Forecasts: A Revi ew and Annotated Bibliography," International Journal of Forecasting, 5, 559-581. Clemen, R.T. and Winkler, R.L. , 1986. "Combinin g Economic Forecasts," Journal of Economic and Business Statistics, 4, 39-46. Clements, M.P. and Hendry, D.F. , 1993. "On the Limitations of Comparing Mean Squared Forecast Errors," Journal of Forecasting, 12, 617638. Cochrane, J.H. , 1988. "How Big is the Random Walk in GNP?," Journal of Political Economy, 96, 893-920. Cooper, D.M . and Nelson, C.R. , 1975. "The &- Ante Prediction Perfo nnan ce of the St. Louis and F.R. B.-M .I.T. -Pen n Econometric Models and Some Results on Composite Predictors," Journal of Money, Credit and Banking, 7, 1-32. Coulson, N.E. and Robins, R.P. , 1993. "Forecast Combination in a Dynamic Setting," Journal of Forecasting, 12, 63-67. Cumby, R.E. and Huizinga, J., 1992. "Testing the Autocorrelation Strncture of Disturbances in Ordinary Least Squares and Instrumental Variables Regressions," Econometrica, 60, 185-195. Cumby, R.E. and Modest, D.M ., 1987. "Testing for Market Timing Ability: A Fram ewor k for Forecast Evaluation," Journal of Financial Econ omics, I 9, 169-189. David, F.N. , 1947. "A Pow er Function for Tests of Randomness in a Sequence of Alternatives," Biometrika, 34, 335-339. Deutsch, M., Granger, C.W .J. and Teriisvirta, T., I 994. "The Combination of Forecasts Using Changing Weights," International Journal of Forecasting, 10, 47-57. Diebold, F.X. , 1988. "Serial Correlation and the Combination of Forecasts," Journal of Business and Economic Statistics, 6, 105-111. 38 Diebold, F.X. , 1993. "On the Limitations of Comparing Mean Square Forecast Errors: Comment," Journal of Forecasting, 12, 641-642. Diebold, F.X. and Lindner, P., 1995. "Fractional Integ ration and Interval Prediction," Manuscript, Department of Economics, University of Penn sylvania. Diebold, F.X. and Mariano, R., 1995. "Comparing Predi ctive Accuracy," Journal of Business and Economic Statistics, forthcoming. Diebold, F.X. , Ohanian, L. and Berkowitz, J., 1995. "Dyn amic Equilibrium Economies: A Framework for Comparing Models and Data," Technical Working Paper No. 174, National Bureau of Economic Research, Cambridge, Mass . Diebold, F.X. and Pauly, P., I 987. "Stmctural Change and the Combination of Forecasts," Journal of Forecasting, 6, 21-40. Diebold, F.X. and Pauly, P., 1990. "The Use of Prior lnfon natio n in Forecast Combination," Intemational Journal of Forecasting, 6, 503-5 08. Diebold, F.X. and Rudebusch, G.D. , 1989. "Scoring the Leading Indicators," Joum al of Business, 62, 369-391. Dufour, J.-M ., 1981. "Rank Tests for Serial Dependence," Journal of Time Series Analysis, 2, 117-128. Engle, R.F., Hong, C.-H ., Kane, A. and Noh, J., 1993. "Arbitrage Valuation of Variance Forecasts with Simulated Options," in D. Chance and R. Tripp (eds.), Advances in Futures and Options Research. Greenwich, CT.: ITA Press . Engle, R.F. and Kozicki, S., 1993. "Testing for Comm on Features," Journal of Business and Economic Statistics, 11, 369-395. Fair, R.C. and Shiller, R.J., 1989. "The Infonnational Content of Ex Ame Forecasts," Review of Economics and Statistics, 71, 325-331. Fair, R.C. and Shiller, R.J., 1990. "Comparing Infonnatio n in Forecasts from Econometric Models," American Economic Review, 80, 375-389. Fama, E.F., 1970. "Efficient Capital Markets: A Revie w of Theory and Empirical Work," Journal of Finance, 25, 383-417. Fama, E.F., I 975. "Short-Tenn Interest Rates as Predictors of Inflation," American Economic Review, 65, 269-282. 39 Fama, E.F., 1991. "Efficient Markets II," Journal of Finan ce, 46, 1575-1617. Fama, E.F. and French, K.R. , 1988. "Permanent and Temp orary Components of St6ck Prices," Journal of Political Economy, 96, 246-273. Granger, C.W .J. and Newbold, P., 1973. "Some Comm ents on the Evaluation of Economic Forecasts," Appl ied Economics, 5, 35-47. Granger, C.W. J. and Newbold, P., 1976. "Forecasting Transfonned Series," Journal of the Royal Statistical Society B, 38, 189-203. Granger, C.W. J. and Newbold, P., 1986. Forecasting Economic Time Series, Second Edition. San Diego: Academic Press. Granger, C.W. J. and Ramanathan, R., 1984. "Improved Methods of Forecasting," Journal of Forecasting, 3, 197-204. Hansen, L.P. and Hodrick, R.J., 1980. "Forward Exch ange Rates as Optimal Predictors of Future Spot Rates: An Econometric Investigation," Journ al of Political Economy, 88, 829-853. Hendry, D.F. and Mizon, G.E. , 1978. "Serial Correlation as a Convenient Simplification, Not a Nuisance: A Comment on a Study of the Demand for Money by the Bank of England," Economic Journal, 88, 549-563. Henriksson, R.D. and Merton, R.C. , 1981. "On Market Timing and Investment Perfo nnan ce II: Statistical Procedures for Evaluating Forecast Skills," Journal of Business, 54, 5 I 3533. Keane, M.P. and Runkle, D.E. , 1990. "Testing the Ratio nality of Price Forecasts: New Evidence from Panel Data," American Economic Review, 80, 714-735. Leitch, G. and Tanner, J.E., 1991. "Economic Forecast Evaluation: Profits Versus the Conventional Error Measures," American Economic Revie w, 81, 580-590. Leitch, G. and Tanner, J.E., 1995. "Professional Econ omic Forecasts: Are They Worth Their Costs?," Journal of Forecasting, 14, 143-157. LeRoy, S.F. and Porter, R.D, 1981. "The Present Value Relation: Tests Based on Implied Variance Bounds," Econometrica, 49, 555-574. Lopez, J.A., 1995. "Evaluating the Predictive Accuracy of Volatility Models," Manuscript, Department of Economics, University of Pennsylvania. 40 Mark, N.C., 1995. "Exchange Rates and Fundamentals: Evidence on Long-Horizon Predictability," American Economic Review, 85, 201-218. McCulloch, R. and Rossi, P.E., 1990. "Posterior, Predictive and Utility -Based Approaches to Testing the Arbitrage Pricing Theory," Journal of Financial Economics, 28, 7-38. Meese, R.A. and Rogoff, K., 1988. "Was it Real? The Exchange Rate - Interest Differential Relation Over the Modern Floating-Rate Period," Journal of Finance, 43, 933-948. Merton, R.C., 1981. "On Market Timing and Investment Performance I: An Equilibrium Theory of Value for Market Forecasts," Journal of Business, 54, 513-53 3. Mincer, J. and Zarnowitz, V., 1969. "The Evaluation of Economic Foreca sts," in J. Mincer (ed.), Economic Forecasts and Expectations. New York: National Bureau of Economic Research. Murphy, A.H., 1973. "A New Vector Partition of the Probability Score, " Journal of Applied Meteorology, 12, 595-600. Murphy, A.H., 1974. "A Sample Skill Score for Probability Forecasts," Monthly Weather Review, 102, 48-55. Murphy, A.H. and Winkler, R.L., 1987. "A General Framework for Forecast Evaluation," Momhly Weather Review, 115, 1330-1338. Murphy, A.H. and Winkler, R.L., 1992. "Diagnostic Verification of Probability Forecasts," Intemational Journal of Forecasting, 7, 435-455. Nelson, C.R., 1972. "The Prediction Perfonnance of the F.R.B.-M.I. T.-Penn Model of the U.S. Economy," American Economic Review, 62, 902-917. Nelson, C.R. and Schwert, G.W., 1977. "Short Tenn Interest Rates as Predictors of Inflation: On Testing the Hypothesis that the Real Rate of Interest is Consta nt," American Economic Review, 67, 478-486. Newbold, P. and Granger, C.W.J ., 1974. "Experience with Forecasting Univariate Time Series and the Combination of Forecasts," Journal of the Royal Statistical Society A, 137, 131-146. Pagan, A.R. and Schwert, G.W., 1990. "Alternative Models for Condi tional Stock Volatility," Journal of Econometrics, 45, 267-290. Pesaran, M.H., 1974. "On the General Problem of Model Selection," Review of Economic Studies, 41, 153-171. 41 Pesaran, M.H . and Timmermann, A., 1992. "A Simple Nonparametric Test of Predictive Perfonnance," Journal of Business and Econom ic Statistics, 10, 461-465. Ramsey, J.B., 1969. "Tests for Specification Errors in Classical Least-Squares Regression Analysis," Journal of the Royal Statistical Soci ery B, 2, 350-371. Satchell, S. and Timmennann, A., 1992. "An Assessment of the Economic Value of Nonlinear Foreign Exchange Rate Forecasts," Birkbeck College, Cambridge University, Financial Economics Discussion Paper FE-6/92. Schnader, M.H . and Stekler, H.O ., 1990. "Eva luating Predictions of Change," Journal of Business, 63, 99-107. Seillier-Moiseiwitsch, F. and Dawid, A.P ., 1993. "On Testing the Validity of Sequentia l Probability Forecasts," Journal of the America n Statistical Association, 88, 355-359. Shiller, R.J. , 1979. "The Volatility of Long Ten n Interest Rates and Expectations Models of the Tenn Structure," Journal of Political Eco nomy, 87, 1190-1219. Stekler, H. 0., 1987. "Who Forecasts Better?," Journal of Business and Economic Statistics, 5, 155-158. Stekler, H 0., 1994. "Are Economic Forecast s Valuable?," Journal of Forecasting, 13, 495-505. Theil, H., 1961. Economic Forecasts and Poli cy. Amsterdam: North-Holland. Weiss, A.A ., I 995. "Estimating Time Series Models Using the Relevant Cost Function," Manuscript, Department of Economics, Univ ersity of Southern California. Weiss, A.A. and Andersen, A.P ., 1984. "Est imating Forecasting Models Using the Relevant Forecast Evaluation Criterion," Journal of the Royal Statistical Sociery A, 137, 484-487. West, K.D ., I 994. "Asymptotic Inference Abo ut Predictive Ability," Manuscript, Department of Economics, University of Wis consin. West, K.D ., Edison, H.J. and Cho, D., 1993 . "A Utility-Based Comparison of Some Mod els of Exchange Rate Volatility," Journal of Inte rnational Economics, 35, 23-45. Winkler, R.L. and Makridakis, S., 1983. "The Combination of Forecasts," Journal of the Royal Statistical Society A, 146, 150-157. 42 Figure 1 Observed Cell Counts Actual i Actual j Marginal Forecast i oii O;; 0 ,. Forecast j O; O;; o,. Marginal 0 ., 0, Total: 0 Figure 2 Unobserved Cell Probabilities Actual i Actual j Marginal Forecast i pii P;- p ,. Forecast j P;i P,, P,. Marginal p ., P_, Total: I 43