View original document

The full text on this page is automatically extracted from the file linked above and may contain errors and inconsistencies.

FORECAST EVALUATION AND COMBINATION

by
Francis X. Diebold and Jose A. Lopez

Federal Reserve Bank of New York
Research Paper No. 9525

November 1995

This paper is being circulated for purposes of discussion and comment only.
The contents should be regarded as preliminary and not for citation or quotation without
permission of the author. The views expressed are those of the author and do not necessarily
reflect those of the Federal Reserve Bank of New York or the Federal Reserve System.
Single copies are available on request to:
Public Information Department
Federal Reserve Bank of New York
New York, NY 10045

Forecast Evaluation and Combination
Francis X. Diebold

Jose A. Lopez

Department of Economics
University of Pennsylvania
3718 Locust Walk
Philadelphia, PA 19104-6297

Research and Market Analysis Group
Federal Reserve Bank of New York
33 Liberty Street
New York, NY 10045

Print date: October 23, 1995
ABSTRACT: Forecasts are of great importance and widely used in
economics and finance.
Quite simply, good forecasts lead to good decisions. The importance
of forecast evaluation
and combination techniques follows immediately -- forecast users natura
lly have a keen interest
in monitoring and improving forecast performance. Here we provide
a five-part selective
account of forecast evaluation and combination methods. In the first,
we discuss evaluation of
a single forecast, and in particular, evaluation of whether and how it
may be improved. In the
second, we discuss the evaluation and comparison of the accuracy of
competing forecasts. In
the third, we discuss whether and how a set of forecasts may be combi
ned to produce a
superior composite forecast. In the fourth, we describe a number of
forecast evaluation topics
of particular relevance in economics and finance, including methods
for evaluating directionof-change forecasts, probability forecasts and volatility forecasts. In
the fifth, we conclude.

Acknmvledgements:The views expressed here are those of the author
s and not those of the
Federal Reserve Bank of New York or the Federal Reserve System.
We thank Clive Granger
for useful comments, and we thank the National Science Foundation,
the Sloan Foundation
and the University of Pennsylvania Research Foundation for financial
support.

It is obvious that forecasts are of great importance and widel
y used in economics and
finance. Quite simply, good forecasts lead to good decis
ions. The importance of forecast
evaluation and combination techniques follows immediatel
y -- forecast users naturally have a
keen interest in monitoring and improving forecast perfo
rmance. More generally, forecast
evaluation figures prominently in many questions in empi
rical economics and finance, such as:
Are expectations rational? (e.g., Keane and Runkle, 1990;
Bonham and Cohen, 1995)
Are financial markets efficient? (e.g., Fama, 1970, 1991)
Do macroeconomic shocks cause agents to revise their forec
asts at all horizons, or just
at short- and medium-tenn horizons? (e.g., Campbell and
Mankiw, 1987;
Cochrane, 1988)
Are observed asset returns "too volatile"? (e.g., Shiller,
I 979; LeRoy and Porter,
1981)
Are asset returns forecastable over long horizons? (e.g.,
Fama and French, 1988;
Mark, 1995)
Are forward exchange rates unbiased and/ or accurate forec
asts of future spot prices at
various horizons? (e.g., Hansen and Hodrick, 1980)
Are government budget projections systematically too optim
istic, perhaps for strategic
reasons? (e.g., Auerbach, 1994; Campbell and Ghysels,
1995)
Are nominal interest rates good forecasts of future inflation?
(e.g., Fama, 1975;
Nelson and Schwert, I 977)
Here we provide a five-part selective account of forecast
evaluation and combination
methods. In the first, we discuss evaluation of a single forec
ast, and in particular, evaluation

of whether and how it may be improved. In
the second, we discuss the evaluation and
comparison of the accuracy of competing fore
casts. In the third, we discuss whether and
how
a set of forecasts may be combined to produce
a superior composite forecast. In the fourth,
we describe a number of forecast evaluation
topics of particular relevance in economics and
finance, including methods for evaluating dire
ction-of-change forecasts, probability forecasts
and volatility forecasts. In the fifth, we conc
lude.
In treating the subject of forecast evaluation,
a tradeoff emerges between generality and
tedium. Thus, we focus for the most part on
linear least-squares forecasts of univariate
covariance stationary processes, or we assume
nonnality so that linear projections and
conditional expectations coincide. We leave
it to the reader to flesh out the remainder.
However, in certain cases of particular inter
est, we do focus explicit! y on nonlinearities
that
produce divergence between the linear projectio
n and the conditional mean, as well as on
nonstationarities that require special attention
.

I. Evaluating a Single Forecast
The properties of optimal forecasts are well
known; forecast evaluation essentially
amounts to checking those properties. First,
we establish some notation and recall some
familiar results. Denote the covariance stati
onary time series of interest by y • Assumin
g that
1
the only deterministic component is a possibly
nonzero mean, µ, the Wold representation is
y1 = µ + r + b t\_ + b rt-2 + ... , where
£ ~ WN(O,a 2h and WN deno
1
1
1
2
tes serially
1
uncorrelated (but not necessarily Gaussian, and
hence not necessarily independent) white
noise. We assume invertibility throughout,
so that an equivalent one-sided autoregressiv
e

2

representation exists.
The k-step-ahead linear least-squares forecast is y'
1-k.1
the corresponding k-step-ahead forecast error is
et+k.t -y
t+k -y-t+k,t -e
t+k +be
I t+k-1 +

...

+b k-1

0

c-t+I'

(I)

Finally, the k-step-ahead forecast error variance is

(2)

Four key properties of errors from optimal forecasts, which we discuss in greater
detail below,
follow immediately:
(a) Optimal forecast errors have a zero mean (follows from (I));
(b) I-step-ahead optimal forecast errors are white noise (special case of (I)
corresponding to k = I);
(c) k-step-ahead optimal forecast errors are at most MA(k-1) (general case of
(I));
(d) The k-step-ahead optimal forecast error variance is non-decreasing in k (follow
s
from (2)).
Before proceeding, we now describe some exact distribution-free nonparametric
tests
for whether an independently (but not necessarily identically) distributed series
has a zero
median. The tests are useful in evaluating the properties of optimal forecast errors
listed
above, as well as other hypotheses that will concern us later. Many such tests
exist; two of the
most popular, which we use repeatedly, are the sign test and the Wilcoxon signedrank test.
Denote the series being examined by x" and assume that T observations are availab
le.
The sign test proceeds under the null hypothesis that the observed series is indepen
dent with a

3

zero median. 1 The intuition and construction
of the test statistic are straightforward -- unde
r
the null, the num ber of positive observations
in a sample of size T has the binomial
distribution with parameters T and 1/2. The
test statistic is therefore simply

where
if x, > 0,
otherwise.
In large samples, the studentized version of
the statistic is standard non nal,

S - T

a

2

-

N(O,l).

Thus, significance may be assessed using stan
dard tables of the binomial or non nal
distributions.
Note that the sign test does not require distr
ibutional symmetry. The Wilcoxon signedrank test, a related distribution-free procedur
e, does require distributional symmetry, but
it can
be more powerful than the sign test in that
case. Apart from the additional assumption
of
symmetry, the null hypothesis is the same,
and the test statistic is the sum of the ranks
of the
absolute values of the positive observations
,
T

W =

L I,( x, ) Rank( I x, I ),
t=l

1

If the series is symmetrically distributed, then
a zero median of course corresponds to a
zero mean.
4

where the ranking is in increasing order (e.g., the largest absolute observation is assigned
a
rank of T, and so on). The intuition of the test is simple -- if the underlying distribution
is
symmetric about zero, a "very large" (or "very small") sum of the ranks of the absolute
values
of the positive observations is "very unlikely." The exact finite-sample null distribution
of the
signed-rank statistic is free from nuisance parameters and invariant to the true underlying
distribution, and it has been tabulated. Moreover, in large samples, the studentized version
of
the statistic is standard normal,
W _ T(T

1)

+

a

4

-

N(O,l).

T(T + !)(2T + 1)

24
Iesting_Eroperties_o£Dptimal_Eor.ecasts
Given a track record of forecasts,

Y,.u•

and corresponding realizations, Y,.,, forecast

users will naturally want to assess forecast perfonnance. The properties of optimal forecast
s,
cataloged above, can readily be checked.

a. Optimal Forecast Errors Have a Zero Mean
A variety of standard tests of this hypothesis can be perfonned, depending on the
assumptions one is willing to maintain. For example, if e,.,., is Gaussian white noise (as
might be the case for I-step-ahead errors), then the standard t-test is the obvious choice
because it is exact and unifonnly most powerful. If the errors are non-Gaussian but remain
independent and identically distributed (iid), then the t-test is still useful asymptotically.
However, if more complicated dependence or heterogeneity structures are (or may be)
operative, then alternative tests are required, such as those based on the generalized method
of

5

moments.
It would be unfortunate if non-nonnality
or richer dependence/heterogeneity structur
es
mandated the use of asymptotic tests, beca
use sometimes only short track records are
available. Such is not the case, however
, because exact distribution-free nonpara
metric tests
are often applicable, as pointed out by Cam
pbell and Ghysels (I 995). Although the
distribution-free tests do require indepen
dence (sign test) and independence and sym
metry
(signed-rank test), they do not require non
nality or identical distributions ove r time
. Thus, the
tests are automatically robust to a variety
of forecast erro r distributions, and to
heteroskedasticity of the independent but
not identically distributed type.
For k> 1, however, even optimal forecast
erro rs are likely to display serial correlat
ion,
so the nonparametric tests must be modified
. Und er the assumption that the forecast
erro rs are
(k-1)-dependent, each of the following k
series of forecast erro rs will be free of seria
l

e3 + 3k, 3 +2k, ... }, ... , {e,k,k• e,k.,k• e ,,,k• ...
}. Thus, a Bonferroni bounds test (with
4
size bounded
above by ex) is obtained by perf onn ing k
tests, each of size ex/k, on each of the k
erro r series,
and rejecting the null hypothesis if the null
is rejected for any of the series. This proc
edu re is
conservative, even asymptotically. Alte
rnatively, one could use just one of the k
erro r series
and perf onn an exact test at level ex, at the
cost of reduced pow er due to the discarde
d
observations.
In concluding this section, let us stress that
the nonparametric distribution-free tests
are
neither unambiguously "better" nor "wo
rse" than the mor e com mon tests; rather,
they are
useful in different situations and are ther
efore complementary. To thei r cred it, they
are often

6

exact finite-sample tests with good finite-sample power, and they are
insensitive to deviations
from the standard assumptions of normality and homoskedasticity require
d to justify more
standard tests in small samples. Against them, however, is the fact that
they require
independence of the forecast errors, an assumption even stronger than
conditional-mean
independence, let alone linear-projection independence. Furthermore
, although the
nonparametric tests can be modified to allow fork-dependence, a possib
ly substantial price
must be paid either in tenns of inexact size or reduced power.

b. I-Step-Ahead Optimal Forecast Errors are White Noise
More precisely, the errors from linear least squares forecasts are linearprojection
independent, and the errors from least squares forecasts are conditional-m
ean independent.
The errors never need be fully serially independent, because dependence
can always enter
through higher moments, as for example with the conditional-varianc
e dependence of GARCH
processes.
Under various sets of maintained assumptions, standard asymptotic tests
may be used to
test the white noise hypothesis. For example, the sample autocorrelati
on and partial
autocorrelation functions, together with Bartlett asymptotic standard
errors, may be useful
graphical diagnostics in that regard. Standard tests based on the serial
correlation coefficient,
as well as the Box-Pierce and related statistics, may be useful as well.
Dufour ( 1981) presents adaptations of the sign and Wilcoxon signed-rank
tests that
yield exact tests for serial dependence in I-step-ahead forecast errors
, without requiring
nonnality or identical forecast error distributions. Consider, for examp
le, the null hypothesis
that the forecast errors are independent and symmetrically distributed
with zero median. Then

7

median(e1+ 1,1e1+ 2 ,1+ 1)

= O;

that is, the product of two symmetric independent
random variables

with zero median is itself symmetric with zero med
ian. Under the alternative of positive serial
dependence, median(e1+ 1 ,1e1+2 ,1+ ) > 0, and unde
r the alternative of negative serial dependence,
1
median(e,+ 1,,e,+ 2 ,1+1) < 0. This suggests examining
the cross-product series z = e,.1,, e,. .,.,
1
2
for symmetry about zero, the obvious test for whic
h is the signed-rank test,
T

WO

=

L
t=

I

I. ( z1 )Rank( Iz1 I)· Note that the z sequence will
be serially dependent even if the
1

e,+ 1,1 sequence is not, in apparent violation of the
conditions required for validity of the signedrank test (applied to ZJ. Hence the importance
of Dufo ur's contribution -- Dufour shows that
the serial correlation is of no consequence and that
the distribution of W O is the same as that of

w.
c. k-Srep-Ahead Oprimal Forecast Errors are ar
Mos! MA(k-1)

Cumby and Huizinga (1992) develop a useful asym
ptotic test for serial dependence of
orde r greater than k-1. The null hypothesis is that
the e,+k,t series is MA(q) (0 ,; q ,; k-1)
against the alternative hypothesis that at least one
autocorrelation is nonzero at a lag greater
than k-1. Under the null, the sample autocorre
lations of e,+k,t' p = [ Pq-i• ... , Pq.,} are
asymptotically distributed {f p - N( 0, V ). 2
Thus,

C= Tp 1V 1 p
is asymptotically distributed as

x; under the null, where V is a consistent estimator of V.

Dufo ur's (1981) distribution-free nonparametric
tests may also be adapted to provide a
finite-sample bounds test for serial dependence
of order greater than k-1. As before, separate
the forecast errors into k series, each of which is
serially independent under the null of (k-1)2

s is a cutoff lag selected by the user.

8

dependence. Then, for each series, take zk.t = e,,u e,,zk.t•k and reject
at significance level
bounded above by a if one or more of the subset test statistics rejects
at the a/k level.
d. The k-Step-Ahead Optimal Forecast Error Variance is Non-Decreas
ing in k

The k-step-ahead forecast error variance, a~

= var(e1,k_1) = 0 2 (

~bi) , is non-

decreasing in k. Thus, it is often useful simply to examine the sample
k-step-ahead forecast
error variances as a function of k, both to be sure the condition appear
s satisfied and to see the
pattern with which the forecast error variance grows with k, which often
conveys useful
infonnation. 3 Fonna l inference may also be done, so long as one takes
care to allow for
dependence of the sample variances across horizons.
Assessing.._Optimalit_y_with_RespectJ.CLll!Llnfonnation_Set
The key property of optimal forecast errors, from which all others follow
(including
those cataloged above), is unforecastability on the basis of infonnation
available at the time the
forecast was made. This is true regardless of whether linear-projection
optimality or
conditional-mean optimality is of interest, regardless of whether the
relevant loss function is
quadratic, and regardless of whether the series being forecast is station
ary.
Following Brown and Maita! (1981), it is useful to distinguish betwee
n partial and full
optimality. Partial optimality refers to unforecastability of forecast errors
with respect to some
subset, as opposed to all subsets, of available infonnation, 0,. Partial
optimality, for example,
characterizes a situation in which a forecast is optimal with respect to
the inforniation used to
construct it, but the infonnation used was not all that could have been
used. Thus, each of a

3

Extensions of this idea to nonstationary long-memory environments are
developed in
Diebold and Lindner (I 995).
9

set of competing forecasts may have the part
ial optimality property if each is optimal with
respect to its own information set.
One may test partial optimality via regressio
ns of the form e,,,., = a \
x1d2 1. The part icul ar case of testing partial
optimality with respect to
good deal of attention, as in Min cer and Zarn
owitz (1969).

y1, , 1

+

u1, where

has received a

The relevant regression is

to (a 0 , a,) = (0, 0) or (Po, P,) = (0, 1). 4
One may also expand the regression to allow
for
various sorts of nonlinearity. For example,
following Ramsey (1969), one may test whe
ther
all coefficients in the regression e ,,_ =
1
1

J

L

aj

y1~'-'

jcQ

+ u1 are zero .

Full optimality, in contrast, requires the fore
cast erro r to be unforecastable on the basis
of all information available when the forecast
was made (that is, the entirety of Qi),

Conceptually, one could test full rationality
via regressions of the fonn e,.u = a 1x +
u, If
1
a =0 for all xi'::0 1, then the forecast is fully opti
mal. In practice, one can never test for full
optimality, but rather only partial optimality
with respect to increasing info nnat ion sets.
Distribution-free nonparametric methods may
also be used to test optimality with
respect to various info nnat ion sets. The sign
and signed-rank tests, for example, are read
ily
adapted to test orthogonality between forecast
erro rs and available info nnat ion, as proposed
by
Campbell and Duf our (1991, 1995). If, for
example, e,+t,, is linear-projection independ
ent of
xtO ,, then cov(ei, , , x ) = 0. Thus, in the
symmetric case, one may use the signed-rank
11
1
test
for whe ther E[z,J = E[e,,i.1 x ] = 0, and mor
e generally, one may use the sign test for
1

4

In such regressions, the disturbance should
be white noise for !-step-ahead forecasts but
may be serially correlated for multi-step-ah
ead forecasts.
10

whether median(z,)

= median( e1 , 1.1 x1 ) = 0. 5

T

S"

=

L

The relevant sign and signed-rank statistics are

T

I.( z,) and W"

t " I

=

L I.( z
t " I

1)

Rank( Iz1 I)

Moreov er, one may allow for nonlinear

transformations of the elements of the information set, which is useful for assessing
conditional-mean as opposed to simply linear-projection independence, by taking
z1 = e,. 1•1 g("i), where g(.) is a nonlinear function of interest. Finally, the tests can be
generalized to allow for k-step-ahead forecast errors as before. Simply take z, = e, ,k.,
g(x1),
divide the z1 series into the usual k subsets, and reject the orthogonality null at significa
nce
level bounded by a if any of the subset test statistics are significant at the a/k level. 6

II. Comparing the Accuracy of Multiple Forecasts
Measures_of.Eore.cast Acc11rac¥
In practice, it is unlikely that one will ever stumble upon a fully-optimal forecast;
instead, situations often arise in which a number of forecasts (all of them suboptimal)
are
compare d and possibly combined. The cmcial object in measuring forecast accuracy
is the
loss function,

L( y 5',,u}
,
1

often restricted to

L( e,,k.t}

which charts the "loss," "cost" or

"disutility" associated with various pairs of forecasts and realizations. In addition to the
shape
of the loss function, the forecast horizon (k) is also of cmcial importance. Rankings
of
forecast accuracy may be very different across different loss functions and/or differen
t
5

Again, it is not obvious that the conditions required for application of the sign or signedrank test to z, are satisfied, but they are; see Campbell and Dufour (I 995) for details.
6

Our discussion has implicitly assumed that both e + ,, and g(xi) are centered at zero. This
1 1
will hold for e,+i,, if the forecast is unbiased, but there is no reason why it should hold
for
g(xi). Thus, in general, the test is based on g(x,)-µ , where µ is a centering paramet
er such as
1
1
the mean, median or trend of g(xi). See Campbell and Dufour (1995) for details.

11

horizons. This result has led som
e to argue the virtues of various
"universally applicable"
accuracy measures. Clements and
Hendry (1993), for example, arg
ue for an accuracy measure
under which forecast rankings are
invariant to certain transformation
s.
Ultimately, however, the approp
riate loss function depends on the
situation at hand.
As stressed by Diebold (1993) am
ong many others, forecasts are usu
ally constructed for use in
particular decision environments
; for example, policy decisions by
government officials or
trading decisions by market partici
pants. Thus, the appropriate acc
uracy measure arises from
the loss function faced by the fore
cast user. Economists, for examp
le, may be interested in the
profit streams (e.g ., Leitch and Tan
ner, 1991, 1995; Engle et al., 199
3) or utility streams
(e.g ., McCulloch and Rossi, 199
0; West, Edison and Cho, 1993)
flowing from various
forecasts.
Nevertheless, let us discuss a few
stylized statistical loss functions,
because they are
used widely and serve as popula
r benchmarks. Accuracy measures
are usually defined on the
the mean err or, ME

I

T

L e,,k.t , and mean percent error,
T

= -

{al

MPE

I

=

T

T ~ p, •k.t

,

provide

measures of bias, which is one com
ponent of accuracy.
The most common overall accura
cy measure, by far, is mean square
d err or,
IT
MSE = IT
e,:k., , or mean squared percent err
2
or, MSPE =
T (al
P,.u· Often the square
T {al
roots of these measures are used
to preserve units, yielding the roo
t mean squared err or
T
RMSE = _!_ e,:k.1 , and the
1 T 2
root mean squared percent error,
T (al
RMSPE =
Pi,k.t
(al
Somewhat less popular, but neverth
eless common, accuracy measures
are mean absolute error,
I T
MAE =
1 T
\e,, ul, and mean absolute percen
t error, MAPE = TL,a IP,,ulT ,a1

L

-I :

L

TL

-L

1

12

MSE admits an infonnative decomposition into the sum of the
variance of the forecast
error and its squared bias,
MSE = E[(Y,,k - Y.,k,1/] = var(y,,k - Y,,k.,) + (E[y, ,k] E[y,,k.,])2,
or equivalently
MSE = var(y ,,.)

+

var(y ,,k,,) - 2cov( y,,k, Y,,k.t)

+

(E[y, ,k] - E[y,,k .,])2

This result makes clear that MSE depends only on the second
moment structure of the joint
distribution of the actual and forecasted series. Thus, as noted
in Murp hy and Wink ler (1987,
1992), although MSE is a useful summary statistic for the joint
distribution of Y,.k and Y,.k.t'
in general it contains substantially less infonnation than the actual
joint distribution itself.
Other statistics highlighting different aspects of the joint distrib
ution may therefore be useful as
well. Ultimately, of course, one may want to focus directly
on estimates of the joint
distribution, which may be available if the sample size is large
enough to penni t relatively
precise estimation.
Measuring_Eorecastahilicy

It is natural and infon native to evaluate the accuracy of a foreca
st. We hasten to add,
however, that actual and forecasted values may be dissimilar,
however, even for very good
forecasts. To take an extreme example, note that the linear
least squares forecast for a zeromean white noise process is simply zero -- the paths of foreca
sts and realizations will look very
different, yet there does not exist a better linear forecast under
quadratic loss. This example
highlights the inherent limits to forecastability, which depends
on the process being forecast;
some processes are inherently easy to forecast, while others
are hard to forecast. In other
words, sometimes the infonnation on which the forecaster optim
ally conditions is very

13

valuable, and sometimes it isn't.
The issue of how to quantify forecas
tability arises at once. Granger and
Newbold
(I 976) propose a natural definition
of forecastability for covariance stat
ionary series under
squared-error loss, patterned after
the familiar R 2 of linear regression
G = var(Y,.i.,) ~
var( e,, 1.,)
I var( Y,.i )
var(y,, 1 )'
where both the forecast and forecas
t erro r refer to the optimal (that is,
linear least squares or
conditional mean) forecast.
In closing this section, we note that
although measures of forecastability
are useful
constmcts, they are driven by the
population properties of processes
and their optimal
forecasts, so they don 't help one to
evaluate the "goodness" of an actu
al reported forecast,
which may be far from optimal. For
example, if the variance of Y,, .1 is
not much lower than
1
the variance of the covariance stat
ionary series Y,+ 1 , it could be that
either the forecast is poor,
the series is inherently almost unfore
castable, or both.
StatisticaLCompar.ison_oLEorecastAc
cur,1c~7
Once a loss function has been decided
upon, it is often of interest to know
which of the
competing forecasts has smallest exp
ected loss. Forecasts may of course
be ranked according
to average loss over the sample per
iod, but one would like to have a me
asure of the sampling
variability in such average losses.
Alternatively, one would like to be
able to test the
hypothesis that the difference of exp
ected losses between forecasts i and
j is zero (i.e .,
E[L(Y,.,. y,'.u)l = E[L(y,,k• y,\. ))),
against the alternative that one fore
1
cast is better.
7

This section draws heavily upon Die
bold and Mariano (1995).

14

Stekler (1987) proposes a rank-based test of the hypothesis that each of a set of
forecasts has equal expected loss. 8 Given N competing forecasts, assign to each forecast
at
each time a rank according to its accuracy (the best forecast receives a rank of N, the secondbest receives a rank of N-1, and so forth). Then aggregate the period-by-period ranks for
each
forecast,
T

H; =

L Rank( L( Y,,k• y,'.k.t) ),
t=I

= I, ... , N, and form the chi-squared goodness-of-fit test statistic,

Under the null, H

~

2

X:--:-i· As described here, the test requires the rankings to be independent

over space and time, but simple modifications along the lines of the Bonferroni bounds test
may be made if the rankings are temporally (k-1)-dependent. Moreover, exact versions
of the
test may be obtained by exploiting Fisher's randomization principle. 9
One limitation of Stekler's rank-based approach is that infonnation on the magnitude of
differences in expected loss across forecasters is discarded. In many applications, one wants
to
know not only whether the difference of expected losses differs from zero (or the ratio differs
from I), but also by how much it differs. Effectively, one wants to know the sampling
distribution of the sample mean loss differential (or of the individual sample mean losses),

' Stekler uses RMSE, but other loss functions may be used.
9

See, for example, Bradley (1968), Chapter 4.
15

which in addition to being directly info
nnative would enable Wald tests of the
hypothesis that
the expected loss differential is zero .
Diebold and Ma rian o (I 995), building
on earl ier work
by Gra nge r and Newbold ( 1986) and
Meese and Rog off ( 1988), develop a
test for a zero
expected loss differential that allows for
forecast errors that are nonzero mean,
non-Gaussian,
serially correlated and contemporaneo
usly correlated.
In general, the loss function is L(y ,
1

5','.k,t)

Because in many applications the loss

function will be a direct function of the
forecast erro r, L( y , y,'.k.t) = L( e,'.k.t
1
we
writeL(e,'.k.tl from this point on to eco
nomize on notation, while recognizing
•
that certain loss
functions (such as direction-of-change
) don 't collapse to the L(e,'.k.t) fon n. 10
The null
hypothesis of equal forecast accuracy
for two forecasts is E[L(e\+k,.)] = E[L
(e\+k.J], or E[d J
= 0, where d, = L(e\+k,.) - L(e\+k,.) is the loss differen
tial.

f

If d, is a covariance stationary, short-m
emory series, then standard results may
be used
to deduce the asymptotic distribution
of the sample mean loss differential,

T

whe re

d = ~ ~ [L(e,'.k.t)

:t

- L(e /,ul ] is the sample mean loss diff
erential,

f/0 ) = _I_
y tr) is the spectral density of the
loss differential at frequency zero,
2TI,a-«
Yit ) = E[(d, - µ)(d,_, - µ)) is the auto
covariance of the loss differential at disp
lace men t,,
and µ is the population mean loss diff
erential. The formula for ( (0) shows
that the correction
1
for serial correlation can be substantial,
even if the loss differential is only wea
kly serially
correlated, due to the cumulation of the
autocovariance tenn s. In large samples
, the obvious
10

In such cases, the L(y,.

Y;.1-k.t) fonn

should be used.
16

statistic for testing the null hypothesis of equal forecast accuracy is the standar
dized sample
mean loss differential,

-

B

d

= ----

~ 2n~(O)

where f/0) is a consistent estimate of f/0).
It is useful to have available exact finite-sample tests of forecast accuracy to
complement the asymptotic tests. As usual, variants of the sign and signedrank tests are
applicable. When using the sign test, the null hypothesis is that the median
of the loss
differential is zero, median(L(e,'.k.t) - L(e/.k.t)) = 0. Note that the null of
a zero median loss
differential is not the same as the null of zero difference between median losses;
that is,
median(L( e, '.k.t) - L( e/.u)) ,, median(L( e, '.k.t)) - median(L( e/,k.t))

For this reason, the null

differs slightly in spirit from that associated with the asymptotic Diebold-Mari
ano test, but
nevertheless, it has the intuitive and meaningful interpretation that P(L(e,'.u)>L(e
/.k.t)) =

When using the Wilcoxon signed-rank test, the null hypothesis is that the loss
differential series is symmetric about a zero median (and hence mean), which
corresp onds
precisely to the null of the asymptotic Diebold-Mariano test. Symmetry of
the loss differential
will obtain, for example, if the distributions of L( e,'.k,<) and L( e/.u) are the
same up to a
location shift. Symmetry is ultimately an empirical matter and may be assesse
d using standard
procedures.
The construction and intuition of the distribution-free nonparametric test statistic
s are

17

straightforward. The sign test statistic
is SB
T

W8

=

L 1.(d,) Rank(ld,I).

t =I

T

=

L I,(d,), and the signed-rank test statistic is

Serial correlation may be handled as
before via Bonferroni

bounds. It is interesting to note that
, in multi-step forecast comparisons,
forecast erro r serial
correlation may be a "common feature"
in the tenninology of Engle and Koz
icki (1993),
because it is induced largely by the fact
that the forecast horizon is longer than
the interval at
which the data are sampled and may
therefore not be present in loss differe
ntials even if
present in the forecast errors themselv
es. This possibility can of course be
checked
empirically.
West (1994) takes an approach very
much related to, but nevertheless diff
erent from,
that of Diebold and Mariano. The mai
n difference is that West assumes that
forecasts are
computed from an estimated regressi
on model and explicitly accounts for
the effects of
parameter uncertainty within that fram
ework. When the estimation sample
is small, the tests
can lead to different results. Howeve
r, as the estimation period grows in
length relative to the
forecast period, the effects of parame
ter uncertainty vanish, and the Diebol
d-Mariano and
West statistics are identical.
We st's approach is both more general
and less general than the Diebold-M
ariano
approach. It is more general in that
it corrects for nonstationarities induced
by the updating of
parameter estimates. It is less general
in that those corrections are made with
in the confines of
a more rigid framework than that of
Diebold and Mariano, in whose fram
ework no
assumptions need be made about the
often unknown or incompletely known
models that
underlie forecasts.
In closing this section, we note that
it is sometimes infonnative to compar
e the

18

accuracy of a forecast to that of a "naive" competitor. A simpl
e and popular such comparison
is achieved by Theil 's (1961) U statistic, which is the ratio of
the I-step-ahead MSE for a
given forecast relative to that of a random walk forecast 5',,u=
y,; that is,

u

=

Generalization to other loss functions and other horizons is imme
diate. The statistical
significance of the MSE comparison underlying the U statistic
may be ascertained using the
methods just described. One must remember, of course, that
the random walk is not
necessarily a naive competitor, particularly for many economic
and financial variables, so that
values of the U statistic near one are not necessarily "bad." Sever
al authors, including
Anns trong and Fildes (I 995), have advocated using the U statist
ic and close relatives for
comparing the accuracy of various forecasting methods across
series.

III. Combining Forecasts
In forecast accuracy comparison, one asks which forecast is best
with respect to a
particular loss function. Regardless of whether one forecast
is "best," however, the question
arises as to whether competing forecasts may be fruitfully comb
ined -- in similar fashion to the
construction of an asset portfolio -- to produce a composite foreca
st superior to all the original
forecasts. Thus, forecast combination, although obviously relate
d to forecast accuracy
comparison, is logically distinct and of independent interest.
EorecasLEncompassing_Iests

19

Forecast encompassing tests enable one to dete
nnine whether a certain forecast
incorporates (or encompasses) all the relevant
infonnation in competing forecasts. The idea
dates at least to Nelson (1972) and Cooper and
Nelson (1975), and was fonnalized and
extended by Chong and Hendry (1986). For
simplicity, let us focus on the case of two
,' I
' 2 . Cons1·cter h
,ore
casts, Y,-k.,
and y,,u
. n
t e regressio

If

(Po, P,, P,) = (0,1 ,0),

one says that model 1 forecast-encompasses
model 2, and if

P,l = (0,0, 1), then model 2 forecast-encompasses

model 1. For any other

(Po, P,,

(Po, P,, P,l values,

neither model encompasses the other, and both
forecasts contain useful information about y,+k·
Under certain conditions, the encompassing
hypotheses can be tested using standard meth
ods. 11
Moreover, although it does not yet seem to
have appeared in the forecasting literature, it
would be straightforward to develop exact finit
e-sample tests (or bounds tests when k > 1) of
the hypothesis using simple generalizations of
the distribution-free tests discussed earlier.
Fair and Shiller (I 989, 1990) take a different
but related approach based on the
regression

~i.k - Y,)

=

P0 + Pi(Y,~k.t - Y,) + P2(5',: •., - Yi)+

e,.u ·

As before, forecast-encompassing corresponds
to coefficient values of (0,1 ,0) or (0,0 ,1).
Under the null of forecast encompassing, the
Chong-Hendry and Fair-Shiller regressions
are
identical. When the variable being forecast
is integrated, however, the Fair-Shiller fram
ework
may prove more convenient, because the spec
ification in tenns of changes facilitates the use
of
Gaussian asymptotic distribution theory.
11

Note that MA(k- I) serial correlation will typic
ally be present in e, •k.t if k > I.
20

EurecasLCombination
Failure of one model's forecasts to encompass other models'
forecasts indicates that all
the models examined are misspecified. It should come as no
surprise that such situations are
typical in practice, because all forecasting models are surely missp
ecified -- they are
intentional abstractions of a much more complex reality. What
, then, is the role of forecast
combination techniques? In a world in which information sets
can be instantaneously and
costlessly combined, there is no role; it is always optimal to comb
ine information sets rather
than forecasts. In the long run, the combination of infonnation
sets may sometimes be
achieved by improved model specification. But in the short run
-- particularly when deadlines
must be met and timely forecasts produced -- pooling of inform
ation sets is typically either
impossible or prohibitively costly. This simple insight motivates
the pragmatic idea of forecast
combination, in which forecasts rather than models are the basic
object of analysis, due to an
assumed inability to combine infonnation sets. Thus, forecast
combination can be viewed as a
key link between the short-nm, real-time forecast production
process, and the longer-run,
ongoing process of model development.
Many combining methods have been proposed, and they fall
roughly into two groups,
"variance-covariance" methods and "regression-based" metho
ds. Let us consider first the
variance-covariance method due to Bates and Granger (1969).
Suppose one has two unbiased
forecasts from which a composite is fanne d as 12

12

The generalization to the case of M > 2 competing unbiased foreca
sts is straightforward,
as shown in Newbold and Granger (1974).

21

Because the weights sum to unity, the composi
te forecast will necessarily be unbiased.
Moreover, the combined forecast erro r will
satisfy the same relation as the combined fore
cast;
that is,
c

I

e,.k.t = we,,k.t
.
2
wit. h a vana
nce o,2 = w2 On

+

(l-w )20; 2

+

+

2

(1-w)e,,k,t ,

2w( l-w )o 12 , where 0; and 0; are
unconditional
1
2

forecast erro r variances and 0 is their cova
riance. The combining weight that minimiz
12
es the
combined forecast erro r variance (and hence
the combined forecast erro r MSE, by
unbiasedness) is

w

=

Note that the optimal weight is detennined by
both the underlying variances and covariances.
Moreover, it is straightforward to show that,
except in the case where one forecast
encompasses the other, the forecast erro r vari
ance from the optimal composite is less than
min(o; 1, 0; 2) Thus, in population, one has
nothing to lose by combining forecasts and
potentially much to gain.
In practice, one replaces the unknown variance
s and covariances that underlie the
optimal combining weights with consistent estim
ates; that is, one estimates w' by replacing o,i
with 6. =
'J

In finite samples of the size typically available,
sampling erro r contaminates the combining
weight estimates, and the problem of samplin
g erro r is exacerbated by the collinearity that

22

typically exists among primary forecasts. Thus, while one hopes to
reduce out-of-sample
forecast MSE by combining, there is no guarantee. In practice, howev
er, it turns out that
forecast combination techniques often perform very well, as documented
Cleme n's (1989)
review of the vast literature on forecast combination.
Now consider the "regression method" of forecast combination. The
form of the
Chong-Hendry and Fair-Shiller encompassing regressions immediately
suggests combining
forecasts by simply regressing realizations on forecasts. Granger and
Ramanathan (I 984)
showed that the optimal variance-covariance combining weight vector
has a regression
interpretation as the coefficient vector of a linear projection of Yi+k onto
the forecasts, subject
to two constraints: the weights sum to unity, and no intercept is includ
ed. In practice, of
course, one simply runs the regression on available data.
In general, the regression method is simple and flexible. There are
many variations
and extensions, because any "regression tool" is potentially applicable.
The key is to use
generalizations with sound motivation. We shall give four examples:
time-varying combining
weights, dynamic combining regressions, Bayesian shrinkage of combi
ning weights toward
equality, and nonlinear combining regressions.

a. Time- Va,ying Combining Weights
Time-varying combining weights were proposed in the variance-cov
ariance context by
Granger and Newbold (I 973) and in the regression context by Diebo
ld and Pauly (1987). In
the regression framework, for example, one may undertake weighted
or rolling estimation of
combining regressions, or one may estimate combining regressions with
explicitly timevarying parameters.

23

The potential desirability of time-varying weig
hts stems from a number of sources.
First, different learning speeds may lead to a
particular forecast improving over time relat
ive
to others. In such situations, one naturally wan
ts to weight the improving forecast
progressively more heavily. Second, the desi
gn of various forecasting models may make
them
relatively better forecasting tools in some situa
tions than in others. For example, a structura
l
model with a highly developed wage-price sect
or may substantially outperform a simpler
model during times of high inflation. In such
times, the more sophisticated model should
received higher weight. Third, the paramete
rs in agents' decision rules may drift over time
,
and certain forecasting techniques may be relat
ively more vulnerable to such drift.

b. Dynamic Combining Regressions
Serially correlated errors arise naturally in com
bining regressions. Diebold (1988)
considers the covariance stationary case and
argues that serial correlation is likely to appe
ar in
unrestricted regression-based forecast combini
ng regressions when P +p * I. Mor e
1
2
generally, it may be a good idea to allow for
serial correlation in combining regressions to
capture any dynamics in the variable to be fore
cast not captured by the various forecasts. In
that regard, Coulson and Robins (1993), follo
wing Hendry and Mizon (1978), point out that
a
combining regression with serially correlated
disturbances is a special case of a combining
regression that includes lagged dependent vari
ables and lagged forecasts, which they advo
cate.
c. Bayesian Shrinkage of Combining Weights
Toward Equality
Simple arithmetic averages of forecasts are often
found to perfonn very well, even
relative to "optimal" composites. 13 Obvious
ly, the imposition of an equal weights constrain
t
13

See Winkler and Makridakis (1983), Clemen
(1989), and many of the references therein.

24

eliminates variation in the estimated weights at the cost
of possibly introducing bias.
However, the evidence indicates that, under quadratic loss,
the benefits of imposing equal
weights often exceed this cost. With this in mind, Clem
en and Winkler (1986) and Diebold
and Pauly (1990) propose Bayesian shrinkage techniques
to allow for the incorporation of
varying degrees of prior information in the estimation of
combining weights; least-squares
weights and the prior weights then emerge as polar cases
for the posterior-mean combining
weights. The actual posterior mean combining weights
are a matrix weighted average of those
for the two polar cases. For example, using a natural conju
gate nonnal-gamma prior, the
posterior-mean combining weight vector is

where

P"';°' is the prior mean vector,

the combining regression, and

Q is the prior precision matrix, F is the design matri
x for

Pis the vector of least squares combining weights.

The

obvious shrinkage direction is toward a measure of centr
al tendency (e.g., the arithmetic
mean). In this way, the combining weights are coaxed
toward the arithmetic mean, but the
data are still allowed to speak, when (and it) they have
something to say.

d. Nonlinear Combining Regressions
There is no reason, of course, to force combining regression
s to be linear, and various
of the usual alternatives may be entertained. One partic
ularly interesting possibility is
proposed by Deutsch, Granger and Teriisvirta (I 994), who
suggest

)'t'.k.t

= l(s,=l )(P 11 y,'.k.t

+

P12 Y,:u)

+

l(s,=2)(P 2S,\.1

+

P22Y1:k_,)

The states that govern the combining weights can depen
d on past forecast errors from one or
both models or on various economic variables. Furthenno
re, the indicator weight need not be

25

simply a binary variable; the transition between
states can be made more gradual by allow ing
weights to be functions of the forecast errors or
economic variables.

IV. Special Topics in Evaluating Economic and
Financial Forecasts

EY_a! uating_llir.ection-of-Cha nge_Eoreca.sts
Dire ction -of-c hang e forecasts are often used in
financial and economic decision-making
(e.g ., Leitch and Tann er, 1991, 1995; Satchell
and Tim men nann , 1992). The question of
how to evaluate such forecasts immediately arise
s. Our earli er results on tests for forecast
accu racy com paris on remain valid, appropriately
modified, so we shall not restate them here.
Instead, we note that one frequently sees asses
sments of whe ther direction-of-change forecasts
"have value," and we shall discuss that issue.
The question as to whe ther a direction-of-change
forecast has value by necessity
involves comparison to a naive benc hma rk -the direction-of-change forecast is com pare d to
a
"naive" coin flip (with success probability equa
l to the relevant marginal). Con side r a 2x2
contingency table. For ease of notation, call
the two states into which forecasts and
realizations fall "i" and "j". Com mon ly, for exam
ple, i = "up" and j = "dow n." Figu res I
and 2 mak e clea r our notation regarding observed
cell counts and unobserved cell probabilities.
The null hypothesis that a direction-of-change
forecast has no value is that the forecasts and
realizations are independent, in which case P.
= P I. P .J , V i, j. As always, one proc eeds unde
lJ
r
the null. The true cell probabilities are of cour
se unknown, so one uses the consistent

0

= -'
,.
0

estimates P.

unde r the null, E;_;

,

and P.
J

= P;.Pp,

0
0

= _J.
,

by Ei.i

Then one consistently estimates the expected cell
counts

, ,

00 .

= P;P.P = -'-J .
0

26

Fina lly, one constructs the statistic

2 (0
EIJ )2
ll_
C = L~ - ~
--~

Under the null, C

d

x,.2

E..
IJ
An intimately-related test of forecast value was proposed by Merton (1981) and

i,j=l

p

Henriksson and Merton (1981), who assert that a forecast has value if -" + p
-1!. > 1. They
p
p
1·
P.
P.
"
therefore develop an exact test of the null hypothesis that -" + -1!. = I against
the inequality
p
p
I.
J
alternative. A key insight, noted in varying degrees by Schnader and Stekler
(1990) and
Stekler (1994), and formalized by Pesaran and Timmermann (1992), is that the
HenrikssonMerton null is equivalent to the contingency-table null if the marginal probab
ilities are fixed at

0

0

the observed relative frequencies, -" and _.J. The same unpalatable assump
tion is necessary
0
0
for deriving the exact finite-sample distribution of the Henriksson-Merton test
statistic.
Asymptotically, however, all is well; the square of the Henriksson-Merton statistic
,
appropriately nonnalized, is asymptotically equivalent to C, the chi-squared conting
ency table
statistic. Moreover, the 2x2 contingency table test generalizes trivially to the
NxN case, with

a

2

Under the null, C:x ~ Xe,

IJ(:X-JJ·

A subtle point arises, however, as pointed out by Pesaran

and Timmennann (1992). In the 2x2 case, one must base the test on the entire
table, as the
off-diagonal elements are detennined by the diagonal elements, because the two
elements of
each row must sum to one. In the NxN case, in contrast, there is more latitude
as to which
cells to examine, and for purposes of forecast evaluation, it may be desirable
to focus only on
the diagonal cells.
In closing this section, we note that although the contingency table tests are often
of
interest in the direction-of-change context (for the same reason that tests based
on Theil' s U-

27

statistic are often of interest in more standard
contexts), forecast "value" in that sense is
neither a necessary nor sufficient condition
for forecast value in tenns of a profitable
trading
strategy yielding significant excess returns.
For example, one might beat the margina
l forecast
but still earn no excess returns after adjustin
g for transactions costs. Alternatively, one
might
do worse than the marginal but still make
huge profits if the "hits" are "big," a poin
t stressed
by Cumby and Modest (1987).
E:11aluating1'robahility__Eorecasts
Oftentimes economic and financial forecast
s are issued as probabilities, such as the
probability that a business cycle turning poin
t will occur in the next year, the probabil
ity that a
corporation will default on a particular bon
d issue this year, or the probability that the
return
on the S&P 500 stock index will be more
than ten percent this year. A number of
specialized
considerations arise in the evaluation of prob
ability forecasts, to which we now tum.
Let P1 +k,1
be a probability forecast made at time t for
an event at time t+k , and let Ri+k= I if the
event
occurs and zero otherwise. P +,.i is a scal
ar if there are only two possible events.
1
Mor e
generally, if there are N possible events,
then P 1+k,t is an (N- l)xl vector. 14 For nota
tional
economy, we shall focus on scalar probabil
ity forecasts.
Accuracy measures for probability forecast
s are commonly called "scores," and the
most common is Brie r's ( I 950) quadratic
probability score, also called the Brie r scor
e,

14

The probability forecast assigned to the Nth
event is implicitly detennined by the
restriction that the probabilities sum to 1.

28

Clearly, QPS

E

[0,2), and it has a negative orientation (smaller values indicate more
accurate

forecasts). 15 To understand the QPS, note that the accuracy of any
forecast refers to the
expected loss when using that forecast, and typically loss depends
on the deviation between
forecasts and realizations. It seems reasonable, then, in the contex
t of probability forecasting
under quadratic loss, to track the average squared divergence betwee
n P,+k I and R.+k, which is
what the QPS does. Thus, the QPS is a rough probability-forecast
analog of MSE.
The QPS is only a rough analog of MSE, however, because P,+k is
in fact not a
I

forecast of the outcome (which is 0-1), but rather a probability assign
ed to it. A more natural
and direct way to evaluate probability forecasts is simply to compa
re the forecasted
probabilities to observed relative frequencies -- that is, to assess calibra
tion. An overall
measure of calibration is the global squared bias,
GSB = 2(P - R.)2,
IT
==RH. GSB E [0,2] with a negative orientation.
T (cl
T (al
Calibration may also be examined locally in any subset of the unit
interval. For

where P

IT

L

P,.u and R

L

example, one might check whether the observed relative frequency
corresponding to
probability forecasts between . 6 and . 7 is also between . 6 and . 7.
One may go farthe r to fonn
a weighted average of local calibration across all cells of a J-sunset
partition of the unit
interval into J subsets chosen according to the user's interest and the
specifics of the
situation. 16 This leads to the local squared bias measure,

15

The "2" that appears in the QPS fonnu la is an artifact from the full
vector case. We
could of course drop it without affecting the QPS rankings of compe
ting forecasts, but we
leave it to maintain comparability to other literature.
16

For example, Diebold and Rudebusch (I 989) split the unit interval
into ten equal parts.
29

1

J

-

-

LSB = - "2 r(P - R \2
L.
J
J
J/'
T j•I

whe re T; is the num ber of probability fore
casts in set j, Pj is the average forecast
in set j, and
R; is the average realization in set j, j = 1,
... , J. Not e that LSBE[0,2], and LSB
= O
implies that GSB = 0, but not conversely.
Testing for adequate calibration is a stra
ightforward matter, at least und er indepen
dence
of the realizations. For a given event and
a corresponding sequence of forecasted
probabilities

{P, .•.,

r.

create J mutually exclusive and collectively
exhaustive subsets of forecasts, and
denote the midpoint of each range TI;, j
= 1, ... , J. Let ~ denote the number of observed
events when the forecast was in set j, resp
ectively, and define "range j" calibration
statistics,
,
1

Z;

(R;

=

-

Ti TI; )
I

(T.I TIJ (I

-

-

(R; - e; )
I

J

= 1, ... , J,

w.2

Tij )) 2

.I

and an overall calibration statistic,

J

L

where R. =L R; , e, =
j " I

w.

J

j

=

2

J

T;TI;, and w.

=

I

L
j = I

T;TI;(l - TI_;) Z0 isaj oin ttes tof

adequate local calibration across all cells
, while the Z; statistics test cell-by-cell loca
l
17
calibration. Und er independence, the
binomial stm ctur e would obviously imply
that
a
a
2 0 - N(O,l), and Z; - N(O,l), \fj = 1,
... , J. In a fascinating development, Seil
lierMoiseiwitsch and Daw id (1993) show that
the asymptotic non nali ty holds much mor
e

17

One may of course test for adequate glob
al calibration by using a trivial partition
of the
unit interval -- the unit interval itself.
30

generally, including in the dependent situations of practical
relevance.
One additional feature of probability forecasts (or more
precisely, of the corresponding
realizations), called resolution, is of interest:
RES = - 1
T

1

L2T

jal

J

(-

-)2

R - R .
J

RES is simply the weighted average squared divergence
between

R and the

how much the observed relative frequencies move across
cells. RES

~

i's, a measure of
J

0 and has a positive

orientation. As shown by Murphy (1973), an inforn1ative
decomposition of QPS exists,
QPS = QPSR + LSB - RES,

where QPSR is the QPS evaluated at P,,k, = R.
This decomposition highlights the tradeoffs
between the various attributes of probability forecasts.
Just as with Theil' s U-statistic for "standard" forecasts,
it is sometimes inforniative to
compare the perfo nnan ce of a particular probability forec
ast to that of a benchmark. Murphy
(I 974), for example, proposes the statistic
M = QPS - QPSR = LSB - RES,

which measures the difference in accuracy between the forec
ast at hand and the benchmark

-

forecast R. Using the earlier-discussed Diebold-Marian
o approach, one can also assess the

significance of differences in QPS and QPSR, differences in QPS
or various other measures of
probability forecast accuracy across forecasters, or differ
ences in local or global calibration
across forecasters.
Ev_a!uating_V__olatility_Eorecasts
Many interesting questions in finance, such as options pricin
g, risk hedging and

31

portfolio management, explicitly depend upon the varia
nces of asset prices. Thus, a variety of
methods have been proposed for generating volatility
forecasts. As opposed to point or
probability forecasts, evaluation of volatility forecasts
is complicated by the fact that actual
conditional variances are unobservable.
A standard "solution" to this unobservability problem
is to use the squared realization

e;.,

as a proxy for the true conditional variance h,+>, beca
use E [
=

h,.,, where

v,,,

~

e;,,/0,,,_

1

J

WN(0, 1). 18 Thus, for example,

1\,u )2. Although MSE is often used to measure volat
ility forecast

accuracy, Bollerslev, Engle and Nelson (1994) point
out that MSE is inappropriate, because it
penalizes positive volatility forecasts and negative volat
ility forecasts (which are meaningless)
symmetrically. Two alternative loss functions that
penalize volatility forecasts asymmetrically
are the logarithmic loss function employed in Paga
n and Schwert (1990),

and the heteroskedasticity-adjusted MSE of Bollerslev
and Ghysels ( 1994),

Bollerslev, Engle and Nelson (1994) suggest the loss
function implicit in the Gaussian quasi-

18

Although

e;,, is an unbiased estimator of h,+>, it is an imprecise or "noisy" estimator.
v,,,

e;,,

v;,, has a conditional mean ofh,+> because
v;., x; Yet, because the median of a x; distribution is 0.455, e;,, < ¾h,., more than

For example, if

~ N(0, l),

= h,,,

~

fifty percent of the time.

32

maximum likelihood function often used in fitting volatility models; that is,

As with all forecast evaluations, the volatility forecast evaluations of most interest to
forecast users are those conducted under the relevant loss function. West, Edison and Cho
(1993) and Engle et al. (1993) make important contributions along those lines, proposing
economic loss functions based on utility maximization and profit maximization, respectively.
Lopez ( 1995) proposes a framework for volatility forecast evaluation that allows for a variety
of economic loss functions. The framework is based on transfonning volatility forecasts into
probability forecasts by integrating over the assumed or estimated distribution of

E1.

By

selecting the range of integration corresponding to an event of interest, a forecast user can
incorporate elements of her loss function into the probability forecasts.
For example, given e,.k/0, ~ D(O, h,,u} and a volatility forecast h,,u, an options
trader interested in the event e,,k E [ L,.,,k•

where z,+k is the standardized innovation,

[t,. t•k'

u,.,,, J would generate the probability forecast

f( z,,,}

is the functional fonn of D (0, 1), and

u,_ ,,k] is the standardized range of integration. In contrast, a forecast user interested in

the behavior of the underlying asset, Y,,k

=

µ,,k.,

generate the probability forecast

33

+

e,,k where µ,,k.1

=

E [Y,,k/0,], might

where

µ,,u

is the forecasted conditional mean and ['y. •••'
uy.

H] is the standardized range of

integration.
Once generated, these probability forecasts
can be evaluated using the scoring mie s
described above, and the significance of diffe
rences across models can be tested using the
Diebold-Mariano tests. The key advantage
of this framework is that it allows the evaluatio
n to
be based on observable events and thus avoi
ds proxying for the unobservable tme variance
.
The Lopez approach to volatility forecast eval
uation is based on time-varying
probabilities assigned to a fixed interval. Alte
rnatively, one may fix the probabilities and
vary
the widths of the intervals, as in traditional
confidence interval cons tmct ion. In that rega
rd,
Christoffersen (1995) suggests exploiting the
fact that if a (I-a: )% confidence interval (deno
ted
[Ly,t+t, Uy,,+iD is correctly calibrated, then
the "hits" are iid Bemoulli(l-o:). That is, if
one
defines

then I,+i,, is one with probability (I-a:) and
zero with probability a:. Given the T values
of the
indicator variable for the T forecast intervals
, one can dete nnin e whether the forecasted
intervals are well calibrated by testing the hypo
thesis that the indicator variable is an iid
Bernoulli( 1-o:) random variable.
The iid property can be checked using the grou
p test of David (1947), which is
unif onn ly most powerful against first-order
dependence. Defi ne a group as a string of
consecutive zeros or ones, and let k be the
num ber of groups in the sequence {l,+1,,},
Und er
the null that the sequence is iid, the distribut
ion of k given the total number of ones, n,,
and

34

the total numb er of zeros, n0 , is

for b2,

where n=n0 + n" and

f2 ,

f2,(n-2t )

'

1 = ---,

(2t)

for k odd.

A likelihood ratio test of the Bernoulli hypothesis (that is, a
joint test of iid behavior
and correct coverage) is readily constructed by comparing the
maximized log likelihoods of
restricted and unrestricted Mark ov processes for the indicator
series {I,+ 1 ,}. The unrestricted
transition probability matrix is
1-rr 00]
7t 11

and restricted transition probability matrix is

IIR =

[1-a-a al.
1

ct

The corresponding approximate likelihood functions are 19

and

L(IIR

I I)

= (1-a/'oo'"iol (a/'"'"11>,

where n,., is the numb er of observed transitions from i to j and
I is the indicator sequence. The

'' The likelihoods are approximate because the initial tenns
are dropped.

35

likelihood ratio statistic is
LR = 2[1nL(Il / I)
Unde r the null hypothesis, LR ~

lnL(IlR / I)]

x;, where fl and fIR are the maximum-likelihood estimates.

V. Concluding Remarks
Thre e modern themes permeate this survey, so it is
worth highlighting them explicit! y.
The first theme is that various types of forecasts, such
as probability forecasts and volatility
forecasts, are becoming more integrated into econ
omic and financial decision making, leading
to a derived demand for new types of forecast evalu
ation procedures.
The second theme is the use of exact finite-sample
hypothesis tests, typically based on
distribution-free nonparametrics. We explicitly sketc
hed such tests in the context of forecasterror unbiasedness, k-dependence, orthogonality to
available infon natio n, and when more than
one forecast is available, in the context of testing equa
lity of expected loss, testing whet her a
direction-of-change forecast has value, etc.
The third theme is use of the relevant loss function.
This idea arose in many places,
such as in forecastability measures and forecast accu
racy comparison tests, and may readily be
introduced in others, such as orthogonality tests, enco
mpassing tests and combining
regressions. In fact, an integrated tool kit for estim
ation, forecasting, and forecast evaluation
(and hence model selection and nonnested hypothesi
s testing) unde r the relevant loss function
is rapidly becoming available; see Weiss and Ande
rsen (1984), Weiss (1995), Diebold and
Mari ano (1995), Christoffersen and Diebold (1994),
and Diebold, Ohanian and Berkowitz
(1995).

36

References
Annstrong, J.S. and Fildes, R., I 995. "On the Selection of Error Measu
res for Comparisons
Among Forecasting Methods," Journal of Forecasting, 14, 67-71.
Auerbach, A., 1994. "The U.S. Fiscal Problem: Where We Are, How
We Got Here and
Where We're Going," NEER Macro Annual. Cambridge, Mass.: MIT
Press.
Bates, J.M. and Granger, C.W .J., 1969. "The Combination of Foreca
sts," Operations
Research Quarterly, 20, 451-468.
Bollerslev, T., Engle, R.F. and Nelson, D.B., 1994. "ARCH Model
s," in R.F. Engle and D.
McFadden (eds.), Handbook of Econometrics, Volume 4. Amsterdam:
North-Holland.
Bollerslev, T. and Ghysels, E., 1994. "Periodic Autoregressive Condi
tional
Heteroskedasticity," Working Paper #178, Department of Finance, Kellog
g School,
Northwestern University.
Bonham, C. and Cohen, R., 1995. "Testing the Rationality of Price
Forecasts: Comment,"
American Economic Review, 85, 284-289.
Bradley, J.V., 1968. Distribution-Free Statistical Tests. Englewood
Cliffs, New Jersey:
Prentice-Hall.
Brier, G.W., 1950. "Verification of Forecasts Expressed in Tenns of
Probability," Monthly
Weather Review, 75, 1-3.
Brown, B.W. and Maita!, S., 1981. "What Do Economists Know?
An Empirical Study of
Experts' Expectations," Econometrica, 49, 491-504.
Campbell, B. and Dufour, J.-M., 1991. "Over-Rejections in Rational
Expectations Models:
A Nonparametric Approach to the Mankiw-Shapiro Problem," Econo
mics Letters, 35,
285-290.
Campbell, B. and Dufour, J.-M., 1995. "Exact Nonparametric Orthog
onality and Random
Walk Tests," Review of Economics and Statistics, 77, 1-16.
Campbell, B. and Ghysels, E., 1995. "Federal Budget Projections:
A Nonparametric
Assessment of Bias and Efficiency," Review of Economics and Statist
ics, 77, 17-31.
Campbell, J.Y. and Mankiw, N.G., 1987. "Are Output Fluctuations
Transitory?," Quarterly
Journal of Economics, 102, 857-880.
Chong, Y.Y. and Hendry, D.F., 1986. "Econometric Evaluation of
Linear Macroeconomic
37

Models," Review of Economic Studies, 53, 671-690.
Christoffersen, P.F. , 1995. "Predicting Uncertainty
in the Foreign Exchange Markets,"
Manuscript, Department of Economics, University
of Pennsylvania.
Christoffersen, P.F. and Diebold, F.X. , 1994. "Opt
imal Prediction under Asymmetric Loss,"
Technical Working Pape r #167, National Bureau of
Economic Research, Cambridge,
Mass.
Clemen, R.T. , 1989. "Combining Forecasts: A Revi
ew and Annotated Bibliography,"
International Journal of Forecasting, 5, 559-581.
Clemen, R.T. and Winkler, R.L. , 1986. "Combinin
g Economic Forecasts," Journal of
Economic and Business Statistics, 4, 39-46.
Clements, M.P. and Hendry, D.F. , 1993. "On the
Limitations of Comparing Mean Squared
Forecast Errors," Journal of Forecasting, 12, 617638.
Cochrane, J.H. , 1988. "How Big is the Random Walk
in GNP?," Journal of Political
Economy, 96, 893-920.
Cooper, D.M . and Nelson, C.R. , 1975. "The &- Ante
Prediction Perfo nnan ce of the St.
Louis and F.R. B.-M .I.T. -Pen n Econometric Models
and Some Results on Composite
Predictors," Journal of Money, Credit and Banking,
7, 1-32.
Coulson, N.E. and Robins, R.P. , 1993. "Forecast
Combination in a Dynamic Setting,"
Journal of Forecasting, 12, 63-67.
Cumby, R.E. and Huizinga, J., 1992. "Testing the
Autocorrelation Strncture of Disturbances
in Ordinary Least Squares and Instrumental Variables
Regressions," Econometrica, 60,
185-195.
Cumby, R.E. and Modest, D.M ., 1987. "Testing
for Market Timing Ability: A Fram ewor k
for Forecast Evaluation," Journal of Financial Econ
omics, I 9, 169-189.
David, F.N. , 1947. "A Pow er Function for Tests
of Randomness in a Sequence of
Alternatives," Biometrika, 34, 335-339.
Deutsch, M., Granger, C.W .J. and Teriisvirta, T.,
I 994. "The Combination of Forecasts
Using Changing Weights," International Journal of
Forecasting, 10, 47-57.
Diebold, F.X. , 1988. "Serial Correlation and the
Combination of Forecasts," Journal of
Business and Economic Statistics, 6, 105-111.

38

Diebold, F.X. , 1993. "On the Limitations of Comparing
Mean Square Forecast Errors:
Comment," Journal of Forecasting, 12, 641-642.
Diebold, F.X. and Lindner, P., 1995. "Fractional Integ
ration and Interval Prediction,"
Manuscript, Department of Economics, University of Penn
sylvania.
Diebold, F.X. and Mariano, R., 1995. "Comparing Predi
ctive Accuracy," Journal of
Business and Economic Statistics, forthcoming.
Diebold, F.X. , Ohanian, L. and Berkowitz, J., 1995. "Dyn
amic Equilibrium Economies: A
Framework for Comparing Models and Data," Technical
Working Paper No. 174,
National Bureau of Economic Research, Cambridge, Mass
.
Diebold, F.X. and Pauly, P., I 987. "Stmctural Change
and the Combination of Forecasts,"
Journal of Forecasting, 6, 21-40.
Diebold, F.X. and Pauly, P., 1990. "The Use of Prior lnfon
natio n in Forecast
Combination," Intemational Journal of Forecasting, 6, 503-5
08.
Diebold, F.X. and Rudebusch, G.D. , 1989. "Scoring the
Leading Indicators," Joum al of
Business, 62, 369-391.
Dufour, J.-M ., 1981. "Rank Tests for Serial Dependence,"
Journal of Time Series Analysis,
2, 117-128.
Engle, R.F., Hong, C.-H ., Kane, A. and Noh, J., 1993.
"Arbitrage Valuation of Variance
Forecasts with Simulated Options," in D. Chance and R.
Tripp (eds.), Advances in
Futures and Options Research. Greenwich, CT.: ITA Press
.
Engle, R.F. and Kozicki, S., 1993. "Testing for Comm
on Features," Journal of Business and
Economic Statistics, 11, 369-395.
Fair, R.C. and Shiller, R.J., 1989. "The Infonnational
Content of Ex Ame Forecasts,"
Review of Economics and Statistics, 71, 325-331.
Fair, R.C. and Shiller, R.J., 1990. "Comparing Infonnatio
n in Forecasts from Econometric
Models," American Economic Review, 80, 375-389.
Fama, E.F., 1970. "Efficient Capital Markets: A Revie
w of Theory and Empirical Work,"
Journal of Finance, 25, 383-417.
Fama, E.F., I 975. "Short-Tenn Interest Rates as Predictors
of Inflation," American
Economic Review, 65, 269-282.

39

Fama, E.F., 1991. "Efficient Markets II," Journal of Finan
ce, 46, 1575-1617.
Fama, E.F. and French, K.R. , 1988. "Permanent and Temp
orary Components of St6ck
Prices," Journal of Political Economy, 96, 246-273.
Granger, C.W .J. and Newbold, P., 1973. "Some Comm
ents on the Evaluation of Economic
Forecasts," Appl ied Economics, 5, 35-47.
Granger, C.W. J. and Newbold, P., 1976. "Forecasting
Transfonned Series," Journal of the
Royal Statistical Society B, 38, 189-203.
Granger, C.W. J. and Newbold, P., 1986. Forecasting
Economic Time Series, Second
Edition. San Diego: Academic Press.
Granger, C.W. J. and Ramanathan, R., 1984. "Improved
Methods of Forecasting," Journal of
Forecasting, 3, 197-204.
Hansen, L.P. and Hodrick, R.J., 1980. "Forward Exch
ange Rates as Optimal Predictors of
Future Spot Rates: An Econometric Investigation," Journ
al of Political Economy, 88,
829-853.
Hendry, D.F. and Mizon, G.E. , 1978. "Serial Correlation
as a Convenient Simplification,
Not a Nuisance: A Comment on a Study of the Demand
for Money by the Bank of
England," Economic Journal, 88, 549-563.
Henriksson, R.D. and Merton, R.C. , 1981. "On Market
Timing and Investment Perfo nnan ce
II: Statistical Procedures for Evaluating Forecast Skills,"
Journal of Business, 54, 5 I 3533.
Keane, M.P. and Runkle, D.E. , 1990. "Testing the Ratio
nality of Price Forecasts: New
Evidence from Panel Data," American Economic Review,
80, 714-735.
Leitch, G. and Tanner, J.E., 1991. "Economic Forecast
Evaluation: Profits Versus the
Conventional Error Measures," American Economic Revie
w, 81, 580-590.
Leitch, G. and Tanner, J.E., 1995. "Professional Econ
omic Forecasts: Are They Worth
Their Costs?," Journal of Forecasting, 14, 143-157.
LeRoy, S.F. and Porter, R.D, 1981. "The Present Value
Relation: Tests Based on Implied
Variance Bounds," Econometrica, 49, 555-574.
Lopez, J.A., 1995. "Evaluating the Predictive Accuracy
of Volatility Models," Manuscript,
Department of Economics, University of Pennsylvania.

40

Mark, N.C., 1995. "Exchange Rates and Fundamentals: Evidence on
Long-Horizon
Predictability," American Economic Review, 85, 201-218.
McCulloch, R. and Rossi, P.E., 1990. "Posterior, Predictive and Utility
-Based Approaches to
Testing the Arbitrage Pricing Theory," Journal of Financial Economics,
28, 7-38.
Meese, R.A. and Rogoff, K., 1988. "Was it Real? The Exchange Rate
- Interest Differential
Relation Over the Modern Floating-Rate Period," Journal of Finance,
43, 933-948.
Merton, R.C., 1981. "On Market Timing and Investment Performance
I: An Equilibrium
Theory of Value for Market Forecasts," Journal of Business, 54, 513-53
3.
Mincer, J. and Zarnowitz, V., 1969. "The Evaluation of Economic Foreca
sts," in J. Mincer
(ed.), Economic Forecasts and Expectations. New York: National Bureau
of
Economic Research.
Murphy, A.H., 1973. "A New Vector Partition of the Probability Score,
" Journal of Applied
Meteorology, 12, 595-600.
Murphy, A.H., 1974. "A Sample Skill Score for Probability Forecasts,"
Monthly Weather
Review, 102, 48-55.
Murphy, A.H. and Winkler, R.L., 1987. "A General Framework for
Forecast Evaluation,"
Momhly Weather Review, 115, 1330-1338.
Murphy, A.H. and Winkler, R.L., 1992. "Diagnostic Verification of
Probability Forecasts,"
Intemational Journal of Forecasting, 7, 435-455.
Nelson, C.R., 1972. "The Prediction Perfonnance of the F.R.B.-M.I.
T.-Penn Model of the
U.S. Economy," American Economic Review, 62, 902-917.
Nelson, C.R. and Schwert, G.W., 1977. "Short Tenn Interest Rates as
Predictors of
Inflation: On Testing the Hypothesis that the Real Rate of Interest is Consta
nt,"
American Economic Review, 67, 478-486.
Newbold, P. and Granger, C.W.J ., 1974. "Experience with Forecasting
Univariate Time
Series and the Combination of Forecasts," Journal of the Royal Statistical
Society A,
137, 131-146.
Pagan, A.R. and Schwert, G.W., 1990. "Alternative Models for Condi
tional Stock
Volatility," Journal of Econometrics, 45, 267-290.
Pesaran, M.H., 1974. "On the General Problem of Model Selection,"
Review of Economic
Studies, 41, 153-171.
41

Pesaran, M.H . and Timmermann, A., 1992.
"A Simple Nonparametric Test of Predictive
Perfonnance," Journal of Business and Econom
ic Statistics, 10, 461-465.
Ramsey, J.B., 1969. "Tests for Specification
Errors in Classical Least-Squares Regression
Analysis," Journal of the Royal Statistical Soci
ery B, 2, 350-371.
Satchell, S. and Timmennann, A., 1992. "An
Assessment of the Economic Value of
Nonlinear Foreign Exchange Rate Forecasts,"
Birkbeck College, Cambridge
University, Financial Economics Discussion
Paper FE-6/92.
Schnader, M.H . and Stekler, H.O ., 1990. "Eva
luating Predictions of Change," Journal of
Business, 63, 99-107.
Seillier-Moiseiwitsch, F. and Dawid, A.P .,
1993. "On Testing the Validity of Sequentia
l
Probability Forecasts," Journal of the America
n Statistical Association, 88, 355-359.
Shiller, R.J. , 1979. "The Volatility of Long
Ten n Interest Rates and Expectations Models
of
the Tenn Structure," Journal of Political Eco
nomy, 87, 1190-1219.
Stekler, H. 0., 1987. "Who Forecasts Better?,"
Journal of Business and Economic Statistics,
5, 155-158.
Stekler, H 0., 1994. "Are Economic Forecast
s Valuable?," Journal of Forecasting, 13,
495-505.
Theil, H., 1961. Economic Forecasts and Poli
cy. Amsterdam: North-Holland.
Weiss, A.A ., I 995. "Estimating Time Series
Models Using the Relevant Cost Function,"
Manuscript, Department of Economics, Univ
ersity of Southern California.
Weiss, A.A. and Andersen, A.P ., 1984. "Est
imating Forecasting Models Using the Relevant
Forecast Evaluation Criterion," Journal of the
Royal Statistical Sociery A, 137,
484-487.
West, K.D ., I 994. "Asymptotic Inference Abo
ut Predictive Ability," Manuscript,
Department of Economics, University of Wis
consin.
West, K.D ., Edison, H.J. and Cho, D., 1993
. "A Utility-Based Comparison of Some Mod
els
of Exchange Rate Volatility," Journal of Inte
rnational Economics, 35, 23-45.
Winkler, R.L. and Makridakis, S., 1983. "The
Combination of Forecasts," Journal of the
Royal Statistical Society A, 146, 150-157.

42

Figure 1
Observed Cell Counts
Actual i

Actual j

Marginal

Forecast i

oii

O;;

0 ,.

Forecast j

O;

O;;

o,.

Marginal

0 .,

0,

Total: 0

Figure 2
Unobserved Cell Probabilities
Actual i

Actual j

Marginal

Forecast i

pii

P;-

p ,.

Forecast j

P;i

P,,

P,.

Marginal

p .,

P_,

Total: I

43