View original document

The full text on this page is automatically extracted from the file linked above and may contain errors and inconsistencies.

Center for Quantitative Economic Research

Comparing and Evaluating Bayesian
Predictive Distributions of Asset Returns
John Geweke and Gianni Amisano
CQER Working Paper 09-04
October 2009

Center for Quantitative Economic Research ⎮ WORKING PAPER SERIES

Comparing and Evaluating Bayesian
Predictive Distributions of Asset Returns
John Geweke and Gianni Amisano
CQER Working Paper 09-04
October 2009
Abstract: Bayesian inference in a time series model provides exact, out-of-sample predictive
distributions that fully and coherently incorporate parameter uncertainty. This study compares
and evaluates Bayesian predictive distributions from alternative models, using as an illustration
five alternative models of asset returns applied to daily S&P 500 returns from 1972 through 2005.
The comparison exercise uses predictive likelihoods and is inherently Bayesian. The evaluation
exercise uses the probability integral transform and is inherently frequentist. The illustration
shows that the two approaches can be complementary, each identifying strengths and weaknesses
in models that are not evident using the other.
JEL classification: C11, C53
Key words: forecasting, GARCH, inverse probability transform, Markov-mixture, predictive
likelihood, S&P 500 returns, stochastic volatility

The authors gratefully acknowledge financial support from NSF grant SBR-0720547. The views expressed here are
the authors’ and not necessarily those of the Federal Reserve Bank of Atlanta or the Federal Reserve System. Any
remaining errors are the authors’ responsibility.
Please address questions regarding content to John Geweke, Departments of Statistics and Economics, W210
Pappajohn Business Bldg., University of Iowa, Iowa City, IA 52242-1000,, and Gianni
Amisano, University of Brescia and European Central Bank,
Center for Quantitative Economic Research Working Papers from the Federal Reserve Bank of Atlanta are available
on the Atlanta Fed’s Web site at Click “Economic Research & Data,” “CQER,” and then “Publications.”
Use the WebScriber Service at to receive e-mail notifications about new papers.


Introduction and motivation

Probability distributions for magnitudes that are unknown at the time a decision
is made, but will become known afterward, are required for the formal solutions of
most decision problems in economics – in the private and public sectors as well as
academic contexts. Increasing awareness of this setting, combined with advances in
modeling and computing, is leading to a sustained emphasis on these distributions in
econometric research (Diebold et al.,1998; Christo¤ersen, 1998); Corradi and Swanson
(2006) provides a survey. For important decisions there are typically competing
models and methods that produce predictive distributions. The question of how these
predictive distributions should be compared and evaluated then becomes relevant.
This study compares and evaluates the quality of predictive distributions over
multiple horizons for asset returns using …ve di¤erent models. We use the daily returns
of the Standard and Poors (S&P) 500 index over the period 1972-2005, a series that is
widely employed in academic work and is also one of the most important indexes in the
…nance industry. The models compared are two from the autoregressive conditional
heteroscedasticity (ARCH) family, a stochastic volatility model, the Markov normal
mixture model, and an extension of the last model that we have described in detail
elsewhere (Geweke and Amisano, 2007).
The basis of comparison used in this study is the predictive likelihood function
– the model’s probability density for the return at the relevant horizon before it is
observed, evaluated at the actual value of the return after it is observed. This function
lies at the heart of the Bayesian calculus for posterior model probabilities, re‡ecting
the logical positivism of the Bayesian approach: a model is as good as its predictions.
Each model produces a predictive distribution for each return ex ante, and therefore
a predictive likelihood ex post. Comparison of these predictive likelihoods across
models decomposes the Bayes factor one observation at a time. One of the objectives
of this study is to illustrate how this decomposition provides insight into conventional
Bayesian model comparison. The study does this in Sections 3 and 4.
The basis of evaluation used in this study is the probability integral transform
(PIT), which is the inverse of the sequence of ex ante predictive cumulative distribution function (c.d.f.’s) evaluated at the sequence of actual returns ex post. If returns
are in fact generated from this c.d.f. sequence then the ex ante distribution of the
PIT is i.i.d. uniform. As a practical matter this condition will not be met precisely
even in ideal circumstances: while observed values might come from the model under
consideration, uncertainty about parameter values implies that the predictive distributions will not be exactly the same as in the data generating process. Nevertheless
the PIT provides a well-recognized and useful paradigm against which any sequence
of predictive distributions can be evaluated. A second objective of this study is to
illustrate how the PIT also provides insight into the de…ciencies of models. The study
does this in Sections 5 and 6.
Model comparison using predictive likelihoods and model evaluation using the PIT

are quite distinct methodologically. The predictive likelihood function is inherently
Bayesian: it is a component of the likelihood function, integrated over the posterior
distribution of the unobservables (parameters and latent variables) at the time the
prediction is made. The product of predictive likelihood functions over all observations in the sample is the marginal likelihood of the model over the same observations.
By contrast the PIT is inherently frequentist, comparing a function of the data with
the ex ante distribution that function would have if the data were generated by a
process coinciding with the model used by the analyst. While the methods are distinct both can be applied to predictive distributions arising from Bayesian inference,
which we do in this study.
This study builds upon and is distinct from the most closely related recent work on
this topic. Amisano and Giacomini (2007) use frequentist tests based on weighted log
predictive distributions to compare alternative models. Their method can be applied
either to Bayesian or frequentist predictive distributions. However, they do not use
PIT to evaluate models. Other studies have employed both the predictive likelihood
and the PIT to compare and evaluate predictive densities, some with large samples
of daily returns like the one used in this study. Hong et al. (2004) is perhaps closest
in these dimensions; see also Bauwens et al. (2004). However none of these studies
incorporate parameter uncertainty in their predictive distributions. As discussed in
the next section, the coherent combination of intrinsic and parameter uncertainty is
the hallmark of Bayesian predictive distributions.


Data, models and Bayesian predictive distributions

This study compares and evaluates the performance of …ve alternative predictive
distributions of asset returns using daily percent log returns of the S&P 500 index.
The daily index pt for 1972-2005 was collected from three di¤erent electronic sources:
the Wharton WRDS data base;1 Thompson/Data Stream;2 and Yahoo Finance.3 For
days on which all three sources did not agree we consulted the periodical publication
Security Price Index Record of Standard & Poor’s Statistical Service. From the price
series fpt g assembled in this way the daily percent log returns yt = 100 log (pt =pt 1 )
were constructed. The total number of returns in the sample is 8574.
Each of the …ve alternative predictive distributions arises from a model A for the
time series of S&P 500 asset returns yT = (y1 ; : : : ; yT )0 . Each model A for a time series
yT speci…es a density p (yT j A ; A) for the observables yT conditional on a vector of
unobservables A 2 A that may include latent variables as well as parameters. It
also speci…es a prior density p ( A j A), and through the usual Bayesian calculus the


posterior distribution of


from a sample of t


j yto ; A) / p (


T observations is

j A) p (yto j

A ; A) .


The superscript o in (1) denotes the ex post, observed, value of yt ; ex post yt = yto is
known and …xed whereas ex ante it is random. The posterior distribution represented
n (1)ocan be accessed using a posterior simulator that produces ergodic sequences
(m = 1; : : : ; M ) for each t considered.
Conditional on the data yto 1 and the model A the predictive density for yt is
p yt j yt 1 ; A =
p yt j yto 1 ; A ; A p A j yto 1 ; A d A .


This distribution can be accessed by the simulating one value yt from each of
the distributions represented by the density p yt j yto 1 ; A;t 1 ; A (m = 1; : : : ; M ).
This simulation is usually straightforward and less demanding than the simulation of
A;t from (1). The predictive density (2) integrates uncertainty about the vector of
unobservables A and intrinsic uncertainty about the future value yt , both conditional
on the history of returns yto 1 and the assumptions of the model A.
This integration is a hallmark
n ofoBayesian predictive
o distributions. The use of sim(m)
ulation methods to produce
and then yt
makes these predictive distribA;t
utions applicable in real time. A key advantage of Bayesian predictive distributions is
the combination of the two sources of uncertainty in a logically coherent framework.
To consider two alternatives suppose, …rst, that one were to use the predictive density
(t 1)

(t 1)

p yt j yto 1 ; bA



where the estimate bA , a function of yto 1 , replaces the unknown A . This does
not account for parameter uncertainty at all. In a second alternative one could work
(t 1)
(t 1)
(t 1)
p yt j yto 1 ; bA ; A pb bA
j A dbA

where pb

b(t 1)

j A is an asymptotic approximation of the sampling distribution of

b(t 1) .

the estimator
This alternative conditions on the actual history yto 1 in the …rst
component of the integration, while treating the history as a random variable in the
second component. The resulting distribution for yt thus has no clear interpretation.
For further discussion of these issues, see Geweke and Whiteman (2006), Section
The …rst model A considered in this study is the generalized autoregressive conditional heteroscedasticity model with parameters p = q = 1 in which the distribution

of the innovations is Gaussian (“GARCH”). The second model is the same as the
…rst, except that the distribution of the innovations is Student-t (“t-GARCH”). The
third model is the stochastic volatility model of Jacquier et al. (1994) (“SV”).
The fourth model is a Markov normal mixture model (“MNM”), which dates at
least to Lindgren (1978) and has since been applied in statistics and econometrics
(Tyssedal and Tjøstheim, 1988; Chib, 1996; Ryden et al., 1998; Weigend and Shi,
2001). In the MNM model a latent state variable st takes on discrete values st =
1; : : : ; m0 and obeys a …rst-order discrete Markov process P (st = j j st 1 = i) = pij .
yt j (st = j) s N j ; 2j .

The model is used here with m0 = 4 components, which is the choice made by
Weigend and Shi (2001) using S&P 500 return data.
The …nal model is a generalization of the MNM model proposed in Geweke and
Amisano (2007). This generalization replaces the normal distribution in (5) with
a conventional …nite mixture of normal distributions. The latent variable st becomes the …rst component s1t of a bivariate latent state vector st = (st1 ; st2 ); thus
P (st1 = j j st 1;1 = i) = pij (i; j = 1; : : : ; m1 ). For the second component
P (st2 = j j st1 = i) = rij (i = 1; : : : ; m1 ; j = 1; : : : ; m2 ) :
yt j (st1 = i; st2 = j) s N

ij ;


(i = 1; : : : ; m1 ; j = 1; : : : ; m2 ) .

This generalization is termed the hierarchical normal mixture model (“HMNM”) in
Geweke and Amisano (2007). The model is used here with m1 = m2 = 5, the
choice being made based on predictive likelihoods as explained in the next section.
The HMNM model can also be regarded as a …rst-order MNM model with m0 =
m1 m2 states and with substantial structure imposed on the Markov transition
matrix. Yet a third interpretation is that of an arti…cial neural network with two
hidden layers. Geweke and Amisano (2007) provides further detail about the model,
prior distributions, and the posterior simulation algorithm.
These …ve models are illustrative examples. Bayesian predictive distributions arise
naturally in any complete model for time series yt that speci…es a conditional distribution of the form p (yt j yt 1 ; A ; A) and a prior distribution of the form p ( A ; A).


Model comparison with predictive likelihood functions

The one-step-ahead predictive likelihood, which can be evaluated only at time t or
later, is the real number
P LA (t) = p yt j yt 1 ; A =
p yto j yto 1 ; A ; A p A j yto 1 ; A d A . (6)


In most time series models evaluation of p yto j yto 1 ;
to the approximation of (6),





A;t 1

p yto j yto 1 ;

is straightforward, leading

A; A

A;t 1 ; A




using an ergodic sequence
from a posterior simulator.
For the data set yT the marginal likelihood of the model A is
p (yTo

j A) =


p yto j yto 1 ; A


implying the additive decomposition
log p (yTo

j A) =


log P LA (t) .



Given two competing models A1 and A2 , the log Bayes factor may be decomposed
p (yTo j A1 )
P LA1 (t)
p (yT j A2 )
P LA2 (t)



where P LA1 (t) =P LA2 (t) is the predictive Bayes factor in favor of A1 over A2 for
observation t. Predictive Bayes factors may be approximated using the output of a
posterior simulator by means of (7). These approximations are usually quite accurate;
the cost is that the posterior simulator must executed for each time period t.
The decomposition (8) shows the intimate relationship between the evaluation
of the predictive performance of a model by means of the predictive likelihood, on
the one hand, and the evidence in favor of a model in the conventional Bayesian
comparison of models by means of Bayes factors, on the other. The corresponding
decomposition (9) shows how individual observations contribute to the evidence in
favor of one model versus a second. See Geweke (2001) or Geweke (2005), Section
2.6.2, for further details and elaboration.
A generalization of (8) is
log p (yTo


ySo ; A)



log P LA (t)



for S < T , and the corresponding generalization of (9) is

P LA1 (t)
p (yTo j ySo ; A1 )
p (yT j yS ; A2 )
A2 (t)



In (10) and (11) the cumulation of evidence begins at time t = S + 1 rather than
at time t = 1. If one were to regard p ( A j ySo ; A) as the prior distribution for A –
that is, if ySo were interpreted as a training sample –then (10) would have the same
interpretation as (8) and (11) would have the same interpretation as (9). The analysis
in the next section uses (10) and (11) with S = 1250 (about …ve years of data) and
T = 8574, so that there are 7324 terms in the sums in these two expressions. The
same sample is used for the analysis in Section 6. For small values of t P LA (t) is
sensitive to the prior density p ( A j A), whereas for t
1250 the results reported
here are for all practical purposes invariant with respect to substantial changes in the
prior distribution. This result is unsurprising when one interprets ySo as a training
sample: the information in these 1250 observations dominates the information in the
original prior distribution.
The decomposition (11) shows how individual observations contribute to the evidence in favor of one model versus a second. For example, it may show that a few
observations are pivotal in evidence yTo strongly favoring one model over another.
Comparison of the predictive Bayes factors P LA1 (t) =P LA2 (t) with characteristics of
the sample fyso g for s = t and observations s leading up to t can provide insight into
why the evidence favors one model over the other. The comparison can be carried
out using predictions over horizons greater than one period, but the decomposition
for multiple-period horizons is exactly the same as that for single-period horizons as
explained in Geweke (2001) and Geweke (2005), Section 2.6.2.
The generalization (10) of the marginal likelihood (8) amounts to the evaluation
of the predictive densities p yt j yto 1 (t = S + 1; : : : ; T ) using a log scoring rule; see
Gneiting and Raftery (2007), Section 7. Non-Bayesian predictive densities, like (3)
and (4), may also be evaluated using a log scoring rule. In the case of (3), for example,
the log score
(t 1)
log p yto j yto 1 ; bA ; A

is directly comparable with (10). It is therefore possible to compare Bayesian and
non-Bayesian methods directly by means of their di¤erence in log scores
p yto j yto 1 ; A
log 4
(t 1)
p yto j yto 1 ; bA ; A


Comparison of …ve models of S&P 500 returns

Figure 1 shows the familiar S&P 500 percent log return series for the period beginning
with December 15, 1976, corresponding to S + 1 = 1251 and ending with December
16, 2005, corresponding to T = 8574. (Since the data set goes through the end of

Figure 1: S&P 500 percent log return observations for which predictive likelihood
was evaluated. The symbols identify ten speci…c observations.

2005, and because the analysis in Section 6 utilizes prediction horizons of up to 10
trading days, this exercise ends short of the last trading day of 2005.) Symbols in
this …gure identify particular dates for reference in the analyses that follow in this
Corresponding to (11), the cumulative log predictive Bayes factor through period
r, in favor of model A1 over model A2 , is
P LA1 (t)
p (yro j ySo ; A1 )
p (yr j yS ; A2 )
P LA2 (t)


Figure 2 shows these cumulative log predictive Bayes factors for r = S + 1; : : : ; T .
For each prediction model posterior inference was carried out by Markov chain Monte
Carlo in each of 7324 samples, applying (7) to approximate P LA (t) = p yto j yto 1 ; A ; A .
In each panel the comparison model A2 is GARCH, and the other model is the one
indicated. All of these results are out-of-sample: that is, P LA (t) re‡ects inference
for the parameter vector A using the sample consisting of observations 1; : : : ; t 1 .
The right endpoint of the plotted points in each panel of Figure 2 provides (13)
with r = T . For SV versus GARCH the value is 144.36, for t-GARCH 208.44, for
MNM 151.65, and for HMNM 199.31. The evidence strongly favors the t-GARCH
and HMNM models, with SV and MNM rounding out the rankings. More than
one-third of the log predictive likelihood in favor of each of the other four models
over GARCH is due to returns on just two days: the record log return of -22.9% on
October 19, 1987, and the log return of -6.3% on October 13, 1989. The returns of
-3.7% on November 15, 1991, -7.1% on October 27, 1997, and -7.0% on August 31,

Figure 2: Cumulative predictive log predictive Bayes factors in favor of each of four
models over GARCH. Symbols identify dates as indicated in Figure 1.

1998 also lead to predictive likelihoods for those days that strongly favor the other
four models over GARCH.
Figure 3 provides further comparison of the predictive performance of the tGARCH and HMNM models as measured by predictive likelihoods. The sequence
of cumulative log predictive Bayes factors, panel (a), is not dominated by any single
date. Until May 23, 1984, predictive Bayes factors on average favor t-GARCH. From
then until November 27, 1987, they favor HMNM on average. From July 20, 1993,
through the end of 2005 predictive Bayes factors again favor t-GARCH on average.
Log predictive Bayes factors for all ten individual dates marked by symbols in Figure
1 can be read from panels (b) and (d) of Figure 3.
Panel (b) shows all the log predictive likelihoods for the two models. Combinations
above the 45o line favor the HMNM model and those below it favor t-GARCH. The
symbols speci…cally designate all combinations for which the t-GARCH log predictive
likelihood was less than -8 or the log predictive Bayes factor in favor of HMNM was


Figure 3: Some comparisons of the t-GARCH and HMNM models using the log
predictive Bayes factor in favor of HMNM. Symbols identify dates as indicated in
Figure 1.

less than -1.5. (That is how the dates indicated in Figure 1 were selected.) The
record return of October 19, 1987, has by far the lowest log predictive likelihood in
the t-GARCH model, whereas October 13, 1989, has the lowest predictive likelihood
in the HMNM model. Panel (d) of Figure 3 shows that there is no simple relationship
between returns that are large in magnitude and log predictive Bayes factors, and
comparison of panels (b) and (d) shows that there is no simple relationship between
these returns and log predictive likelihoods. Panels (b) through (d) show that for
most days the predictive Bayes factor in favor of one model or the other is small.
Panels (c) and (d) show a weak but systematic relationship between absolute
returns and log predictive Bayes factors: the HMNM model tends to be favored by
log Bayes factors when returns are less than 0.5% in magnitude, whereas t-GARCH
tends to be favored when return magnitude is between 0.5% and 1%. As return
magnitude rises above 1% the range of log predictive Bayes factors tends to increase,


with no systematic tendency for one model or the other to be favored. The important
exception to this pattern is October 19, 1987.
The exploratory analysis illustrated in Figure 3 can be used to compare the predictive performance of the two models (as captured by log predictive Bayes factors) and
any function of returns over the preceding days. In all …ve models more volatile recent
returns lead to greater dispersion in predictive distributions, but the mechanisms are
distinct –especially in the HMNM model as opposed to models in the ARCH family.
This characteristic of the models suggests that the magnitude of the return relative
to recent magnitudes might be systematically related to log predictive Bayes factors.
Figure 4 pursues this analysis, capturing return relative to recent magnitudes as the
ratio of jyto j to the standard deviation in fyso g (s = t 80; : : : ; t 1). Call this ratio

Figure 4: Comparison of the ratio of absolute return to the standard deviation of
returns in the past 80 days, with the predictive Bayes factor for HMNM over tGARCH. Symbols identify dates as indicated in Figure 1.

For the ten dates initially identi…ed in Figure 3(b), the correlation between q and
the corresponding log predictive Bayes factors exceeds 0.9: the near-linear relationship
is evident in panel (a) of Figure 4. Panel (b) plots log predictive Bayes factors against
q for all days on which q is less than 4. (The rescaling of the vertical axis in panel
(b) excludes no days.) The pattern in Figure 4(b) is similar to that in Figure 3(c).
As measured by log predictive Bayes factors, the predictive performance of the
HMNM model dominates that of the t-GARCH model when returns are very large
in magnitude relative to recent volatility –that is, for returns whose absolute value

exceeds …ve standard deviations of returns over the past 80 days. Equivalently, extreme returns that occur roughly once or twice per decade are assigned substantially
more probability in the HMNM model than in the t-GARCH model. Overall, the
log predictive likelihoods of the two models are nearly identical. Elsewhere (Geweke
and Amisano, 2008) we have shown that this implies that neither model corresponds
to a true data generating process D, and there must exist models with higher log
predictive likelihoods. Figure 4 suggests that in such models the predictive density
function might resemble more the HMNM predictive density for returns that are small
or quite large relative to recent volatility, and for the remainder might resemble more
the t-GARCH model.
For the GARCH and t-GARCH models we prepared an alternative set of predictive
densities for each of the 7324 days, using (3) and the maximum likelihood estimates
b(t 1) . We then compared the Bayesian and maximum likelihood (ML) predictive
densities using the di¤erence in log scores (12). For the GARCH model the outcome
is 18.09 and for the t-GARCH model it is 9.76: both comparisons favor the Bayesian
predictive densities over the MLE predictive densities.

Figure 5: Daily decomposition of (12) for the GARCH and t-GARCH models.

This outcome is not surprising. Bayesian predictive densities account for parameter uncertainty, whereas the MLE predictive density (3) does not. Figure 5 provides
the daily decomposition of (12). One would expect the advantage of Bayesian predictive densities to be more pronounced in smaller samples, corresponding to earlier data
in Figure 5, in which parameter uncertainty is relatively more important. The daily
decomposition for t-GARCH is at least roughly consistent with this understanding.
The decomposition for GARCH is dominated by the inferior performance of the MLE
predictive density, relative to the Bayesian predictive density, on October 19, 1987.


Model evaluation with probability integral transforms

Predictive likelihoods are local measures of the predictive performance of models: that
is, they depend only on the predictive probability density evaluated at the realized
return. Moreover predictive densities measure only the relative performance of models
– indeed, as discussed in Section 3 they are components of Bayes factors that are
critical in Bayesian model comparison and model averaging.
The probability integral transform (PIT) provides an alternative assessment of the
predictive performance of a model that is based on non-local assignment of predictive
probability and is therefore complementary to the assessment based on predictive
likelihoods discussed in the previous two sections. Unlike predictive likelihoods, however, the comparison is non-Bayesian because the ideal i.i.d. uniform distribution of
PIT is ex ante and conditions on model parameters.
Suppose that a model A assigns one-step-ahead predictive densities p (yt j yt 1 ; A).
Denote the corresponding sequence of cumulative distribution functions
F (c j yt 1 ; A) = P (yt

c j yt 1 ; A) :

The PIT corresponding to the model A and the sequence fyt g is p1 (t; yT ;A) =
F (yt j yt 1 ; A). If A = D, the true data generating process for fyt g, then the sequence fp1 (t; yT ;A)g is distributed i.i.d. uniform (0; 1) ex ante. This result dates
at least to Rosenblatt (1952), and was brought to the attention of the econometrics
community by Diebold et al. (1998). Following Smith (1985) and Berkowitz (2001),
if A = D then
f1 (t; yT ;A) =
[p1 (t; yT ;A)] s N (0; 1) ,
and for analytical purposes it is often more convenient to work with ff1 (t; yT ;A)g
than with fp1 (t; yT ;A)g. For h-step ahead predictive distributions, let Fh (c j yt h ; A) =
P (yt c j yt h ; A) and ph (t; yT ;A) = F (yt j yt h ; A). If A = D then the distribution of ph (t; yT ;A) is uniform on (0; 1) ex ante, but ph (t; yT ;A) and ph (s; yT ;A) are
independent ex ante if and only if jt sj h.
These characteristics of fph (t; yT ;D)g would only be approximately true of fph (t; yT ;A)g
even if p (yT j D) = p (yT j A ; A) for some value of A , because A is unobservable.
(The approximation would improve as T increased and p (yT j A) incorporated an increasingly tight posterior distribution for A .) More important, we know that A 6= D.
The departure of the sequence fph (t; yT ;A)g from these ideal characteristics provides
an informal evaluation of A against an absolute standard.
In many models, including all …ve in this study, analytical evaluation of
P yt

c j yto 1 ;

A; A

is possible, and this is all that is required for a posterior simulation approximation of

F yto j yto 1 ; A , because



yto j yto 1 ;

P yt


so long as




A ;A


! p1 (t; yTo ;A)

is an ergodic sequence whose invariant distribution is the poste-

rior. For h > 1 analytical evaluation of P yt c j yto h ; A ; A is very awkward or
impossible in all of these models, and that is true of econometric prediction models
generally. Instead we employ a simulation approximation using the recursion
h+1 ; : : : ; ys 1 ;

ys(m) s p ys j yto h ; yt

A ;A

(s = t

h + 1; : : : ; t) .


Since the posterior MCMC sample is large, only one such recursion need be carried
out for each A in the posterior sample and





1;yto )




! ph (t; yTo ;A) ;


where IS (x) = 1 if x 2 S and IS (x) = 0 if x 62 S.


Evaluation of …ve models of S&P 500 returns

For the S&P 500 return series yTo we evaluated ph (t; yTo ;A) for each of the …ve models A described in section 2, for h = 1; : : : ; 10, and for t = 1251; : : : ; 8574 = T ,
where T is the entire sample size. The computations are based on the 7324 Markov
chain Monte Carlo posterior samples in each of the …ve models, one for each sample,
using (14) and (15) for t = 1251; : : : ; 8574. We then computed the corresponding
[ph (t; yTo ;A)]
transformation to the standard normal distribution, fh (t; yTo ;A) =
(t = 1251; : : : ; 8574).
To describe the departure from the PIT paradigm for each model, we determined
the average values of


[ph (1 + (s

1) h; yTo ;A)] (j = 1; : : : ; 10; h = 1; : : : ; 10) ;


that is, we determined the fraction of ph (t; yTo ;A) in each decile using non-overlapping
prediction horizons h. Figure 6 presents the values of (16) for a di¤erent model A in
each of the …ve rows of panels, and for h = 1, 5 and 10 in each of the three columns
of panels. Values for the deciles j = 1; : : : ; 10 are shown in each panel. The paradigm value 0.10 is indicated by the solid horizontal line, and the dotted horizontal
lines provide a conventional 95% con…dence interval for these values under the condition that fph (1 + (s 1) h; yTo ;A)g (s = 1; : : : ; [T =h]) is i.i.d. Bernoulli (p = 0:1).

Figure 6: Each panel shows the frequency of observations occuring in each decile of
the predictive distribution (16), as a function of the prediction horizon h indicated on
the horizontal axis. Solid lines indicate results using Bayesian inference, dotted lines
results using maximum likelihood in the GARCH and t-GARCH models. Dashed
lines provide centered 95% confdence intervals for the individual deciles under the
PIT paradigm.

Plotted values above 0.1 indicate deciles in which more than 10% of the realized
returns occurred; equivalently, the model underpredicts probability in this range of
the predictive density. Values below 0.1 correspond to overprediction in the relevant
The performance of the GARCH model is markedly inferior to the other four,
and the performance of SV is not quite as good as the t-GARCH, MNM and HMNM
models. The tendency of the latter three models to over- or under-predict di¤erent
deciles is roughly the same for all three horizons studied in Figure 6. At horizon h = 1
they assign too little probability in the interquartile range and too much in the lower
and upper quartiles. At h = 5 and h = 10 they assign too much probability below the
median (with an exception for the lowest decile in some cases) and too little above
the median of their predictive distributions. These characteristics are also evident in

the GARCH model, where the performance is markedly poorer especially for h = 10.

Cell entries are

Table 1: Formal PIT goodness of …t testsa
Left tail

.0088 .0018
5:4 10
.0019 .052
2:6 10
7:9 10
3:7 10
p-values for the tests.


5:9 10



5:2 10
1:3 10




1:6 10



Table 1 provides two sets of chi-square goodness of …t test results for the PIT.
The entries in the table correspond to p-values for the tests. The results for horizon
h = 1 are conventional. The results for horizons h = 5 and h = 10 are based on h
separate tests, using (16) for k = 1; : : : ; h. These tests are not independent across k
because the horizons overlap; the entries in Table 1 are Bonferroni p-values for the h
separate tests in each case.
The …rst set of tests, under the “Deciles”heading, corresponds to the results shown
in Figure 6. The results reinforce the conclusion that the PIT …t of the GARCH and
SV models is inferior to those of the other three models. At horizon h = 1, only
the HMNM model comes close to passing a PIT goodness of …t test at conventional
signi…cance levels. At longer horizons results are mixed, perhaps due to the fact that
there are fewer intervals with nonoverlapping predictions and therefore the tests have
lower power.
Evaluation of the predictive distributions over particular regions may be of concern
in speci…c applications, particularly for negative returns; see the discussion in Diks
et al. (2008). Our second set of tests explores PIT goodness of …t in the lower tail of
the predictive distribution, based on (16) with n = 5, aj = j=200 (j = 0; : : : ; n), and
h = 1, 5 and 10. As in the …rst set of tests, conventional p-values are given for h = 1
and Bonferroni test p-values are given for h = 5 and h = 10. The power of these
tests is, of course, much lower than the decile tests. The GARCH and SV models fail
at horizon h = 1, MNM fails at h = 5, and GARCH again at h = 10; HMNM has
some di¢ culty at horizon h = 5. There is little in these results to suggest that the
evaluation of left-tail performance is very di¤erent from overall performance using
Figures 7 and 8 provide some additional evidence on the relationship of the predictive distributions to the PIT paradigm. Each panel plots a di¤erent transformation
of ph (t; yTo ;A) as a function of the prediction horizon h shown on the horizontal axis.

Figure 7: Each panel plots a transformation of ph (YT ; A) as a function of the prediction horizon h shown on the horizontal axis. Solid lines indicate results using Bayesian
inference; dotted lines indicate results using maximum likelihood in the GARCH and
t-GARCH models. The dashed lines provide the .01, .05, .50, .95 and .99 quantiles
of the transformation under the PIT paradigm.

Figure 7(a) displays the interdecile range for each model: for each model A and
each horizon h, it is the di¤erence between the maximum and minimum values of
(16) taken over j = 1; : : : ; 10. For each combination of A and h, larger values of
the interdecile range constitute greater evidence against the PIT paradigm. The
MNM and HMNM interdecile ranges display greater consistency with PIT than do
the ranges for the other three models.
The remaining panels of Figures 7 and 8 pertain to fh (t; yTo ;A) =
[ph (t; yTo ;A)].
Under the PIT paradigm values of
fh (1 + (s

1) h; yTo ;A) (s = 1; 2; : : :)


are realizations of a standard i.i.d. normal process, implying that there are many
functions of fh (1 + (s 1) h; yTo ;A) with well-established distributions. Panels (b) of

Figure 7 through (b) of Figure 8 pertain to the …rst four moments of fh (t; yTo ;A), and
therefore provide indications of the discrepancy between the predictive and observed
distributions. Figure 8(f) pertains to the …rst-order autocorrelation coe¢ cient of (17)
and therefore provides an indication of the departure from independence in successive
quantiles implied by PIT. Unlike the interdecile range in Figure 7(a), both large and
small values of the statistics in Figure 7 (b,c) and Figure 8 constitute evidence that
the PIT paradigm is not appropriate.

Figure 8: Each panel plots a transformation of ph (YT ; A) as a function of the prediction horizon h shown on the horizontal axis. Solid lines indicate results using Bayesian
inference; dotted lines indicate results using maximum likelihood in the GARCH and
t-GARCH models. The dashed lines provide the .01, .05, .50, .95 and .99 quantiles
of the transformation under the PIT paradigm.

The evaluations of the t-GARCH, MNM and HMNM model …ts in columns (b)
and (c) of Figure 7 are all similar. Means are slightly and insigni…cantly higher
than zero at all horizons. Variances are greater than one, with signi…cant departures
for t-GARCH and marginal signi…cant departures for MNM. The distribution for t18

GARCH is signi…cantly and negatively skewed (Figure 8(a)) whereas for MNM and
HMNM it is insigni…cantly positively skewed. The evaluation of all three models
using the kurtosis (Figure 8(b)) of the normalized PIT is satisfactory. By contrast
the GARCH and stochastic volatility models all have severe departures from the
paradigm skewness and kurtosis of the normalized PIT.
The PIT evaluations of the ML and Bayesian predictive distributions do not provide any striking systematic comparisons, either in the GARCH or t-GARCH model.
Posterior distributions are concentrated about the maximum likelihood estimates and
consequently deciles of predictive distributions are much the same in the Bayesian
and ML predictive distributions.
For one-step-ahead predictive distributions (h = 1) the …rst order autocorrelation
coe¢ cient for the series (17) is about 0.06 in all …ve models, well outside the range
of values plausible under PIT. For larger values of h the autocorrelation coe¢ cient
is smaller, with a notable tendency to be negative, in most cases, for horizons 3 or
greater. Moreover, the pattern of evaluations across horizons is roughly the same in
all cases. These results suggest that there is some persistence in day-to-day returns
that is not adequately captured by any of the …ve models.


Summary and conclusions

This study compares and evaluates Bayesian predictive distributions from alternative
models, using as an illustration …ve alternative models of asset returns applied to a
time series of 7324 daily S&P 500 log returns. The comparison exercise uses predictive
likelihoods and is inherently Bayesian. The evaluation exercise uses the probability
integral transform (PIT) and is inherently frequentist. The illustration shows that
the two approaches can be complementary, each identifying strengths and weaknesses
in models that are not evident using the other.
Both the predictive likelihood and PIT analyses lead to the conclusion that the
GARCH Bayesian predictive distributions are inferior to the other four, and the same
is true of GARCH predictive distributions constructed from maximum likelihood estimates. Each analysis also leads to the conclusion that the Bayesian stochastic
volatility (SV) and Markov normal mixture (MNM) predictive distributions are dominated by the t-GARCH and hierarchical Markov normal mixture (HMNM) predictive
distributions. These comparisons are readily apparent in Figures 2, 6, 7 and 8, and
in Table 1.
In the predictive likelihood analysis Bayes t-GARCH is favored over HMNM by
about 10 points (Figure 2), the same as its margin over ML t-GARCH (Figure 5).
Therefore the predictive likelihoods for MLE t-GARCH and HMNM are nearly identical. By contrast the PIT analysis narrowly favors HMNM predictive distributions
over Bayes t-GARCH. This is evident in Table 1 and Figures 6, 7 and 8. However the
latter analysis also shows that the normalized PIT for the HMNM model is not ideal.
In particular, PIT’s for successive one-day predictions are not independent, and the

performance of HMNM in this dimension is no better than that of any of the other
four models.
A new predictive density can always be formed as a weighted average of predictive densities from di¤erent models, the best known example being Bayesian model
averaging (Geweke, 2005, Section 2.6). The analysis in Section 4 indicates that for
most combinations of models and substantial subperiods of the sample considered
in this study, Bayesian model averaging is for all practical purposes equivalent to
model selection, with one model receiving a weight very close to 1. This is often
the outcome for Bayesian model averaging when the sample is large. The notable
exception arises when the models averaged include both t-GARCH and HMNM: in
that case these two models can have substantial weight in Bayesian model averaging,
depending on the days included in the sample. Geweke and Amisano (2008) shows
that a weighted average of the HMNM and t-GARCH models compares quite favorably with both models, using predictive likelihood. That paper also shows that, in
general, optimization of the log score function for multiple models leads to non-trivial
weights on several models, weights that are quite di¤erent from those that result from
conventional Bayesian model averaging.
Analysis of the kind in this article provides guidance for model improvement. This
can be seen in the context of the application used in this study, looking both to the
past and the future. Of the …ve models considered here, GARCH was the earliest to
be applied systematically to …nancial returns, followed shortly by SV. While predictive likelihood analysis shows that SV substantially improves on GARCH, the PIT
indicates the neither wholly captures the distribution of S&P 500 returns, especially
for extreme events. This provides an agenda for model improvement, addressed by
the t-GARCH and MNM models. The predictive likelihood and PIT analyses indicate that t-GARCH goes far in addressing the problems of GARCH and SV, whereas
MNM makes little progress. HMNM adds even more ‡exibility to MNM and substantially outperforms it in both predictive likelihood and PIT analyses. Looking
to the future, persistent dependence in the PIT transformation in all of the models
considered emphasizes the need for more adequate dynamics. For example, this could
take the form of a t-GARCH (p; q) model. Awaiting such new research, real-world
demands for predictive distributions will continue. In this context analyses of predictive distributions of the kind conducted here provide indications of model limitations
with which actual decisions based on these models must always be tempered.
The authors acknowledge many useful comments that have substantially improved
the paper from several seminar presentations and especially from the co-editor and
the referees. Responsibility for any remaining errors or points of confusion rests with
the authors. Geweke acknolwedges …nancial support from US National Science Foundation Grant SBR-0720547 and Amisano from the Italian Ministero dell’Istruzione,
dell’Universitá e della Ricerca.

Amisano, G., & Giacomini, R. (2007). Comparing density forecasts via weighted
likelihood ratio tests. Journal of Business and Economic Statistics, 25, 177-190.
Bauwens, L., Giot, P., Grammig, J., & Veredas, D. (2004). A comparison of
…nancial duration models via density forecasts. International Journal of Forecasting,
20, 589-609.
Berkowitz, J. (2001). Testing density forecasts with applications to risk management. Journal of Business and Economic Statistics, 19, 465-474.
Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity.
Journal of Econometrics, 31, 307-327.
Chib, S. (1996). Calculating posterior distributions and modal estimates in Markov
mixture models. Journal of Econometrics, 75, 79-97.
Christo¤ersen, P. F. (1998). Evaluating interval forecasts. International Economic Review, 39, 841-862.
Corradi, V., & Swanson, N. R. (2006). Predictive density evaluation. In: G.
Elliott et al. (Eds.), Handbook of Economic Forecasting. Amsterdam: North-Holland,
pp. 197-284.
Diebold, F. X., Gunther, T. A., & Tay, A. S. (1998). Evaluating density forecasts
with applications to …nancial risk management. International Economic Review, 39,
Diks, C., Panchenko, V., & van Dijk, D. (2008). Partial likelihood-based scoring
rules for evaluating density forecasts in the tails. Tinbergen Institute Discussion paper
Geweke, J. (2001). Bayesian econometrics and forecasting. Journal of Econometrics, 100, 11-15.
Geweke, J. (2005). Contemporary Bayesian Econometrics and Statistics, Hoboken: Wiley.
Geweke, J., & Amisano, G. (2007). Hierarchical Markov normal mixture models
with applications to …nancial asset returns. European Central Bank working paper
Geweke, J. & Amisano, G. (2008) Optimal Prediction Pools,
Geweke, J. & Whiteman C. (2006). Bayesian forecasting. In: C. W. J. Granger
et al. (Eds.), Handbook of Economic Forecasting. Amsterdam, The Netherlands:
Gneiting, T., & Raftery, A. E. (2007). Strictly proper scoring rules, prediction,
and estimation. Journal of the American Statistical Association, 102, 359-378.
Hong, Y. M., Li, H. T., & Zhao F. (2004). Out of sample performance of discrete
time spot interest rate models. Journal of Business and Economic Statistics, 22,
Lindgren, G. (1978). Markov regime models for mixed distributions and switching
regressions. Scandinavian Journal of Statistics, 5, 81-91.

Jacquier, E., Polson, N. G., & Rossi, P. E. (1994). Bayesian analysis of stochastic
volatility models. Journal of Business and Economic Statistics, 12, 371-389.
Rosenblatt, M. (1952). Remarks on a multivariate transformation. Annals of
Mathematical Statistics, 23, 470-472.
Rydén, T., Teräsvirta, T., & Åsbrink, S. (1998). Stylized facts of daily return
series and the hidden Markov model. Journal of Applied Econometrics, 13, 217-244.
Smith, J. Q. (1985). Diagnostic checks of non-standard time series models. Journal of Forecasting, 4, 283-291.
Tyssedal, J. S. & Tjøstheim, D. (1988). An autoregressive model with suddenly
changing parameters and an application to stock market prices. Applied Statistics,
37, 353-369.
Weigend, A. S. & Shi, S. (2000). Predicting daily probability distributions of S&P
500 returns. Journal of Forecasting, 19, 375-392.