The full text on this page is automatically extracted from the file linked above and may contain errors and inconsistencies.
Center for Quantitative Economic Research WORKING PAPER SERIES Comparing and Evaluating Bayesian Predictive Distributions of Asset Returns John Geweke and Gianni Amisano CQER Working Paper 09-04 October 2009 Center for Quantitative Economic Research ⎮ WORKING PAPER SERIES Comparing and Evaluating Bayesian Predictive Distributions of Asset Returns John Geweke and Gianni Amisano CQER Working Paper 09-04 October 2009 Abstract: Bayesian inference in a time series model provides exact, out-of-sample predictive distributions that fully and coherently incorporate parameter uncertainty. This study compares and evaluates Bayesian predictive distributions from alternative models, using as an illustration five alternative models of asset returns applied to daily S&P 500 returns from 1972 through 2005. The comparison exercise uses predictive likelihoods and is inherently Bayesian. The evaluation exercise uses the probability integral transform and is inherently frequentist. The illustration shows that the two approaches can be complementary, each identifying strengths and weaknesses in models that are not evident using the other. JEL classification: C11, C53 Key words: forecasting, GARCH, inverse probability transform, Markov-mixture, predictive likelihood, S&P 500 returns, stochastic volatility The authors gratefully acknowledge financial support from NSF grant SBR-0720547. The views expressed here are the authors’ and not necessarily those of the Federal Reserve Bank of Atlanta or the Federal Reserve System. Any remaining errors are the authors’ responsibility. Please address questions regarding content to John Geweke, Departments of Statistics and Economics, W210 Pappajohn Business Bldg., University of Iowa, Iowa City, IA 52242-1000, john-geweke@uiowa.edu, and Gianni Amisano, University of Brescia and European Central Bank, amisano@eco.unibs.it. Center for Quantitative Economic Research Working Papers from the Federal Reserve Bank of Atlanta are available on the Atlanta Fed’s Web site at frbatlanta.org. Click “Economic Research & Data,” “CQER,” and then “Publications.” Use the WebScriber Service at frbatlanta.org to receive e-mail notifications about new papers. 1 Introduction and motivation Probability distributions for magnitudes that are unknown at the time a decision is made, but will become known afterward, are required for the formal solutions of most decision problems in economics – in the private and public sectors as well as academic contexts. Increasing awareness of this setting, combined with advances in modeling and computing, is leading to a sustained emphasis on these distributions in econometric research (Diebold et al.,1998; Christo¤ersen, 1998); Corradi and Swanson (2006) provides a survey. For important decisions there are typically competing models and methods that produce predictive distributions. The question of how these predictive distributions should be compared and evaluated then becomes relevant. This study compares and evaluates the quality of predictive distributions over multiple horizons for asset returns using …ve di¤erent models. We use the daily returns of the Standard and Poors (S&P) 500 index over the period 1972-2005, a series that is widely employed in academic work and is also one of the most important indexes in the …nance industry. The models compared are two from the autoregressive conditional heteroscedasticity (ARCH) family, a stochastic volatility model, the Markov normal mixture model, and an extension of the last model that we have described in detail elsewhere (Geweke and Amisano, 2007). The basis of comparison used in this study is the predictive likelihood function – the model’s probability density for the return at the relevant horizon before it is observed, evaluated at the actual value of the return after it is observed. This function lies at the heart of the Bayesian calculus for posterior model probabilities, re‡ecting the logical positivism of the Bayesian approach: a model is as good as its predictions. Each model produces a predictive distribution for each return ex ante, and therefore a predictive likelihood ex post. Comparison of these predictive likelihoods across models decomposes the Bayes factor one observation at a time. One of the objectives of this study is to illustrate how this decomposition provides insight into conventional Bayesian model comparison. The study does this in Sections 3 and 4. The basis of evaluation used in this study is the probability integral transform (PIT), which is the inverse of the sequence of ex ante predictive cumulative distribution function (c.d.f.’s) evaluated at the sequence of actual returns ex post. If returns are in fact generated from this c.d.f. sequence then the ex ante distribution of the PIT is i.i.d. uniform. As a practical matter this condition will not be met precisely even in ideal circumstances: while observed values might come from the model under consideration, uncertainty about parameter values implies that the predictive distributions will not be exactly the same as in the data generating process. Nevertheless the PIT provides a well-recognized and useful paradigm against which any sequence of predictive distributions can be evaluated. A second objective of this study is to illustrate how the PIT also provides insight into the de…ciencies of models. The study does this in Sections 5 and 6. Model comparison using predictive likelihoods and model evaluation using the PIT 2 are quite distinct methodologically. The predictive likelihood function is inherently Bayesian: it is a component of the likelihood function, integrated over the posterior distribution of the unobservables (parameters and latent variables) at the time the prediction is made. The product of predictive likelihood functions over all observations in the sample is the marginal likelihood of the model over the same observations. By contrast the PIT is inherently frequentist, comparing a function of the data with the ex ante distribution that function would have if the data were generated by a process coinciding with the model used by the analyst. While the methods are distinct both can be applied to predictive distributions arising from Bayesian inference, which we do in this study. This study builds upon and is distinct from the most closely related recent work on this topic. Amisano and Giacomini (2007) use frequentist tests based on weighted log predictive distributions to compare alternative models. Their method can be applied either to Bayesian or frequentist predictive distributions. However, they do not use PIT to evaluate models. Other studies have employed both the predictive likelihood and the PIT to compare and evaluate predictive densities, some with large samples of daily returns like the one used in this study. Hong et al. (2004) is perhaps closest in these dimensions; see also Bauwens et al. (2004). However none of these studies incorporate parameter uncertainty in their predictive distributions. As discussed in the next section, the coherent combination of intrinsic and parameter uncertainty is the hallmark of Bayesian predictive distributions. 2 Data, models and Bayesian predictive distributions This study compares and evaluates the performance of …ve alternative predictive distributions of asset returns using daily percent log returns of the S&P 500 index. The daily index pt for 1972-2005 was collected from three di¤erent electronic sources: the Wharton WRDS data base;1 Thompson/Data Stream;2 and Yahoo Finance.3 For days on which all three sources did not agree we consulted the periodical publication Security Price Index Record of Standard & Poor’s Statistical Service. From the price series fpt g assembled in this way the daily percent log returns yt = 100 log (pt =pt 1 ) were constructed. The total number of returns in the sample is 8574. Each of the …ve alternative predictive distributions arises from a model A for the time series of S&P 500 asset returns yT = (y1 ; : : : ; yT )0 . Each model A for a time series yT speci…es a density p (yT j A ; A) for the observables yT conditional on a vector of unobservables A 2 A that may include latent variables as well as parameters. It also speci…es a prior density p ( A j A), and through the usual Bayesian calculus the 1 http://wrds.wharton.upenn.edu http://www.datastream.com/default.htm 3 http://…nance.yahoo.com/ 2 3 posterior distribution of p( A from a sample of t A j yto ; A) / p ( A T observations is j A) p (yto j A ; A) . (1) The superscript o in (1) denotes the ex post, observed, value of yt ; ex post yt = yto is known and …xed whereas ex ante it is random. The posterior distribution represented by n (1)ocan be accessed using a posterior simulator that produces ergodic sequences (m) (m = 1; : : : ; M ) for each t considered. A;t Conditional on the data yto 1 and the model A the predictive density for yt is Z o p yt j yt 1 ; A = (2) p yt j yto 1 ; A ; A p A j yto 1 ; A d A . A (m) This distribution can be accessed by the simulating one value yt from each of (m) the distributions represented by the density p yt j yto 1 ; A;t 1 ; A (m = 1; : : : ; M ). This simulation is usually straightforward and less demanding than the simulation of (m) A;t from (1). The predictive density (2) integrates uncertainty about the vector of unobservables A and intrinsic uncertainty about the future value yt , both conditional on the history of returns yto 1 and the assumptions of the model A. This integration is a hallmark n ofoBayesian predictive n o distributions. The use of sim(m) (m) ulation methods to produce and then yt makes these predictive distribA;t utions applicable in real time. A key advantage of Bayesian predictive distributions is the combination of the two sources of uncertainty in a logically coherent framework. To consider two alternatives suppose, …rst, that one were to use the predictive density (t 1) (t 1) p yt j yto 1 ; bA ;A (3) where the estimate bA , a function of yto 1 , replaces the unknown A . This does not account for parameter uncertainty at all. In a second alternative one could work with Z (t 1) (t 1) (t 1) p yt j yto 1 ; bA ; A pb bA j A dbA (4) A where pb b(t 1) A j A is an asymptotic approximation of the sampling distribution of b(t 1) . A the estimator This alternative conditions on the actual history yto 1 in the …rst component of the integration, while treating the history as a random variable in the second component. The resulting distribution for yt thus has no clear interpretation. For further discussion of these issues, see Geweke and Whiteman (2006), Section 2.4.2. The …rst model A considered in this study is the generalized autoregressive conditional heteroscedasticity model with parameters p = q = 1 in which the distribution 4 of the innovations is Gaussian (“GARCH”). The second model is the same as the …rst, except that the distribution of the innovations is Student-t (“t-GARCH”). The third model is the stochastic volatility model of Jacquier et al. (1994) (“SV”). The fourth model is a Markov normal mixture model (“MNM”), which dates at least to Lindgren (1978) and has since been applied in statistics and econometrics (Tyssedal and Tjøstheim, 1988; Chib, 1996; Ryden et al., 1998; Weigend and Shi, 2001). In the MNM model a latent state variable st takes on discrete values st = 1; : : : ; m0 and obeys a …rst-order discrete Markov process P (st = j j st 1 = i) = pij . Then (5) yt j (st = j) s N j ; 2j . The model is used here with m0 = 4 components, which is the choice made by Weigend and Shi (2001) using S&P 500 return data. The …nal model is a generalization of the MNM model proposed in Geweke and Amisano (2007). This generalization replaces the normal distribution in (5) with a conventional …nite mixture of normal distributions. The latent variable st becomes the …rst component s1t of a bivariate latent state vector st = (st1 ; st2 ); thus P (st1 = j j st 1;1 = i) = pij (i; j = 1; : : : ; m1 ). For the second component P (st2 = j j st1 = i) = rij (i = 1; : : : ; m1 ; j = 1; : : : ; m2 ) : Then yt j (st1 = i; st2 = j) s N ij ; 2 ij (i = 1; : : : ; m1 ; j = 1; : : : ; m2 ) . This generalization is termed the hierarchical normal mixture model (“HMNM”) in Geweke and Amisano (2007). The model is used here with m1 = m2 = 5, the choice being made based on predictive likelihoods as explained in the next section. The HMNM model can also be regarded as a …rst-order MNM model with m0 = m1 m2 states and with substantial structure imposed on the Markov transition matrix. Yet a third interpretation is that of an arti…cial neural network with two hidden layers. Geweke and Amisano (2007) provides further detail about the model, prior distributions, and the posterior simulation algorithm. These …ve models are illustrative examples. Bayesian predictive distributions arise naturally in any complete model for time series yt that speci…es a conditional distribution of the form p (yt j yt 1 ; A ; A) and a prior distribution of the form p ( A ; A). 3 Model comparison with predictive likelihood functions The one-step-ahead predictive likelihood, which can be evaluated only at time t or later, is the real number Z o o P LA (t) = p yt j yt 1 ; A = p yto j yto 1 ; A ; A p A j yto 1 ; A d A . (6) A 5 In most time series models evaluation of p yto j yto 1 ; to the approximation of (6), M 1 M X m=1 n (m) A;t 1 p yto j yto 1 ; is straightforward, leading A; A (m) A;t 1 ; A , (7) o using an ergodic sequence from a posterior simulator. o For the data set yT the marginal likelihood of the model A is p (yTo j A) = T Y p yto j yto 1 ; A t=1 implying the additive decomposition log p (yTo j A) = T X log P LA (t) . (8) t=1 Given two competing models A1 and A2 , the log Bayes factor may be decomposed X p (yTo j A1 ) P LA1 (t) = log o p (yT j A2 ) P LA2 (t) t=1 T log (9) where P LA1 (t) =P LA2 (t) is the predictive Bayes factor in favor of A1 over A2 for observation t. Predictive Bayes factors may be approximated using the output of a posterior simulator by means of (7). These approximations are usually quite accurate; the cost is that the posterior simulator must executed for each time period t. The decomposition (8) shows the intimate relationship between the evaluation of the predictive performance of a model by means of the predictive likelihood, on the one hand, and the evidence in favor of a model in the conventional Bayesian comparison of models by means of Bayes factors, on the other. The corresponding decomposition (9) shows how individual observations contribute to the evidence in favor of one model versus a second. See Geweke (2001) or Geweke (2005), Section 2.6.2, for further details and elaboration. A generalization of (8) is log p (yTo j ySo ; A) = T X log P LA (t) (10) t=S+1 for S < T , and the corresponding generalization of (9) is log T X P LA1 (t) p (yTo j ySo ; A1 ) = log . o o p (yT j yS ; A2 ) P L A2 (t) t=S+1 6 (11) In (10) and (11) the cumulation of evidence begins at time t = S + 1 rather than at time t = 1. If one were to regard p ( A j ySo ; A) as the prior distribution for A – that is, if ySo were interpreted as a training sample –then (10) would have the same interpretation as (8) and (11) would have the same interpretation as (9). The analysis in the next section uses (10) and (11) with S = 1250 (about …ve years of data) and T = 8574, so that there are 7324 terms in the sums in these two expressions. The same sample is used for the analysis in Section 6. For small values of t P LA (t) is sensitive to the prior density p ( A j A), whereas for t 1250 the results reported here are for all practical purposes invariant with respect to substantial changes in the prior distribution. This result is unsurprising when one interprets ySo as a training sample: the information in these 1250 observations dominates the information in the original prior distribution. The decomposition (11) shows how individual observations contribute to the evidence in favor of one model versus a second. For example, it may show that a few observations are pivotal in evidence yTo strongly favoring one model over another. Comparison of the predictive Bayes factors P LA1 (t) =P LA2 (t) with characteristics of the sample fyso g for s = t and observations s leading up to t can provide insight into why the evidence favors one model over the other. The comparison can be carried out using predictions over horizons greater than one period, but the decomposition for multiple-period horizons is exactly the same as that for single-period horizons as explained in Geweke (2001) and Geweke (2005), Section 2.6.2. The generalization (10) of the marginal likelihood (8) amounts to the evaluation of the predictive densities p yt j yto 1 (t = S + 1; : : : ; T ) using a log scoring rule; see Gneiting and Raftery (2007), Section 7. Non-Bayesian predictive densities, like (3) and (4), may also be evaluated using a log scoring rule. In the case of (3), for example, the log score T X (t 1) log p yto j yto 1 ; bA ; A t=S+1 is directly comparable with (10). It is therefore possible to compare Bayesian and non-Bayesian methods directly by means of their di¤erence in log scores 3 2 T X p yto j yto 1 ; A 7 6 (12) log 4 5. (t 1) p yto j yto 1 ; bA ; A t=S+1 4 Comparison of …ve models of S&P 500 returns Figure 1 shows the familiar S&P 500 percent log return series for the period beginning with December 15, 1976, corresponding to S + 1 = 1251 and ending with December 16, 2005, corresponding to T = 8574. (Since the data set goes through the end of 7 Figure 1: S&P 500 percent log return observations for which predictive likelihood was evaluated. The symbols identify ten speci…c observations. 2005, and because the analysis in Section 6 utilizes prediction horizons of up to 10 trading days, this exercise ends short of the last trading day of 2005.) Symbols in this …gure identify particular dates for reference in the analyses that follow in this section. Corresponding to (11), the cumulative log predictive Bayes factor through period r, in favor of model A1 over model A2 , is r X P LA1 (t) p (yro j ySo ; A1 ) = log . log o o p (yr j yS ; A2 ) P LA2 (t) t=S+1 (13) Figure 2 shows these cumulative log predictive Bayes factors for r = S + 1; : : : ; T . For each prediction model posterior inference was carried out by Markov chain Monte Carlo in each of 7324 samples, applying (7) to approximate P LA (t) = p yto j yto 1 ; A ; A . In each panel the comparison model A2 is GARCH, and the other model is the one indicated. All of these results are out-of-sample: that is, P LA (t) re‡ects inference for the parameter vector A using the sample consisting of observations 1; : : : ; t 1 . The right endpoint of the plotted points in each panel of Figure 2 provides (13) with r = T . For SV versus GARCH the value is 144.36, for t-GARCH 208.44, for MNM 151.65, and for HMNM 199.31. The evidence strongly favors the t-GARCH and HMNM models, with SV and MNM rounding out the rankings. More than one-third of the log predictive likelihood in favor of each of the other four models over GARCH is due to returns on just two days: the record log return of -22.9% on October 19, 1987, and the log return of -6.3% on October 13, 1989. The returns of -3.7% on November 15, 1991, -7.1% on October 27, 1997, and -7.0% on August 31, 8 Figure 2: Cumulative predictive log predictive Bayes factors in favor of each of four models over GARCH. Symbols identify dates as indicated in Figure 1. 1998 also lead to predictive likelihoods for those days that strongly favor the other four models over GARCH. Figure 3 provides further comparison of the predictive performance of the tGARCH and HMNM models as measured by predictive likelihoods. The sequence of cumulative log predictive Bayes factors, panel (a), is not dominated by any single date. Until May 23, 1984, predictive Bayes factors on average favor t-GARCH. From then until November 27, 1987, they favor HMNM on average. From July 20, 1993, through the end of 2005 predictive Bayes factors again favor t-GARCH on average. Log predictive Bayes factors for all ten individual dates marked by symbols in Figure 1 can be read from panels (b) and (d) of Figure 3. Panel (b) shows all the log predictive likelihoods for the two models. Combinations above the 45o line favor the HMNM model and those below it favor t-GARCH. The symbols speci…cally designate all combinations for which the t-GARCH log predictive likelihood was less than -8 or the log predictive Bayes factor in favor of HMNM was 9 Figure 3: Some comparisons of the t-GARCH and HMNM models using the log predictive Bayes factor in favor of HMNM. Symbols identify dates as indicated in Figure 1. less than -1.5. (That is how the dates indicated in Figure 1 were selected.) The record return of October 19, 1987, has by far the lowest log predictive likelihood in the t-GARCH model, whereas October 13, 1989, has the lowest predictive likelihood in the HMNM model. Panel (d) of Figure 3 shows that there is no simple relationship between returns that are large in magnitude and log predictive Bayes factors, and comparison of panels (b) and (d) shows that there is no simple relationship between these returns and log predictive likelihoods. Panels (b) through (d) show that for most days the predictive Bayes factor in favor of one model or the other is small. Panels (c) and (d) show a weak but systematic relationship between absolute returns and log predictive Bayes factors: the HMNM model tends to be favored by log Bayes factors when returns are less than 0.5% in magnitude, whereas t-GARCH tends to be favored when return magnitude is between 0.5% and 1%. As return magnitude rises above 1% the range of log predictive Bayes factors tends to increase, 10 with no systematic tendency for one model or the other to be favored. The important exception to this pattern is October 19, 1987. The exploratory analysis illustrated in Figure 3 can be used to compare the predictive performance of the two models (as captured by log predictive Bayes factors) and any function of returns over the preceding days. In all …ve models more volatile recent returns lead to greater dispersion in predictive distributions, but the mechanisms are distinct –especially in the HMNM model as opposed to models in the ARCH family. This characteristic of the models suggests that the magnitude of the return relative to recent magnitudes might be systematically related to log predictive Bayes factors. Figure 4 pursues this analysis, capturing return relative to recent magnitudes as the ratio of jyto j to the standard deviation in fyso g (s = t 80; : : : ; t 1). Call this ratio q. Figure 4: Comparison of the ratio of absolute return to the standard deviation of returns in the past 80 days, with the predictive Bayes factor for HMNM over tGARCH. Symbols identify dates as indicated in Figure 1. For the ten dates initially identi…ed in Figure 3(b), the correlation between q and the corresponding log predictive Bayes factors exceeds 0.9: the near-linear relationship is evident in panel (a) of Figure 4. Panel (b) plots log predictive Bayes factors against q for all days on which q is less than 4. (The rescaling of the vertical axis in panel (b) excludes no days.) The pattern in Figure 4(b) is similar to that in Figure 3(c). As measured by log predictive Bayes factors, the predictive performance of the HMNM model dominates that of the t-GARCH model when returns are very large in magnitude relative to recent volatility –that is, for returns whose absolute value 11 exceeds …ve standard deviations of returns over the past 80 days. Equivalently, extreme returns that occur roughly once or twice per decade are assigned substantially more probability in the HMNM model than in the t-GARCH model. Overall, the log predictive likelihoods of the two models are nearly identical. Elsewhere (Geweke and Amisano, 2008) we have shown that this implies that neither model corresponds to a true data generating process D, and there must exist models with higher log predictive likelihoods. Figure 4 suggests that in such models the predictive density function might resemble more the HMNM predictive density for returns that are small or quite large relative to recent volatility, and for the remainder might resemble more the t-GARCH model. For the GARCH and t-GARCH models we prepared an alternative set of predictive densities for each of the 7324 days, using (3) and the maximum likelihood estimates b(t 1) . We then compared the Bayesian and maximum likelihood (ML) predictive A densities using the di¤erence in log scores (12). For the GARCH model the outcome is 18.09 and for the t-GARCH model it is 9.76: both comparisons favor the Bayesian predictive densities over the MLE predictive densities. Figure 5: Daily decomposition of (12) for the GARCH and t-GARCH models. This outcome is not surprising. Bayesian predictive densities account for parameter uncertainty, whereas the MLE predictive density (3) does not. Figure 5 provides the daily decomposition of (12). One would expect the advantage of Bayesian predictive densities to be more pronounced in smaller samples, corresponding to earlier data in Figure 5, in which parameter uncertainty is relatively more important. The daily decomposition for t-GARCH is at least roughly consistent with this understanding. The decomposition for GARCH is dominated by the inferior performance of the MLE predictive density, relative to the Bayesian predictive density, on October 19, 1987. 12 5 Model evaluation with probability integral transforms Predictive likelihoods are local measures of the predictive performance of models: that is, they depend only on the predictive probability density evaluated at the realized return. Moreover predictive densities measure only the relative performance of models – indeed, as discussed in Section 3 they are components of Bayes factors that are critical in Bayesian model comparison and model averaging. The probability integral transform (PIT) provides an alternative assessment of the predictive performance of a model that is based on non-local assignment of predictive probability and is therefore complementary to the assessment based on predictive likelihoods discussed in the previous two sections. Unlike predictive likelihoods, however, the comparison is non-Bayesian because the ideal i.i.d. uniform distribution of PIT is ex ante and conditions on model parameters. Suppose that a model A assigns one-step-ahead predictive densities p (yt j yt 1 ; A). Denote the corresponding sequence of cumulative distribution functions F (c j yt 1 ; A) = P (yt c j yt 1 ; A) : The PIT corresponding to the model A and the sequence fyt g is p1 (t; yT ;A) = F (yt j yt 1 ; A). If A = D, the true data generating process for fyt g, then the sequence fp1 (t; yT ;A)g is distributed i.i.d. uniform (0; 1) ex ante. This result dates at least to Rosenblatt (1952), and was brought to the attention of the econometrics community by Diebold et al. (1998). Following Smith (1985) and Berkowitz (2001), if A = D then iid 1 f1 (t; yT ;A) = [p1 (t; yT ;A)] s N (0; 1) , and for analytical purposes it is often more convenient to work with ff1 (t; yT ;A)g than with fp1 (t; yT ;A)g. For h-step ahead predictive distributions, let Fh (c j yt h ; A) = P (yt c j yt h ; A) and ph (t; yT ;A) = F (yt j yt h ; A). If A = D then the distribution of ph (t; yT ;A) is uniform on (0; 1) ex ante, but ph (t; yT ;A) and ph (s; yT ;A) are independent ex ante if and only if jt sj h. These characteristics of fph (t; yT ;D)g would only be approximately true of fph (t; yT ;A)g even if p (yT j D) = p (yT j A ; A) for some value of A , because A is unobservable. (The approximation would improve as T increased and p (yT j A) incorporated an increasingly tight posterior distribution for A .) More important, we know that A 6= D. The departure of the sequence fph (t; yT ;A)g from these ideal characteristics provides an informal evaluation of A against an absolute standard. In many models, including all …ve in this study, analytical evaluation of P yt c j yto 1 ; A; A is possible, and this is all that is required for a posterior simulation approximation of 13 F yto j yto 1 ; A , because M 1 M X yto j yto 1 ; P yt m=1 so long as n (m) A o (m) A ;A a:s: ! p1 (t; yTo ;A) is an ergodic sequence whose invariant distribution is the poste- rior. For h > 1 analytical evaluation of P yt c j yto h ; A ; A is very awkward or impossible in all of these models, and that is true of econometric prediction models generally. Instead we employ a simulation approximation using the recursion (m) (m) h+1 ; : : : ; ys 1 ; ys(m) s p ys j yto h ; yt (m) A ;A (s = t h + 1; : : : ; t) . (14) Since the posterior MCMC sample is large, only one such recursion need be carried (m) out for each A in the posterior sample and M 1 M X m=1 I( 1;yto ) (m) yt a:s: ! ph (t; yTo ;A) ; (15) where IS (x) = 1 if x 2 S and IS (x) = 0 if x 62 S. 6 Evaluation of …ve models of S&P 500 returns For the S&P 500 return series yTo we evaluated ph (t; yTo ;A) for each of the …ve models A described in section 2, for h = 1; : : : ; 10, and for t = 1251; : : : ; 8574 = T , where T is the entire sample size. The computations are based on the 7324 Markov chain Monte Carlo posterior samples in each of the …ve models, one for each sample, using (14) and (15) for t = 1251; : : : ; 8574. We then computed the corresponding 1 [ph (t; yTo ;A)] transformation to the standard normal distribution, fh (t; yTo ;A) = (t = 1251; : : : ; 8574). To describe the departure from the PIT paradigm for each model, we determined the average values of I((j 1)=10;j=10) [ph (1 + (s 1) h; yTo ;A)] (j = 1; : : : ; 10; h = 1; : : : ; 10) ; (16) that is, we determined the fraction of ph (t; yTo ;A) in each decile using non-overlapping prediction horizons h. Figure 6 presents the values of (16) for a di¤erent model A in each of the …ve rows of panels, and for h = 1, 5 and 10 in each of the three columns of panels. Values for the deciles j = 1; : : : ; 10 are shown in each panel. The paradigm value 0.10 is indicated by the solid horizontal line, and the dotted horizontal lines provide a conventional 95% con…dence interval for these values under the condition that fph (1 + (s 1) h; yTo ;A)g (s = 1; : : : ; [T =h]) is i.i.d. Bernoulli (p = 0:1). 14 Figure 6: Each panel shows the frequency of observations occuring in each decile of the predictive distribution (16), as a function of the prediction horizon h indicated on the horizontal axis. Solid lines indicate results using Bayesian inference, dotted lines results using maximum likelihood in the GARCH and t-GARCH models. Dashed lines provide centered 95% confdence intervals for the individual deciles under the PIT paradigm. Plotted values above 0.1 indicate deciles in which more than 10% of the realized returns occurred; equivalently, the model underpredicts probability in this range of the predictive density. Values below 0.1 correspond to overprediction in the relevant range. The performance of the GARCH model is markedly inferior to the other four, and the performance of SV is not quite as good as the t-GARCH, MNM and HMNM models. The tendency of the latter three models to over- or under-predict di¤erent deciles is roughly the same for all three horizons studied in Figure 6. At horizon h = 1 they assign too little probability in the interquartile range and too much in the lower and upper quartiles. At h = 5 and h = 10 they assign too much probability below the median (with an exception for the lowest decile in some cases) and too little above the median of their predictive distributions. These characteristics are also evident in 15 the GARCH model, where the performance is markedly poorer especially for h = 10. Horizon! Model# GARCH SV t-GARCH MNM HMNM GARCH(ML) t-GARCH(ML) a Cell entries are Table 1: Formal PIT goodness of …t testsa Deciles Left tail 1 5 10 1 5 0 .0088 .0018 5:4 10 .0019 .052 5 2:6 10 .066 >1 7 7:9 10 .11 .58 .0014 .13 .18 0 .39 .73 6 3:7 10 .076 .65 p-values for the tests. 8 1:0 8:2 10 10 .29 .063 .15 5:9 10 .076 4 5 5 .042 .032 .028 5:2 10 .0035 1:3 10 .031 10 6:2 5 6 10 .026 .14 .19 .34 1:6 10 .17 4 5 Table 1 provides two sets of chi-square goodness of …t test results for the PIT. The entries in the table correspond to p-values for the tests. The results for horizon h = 1 are conventional. The results for horizons h = 5 and h = 10 are based on h separate tests, using (16) for k = 1; : : : ; h. These tests are not independent across k because the horizons overlap; the entries in Table 1 are Bonferroni p-values for the h separate tests in each case. The …rst set of tests, under the “Deciles”heading, corresponds to the results shown in Figure 6. The results reinforce the conclusion that the PIT …t of the GARCH and SV models is inferior to those of the other three models. At horizon h = 1, only the HMNM model comes close to passing a PIT goodness of …t test at conventional signi…cance levels. At longer horizons results are mixed, perhaps due to the fact that there are fewer intervals with nonoverlapping predictions and therefore the tests have lower power. Evaluation of the predictive distributions over particular regions may be of concern in speci…c applications, particularly for negative returns; see the discussion in Diks et al. (2008). Our second set of tests explores PIT goodness of …t in the lower tail of the predictive distribution, based on (16) with n = 5, aj = j=200 (j = 0; : : : ; n), and h = 1, 5 and 10. As in the …rst set of tests, conventional p-values are given for h = 1 and Bonferroni test p-values are given for h = 5 and h = 10. The power of these tests is, of course, much lower than the decile tests. The GARCH and SV models fail at horizon h = 1, MNM fails at h = 5, and GARCH again at h = 10; HMNM has some di¢ culty at horizon h = 5. There is little in these results to suggest that the evaluation of left-tail performance is very di¤erent from overall performance using PIT. Figures 7 and 8 provide some additional evidence on the relationship of the predictive distributions to the PIT paradigm. Each panel plots a di¤erent transformation of ph (t; yTo ;A) as a function of the prediction horizon h shown on the horizontal axis. 16 Figure 7: Each panel plots a transformation of ph (YT ; A) as a function of the prediction horizon h shown on the horizontal axis. Solid lines indicate results using Bayesian inference; dotted lines indicate results using maximum likelihood in the GARCH and t-GARCH models. The dashed lines provide the .01, .05, .50, .95 and .99 quantiles of the transformation under the PIT paradigm. Figure 7(a) displays the interdecile range for each model: for each model A and each horizon h, it is the di¤erence between the maximum and minimum values of (16) taken over j = 1; : : : ; 10. For each combination of A and h, larger values of the interdecile range constitute greater evidence against the PIT paradigm. The MNM and HMNM interdecile ranges display greater consistency with PIT than do the ranges for the other three models. 1 The remaining panels of Figures 7 and 8 pertain to fh (t; yTo ;A) = [ph (t; yTo ;A)]. Under the PIT paradigm values of fh (1 + (s 1) h; yTo ;A) (s = 1; 2; : : :) (17) are realizations of a standard i.i.d. normal process, implying that there are many functions of fh (1 + (s 1) h; yTo ;A) with well-established distributions. Panels (b) of 17 Figure 7 through (b) of Figure 8 pertain to the …rst four moments of fh (t; yTo ;A), and therefore provide indications of the discrepancy between the predictive and observed distributions. Figure 8(f) pertains to the …rst-order autocorrelation coe¢ cient of (17) and therefore provides an indication of the departure from independence in successive quantiles implied by PIT. Unlike the interdecile range in Figure 7(a), both large and small values of the statistics in Figure 7 (b,c) and Figure 8 constitute evidence that the PIT paradigm is not appropriate. Figure 8: Each panel plots a transformation of ph (YT ; A) as a function of the prediction horizon h shown on the horizontal axis. Solid lines indicate results using Bayesian inference; dotted lines indicate results using maximum likelihood in the GARCH and t-GARCH models. The dashed lines provide the .01, .05, .50, .95 and .99 quantiles of the transformation under the PIT paradigm. The evaluations of the t-GARCH, MNM and HMNM model …ts in columns (b) and (c) of Figure 7 are all similar. Means are slightly and insigni…cantly higher than zero at all horizons. Variances are greater than one, with signi…cant departures for t-GARCH and marginal signi…cant departures for MNM. The distribution for t18 GARCH is signi…cantly and negatively skewed (Figure 8(a)) whereas for MNM and HMNM it is insigni…cantly positively skewed. The evaluation of all three models using the kurtosis (Figure 8(b)) of the normalized PIT is satisfactory. By contrast the GARCH and stochastic volatility models all have severe departures from the paradigm skewness and kurtosis of the normalized PIT. The PIT evaluations of the ML and Bayesian predictive distributions do not provide any striking systematic comparisons, either in the GARCH or t-GARCH model. Posterior distributions are concentrated about the maximum likelihood estimates and consequently deciles of predictive distributions are much the same in the Bayesian and ML predictive distributions. For one-step-ahead predictive distributions (h = 1) the …rst order autocorrelation coe¢ cient for the series (17) is about 0.06 in all …ve models, well outside the range of values plausible under PIT. For larger values of h the autocorrelation coe¢ cient is smaller, with a notable tendency to be negative, in most cases, for horizons 3 or greater. Moreover, the pattern of evaluations across horizons is roughly the same in all cases. These results suggest that there is some persistence in day-to-day returns that is not adequately captured by any of the …ve models. 7 Summary and conclusions This study compares and evaluates Bayesian predictive distributions from alternative models, using as an illustration …ve alternative models of asset returns applied to a time series of 7324 daily S&P 500 log returns. The comparison exercise uses predictive likelihoods and is inherently Bayesian. The evaluation exercise uses the probability integral transform (PIT) and is inherently frequentist. The illustration shows that the two approaches can be complementary, each identifying strengths and weaknesses in models that are not evident using the other. Both the predictive likelihood and PIT analyses lead to the conclusion that the GARCH Bayesian predictive distributions are inferior to the other four, and the same is true of GARCH predictive distributions constructed from maximum likelihood estimates. Each analysis also leads to the conclusion that the Bayesian stochastic volatility (SV) and Markov normal mixture (MNM) predictive distributions are dominated by the t-GARCH and hierarchical Markov normal mixture (HMNM) predictive distributions. These comparisons are readily apparent in Figures 2, 6, 7 and 8, and in Table 1. In the predictive likelihood analysis Bayes t-GARCH is favored over HMNM by about 10 points (Figure 2), the same as its margin over ML t-GARCH (Figure 5). Therefore the predictive likelihoods for MLE t-GARCH and HMNM are nearly identical. By contrast the PIT analysis narrowly favors HMNM predictive distributions over Bayes t-GARCH. This is evident in Table 1 and Figures 6, 7 and 8. However the latter analysis also shows that the normalized PIT for the HMNM model is not ideal. In particular, PIT’s for successive one-day predictions are not independent, and the 19 performance of HMNM in this dimension is no better than that of any of the other four models. A new predictive density can always be formed as a weighted average of predictive densities from di¤erent models, the best known example being Bayesian model averaging (Geweke, 2005, Section 2.6). The analysis in Section 4 indicates that for most combinations of models and substantial subperiods of the sample considered in this study, Bayesian model averaging is for all practical purposes equivalent to model selection, with one model receiving a weight very close to 1. This is often the outcome for Bayesian model averaging when the sample is large. The notable exception arises when the models averaged include both t-GARCH and HMNM: in that case these two models can have substantial weight in Bayesian model averaging, depending on the days included in the sample. Geweke and Amisano (2008) shows that a weighted average of the HMNM and t-GARCH models compares quite favorably with both models, using predictive likelihood. That paper also shows that, in general, optimization of the log score function for multiple models leads to non-trivial weights on several models, weights that are quite di¤erent from those that result from conventional Bayesian model averaging. Analysis of the kind in this article provides guidance for model improvement. This can be seen in the context of the application used in this study, looking both to the past and the future. Of the …ve models considered here, GARCH was the earliest to be applied systematically to …nancial returns, followed shortly by SV. While predictive likelihood analysis shows that SV substantially improves on GARCH, the PIT indicates the neither wholly captures the distribution of S&P 500 returns, especially for extreme events. This provides an agenda for model improvement, addressed by the t-GARCH and MNM models. The predictive likelihood and PIT analyses indicate that t-GARCH goes far in addressing the problems of GARCH and SV, whereas MNM makes little progress. HMNM adds even more ‡exibility to MNM and substantially outperforms it in both predictive likelihood and PIT analyses. Looking to the future, persistent dependence in the PIT transformation in all of the models considered emphasizes the need for more adequate dynamics. For example, this could take the form of a t-GARCH (p; q) model. Awaiting such new research, real-world demands for predictive distributions will continue. In this context analyses of predictive distributions of the kind conducted here provide indications of model limitations with which actual decisions based on these models must always be tempered. Acknowledements The authors acknowledge many useful comments that have substantially improved the paper from several seminar presentations and especially from the co-editor and the referees. Responsibility for any remaining errors or points of confusion rests with the authors. Geweke acknolwedges …nancial support from US National Science Foundation Grant SBR-0720547 and Amisano from the Italian Ministero dell’Istruzione, dell’Universitá e della Ricerca. 20 References Amisano, G., & Giacomini, R. (2007). Comparing density forecasts via weighted likelihood ratio tests. Journal of Business and Economic Statistics, 25, 177-190. Bauwens, L., Giot, P., Grammig, J., & Veredas, D. (2004). A comparison of …nancial duration models via density forecasts. International Journal of Forecasting, 20, 589-609. Berkowitz, J. (2001). Testing density forecasts with applications to risk management. Journal of Business and Economic Statistics, 19, 465-474. Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics, 31, 307-327. Chib, S. (1996). Calculating posterior distributions and modal estimates in Markov mixture models. Journal of Econometrics, 75, 79-97. Christo¤ersen, P. F. (1998). Evaluating interval forecasts. International Economic Review, 39, 841-862. Corradi, V., & Swanson, N. R. (2006). Predictive density evaluation. In: G. Elliott et al. (Eds.), Handbook of Economic Forecasting. Amsterdam: North-Holland, pp. 197-284. Diebold, F. X., Gunther, T. A., & Tay, A. S. (1998). Evaluating density forecasts with applications to …nancial risk management. International Economic Review, 39, 863-883. Diks, C., Panchenko, V., & van Dijk, D. (2008). Partial likelihood-based scoring rules for evaluating density forecasts in the tails. Tinbergen Institute Discussion paper 2008-050/4. Geweke, J. (2001). Bayesian econometrics and forecasting. Journal of Econometrics, 100, 11-15. Geweke, J. (2005). Contemporary Bayesian Econometrics and Statistics, Hoboken: Wiley. Geweke, J., & Amisano, G. (2007). Hierarchical Markov normal mixture models with applications to …nancial asset returns. European Central Bank working paper 831, http://www.ecb.int/pub/pdf/scpwps/ecbwp831.pdf. Geweke, J. & Amisano, G. (2008) Optimal Prediction Pools, http://www.biz.uiowa.edu/faculty/jgeweke/papers/paperA/paper.pdf. Geweke, J. & Whiteman C. (2006). Bayesian forecasting. In: C. W. J. Granger et al. (Eds.), Handbook of Economic Forecasting. Amsterdam, The Netherlands: Elsevier. Gneiting, T., & Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102, 359-378. Hong, Y. M., Li, H. T., & Zhao F. (2004). Out of sample performance of discrete time spot interest rate models. Journal of Business and Economic Statistics, 22, 457-473. Lindgren, G. (1978). Markov regime models for mixed distributions and switching regressions. Scandinavian Journal of Statistics, 5, 81-91. 21 Jacquier, E., Polson, N. G., & Rossi, P. E. (1994). Bayesian analysis of stochastic volatility models. Journal of Business and Economic Statistics, 12, 371-389. Rosenblatt, M. (1952). Remarks on a multivariate transformation. Annals of Mathematical Statistics, 23, 470-472. Rydén, T., Teräsvirta, T., & Åsbrink, S. (1998). Stylized facts of daily return series and the hidden Markov model. Journal of Applied Econometrics, 13, 217-244. Smith, J. Q. (1985). Diagnostic checks of non-standard time series models. Journal of Forecasting, 4, 283-291. Tyssedal, J. S. & Tjøstheim, D. (1988). An autoregressive model with suddenly changing parameters and an application to stock market prices. Applied Statistics, 37, 353-369. Weigend, A. S. & Shi, S. (2000). Predicting daily probability distributions of S&P 500 returns. Journal of Forecasting, 19, 375-392. 22