The full text on this page is automatically extracted from the file linked above and may contain errors and inconsistencies.
FEDERAL RESERVE BANK OF SAN FRANCISCO WORKING PAPER SERIES Robust Bond Risk Premia Michael D. Bauer Federal Reserve Bank of San Francisco James D. Hamilton University of California, San Diego January 2016 Working Paper 2015-15 http://www.frbsf.org/economic-research/publications/working-papers/wp2015-15.pdf Suggested citation: Bauer, Michael D, James D. Hamilton. 2015. “Robust Bond Risk Premia.” Federal Reserve Bank of San Francisco Working Paper 2015-15. http://www.frbsf.org/economicresearch/publications/working-papers/wp2015-15.pdf The views in this paper are solely the responsibility of the authors and should not be interpreted as reflecting the views of the Federal Reserve Bank of San Francisco or the Board of Governors of the Federal Reserve System. Robust Bond Risk Premia∗ Michael D. Bauer†and James D. Hamilton‡ April 16, 2015 Revised: January 20, 2016 Abstract A consensus has recently emerged that variables beyond the level, slope, and curvature of the yield curve can help predict bond returns. This paper shows that the statistical tests underlying this evidence are subject to serious small-sample distortions. We propose more robust tests, including a novel bootstrap procedure specifically designed to test the “spanning hypothesis.” We revisit the evidence in five published studies, find most rejections of the spanning hypothesis to be spurious, and conclude that the current consensus is wrong. Only the level and the slope of the yield curve are robust predictors of bond returns. Keywords: yield curve, spanning, return predictability, robust inference, bootstrap JEL Classifications: E43, E44, E47 ∗ The views expressed in this paper are those of the authors and do not necessarily reflect those of others in the Federal Reserve System. We thank John Cochrane, Graham Elliott, Robin Greenwood, Helmut Lütkepohl, Ulrich Müller, Hashem Pesaran and Glenn Rudebusch for useful suggestions, conference participants at the 7th Annual Volatility Institute Conference at the NYU Stern School of Business and at the NBER 2015 Summer Institute, as well as seminar participants at the Federal Reserve Bank of Boston and the Free University of Berlin for helpful comments, and Javier Quintero and Simon Riddell for excellent research assistance. † Federal Reserve Bank of San Francisco, 101 Market St MS 1130, San Francisco, CA 94105, phone: 415974-3299, e-mail: michael.bauer@sf.frb.org ‡ University of California at San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0508, phone: 858-534-5986, e-mail: jhamilton@ucsd.edu 1 Introduction The nominal yield on a 10-year U.S. Treasury bond has been below 2% much of the time since 2011, a level never seen previously. To what extent does this represent unprecedently low expected interest rates extending through the next decade, and to what extent does it reflect an unusually low risk premium resulting from a flight to safety and large-scale asset purchases by central banks that depressed the long-term yield? Finding the answer is a critical input for monetary policy, investment strategy, and understanding the lasting consequences of the financial and economic disruptions of 2008. In principle one can measure the risk premium by the difference between the current long rate and the expected value of future short rates. But what information should go into constructing that expectation of future short rates? A powerful argument can be made that the current yield curve itself should contain most (if not all) information useful for forecasting future interest rates and bond returns. Investors use information at time t—which we can summarize by a state vector zt —to forecast future short-term interest rates and determine bond risk premia. Hence current yields are necessarily a function of zt , reflecting the general fact that current asset prices incorporate all current information. This suggests that we may be able to back out the state vector zt from the observed yield curve.1 The “invertibility” or “spanning” hypothesis states that the current yield curve contains all the information that is useful for predicting future interest rates or determining risk premia. Notably, under this hypothesis, the yield curve is first-order Markov. It has long been recognized that three yield-curve factors, such as the first three principal components (PCs) of yields, can provide an excellent summary of the information in the entire yield curve (Litterman and Scheinkman, 1991). While it is clear that these factors, which are commonly labeled level, slope, and curvature, explain almost all of the cross-sectional variance of yields, it is less clear whether they completely capture the relevant information for forecasting future yields and estimating bond risk premia. In this paper we investigate what we will refer to as the “spanning hypothesis” which holds that all the relevant information for predicting future yields and returns is spanned by the level, slope and curvature of the yield curve. This hypothesis differs from the claim that the yield curve follows a first-order Markov process, as it adds the assumption that only these three yield-curve factors are useful in forecasting. For example, if higher-order yield-curve factors such as the 4th and 5th PC are informative about predicting yields and returns, yields would still be Markov, but the spanning hypothesis, as we define it here, would be violated. Note also that the spanning 1 Specifically, this invertibility requires that (a) we observe at least as many yields as there are state variables in zt , and (b) there are no knife-edge cancellations or pronounced nonlinearities; see for example Duffee (2013b). 1 hypothesis is much less restrictive than the expectations hypothesis, which states that bond risk premia are constant and hence excess bond returns are not predictable. The question whether the spanning hypothesis is valid is of crucial importance for finance and macroeconomics. If it is valid, then the estimation of monetary policy expectations and bond risk premia would not require any data or models involving macroeconomic series, other asset prices or quantities, volatilities, or survey expectations. Instead, all the necessary information is in the shape of the current yield curve, summarized by the level, slope, and curvature. If, however, the spanning hypothesis is violated, then this would seem to invalidate a large body of theoretical work in asset pricing and macro-finance, since models in this literature generally imply that the state variables are spanned by the information in the term structure of interest rates.2 A growing literature on yield curve modeling is based on the premise that it is undesirable and potentially counterfactual to assume spanning.3 There appears to be a consensus, reflected in recent review articles by Gürkaynak and Wright (2012) and Duffee (2013a), that the spanning question is a central issue in macro-finance. A number of recent studies have produced evidence that appears to contradict the spanning hypothesis. This evidence comes from predictive regressions for bond returns on various predictors, controlling for information in the current yield curve. The variables that have been found to contain additional predictive power in such regressions include measures of economic growth and inflation (Joslin et al., 2014), factors inferred from a large set of macro variables (Ludvigson and Ng, 2009, 2010), higher-order (fourth and fifth) PCs of bond yields (Cochrane and Piazzesi, 2005), the output gap (Cooper and Priestley, 2008), and measures of Treasury bond supply (Greenwood and Vayanos, 2014). The results in each of these studies suggest that there might be unspanned or hidden information that is not captured by the level, slope, and curvature of the current yield curve but that is useful for forecasting. But the predictive regressions underlying all these results have a number of problematic features. First, the predictive variables are typically very persistent, in particular in relation to the small available sample sizes. Second, some of these predictors summarize the information in the current yield curve, and therefore are generally correlated with lagged forecast errors, i.e., they violate the condition of strict exogeneity. In such a setting, tests of the spanning hypothesis are necessarily oversized in small samples, as we show both analytically and using simulations. Third, the dependent variable is typically a bond return over an annual holding period, which introduces substantial serial correlation in the prediction errors. This worsens the size distortions and leads to large increases in R2 even if irrelevant predictors 2 Key contributions to this large literature include Wachter (2006), Piazzesi and Schneider (2007), Rudebusch and Wu (2008), and Bansal and Shaliastovich (2013). For a recent example see Swanson (2015). 3 Examples are Wright (2011), Chernov and Mueller (2012), Priebsch (2014), and Coroneo et al. (2015). 2 are included.4 We demonstrate that the procedures commonly used for inference about the spanning hypothesis do not appropriately address these issues, and are subject to serious small-sample distortions. We propose two procedures that give substantially more robust small-sample inference. The first is a parametric bootstrap that generates data samples under the spanning hypothesis: We calculate the first three PCs of the observed set of yields and summarize their dynamics with a VAR fit to the observed PCs.5 Then we use a residual bootstrap to resample the PCs, and construct bootstrapped yields by multiplying the simulated PCs by the historical loadings of yields on the PCs and adding a small Gaussian measurement error. Thus by construction no variables other than the PCs are useful for predicting yields or returns in our generated data. We then fit a separate VAR to the proposed additional explanatory variables alone, and generate bootstrap samples for the predictors from this VAR. Using our novel bootstrap procedure, we can calculate the properties of any regression statistic under the spanning hypothesis. Our procedure notably differs from the bootstrap approach often employed in this literature, which generates artificial data under the expectations hypothesis.6 This reveals that the conventional tests reject the true null much too often. We show for example that the tests employed by Ludvigson and Ng (2009), which are intended to have a nominal size of five percent, can have a true size of up to 54%. We then ask whether under the null it would be possible to observe similar patterns of predictability as researchers have found in the data. We find that this is indeed the case, meaning that much of the above-cited evidence against the spanning hypothesis is in fact spurious. These results provide a strong caution against using conventional tests, and we recommend that researchers instead use the bootstrap procedure proposed in this paper. Despite the usual technical concerns about bootstrapping near-nonstationary variables, we present evidence that this procedure performs well in small small samples. A second procedure that we propose for inference in this context is the approach for robust testing of Ibragimov and Müller (2010). The approach is to split the sample into subsamples, to estimate coefficients separately in each of these, and then to perform a simple t-test on the coefficients across subsamples. We have found this approach to have excellent size and power properties in settings similar to the ones encountered by researchers testing for predictive power for interest rates and bond returns. Applying this type of test to the predictive regressions 4 Lewellen et al. (2010) demonstrated that high R2 in cross-sectional return regressions are, for different reasons, often unconvincing evidence of true predictability. 5 We consider bias-corrected estimation of the VAR, in light of the high persistence of the PCs. 6 This approach has been used, for example, by Bekaert et al. (1997), Cochrane and Piazzesi (2005), Ludvigson and Ng (2009, 2010), and Greenwood and Vayanos (2014). 3 for excess bond returns studied in the literature, we find that the only robust predictors are the level and the slope of the yield curve. After revisiting the evidence in the five influential papers cited above we draw two conclusions. First, the claims going back to Fama and Bliss (1987) and Campbell and Shiller (1991) that excess returns can be predicted from the level and slope of the yield curve remain quite robust. Second, the newer evidence on the predictive power of macro variables, higher-order PCs of the yield curve, or other variables, is subject to more serious econometric problems and appears weaker and much less robust.We further demonstrate that this predictive power is substantially weaker in samples that include subsequent data than in the samples originally analyzed. Overall, we do not find convincing evidence to reject the hypothesis that the current yield curve, and in particular three factors summarizing this yield curve, contains all the information necessary to infer interest rate forecasts and bond risk premia. In other words, the spanning hypothesis cannot be rejected, and the Markov property of the yield curve seems alive and well. Our paper is related mainly to two strands of literature. The first is the literature on the spanning hypothesis, and most relevant studies were cited above. Bauer and Rudebusch (2015) also question the evidence against spanning, by showing that conventional macrofinance models can generate data in which the spanning hypothesis is spuriously rejected. Our paper is also related to the econometric literature on spurious results in return regressions. Mankiw and Shapiro (1986), Cavanagh et al. (1995), Stambaugh (1999) and Campbell and Yogo (2006), among others, studied short-horizon return predictability with a regressor that is not strictly exogenous. We point out a related econometric issue in bond return regressions, which is however distinct from Stambaugh bias. Ferson et al. (2003) and Deng (2013) studied the size distortions in a setting that is different from ours and more relevant for stock returns, namely when returns have an unobserved persistent component. In contrast to these studies, we focus on the econometric problems that arise in tests of the spanning hypothesis. In addition, we propose simple, easily implementable solutions to these problems. 2 Inference about the spanning hypothesis The evidence against the spanning hypothesis in all of the studies cited in the introduction comes from regressions of the form yt+h = β10 x1t + β20 x2t + ut+h , 4 (1) where the dependent variable yt+h is the return or excess return on a long-term bond (or portfolio of bonds) that we wish to predict, x1t and x2t are vectors containing K1 and K2 predictors, respectively, and ut+h is a forecast error. The predictors x1t contain a constant and the information in the yield curve, typically captured by the first three PCs of observed yields, i.e., level, slope, and curvature.7 The null hypothesis of interest is H0 : β2 = 0, which says that the relevant predictive information is spanned by the information in the yield curve and that x2t has no additional predictive power. The evidence produced in these studies comes in two forms, the first based on simple descriptive statistics such as how much the R2 of the regression increases when the variables x2t are added and the second from formal statistical tests of the hypothesis that β2 = 0. In this section we show how key features of the specification can matter significantly for both forms of evidence. In Section 2.1 we show how serial correlation in the error term ut and the proposed predictors x2t can give rise to a large increase in R2 when x2t is added to the regression even if it is no help in predicting yt+h . In Section 2.2 we show the consequences of lack of strict exogeneity of x1t , which is necessarily correlated with ut since it contains information in current yields. When x1t and x2t are highly persistent processes, as is usually the case in practice, conventional tests of H0 generally will exhibit significant size distortions in finite samples. We then propose methods for robust inference about bond return predictability in Sections 2.3 and 2.4. 2.1 Implications of serially correlated errors based on first-order asymptotics Our first observation is that in regressions in which x1t and x2t are strongly persistent and the error term is serially correlated—as is always the case with overlapping bond returns—we should not be surprised to see substantial increases in R2 when x2t is added to the regression even if the true coefficient is zero. It is well known that in small samples serial correlation in the residuals can increase both the bias as well as the variance of a regression R2 (see for example Koerts and Abrahamse (1969) and Carrodus and Giles (1992)). To see how much 7 We will always sign the PCs so that the yield with the longest maturity loads positively on all PCs. As a result, PC1 and PC2 correspond to what are commonly referred to as “level” and “slope” of the yield curve. 5 difference this could make in the current setting, consider the unadjusted R2 defined as SSR R 2 = 1 − PT t=1 (yt+h − ȳh )2 (2) where SSR denotes the regression sum of squared residuals. The increase in R2 when x2t is added to the regression is thus given by (SSR1 − SSR2 ) R22 − R12 = PT . 2 t=1 (yt+h − ȳh ) (3) We show in Appendix A that when x1t , x2t , and ut+h are stationary and satisfy standard regularity conditions, if the null hypothesis is true (β2 = 0) and the extraneous regressors are uncorrelated with the valid predictors (E(x2t x01t ) = 0), then d T (R22 − R12 ) → r0 Q−1 r/γ (4) γ = E[yt − E(yt )]2 S= r ∼ N (0, S), (5) Q = E(x2t x02t ) (6) P∞ 0 v=−∞ E(ut+h ut+h−v x2t x2,t−v ). (7) Result (4) implies that the difference R22 − R12 itself converges in probability to zero under the null hypothesis that x2t does not belong in the regression, meaning that the two regressions asymptotically should have the same R2 . In a given finite sample, however, R22 is larger than R12 by construction, and the above results give us an indication of how much larger it would be in a given finite sample. If x2t ut+h is serially uncorrelated, then (7) simplifies to S0 = E(u2t+h x2t x02t ). On the other hand, if x2t ut+h is positively serially correlated, then S exceeds S0 by a positive-definite matrix, and r exhibits more variability across samples. This means R22 − R12 , being a quadratic form in a vector with a higher variance, would have both a higher expected value as well as a higher variance when x2t ut+h is serially correlated compared to situations when it is not. When the dependent variable yt+h is a multi-period bond return, then the error term is necessarily serially correlated. In our empirical applications, yt+h will typically be the h-period excess return on an n-period bond, yt+h = pn−h,t+h − pnt − hiht , 6 (8) for pnt the log of the price of a pure discount n-period bond purchased at date t and int = −pnt /n the corresponding zero-coupon yield. In that case, E(ut ut−v ) 6= 0 for v = 0, . . . , h − 1, due to the overlapping observations. At the same time, the explanatory variables x2t often are highly serially correlated, so E(x2t x02,t−v ) 6= 0. Thus even if x2t is completely independent of ut at all leads and lags, the product will be highly serially correlated, E(ut+h ut+h−v x2t x02,t−v ) = E(ut ut−v )E(x2t x02,t−v ) 6= 0. This serial correlation in x2t ut+h would contribute to larger values for R22 − R12 on average as well as to increased variability in R22 − R12 across samples. In other words, including x2t could substantially increase the R2 even if H0 is true.8 These results on the asymptotic distribution of R22 − R12 could be used to design a test of H0 . However, we show in the next subsection that in small samples additional problems arise from the persistence of the predictors, with the consequence that the bias and variability of R22 − R12 can be even greater than predicted by (4). For this reason, in this paper we will rely on bootstrap approximations to the small-sample distribution of the statistic R22 − R12 , and demonstrate that the dramatic values sometimes reported in the literature are not implausible under the spanning hypothesis. Serial correlation of the residuals also affects the sampling distribution of the OLS estimate of β2 . In Appendix A we verify using standard algebra that under the null hypothesis β2 = 0 the OLS estimate b2 can be written as b2 = P T 0 t=1 x̃2t x̃2t −1 P T t=1 x̃2t ut+h (9) where x̃2t denotes the sample residuals from OLS regressions of x2t on x1t : x̃2t = x2t − AT x1t 8 (10) The same conclusions necessarily also hold for the adjusted R̄2 defined as R̄i2 = 1 − T −1 SSRi P T − ki Tt=1 (yt+h − ȳh )2 for ki the number of coefficients estimated in model i, from which we see that T (R̄22 − R̄12 ) = [T /(T − k1 )]SSR1 − [T /(T − k2 )]SSR2 PT 2 t=1 (yt+h − ȳh ) /(T − 1) which has the same asymptotic distribution as (4). In our small-sample investigations below, we will analyze either R2 or R̄2 as was used in the original study that we revisit. 7 AT = P T 0 t=1 x2t x1t P T 0 t=1 x1t x1t −1 . (11) If x2t and x1t are stationary and uncorrelated with each other, as the sample size grows, p AT → 0 and b2 has the same asymptotic distribution as b∗2 = namely P T 0 t=1 x2t x2t √ −1 P T t=1 x2t ut+h , d T b2 → N (0, Q−1 SQ−1 ). (12) (13) with Q and S the matrices defined in (6) and (7). Again we see that positive serial correlation causes S to exceed the value S0 that would be appropriate for serially uncorrelated residuals. In other words, serial correlation in the error term increases the sampling variability of the OLS estimate b2 . The standard approach is to use heteroskedasticity- and autocorrelation-consistent (HAC) standard errors to try to correct for this, for example, the estimators proposed by Newey and West (1987) or Andrews (1991). However, in practice different HAC estimators of S can lead to substantially different empirical conclusions (Müller, 2014). Moreover, we show in the next subsection that even if the population value of S were known with certainty, expression (13) can give a poor indication of the true small-sample variance. We further demonstrate empirically in the subsequent sections that this is a serious problem when carrying out inference about bond return predictability. 2.2 Small-sample implications of lack of strict exogeneity A second feature of the studies examined in this paper is that the valid explanatory variables x1t are correlated with lagged values of the error term. That is, they are only weakly but not strictly exogenous. In addition, x1t and x2t are highly serially correlated. We will show that this can lead to substantial size distortions in tests of β2 = 0. The intuition of our result is the following: As noted above, the OLS estimate of β2 in (1), b2 , can be thought of as being implemented in three steps: (i) regress x2t on x1t , (ii) regress yt+h on x1t , and (iii) regress the residuals from (ii) on the residuals of (i). When x1t and x2t are highly persistent, the auxiliary P regression (i) behaves like a spurious regression in small samples, causing x̃2t x̃02t in (9) to be P significantly smaller than x2t x02t in (12). When there is correlation between x1t and ut , this causes the usual asymptotic distribution to underestimate significantly the true variability of b2 . As a consequence, the t-test for β2 = 0 rejects the true null too often. In the following, we demonstrate exactly why this occurs, first theoretically using local-to-unity asymptotics, 8 and then in small-sample simulations. The issue we raise has to our knowledge not previously been recognized. Mankiw and Shapiro (1986) and Stambaugh (1999) studied tests of the hypothesis β1 = 0 in a regression of yt+1 on x1t , where the regressors x1t are not strictly exogenous, and documented that when x1t is persistent this leads to small-sample coefficient bias in the OLS estimate of β1 .9 By contrast, in our setting there is no coefficient bias present in estimates of β2 , and it is instead the inaccuracy of the standard errors, which we will refer to as “standard error bias,” that distorts the results of conventional inference. Another related line of work is by Ferson et al. (2003) and Deng (2013), who studied predictions of returns that have a persistent component that is unobserved. In our notation, their setting corresponds to the case where both x1t and x2t are strictly exogenous, x1t is unobserved, and returns are predicted using x2t . For predictive regressions of bond returns, however, we do have estimates of the persistent return component based on information in the current yield curve, x1t , and instead the resulting lack of strict exogeneity causes a separate econometric problem from that considered by Ferson et al. (2003) and Deng (2013). 2.2.1 Theoretical anlysis using local-to-unity asymptotics We now demonstrate where the problem arises in the simplest example of our setting. Suppose that x1t and x2t are scalars that follow independent highly persistent processes, xi,t+1 = ρi xit + εi,t+1 i = 1, 2 (14) where ρi is close to one. Consider the consequences of OLS estimation of (1) in the special case where h = 1: yt+1 = β0 + β1 x1t + β2 x2t + ut+1 . (15) We assume that (ε1t , ε2t , ut )0 follows a martingale difference sequence with finite fourth moments and variance matrix ε1t h σ12 0 δσ1 σu i V = E ε2t ε1t ε2t ut = 0 σ22 0 . ut δσ1 σu 0 σu2 (16) Thus x1t is not strictly exogenous when the correlation δ is nonzero. Note that for any δ, x2t ut+1 is serially uncorrelated and the standard OLS t-test of β2 = 0 asymptotically has a N (0, 1) 9 Cavanagh et al. (1995) and Campbell and Yogo (2006) considered this problem using local-to-unity asymptotic theory. 9 distribution when using the conventional first-order asymptotic approximation. This simple example illustrates the problems in a range of possible settings for yield-curve forecasting. In particular, if Var(ut+1 ) substantially exceeds Var(β1 x1t ), yt could be viewed as a (one-period) bond return, where β1 x1t is a persistent component of the return that is small relative to the size of yt+1 . One device for seeing how the results in a finite sample of some particular size T likely differ from those predicted by conventional first-order asymptotics is to use a local-to-unity specification as in Phillips (1988) and Cavanagh et al. (1995): xi,t+1 = (1 + ci /T )xit + εi,t+1 i = 1, 2. (17) For example, if our data come from a sample of size T = 100 when ρi = 0.95, the idea is to represent this with a value of ci = −5 in (17). The claim is that analyzing the properties as T → ∞ of a model characterized by (17) with ci = −5 gives a better approximation to the actual distribution of regression statistics in a sample of size T = 100 and ρi = 0.95 than is provided by the first-order asymptotics used in the previous subsection which treat ρi as a constant when T → ∞; see for example Chan (1988) and Nabeya and Sørensen (1994). The local-to-unity asymptotics turn out to be described by Ornstein-Uhlenbeck processes. For example Z 1 PT 2 2 −2 [Jcµi (λ)]2 dλ T t=1 (xit − x̄i ) ⇒ σi 0 where ⇒ denotes weak convergence as T → ∞ and Z Jci (λ) = ci λ eci (λ−s) Wi (s)ds + Wi (λ) i = 1, 2 0 Jcµi (λ) 1 Z = Jci (λ) − Jci (s)ds i = 1, 2 0 with W1 (λ) and W2 (λ) denoting independent standard Brownian motion. When ci = 0, (17) becomes a random walk and the local-to-unity asymptotics simplify to the standard unit-root asymptotics involving functionals of Brownian motion as a special case: J0 (λ) = W (λ). Applying local-to-unity asymptotics to our setting reveals the basic econometric problem. We show in Appendix B that under local-to-unity asymptotics the coefficient from a regression of x2t on x1t has the following limiting distribution: P AT = R1 σ2 0 Jcµ1 (λ)Jcµ2 (λ)dλ (x1t − x̄1 )(x2t − x̄2 ) P ⇒ ≡ (σ2 /σ1 )A, R1 (x1t − x̄1 )2 σ1 0 [Jcµ1 (λ)]2 dλ 10 (18) where we have defined A to be the random variable in the middle expression. Under first-order asymptotics the influence of AT would vanish as the sample size grows. But using local-tounity asymptotics we see that AT behaves similarly to the coefficient in a spurious regression and does not converge to zero—the true correlation between x1t and x2t in this setting—but to a random variable proportional to A. Consequently, the t-statistic for β2 = 0 can have a very different distribution from that predicted using first-order asymptotics. We demonstrate in Appendix B that this t-statistic has a local-to-unity asymptotic distribution under the null hypothesis that is given by √ b2 ⇒ δZ + 1 − δ 2 Z0 1 P 1/2 {s2 / x̃22t } (19) R1 Kc1 ,c2 (λ)dW1 (λ) o1/2 2 dλ [K (λ)] c ,c 1 2 0 Z1 = nR0 1 (20) R1 Kc1 ,c2 (λ)dW0 (λ) o1/2 2 dλ [K (λ)] c ,c 1 2 0 Z0 = nR0 1 (21) Kc1 ,c2 (λ) = Jcµ2 (λ) − AJcµ1 (λ) P for s2 = (T − 3)−1 (yt+1 − b0 − b1 x1t − b2 x2t )2 and Wi (λ) independent standard Brownian processes for i = 0, 1, 2. Conditional on the realizations of W1 (.) and W2 (.), the term Z0 will be recognized as a standard Normal variable, and therefore Z0 has an unconditional N (0, 1) distribution as well.10 In other words, if x1t is strictly exogenous (δ = 0) then the OLS t-test of β2 = 0 will be valid in small samples even with highly persistent regressors. By contrast, the term dW1 (λ) in the numerator of (20) is not independent of the denominator and this gives Z1 a nonstandard distribution. In particular, Appendix B establishes that Var(Z1 ) > 1. Moreover Z1 and Z0 are uncorrelated with each other.11 Therefore the t-statistic in (19) in general has a non-standard distribution with variance δ 2 Var(Z1 ) + (1 − δ 2 )1 > 1 which is monotonically increasing in |δ|. This shows that whenever x1t is correlated with ut (δ 6= 0) and x1t and x2t are highly persistent, in small samples the t-test of β2 = 0 will reject too often 10 The intuition is that for v0,t+1 ∼ i.i.d. N (0, 1) and K = {Kt }Tt=1 any sequence of random variables PT PT 2 that is independent of v0 , t=1 Kt v0,t+1 has a distribution conditional on K that is N (0, t=1 Kt ) and q PT PT 2 t=1 Kt v0,t+1 / t=1 Kt ∼ N (0, 1). Multiplying by the density of K and integrating over K gives the identical unconditional distribution, namely N (0, 1). For a more formal discussion in the current setting, see Hamilton (1994, pp. 602-607). 11 The easiest way to see this is to note that conditional on W1 (.) and W2 (.) the product has expectation zero, so the unconditional expected product is zero as well. 11 when H0 is true. Expression (19) can be viewed as a straightforward generalization of result (2.1) in Cavanagh et al. (1995) and expression (11) in Campbell and Yogo (2006). In their case the explanatory variable is x1,t−1 − x̄1 which behaves asymptotically like Jcµ1 (λ). The component of ut that is correlated with ε1t leads to a contribution to the t-statistic given by the expression that Cavanagh et al. (1995) refer to as τ1c , which is labeled as τc /κc by Campbell and Yogo (2006). This variable is a local-to-unity version of the Dickey-Fuller distribution with well-known negative bias. By contrast, in our case the explanatory variable is x̃2.t−1 = x2,t−1 − AT x1,t−1 which behaves asymptotically like Kc1 ,c2 (λ). Here the component of ut that is correlated with ε1t leads to a contribution to the t-statistic given by Z1 in our expression (19). Unlike the Dickey-Fuller distribution, Z1 has mean zero, but like the Dickey-Fuller distribution it has variance larger than one. 2.2.2 Simulation evidence We now examine the implications of the theory developed above in a simulation study. We generate values for x1t and x2t using (14), with ε1t and ε2t serially independent Gaussian random variables with unit variance and covariance equal to θ.12 We then calculate yt+1 = ρ1 x1t + ut+1 , ut = δε1t + √ 1 − δ 2 vt , where vt is an i.i.d. standard normal random variable. This implies that in the predictive equation (15) the true parameters are β0 = β2 = 0 and β1 = ρ1 , and that the correlation between ut and ε1t is δ. Note that for δ = 1 this corresponds to a model with a lagged dependent variable (yt = x1t ), whereas for δ = 0 both predictors are strictly exogenous as ut is independent of both both ε1t and ε2t . While in bond return regressions δ is typically negative (as we discuss below in Section 3), we can focus here on 0 ≤ δ ≤ 1, since only |δ| matters for the distribution of the t-statistic. We first set θ = 0 as in our theory above, so that the variance matrix V is given by equation (16) with σ1 , σ2 , and σu equal to one, and x2t is strictly exogenous. We investigate the effects of varying δ, the persistence of the predictors (ρ1 = ρ2 = ρ), and the sample size T . We simulate 50,000 artificial data samples, and in each sample we estimate the regression in equation (15). Since our interest is in the inference about β2 we use this simulation design to study the small-sample behavior of the t-statistic for the test of H0 : β2 = 0. To give 12 We start the simulations at x1,0 = x2,0 = 0, following standard practice of making all inference conditional on date 0 magnitudes. 12 conventional inference the best chance, we use OLS standard errors, which is the correct choice in this simulation setup as the errors are not serially correlated (h = 1) and there is no heteroskedasticity.13 In addition to the small-sample distribution of the t-statistic we also study its asymptotic distribution given in equation (19). While this is a non-standard distribution, we can draw from it using Monte Carlo simulation: for given values of c1 and c2 , we simulate samples of size T̃ from near-integrated processes and approximating the integrals using Rieman sums—see, for example, Chan (1988), Stock (1991), and Stock (1994). The literature suggests that such a Monte Carlo approach yields accurate approximations to the limiting distribution even for moderate sample sizes (Stock, 1991, uses T̃ = 500). We will use T̃ = 1000 and generate 50,000 Monte Carlo replications with c1 = c2 = T (ρ − 1) to calculate the predicted outcome for a sample of size T with serial dependence ρ. Table 1 reports the performance of the t-test of H0 with a nominal size of five percent. It shows the true size of this test, i.e., the frequency of rejections of H0 , according to both the small-sample distribution from our simulations and the asymptotic distribution in equation (19). We use critical values from a Student t-distribution with T − 3 degrees of freedom. Not surprisingly, the local-to-unity asymptotic distribution provides an excellent approximation to the exact small-sample distributions, as both indicate a very similar test size across parameter configurations and sample sizes. The main finding here is that the size distortions can be quite substantial with a true size of up to 17 percent—the t-test would reject the null more than three times as often as it should. When δ 6= 0, the size of the t-test increases with the persistence of the regressors. Table 1 also shows the dependence of the size distortion on the sample size. To visualize this, Figure 1 plots the empirical size of the t-test for the case with δ = 1 for different sample sizes from T = 50 to T = 1000.14 When ρ < 1, the size distortions decrease with the sample size—for example for ρ = 0.99 the size decreases from 15 percent to about 9 percent. In contrast, when ρ = 1 the size distortions are not affected by the sample size, as indeed in this case the non-Normal distribution corresponding to (19) with ci = 0 governs the distribution for arbitrarily large T . To understand better why conventional t-tests go so wrong in this setting, we use simulations to study the respective roles of bias in the coefficient estimates and of inaccuracy of the OLS standard errors for estimation of β1 and β2 . Table 2 shows results for three different simulation settings, in all of which T = 100, ρ = 0.99, and x1t is correlated with past forecast errors (δ 6= 0). In the first two settings, the correlation between the regressors is zero (θ = 0), 13 If we instead use Newey-West standard errors, the size distortions become larger, as expected based on the well-known small-sample problems of HAC covariance estimators (e.g., Müller, 2014). 14 The lines in Figure 1 are based on 500,000 simulated samples in each case. 13 and δ is either equal to one or 0.8. In the third setting, we investigate the effects of non-zero correlation between the predictors by setting δ = 0.8 and θ = 0.8.15 The results show that in all three simulation settings b1 is downward biased and b2 is unbiased. The problem with the hypothesis test of β2 = 0 does not arise from coefficient (Stambaugh) bias, but from the fact that the asymptotic standard errors underestimate the true sampling variability of both b1 and b2 , i.e., from “standard error bias.” This is evident from comparing the standard deviation of the coefficient estimates across simulations—the true small-sample standard error—and the average OLS standard errors. The latter are between 22 and 31 percent too low. Because of this standard error bias, the tests for β2 = 0 reject much more often than their nominal size of five percent. 2.2.3 Relevance for tests of the spanning hypothesis We have demonstrated that with persistent predictors, the lack of strict exogeneity of a subset of the predictors can have serious consequences for the small-sample inference on the remaining predictors, because it causes standard error bias for all predictors. Importantly, HAC standard errors do not help, because in such settings they cannot accurately capture the uncertainty surrounding the coefficient estimators. This econometric issue arises necessarily in all tests of the spanning hypothesis. First, in these regressions the predictors in x1t are by construction correlated with ut , because they correspond to information in current yields and the dependent variable is a future bond return. Second, the predictors are often highly persistent. Table 3, which we discuss in more detail below, reports the estimated autocorrelation coefficients for the predictors used in each published study, showing the high persistence of the predictors used in practice. Third, the sample sizes are necessarily small.16 In light of these observations, conventional hypothesis tests are likely to be misleading in all of the empirical studies that we consider in this paper. Predictive regressions for bond returns are “unbalanced” in the sense that the dependent variable has little serial correlation whereas the predictors are highly persistent. One might suppose that inclusion of additional lags solves the problem we point out. This, unfortunately, is not the case: including further lags of x2t and testing whether the coefficients on current and lagged values are jointly significant leads to a test with exactly the same small-sample 15 Note that in this setting, x2t is not strictly exogenous, as the correlation between ut and ε2t is θδ. This is the natural implication of a model in which only x1t contains information useful for predicting yt . If instead we insisted on E(ut ε2t ) = 0 while θ 6= 0 (or, more generally, if E(ut ε2t ) 6= θδ) then E(yt |x1t , x1,t−1 , x2t , x2,t−1 ) 6= E(yt |x1t , x1,t−1 ) meaning that in effect yt would depend on both x1t and x2t . 16 Reliable interest rate data are only available since about the 1960s, which leads to situations with about 40-50 years of monthly data. Going to higher frequencies—such as weekly or daily—does not increase the effective sample sizes, since it typically increases the persistence of the series and introduces additional noise. 14 size distortions as the t-test on x2t alone.17 2.3 A bootstrap design for investigating the spanning hypothesis The above analysis suggests that it is of paramount importance to base inference on the smallsample distributions of the relevant test statistics. We propose to do so using a parametric bootstrap under the spanning hypothesis.18 While some studies (Bekaert et al., 1997; Cochrane and Piazzesi, 2005; Ludvigson and Ng, 2009; Greenwood and Vayanos, 2014) use the bootstrap in a similar context, they typically generate data under the expectations hypothesis. Cochrane and Piazzesi (2005) and Ludvigson and Ng (2009, 2010) also calculated bootstrap confidence intervals under the alternative hypothesis, which in principle gives some indication of the smallsample significance of the coefficients on x2t . However, bootstrapping under the relevant null hypothesis—the spanning hypothesis—is much to be preferred, as it allows us to calculate the small-sample size of conventional tests and generally leads to better numerical accuracy and more powerful tests (Hall and Wilson, 1991; Horowitz, 2001). Our paper is the first to propose a bootstrap to test the spanning hypothesis H0 : β2 = 0 by generating bootstrapped samples under the null. Our bootstrap design is as follows: First, we calculate the first three PCs of observed yields which we denote x1t = (P C1t , P C2t , P C3t )0 , along with the weighting vector ŵn for the bond yield with maturity n: int = ŵn0 x1t + v̂nt . That is, x1t = Ŵ it , where it = (in1 t , . . . , inJ t )0 is a J-vector with observed yields at t, and Ŵ = (ŵn1 , . . . , ŵnJ )0 is the 3 × J matrix with rows equal to the first three eigenvectors of the variance matrix of it . We use normalized eigenvectors so that Ŵ Ŵ 0 = I3 .19 Fitted yields can be obtained using ı̂t = Ŵ 0 x1t . Three factors generally fit the cross section of yields very well, with fitting errors v̂nt (pooled across maturities) that have a standard deviation of only a few basis points.20 17 A closely related problem arises in classical spurious regression, see Hamilton (1994, p. 562)) An alternative approach would be a nonparametric bootstrap under the null hypothesis, using for example a moving-block bootstrap to re-sample x1t and x2t . However, Berkowitz and Kilian (2000) found that parametric bootstrap methods such as ours typically perform better than nonparametric methods. 19 We choose the eigenvectors so that the elements in the last column of Ŵ are positive—see also footnote 7. 20 For example, in the case study of Joslin et al. (2014) in Section 3, the standard deviation is 6.5 basis points. 18 15 Then we estimate by OLS a VAR(1) for x1t : x1t = φ̂0 + φ̂1 x1,t−1 + e1t t = 1, . . . , T. (22) This time-series specification for x1t completes our simple factor model for the yield curve. Though this model does not impose absence of arbitrage, it captures both the dynamic evolution and the cross-sectional dependence of yields. Studies that have documented that such a simple factor model fits and forecasts the yield curve well include Duffee (2011) and Hamilton and Wu (2014). Next we generate 5000 artificial yield data samples from this model, each with length T equal to the original sample length. We first iterate21 on x∗1τ = φ̂0 + φ̂1 x∗1,τ −1 + e∗1τ where e∗1τ denotes bootstrap residuals. Then we obtain the artificial yields using ∗ i∗nτ = ŵn0 x∗1τ + vnτ (23) ∗ ∼ N (0, σv2 ). The standard deviation of the measurement errors, σv , is set to the sample for vnτ standard deviation of the fitting errors v̂nt .22 We thus have generated an artificial sample of yields i∗nτ which by construction only three factors (the elements of x∗1τ ) have any power to predict, but whose covariance and dynamics are similar to those of the observed data int . Notably, our bootstrapped yields are first-order Markov—under our bootstrap the current yield curve contains all the information necessary to forecast future yields. We likewise fit a VAR(1) to the observed data for the proposed predictors x2t , x2t = α̂0 + α̂1 x2,t−1 + e2t , (24) from which we then bootstrap 5000 artificial samples x∗2τ in a similar fashion as for x∗1τ . The 0∗ 0 0 23 bootstrap residuals (e0∗ 1τ , e2τ ) are drawn from the joint empirical distribution of (e1t , e2t ). 21 We start the recursion with a draw from the unconditional distribution implied by the estimated VAR for x1t . 22 We can safely assume serially uncorrelated fitting errors, despite some evidence in the literature to the contrary (Adrian et al., 2013; Hamilton and Wu, 2014). Recall that our goal is to investigate the smallsample properties of previously calculated test statistics in an environment in which the null hypothesis holds ∗ by construction. Adding serial correlation in vnτ would only add yet another possible reason why the spanning hypothesis could have been spuriously rejected by earlier researchers. 23 We also experimented with a Monte Carlo design in which e∗1τ was drawn from a Student-t dynamic 16 Using the bootstrapped samples of predictors and yields, we can then investigate the properties of any proposed test statistic involving yτ∗+h , x∗1τ , and x∗2τ in a sample for which the dynamic serial correlation of yields and explanatory variables are similar to those in the actual data but in which by construction the null hypothesis is true that x∗2τ has no predictive power for future yields and bond returns.24 In particular, under our bootstrap there are no unspanned macro risks. To see how to test the spanning hypothesis using the bootstrap, consider for example a t-test for significance of a parameter in β2 . Denote the t-statistic in the data by t and the corresponding t-statistic in bootstrap sample i as t∗i . We calculate the bootstrap p-value as the fraction of samples in which |t∗i | > |t|, and would reject the null if this is less than, say, five percent. In addition, we can estimate the true size of a conventional t-test as the fraction of samples in which |t∗i | exceeds the usual asymptotic critical value. One concern about this procedure is related to the well-known fact that under local-to-unity asymptotics, the bootstrap generally cannot provide a test of the correct nominal size.25 The reason is that the test statistics are not asymptotically pivotal as their distribution depends on the nuisance parameters c1 and c2 , which cannot be consistently estimated. For our purpose, however, this is not a concern for two reasons. First, when the goal (as in this investigation) is to judge whether the existing evidence against the spanning hypothesis is compelling, we do not need to be worried about a test that is not conservative enough. Let’s say our bootstrap procedure does not completely eliminate the size distortions and leads to a test that still rejects somewhat too often. If such a test nevertheless fails to reject the spanning hypothesis, we know this could not be attributed to the test being too conservative, but instead accurately conveys a lack of evidence against the null. Nor is a failure to reject a reflection of a lack of power. In additional, unreported results we have found that for those coefficients that are non-zero in our bootstrap DGP, we consistently and strongly reject the null. Moreover, we can directly evaluate the accuracy of our bootstrap procedure using simulaconditional correlation GARCH model (Engle, 2002) fit to the residuals e1t with similar results to those obtained using independently resampled e1t and e2t . 24 For example, if yt+h is an h-period excess return as in equation (8) then in our bootstrap yτ∗+h = ni∗nτ − hi∗hτ − (n − h)i∗n−h,τ +h ∗ ∗ 0 ∗ = n(ŵn0 x∗1τ + vnτ ) − h(ŵh0 x∗1τ + vhτ ) − (n − h)(ŵn−h x∗1,τ +h + vn−h,τ +h ) ∗ ∗ 0 = n(ŵn0 x∗1τ + vnτ ) − h(ŵh0 x∗1τ + vhτ ) − (n − h)[ŵn−h (k̂h + e∗1,τ +h + φ̂1 e∗1,τ +h−1 + · · · ∗ + φ̂h−1 e∗1,τ +1 + φ̂h1 x∗1τ ) + vn−h,τ +h ] 1 which replicates the date t predictable component and the MA(h−1) serial correlation structure of the holding returns that is both seen in the data and predicted under the spanning hypothesis. 25 This result goes back to Basawa et al. (1991). See also Hansen (1999) as well as Horowitz (2001) and the references therein. 17 tions. It is straightforward to use the Monte Carlo simulations in Section 2.2.2 to calculate what the size of our bootstrap procedure would be if applied to a specified parametric model. In each sample i simulated from a known parametric model, we can: (i) calculate the usual t-statistic (denoted t̃i ) for testing the null hypothesis that β2 = 0; (ii) estimate the autoregressive models for the predictors by using OLS on that sample; (iii) generate a single bootstrap simulation using these estimated autoregressive coefficients; (iv) estimate the predictive regression on the bootstrap simulation;26 (v) calculate the t-test of β2 = 0 on this bootstrap predictive regression, denoted t∗i . We generate 5,000 samples from the maintained model, repeating steps (i)-(v), and then calculate the value c such that |t∗i | > c in 5% of the samples. Our bootstrap procedure amounts to the recommendation of rejecting H0 if |t̃i | > c, and we can calculate from the above simulation the fraction of samples in which this occurs. This number tells us the true size if we were to apply our bootstrap procedure to the chosen parametric model. This number is reported in the second-to-last row of Table 2. We find in these settings that our bootstrap has a size above but fairly close to five percent. The size distortion is always smaller for our bootstrap than for the conventional t-test. We will repeat the above procedure to estimate the size of our bootstrap test in each of our empirical applications, taking a model whose true coefficients are those of the VAR estimated in the sample as if it were the known parametric model, and estimating VAR’s from data generated using those coefficients. To foreshadow those results, we will find that the size is typically quite close to or slightly above five percent. In addition, we find that our bootstrap procedure has good power properties. The implication is that if our bootstrap procedure fails to reject the spanning hypothesis, we can safely conclude that the evidence against the spanning hypothesis in the original data is not persuasive. A separate but related issue is that least squares typically underestimates the autocorrelation of highly persistent processes due to small-sample bias (Kendall, 1954; Pope, 1990). Therefore the VAR we use in our bootstrap would typically be less persistent than the true data-generating process. For this reason, we might expect the bootstrap procedure to be slightly oversized.27 One way to deal with this issue is to generate samples not from the OLS estimates φ̂1 and α̂1 but instead use bias-corrected VAR estimates obtained with the bootstrap adopted by Kilian (1998). We refer to this below as the “bias-corrected bootstrap.”28 In this simple Monte Carlo setting, we bootstrap the dependent variable as yτ∗ = φ̂1 x∗1,τ −1 + u∗τ where u∗τ is resampled from the residuals in a regression of yt on x1,t−1 , and is jointly drawn with ε∗1τ and ε∗2τ to maintain the same correlation as in the data. By contrast, in all our empirical analysis the bootstrapped dependent variable is obtained from (23) and the definition of yt+h (for example, equation (8)). 27 A test that would have size five percent if the serial correlation was given by ρ̂1 = 0.97 would have size greater than five percent if the true serial correlation is ρ1 = 0.99. 28 We have found in Monte Carlo experiments that the size of the bias-corrected bootstrap is closer to five 26 18 2.4 An alternative robust test for predictability There is of course a very large literature addressing the problem of HAC inference. This literature is concerned with accurately estimating the matrix S in (7) but does not address what we have identified as the key issue, which is the small-sample difference between the statistics in (9) and (12). We have looked at a number of alternative approaches in terms of how well they perform in our bootstrap experiments. We found that the most reliable existing test appears to be the one suggested by Ibragimov and Müller (2010), who proposed a novel method for testing a hypothesis about a scalar coefficient. The original dataset is divided into q subsamples and the statistic is estimated separately over each subsample. If these estimates across subsamples are approximately independent and Gaussian, then a standard t-test with q degrees of freedom can be carried out to test hypotheses about the parameter. Müller (2014) provided evidence that this test has excellent size and power properties in regression settings where standard HAC inference is seriously distorted. Our simulation results, to be discussed below, show that this test also performs very well in the specific settings that we consider in this paper, namely inference about predictive power of certain variables for future interest rates and excess bond returns. Throughout this paper, we report two sets of results for the Ibragimov-Müller (IM) test, setting the number of subsamples q equal to either 8 and 16 (as in Müller, 2014). A notable feature of the IM test is that it allows us to carry out inference that is robust not only with respect to serial correlation but also with respect to parameter instability across subsamples, as we will discuss Section 5. We use the same Monte Carlo simulation as before to estimate the size of the IM test in the simple setting with two scalar predictors. The results are shown in the last row of Table 2. The IM test has close to nominal size in all three settings. The reason is that the IM test is based on more accurate estimates of the sampling variability of the test statistic by using variation across subsamples. In this way, it solves the problem of standard error bias that conventional t-tests are faced with. Note, however, that coefficient bias would be a problem for the IM test, because it splits the (already small) sample into even smaller samples, which would magnify the small-sample coefficient bias. It is therefore important to assess whether the conditions are met for the IM test to work well in practice, which we will do below in our empirical applications. It will turn out that in our applications the IM test should perform very well. percent than for the simple bootstrap. 19 3 Economic growth and inflation In this section we examine the evidence reported by Joslin et al. (2014) (henceforth JPS) that macro variables may help predict bond returns. We will follow JPS and focus on predictive regressions as in equation (1) where yt+h is an excess bond return for a one-year holding period (h = 12), x1t is a vector consisting of a constant and the first three PCs of yields, and x2t consists of a measure of economic growth (the three-month moving average of the Chicago Fed National Activity Index, GRO) and of inflation (one-year CPI inflation expectations from the Blue Chip Financial Forecasts, IN F ). While JPS also presented model-based evidence in favor of unspanned macro risks, all of those results stem from the substantial in-sample predictive power of x2t in these excess return regressions. The sample contains monthly observations over the period 1985:1-2007:12. 3.1 Predictive power according to adjusted R̄2 JPS found that for the ten-year bond, the adjusted R̄2 of regression (1) when x2t is excluded is only 0.20. But when they added x2t , the R̄2 increased to 0.37. For the two-year bond, the change is even more striking, with R̄2 increasing from 0.14 without the macro variables to 0.48 when they are included. JPS interpreted these adjusted R̄2 as strong evidence that macroeconomic variables have predictive power for excess bond returns beyond the information in the yield curve itself, and concluded from this evidence that “macroeconomic risks are unspanned by bond yields” (p. 1203). However, there are some warning flags for these predictive regressions, which we report in Table 3. First, the predictors in x2t are very persistent. The first-order sample autocorrelations for GRO and IN F are 0.91 and 0.99, respectively. The yield PCs in x1t , in particular the level and slope, are of course highly persistent as well, which is a common feature of interest rate data. Second, to assess strict exogeneity of the predictors we report estimated values for δ, the correlation between innovations to the predictors, ε1t and ε2t , and the lagged prediction error, ut .29 The innovations are obtained from the estimated VAR models for x1t and x2t , and the prediction error is calculated from least squares estimates of equation (1) for yt+h the average excess bond return for two- through ten-year maturities. For the first PC of yields, the level of the yield curve, strict exogeneity is strongly violated, as the absolute value of δ is substantial. Its sizable negative value is due to the mechanical relationship between bond returns and the level of the yield curve: a positive innovation to PC1 at t raises all yields and mechanically 29 While in our theory in Section 2.2 δ was the correlation of the (scalar) innovation of x1t with past prediction errors, here we calculate it for all predictors in x1t and x2t . 20 lowers bond returns from t − h to t. Hence such a violation of strict exogeneity will always be present in predictive regressions for bond returns that include the current level of the yield curve. In light of our results in Section 2 these warning flags suggest that small-sample issues are present, and we will use the bootstrap to address them. Table 4 shows R̄2 of predictive regressions for the excess bond returns on the two- and ten-year bond, and for the average excess return across maturities. The first three columns are for the same data set as was used by JPS.30 The first row in each panel reports the actual R̄2 , and for the excess returns on the 2-year and 10-year bonds essentially replicates the results in JPS.31 The entry R̄12 gives the adjusted R̄2 for the regression with only x1t as predictors, and R̄22 corresponds to the case when x2t is added to the regression. The second row reports the mean R̄2 across 5000 replications of the bootstrap described in Section 2.3, that is, the average value we would expect to see for these statistics in a sample of the size used by JPS in which x2t in fact has no true ability to predict yt+h but whose serial correlation properties are similar to those of the observed data. The third row gives 95% confidence intervals, calculated from the bootstrap distribution of the test statistics. For all predictive regressions, the variability of the adjusted R̄2 is very high. Values for R̄22 up to about 63% would not be uncommon, as indicated by the bootstrap confidence intervals. Most notably, adding the regressors x2t often substantially increases the adjusted R2 , by up to 23 percentage points or more, although x2t has no predictive power in population by construction. For the ten-year bond, JPS report an increase of 17 percentage points when adding macro variables, but our results show that this increase is in fact not statistically significant at conventional significance levels. For the two-year bond, the increase in R̄2 of 35 percentage points is statistically significant. However, the two-year bond seems to be special among the yields one could look at. When we look for example at the average excess return across all maturities, our bootstrap finds no evidence that x2t has predictive power beyond the information in the yield curve, as reported in the last panel of Table 4. Since the persistence of x2t is high, it may be important to adjust for small-sample bias in the VAR estimates. For this reason we also carried out the bias-corrected (BC) bootstrap. The expected values and 95% confidence intervals are reported in the bottom two rows of 30 Their yield data set ends in 2008, with the last observation in their regression corresponding to the excess bond return from 2007:12 to 2008:12. 31 The yield data set of JPS includes the six-month and the one- through ten-year Treasury yields. After calculating annual returns for the two- to ten-year bonds, JPS discarded the six, eight, and nine-year yields before fitting PCs and their term structure models. Here, we need the fitted nine-year yield to construct the return on the ten-year bond, so we keep all 11 yield maturities. While our PCs are therefore slightly different than those in JPS, the only noticeable difference is that our adjusted R̄2 in the regressions for the two-year bond with yield PCs and macro variables is 0.49 instead of their 0.48. 21 each panel of Table 4. As expected, more serial correlation in the generated data (due to the bias correction) increases the mean and the variability of the adjusted R̄2 and of their difference. In particular, while the difference R̄22 − R̄12 for the average excess return regression was marginally significant at the 10-percent level for the simple bootstrap, it is insignificant at this level for the BC bootstrap. The right half of Table 4 updates the analysis to include an additional 7 years of data. As expected under the spanning hypothesis, the value of R̄22 that is observed in the data falls significantly when new data are added. And although the bootstrap 95% confidence intervals for R̄12 − R̄22 are somewhat tighter with the longer data set, the conclusion that there is no statistically significant evidence of added predictability provided by x2t is even more compelling. For all bond maturities, the increases in adjusted R̄2 from adding macro variables as predictors for excess returns lie comfortably inside the bootstrap confidence intervals. 3.2 Testing the spanning hypothesis Is the predictive power of macro variables statistically significant? JPS only reported adjusted R̄2 for their excess return regression, but one is naturally interested in formal tests of the spanning hypothesis in JPS’ excess return regressions. The common approach to address the serial correlation in the residuals due to overlapping observations is to use the HAC standard errors and test statistics proposed by Newey and West (1987), typically using 18 lags (see among many others Cochrane and Piazzesi, 2005; Ludvigson and Ng, 2009). In the second row of Table 5 we report the resulting t-statistic for each coefficient along with the Wald test of the hypothesis β2 = 0, calculated using Newey-West standard errors with 18 lags. The third row reports asymptotic p-values for these statistics. According to this popular test, GRO and IN F appear strongly significant, both individually and jointly. In particular, the Wald statistic has a p-value below 0.1%. We then employ our bootstrap to carry out tests of the spanning hypothesis that account for the small-sample issues described in Section 2. Again, we use both a simple bootstrap based on OLS estimates of the VAR parameters, as well as a bias-corrected (BC) bootstrap. For each, we report five-percent critical values for the t- and Wald statistics, calculated as the 95th percentiles of the bootstrap distribution, as well as bootstrap p-values, i.e., the frequency of bootstrap replications in which the bootstrapped test statistics are at least as large as in the data. Using the simple bootstrap, the coefficient on GRO is insignificant, while IN F is still marginally significant. Using the BC bootstrap, however, the coefficients are both individually and jointly insignificant, in stark contrast to the conventional HAC tests. We also report in Table 5 the p-values for the IM test of the individual significance of the 22 coefficients. The level and slope of the yield curve (P C1 and P C2) are strongly significant predictors according to both IM tests.32 This will turn out to be a consistent finding in all the data sets that we will look at—the level and slope of the yield curve appear to be robust predictors of bond returns, consistent with an old literature going back to Fama and Bliss (1987) and Campbell and Shiller (1991).33 By contrast, the coefficients on GRO and IN F are not statistically significant at conventional significance levels based on the IM test. We then use the bootstrap to calculate the properties of the different tests for data with serial correlation properties similar to those observed in the sample. In particular, we estimate the true size of the HAC, bootstrap, and IM tests with nominal size of five percent, and report these in the last four rows of the top panel of Table 5. For the HAC tests, this is simply the frequency of bootstrap replications in which the t- and Wald-statistics exceed the usual asymptotic critical values. The results reveal that the true size of the conventional tests is 21-38% instead of the presumed five percent.34 These substantial size distortions are also reflected in the bootstrap critical values, which far exceed the conventional critical values. The bootstrap and the IM tests, in contrast, have a size that is estimated to be very close to five percent, eliminating almost all of the size distortions of the more conventional tests. As in the originally published work, we study returns with twelve-month holding periods in all empirical applications of this paper. One might be interested, however, in the magnitude of the size distortions for one-month bond returns. In such a setting, only the lack of strict exogeneity of x1t causes problems for small-sample inference, and not the serial correlation in the prediction errors. In additional, unreported results using the JPS data, we find that in regressions for one-month excess returns the bootstrap does not reject the spanning hypothesis. The conventional tests have serious size distortions, which are however smaller than in the presence of serially correlated errors.35 The implication is that the substantial small-sample size distortions we reported above for data with overlapping returns are due to a combination of both problems, serially correlated errors as well as lack of strict exogeneity. When we add more data to the sample, we again find that the statistical evidence of predictability declines substantially, as seen in the second panel of Table 5. When the data set is extended through 2013, the HAC test statistics are only marginally significant or insignificant, even if interpreted assuming the usual asymptotics. Using the bootstrap to take into account 32 The low p-values are also consistent with the conclusion from our unreported Monte Carlo investigation that IM has good power to reject a false null hypothesis. 33 We have also calculated small-sample confidence intervals using the bootstrap, which confirm that the coefficients on P C1 and P C2 are significant. 34 Using the BC bootstrap gives an even higher estimate of the true size of the HAC Wald test, about 45%. 35 Specifically, if we use White standard errors, as Duffee (2013b) and others do for predictions of one-month excess returns, the BC bootstrap estimate of the true size of the Wald test of the spanning hypothesis is 15%. 23 the small-sample size distortions of such tests, these test statistics are far from significant. Regarding the results for the IM test, we also find in this extended sample that the slope is an important predictor of excess bond returns, consistent with a large existing literature, whereas the coefficients on the macro variables are insignificant. We conclude that the evidence in JPS on the predictive power of macro variables for yields and bond returns is not altogether convincing. Notwithstanding, JPS noted that theirs is only one of several papers claiming to have found such evidence. We turn to these studies in the following sections. 4 Factors of large macro data sets Ludvigson and Ng (2009, 2010) found that factors extracted from a large macroeconomic data set are helpful in predicting excess bond returns, above and beyond the information contained in the yield curve, adding further evidence for the claim of unspanned macro risks and against the hypothesis of invertibility. Here we revisit this evidence, focusing on the results in Ludvigson and Ng (2010) (henceforth LN). LN started with a panel data set of 131 macro variables observed over 1964:1-2007:12 and extracted eight macro factors using the method of principal components. These factors, which we will denote by F 1 through F 8, were then related to future one-year excess returns on twothrough five-year Treasury bonds. The authors carried out an extensive specification search in which they considered many different combinations of the factors along with squared and cubic terms. They also included in their specification search the bond-pricing factor proposed by Cochrane and Piazzesi (2005), which is the linear combination of forward rates that best predicts the average excess return across maturities, and which we denote here by CP. LN’s conclusion was that macro factors appear to help predict excess returns, even when controlling for the CP factor. This conclusion is mostly based on comparisons of adjusted R̄2 in regressions with and without the macro factors and on HAC inference using Newey-West standard errors. 4.1 Robust inference about coefficients on macro factors One feature of LN’s design obscures the evidence relevant for the null hypothesis that is the focus of our paper. Their null hypothesis is that the CP factor alone provides all the information necessary to predict bond yields, whereas our null hypothesis of interest is that the 3 variables (P C1, P C2, P C3) contain all the necessary information. Their regressions in which CP alone is used to summarize the information in the yield curve could not be used to 24 test our null hypothesis. For this reason, we begin by examining similar predictive regressions to those in LN in which excess bond returns are regressed on three PCs of the yields and all eight of the LN macro factors. We further leave aside the specification search of LN in order to focus squarely on hypothesis testing for a given regression specification.36 These regressions take the same form as (1), where now yt+h is the average one-year excess bond return for maturities of two through five years, x1t contains a constant and three yield PCs, and x2t contains eight macro PCs. As before, our interest is in testing the hypothesis H0 : β2 = 0. The top panel of Table 6 reports regression results for LN’s original sample. The first three rows show the coefficient estimates, HAC t- and Wald statistics (using Newey-West standard errors with 18 lags as in LN), and p-values based on the asymptotic distributions of these test statistics. There are five macro factors that appear to be statistically significant at the tenpercent level, among which three are significant at the five-percent level. The Wald statistic for H0 far exceeds the critical values for conventional significant levels (the five-percent critical value for a χ28 -distribution is 15.5). Table 7 reports adjusted R̄2 for the restricted (R̄12 ) and unrestricted (R̄22 ) regressions, and shows that this measure of fit increases by 10 percentage points when the macro factors are included. Taken at face value, this evidence suggests that macro factors have strong predictive power, above and beyond the information contained in the yield curve, consistent with the overall conclusions of LN. How robust are these econometric results? We first check the warning flags summarized in Table 3. As usual, the yield PCs are very persistent. The macro factors differ in their persistence, but even the most persistent ones only have first-order autocorrelations of around 0.75, so the persistence of x2t is lower than in the data of JPS but still considerable. Again the first PC of yields strongly violates strict exogeneity for the reasons explained above. Based on these indicators, it appears that small-sample problems may well distort the results of conventional inference methods. To assess the potential importance in this context, we bootstrapped 5000 data sets of artificial yields and macro data in which H0 is true in population. The samples each contain 516 observations, which corresponds to the length of the original data sample. We report results only for the simple bootstrap without bias correction, because the bias in the VAR for x2t is estimated to be small. Before turning to the results, it is worth noting the differences between our bootstrap exercise and the bootstrap carried out by LN. Their bootstrap is designed to test the null hypothesis that excess returns are not predictable against the alternative that they are pre36 We were able to closely replicate the results in LN’s tables 4 through 7, and have also applied our techniques to those regressions, which led to qualitatively similar results. 25 dictable by macro factors and the CP factor. Using this setting, LN produced convincing evidence that excess returns are predictable, which is fully consistent with all the results in our paper as well. Our null hypothesis of interest, however, is that excess returns are predictable only by current yields. While LN also reported results for a bootstrap under the alternative hypothesis, our bootstrap allows us to provide a more accurate assessment of the spanning hypothesis, and to estimate the size of conventional tests under the null. As seen in Table 6, our bootstrap finds that only three coefficients are significant at the ten-percent level (instead of five using conventional critical values), and one at the five-percent level (instead of three). While the Wald statistic is significant even compared to the critical value from the bootstrap distribution, the evidence is weaker than when using the asymptotic distribution. Table 7 shows that the observed increase in predictive power from adding macro factors to the regression, measured by R̄2 , would not be implausible if the null hypothesis were true, as the increase in R̄2 is within the 95% bootstrap confidence interval. Table 6 also reports p-values for the two IM tests using q = 8 and 16 subsamples. Only the coefficient on F 7 is significant at the 5% level using this test, and then only for q = 16. The robustly significant predictors are the level and the slope of the yield curve. We again use the bootstrap to estimate the true size of the different tests with a nominal size of five percent. The results, which are reported in the bottom four rows of the top panel of Table 6, reveal that the conventional tests have serious size distortions. The true size of these t-tests is 9-14 percent, instead of the nominal five percent, and for the Wald test the size distortion is particularly high with a true size of 34 percent. By contrast, the bootstrap and IM tests, according to our calculations, appear to have close to correct size. The failure to reject the null based on the IM tests is a reflection of the fact that the parameter estimates are often unstable across subsamples. Duffee (2013b, Section 7) has also noted problems with the stability of the results in Cochrane and Piazzesi (2005) and Ludvigson and Ng (2010) across different sample periods. To explore this further we repeated our analysis using the same 1985-2013 sample period that was used in the second panel of Tables 4 and 5. Note that whereas in the case of JPS this was a strictly larger sample than the original, in the case of LN our second sample adds data at the end but leaves some out at the beginning. Reasons for interest in this sample period include the significant break in monetary policy in the early 1980s, the advantages of having a uniform sample period for comparison across all the different studies considered in our paper, and investigating robustness of the original claims in describing data since the papers were originally published.37 We used the macro data set of McCracken and Ng (2014), to extract macro factors in the same way as LN over 37 We also analyzed the full 1964-2013 sample and obtained similar results as over the 1964-2007 sample. 26 the more recent data.38 The bottom panels of Tables 6 and 7 display the results. Over the later sample period, the evidence for the predictive power of macro factors is even weaker. Notably, the Wald tests reject H0 for both bond maturities (at the ten-percent level for the five-year bond) when using asymptotic critical values, but are very far from significant when using bootstrap critical values. The increases in adjusted R̄2 in Table 7 are not statistically significant, and the IM tests find essentially no evidence of predictive power of the macro factors. These results imply that the evidence that macro factors have predictive power beyond the information already contained in yields is much weaker than the results in LN would initially have suggested. For the original sample used by LN, our bootstrap procedure reveals substantial small-sample size distortions and weakens the statistical significance of the predictive power of macro variables, while the IM test indicates that only the level and slope are robust predictors. For the later sample, there is no evidence for unspanned macro risks at all. Our overall conclusion is that the predictive power of macro variables is much more tenuous than one would have thought from the published results, and that both small-sample concerns as well as subsample stability raise serious robustness concerns. 4.2 Robust inference about return-forecasting factors LN also constructed a single return-forecasting factor using a similar approach as Cochrane and Piazzesi (2005). They regressed the excess bond returns, averaged across the two- through five-year maturities, on the macro factors plus a cubed term of F 1 which they found to be important. The fitted values of this regression produced their return-forecasting factor, denoted by H8. The CP factor of Cochrane and Piazzesi (2005) is constructed similarly using a regression on five forward rates. Adding H8 to a predictive regression with CP substantially increases the adjusted R̄2 , and leads to a highly significant coefficient on H8. LN emphasized this result and interpreted it as further evidence that macro variables have predictive power beyond the information in the yield curve. Tables 8 and 9 replicate LN’s results for these regressions on the macro- (H8) and yieldbased (CP ) return-forecasting factors.39 Table 8 shows coefficient estimates and statistical significance, while Table 9 reports R̄2 . In LN’s data, both CP and H8 are strongly significant with HAC p-values below 0.1%. Adding H8 to the regression increases the adjusted R̄2 by 9-11 percentage points. 38 Using this macro data set and the same sample period as LN we obtained results that were very similar to those in the original paper, which gives us confidence in the consistency of the macro data set. 39 These results correspond to those in column 9 in tables 4-7 in LN. 27 How plausible would it have been to obtain these results if macro factors have in fact no predictive power? In order to answer this question, we adjust our bootstrap design to handle regressions with return-forecasting factors CP and H8. To this end, we simply add an additional step in the construction of our artificial data by calculating CP and H8 in each bootstrap data set as the fitted values from preliminary regressions in the exact same way that LN did in the actual data. The results in Table 8 show that the bootstrap p-values are substantially larger than the asymptotic HAC p-values, and H8 is no longer significant at the 1% level. Table 9 shows that the observed increases in adjusted R̄2 when adding H8 to the regression are not statistically significant at the five-percent level, with the exception of the two-year bond maturity where the observed value lies slightly outside the 95% bootstrap confidence interval. We report bootstrap estimates of the true size of conventional HAC tests and of our bootstrap test of the significance of the macro return-forecasting factor—for a nominal size of five percent—in the bottom two rows of the top panel of Table 8. The size distortions for conventional t-tests are very substantial: a test with nominal size of five percent based on asymptotic HAC p-values has a true size of 50-55 percent. In contrast, the size of our bootstrap test is estimated to be very close to the nominal size. We also examined the same regressions over the 1985–2013 sample period with results shown in the bottom panel of Table 8 and in the right half of Table 9. In this sample, the return-forecasting factors would again both appear to be highly significant based on HAC p-values, but the size distortions of these tests are again very substantial and the coefficients on H8 are in fact not statistically significant when using the bootstrap p-values. The observed increases in R̄2 are squarely in line with what we would expect under the spanning hypothesis, as indicated by the confidence intervals in Table 9. This evidence suggests that conventional HAC inference can be particularly problematic when the predictors are return-forecasting factors. One reason for the substantially distorted inference is their high persistence—Table 3 shows that both H8 and CP have autocorrelations that are near 0.8 at first order, and decline only slowly with the lag length. Another reason is that the return-forecasting factors are constructed in a preliminary estimation step, which introduces additional estimation uncertainty not accounted for by conventional inference. In such a setting other econometric methods—preferably a bootstrap exercise designed to assess the relevant null hypothesis—are needed to accurately carry out inference. For the case at hand, we conclude that a return-forecasting factor based on macro factors exhibits only very tenuous predictive power, much weaker than indicated by LN’s original analysis and which disappears completely over a different sample period. 28 5 Higher-order PCs of yields Cochrane and Piazzesi (2005) (henceforth CP) documented several striking new facts about excess bond returns. Focusing on returns with a one-year holding period, they showed that the same linear combination of forward rates predicts excess returns on different long-term bonds, that the coefficients of this linear combination have a tent shape, and that predictive regressions using this one variable deliver R2 of up to 37% (and even up to 44% when lags are included). Importantly for our context, CP found that the first three PCs of yields—level, slope, and curvature—did not fully capture this predictability, but that the fourth and fifth PC were significant predictors of future bond returns (see CP’s Table 4 on p. 147, row 3). In CP’s data, the first three PCs explain 99.97% of the variation in the five Fama-Bliss yields (see page 147 of CP), consistent with the long-standing evidence that three factors are sufficient to almost fully capture the shape and evolution of the yield curve, a result going back at least to Litterman and Scheinkman (1991). CP found that the other two PCs, which explain only 0.03% of the variation in yields, are statistically important for predicting excess bond returns. In particular, the fourth PC appeared “very important for explaining expected returns” (p. 147). Here we assess the robustness of this finding, by revisiting the null hypothesis that only the first three PCs predict yields and excess returns and that higher-order PCs do not contain additional predictive power. The first 3 rows of Table 10 replicate the relevant results of CP using their original data. We estimate the predictive regression for the average excess bond return using five PCs as predictors, and carry out HAC inference in this model using Newey-West standard errors as in CP. The Wald statistic and R12 and R22 are identical to those reported by CP. The p-values indicate that P C4 is very strongly statistically significant, and that the spanning hypothesis would be rejected. We then use our bootstrap procedure to obtain robust inference about the relevance of the predictors PC4 and PC5. In contrast to the results found for JPS in Section 3 and LN in Section 4, our bootstrap finds that the CP results cannot be accounted for by small-sample size distortions. The main reason for this is that the t-statistic on P C4 is far too large to be accounted for by the kinds of factors identified in Section 2. Likewise the increase in R2 reported by CP would be quite implausible under the null hypothesis, and falls far outside the 95% bootstrap interval under the null. Interestingly, however, the IM tests would fail to reject the null hypothesis that β2 = 0. These indicate that the coefficients on P C4 and P C5 are not statistically significant, and find only the level and slope to be robust predictors of excess bond returns. The bootstrap 29 estimates of the size of the IM test, reported in the bottom two rows of the top panel of Table 10, indicate that these tests have close to nominal size, giving us added reason to pay attention to these results. Figure 2 provides some intuition about why the IM tests fail to reject. It shows the coefficients on each predictor across the q = 8 subsamples used in the IM test. The coefficients are standardized by dividing them by the sample standard deviation across the eight estimated coefficients for each predictor. Thus, the IM t-statistics, which are reported in the legend of Figure 2, are equal to the means of the standardized coefficients across subsamples, multiplied √ by 8. The figure shows that P C1 and P C2 had much more consistent predictive power across subsamples than P C4, whose coefficient switches signs several times. The strong association between P C4 and excess returns is mostly driven by the fifth subsample, which starts in September 1983 and ends in July 1988.40 This illustrates that the IM test, which is designed to produce inference that is robust to serial correlation, at the same time delivers results that are robust to sub-sample instability. Only the level and slope have predictive power for excess bond returns in the CP data that is truly robust in both meanings of the word. It is worth emphasizing the similarities and differences between the tests of interest to CP and in our own paper. Their central claim, with which we concur, is that the factor they have identified is a useful and stable predictor of bond returns. However, this factor is a function of all 5 PC’s, and the first 3 of these account for 76% of the variation of the CP factor. Our claim is that it is the role of P C1-P C3 in the CP factor, and not the addition of P C4 and P C5, that makes the CP pricing factor turn out to be a useful and stable predictor of yields. Thus our test for structural stability differs from those performed in CP and their accompanying online appendix. CP conducted tests of the usefulness of their return-forecasting factor for predicting returns across different subsamples, a result that we have been able to reproduce and confirm. Our tests, by contrast, look at stability of the role of each individual PC. We agree with CP that the first three PC’s indeed have a stable predictive relation, as we confirmed with the IM tests in Table 10 and Figure 2, and in additional, unreported subsample analysis similar to that in CP’s appendix. On the other hand, the predictive power of the 4th and 5th PC is much more tenuous, and is insignificant in most of the subsample periods that CP considered. Duffee (2013b, Section 7) also documented that extending CP’s sample period to 1952–2010 alters some of their key results, and we have found that over Duffee’s sample period the predictive power of higher-order PCs disappears. In the bottom panel of Table 10 we report results for our preferred sample period, from 40 Consistent with this finding, an influence analysis of the predictive power of P C4 indicates that the observations with the largest leverage and influence are almost all clustered in the early and mid 1980s. 30 1985 to 2013. In this case, the coefficients on P C4 and P C5 are not significant for any method of inference, and the increase in R2 due to inclusion of higher-order PCs are comfortably inside the 95% bootstrap intervals. At the same time, the predictive power of the level and slope of the yield curve is quite strong also in this sample. Although the standard HAC t-test fails to reject that the coefficient on the level is zero, the same test finds the coefficient on the slope to be significant, and the IM tests imply that both coefficients are significant. Since CP used a sample period that ended more than ten years prior to the time of this writing, we can carry out a true out-of-sample test of our hypothesis of interest. We estimate the same predictive regressions as in CP, for excess returns on two- to five-year bonds as well as for the average excess return across bond maturities. The first two columns of Table 11 report the in-sample R2 for the restricted models (using only P C1 to P C3) and unrestricted models (using all PCs). Then we construct expected future excess returns from these models using yield PCs41 from 2003:1 through 2012:12, and compare these to realized excess returns for holding periods ending in 2004:1 through 2013:12. Table 11 shows the resulting rootmean-squared forecast errors (RMSEs). For all bond maturities, the model that leaves out P C4 and P C5 performs substantially better, with reductions of RMSEs around 20 percent. The test for equal forecast accuracy of Diebold and Mariano (1995) rejects the null, indicating that the performance gains of the restricted model are statistically significant. Figure 3 shows the forecast performance graphically, plotting the realized and predicted excess bond returns. Clearly, both models did not predict future bond returns very well, expecting mostly negative excess returns over a period when these turned out to be positive. In fact, the unconditional mean, estimated over the CP sample period, was a better predictor of future returns. This is evident both from Figure 3, which shows this mean as a horizontal line, and from the RMSEs in the last column of Table 11. Nevertheless, the unrestricted model implied expected excess returns that were more volatile and significantly farther off than those of the restricted model. Restricting the predictive model to use only the level, slope and curvature leads to more stable and more accurate return predictions. We conclude from both our in-sample and out-of-sample results that the evidence for predictive power of higher-order factors is tenuous and sample-dependent. To estimate bond risk premia in a robust way, we recommend using only those predictors that consistently show a strong associations with excess bond returns, namely the level and the slope of the yield curve. 41 PCs are calculated throughout using the loadings estimated over the original CP sample period. 31 6 Bond supply In addition to macro-finance linkages, a separate literature studies the effects of the supply of bonds on prices and yields. The theoretical literature on the so-called portfolio balance approach to interest rate determination includes classic contributions going back to Tobin (1969) and Modigliani and Sutch (1966), as well as more recent work by Vayanos and Vila (2009) and King (2013). A number of empirical studies document the relation between bond supply and interest rates during both normal times and over the recent period of near-zero interest and central bank asset purchases, including Hamilton and Wu (2012), D’Amico and King (2013), and Greenwood and Vayanos (2014). Both theoretical and empirical work has convincingly demonstrated that bond supply is related to bond yields and returns. However, our question here is whether measures of Treasury bond supply contain information that is not already reflected in the yield curve and that is useful for predicting future bond yields and returns. Is there evidence against the spanning hypothesis that involves measures of time variation in bond supply? At first glance, the answer seems to be yes. Greenwood and Vayanos (2014) (henceforth GV) found that their measure of bond supply, a maturityweighted debt-to-GDP ratio, predicts yields and bond returns, and that this holds true even controlling for yield curve information such as the term spread. Here we investigate whether this result holds up to closer scrutiny. The sample period used in Greenwood and Vayanos (2014) is 1952 to 2008.42 To estimate the effects of bond supply on interest rates, GV estimate a broad variety of different regression specifications with yields and returns of various maturities as dependent variables. Here we are most interested in those regressions that control for the information in the yield curve. In the top panel of Table 12 we reproduce their baseline specification in which the one-year return on a long-term bond is predicted using the one-year yield and bond supply measure alone. The second panel includes the spread between the long-term and one-year yield as an additional explanatory variable.43 Like GV we use Newey-West standard errors with 36 lags.44 If we interpreted the HAC t-test using the conventional asymptotic critical values, the coefficient on bond supply is significant in the baseline regression in the top panel but is no longer significant at the conventional significance level of five percent when the yield spread is included in the regression, as seen in the second panel. But once again there are some 42 As in JPS, the authors report a sample end date of 2007 but use yields up to 2008 to calculate one-year bond returns up to the end of 2007. 43 These estimates are in GV’s table 5, rows 1 and 6. Their baseline results are also in their table 2. 44 There are small differences in our and their t-statistics that we cannot reconcile but which are unimportant for the results. 32 warning flags that raise doubts about the validity of HAC inference. Table 3 shows that the bond supply variable is extremely persistent—the first-order autocorrelation is 0.998—and the one-year yield and yield spread are of course highly persistent as well. This leads us to suspect that the true p-value likely exceeds the purported 5.8%. The bond return that GV used as the dependent variable in these regressions is for a hypothetical long-term bond with a 20-year maturity. We do not apply our bootstrap procedure here because this bond return is not constructed from the observed yield curve.45 Instead we rely on IM tests to carry out robust inference. Neither of the IM tests finds the coefficient on bond supply to be statistically significant. In contrast, the coefficient on the term spread is strongly significant for the HAC test and both IM tests. We consider two additional regression specifications that are relevant in this context. The first controls for information in the yield curve by including, instead of a single term spread, the first three PCs of observed yields.46 It also subtracts the one-year yield from the bond return in order to yield an excess return. Both of these changes make this specification more closely comparable to those in the literature. The results are reported in the third panel of Table 12. Again, the coefficient on bond supply is only marginally significant for the HAC t-test, and insignificant for the IM tests. In contrast, the coefficients on both PC1 and PC2 are strongly significant for the IM tests. Finally, we consider the most common specification where yt+h is the one-year excess return, averaged across two- though five-year maturities. The last panel of Table 12 shows that in this case, the coefficient on bond supply is insignificant. Since Table 3 indicates that for this predictive regression both persistence as well as lack of strict exogeneity are warning flags, so we also apply our bootstrap procedure. We find that there is a significant size distortion for this hypothesis test, and the bootstrap p-value is substantially higher than the conventional pvalue. There is robust evidence that PC1 and PC2 have predictive power for bond returns, as judged by the IM test, whereas this test indicates that bond supply is not a robust predictor. Overall, the results in GV do not constitute evidence against the spanning hypothesis. While bond supply exhibits a strong empirical link with interest rates, its predictive power for future yields and returns seems to be fully captured by the current yield curve. 45 46 GV obtained this series from Ibbotson Associates. These PCs are calculated from the observed Fama-Bliss yields with one- through five-year maturities. 33 7 Output gap Another widely cited study that appears to provide evidence of predictive power of macro variables for asset prices is Cooper and Priestley (2008) (henceforth CPR). This paper focuses on one particular macro variable as a predictor of stock and bond returns, namely the output gap, which is a key indicator of the economic business cycle. The authors concluded that “the output gap can predict next year’s excess returns on U.S. government bonds” (p. 2803). Furthermore, they also claimed that some of this predictive power is independent of the information in the yield curve, and implicitly rejected the spanning hypothesis (p. 2828). We investigate the predictive regressions for excess bond returns yt+h using the output gap at date t−1 (gapt−1 ), measured as the deviation of the Fed’s Industrial Production series from a quadratic time trend.47 CPR lagged their measure by one month to account for the publication lag of the Fed’s Industrial Production data. Table 13 shows our results for predictions of the excess return on the five-year bond; the results for other maturities closely parallel these. The top two panels correspond to the regression specifications that CPR estimated.48 In the first ˜ t , which is specification, the only predictor is gapt−1 . The second specification also includes CP the Cochrane-Piazzesi factor CPt after it is orthogonalized with respect to gapt .49 We obtain coefficients and R̄2 that are close to those published in CPR. We calculate both OLS and HAC t-statistics, where in the latter case we use Newey-West with 22 lags as described by CPR. Our OLS t-statistics are very close to the published numbers, and according to these the coefficient on gapt−1 is highly significant. However, the HAC t-statistics are only about a third of the OLS t-statistics, and indicate that the coefficient on gap is far from significant, with p-values above 20%.50 Importantly, neither of the specifications in CPR can be used to test the spanning hypothesis, because the CP factor is first orthogonalized with respect to the output gap. This defeats the purpose of controlling for yield-curve information, since any predictive power that is shared by the CP factor and gap will be exclusively attributed to the latter.51 One way to ˜ , for which we report the results in test the spanning hypothesis is to include CP instead of CP the third panel of Table 13. In this case, the coefficient on gap switches to a positive sign, and ˜ and CP are strongly its Newey-West t-statistic remains insignificant. In contrast, both CP significant in these regressions. 47 We thank Richard Priestley for sending us this real-time measure of the output gap. The relevant results in CPR are in the top panel of their table 9. 49 ˜ t and gapt−1 are therefore not completely orthogonal. Note that the predictors CP 50 This indicates that CPR may have mistakenly reported the OLS instead of the Newey-West t-statistics 51 ˜ cannot justify the conclusion In particular, finding a significant coefficient on gap in a regression with CP that “gap is capturing risk that is independent of the financial market-based variable CP” (p. 2828). 48 34 Our preferred specification includes the first three PCs of the yield curve—see the last panel of Table 13. Importantly, the predictor gap is highly persistent, with a first-order autocorrelation coefficient of 0.975, as shown in Table 3, and the level PC is not strictly exogenous, so we need to worry about conventional t-tests to be substantially oversized. Hence we also include results for robust inference using the bootstrap and IM tests. The gap variable has a positive coefficient with a HAC p-value of 19%, which rises to 36% when using our bootstrap procedure. The conventional HAC t-test is substantially oversized, as evident by the bootstrap critical value that substantially exceeds the conventional critical value. The IM tests do not reject the null. Overall, there is no evidence that the output gap predicts bond returns. The level and in particular the slope of the yield curve, in contrast, are very strongly associated with future excess bond returns, in line with our finding throughout this paper. 8 Conclusion The methods developed in our paper confirm a well established finding in the earlier literature– the current level and slope of the yield curve are robust predictors of future bond returns. That means that in order to test whether any other variables may also help predict bond returns, the regression needs to include the current level and slope, which are highly persistent lagged dependent variables. If other proposed predictors are also highly persistent, conventional tests of their statistical significance can have significant size distortions and the R2 of the regression can increase dramatically when the variables are added to the regression even if they have no true explanatory power. We proposed two strategies for dealing with this problem, the first of which is a simple bootstrap based on PCs and the second a robust t-test based on subsample estimates proposed by Ibragimov and Müller (2010). We used these methods to revisit five different widely cited studies, and found in each case that the evidence that variables other than the current level, slope and curvature predict excess bond returns is substantially less convincing than the original research would have led us to believe. We emphasize that these results do not mean that fundamentals such as inflation, output, and bond supplies do not matter for interest rates. Instead, our conclusion is that any effects of these variables can be summarized in terms of the level, slope, and curvature. Once these three factors are included in predictive regressions, no other variables appear to have robust forecasting power for future yields or returns. Our results cast doubt on the claims for the existence of unspanned macro risks and support the view that it is not necessary to look 35 beyond the information in the yield curve to estimate risk premia in bond markets. References Adrian, Tobias, Richard K. Crump, and Emanuel Moench (2013) “Pricing the Term Structure with Linear Regressions,” Journal of Financial Economics, Vol. 110, pp. 110–138. Andrews, Donald W. K. (1991) “Heteroskedasticity and Autocorrelation Consistent Covariance Matrix Estimation,” Econometrica, Vol. 59, pp. 817–858. Bansal, Ravi and Ivan Shaliastovich (2013) “A Long-Run Risks Explanation of Predictability Puzzles in Bond and Currency Markets,” Review of Financial Studies, Vol. 26, pp. 1–33. Basawa, Ishwar V, Asok K Mallik, William P McCormick, Jaxk H Reeves, and Robert L Taylor (1991) “Bootstrapping unstable first-order autoregressive processes,” Annals of Statistics, pp. 1098–1101. Bauer, Michael D. and Glenn D. Rudebusch (2015) “Resolving the Spanning Puzzle in MacroFinance Term Structure Models,” Working Paper 2015-01, Federal Reserve Bank of San Francisco. Bekaert, G., R.J. Hodrick, and D.A. Marshall (1997) “On biases in tests of the expectations hypothesis of the term structure of interest rates,” Journal of Financial Economics, Vol. 44, pp. 309–348. Berkowitz, Jeremy and Lutz Kilian (2000) “Recent developments in bootstrapping time series,” Econometric Reviews, Vol. 19, pp. 1–48. Campbell, John Y. and Robert J. Shiller (1991) “Yield Spreads and Interest Rate Movements: A Bird’s Eye View,” Review of Economic Studies, Vol. 58, pp. 495–514. Campbell, John Y and Motohiro Yogo (2006) “Efficient tests of stock return predictability,” Journal of financial economics, Vol. 81, pp. 27–60. Carrodus, Mark L and David EA Giles (1992) “The exact distribution of R 2 when the regression disturbances are autocorrelated,” Economics Letters, Vol. 38, pp. 375–380. Cavanagh, Christopher L, Graham Elliott, and James H Stock (1995) “Inference in Models with Nearly Integrated Regressors,” Econometric theory, Vol. 11, pp. 1131–1147. 36 Chan, Ngai Hang (1988) “The parameter inference for nearly nonstationary time series,” Journal of the American Statistical Association, Vol. 83, pp. 857–862. Chernov, Mikhail and Philippe Mueller (2012) “The Term Structure of Inflation Expectations,” Journal of Financial Economics, Vol. 106, pp. 367–394. Cochrane, John H. and Monika Piazzesi (2005) “Bond Risk Premia,” American Economic Review, Vol. 95, pp. 138–160. Cooper, Ilan and Richard Priestley (2008) “Time-Varying Risk Premiums and the Output Gap,” Review of Financial Studies, Vol. 22, pp. 2801–2833. Coroneo, Laura, Domenico Giannone, and Michle Modugno (2015) “Unspanned Macroeconomic Factors in the Yields Curve,” Journal of Business and Economic Statistics, p. forthcoming. D’Amico, Stefania and Thomas B. King (2013) “Flow and stock effects of large-scale treasury purchases: Evidence on the importance of local supply,” Journal of Financial Economics, Vol. 108, pp. 425–448. Deng, Ai (2013) “Understanding Spurious Regression in Financial Economics,” Journal of Financial Econometrics, pp. 1–29. Diebold, Francis X. and Robert S. Mariano (1995) “Comparing Predictive Accuracy,” Journal of Business & economic statistics, Vol. 13, pp. 253–263. Duffee, Gregory R. (2011) “Forecasting with the Term Structure: the Role of No-Arbitrage,” Working Paper January, Johns Hopkins University. (2013a) “Bond Pricing and the Macroeconomy,” in Milton Harris George M. Constantinides and Rene M. Stulz eds. Handbook of the Economics of Finance, Vol. 2, Part B: Elsevier, pp. 907–967. (2013b) “Forecasting Interest Rates,” in Graham Elliott and Allan Timmermann eds. Handbook of Economic Forecasting, Vol. 2, Part A: Elsevier, pp. 385–426. Engle, Robert (2002) “Dynamic conditional correlation: A simple class of multivariate generalized autoregressive conditional heteroskedasticity models,” Journal of Business & Economic Statistics, Vol. 20, pp. 339–350. 37 Fama, Eugene F. and Robert R. Bliss (1987) “The Information in Long-Maturity Forward Rates,” The American Economic Review, Vol. 77, pp. 680–692. Ferson, Wayne E, Sergei Sarkissian, and Timothy T Simin (2003) “Spurious Regressions in Financial Economics?” Journal of Finance, Vol. 58, pp. 1393–1414. Greenwood, Robin and Dimitri Vayanos (2014) “Bond Supply and Excess Bond Returns,” Review of Financial Studies, Vol. 27, pp. 663–713. Gürkaynak, Refet S. and Jonathan H. Wright (2012) “Macroeconomics and the Term Structure,” Journal of Economic Literature, Vol. 50, pp. 331–367. Hall, Peter and Susan R. Wilson (1991) “Two Guidelines for Bootstrap Hypothesis Testing,” Biometrics, Vol. 47, pp. 757–762. Hamilton, James D. (1994) Time Series Analysis: Princeton University Press. Hamilton, James D. and Jing Cynthia Wu (2012) “Identification and estimation of Gaussian affine term structure models,” Journal of Econometrics, Vol. 168, pp. 315–331. (2014) “Testable Implications of Affine Term Structure Models,” Journal of Econometrics, Vol. 178, pp. 231–242. Hansen, Bruce E (1999) “The grid bootstrap and the autoregressive model,” Review of Economics and Statistics, Vol. 81, pp. 594–607. Horowitz, Joel L. (2001) “The Bootstrap,” in J.J. Heckman and E.E. Leamer eds. Handbook of Econometrics, Vol. 5: Elsevier, Chap. 52, pp. 3159–3228. Ibragimov, Rustam and Ulrich K. Müller (2010) “t-Statistic Based Correlation and Heterogeneity Robust Inference,” Journal of Business and Economic Statistics, Vol. 28, pp. 453– 468. Joslin, Scott, Marcel Priebsch, and Kenneth J. Singleton (2014) “Risk Premiums in Dynamic Term Structure Models with Unspanned Macro Risks,” Journal of Finance, Vol. 69, pp. 1197–1233. Kendall, M. G. (1954) “A note on bias in the estimation of autocorrelation,” Biometrika, Vol. 41, pp. 403–404. Kilian, Lutz (1998) “Small-sample confidence intervals for impulse response functions,” Review of Economics and Statistics, Vol. 80, pp. 218–230. 38 King, Thomas B. (2013) “A Portfolio-Balance Approach to the Nominal Term Structure,” Working Paper 2013-18, Federal Reserve Bank of Chicago. Koerts, Johannes and Adriaan Pieter Johannes Abrahamse (1969) On the theory and application of the general linear model: Rotterdam University Press Rotterdam. Lewellen, Jonathan, Stefan Nagel, and Jay Shanken (2010) “A skeptical appraisal of asset pricing tests,” Journal of Financial Economics, Vol. 96, pp. 175–194. Litterman, Robert and J. Scheinkman (1991) “Common Factors Affecting Bond Returns,” Journal of Fixed Income, Vol. 1, pp. 54–61. Ludvigson, Sydney C. and Serena Ng (2009) “Macro Factors in Bond Risk Premia,” Review of Financial Studies, Vol. 22, pp. 5027–5067. Ludvigson, Sydney C and Serena Ng (2010) “A Factor Analysis of Bond Risk Premia,” Handbook of Empirical Economics and Finance, p. 313. Mankiw, N. Gregory and Matthew D. Shapiro (1986) “Do we reject too often? Small sample properties of tests of rational expectations models,” Economics Letters, Vol. 20, pp. 139–145. McCracken, Michael W. and Serena Ng (2014) “FRED-MD: A Monthly Database for Macroeconomic Research,” working paper, Federal Reserve Bank of St. Louis. Modigliani, Franco and Richard Sutch (1966) “Innovations in interest rate policy,” The American Economic Review, pp. 178–197. Müller, Ulrich K. (2014) “HAC Corrections for Strongly Autocorrelated Time Series,” Journal of Business and Economic Statistics, Vol. 32. Nabeya, Seiji and Bent E Sørensen (1994) “Asymptotic distributions of the least-squares estimators and test statistics in the near unit root model with non-zero initial value and local drift and trend,” Econometric Theory, Vol. 10, pp. 937–966. Newey, Whitney K and Kenneth D West (1987) “A Simple, Positive Semi-definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix,” Econometrica, Vol. 55, pp. 703–08. Phillips, Peter CB (1988) “Regression theory for near-integrated time series,” Econometrica: Journal of the Econometric Society, pp. 1021–1043. 39 Piazzesi, Monika and Martin Schneider (2007) “Equilibrium Yield Curves,” in NBER Macroeconomics Annual 2006, Volume 21: MIT Press, pp. 389–472. Pope, Alun L. (1990) “Biases of Estimators in Multivariate Non-Gaussian Autoregressions,” Journal of Time Series Analysis, Vol. 11, pp. 249–258. Priebsch, Marcel (2014) “(Un)Conventional Monetary Policy and the Yield Curve,” working paper, Federal Reserve Board, Washington, D.C. Rudebusch, Glenn D. and Tao Wu (2008) “A Macro-Finance Model of the Term Structure, Monetary Policy, and the Economy,” Economic Journal, Vol. 118, pp. 906–926. Stambaugh, Robert F. (1999) “Predictive regressions,” Journal of Financial Economics, Vol. 54, pp. 375–421. Stock, James H (1991) “Confidence intervals for the largest autoregressive root in US macroeconomic time series,” Journal of Monetary Economics, Vol. 28, pp. 435–459. Stock, James H. (1994) “Unit roots, structural breaks and trends,” in Robert F. Engle and Daniel L. McFadden eds. Handbook of Econometrics, Vol. 4: Elsevier, Chap. 46, pp. 2739– 2841. Swanson, Eric T (2015) “A macroeconomic model of equities and real, nominal, and defaultable debt,” unpublished manuscript, University of California, Irvine. Tobin, James (1969) “A general equilibrium approach to monetary theory,” Journal of money, credit and banking, Vol. 1, pp. 15–29. Vayanos, Dimitri and Jean-Luc Vila (2009) “A Preferred-Habitat Model of the Term Structure of Interest Rates,” NBER Working Paper 15487, National Bureau of Economic Research. Wachter, Jessica A. (2006) “A Consumption-Based Model of the Term Structure of Interest Rates,” Journal of Financial Economics, Vol. 79, pp. 365–399. Wright, Jonathan H. (2011) “Term Premia and Inflation Uncertainty: Empirical Evidence from an International Panel Dataset,” American Economic Review, Vol. 101, pp. 1514– 1534. 40 Appendix A First-order asymptotic results Here we provide details of the claims made in Section 2.1. Let b = (b01 , b02 )0 denote the OLS coefficients when the regression includes both x1t and x2t and b∗1 the coefficients from an OLS regression that includes only x1t . The SSR from the latter regression can be written P SSR1 = (yt+h − x01t b∗1 )2 P = (yt+h − x0t b + x0t b − x01t b∗1 )2 P P = (yt+h − x0t b)2 + (x0t b − x01t b∗1 )2 where all summations are over t = 1, ..., T and the last equality follows from the orthogonality property of OLS. Thus the difference in SSR between the two regressions is P SSR1 − SSR2 = (x0t b − x01t b∗1 )2 . (25) It’s also not hard to show that the fitted values for the full regression could be calculated as x0t b = x01t b∗1 + x̃02t b2 (26) where x̃2t denotes the residuals from regressions of the elements of x2t on x1t and b2 can be obtained from an OLS regression of yt+h − x01t b∗1 on x̃2t .52 Thus from (25) and (26), P SSR1 − SSR2 = (x̃02t b2 )2 . If the P true value of β2 is zero, then by plugging (1) into the definition of b2 and using the fact that x̃2t x01t β1 = 0 (which follows from the orthogonality of x̃2t with x1t ) we see that P −1 P b2 = ( x̃2t x̃02t ) ( x̃2t ut+h ) (27) P SSR1 − SSR2 = b02 ( x̃2t x̃02t ) b2 −1 −1/2 P P P = T −1/2 ut+h x̃02t T −1 x̃2t x̃02t T x̃2t ut+h . (28) P −1 P That is, b2 = ( x̃2t x̃02t ) ( x̃2t (yt+h − x1t b∗1 ) for x̃2t defined in (10) and (11). The easiest way to confirm the claim is to show that the residuals implied by (26) satisfy the orthogonality conditions required of the original full regression, namely, that they are orthogonal to x1t and x2t . That the residual yt+h − x01t b∗1 − x̃02t b2 is orthogonal to x1t follows from the fact that yt+h − x01t b∗1 is orthogonal to x1t by the definition of b∗1 while x̃2t is orthogonal to x1t by the construction of x̃2t . Likewise orthogonality of yt+h − x01t b∗1 − x̃02t b2 to x̃2t follows directly from the definition of b2 . Since yt+h − x01t b∗1 − x̃02t b2 is orthogonal to both x1t and x̃2t , it is also orthogonal to x2t = x̃2t + AT x1t . 52 41 If xt is stationary and ergodic, then it follows from the Law of Large Numbers that −1 −1 P P P P P T −1 x̃2t x̃02t = T −1 x2t x02t − T −1 x2t x01t T −1 x1t x01t T x1t x02t p −1 → E(x2t x02t ) − [E(x2t x01t )] [E(x1t x01t )] [E(x1t x02t )] which equals Q in (6) in the special case when E(x2t x01t ) = 0. For the last term in (28) we see from (10) and (11) that P P P T −1/2 x̃2t ut+h = T −1/2 x2t ut+h − AT T −1/2 ( x1t ut+h ) . But if E(x2t x01t ) = 0, then plim(AT ) = 0, meaning P P d T −1/2 x̃2t ut+h → T −1/2 x2t ut+h . √ This will be recognized as T times the sample mean of a random vector with population mean zero, so from the Central Limit Theorem T −1/2 P d x̃2t ut+h → r ∼ N (0, S) implying from (28) that d SSR1 − SSR2 → r0 Q−1 r. Thus from (3), (SSR1 − SSR2 ) d r0 Q−1 r → T (R22 − R12 ) = P (yt+h − ȳh )2 /T γ as claimed in (4). Expression (27) also implies that √ −1 −1/2 P d P T b2 = T −1 x̃2t x̃02t T x̃2t ut+h → Q−1 r from which (13) follows immediately. B Local-to-unity asymptotic results Here we provide details behind the claims made Phillips (1988, in Section 2.2. hWe know from i2 R R R P 1 1 Lemma 3.1(d)) that T −2 (x1t − x̄1 )2 ⇒ σ12 0 [Jc1 (λ)]2 dλ − 0 [Jc1 (λ)]dλ = σ12 [Jcµ1 ]2 R where in the sequel our notation suppresses the dependence on λ and lets denote integration over λ from 0 to 1. The analogous operation applied to the numerator of (18) yields R P σ1 σ2 Jcµ1 Jcµ2 T −2 (x1t − x̄1 )(x2t − x̄2 ) R P AT = ⇒ T −2 (x1t − x̄1 )2 σ12 [Jcµ1 ]2 42 as claimed in (18). We also have from equation (2.17) in Stock (1994) that T −1/2 x2,[T λ] ⇒ σ2 Jc2 (λ) where [T λ] denotes the largest integer less than T λ. From the Continuous Mapping Theorem, Z 1 Z 1 P −1/2 −3/2 −1/2 T x̄2 = T x2t = T x2,[T λ] dλ ⇒ σ2 Jc2 (λ)dλ. 0 0 Since x̃2t = x2t − x̄2 − AT (x1t − x̄1 ), Z 1 Z 1 −1/2 T x̃2,[T λ] ⇒ σ2 Jc2 (λ) − Jc2 (s)ds − A Jc1 (λ) − Jc1 (s)ds 0 0 = σ2 Jcµ2 (λ) − AJcµ1 (λ) = σ2 Kc1 ,c2 (λ) T −2 P x̃22t Z 1 {T = −1/2 2 x̃2,[T λ] } dλ ⇒ 0 σ22 Z 1 {Kc1 ,c2 (λ)}2 dλ. (29) 0 Note we can write σ1 0 0 v1t ε1t v2t ε2t = 0 σ2 √ 0 2 ut v0t 1 − δ σu δσu 0 where (v1t , v2t , v0t )0 is a martingale-difference sequence with unit variance matrix. Lemma 3.1(e) in Phillips (1988) we see √ P P T −1 x̃2t ut+1 = T −1 [x2t − x̄2 − AT (x1t − x̄1 )](δσu v1,t+1 + 1 − δ 2 σu v0,t+1 ) Z Z √ 2 ⇒ δσ2 σu Kc1 ,c2 dW1 + 1 − δ σ2 σu Kc1 ,c2 dW0 . Recalling (27), under the null hypothesis the t-test of β2 = 0 can be written as P P T −1 x̃2t ut+1 x̃2t ut+1 = τ= P P 1/2 1/2 {s2 x̃22t } {s2 T −2 x̃22t } where p s2 → σu2 . From (30) (31) (32) Substituting (32), (30), and (29) into (31) produces √ R R σ2 σu δ Kc1 ,c2 dW1 + 1 − δ 2 Kc1 ,c2 dW0 τ⇒ R 1/2 σu2 σ22 (Kc1 ,c2 )2 as claimed in (19). Last we demonstrate that the variance of the variable Z1 defined in (20) exceeds unity. We 43 can write R1 R1 µ µ (λ)dW (λ) J A J (λ)dW1 (λ) 1 c 0 c1 Z1 = nR 0 2 o1/2 − nR o1/2 1 1 2 2 [Kc1 ,c2 (λ)] dλ [Kc1 ,c2 (λ)] dλ 0 0 (33) Consider the denominator in these expressions, and note that Z 1 Z 1 2 µ [Jcµ2 (λ) − AJcµ1 (λ) + AJcµ1 (λ)]2 dλ [Jc2 (λ)] dλ = 0 Z0 1 Z 1 2 = [Kc1 ,c2 (λ)] dλ + [AJcµ1 (λ)]2 dλ 0 Z0 1 > [Kc1 ,c2 (λ)]2 dλ 0 where the cross-product term dropped out in the second equation by the definition of A in (18). This means that the following inequality holds for all realizations: R1 R1 µ µ J (λ)dW (λ) J (λ)dW1 (λ) 1 0 c2 0 c2 nR o1/2 > nR o1/2 . 1 1 µ 2 2 [Kc1 ,c2 (λ)] dλ [Jc2 (λ)] dλ 0 0 (34) Adapting the argument made in footnote 10, the magnitude inside the absolute-value operator on the right side of (34) can be seen to have a N (0, 1) distribution. Inequality (34) thus establishes that the first term in (33) has a variance that is greater than unity. The second term in (33) turns out to be uncorrelated with the first, and hence contributes additional variance to Z1 , although we have found that the first term appears to be the most important factor.53 In sum, these arguments show that Var(Z1 ) > 1. 53 These claims are based on moments of the respective functionals as estimated from discrete approximations to the Ornstein-Uhlenbeck processes. 44 Table 1: Simulation study: size distortions of conventional t-test T 50 50 100 100 200 200 500 500 simulated asymptotic simulated asymptotic simulated asymptotic simulated asymptotic δ ρ = 0.9 5.1 4.5 4.8 4.5 5.0 4.9 5.0 5.0 =0 0.99 4.9 4.4 5.1 4.7 5.1 4.9 5.0 4.8 1 5.1 4.6 5.2 4.8 5.0 4.9 5.0 4.9 0.9 8.1 8.4 7.1 7.0 6.1 6.2 5.4 5.4 δ = 0.8 δ=1 0.99 1 0.9 0.99 11.1 11.4 10.2 15.1 11.0 11.5 10.5 14.9 11.4 12.2 8.4 15.2 11.1 11.9 8.4 15.0 11.1 12.4 6.8 14.5 10.7 12.0 7.2 14.6 8.9 12.2 5.7 11.6 9.2 12.3 5.8 11.6 1 15.9 15.4 16.2 16.0 16.5 16.6 17.0 16.9 True size (in percentage points) of a conventional t-test of H0 : β2 = 0 with nominal size of 5%, in simulated small samples and according to local-to-unity asymptotic distribution. δ determines the degree of endogeneity, i.e., the correlation of x1t with the lagged error term ut . The persistence of the predictors is ρ1 = ρ2 = ρ. For details on the simulation study refer to main text. Table 2: Simulation study: coefficient bias and standard error bias True coefficient Mean estimate Coefficient bias True standard error Mean OLS std. error Standard error bias Size of t-test Size of bootstrap test Size of IM test δ = 1, θ = 0 δ = 0.8, θ = 0 δ = 0.8, θ = 0.8 β1 β2 β1 β2 β1 β2 0.990 0.000 0.990 0.000 0.990 0.000 0.921 0.000 0.936 0.000 0.935 0.000 -0.069 0.000 -0.054 0.000 -0.055 0.000 0.053 0.055 0.049 0.049 0.082 0.083 0.038 0.038 0.038 0.038 0.064 0.064 -0.015 -0.017 -0.011 -0.011 -0.018 -0.019 0.155 0.111 0.112 0.080 0.072 0.067 0.047 0.047 0.045 Analysis of bias in estimated coefficients and standard errors for regressions in small samples with T = 100 and ρ1 = ρ2 = 0.99, as well as estimated size of conventional t-test, bootstrap, and IM tests. For details on the simulation study refer to main text. 45 Table 3: Warning flags for predictive regressions in published studies Study JPS LN CP GV CPR Predictor PC1 PC2 PC3 GRO INF PC1 PC2 PC3 F1 F2 F3 F4 F5 F6 F7 F8 CP H8 PC1 PC2 PC3 PC4 PC5 CP PC1 PC2 PC3 supply PC1 PC2 PC3 gap 1 0.974 0.973 0.849 0.910 0.986 0.984 0.944 0.601 0.766 0.748 -0.233 0.455 0.361 0.422 -0.111 0.225 0.773 0.777 0.980 0.940 0.592 0.425 0.227 0.767 0.988 0.942 0.582 0.998 0.986 0.939 0.590 0.975 ACF(l) 6 0.840 0.774 0.380 0.507 0.897 0.904 0.734 0.254 0.381 0.454 0.035 0.207 0.207 0.476 0.134 0.087 0.531 0.627 0.880 0.721 0.237 0.137 0.157 0.522 0.925 0.722 0.233 0.990 0.917 0.712 0.262 0.750 δ 12 0.696 0.467 0.216 0.260 0.815 0.821 0.537 0.113 0.088 0.188 -0.085 0.151 0.171 0.272 0.054 0.093 0.377 0.331 0.767 0.539 0.110 0.062 -0.135 0.361 0.860 0.521 0.094 0.974 0.841 0.528 0.153 0.475 -0.368 -0.048 0.202 -0.122 -0.189 -0.342 0.137 0.091 0.100 0.160 0.044 0.189 0.169 0.058 -0.079 0.048 -0.358 0.157 0.090 -0.020 0.121 -0.312 0.147 0.105 0.035 -0.338 0.179 0.055 -0.193 Measures of persistence and lack of strict exogeneity of the predictors. For the persistence we report autocorrelations of the predictors at lags of one, six, and twelve months. Lack of strict exogeneity is measured by δ, the correlation between the innovations to the predictors, ε1t or ε2t , and the lagged prediction error, ut . The innovations are obtained from estimated VAR(1) models for x1t (the principal components of yields) and x2t (the other predictors). The forecast error ut is calculated from a predictive regression of the average excess bond return across maturities. The predictors are described in the main text. The data and sample are the same as in the published studies. These are JPS (Joslin et al., 2014), LN (Ludvigson and Ng, 2010), CP (Cochrane and Piazzesi, 2005), GV (Greenwood and Vayanos, 2014), and CPR (Cooper and Priestley, 2008). 46 Table 4: Joslin-Priebsch-Singleton: R2 in excess return regressions Original sample: 1985–2008 R̄12 R̄22 R̄22 − R̄12 Two-year bond Data Simple bootstrap Later sample: 1985–2013 R̄22 R̄22 − R̄12 0.49 0.36 (0.11, 0.63) 0.44 (0.13, 0.75) 0.35 0.12 0.06 0.26 (-0.00, 0.22) (0.05, 0.51) 0.06 0.32 (-0.00, 0.23) (0.07, 0.60) 0.28 0.32 (0.09, 0.56) 0.38 (0.12, 0.64) 0.16 0.06 (-0.00, 0.21) 0.06 (-0.00, 0.21) 0.20 0.37 0.26 0.32 (0.07, 0.48) (0.12, 0.54) BC bootstrap 0.27 0.34 (0.06, 0.50) (0.12, 0.57) Average two- through ten-year bonds Data 0.19 0.39 Simple bootstrap 0.28 0.35 (0.08, 0.50) (0.12, 0.56) BC bootstrap 0.30 0.37 (0.06, 0.55) (0.13, 0.61) 0.17 0.20 0.07 0.24 (-0.00, 0.23) (0.06, 0.46) 0.08 0.26 (-0.00, 0.27) (0.06, 0.49) 0.28 0.30 (0.11, 0.51) 0.33 (0.11, 0.55) 0.08 0.06 (-0.00, 0.21) 0.07 (-0.00, 0.23) 0.20 0.17 0.07 0.24 (-0.00, 0.23) (0.05, 0.46) 0.07 0.27 (-0.00, 0.26) (0.05, 0.50) 0.25 0.30 (0.10, 0.52) 0.33 (0.12, 0.56) 0.08 0.06 (-0.00, 0.21) 0.07 (-0.00, 0.24) BC bootstrap Ten-year bond Data Simple bootstrap 0.14 0.30 (0.06, 0.58) 0.38 (0.07, 0.72) R̄12 Adjusted R̄2 for regressions of annual excess bond returns on three PCs of the yield curve (R̄12 ) and on three yield PCs together with the macro variables GRO and IN F (R̄22 ), as well as the difference in adjusted R̄2 . GRO is the three-month moving average of the Chicago Fed National Activity Index, and IN F is one-year expected inflation measured by Blue Chip inflation forecasts. The data used for the left half of the table is the original data set of Joslin et al. (2014); the data used in the right half is extended to December 2013. The last panel shows results for the average excess bond return for all bond maturities from two to ten years. The first row of each panel reports the values of the statistics in the original data. The next three rows report bootstrap small-sample mean, and the 95%-confidence intervals (in parentheses). The bootstrap simulations are obtained under the null hypothesis that the macro variables have no predictive power. The bootstrap procedure for the simple bootstrap and the bias-corrected (BC) bootstrap is described in the main text. 47 Table 5: Joslin-Priebsch-Singleton: inference in excess return regressions P C1 P C2 P C3 GRO IN F Wald Original sample: 1985–2008 Coefficient 1.064 1.988 3.342 -2.174 -6.494 HAC statistic 5.603 4.671 0.865 2.438 4.232 25.476 HAC p-value 0.000 0.000 0.388 0.015 0.000 0.000 Bootstrap 5% c.v. 3.203 3.950 24.410 Bootstrap p-value 0.129 0.038 0.046 BC bootstrap 5% c.v. 3.460 4.286 27.664 BC bootstrap p-value 0.140 0.052 0.061 IM q = 8 0.002 0.040 0.002 0.563 0.940 IM q = 16 0.003 0.002 0.063 0.244 0.500 Estimated size of tests HAC 0.209 0.285 0.382 Simple bootstrap 0.058 0.067 0.069 IM q = 8 0.049 0.054 IM q = 16 0.038 0.033 Later sample: 1985–2013 Coefficient 0.523 1.865 4.330 -0.271 -3.767 HAC statistic 2.524 3.755 1.345 0.323 2.408 5.799 HAC p-value 0.012 0.000 0.180 0.747 0.017 0.055 Bootstrap 5% c.v. 3.332 3.665 22.786 Bootstrap p-value 0.820 0.178 0.376 BC bootstrap 5% c.v. 3.420 3.919 24.471 BC bootstrap p-value 0.838 0.206 0.417 IM q = 8 0.275 0.030 0.003 0.550 0.325 IM q = 16 0.304 0.007 0.139 0.393 0.934 Predictive regressions for annual excess bond returns, averaged over two- through ten-year bond maturities, using yield PCs and macro variables (which are described in the notes to Table 4). The data used for the top panel is the original data set of Joslin et al. (2014); the data used for the bottom panel is extended to December 2013. HAC statistics and p-values are calculated using Newey-West standard errors with 18 lags. The column “Wald” reports results for the χ2 test that GRO and IN F have no predictive power; the other columns report results for individual t-tests. We obtain bootstrap distributions of the test statistics under the null hypothesis that GRO and IN F have no predictive power. Critical values (c.v.’s) are the 95th percentile of the bootstrap distribution of the test statistics, and p-values are the frequency of bootstrap replications in which the test statistics are at least as large as in the data. See the text for a description of the experimental design for the simple bootstrap and the bias-corrected (BC) bootstrap. We also report p-values for t-tests using the methodology of Ibragimov and Müller (2010) (IM), splitting the sample into either 8 or 16 blocks. The last four rows in the first panel report bootstrap estimates of the true size of different tests with 5% nominal coverage, calculated as the frequency of bootstrap replications in which the test statistics exceed their critical values, except for the size of bootstrap test which is calculated as described in the main text. p-values below 5% are emphasized with bold face. 48 49 0.068 0.788 2.725 0.682 0.496 0.651 1.652 0.099 2.817 0.224 0.139 0.831 -0.274 0.267 0.789 2.908 0.844 0.511 0.636 0.132 0.055 0.050 0.048 0.131 0.058 0.051 0.051 0.225 0.813 0.146 0.379 0.705 2.580 0.761 0.558 0.317 0.742 1.855 0.064 2.572 0.140 0.098 0.228 -5.014 2.724 0.007 F2 F1 P C3 0.147 0.690 0.491 2.516 0.587 0.537 0.923 0.097 0.053 0.051 0.051 -0.072 0.608 0.543 2.241 0.594 0.579 0.771 F3 -0.488 1.162 0.246 2.667 0.370 0.899 0.187 0.124 0.061 0.049 0.050 -0.528 1.912 0.056 2.513 0.128 0.088 0.327 F4 0.022 0.038 0.969 2.798 0.973 0.767 0.570 0.126 0.055 0.049 0.051 -0.321 1.307 0.192 2.497 0.301 0.703 0.358 F5 0.334 1.866 0.063 2.468 0.136 0.144 0.882 0.134 0.053 0.052 0.045 -0.576 2.220 0.027 2.622 0.092 0.496 0.209 F6 0.035 0.153 0.878 2.365 0.892 0.923 0.703 0.113 0.049 0.050 0.055 -0.401 2.361 0.019 2.446 0.057 0.085 0.027 F7 -0.075 0.423 0.673 2.298 0.718 0.398 0.239 0.086 0.046 0.042 0.046 0.551 3.036 0.003 2.242 0.010 0.324 0.502 F8 13.766 0.088 37.267 0.495 0.335 0.061 42.084 0.000 29.686 0.009 Wald Predictive regressions for annual excess bond returns, averaged over two- through five-year bond maturities, using yield PCs and factors from a large data set of macro variables, as in Ludvigson and Ng (2010). The top panel shows the results for the original data set used by Ludvigson and Ng (2010); the bottom panel uses a data sample that starts in 1985 and ends in 2013. The bootstrap is a simple bootstrap without bias correction. For a description of the statistics in each row, see the notes to Table 5. p-values below 5% are emphasized with bold face. P C1 P C2 A. Original sample: 1964–2007 Coefficient 0.136 2.052 HAC statistic 1.552 2.595 HAC p-value 0.121 0.010 Bootstrap 5% c.v. Bootstrap p-value IM q = 8 0.001 0.001 IM q = 16 0.000 0.052 Estimated size of tests HAC Bootstrap IM q = 8 IM q = 16 B. Later sample: 1985–2013 Coefficient 0.157 1.182 HAC statistic 1.506 1.111 HAC p-value 0.133 0.268 Bootstrap 5% c.v. Bootstrap p-value IM q = 8 0.014 0.005 IM q = 16 0.024 0.185 Table 6: Ludvigson-Ng: predicting excess returns using PCs and macro factors Table 7: Ludvigson-Ng: R̄2 for predicting excess returns using PCs and macro factors R̄12 R̄22 Original sample: 1964–2007 Data 0.25 0.35 Bootstrap 0.20 0.24 (0.05, 0.39) (0.08, 0.42) Later sample: 1985–2013 Data 0.14 0.18 Bootstrap 0.26 0.29 (0.05, 0.49) (0.08, 0.51) R̄22 − R̄12 0.10 0.03 (-0.00, 0.11) 0.04 0.03 (-0.01, 0.14) Adjusted R̄2 for regressions of annual excess bond returns, averaged over two- through five-year bonds, on three PCs of the yield curve (R̄12 ) and on three yield PCs together with eight macro factors (R̄22 ), as well as the difference in R̄2 . The top panel shows the results for the original data set used by Ludvigson and Ng (2010); the bottom panel uses a data sample that starts in 1985 and ends in 2013. For each data sample we report the values of the statistics in the data, and the mean and 95%-confidence intervals (in parentheses) of the bootstrap small-sample distributions of these statistics. The bootstrap simulations are obtained under the null hypothesis that the macro variables have no predictive power. The bootstrap procedure, which does not include bias correction, is described in the main text. 50 Table 8: Ludvigson-Ng: predicting excess returns using return-forecasting factors Two-year bond Three-year bond Four-year bond CP H8 CP H8 CP H8 Original sample: 1964–2007 Coefficient 0.335 0.331 0.645 0.588 0.955 0.776 HAC t-statistic 4.429 4.331 4.666 4.491 4.765 4.472 HAC p-value 0.000 0.000 0.000 0.000 0.000 0.000 Bootstrap 5% c.v. 3.809 3.799 3.874 Bootstrap p-value 0.022 0.015 0.017 Estimated size of tests HAC 0.514 0.538 0.545 Bootstrap 0.047 0.055 0.057 Later sample: 1985–2013 Coefficient 0.349 0.371 0.661 0.695 1.101 0.895 HAC t-statistic 2.644 3.348 2.527 3.409 3.007 3.340 HAC p-value 0.009 0.001 0.012 0.001 0.003 0.001 Bootstrap 5% c.v. 3.890 4.014 4.026 Bootstrap p-value 0.103 0.116 0.124 Five-year bond CP H8 1.115 4.371 0.000 0.937 4.541 0.000 3.898 0.014 0.539 0.050 1.320 2.946 0.003 1.021 3.270 0.001 3.942 0.128 Predictive regressions for annual excess bond returns, using return-forecasting factors based on yield-curve information (CP ) and macro information (H8), as in Ludvigson and Ng (2010). The first panel shows the results for the original data set used by Ludvigson and Ng (2010); the second panel uses a data sample that starts in 1985 and ends in 2013. HAC t-statistics and p-values are calculated using Newey-West standard errors with 18 lags. We obtain bootstrap distributions of the t-statistics under the null hypothesis that macro factors and hence H8 have no predictive power. We also report bootstrap critical values (c.v.’s) and p-values, as well as estimates of the true size of conventional t-tests and the bootstrap tests with 5% nominal coverage (see notes to Table 5). The bootstrap procedure, which does not include bias correction, is described in the main text. p-values below 5% are emphasized with bold face. 51 Table 9: Ludvigson-Ng: R̄2 for predicting excess returns using return-forecasting factors Original sample: 1985–2008 R̄12 R̄22 R̄22 − R̄12 Two-year bond Data Bootstrap Three-year bond Data Bootstrap Four-year bond Data Bootstrap Five-year bond Data Bootstrap R̄12 Later sample: 1985–2013 R̄22 R̄22 − R̄12 0.31 0.21 (0.06, 0.39) 0.42 0.24 (0.09, 0.41) 0.11 0.03 (-0.00, 0.10) 0.15 0.25 (0.04, 0.50) 0.23 0.28 (0.08, 0.52) 0.07 0.03 (-0.00, 0.12) 0.33 0.20 (0.05, 0.38) 0.43 0.23 (0.09, 0.40) 0.10 0.03 (-0.00, 0.10) 0.15 0.25 (0.05, 0.48) 0.22 0.29 (0.09, 0.51) 0.07 0.04 (-0.00, 0.13) 0.36 0.21 (0.06, 0.40) 0.45 0.25 (0.10, 0.42) 0.09 0.03 (-0.00, 0.11) 0.19 0.27 (0.07, 0.50) 0.24 0.30 (0.11, 0.52) 0.05 0.03 (-0.00, 0.12) 0.33 0.21 (0.06, 0.39) 0.42 0.24 (0.10, 0.41) 0.09 0.03 (-0.00, 0.11) 0.17 0.25 (0.06, 0.48) 0.21 0.29 (0.10, 0.50) 0.05 0.03 (-0.00, 0.13) Adjusted R̄2 for regressions of annual excess bond returns on return-forecasting factors based on yield-curve information (CP ) and macro information (H8), as in Ludvigson and Ng (2010). R̄12 is for regressions with only CP , while R̄22 is for regressions with both CP and H8. The table shows results both for the original data set used by Ludvigson and Ng (2010) and for a data sample that starts in 1985 and ends in 2013. For each data sample and bond maturity, we report the values of the statistics in the data, and for the bootstrap small-sample distributions of these statistics the mean, and 95%-confidence intervals (in parentheses). The bootstrap simulations are obtained under the null hypothesis that the macro variables have no predictive power. The bootstrap procedure, which does not include bias correction, is described in the main text. 52 Table 10: Cochrane-Piazzesi: in-sample evidence Original sample: 1964–2003 Data HAC statistic HAC p-value Bootstrap 5% c.v./mean R̄2 Bootstrap p-value/95% CIs IM q = 8 IM q = 16 Estimated size of tests HAC Bootstrap IM q = 8 IM q = 16 Later sample: 1985–2013 Data HAC statistic HAC p-value Bootstrap 5% c.v./mean R̄2 Bootstrap p-value/95% CIs IM q = 8 IM q = 16 P C1 P C2 P C3 P C4 P C5 0.127 1.724 0.085 2.740 5.205 0.000 -6.307 2.950 0.003 -16.128 5.626 0.000 2.253 0.000 0.237 0.953 -2.038 0.748 0.455 2.236 0.507 0.233 0.283 0.085 0.046 0.040 0.043 0.083 0.053 0.050 0.049 -9.196 1.275 0.203 2.463 0.301 0.803 0.190 -9.983 1.351 0.178 2.433 0.273 0.435 0.949 0.002 0.000 0.104 1.619 0.106 0.011 0.001 0.030 0.004 1.586 2.215 0.027 0.079 0.031 0.873 0.148 3.962 1.073 0.284 0.044 0.215 Wald 31.919 0.000 8.464 0.000 R12 R22 R22 − R12 0.26 0.35 0.09 0.21 (0.05, 0.40) 0.21 (0.06, 0.41) 0.01 (0.00, 0.03) 0.14 0.17 0.03 0.26 (0.06, 0.49) 0.28 (0.08, 0.50) 0.02 (0.00, 0.05) 0.114 0.055 4.174 0.124 9.878 0.272 Predicting annual excess bond returns, averaged over two- through five-year bonds, using principal components (PCs) of yields. The null hypothesis is that the first three PCs contain all the relevant predictive information. The data used in the top panel is the same as in Cochrane and Piazzesi (2005)—see in particular their table 4. HAC statistics and p-values are calculated using Newey-West standard errors with 18 lags. We also report the unadjusted R2 for the regression using only three PCs (R12 ) and for the regression including all five PCs (R22 ), as well as the difference in these two. Bootstrap distributions are obtained under the null hypothesis, using the bootstrap procedure described in the main text (without bias correction). For the R2 -statistics, we report means and 95%-confidence intervals (in parentheses). For the HAC test statistics, bootstrap critical values (c.v.’s) are the 95th percentile of the bootstrap distribution of the test statistics, and p-values are the frequency of bootstrap replications in which the test statistics are at least as large as the statistic in the data. We also report p-values for t-tests using the methodology of Ibragimov and Müller (2010) (IM), splitting the sample into either 8 or 16 blocks. The last four rows in the first panel report bootstrap estimates of the true size of different tests with 5% nominal coverage, calculated as the frequency of bootstrap replications in which the test statistics exceed their critical values, except for the size of bootstrap test which is calculated as described in the main text. p-values below 5% are emphasized with bold face. 53 Table 11: Cochrane-Piazzesi: out-of-sample forecast accuracy n (1) 2 3 4 5 average R12 R22 (2) (3) 0.321 0.260 0.341 0.242 0.371 0.266 0.346 0.270 0.351 0.264 RM SE2 (4) 2.120 4.102 5.848 7.374 4.845 RM SE1 (5) 1.769 3.232 4.684 6.075 3.917 DM p-value RM SEmean (6) (7) (8) 2.149 0.034 1.067 2.167 0.032 1.946 2.091 0.039 2.989 2.121 0.036 3.987 2.133 0.035 2.385 In-sample vs. out-of-sample predictive power for excess bond returns (averaged across maturities) of a restricted model with three PCs and an unrestricted model with five PCs. The in-sample period is from 1964 to 2002 (the last observation used by Cochrane-Piazzesi), and the out-of-sample period is from 2003 to 2013. The second and third column show in-sample R2 . The fourth and fifth column show root-mean-squared forecast errors (RMSEs) of the two models. The column labeled “DM” reports the z-statistic of the Diebold-Mariano test for equal forecast accuracy, and the following column the corresponding p-value. The last column shows the RMSE when forecasts are the in-sample mean excess return. 54 Table 12: Greenwood-Vayanos: predictive power of Treasury bond supply One-year Term Bond yield spread P C1 P C2 P C3 supply Dependent variable: return on long-term bond Coefficient 1.212 0.026 HAC t-statistic 2.853 3.104 HAC p-value 0.004 0.002 IM q = 8 0.030 0.795 IM q = 16 0.001 0.925 Dependent variable: return on long-term bond Coefficient 1.800 2.872 0.014 HAC t-statistic 5.208 4.596 1.898 HAC p-value 0.000 0.000 0.058 IM q = 8 0.006 0.013 0.972 IM q = 16 0.000 0.000 0.557 Dependent variable: excess return on long-term bond Coefficient 0.168 5.842 -6.089 0.013 HAC t-statistic 1.457 4.853 1.303 1.862 HAC p-value 0.146 0.000 0.193 0.063 IM q = 8 0.000 0.003 0.045 0.968 IM q = 16 0.000 0.000 0.023 0.854 Dependent variable: avg. excess return for 2-5 year bonds Coefficient 0.085 1.669 -4.632 0.004 HAC statistic 1.270 3.156 2.067 1.154 HAC p-value 0.204 0.002 0.039 0.249 Bootstrap 5% c.v. 3.105 Bootstrap p-value 0.448 IM q = 8 0.005 0.134 0.714 0.494 IM q = 16 0.008 0.011 0.611 0.980 Predictive regressions for annual bond returns using Treasury bond supply, as in Greenwood and Vayanos (2014) (GV). The coefficients on bond supply in the first two panels are identical to those reported in row (1) and (6) of Table 5 in GV. HAC t-statistics and p-values are constructed using Newey-West standard errors with 36 lags, as in GV. The last two rows in each panel report p-values for t-tests using the methodology of Ibragimov and Müller (2010), splitting the sample into either 8 or 16 blocks. The sample period is 1952 to 2008. p-values below 5% are emphasized with bold face. 55 Table 13: Cooper-Priestley: predictive power of the output gap ˜ gap CP CP P C1 Coefficient -0.126 OLS t-statistic 3.224 HAC t-statistic 1.077 HAC p-value 0.282 Coefficient -0.120 1.588 OLS t-statistic 3.479 13.541 HAC t-statistic 1.244 4.925 HAC p-value 0.214 0.000 Coefficient 0.113 1.612 OLS t-statistic 2.940 13.831 HAC t-statistic 1.099 5.059 HAC p-value 0.272 0.000 Coefficient 0.147 0.001 OLS t-statistic 3.524 4.359 HAC t-statistic 1.306 1.354 HAC p-value 0.192 0.176 Bootstrap 5% c.v. 2.933 Bootstrap p-value 0.356 IM q = 8 0.612 0.002 IM q = 16 0.243 0.000 P C2 P C3 0.043 -0.067 11.506 3.690 4.362 2.507 0.000 0.012 0.011 0.001 0.234 0.064 Predictive regressions for the one-year excess return on a five-year bond using the output gap, as in ˜ is the Cochrane-Piazzesi factor after orthogonalizing it Cooper and Priestley (2008) (CPR). CP with respect to gap, whereas CP is the usual Cochrane-Piazzesi factor. For the predictive regression, gap is lagged one month, as in CPR. HAC standard errors are based on the Newey-West estimator with 22 lags. The bootstrap procedure, which does not include bias correction, is described in the main text. The sample period is 1952 to 2003. p-values below 5% are emphasized with bold face. 56 0.15 0.10 0.05 ρ = 1, small−sample simulations ρ = 0.99, small−sample simulations ρ = 1, asymptotic distribution ρ = 0.99, asymptotic distribution 0.00 Empirical size of test 0.20 Figure 1: Simulation study: size of t-test and sample size 0 200 400 600 800 1000 Sample size True size of conventional t-test of H0 : β2 = 0 with nominal size of 5%, in simulated small samples and according to local-to-unity asymptotic distribution, for different sample sizes, with δ = 1. Regressors are either random walks (ρ = 1) or stationary but highly persistent AR(1) processes (ρ = 0.99). For details on the simulation study refer to main text. 57 Figure 2: Cochrane-Piazzesi: predictive power of PCs across subsamples 3 Regressor ● ● 2 ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● 0 ● ● ● ● −1 Standardized coefficient ● ● PC1 PC2 PC3 PC4 PC5 ● PC1: t−stat = 4.74, p−value = 0.002 PC2: t−stat = 2.72, p−value = 0.030 PC3: t−stat = 0.17, p−value = 0.873 PC4: t−stat = 1.29, p−value = 0.237 PC5: t−stat = 1.31, p−value = 0.233 1970 ● ● 1980 1990 2000 Endpoint for subsample Standardized coefficients on principal components (PCs) across eight different subsamples, ending at the indicated point in time. Standardized coefficients are calculated by dividing through the sample standard deviation of the coefficient across the eight samples. Text labels indicate t-statistics and p-values of the Ibragimov-Mueller test with q = √ 8. Note that the t-statistics are equal to means of the standardized coefficients multiplied by 8. The data and sample period is the same as in Cochrane and Piazzesi (2005). 58 0 −2 −8 −6 −4 Excess return 2 4 6 Figure 3: Cochrane-Piazzesi: out-of-sample forecasts Realized Forecast 1 Forecast 2 In−sample mean 2004 2006 2008 2010 2012 Year Realizations vs. out-of-sample forecasts of excess bond returns (averaged across maturities) from restricted model (1) with three PCs and unrestricted model (2) with five PCs. The in-sample period is from 1964 to 2002 (the last observation used by Cochrane-Piazzesi), and the out-of-sample period is from 2003 to 2013. The figure also shows the in-sample mean excess return. 59