View original document

The full text on this page is automatically extracted from the file linked above and may contain errors and inconsistencies.

FEDERAL RESERVE BANK OF SAN FRANCISCO
WORKING PAPER SERIES

Robust Bond Risk Premia
Michael D. Bauer
Federal Reserve Bank of San Francisco
James D. Hamilton
University of California, San Diego
January 2016

Working Paper 2015-15
http://www.frbsf.org/economic-research/publications/working-papers/wp2015-15.pdf

Suggested citation:
Bauer, Michael D, James D. Hamilton. 2015. “Robust Bond Risk Premia.” Federal Reserve
Bank of San Francisco Working Paper 2015-15. http://www.frbsf.org/economicresearch/publications/working-papers/wp2015-15.pdf

The views in this paper are solely the responsibility of the authors and should not be interpreted as
reflecting the views of the Federal Reserve Bank of San Francisco or the Board of Governors of
the Federal Reserve System.

Robust Bond Risk Premia∗
Michael D. Bauer†and James D. Hamilton‡
April 16, 2015
Revised: January 20, 2016

Abstract
A consensus has recently emerged that variables beyond the level, slope, and curvature of
the yield curve can help predict bond returns. This paper shows that the statistical tests
underlying this evidence are subject to serious small-sample distortions. We propose
more robust tests, including a novel bootstrap procedure specifically designed to test
the “spanning hypothesis.” We revisit the evidence in five published studies, find most
rejections of the spanning hypothesis to be spurious, and conclude that the current
consensus is wrong. Only the level and the slope of the yield curve are robust predictors
of bond returns.
Keywords: yield curve, spanning, return predictability, robust inference, bootstrap
JEL Classifications: E43, E44, E47

∗

The views expressed in this paper are those of the authors and do not necessarily reflect those of others in
the Federal Reserve System. We thank John Cochrane, Graham Elliott, Robin Greenwood, Helmut Lütkepohl,
Ulrich Müller, Hashem Pesaran and Glenn Rudebusch for useful suggestions, conference participants at the 7th
Annual Volatility Institute Conference at the NYU Stern School of Business and at the NBER 2015 Summer
Institute, as well as seminar participants at the Federal Reserve Bank of Boston and the Free University of
Berlin for helpful comments, and Javier Quintero and Simon Riddell for excellent research assistance.
†
Federal Reserve Bank of San Francisco, 101 Market St MS 1130, San Francisco, CA 94105, phone: 415974-3299, e-mail: michael.bauer@sf.frb.org
‡
University of California at San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0508, phone: 858-534-5986,
e-mail: jhamilton@ucsd.edu

1

Introduction

The nominal yield on a 10-year U.S. Treasury bond has been below 2% much of the time since
2011, a level never seen previously. To what extent does this represent unprecedently low
expected interest rates extending through the next decade, and to what extent does it reflect
an unusually low risk premium resulting from a flight to safety and large-scale asset purchases
by central banks that depressed the long-term yield? Finding the answer is a critical input
for monetary policy, investment strategy, and understanding the lasting consequences of the
financial and economic disruptions of 2008.
In principle one can measure the risk premium by the difference between the current
long rate and the expected value of future short rates. But what information should go into
constructing that expectation of future short rates? A powerful argument can be made that
the current yield curve itself should contain most (if not all) information useful for forecasting
future interest rates and bond returns. Investors use information at time t—which we can
summarize by a state vector zt —to forecast future short-term interest rates and determine
bond risk premia. Hence current yields are necessarily a function of zt , reflecting the general
fact that current asset prices incorporate all current information. This suggests that we may
be able to back out the state vector zt from the observed yield curve.1 The “invertibility” or
“spanning” hypothesis states that the current yield curve contains all the information that
is useful for predicting future interest rates or determining risk premia. Notably, under this
hypothesis, the yield curve is first-order Markov.
It has long been recognized that three yield-curve factors, such as the first three principal
components (PCs) of yields, can provide an excellent summary of the information in the entire
yield curve (Litterman and Scheinkman, 1991). While it is clear that these factors, which
are commonly labeled level, slope, and curvature, explain almost all of the cross-sectional
variance of yields, it is less clear whether they completely capture the relevant information for
forecasting future yields and estimating bond risk premia. In this paper we investigate what
we will refer to as the “spanning hypothesis” which holds that all the relevant information
for predicting future yields and returns is spanned by the level, slope and curvature of the
yield curve. This hypothesis differs from the claim that the yield curve follows a first-order
Markov process, as it adds the assumption that only these three yield-curve factors are useful
in forecasting. For example, if higher-order yield-curve factors such as the 4th and 5th PC
are informative about predicting yields and returns, yields would still be Markov, but the
spanning hypothesis, as we define it here, would be violated. Note also that the spanning
1

Specifically, this invertibility requires that (a) we observe at least as many yields as there are state variables
in zt , and (b) there are no knife-edge cancellations or pronounced nonlinearities; see for example Duffee (2013b).

1

hypothesis is much less restrictive than the expectations hypothesis, which states that bond
risk premia are constant and hence excess bond returns are not predictable.
The question whether the spanning hypothesis is valid is of crucial importance for finance
and macroeconomics. If it is valid, then the estimation of monetary policy expectations
and bond risk premia would not require any data or models involving macroeconomic series,
other asset prices or quantities, volatilities, or survey expectations. Instead, all the necessary
information is in the shape of the current yield curve, summarized by the level, slope, and
curvature. If, however, the spanning hypothesis is violated, then this would seem to invalidate
a large body of theoretical work in asset pricing and macro-finance, since models in this
literature generally imply that the state variables are spanned by the information in the term
structure of interest rates.2 A growing literature on yield curve modeling is based on the
premise that it is undesirable and potentially counterfactual to assume spanning.3 There
appears to be a consensus, reflected in recent review articles by Gürkaynak and Wright (2012)
and Duffee (2013a), that the spanning question is a central issue in macro-finance.
A number of recent studies have produced evidence that appears to contradict the spanning
hypothesis. This evidence comes from predictive regressions for bond returns on various
predictors, controlling for information in the current yield curve. The variables that have
been found to contain additional predictive power in such regressions include measures of
economic growth and inflation (Joslin et al., 2014), factors inferred from a large set of macro
variables (Ludvigson and Ng, 2009, 2010), higher-order (fourth and fifth) PCs of bond yields
(Cochrane and Piazzesi, 2005), the output gap (Cooper and Priestley, 2008), and measures of
Treasury bond supply (Greenwood and Vayanos, 2014). The results in each of these studies
suggest that there might be unspanned or hidden information that is not captured by the
level, slope, and curvature of the current yield curve but that is useful for forecasting.
But the predictive regressions underlying all these results have a number of problematic
features. First, the predictive variables are typically very persistent, in particular in relation
to the small available sample sizes. Second, some of these predictors summarize the information in the current yield curve, and therefore are generally correlated with lagged forecast
errors, i.e., they violate the condition of strict exogeneity. In such a setting, tests of the spanning hypothesis are necessarily oversized in small samples, as we show both analytically and
using simulations. Third, the dependent variable is typically a bond return over an annual
holding period, which introduces substantial serial correlation in the prediction errors. This
worsens the size distortions and leads to large increases in R2 even if irrelevant predictors
2

Key contributions to this large literature include Wachter (2006), Piazzesi and Schneider (2007), Rudebusch and Wu (2008), and Bansal and Shaliastovich (2013). For a recent example see Swanson (2015).
3
Examples are Wright (2011), Chernov and Mueller (2012), Priebsch (2014), and Coroneo et al. (2015).

2

are included.4 We demonstrate that the procedures commonly used for inference about the
spanning hypothesis do not appropriately address these issues, and are subject to serious
small-sample distortions.
We propose two procedures that give substantially more robust small-sample inference.
The first is a parametric bootstrap that generates data samples under the spanning hypothesis:
We calculate the first three PCs of the observed set of yields and summarize their dynamics
with a VAR fit to the observed PCs.5 Then we use a residual bootstrap to resample the PCs,
and construct bootstrapped yields by multiplying the simulated PCs by the historical loadings
of yields on the PCs and adding a small Gaussian measurement error. Thus by construction
no variables other than the PCs are useful for predicting yields or returns in our generated
data. We then fit a separate VAR to the proposed additional explanatory variables alone,
and generate bootstrap samples for the predictors from this VAR. Using our novel bootstrap
procedure, we can calculate the properties of any regression statistic under the spanning
hypothesis. Our procedure notably differs from the bootstrap approach often employed in
this literature, which generates artificial data under the expectations hypothesis.6 This reveals
that the conventional tests reject the true null much too often. We show for example that
the tests employed by Ludvigson and Ng (2009), which are intended to have a nominal size
of five percent, can have a true size of up to 54%. We then ask whether under the null it
would be possible to observe similar patterns of predictability as researchers have found in
the data. We find that this is indeed the case, meaning that much of the above-cited evidence
against the spanning hypothesis is in fact spurious. These results provide a strong caution
against using conventional tests, and we recommend that researchers instead use the bootstrap
procedure proposed in this paper. Despite the usual technical concerns about bootstrapping
near-nonstationary variables, we present evidence that this procedure performs well in small
small samples.
A second procedure that we propose for inference in this context is the approach for robust
testing of Ibragimov and Müller (2010). The approach is to split the sample into subsamples,
to estimate coefficients separately in each of these, and then to perform a simple t-test on the
coefficients across subsamples. We have found this approach to have excellent size and power
properties in settings similar to the ones encountered by researchers testing for predictive power
for interest rates and bond returns. Applying this type of test to the predictive regressions
4

Lewellen et al. (2010) demonstrated that high R2 in cross-sectional return regressions are, for different
reasons, often unconvincing evidence of true predictability.
5
We consider bias-corrected estimation of the VAR, in light of the high persistence of the PCs.
6
This approach has been used, for example, by Bekaert et al. (1997), Cochrane and Piazzesi (2005), Ludvigson and Ng (2009, 2010), and Greenwood and Vayanos (2014).

3

for excess bond returns studied in the literature, we find that the only robust predictors are
the level and the slope of the yield curve.
After revisiting the evidence in the five influential papers cited above we draw two conclusions. First, the claims going back to Fama and Bliss (1987) and Campbell and Shiller (1991)
that excess returns can be predicted from the level and slope of the yield curve remain quite
robust. Second, the newer evidence on the predictive power of macro variables, higher-order
PCs of the yield curve, or other variables, is subject to more serious econometric problems
and appears weaker and much less robust.We further demonstrate that this predictive power
is substantially weaker in samples that include subsequent data than in the samples originally
analyzed. Overall, we do not find convincing evidence to reject the hypothesis that the current yield curve, and in particular three factors summarizing this yield curve, contains all the
information necessary to infer interest rate forecasts and bond risk premia. In other words,
the spanning hypothesis cannot be rejected, and the Markov property of the yield curve seems
alive and well.
Our paper is related mainly to two strands of literature. The first is the literature on
the spanning hypothesis, and most relevant studies were cited above. Bauer and Rudebusch
(2015) also question the evidence against spanning, by showing that conventional macrofinance models can generate data in which the spanning hypothesis is spuriously rejected. Our
paper is also related to the econometric literature on spurious results in return regressions.
Mankiw and Shapiro (1986), Cavanagh et al. (1995), Stambaugh (1999) and Campbell and
Yogo (2006), among others, studied short-horizon return predictability with a regressor that
is not strictly exogenous. We point out a related econometric issue in bond return regressions,
which is however distinct from Stambaugh bias. Ferson et al. (2003) and Deng (2013) studied
the size distortions in a setting that is different from ours and more relevant for stock returns,
namely when returns have an unobserved persistent component. In contrast to these studies,
we focus on the econometric problems that arise in tests of the spanning hypothesis. In
addition, we propose simple, easily implementable solutions to these problems.

2

Inference about the spanning hypothesis

The evidence against the spanning hypothesis in all of the studies cited in the introduction
comes from regressions of the form
yt+h = β10 x1t + β20 x2t + ut+h ,

4

(1)

where the dependent variable yt+h is the return or excess return on a long-term bond (or
portfolio of bonds) that we wish to predict, x1t and x2t are vectors containing K1 and K2
predictors, respectively, and ut+h is a forecast error. The predictors x1t contain a constant
and the information in the yield curve, typically captured by the first three PCs of observed
yields, i.e., level, slope, and curvature.7 The null hypothesis of interest is
H0 :

β2 = 0,

which says that the relevant predictive information is spanned by the information in the yield
curve and that x2t has no additional predictive power.
The evidence produced in these studies comes in two forms, the first based on simple
descriptive statistics such as how much the R2 of the regression increases when the variables
x2t are added and the second from formal statistical tests of the hypothesis that β2 = 0. In
this section we show how key features of the specification can matter significantly for both
forms of evidence. In Section 2.1 we show how serial correlation in the error term ut and
the proposed predictors x2t can give rise to a large increase in R2 when x2t is added to the
regression even if it is no help in predicting yt+h . In Section 2.2 we show the consequences
of lack of strict exogeneity of x1t , which is necessarily correlated with ut since it contains
information in current yields. When x1t and x2t are highly persistent processes, as is usually
the case in practice, conventional tests of H0 generally will exhibit significant size distortions in
finite samples. We then propose methods for robust inference about bond return predictability
in Sections 2.3 and 2.4.

2.1

Implications of serially correlated errors based on first-order
asymptotics

Our first observation is that in regressions in which x1t and x2t are strongly persistent and
the error term is serially correlated—as is always the case with overlapping bond returns—we
should not be surprised to see substantial increases in R2 when x2t is added to the regression
even if the true coefficient is zero. It is well known that in small samples serial correlation
in the residuals can increase both the bias as well as the variance of a regression R2 (see for
example Koerts and Abrahamse (1969) and Carrodus and Giles (1992)). To see how much
7

We will always sign the PCs so that the yield with the longest maturity loads positively on all PCs. As a
result, PC1 and PC2 correspond to what are commonly referred to as “level” and “slope” of the yield curve.

5

difference this could make in the current setting, consider the unadjusted R2 defined as
SSR

R 2 = 1 − PT

t=1 (yt+h

− ȳh )2

(2)

where SSR denotes the regression sum of squared residuals. The increase in R2 when x2t is
added to the regression is thus given by
(SSR1 − SSR2 )
R22 − R12 = PT
.
2
t=1 (yt+h − ȳh )

(3)

We show in Appendix A that when x1t , x2t , and ut+h are stationary and satisfy standard
regularity conditions, if the null hypothesis is true (β2 = 0) and the extraneous regressors are
uncorrelated with the valid predictors (E(x2t x01t ) = 0), then
d

T (R22 − R12 ) → r0 Q−1 r/γ

(4)

γ = E[yt − E(yt )]2

S=

r ∼ N (0, S),

(5)

Q = E(x2t x02t )

(6)

P∞

0
v=−∞ E(ut+h ut+h−v x2t x2,t−v ).

(7)

Result (4) implies that the difference R22 − R12 itself converges in probability to zero under the
null hypothesis that x2t does not belong in the regression, meaning that the two regressions
asymptotically should have the same R2 .
In a given finite sample, however, R22 is larger than R12 by construction, and the above
results give us an indication of how much larger it would be in a given finite sample. If
x2t ut+h is serially uncorrelated, then (7) simplifies to S0 = E(u2t+h x2t x02t ). On the other hand,
if x2t ut+h is positively serially correlated, then S exceeds S0 by a positive-definite matrix, and
r exhibits more variability across samples. This means R22 − R12 , being a quadratic form in
a vector with a higher variance, would have both a higher expected value as well as a higher
variance when x2t ut+h is serially correlated compared to situations when it is not.
When the dependent variable yt+h is a multi-period bond return, then the error term is
necessarily serially correlated. In our empirical applications, yt+h will typically be the h-period
excess return on an n-period bond,
yt+h = pn−h,t+h − pnt − hiht ,
6

(8)

for pnt the log of the price of a pure discount n-period bond purchased at date t and int =
−pnt /n the corresponding zero-coupon yield. In that case, E(ut ut−v ) 6= 0 for v = 0, . . . , h − 1,
due to the overlapping observations. At the same time, the explanatory variables x2t often
are highly serially correlated, so E(x2t x02,t−v ) 6= 0. Thus even if x2t is completely independent
of ut at all leads and lags, the product will be highly serially correlated,
E(ut+h ut+h−v x2t x02,t−v ) = E(ut ut−v )E(x2t x02,t−v ) 6= 0.
This serial correlation in x2t ut+h would contribute to larger values for R22 − R12 on average as
well as to increased variability in R22 − R12 across samples. In other words, including x2t could
substantially increase the R2 even if H0 is true.8
These results on the asymptotic distribution of R22 − R12 could be used to design a test of
H0 . However, we show in the next subsection that in small samples additional problems arise
from the persistence of the predictors, with the consequence that the bias and variability of
R22 − R12 can be even greater than predicted by (4). For this reason, in this paper we will rely
on bootstrap approximations to the small-sample distribution of the statistic R22 − R12 , and
demonstrate that the dramatic values sometimes reported in the literature are not implausible
under the spanning hypothesis.
Serial correlation of the residuals also affects the sampling distribution of the OLS estimate
of β2 . In Appendix A we verify using standard algebra that under the null hypothesis β2 = 0
the OLS estimate b2 can be written as
b2 =

P

T
0
t=1 x̃2t x̃2t

−1 P

T
t=1 x̃2t ut+h



(9)

where x̃2t denotes the sample residuals from OLS regressions of x2t on x1t :
x̃2t = x2t − AT x1t
8

(10)

The same conclusions necessarily also hold for the adjusted R̄2 defined as
R̄i2 = 1 −

T −1
SSRi
P
T − ki Tt=1 (yt+h − ȳh )2

for ki the number of coefficients estimated in model i, from which we see that
T (R̄22 − R̄12 ) =

[T /(T − k1 )]SSR1 − [T /(T − k2 )]SSR2
PT
2
t=1 (yt+h − ȳh ) /(T − 1)

which has the same asymptotic distribution as (4). In our small-sample investigations below, we will analyze
either R2 or R̄2 as was used in the original study that we revisit.

7

AT =

P

T
0
t=1 x2t x1t

 P

T
0
t=1 x1t x1t

−1

.

(11)

If x2t and x1t are stationary and uncorrelated with each other, as the sample size grows,
p
AT → 0 and b2 has the same asymptotic distribution as
b∗2 =
namely

P

T
0
t=1 x2t x2t

√

−1 P

T
t=1 x2t ut+h



,

d

T b2 → N (0, Q−1 SQ−1 ).

(12)

(13)

with Q and S the matrices defined in (6) and (7). Again we see that positive serial correlation
causes S to exceed the value S0 that would be appropriate for serially uncorrelated residuals.
In other words, serial correlation in the error term increases the sampling variability of the
OLS estimate b2 .
The standard approach is to use heteroskedasticity- and autocorrelation-consistent (HAC)
standard errors to try to correct for this, for example, the estimators proposed by Newey and
West (1987) or Andrews (1991). However, in practice different HAC estimators of S can lead
to substantially different empirical conclusions (Müller, 2014). Moreover, we show in the next
subsection that even if the population value of S were known with certainty, expression (13) can
give a poor indication of the true small-sample variance. We further demonstrate empirically
in the subsequent sections that this is a serious problem when carrying out inference about
bond return predictability.

2.2

Small-sample implications of lack of strict exogeneity

A second feature of the studies examined in this paper is that the valid explanatory variables
x1t are correlated with lagged values of the error term. That is, they are only weakly but not
strictly exogenous. In addition, x1t and x2t are highly serially correlated. We will show that
this can lead to substantial size distortions in tests of β2 = 0. The intuition of our result is
the following: As noted above, the OLS estimate of β2 in (1), b2 , can be thought of as being
implemented in three steps: (i) regress x2t on x1t , (ii) regress yt+h on x1t , and (iii) regress the
residuals from (ii) on the residuals of (i). When x1t and x2t are highly persistent, the auxiliary
P
regression (i) behaves like a spurious regression in small samples, causing x̃2t x̃02t in (9) to be
P
significantly smaller than x2t x02t in (12). When there is correlation between x1t and ut , this
causes the usual asymptotic distribution to underestimate significantly the true variability of
b2 . As a consequence, the t-test for β2 = 0 rejects the true null too often. In the following,
we demonstrate exactly why this occurs, first theoretically using local-to-unity asymptotics,
8

and then in small-sample simulations.
The issue we raise has to our knowledge not previously been recognized. Mankiw and
Shapiro (1986) and Stambaugh (1999) studied tests of the hypothesis β1 = 0 in a regression
of yt+1 on x1t , where the regressors x1t are not strictly exogenous, and documented that when
x1t is persistent this leads to small-sample coefficient bias in the OLS estimate of β1 .9 By
contrast, in our setting there is no coefficient bias present in estimates of β2 , and it is instead
the inaccuracy of the standard errors, which we will refer to as “standard error bias,” that
distorts the results of conventional inference. Another related line of work is by Ferson et al.
(2003) and Deng (2013), who studied predictions of returns that have a persistent component
that is unobserved. In our notation, their setting corresponds to the case where both x1t
and x2t are strictly exogenous, x1t is unobserved, and returns are predicted using x2t . For
predictive regressions of bond returns, however, we do have estimates of the persistent return
component based on information in the current yield curve, x1t , and instead the resulting lack
of strict exogeneity causes a separate econometric problem from that considered by Ferson
et al. (2003) and Deng (2013).
2.2.1

Theoretical anlysis using local-to-unity asymptotics

We now demonstrate where the problem arises in the simplest example of our setting. Suppose
that x1t and x2t are scalars that follow independent highly persistent processes,
xi,t+1 = ρi xit + εi,t+1

i = 1, 2

(14)

where ρi is close to one. Consider the consequences of OLS estimation of (1) in the special
case where h = 1:
yt+1 = β0 + β1 x1t + β2 x2t + ut+1 .
(15)
We assume that (ε1t , ε2t , ut )0 follows a martingale difference sequence with finite fourth moments and variance matrix





ε1t h
σ12
0 δσ1 σu
i




V = E  ε2t  ε1t ε2t ut =  0
σ22
0 .
ut
δσ1 σu 0
σu2

(16)

Thus x1t is not strictly exogenous when the correlation δ is nonzero. Note that for any δ, x2t ut+1
is serially uncorrelated and the standard OLS t-test of β2 = 0 asymptotically has a N (0, 1)
9

Cavanagh et al. (1995) and Campbell and Yogo (2006) considered this problem using local-to-unity asymptotic theory.

9

distribution when using the conventional first-order asymptotic approximation. This simple
example illustrates the problems in a range of possible settings for yield-curve forecasting. In
particular, if Var(ut+1 ) substantially exceeds Var(β1 x1t ), yt could be viewed as a (one-period)
bond return, where β1 x1t is a persistent component of the return that is small relative to the
size of yt+1 .
One device for seeing how the results in a finite sample of some particular size T likely
differ from those predicted by conventional first-order asymptotics is to use a local-to-unity
specification as in Phillips (1988) and Cavanagh et al. (1995):
xi,t+1 = (1 + ci /T )xit + εi,t+1

i = 1, 2.

(17)

For example, if our data come from a sample of size T = 100 when ρi = 0.95, the idea is to
represent this with a value of ci = −5 in (17). The claim is that analyzing the properties as
T → ∞ of a model characterized by (17) with ci = −5 gives a better approximation to the
actual distribution of regression statistics in a sample of size T = 100 and ρi = 0.95 than is
provided by the first-order asymptotics used in the previous subsection which treat ρi as a
constant when T → ∞; see for example Chan (1988) and Nabeya and Sørensen (1994). The
local-to-unity asymptotics turn out to be described by Ornstein-Uhlenbeck processes. For
example
Z 1
PT
2
2
−2
[Jcµi (λ)]2 dλ
T
t=1 (xit − x̄i ) ⇒ σi
0

where ⇒ denotes weak convergence as T → ∞ and
Z
Jci (λ) = ci

λ

eci (λ−s) Wi (s)ds + Wi (λ)

i = 1, 2

0

Jcµi (λ)

1

Z
= Jci (λ) −

Jci (s)ds

i = 1, 2

0

with W1 (λ) and W2 (λ) denoting independent standard Brownian motion. When ci = 0, (17)
becomes a random walk and the local-to-unity asymptotics simplify to the standard unit-root
asymptotics involving functionals of Brownian motion as a special case: J0 (λ) = W (λ).
Applying local-to-unity asymptotics to our setting reveals the basic econometric problem.
We show in Appendix B that under local-to-unity asymptotics the coefficient from a regression
of x2t on x1t has the following limiting distribution:
P
AT =

R1
σ2 0 Jcµ1 (λ)Jcµ2 (λ)dλ
(x1t − x̄1 )(x2t − x̄2 )
P
⇒
≡ (σ2 /σ1 )A,
R1
(x1t − x̄1 )2
σ1 0 [Jcµ1 (λ)]2 dλ
10

(18)

where we have defined A to be the random variable in the middle expression. Under first-order
asymptotics the influence of AT would vanish as the sample size grows. But using local-tounity asymptotics we see that AT behaves similarly to the coefficient in a spurious regression
and does not converge to zero—the true correlation between x1t and x2t in this setting—but
to a random variable proportional to A. Consequently, the t-statistic for β2 = 0 can have a
very different distribution from that predicted using first-order asymptotics. We demonstrate
in Appendix B that this t-statistic has a local-to-unity asymptotic distribution under the null
hypothesis that is given by
√
b2
⇒
δZ
+
1 − δ 2 Z0
1
P
1/2
{s2 / x̃22t }

(19)

R1

Kc1 ,c2 (λ)dW1 (λ)
o1/2
2 dλ
[K
(λ)]
c
,c
1 2
0

Z1 = nR0
1

(20)

R1

Kc1 ,c2 (λ)dW0 (λ)
o1/2
2 dλ
[K
(λ)]
c
,c
1 2
0

Z0 = nR0
1

(21)

Kc1 ,c2 (λ) = Jcµ2 (λ) − AJcµ1 (λ)
P
for s2 = (T − 3)−1 (yt+1 − b0 − b1 x1t − b2 x2t )2 and Wi (λ) independent standard Brownian
processes for i = 0, 1, 2. Conditional on the realizations of W1 (.) and W2 (.), the term Z0 will
be recognized as a standard Normal variable, and therefore Z0 has an unconditional N (0, 1)
distribution as well.10 In other words, if x1t is strictly exogenous (δ = 0) then the OLS t-test
of β2 = 0 will be valid in small samples even with highly persistent regressors. By contrast,
the term dW1 (λ) in the numerator of (20) is not independent of the denominator and this
gives Z1 a nonstandard distribution. In particular, Appendix B establishes that Var(Z1 ) > 1.
Moreover Z1 and Z0 are uncorrelated with each other.11 Therefore the t-statistic in (19) in
general has a non-standard distribution with variance δ 2 Var(Z1 ) + (1 − δ 2 )1 > 1 which is
monotonically increasing in |δ|. This shows that whenever x1t is correlated with ut (δ 6= 0)
and x1t and x2t are highly persistent, in small samples the t-test of β2 = 0 will reject too often
10

The intuition is that for v0,t+1 ∼ i.i.d. N (0, 1) and K = {Kt }Tt=1 any sequence of random variables
PT
PT
2
that is independent
of v0 ,
t=1 Kt v0,t+1 has a distribution conditional on K that is N (0,
t=1 Kt ) and
q
PT
PT
2
t=1 Kt v0,t+1 /
t=1 Kt ∼ N (0, 1). Multiplying by the density of K and integrating over K gives the
identical unconditional distribution, namely N (0, 1). For a more formal discussion in the current setting, see
Hamilton (1994, pp. 602-607).
11
The easiest way to see this is to note that conditional on W1 (.) and W2 (.) the product has expectation
zero, so the unconditional expected product is zero as well.

11

when H0 is true.
Expression (19) can be viewed as a straightforward generalization of result (2.1) in Cavanagh et al. (1995) and expression (11) in Campbell and Yogo (2006). In their case the
explanatory variable is x1,t−1 − x̄1 which behaves asymptotically like Jcµ1 (λ). The component of ut that is correlated with ε1t leads to a contribution to the t-statistic given by the
expression that Cavanagh et al. (1995) refer to as τ1c , which is labeled as τc /κc by Campbell
and Yogo (2006). This variable is a local-to-unity version of the Dickey-Fuller distribution with well-known negative bias. By contrast, in our case the explanatory variable is
x̃2.t−1 = x2,t−1 − AT x1,t−1 which behaves asymptotically like Kc1 ,c2 (λ). Here the component of
ut that is correlated with ε1t leads to a contribution to the t-statistic given by Z1 in our expression (19). Unlike the Dickey-Fuller distribution, Z1 has mean zero, but like the Dickey-Fuller
distribution it has variance larger than one.
2.2.2

Simulation evidence

We now examine the implications of the theory developed above in a simulation study. We
generate values for x1t and x2t using (14), with ε1t and ε2t serially independent Gaussian
random variables with unit variance and covariance equal to θ.12 We then calculate
yt+1 = ρ1 x1t + ut+1 ,

ut = δε1t +

√

1 − δ 2 vt ,

where vt is an i.i.d. standard normal random variable. This implies that in the predictive
equation (15) the true parameters are β0 = β2 = 0 and β1 = ρ1 , and that the correlation
between ut and ε1t is δ. Note that for δ = 1 this corresponds to a model with a lagged
dependent variable (yt = x1t ), whereas for δ = 0 both predictors are strictly exogenous as ut is
independent of both both ε1t and ε2t . While in bond return regressions δ is typically negative
(as we discuss below in Section 3), we can focus here on 0 ≤ δ ≤ 1, since only |δ| matters for
the distribution of the t-statistic.
We first set θ = 0 as in our theory above, so that the variance matrix V is given by
equation (16) with σ1 , σ2 , and σu equal to one, and x2t is strictly exogenous. We investigate
the effects of varying δ, the persistence of the predictors (ρ1 = ρ2 = ρ), and the sample size
T . We simulate 50,000 artificial data samples, and in each sample we estimate the regression
in equation (15). Since our interest is in the inference about β2 we use this simulation design
to study the small-sample behavior of the t-statistic for the test of H0 : β2 = 0. To give
12

We start the simulations at x1,0 = x2,0 = 0, following standard practice of making all inference conditional
on date 0 magnitudes.

12

conventional inference the best chance, we use OLS standard errors, which is the correct
choice in this simulation setup as the errors are not serially correlated (h = 1) and there is no
heteroskedasticity.13
In addition to the small-sample distribution of the t-statistic we also study its asymptotic
distribution given in equation (19). While this is a non-standard distribution, we can draw
from it using Monte Carlo simulation: for given values of c1 and c2 , we simulate samples of size
T̃ from near-integrated processes and approximating the integrals using Rieman sums—see,
for example, Chan (1988), Stock (1991), and Stock (1994). The literature suggests that such
a Monte Carlo approach yields accurate approximations to the limiting distribution even for
moderate sample sizes (Stock, 1991, uses T̃ = 500). We will use T̃ = 1000 and generate 50,000
Monte Carlo replications with c1 = c2 = T (ρ − 1) to calculate the predicted outcome for a
sample of size T with serial dependence ρ.
Table 1 reports the performance of the t-test of H0 with a nominal size of five percent. It
shows the true size of this test, i.e., the frequency of rejections of H0 , according to both the
small-sample distribution from our simulations and the asymptotic distribution in equation
(19). We use critical values from a Student t-distribution with T − 3 degrees of freedom. Not
surprisingly, the local-to-unity asymptotic distribution provides an excellent approximation to
the exact small-sample distributions, as both indicate a very similar test size across parameter
configurations and sample sizes. The main finding here is that the size distortions can be
quite substantial with a true size of up to 17 percent—the t-test would reject the null more
than three times as often as it should. When δ 6= 0, the size of the t-test increases with the
persistence of the regressors. Table 1 also shows the dependence of the size distortion on the
sample size. To visualize this, Figure 1 plots the empirical size of the t-test for the case with
δ = 1 for different sample sizes from T = 50 to T = 1000.14 When ρ < 1, the size distortions
decrease with the sample size—for example for ρ = 0.99 the size decreases from 15 percent to
about 9 percent. In contrast, when ρ = 1 the size distortions are not affected by the sample
size, as indeed in this case the non-Normal distribution corresponding to (19) with ci = 0
governs the distribution for arbitrarily large T .
To understand better why conventional t-tests go so wrong in this setting, we use simulations to study the respective roles of bias in the coefficient estimates and of inaccuracy of
the OLS standard errors for estimation of β1 and β2 . Table 2 shows results for three different
simulation settings, in all of which T = 100, ρ = 0.99, and x1t is correlated with past forecast
errors (δ 6= 0). In the first two settings, the correlation between the regressors is zero (θ = 0),
13
If we instead use Newey-West standard errors, the size distortions become larger, as expected based on
the well-known small-sample problems of HAC covariance estimators (e.g., Müller, 2014).
14
The lines in Figure 1 are based on 500,000 simulated samples in each case.

13

and δ is either equal to one or 0.8. In the third setting, we investigate the effects of non-zero
correlation between the predictors by setting δ = 0.8 and θ = 0.8.15 The results show that in
all three simulation settings b1 is downward biased and b2 is unbiased. The problem with the
hypothesis test of β2 = 0 does not arise from coefficient (Stambaugh) bias, but from the fact
that the asymptotic standard errors underestimate the true sampling variability of both b1 and
b2 , i.e., from “standard error bias.” This is evident from comparing the standard deviation of
the coefficient estimates across simulations—the true small-sample standard error—and the
average OLS standard errors. The latter are between 22 and 31 percent too low. Because of
this standard error bias, the tests for β2 = 0 reject much more often than their nominal size
of five percent.
2.2.3

Relevance for tests of the spanning hypothesis

We have demonstrated that with persistent predictors, the lack of strict exogeneity of a subset
of the predictors can have serious consequences for the small-sample inference on the remaining
predictors, because it causes standard error bias for all predictors. Importantly, HAC standard
errors do not help, because in such settings they cannot accurately capture the uncertainty
surrounding the coefficient estimators. This econometric issue arises necessarily in all tests of
the spanning hypothesis. First, in these regressions the predictors in x1t are by construction
correlated with ut , because they correspond to information in current yields and the dependent
variable is a future bond return. Second, the predictors are often highly persistent. Table 3,
which we discuss in more detail below, reports the estimated autocorrelation coefficients for
the predictors used in each published study, showing the high persistence of the predictors
used in practice. Third, the sample sizes are necessarily small.16 In light of these observations,
conventional hypothesis tests are likely to be misleading in all of the empirical studies that we
consider in this paper.
Predictive regressions for bond returns are “unbalanced” in the sense that the dependent
variable has little serial correlation whereas the predictors are highly persistent. One might
suppose that inclusion of additional lags solves the problem we point out. This, unfortunately,
is not the case: including further lags of x2t and testing whether the coefficients on current
and lagged values are jointly significant leads to a test with exactly the same small-sample
15
Note that in this setting, x2t is not strictly exogenous, as the correlation between ut and ε2t is θδ. This is
the natural implication of a model in which only x1t contains information useful for predicting yt . If instead we
insisted on E(ut ε2t ) = 0 while θ 6= 0 (or, more generally, if E(ut ε2t ) 6= θδ) then E(yt |x1t , x1,t−1 , x2t , x2,t−1 ) 6=
E(yt |x1t , x1,t−1 ) meaning that in effect yt would depend on both x1t and x2t .
16
Reliable interest rate data are only available since about the 1960s, which leads to situations with about
40-50 years of monthly data. Going to higher frequencies—such as weekly or daily—does not increase the
effective sample sizes, since it typically increases the persistence of the series and introduces additional noise.

14

size distortions as the t-test on x2t alone.17

2.3

A bootstrap design for investigating the spanning hypothesis

The above analysis suggests that it is of paramount importance to base inference on the smallsample distributions of the relevant test statistics. We propose to do so using a parametric
bootstrap under the spanning hypothesis.18 While some studies (Bekaert et al., 1997; Cochrane
and Piazzesi, 2005; Ludvigson and Ng, 2009; Greenwood and Vayanos, 2014) use the bootstrap
in a similar context, they typically generate data under the expectations hypothesis. Cochrane
and Piazzesi (2005) and Ludvigson and Ng (2009, 2010) also calculated bootstrap confidence
intervals under the alternative hypothesis, which in principle gives some indication of the smallsample significance of the coefficients on x2t . However, bootstrapping under the relevant null
hypothesis—the spanning hypothesis—is much to be preferred, as it allows us to calculate the
small-sample size of conventional tests and generally leads to better numerical accuracy and
more powerful tests (Hall and Wilson, 1991; Horowitz, 2001). Our paper is the first to propose
a bootstrap to test the spanning hypothesis H0 : β2 = 0 by generating bootstrapped samples
under the null.
Our bootstrap design is as follows: First, we calculate the first three PCs of observed yields
which we denote
x1t = (P C1t , P C2t , P C3t )0 ,
along with the weighting vector ŵn for the bond yield with maturity n:
int = ŵn0 x1t + v̂nt .
That is, x1t = Ŵ it , where it = (in1 t , . . . , inJ t )0 is a J-vector with observed yields at t, and
Ŵ = (ŵn1 , . . . , ŵnJ )0 is the 3 × J matrix with rows equal to the first three eigenvectors of the
variance matrix of it . We use normalized eigenvectors so that Ŵ Ŵ 0 = I3 .19 Fitted yields can
be obtained using ı̂t = Ŵ 0 x1t . Three factors generally fit the cross section of yields very well,
with fitting errors v̂nt (pooled across maturities) that have a standard deviation of only a few
basis points.20
17

A closely related problem arises in classical spurious regression, see Hamilton (1994, p. 562))
An alternative approach would be a nonparametric bootstrap under the null hypothesis, using for example a
moving-block bootstrap to re-sample x1t and x2t . However, Berkowitz and Kilian (2000) found that parametric
bootstrap methods such as ours typically perform better than nonparametric methods.
19
We choose the eigenvectors so that the elements in the last column of Ŵ are positive—see also footnote 7.
20
For example, in the case study of Joslin et al. (2014) in Section 3, the standard deviation is 6.5 basis
points.
18

15

Then we estimate by OLS a VAR(1) for x1t :
x1t = φ̂0 + φ̂1 x1,t−1 + e1t

t = 1, . . . , T.

(22)

This time-series specification for x1t completes our simple factor model for the yield curve.
Though this model does not impose absence of arbitrage, it captures both the dynamic evolution and the cross-sectional dependence of yields. Studies that have documented that such a
simple factor model fits and forecasts the yield curve well include Duffee (2011) and Hamilton
and Wu (2014).
Next we generate 5000 artificial yield data samples from this model, each with length T
equal to the original sample length. We first iterate21 on
x∗1τ = φ̂0 + φ̂1 x∗1,τ −1 + e∗1τ
where e∗1τ denotes bootstrap residuals. Then we obtain the artificial yields using
∗
i∗nτ = ŵn0 x∗1τ + vnτ

(23)

∗
∼ N (0, σv2 ). The standard deviation of the measurement errors, σv , is set to the sample
for vnτ
standard deviation of the fitting errors v̂nt .22
We thus have generated an artificial sample of yields i∗nτ which by construction only three
factors (the elements of x∗1τ ) have any power to predict, but whose covariance and dynamics
are similar to those of the observed data int . Notably, our bootstrapped yields are first-order
Markov—under our bootstrap the current yield curve contains all the information necessary
to forecast future yields.
We likewise fit a VAR(1) to the observed data for the proposed predictors x2t ,

x2t = α̂0 + α̂1 x2,t−1 + e2t ,

(24)

from which we then bootstrap 5000 artificial samples x∗2τ in a similar fashion as for x∗1τ . The
0∗
0
0
23
bootstrap residuals (e0∗
1τ , e2τ ) are drawn from the joint empirical distribution of (e1t , e2t ).
21

We start the recursion with a draw from the unconditional distribution implied by the estimated VAR for

x1t .
22

We can safely assume serially uncorrelated fitting errors, despite some evidence in the literature to the
contrary (Adrian et al., 2013; Hamilton and Wu, 2014). Recall that our goal is to investigate the smallsample properties of previously calculated test statistics in an environment in which the null hypothesis holds
∗
by construction. Adding serial correlation in vnτ
would only add yet another possible reason why the spanning
hypothesis could have been spuriously rejected by earlier researchers.
23
We also experimented with a Monte Carlo design in which e∗1τ was drawn from a Student-t dynamic

16

Using the bootstrapped samples of predictors and yields, we can then investigate the
properties of any proposed test statistic involving yτ∗+h , x∗1τ , and x∗2τ in a sample for which
the dynamic serial correlation of yields and explanatory variables are similar to those in the
actual data but in which by construction the null hypothesis is true that x∗2τ has no predictive
power for future yields and bond returns.24 In particular, under our bootstrap there are no
unspanned macro risks. To see how to test the spanning hypothesis using the bootstrap,
consider for example a t-test for significance of a parameter in β2 . Denote the t-statistic in
the data by t and the corresponding t-statistic in bootstrap sample i as t∗i . We calculate the
bootstrap p-value as the fraction of samples in which |t∗i | > |t|, and would reject the null if
this is less than, say, five percent. In addition, we can estimate the true size of a conventional
t-test as the fraction of samples in which |t∗i | exceeds the usual asymptotic critical value.
One concern about this procedure is related to the well-known fact that under local-to-unity
asymptotics, the bootstrap generally cannot provide a test of the correct nominal size.25 The
reason is that the test statistics are not asymptotically pivotal as their distribution depends on
the nuisance parameters c1 and c2 , which cannot be consistently estimated. For our purpose,
however, this is not a concern for two reasons. First, when the goal (as in this investigation) is
to judge whether the existing evidence against the spanning hypothesis is compelling, we do
not need to be worried about a test that is not conservative enough. Let’s say our bootstrap
procedure does not completely eliminate the size distortions and leads to a test that still
rejects somewhat too often. If such a test nevertheless fails to reject the spanning hypothesis,
we know this could not be attributed to the test being too conservative, but instead accurately
conveys a lack of evidence against the null. Nor is a failure to reject a reflection of a lack of
power. In additional, unreported results we have found that for those coefficients that are
non-zero in our bootstrap DGP, we consistently and strongly reject the null.
Moreover, we can directly evaluate the accuracy of our bootstrap procedure using simulaconditional correlation GARCH model (Engle, 2002) fit to the residuals e1t with similar results to those
obtained using independently resampled e1t and e2t .
24
For example, if yt+h is an h-period excess return as in equation (8) then in our bootstrap
yτ∗+h = ni∗nτ − hi∗hτ − (n − h)i∗n−h,τ +h
∗
∗
0
∗
= n(ŵn0 x∗1τ + vnτ
) − h(ŵh0 x∗1τ + vhτ
) − (n − h)(ŵn−h
x∗1,τ +h + vn−h,τ
+h )
∗
∗
0
= n(ŵn0 x∗1τ + vnτ
) − h(ŵh0 x∗1τ + vhτ
) − (n − h)[ŵn−h
(k̂h + e∗1,τ +h + φ̂1 e∗1,τ +h−1 + · · ·
∗
+ φ̂h−1
e∗1,τ +1 + φ̂h1 x∗1τ ) + vn−h,τ
+h ]
1

which replicates the date t predictable component and the MA(h−1) serial correlation structure of the holding
returns that is both seen in the data and predicted under the spanning hypothesis.
25
This result goes back to Basawa et al. (1991). See also Hansen (1999) as well as Horowitz (2001) and the
references therein.

17

tions. It is straightforward to use the Monte Carlo simulations in Section 2.2.2 to calculate
what the size of our bootstrap procedure would be if applied to a specified parametric model.
In each sample i simulated from a known parametric model, we can: (i) calculate the usual
t-statistic (denoted t̃i ) for testing the null hypothesis that β2 = 0; (ii) estimate the autoregressive models for the predictors by using OLS on that sample; (iii) generate a single bootstrap
simulation using these estimated autoregressive coefficients; (iv) estimate the predictive regression on the bootstrap simulation;26 (v) calculate the t-test of β2 = 0 on this bootstrap
predictive regression, denoted t∗i . We generate 5,000 samples from the maintained model,
repeating steps (i)-(v), and then calculate the value c such that |t∗i | > c in 5% of the samples.
Our bootstrap procedure amounts to the recommendation of rejecting H0 if |t̃i | > c, and we
can calculate from the above simulation the fraction of samples in which this occurs. This
number tells us the true size if we were to apply our bootstrap procedure to the chosen parametric model. This number is reported in the second-to-last row of Table 2. We find in these
settings that our bootstrap has a size above but fairly close to five percent. The size distortion
is always smaller for our bootstrap than for the conventional t-test.
We will repeat the above procedure to estimate the size of our bootstrap test in each of our
empirical applications, taking a model whose true coefficients are those of the VAR estimated
in the sample as if it were the known parametric model, and estimating VAR’s from data
generated using those coefficients. To foreshadow those results, we will find that the size is
typically quite close to or slightly above five percent. In addition, we find that our bootstrap
procedure has good power properties. The implication is that if our bootstrap procedure
fails to reject the spanning hypothesis, we can safely conclude that the evidence against the
spanning hypothesis in the original data is not persuasive.
A separate but related issue is that least squares typically underestimates the autocorrelation of highly persistent processes due to small-sample bias (Kendall, 1954; Pope, 1990).
Therefore the VAR we use in our bootstrap would typically be less persistent than the true
data-generating process. For this reason, we might expect the bootstrap procedure to be
slightly oversized.27 One way to deal with this issue is to generate samples not from the OLS
estimates φ̂1 and α̂1 but instead use bias-corrected VAR estimates obtained with the bootstrap
adopted by Kilian (1998). We refer to this below as the “bias-corrected bootstrap.”28
In this simple Monte Carlo setting, we bootstrap the dependent variable as yτ∗ = φ̂1 x∗1,τ −1 + u∗τ where u∗τ is
resampled from the residuals in a regression of yt on x1,t−1 , and is jointly drawn with ε∗1τ and ε∗2τ to maintain
the same correlation as in the data. By contrast, in all our empirical analysis the bootstrapped dependent
variable is obtained from (23) and the definition of yt+h (for example, equation (8)).
27
A test that would have size five percent if the serial correlation was given by ρ̂1 = 0.97 would have size
greater than five percent if the true serial correlation is ρ1 = 0.99.
28
We have found in Monte Carlo experiments that the size of the bias-corrected bootstrap is closer to five
26

18

2.4

An alternative robust test for predictability

There is of course a very large literature addressing the problem of HAC inference. This
literature is concerned with accurately estimating the matrix S in (7) but does not address
what we have identified as the key issue, which is the small-sample difference between the
statistics in (9) and (12). We have looked at a number of alternative approaches in terms of
how well they perform in our bootstrap experiments. We found that the most reliable existing
test appears to be the one suggested by Ibragimov and Müller (2010), who proposed a novel
method for testing a hypothesis about a scalar coefficient. The original dataset is divided into
q subsamples and the statistic is estimated separately over each subsample. If these estimates
across subsamples are approximately independent and Gaussian, then a standard t-test with q
degrees of freedom can be carried out to test hypotheses about the parameter. Müller (2014)
provided evidence that this test has excellent size and power properties in regression settings
where standard HAC inference is seriously distorted. Our simulation results, to be discussed
below, show that this test also performs very well in the specific settings that we consider
in this paper, namely inference about predictive power of certain variables for future interest
rates and excess bond returns. Throughout this paper, we report two sets of results for the
Ibragimov-Müller (IM) test, setting the number of subsamples q equal to either 8 and 16 (as
in Müller, 2014). A notable feature of the IM test is that it allows us to carry out inference
that is robust not only with respect to serial correlation but also with respect to parameter
instability across subsamples, as we will discuss Section 5.
We use the same Monte Carlo simulation as before to estimate the size of the IM test in
the simple setting with two scalar predictors. The results are shown in the last row of Table
2. The IM test has close to nominal size in all three settings. The reason is that the IM test
is based on more accurate estimates of the sampling variability of the test statistic by using
variation across subsamples. In this way, it solves the problem of standard error bias that
conventional t-tests are faced with. Note, however, that coefficient bias would be a problem
for the IM test, because it splits the (already small) sample into even smaller samples, which
would magnify the small-sample coefficient bias. It is therefore important to assess whether
the conditions are met for the IM test to work well in practice, which we will do below in our
empirical applications. It will turn out that in our applications the IM test should perform
very well.
percent than for the simple bootstrap.

19

3

Economic growth and inflation

In this section we examine the evidence reported by Joslin et al. (2014) (henceforth JPS) that
macro variables may help predict bond returns. We will follow JPS and focus on predictive
regressions as in equation (1) where yt+h is an excess bond return for a one-year holding period
(h = 12), x1t is a vector consisting of a constant and the first three PCs of yields, and x2t
consists of a measure of economic growth (the three-month moving average of the Chicago Fed
National Activity Index, GRO) and of inflation (one-year CPI inflation expectations from the
Blue Chip Financial Forecasts, IN F ). While JPS also presented model-based evidence in favor
of unspanned macro risks, all of those results stem from the substantial in-sample predictive
power of x2t in these excess return regressions. The sample contains monthly observations
over the period 1985:1-2007:12.

3.1

Predictive power according to adjusted R̄2

JPS found that for the ten-year bond, the adjusted R̄2 of regression (1) when x2t is excluded
is only 0.20. But when they added x2t , the R̄2 increased to 0.37. For the two-year bond,
the change is even more striking, with R̄2 increasing from 0.14 without the macro variables
to 0.48 when they are included. JPS interpreted these adjusted R̄2 as strong evidence that
macroeconomic variables have predictive power for excess bond returns beyond the information
in the yield curve itself, and concluded from this evidence that “macroeconomic risks are
unspanned by bond yields” (p. 1203).
However, there are some warning flags for these predictive regressions, which we report in
Table 3. First, the predictors in x2t are very persistent. The first-order sample autocorrelations
for GRO and IN F are 0.91 and 0.99, respectively. The yield PCs in x1t , in particular the level
and slope, are of course highly persistent as well, which is a common feature of interest rate
data. Second, to assess strict exogeneity of the predictors we report estimated values for δ, the
correlation between innovations to the predictors, ε1t and ε2t , and the lagged prediction error,
ut .29 The innovations are obtained from the estimated VAR models for x1t and x2t , and the
prediction error is calculated from least squares estimates of equation (1) for yt+h the average
excess bond return for two- through ten-year maturities. For the first PC of yields, the level of
the yield curve, strict exogeneity is strongly violated, as the absolute value of δ is substantial.
Its sizable negative value is due to the mechanical relationship between bond returns and the
level of the yield curve: a positive innovation to PC1 at t raises all yields and mechanically
29
While in our theory in Section 2.2 δ was the correlation of the (scalar) innovation of x1t with past prediction
errors, here we calculate it for all predictors in x1t and x2t .

20

lowers bond returns from t − h to t. Hence such a violation of strict exogeneity will always
be present in predictive regressions for bond returns that include the current level of the yield
curve. In light of our results in Section 2 these warning flags suggest that small-sample issues
are present, and we will use the bootstrap to address them.
Table 4 shows R̄2 of predictive regressions for the excess bond returns on the two- and
ten-year bond, and for the average excess return across maturities. The first three columns
are for the same data set as was used by JPS.30 The first row in each panel reports the actual
R̄2 , and for the excess returns on the 2-year and 10-year bonds essentially replicates the results
in JPS.31 The entry R̄12 gives the adjusted R̄2 for the regression with only x1t as predictors,
and R̄22 corresponds to the case when x2t is added to the regression. The second row reports
the mean R̄2 across 5000 replications of the bootstrap described in Section 2.3, that is, the
average value we would expect to see for these statistics in a sample of the size used by JPS in
which x2t in fact has no true ability to predict yt+h but whose serial correlation properties are
similar to those of the observed data. The third row gives 95% confidence intervals, calculated
from the bootstrap distribution of the test statistics.
For all predictive regressions, the variability of the adjusted R̄2 is very high. Values for R̄22
up to about 63% would not be uncommon, as indicated by the bootstrap confidence intervals.
Most notably, adding the regressors x2t often substantially increases the adjusted R2 , by
up to 23 percentage points or more, although x2t has no predictive power in population by
construction. For the ten-year bond, JPS report an increase of 17 percentage points when
adding macro variables, but our results show that this increase is in fact not statistically
significant at conventional significance levels. For the two-year bond, the increase in R̄2 of 35
percentage points is statistically significant. However, the two-year bond seems to be special
among the yields one could look at. When we look for example at the average excess return
across all maturities, our bootstrap finds no evidence that x2t has predictive power beyond
the information in the yield curve, as reported in the last panel of Table 4.
Since the persistence of x2t is high, it may be important to adjust for small-sample bias
in the VAR estimates. For this reason we also carried out the bias-corrected (BC) bootstrap.
The expected values and 95% confidence intervals are reported in the bottom two rows of
30

Their yield data set ends in 2008, with the last observation in their regression corresponding to the excess
bond return from 2007:12 to 2008:12.
31
The yield data set of JPS includes the six-month and the one- through ten-year Treasury yields. After
calculating annual returns for the two- to ten-year bonds, JPS discarded the six, eight, and nine-year yields
before fitting PCs and their term structure models. Here, we need the fitted nine-year yield to construct the
return on the ten-year bond, so we keep all 11 yield maturities. While our PCs are therefore slightly different
than those in JPS, the only noticeable difference is that our adjusted R̄2 in the regressions for the two-year
bond with yield PCs and macro variables is 0.49 instead of their 0.48.

21

each panel of Table 4. As expected, more serial correlation in the generated data (due to
the bias correction) increases the mean and the variability of the adjusted R̄2 and of their
difference. In particular, while the difference R̄22 − R̄12 for the average excess return regression
was marginally significant at the 10-percent level for the simple bootstrap, it is insignificant
at this level for the BC bootstrap.
The right half of Table 4 updates the analysis to include an additional 7 years of data.
As expected under the spanning hypothesis, the value of R̄22 that is observed in the data
falls significantly when new data are added. And although the bootstrap 95% confidence
intervals for R̄12 − R̄22 are somewhat tighter with the longer data set, the conclusion that there
is no statistically significant evidence of added predictability provided by x2t is even more
compelling. For all bond maturities, the increases in adjusted R̄2 from adding macro variables
as predictors for excess returns lie comfortably inside the bootstrap confidence intervals.

3.2

Testing the spanning hypothesis

Is the predictive power of macro variables statistically significant? JPS only reported adjusted
R̄2 for their excess return regression, but one is naturally interested in formal tests of the
spanning hypothesis in JPS’ excess return regressions. The common approach to address the
serial correlation in the residuals due to overlapping observations is to use the HAC standard
errors and test statistics proposed by Newey and West (1987), typically using 18 lags (see
among many others Cochrane and Piazzesi, 2005; Ludvigson and Ng, 2009). In the second
row of Table 5 we report the resulting t-statistic for each coefficient along with the Wald test of
the hypothesis β2 = 0, calculated using Newey-West standard errors with 18 lags. The third
row reports asymptotic p-values for these statistics. According to this popular test, GRO
and IN F appear strongly significant, both individually and jointly. In particular, the Wald
statistic has a p-value below 0.1%.
We then employ our bootstrap to carry out tests of the spanning hypothesis that account
for the small-sample issues described in Section 2. Again, we use both a simple bootstrap
based on OLS estimates of the VAR parameters, as well as a bias-corrected (BC) bootstrap.
For each, we report five-percent critical values for the t- and Wald statistics, calculated as the
95th percentiles of the bootstrap distribution, as well as bootstrap p-values, i.e., the frequency
of bootstrap replications in which the bootstrapped test statistics are at least as large as in the
data. Using the simple bootstrap, the coefficient on GRO is insignificant, while IN F is still
marginally significant. Using the BC bootstrap, however, the coefficients are both individually
and jointly insignificant, in stark contrast to the conventional HAC tests.
We also report in Table 5 the p-values for the IM test of the individual significance of the
22

coefficients. The level and slope of the yield curve (P C1 and P C2) are strongly significant
predictors according to both IM tests.32 This will turn out to be a consistent finding in all
the data sets that we will look at—the level and slope of the yield curve appear to be robust
predictors of bond returns, consistent with an old literature going back to Fama and Bliss
(1987) and Campbell and Shiller (1991).33 By contrast, the coefficients on GRO and IN F
are not statistically significant at conventional significance levels based on the IM test.
We then use the bootstrap to calculate the properties of the different tests for data with
serial correlation properties similar to those observed in the sample. In particular, we estimate
the true size of the HAC, bootstrap, and IM tests with nominal size of five percent, and report
these in the last four rows of the top panel of Table 5. For the HAC tests, this is simply
the frequency of bootstrap replications in which the t- and Wald-statistics exceed the usual
asymptotic critical values. The results reveal that the true size of the conventional tests is
21-38% instead of the presumed five percent.34 These substantial size distortions are also
reflected in the bootstrap critical values, which far exceed the conventional critical values.
The bootstrap and the IM tests, in contrast, have a size that is estimated to be very close to
five percent, eliminating almost all of the size distortions of the more conventional tests.
As in the originally published work, we study returns with twelve-month holding periods
in all empirical applications of this paper. One might be interested, however, in the magnitude
of the size distortions for one-month bond returns. In such a setting, only the lack of strict
exogeneity of x1t causes problems for small-sample inference, and not the serial correlation in
the prediction errors. In additional, unreported results using the JPS data, we find that in
regressions for one-month excess returns the bootstrap does not reject the spanning hypothesis.
The conventional tests have serious size distortions, which are however smaller than in the
presence of serially correlated errors.35 The implication is that the substantial small-sample
size distortions we reported above for data with overlapping returns are due to a combination
of both problems, serially correlated errors as well as lack of strict exogeneity.
When we add more data to the sample, we again find that the statistical evidence of predictability declines substantially, as seen in the second panel of Table 5. When the data set is
extended through 2013, the HAC test statistics are only marginally significant or insignificant,
even if interpreted assuming the usual asymptotics. Using the bootstrap to take into account
32

The low p-values are also consistent with the conclusion from our unreported Monte Carlo investigation
that IM has good power to reject a false null hypothesis.
33
We have also calculated small-sample confidence intervals using the bootstrap, which confirm that the
coefficients on P C1 and P C2 are significant.
34
Using the BC bootstrap gives an even higher estimate of the true size of the HAC Wald test, about 45%.
35
Specifically, if we use White standard errors, as Duffee (2013b) and others do for predictions of one-month
excess returns, the BC bootstrap estimate of the true size of the Wald test of the spanning hypothesis is 15%.

23

the small-sample size distortions of such tests, these test statistics are far from significant.
Regarding the results for the IM test, we also find in this extended sample that the slope is an
important predictor of excess bond returns, consistent with a large existing literature, whereas
the coefficients on the macro variables are insignificant.
We conclude that the evidence in JPS on the predictive power of macro variables for yields
and bond returns is not altogether convincing. Notwithstanding, JPS noted that theirs is only
one of several papers claiming to have found such evidence. We turn to these studies in the
following sections.

4

Factors of large macro data sets

Ludvigson and Ng (2009, 2010) found that factors extracted from a large macroeconomic
data set are helpful in predicting excess bond returns, above and beyond the information
contained in the yield curve, adding further evidence for the claim of unspanned macro risks
and against the hypothesis of invertibility. Here we revisit this evidence, focusing on the
results in Ludvigson and Ng (2010) (henceforth LN).
LN started with a panel data set of 131 macro variables observed over 1964:1-2007:12 and
extracted eight macro factors using the method of principal components. These factors, which
we will denote by F 1 through F 8, were then related to future one-year excess returns on twothrough five-year Treasury bonds. The authors carried out an extensive specification search
in which they considered many different combinations of the factors along with squared and
cubic terms. They also included in their specification search the bond-pricing factor proposed
by Cochrane and Piazzesi (2005), which is the linear combination of forward rates that best
predicts the average excess return across maturities, and which we denote here by CP. LN’s
conclusion was that macro factors appear to help predict excess returns, even when controlling
for the CP factor. This conclusion is mostly based on comparisons of adjusted R̄2 in regressions
with and without the macro factors and on HAC inference using Newey-West standard errors.

4.1

Robust inference about coefficients on macro factors

One feature of LN’s design obscures the evidence relevant for the null hypothesis that is
the focus of our paper. Their null hypothesis is that the CP factor alone provides all the
information necessary to predict bond yields, whereas our null hypothesis of interest is that
the 3 variables (P C1, P C2, P C3) contain all the necessary information. Their regressions in
which CP alone is used to summarize the information in the yield curve could not be used to

24

test our null hypothesis. For this reason, we begin by examining similar predictive regressions
to those in LN in which excess bond returns are regressed on three PCs of the yields and all
eight of the LN macro factors. We further leave aside the specification search of LN in order to
focus squarely on hypothesis testing for a given regression specification.36 These regressions
take the same form as (1), where now yt+h is the average one-year excess bond return for
maturities of two through five years, x1t contains a constant and three yield PCs, and x2t
contains eight macro PCs. As before, our interest is in testing the hypothesis H0 : β2 = 0.
The top panel of Table 6 reports regression results for LN’s original sample. The first three
rows show the coefficient estimates, HAC t- and Wald statistics (using Newey-West standard
errors with 18 lags as in LN), and p-values based on the asymptotic distributions of these test
statistics. There are five macro factors that appear to be statistically significant at the tenpercent level, among which three are significant at the five-percent level. The Wald statistic
for H0 far exceeds the critical values for conventional significant levels (the five-percent critical
value for a χ28 -distribution is 15.5). Table 7 reports adjusted R̄2 for the restricted (R̄12 ) and
unrestricted (R̄22 ) regressions, and shows that this measure of fit increases by 10 percentage
points when the macro factors are included. Taken at face value, this evidence suggests that
macro factors have strong predictive power, above and beyond the information contained in
the yield curve, consistent with the overall conclusions of LN.
How robust are these econometric results? We first check the warning flags summarized
in Table 3. As usual, the yield PCs are very persistent. The macro factors differ in their
persistence, but even the most persistent ones only have first-order autocorrelations of around
0.75, so the persistence of x2t is lower than in the data of JPS but still considerable. Again the
first PC of yields strongly violates strict exogeneity for the reasons explained above. Based
on these indicators, it appears that small-sample problems may well distort the results of
conventional inference methods.
To assess the potential importance in this context, we bootstrapped 5000 data sets of
artificial yields and macro data in which H0 is true in population. The samples each contain
516 observations, which corresponds to the length of the original data sample. We report
results only for the simple bootstrap without bias correction, because the bias in the VAR for
x2t is estimated to be small.
Before turning to the results, it is worth noting the differences between our bootstrap
exercise and the bootstrap carried out by LN. Their bootstrap is designed to test the null
hypothesis that excess returns are not predictable against the alternative that they are pre36

We were able to closely replicate the results in LN’s tables 4 through 7, and have also applied our techniques
to those regressions, which led to qualitatively similar results.

25

dictable by macro factors and the CP factor. Using this setting, LN produced convincing
evidence that excess returns are predictable, which is fully consistent with all the results in
our paper as well. Our null hypothesis of interest, however, is that excess returns are predictable only by current yields. While LN also reported results for a bootstrap under the
alternative hypothesis, our bootstrap allows us to provide a more accurate assessment of the
spanning hypothesis, and to estimate the size of conventional tests under the null.
As seen in Table 6, our bootstrap finds that only three coefficients are significant at the
ten-percent level (instead of five using conventional critical values), and one at the five-percent
level (instead of three). While the Wald statistic is significant even compared to the critical
value from the bootstrap distribution, the evidence is weaker than when using the asymptotic
distribution. Table 7 shows that the observed increase in predictive power from adding macro
factors to the regression, measured by R̄2 , would not be implausible if the null hypothesis were
true, as the increase in R̄2 is within the 95% bootstrap confidence interval.
Table 6 also reports p-values for the two IM tests using q = 8 and 16 subsamples. Only
the coefficient on F 7 is significant at the 5% level using this test, and then only for q = 16.
The robustly significant predictors are the level and the slope of the yield curve.
We again use the bootstrap to estimate the true size of the different tests with a nominal
size of five percent. The results, which are reported in the bottom four rows of the top panel
of Table 6, reveal that the conventional tests have serious size distortions. The true size of
these t-tests is 9-14 percent, instead of the nominal five percent, and for the Wald test the
size distortion is particularly high with a true size of 34 percent. By contrast, the bootstrap
and IM tests, according to our calculations, appear to have close to correct size.
The failure to reject the null based on the IM tests is a reflection of the fact that the
parameter estimates are often unstable across subsamples. Duffee (2013b, Section 7) has also
noted problems with the stability of the results in Cochrane and Piazzesi (2005) and Ludvigson
and Ng (2010) across different sample periods. To explore this further we repeated our analysis
using the same 1985-2013 sample period that was used in the second panel of Tables 4 and 5.
Note that whereas in the case of JPS this was a strictly larger sample than the original, in
the case of LN our second sample adds data at the end but leaves some out at the beginning.
Reasons for interest in this sample period include the significant break in monetary policy
in the early 1980s, the advantages of having a uniform sample period for comparison across
all the different studies considered in our paper, and investigating robustness of the original
claims in describing data since the papers were originally published.37 We used the macro
data set of McCracken and Ng (2014), to extract macro factors in the same way as LN over
37

We also analyzed the full 1964-2013 sample and obtained similar results as over the 1964-2007 sample.

26

the more recent data.38
The bottom panels of Tables 6 and 7 display the results. Over the later sample period,
the evidence for the predictive power of macro factors is even weaker. Notably, the Wald
tests reject H0 for both bond maturities (at the ten-percent level for the five-year bond) when
using asymptotic critical values, but are very far from significant when using bootstrap critical
values. The increases in adjusted R̄2 in Table 7 are not statistically significant, and the IM
tests find essentially no evidence of predictive power of the macro factors.
These results imply that the evidence that macro factors have predictive power beyond the
information already contained in yields is much weaker than the results in LN would initially
have suggested. For the original sample used by LN, our bootstrap procedure reveals substantial small-sample size distortions and weakens the statistical significance of the predictive
power of macro variables, while the IM test indicates that only the level and slope are robust
predictors. For the later sample, there is no evidence for unspanned macro risks at all. Our
overall conclusion is that the predictive power of macro variables is much more tenuous than
one would have thought from the published results, and that both small-sample concerns as
well as subsample stability raise serious robustness concerns.

4.2

Robust inference about return-forecasting factors

LN also constructed a single return-forecasting factor using a similar approach as Cochrane and
Piazzesi (2005). They regressed the excess bond returns, averaged across the two- through
five-year maturities, on the macro factors plus a cubed term of F 1 which they found to
be important. The fitted values of this regression produced their return-forecasting factor,
denoted by H8. The CP factor of Cochrane and Piazzesi (2005) is constructed similarly using
a regression on five forward rates. Adding H8 to a predictive regression with CP substantially
increases the adjusted R̄2 , and leads to a highly significant coefficient on H8. LN emphasized
this result and interpreted it as further evidence that macro variables have predictive power
beyond the information in the yield curve.
Tables 8 and 9 replicate LN’s results for these regressions on the macro- (H8) and yieldbased (CP ) return-forecasting factors.39 Table 8 shows coefficient estimates and statistical
significance, while Table 9 reports R̄2 . In LN’s data, both CP and H8 are strongly significant
with HAC p-values below 0.1%. Adding H8 to the regression increases the adjusted R̄2 by
9-11 percentage points.
38
Using this macro data set and the same sample period as LN we obtained results that were very similar
to those in the original paper, which gives us confidence in the consistency of the macro data set.
39
These results correspond to those in column 9 in tables 4-7 in LN.

27

How plausible would it have been to obtain these results if macro factors have in fact
no predictive power? In order to answer this question, we adjust our bootstrap design to
handle regressions with return-forecasting factors CP and H8. To this end, we simply add
an additional step in the construction of our artificial data by calculating CP and H8 in each
bootstrap data set as the fitted values from preliminary regressions in the exact same way
that LN did in the actual data. The results in Table 8 show that the bootstrap p-values are
substantially larger than the asymptotic HAC p-values, and H8 is no longer significant at
the 1% level. Table 9 shows that the observed increases in adjusted R̄2 when adding H8 to
the regression are not statistically significant at the five-percent level, with the exception of
the two-year bond maturity where the observed value lies slightly outside the 95% bootstrap
confidence interval.
We report bootstrap estimates of the true size of conventional HAC tests and of our
bootstrap test of the significance of the macro return-forecasting factor—for a nominal size
of five percent—in the bottom two rows of the top panel of Table 8. The size distortions
for conventional t-tests are very substantial: a test with nominal size of five percent based
on asymptotic HAC p-values has a true size of 50-55 percent. In contrast, the size of our
bootstrap test is estimated to be very close to the nominal size.
We also examined the same regressions over the 1985–2013 sample period with results
shown in the bottom panel of Table 8 and in the right half of Table 9. In this sample, the
return-forecasting factors would again both appear to be highly significant based on HAC
p-values, but the size distortions of these tests are again very substantial and the coefficients
on H8 are in fact not statistically significant when using the bootstrap p-values. The observed
increases in R̄2 are squarely in line with what we would expect under the spanning hypothesis,
as indicated by the confidence intervals in Table 9.
This evidence suggests that conventional HAC inference can be particularly problematic
when the predictors are return-forecasting factors. One reason for the substantially distorted
inference is their high persistence—Table 3 shows that both H8 and CP have autocorrelations
that are near 0.8 at first order, and decline only slowly with the lag length. Another reason
is that the return-forecasting factors are constructed in a preliminary estimation step, which
introduces additional estimation uncertainty not accounted for by conventional inference. In
such a setting other econometric methods—preferably a bootstrap exercise designed to assess
the relevant null hypothesis—are needed to accurately carry out inference. For the case at
hand, we conclude that a return-forecasting factor based on macro factors exhibits only very
tenuous predictive power, much weaker than indicated by LN’s original analysis and which
disappears completely over a different sample period.

28

5

Higher-order PCs of yields

Cochrane and Piazzesi (2005) (henceforth CP) documented several striking new facts about
excess bond returns. Focusing on returns with a one-year holding period, they showed that
the same linear combination of forward rates predicts excess returns on different long-term
bonds, that the coefficients of this linear combination have a tent shape, and that predictive
regressions using this one variable deliver R2 of up to 37% (and even up to 44% when lags
are included). Importantly for our context, CP found that the first three PCs of yields—level,
slope, and curvature—did not fully capture this predictability, but that the fourth and fifth
PC were significant predictors of future bond returns (see CP’s Table 4 on p. 147, row 3).
In CP’s data, the first three PCs explain 99.97% of the variation in the five Fama-Bliss
yields (see page 147 of CP), consistent with the long-standing evidence that three factors
are sufficient to almost fully capture the shape and evolution of the yield curve, a result
going back at least to Litterman and Scheinkman (1991). CP found that the other two PCs,
which explain only 0.03% of the variation in yields, are statistically important for predicting
excess bond returns. In particular, the fourth PC appeared “very important for explaining
expected returns” (p. 147). Here we assess the robustness of this finding, by revisiting the null
hypothesis that only the first three PCs predict yields and excess returns and that higher-order
PCs do not contain additional predictive power.
The first 3 rows of Table 10 replicate the relevant results of CP using their original data.
We estimate the predictive regression for the average excess bond return using five PCs as
predictors, and carry out HAC inference in this model using Newey-West standard errors as
in CP. The Wald statistic and R12 and R22 are identical to those reported by CP. The p-values
indicate that P C4 is very strongly statistically significant, and that the spanning hypothesis
would be rejected.
We then use our bootstrap procedure to obtain robust inference about the relevance of
the predictors PC4 and PC5. In contrast to the results found for JPS in Section 3 and LN in
Section 4, our bootstrap finds that the CP results cannot be accounted for by small-sample
size distortions. The main reason for this is that the t-statistic on P C4 is far too large to
be accounted for by the kinds of factors identified in Section 2. Likewise the increase in R2
reported by CP would be quite implausible under the null hypothesis, and falls far outside
the 95% bootstrap interval under the null.
Interestingly, however, the IM tests would fail to reject the null hypothesis that β2 = 0.
These indicate that the coefficients on P C4 and P C5 are not statistically significant, and
find only the level and slope to be robust predictors of excess bond returns. The bootstrap

29

estimates of the size of the IM test, reported in the bottom two rows of the top panel of Table
10, indicate that these tests have close to nominal size, giving us added reason to pay attention
to these results.
Figure 2 provides some intuition about why the IM tests fail to reject. It shows the
coefficients on each predictor across the q = 8 subsamples used in the IM test. The coefficients
are standardized by dividing them by the sample standard deviation across the eight estimated
coefficients for each predictor. Thus, the IM t-statistics, which are reported in the legend of
Figure 2, are equal to the means of the standardized coefficients across subsamples, multiplied
√
by 8. The figure shows that P C1 and P C2 had much more consistent predictive power across
subsamples than P C4, whose coefficient switches signs several times. The strong association
between P C4 and excess returns is mostly driven by the fifth subsample, which starts in
September 1983 and ends in July 1988.40 This illustrates that the IM test, which is designed
to produce inference that is robust to serial correlation, at the same time delivers results that
are robust to sub-sample instability. Only the level and slope have predictive power for excess
bond returns in the CP data that is truly robust in both meanings of the word.
It is worth emphasizing the similarities and differences between the tests of interest to CP
and in our own paper. Their central claim, with which we concur, is that the factor they have
identified is a useful and stable predictor of bond returns. However, this factor is a function
of all 5 PC’s, and the first 3 of these account for 76% of the variation of the CP factor. Our
claim is that it is the role of P C1-P C3 in the CP factor, and not the addition of P C4 and
P C5, that makes the CP pricing factor turn out to be a useful and stable predictor of yields.
Thus our test for structural stability differs from those performed in CP and their accompanying online appendix. CP conducted tests of the usefulness of their return-forecasting
factor for predicting returns across different subsamples, a result that we have been able to
reproduce and confirm. Our tests, by contrast, look at stability of the role of each individual
PC. We agree with CP that the first three PC’s indeed have a stable predictive relation,
as we confirmed with the IM tests in Table 10 and Figure 2, and in additional, unreported
subsample analysis similar to that in CP’s appendix. On the other hand, the predictive power
of the 4th and 5th PC is much more tenuous, and is insignificant in most of the subsample
periods that CP considered. Duffee (2013b, Section 7) also documented that extending CP’s
sample period to 1952–2010 alters some of their key results, and we have found that over
Duffee’s sample period the predictive power of higher-order PCs disappears.
In the bottom panel of Table 10 we report results for our preferred sample period, from
40

Consistent with this finding, an influence analysis of the predictive power of P C4 indicates that the
observations with the largest leverage and influence are almost all clustered in the early and mid 1980s.

30

1985 to 2013. In this case, the coefficients on P C4 and P C5 are not significant for any method
of inference, and the increase in R2 due to inclusion of higher-order PCs are comfortably inside
the 95% bootstrap intervals. At the same time, the predictive power of the level and slope of
the yield curve is quite strong also in this sample. Although the standard HAC t-test fails to
reject that the coefficient on the level is zero, the same test finds the coefficient on the slope
to be significant, and the IM tests imply that both coefficients are significant.
Since CP used a sample period that ended more than ten years prior to the time of this
writing, we can carry out a true out-of-sample test of our hypothesis of interest. We estimate
the same predictive regressions as in CP, for excess returns on two- to five-year bonds as well
as for the average excess return across bond maturities. The first two columns of Table 11
report the in-sample R2 for the restricted models (using only P C1 to P C3) and unrestricted
models (using all PCs). Then we construct expected future excess returns from these models
using yield PCs41 from 2003:1 through 2012:12, and compare these to realized excess returns
for holding periods ending in 2004:1 through 2013:12. Table 11 shows the resulting rootmean-squared forecast errors (RMSEs). For all bond maturities, the model that leaves out
P C4 and P C5 performs substantially better, with reductions of RMSEs around 20 percent.
The test for equal forecast accuracy of Diebold and Mariano (1995) rejects the null, indicating
that the performance gains of the restricted model are statistically significant. Figure 3 shows
the forecast performance graphically, plotting the realized and predicted excess bond returns.
Clearly, both models did not predict future bond returns very well, expecting mostly negative
excess returns over a period when these turned out to be positive. In fact, the unconditional
mean, estimated over the CP sample period, was a better predictor of future returns. This is
evident both from Figure 3, which shows this mean as a horizontal line, and from the RMSEs
in the last column of Table 11. Nevertheless, the unrestricted model implied expected excess
returns that were more volatile and significantly farther off than those of the restricted model.
Restricting the predictive model to use only the level, slope and curvature leads to more stable
and more accurate return predictions.
We conclude from both our in-sample and out-of-sample results that the evidence for
predictive power of higher-order factors is tenuous and sample-dependent. To estimate bond
risk premia in a robust way, we recommend using only those predictors that consistently show
a strong associations with excess bond returns, namely the level and the slope of the yield
curve.
41

PCs are calculated throughout using the loadings estimated over the original CP sample period.

31

6

Bond supply

In addition to macro-finance linkages, a separate literature studies the effects of the supply
of bonds on prices and yields. The theoretical literature on the so-called portfolio balance
approach to interest rate determination includes classic contributions going back to Tobin
(1969) and Modigliani and Sutch (1966), as well as more recent work by Vayanos and Vila
(2009) and King (2013). A number of empirical studies document the relation between bond
supply and interest rates during both normal times and over the recent period of near-zero
interest and central bank asset purchases, including Hamilton and Wu (2012), D’Amico and
King (2013), and Greenwood and Vayanos (2014). Both theoretical and empirical work has
convincingly demonstrated that bond supply is related to bond yields and returns.
However, our question here is whether measures of Treasury bond supply contain information that is not already reflected in the yield curve and that is useful for predicting future bond
yields and returns. Is there evidence against the spanning hypothesis that involves measures
of time variation in bond supply? At first glance, the answer seems to be yes. Greenwood
and Vayanos (2014) (henceforth GV) found that their measure of bond supply, a maturityweighted debt-to-GDP ratio, predicts yields and bond returns, and that this holds true even
controlling for yield curve information such as the term spread. Here we investigate whether
this result holds up to closer scrutiny. The sample period used in Greenwood and Vayanos
(2014) is 1952 to 2008.42
To estimate the effects of bond supply on interest rates, GV estimate a broad variety of
different regression specifications with yields and returns of various maturities as dependent
variables. Here we are most interested in those regressions that control for the information
in the yield curve. In the top panel of Table 12 we reproduce their baseline specification
in which the one-year return on a long-term bond is predicted using the one-year yield and
bond supply measure alone. The second panel includes the spread between the long-term and
one-year yield as an additional explanatory variable.43 Like GV we use Newey-West standard
errors with 36 lags.44
If we interpreted the HAC t-test using the conventional asymptotic critical values, the
coefficient on bond supply is significant in the baseline regression in the top panel but is no
longer significant at the conventional significance level of five percent when the yield spread
is included in the regression, as seen in the second panel. But once again there are some
42

As in JPS, the authors report a sample end date of 2007 but use yields up to 2008 to calculate one-year
bond returns up to the end of 2007.
43
These estimates are in GV’s table 5, rows 1 and 6. Their baseline results are also in their table 2.
44
There are small differences in our and their t-statistics that we cannot reconcile but which are unimportant
for the results.

32

warning flags that raise doubts about the validity of HAC inference. Table 3 shows that the
bond supply variable is extremely persistent—the first-order autocorrelation is 0.998—and the
one-year yield and yield spread are of course highly persistent as well. This leads us to suspect
that the true p-value likely exceeds the purported 5.8%.
The bond return that GV used as the dependent variable in these regressions is for a hypothetical long-term bond with a 20-year maturity. We do not apply our bootstrap procedure
here because this bond return is not constructed from the observed yield curve.45 Instead we
rely on IM tests to carry out robust inference. Neither of the IM tests finds the coefficient on
bond supply to be statistically significant. In contrast, the coefficient on the term spread is
strongly significant for the HAC test and both IM tests.
We consider two additional regression specifications that are relevant in this context. The
first controls for information in the yield curve by including, instead of a single term spread,
the first three PCs of observed yields.46 It also subtracts the one-year yield from the bond
return in order to yield an excess return. Both of these changes make this specification more
closely comparable to those in the literature. The results are reported in the third panel of
Table 12. Again, the coefficient on bond supply is only marginally significant for the HAC
t-test, and insignificant for the IM tests. In contrast, the coefficients on both PC1 and PC2
are strongly significant for the IM tests.
Finally, we consider the most common specification where yt+h is the one-year excess return,
averaged across two- though five-year maturities. The last panel of Table 12 shows that in
this case, the coefficient on bond supply is insignificant. Since Table 3 indicates that for this
predictive regression both persistence as well as lack of strict exogeneity are warning flags, so
we also apply our bootstrap procedure. We find that there is a significant size distortion for
this hypothesis test, and the bootstrap p-value is substantially higher than the conventional pvalue. There is robust evidence that PC1 and PC2 have predictive power for bond returns, as
judged by the IM test, whereas this test indicates that bond supply is not a robust predictor.
Overall, the results in GV do not constitute evidence against the spanning hypothesis.
While bond supply exhibits a strong empirical link with interest rates, its predictive power
for future yields and returns seems to be fully captured by the current yield curve.
45
46

GV obtained this series from Ibbotson Associates.
These PCs are calculated from the observed Fama-Bliss yields with one- through five-year maturities.

33

7

Output gap

Another widely cited study that appears to provide evidence of predictive power of macro
variables for asset prices is Cooper and Priestley (2008) (henceforth CPR). This paper focuses
on one particular macro variable as a predictor of stock and bond returns, namely the output
gap, which is a key indicator of the economic business cycle. The authors concluded that
“the output gap can predict next year’s excess returns on U.S. government bonds” (p. 2803).
Furthermore, they also claimed that some of this predictive power is independent of the
information in the yield curve, and implicitly rejected the spanning hypothesis (p. 2828).
We investigate the predictive regressions for excess bond returns yt+h using the output gap
at date t−1 (gapt−1 ), measured as the deviation of the Fed’s Industrial Production series from a
quadratic time trend.47 CPR lagged their measure by one month to account for the publication
lag of the Fed’s Industrial Production data. Table 13 shows our results for predictions of the
excess return on the five-year bond; the results for other maturities closely parallel these. The
top two panels correspond to the regression specifications that CPR estimated.48 In the first
˜ t , which is
specification, the only predictor is gapt−1 . The second specification also includes CP
the Cochrane-Piazzesi factor CPt after it is orthogonalized with respect to gapt .49 We obtain
coefficients and R̄2 that are close to those published in CPR. We calculate both OLS and
HAC t-statistics, where in the latter case we use Newey-West with 22 lags as described by
CPR. Our OLS t-statistics are very close to the published numbers, and according to these
the coefficient on gapt−1 is highly significant. However, the HAC t-statistics are only about a
third of the OLS t-statistics, and indicate that the coefficient on gap is far from significant,
with p-values above 20%.50
Importantly, neither of the specifications in CPR can be used to test the spanning hypothesis, because the CP factor is first orthogonalized with respect to the output gap. This
defeats the purpose of controlling for yield-curve information, since any predictive power that
is shared by the CP factor and gap will be exclusively attributed to the latter.51 One way to
˜ , for which we report the results in
test the spanning hypothesis is to include CP instead of CP
the third panel of Table 13. In this case, the coefficient on gap switches to a positive sign, and
˜ and CP are strongly
its Newey-West t-statistic remains insignificant. In contrast, both CP
significant in these regressions.
47

We thank Richard Priestley for sending us this real-time measure of the output gap.
The relevant results in CPR are in the top panel of their table 9.
49
˜ t and gapt−1 are therefore not completely orthogonal.
Note that the predictors CP
50
This indicates that CPR may have mistakenly reported the OLS instead of the Newey-West t-statistics
51
˜ cannot justify the conclusion
In particular, finding a significant coefficient on gap in a regression with CP
that “gap is capturing risk that is independent of the financial market-based variable CP” (p. 2828).
48

34

Our preferred specification includes the first three PCs of the yield curve—see the last
panel of Table 13. Importantly, the predictor gap is highly persistent, with a first-order
autocorrelation coefficient of 0.975, as shown in Table 3, and the level PC is not strictly
exogenous, so we need to worry about conventional t-tests to be substantially oversized. Hence
we also include results for robust inference using the bootstrap and IM tests. The gap variable
has a positive coefficient with a HAC p-value of 19%, which rises to 36% when using our
bootstrap procedure. The conventional HAC t-test is substantially oversized, as evident by
the bootstrap critical value that substantially exceeds the conventional critical value. The IM
tests do not reject the null.
Overall, there is no evidence that the output gap predicts bond returns. The level and in
particular the slope of the yield curve, in contrast, are very strongly associated with future
excess bond returns, in line with our finding throughout this paper.

8

Conclusion

The methods developed in our paper confirm a well established finding in the earlier literature–
the current level and slope of the yield curve are robust predictors of future bond returns. That
means that in order to test whether any other variables may also help predict bond returns,
the regression needs to include the current level and slope, which are highly persistent lagged
dependent variables. If other proposed predictors are also highly persistent, conventional tests
of their statistical significance can have significant size distortions and the R2 of the regression
can increase dramatically when the variables are added to the regression even if they have no
true explanatory power.
We proposed two strategies for dealing with this problem, the first of which is a simple
bootstrap based on PCs and the second a robust t-test based on subsample estimates proposed
by Ibragimov and Müller (2010). We used these methods to revisit five different widely cited
studies, and found in each case that the evidence that variables other than the current level,
slope and curvature predict excess bond returns is substantially less convincing than the
original research would have led us to believe.
We emphasize that these results do not mean that fundamentals such as inflation, output,
and bond supplies do not matter for interest rates. Instead, our conclusion is that any effects
of these variables can be summarized in terms of the level, slope, and curvature. Once these
three factors are included in predictive regressions, no other variables appear to have robust
forecasting power for future yields or returns. Our results cast doubt on the claims for the
existence of unspanned macro risks and support the view that it is not necessary to look
35

beyond the information in the yield curve to estimate risk premia in bond markets.

References
Adrian, Tobias, Richard K. Crump, and Emanuel Moench (2013) “Pricing the Term Structure
with Linear Regressions,” Journal of Financial Economics, Vol. 110, pp. 110–138.
Andrews, Donald W. K. (1991) “Heteroskedasticity and Autocorrelation Consistent Covariance Matrix Estimation,” Econometrica, Vol. 59, pp. 817–858.
Bansal, Ravi and Ivan Shaliastovich (2013) “A Long-Run Risks Explanation of Predictability
Puzzles in Bond and Currency Markets,” Review of Financial Studies, Vol. 26, pp. 1–33.
Basawa, Ishwar V, Asok K Mallik, William P McCormick, Jaxk H Reeves, and Robert L Taylor
(1991) “Bootstrapping unstable first-order autoregressive processes,” Annals of Statistics,
pp. 1098–1101.
Bauer, Michael D. and Glenn D. Rudebusch (2015) “Resolving the Spanning Puzzle in MacroFinance Term Structure Models,” Working Paper 2015-01, Federal Reserve Bank of San
Francisco.
Bekaert, G., R.J. Hodrick, and D.A. Marshall (1997) “On biases in tests of the expectations
hypothesis of the term structure of interest rates,” Journal of Financial Economics, Vol.
44, pp. 309–348.
Berkowitz, Jeremy and Lutz Kilian (2000) “Recent developments in bootstrapping time series,”
Econometric Reviews, Vol. 19, pp. 1–48.
Campbell, John Y. and Robert J. Shiller (1991) “Yield Spreads and Interest Rate Movements:
A Bird’s Eye View,” Review of Economic Studies, Vol. 58, pp. 495–514.
Campbell, John Y and Motohiro Yogo (2006) “Efficient tests of stock return predictability,”
Journal of financial economics, Vol. 81, pp. 27–60.
Carrodus, Mark L and David EA Giles (1992) “The exact distribution of R 2 when the
regression disturbances are autocorrelated,” Economics Letters, Vol. 38, pp. 375–380.
Cavanagh, Christopher L, Graham Elliott, and James H Stock (1995) “Inference in Models
with Nearly Integrated Regressors,” Econometric theory, Vol. 11, pp. 1131–1147.

36

Chan, Ngai Hang (1988) “The parameter inference for nearly nonstationary time series,”
Journal of the American Statistical Association, Vol. 83, pp. 857–862.
Chernov, Mikhail and Philippe Mueller (2012) “The Term Structure of Inflation Expectations,” Journal of Financial Economics, Vol. 106, pp. 367–394.
Cochrane, John H. and Monika Piazzesi (2005) “Bond Risk Premia,” American Economic
Review, Vol. 95, pp. 138–160.
Cooper, Ilan and Richard Priestley (2008) “Time-Varying Risk Premiums and the Output
Gap,” Review of Financial Studies, Vol. 22, pp. 2801–2833.
Coroneo, Laura, Domenico Giannone, and Michle Modugno (2015) “Unspanned Macroeconomic Factors in the Yields Curve,” Journal of Business and Economic Statistics, p. forthcoming.
D’Amico, Stefania and Thomas B. King (2013) “Flow and stock effects of large-scale treasury
purchases: Evidence on the importance of local supply,” Journal of Financial Economics,
Vol. 108, pp. 425–448.
Deng, Ai (2013) “Understanding Spurious Regression in Financial Economics,” Journal of
Financial Econometrics, pp. 1–29.
Diebold, Francis X. and Robert S. Mariano (1995) “Comparing Predictive Accuracy,” Journal
of Business & economic statistics, Vol. 13, pp. 253–263.
Duffee, Gregory R. (2011) “Forecasting with the Term Structure: the Role of No-Arbitrage,”
Working Paper January, Johns Hopkins University.
(2013a) “Bond Pricing and the Macroeconomy,” in Milton Harris George M. Constantinides and Rene M. Stulz eds. Handbook of the Economics of Finance, Vol. 2, Part B:
Elsevier, pp. 907–967.
(2013b) “Forecasting Interest Rates,” in Graham Elliott and Allan Timmermann eds.
Handbook of Economic Forecasting, Vol. 2, Part A: Elsevier, pp. 385–426.
Engle, Robert (2002) “Dynamic conditional correlation: A simple class of multivariate generalized autoregressive conditional heteroskedasticity models,” Journal of Business & Economic
Statistics, Vol. 20, pp. 339–350.

37

Fama, Eugene F. and Robert R. Bliss (1987) “The Information in Long-Maturity Forward
Rates,” The American Economic Review, Vol. 77, pp. 680–692.
Ferson, Wayne E, Sergei Sarkissian, and Timothy T Simin (2003) “Spurious Regressions in
Financial Economics?” Journal of Finance, Vol. 58, pp. 1393–1414.
Greenwood, Robin and Dimitri Vayanos (2014) “Bond Supply and Excess Bond Returns,”
Review of Financial Studies, Vol. 27, pp. 663–713.
Gürkaynak, Refet S. and Jonathan H. Wright (2012) “Macroeconomics and the Term Structure,” Journal of Economic Literature, Vol. 50, pp. 331–367.
Hall, Peter and Susan R. Wilson (1991) “Two Guidelines for Bootstrap Hypothesis Testing,”
Biometrics, Vol. 47, pp. 757–762.
Hamilton, James D. (1994) Time Series Analysis: Princeton University Press.
Hamilton, James D. and Jing Cynthia Wu (2012) “Identification and estimation of Gaussian
affine term structure models,” Journal of Econometrics, Vol. 168, pp. 315–331.
(2014) “Testable Implications of Affine Term Structure Models,” Journal of Econometrics, Vol. 178, pp. 231–242.
Hansen, Bruce E (1999) “The grid bootstrap and the autoregressive model,” Review of Economics and Statistics, Vol. 81, pp. 594–607.
Horowitz, Joel L. (2001) “The Bootstrap,” in J.J. Heckman and E.E. Leamer eds. Handbook
of Econometrics, Vol. 5: Elsevier, Chap. 52, pp. 3159–3228.
Ibragimov, Rustam and Ulrich K. Müller (2010) “t-Statistic Based Correlation and Heterogeneity Robust Inference,” Journal of Business and Economic Statistics, Vol. 28, pp. 453–
468.
Joslin, Scott, Marcel Priebsch, and Kenneth J. Singleton (2014) “Risk Premiums in Dynamic
Term Structure Models with Unspanned Macro Risks,” Journal of Finance, Vol. 69, pp.
1197–1233.
Kendall, M. G. (1954) “A note on bias in the estimation of autocorrelation,” Biometrika, Vol.
41, pp. 403–404.
Kilian, Lutz (1998) “Small-sample confidence intervals for impulse response functions,” Review
of Economics and Statistics, Vol. 80, pp. 218–230.
38

King, Thomas B. (2013) “A Portfolio-Balance Approach to the Nominal Term Structure,”
Working Paper 2013-18, Federal Reserve Bank of Chicago.
Koerts, Johannes and Adriaan Pieter Johannes Abrahamse (1969) On the theory and application of the general linear model: Rotterdam University Press Rotterdam.
Lewellen, Jonathan, Stefan Nagel, and Jay Shanken (2010) “A skeptical appraisal of asset
pricing tests,” Journal of Financial Economics, Vol. 96, pp. 175–194.
Litterman, Robert and J. Scheinkman (1991) “Common Factors Affecting Bond Returns,”
Journal of Fixed Income, Vol. 1, pp. 54–61.
Ludvigson, Sydney C. and Serena Ng (2009) “Macro Factors in Bond Risk Premia,” Review
of Financial Studies, Vol. 22, pp. 5027–5067.
Ludvigson, Sydney C and Serena Ng (2010) “A Factor Analysis of Bond Risk Premia,” Handbook of Empirical Economics and Finance, p. 313.
Mankiw, N. Gregory and Matthew D. Shapiro (1986) “Do we reject too often? Small sample
properties of tests of rational expectations models,” Economics Letters, Vol. 20, pp. 139–145.
McCracken, Michael W. and Serena Ng (2014) “FRED-MD: A Monthly Database for Macroeconomic Research,” working paper, Federal Reserve Bank of St. Louis.
Modigliani, Franco and Richard Sutch (1966) “Innovations in interest rate policy,” The American Economic Review, pp. 178–197.
Müller, Ulrich K. (2014) “HAC Corrections for Strongly Autocorrelated Time Series,” Journal
of Business and Economic Statistics, Vol. 32.
Nabeya, Seiji and Bent E Sørensen (1994) “Asymptotic distributions of the least-squares
estimators and test statistics in the near unit root model with non-zero initial value and
local drift and trend,” Econometric Theory, Vol. 10, pp. 937–966.
Newey, Whitney K and Kenneth D West (1987) “A Simple, Positive Semi-definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix,” Econometrica, Vol. 55,
pp. 703–08.
Phillips, Peter CB (1988) “Regression theory for near-integrated time series,” Econometrica:
Journal of the Econometric Society, pp. 1021–1043.

39

Piazzesi, Monika and Martin Schneider (2007) “Equilibrium Yield Curves,” in NBER Macroeconomics Annual 2006, Volume 21: MIT Press, pp. 389–472.
Pope, Alun L. (1990) “Biases of Estimators in Multivariate Non-Gaussian Autoregressions,”
Journal of Time Series Analysis, Vol. 11, pp. 249–258.
Priebsch, Marcel (2014) “(Un)Conventional Monetary Policy and the Yield Curve,” working
paper, Federal Reserve Board, Washington, D.C.
Rudebusch, Glenn D. and Tao Wu (2008) “A Macro-Finance Model of the Term Structure,
Monetary Policy, and the Economy,” Economic Journal, Vol. 118, pp. 906–926.
Stambaugh, Robert F. (1999) “Predictive regressions,” Journal of Financial Economics, Vol.
54, pp. 375–421.
Stock, James H (1991) “Confidence intervals for the largest autoregressive root in US macroeconomic time series,” Journal of Monetary Economics, Vol. 28, pp. 435–459.
Stock, James H. (1994) “Unit roots, structural breaks and trends,” in Robert F. Engle and
Daniel L. McFadden eds. Handbook of Econometrics, Vol. 4: Elsevier, Chap. 46, pp. 2739–
2841.
Swanson, Eric T (2015) “A macroeconomic model of equities and real, nominal, and defaultable debt,” unpublished manuscript, University of California, Irvine.
Tobin, James (1969) “A general equilibrium approach to monetary theory,” Journal of money,
credit and banking, Vol. 1, pp. 15–29.
Vayanos, Dimitri and Jean-Luc Vila (2009) “A Preferred-Habitat Model of the Term Structure
of Interest Rates,” NBER Working Paper 15487, National Bureau of Economic Research.
Wachter, Jessica A. (2006) “A Consumption-Based Model of the Term Structure of Interest
Rates,” Journal of Financial Economics, Vol. 79, pp. 365–399.
Wright, Jonathan H. (2011) “Term Premia and Inflation Uncertainty: Empirical Evidence
from an International Panel Dataset,” American Economic Review, Vol. 101, pp. 1514–
1534.

40

Appendix
A

First-order asymptotic results

Here we provide details of the claims made in Section 2.1. Let b = (b01 , b02 )0 denote the OLS
coefficients when the regression includes both x1t and x2t and b∗1 the coefficients from an OLS
regression that includes only x1t . The SSR from the latter regression can be written
P
SSR1 = (yt+h − x01t b∗1 )2
P
= (yt+h − x0t b + x0t b − x01t b∗1 )2
P
P
= (yt+h − x0t b)2 + (x0t b − x01t b∗1 )2
where all summations are over t = 1, ..., T and the last equality follows from the orthogonality
property of OLS. Thus the difference in SSR between the two regressions is
P
SSR1 − SSR2 = (x0t b − x01t b∗1 )2 .
(25)
It’s also not hard to show that the fitted values for the full regression could be calculated as
x0t b = x01t b∗1 + x̃02t b2

(26)

where x̃2t denotes the residuals from regressions of the elements of x2t on x1t and b2 can be
obtained from an OLS regression of yt+h − x01t b∗1 on x̃2t .52 Thus from (25) and (26),
P
SSR1 − SSR2 = (x̃02t b2 )2 .
If the P
true value of β2 is zero, then by plugging (1) into the definition of b2 and using the
fact that x̃2t x01t β1 = 0 (which follows from the orthogonality of x̃2t with x1t ) we see that
P
−1 P
b2 = ( x̃2t x̃02t ) ( x̃2t ut+h )
(27)
P
SSR1 − SSR2 = b02 ( x̃2t x̃02t ) b2

−1 −1/2 P

P
P
= T −1/2 ut+h x̃02t T −1 x̃2t x̃02t
T
x̃2t ut+h .

(28)

P
−1 P
That is, b2 = ( x̃2t x̃02t ) ( x̃2t (yt+h − x1t b∗1 ) for x̃2t defined in (10) and (11). The easiest way to confirm
the claim is to show that the residuals implied by (26) satisfy the orthogonality conditions required of the
original full regression, namely, that they are orthogonal to x1t and x2t . That the residual yt+h − x01t b∗1 − x̃02t b2
is orthogonal to x1t follows from the fact that yt+h − x01t b∗1 is orthogonal to x1t by the definition of b∗1 while x̃2t
is orthogonal to x1t by the construction of x̃2t . Likewise orthogonality of yt+h − x01t b∗1 − x̃02t b2 to x̃2t follows
directly from the definition of b2 . Since yt+h − x01t b∗1 − x̃02t b2 is orthogonal to both x1t and x̃2t , it is also
orthogonal to x2t = x̃2t + AT x1t .
52

41

If xt is stationary and ergodic, then it follows from the Law of Large Numbers that

−1 −1 P

P
P
P
P
T −1 x̃2t x̃02t = T −1 x2t x02t − T −1 x2t x01t T −1 x1t x01t
T
x1t x02t
p

−1

→ E(x2t x02t ) − [E(x2t x01t )] [E(x1t x01t )]

[E(x1t x02t )]

which equals Q in (6) in the special case when E(x2t x01t ) = 0. For the last term in (28) we
see from (10) and (11) that
P
P
P
T −1/2 x̃2t ut+h = T −1/2 x2t ut+h − AT T −1/2 ( x1t ut+h ) .
But if E(x2t x01t ) = 0, then plim(AT ) = 0, meaning
P
P
d
T −1/2 x̃2t ut+h → T −1/2 x2t ut+h .
√
This will be recognized as T times the sample mean of a random vector with population
mean zero, so from the Central Limit Theorem
T −1/2

P
d
x̃2t ut+h → r ∼ N (0, S)

implying from (28) that
d

SSR1 − SSR2 → r0 Q−1 r.
Thus from (3),
(SSR1 − SSR2 ) d r0 Q−1 r
→
T (R22 − R12 ) = P
(yt+h − ȳh )2 /T
γ
as claimed in (4).
Expression (27) also implies that
√
−1 −1/2 P
 d
P
T b2 = T −1 x̃2t x̃02t
T
x̃2t ut+h → Q−1 r
from which (13) follows immediately.

B

Local-to-unity asymptotic results

Here we provide details behind the claims made
Phillips (1988,
 in Section 2.2. hWe know from
i2 
R
R
R
P
1
1
Lemma 3.1(d)) that T −2 (x1t − x̄1 )2 ⇒ σ12 0 [Jc1 (λ)]2 dλ − 0 [Jc1 (λ)]dλ
= σ12 [Jcµ1 ]2
R
where in the sequel our notation suppresses the dependence on λ and lets denote integration
over λ from 0 to 1. The analogous operation applied to the numerator of (18) yields
R
P
σ1 σ2 Jcµ1 Jcµ2
T −2 (x1t − x̄1 )(x2t − x̄2 )
R
P
AT =
⇒
T −2 (x1t − x̄1 )2
σ12 [Jcµ1 ]2

42

as claimed in (18). We also have from equation (2.17) in Stock (1994) that
T −1/2 x2,[T λ] ⇒ σ2 Jc2 (λ)
where [T λ] denotes the largest integer less than T λ. From the Continuous Mapping Theorem,
Z 1
Z 1
P
−1/2
−3/2
−1/2
T
x̄2 = T
x2t =
T
x2,[T λ] dλ ⇒ σ2
Jc2 (λ)dλ.
0

0

Since x̃2t = x2t − x̄2 − AT (x1t − x̄1 ),



Z 1
Z 1
−1/2
T
x̃2,[T λ] ⇒ σ2 Jc2 (λ) −
Jc2 (s)ds − A Jc1 (λ) −
Jc1 (s)ds
0
0

= σ2 Jcµ2 (λ) − AJcµ1 (λ) = σ2 Kc1 ,c2 (λ)
T

−2

P

x̃22t

Z

1

{T

=

−1/2

2

x̃2,[T λ] } dλ ⇒

0

σ22

Z

1

{Kc1 ,c2 (λ)}2 dλ.

(29)

0

Note we can write
 


σ1 0
0
v1t
ε1t
  v2t 
 ε2t  =  0 σ2
√ 0
2
ut
v0t
1 − δ σu
δσu 0


where (v1t , v2t , v0t )0 is a martingale-difference sequence with unit variance matrix.
Lemma 3.1(e) in Phillips (1988) we see
√
P
P
T −1 x̃2t ut+1 = T −1 [x2t − x̄2 − AT (x1t − x̄1 )](δσu v1,t+1 + 1 − δ 2 σu v0,t+1 )
Z
Z
√
2
⇒ δσ2 σu Kc1 ,c2 dW1 + 1 − δ σ2 σu Kc1 ,c2 dW0 .
Recalling (27), under the null hypothesis the t-test of β2 = 0 can be written as
P
P
T −1 x̃2t ut+1
x̃2t ut+1
=
τ=
P
P
1/2
1/2
{s2 x̃22t }
{s2 T −2 x̃22t }
where

p

s2 → σu2 .

From

(30)

(31)

(32)

Substituting (32), (30), and (29) into (31) produces
√
 R
R
σ2 σu δ Kc1 ,c2 dW1 + 1 − δ 2 Kc1 ,c2 dW0
τ⇒

R
1/2
σu2 σ22 (Kc1 ,c2 )2
as claimed in (19).
Last we demonstrate that the variance of the variable Z1 defined in (20) exceeds unity. We
43

can write

R1

R1 µ
µ
(λ)dW
(λ)
J
A
J (λ)dW1 (λ)
1
c
0 c1
Z1 = nR 0 2
o1/2 − nR
o1/2
1
1
2
2
[Kc1 ,c2 (λ)] dλ
[Kc1 ,c2 (λ)] dλ
0
0

(33)

Consider the denominator in these expressions, and note that
Z 1
Z 1
2
µ
[Jcµ2 (λ) − AJcµ1 (λ) + AJcµ1 (λ)]2 dλ
[Jc2 (λ)] dλ =
0
Z0 1
Z 1
2
=
[Kc1 ,c2 (λ)] dλ +
[AJcµ1 (λ)]2 dλ
0
Z0 1
>
[Kc1 ,c2 (λ)]2 dλ
0

where the cross-product term dropped out in the second equation by the definition of A in
(18). This means that the following inequality holds for all realizations:
R1

R1 µ
µ
J
(λ)dW
(λ)
J (λ)dW1 (λ)
1
0 c2
0 c2
nR
o1/2 > nR
o1/2 .
1
1 µ
2
2
[Kc1 ,c2 (λ)] dλ
[Jc2 (λ)] dλ
0
0

(34)

Adapting the argument made in footnote 10, the magnitude inside the absolute-value operator
on the right side of (34) can be seen to have a N (0, 1) distribution. Inequality (34) thus
establishes that the first term in (33) has a variance that is greater than unity. The second
term in (33) turns out to be uncorrelated with the first, and hence contributes additional
variance to Z1 , although we have found that the first term appears to be the most important
factor.53 In sum, these arguments show that Var(Z1 ) > 1.

53

These claims are based on moments of the respective functionals as estimated from discrete approximations
to the Ornstein-Uhlenbeck processes.

44

Table 1: Simulation study: size distortions of conventional t-test
T
50
50
100
100
200
200
500
500

simulated
asymptotic
simulated
asymptotic
simulated
asymptotic
simulated
asymptotic

δ
ρ = 0.9
5.1
4.5
4.8
4.5
5.0
4.9
5.0
5.0

=0
0.99
4.9
4.4
5.1
4.7
5.1
4.9
5.0
4.8

1
5.1
4.6
5.2
4.8
5.0
4.9
5.0
4.9

0.9
8.1
8.4
7.1
7.0
6.1
6.2
5.4
5.4

δ = 0.8
δ=1
0.99
1
0.9 0.99
11.1 11.4 10.2 15.1
11.0 11.5 10.5 14.9
11.4 12.2 8.4 15.2
11.1 11.9 8.4 15.0
11.1 12.4 6.8 14.5
10.7 12.0 7.2 14.6
8.9 12.2 5.7 11.6
9.2 12.3 5.8 11.6

1
15.9
15.4
16.2
16.0
16.5
16.6
17.0
16.9

True size (in percentage points) of a conventional t-test of H0 : β2 = 0 with nominal size of 5%, in
simulated small samples and according to local-to-unity asymptotic distribution. δ determines the
degree of endogeneity, i.e., the correlation of x1t with the lagged error term ut . The persistence of
the predictors is ρ1 = ρ2 = ρ. For details on the simulation study refer to main text.

Table 2: Simulation study: coefficient bias and standard error bias

True coefficient
Mean estimate
Coefficient bias
True standard error
Mean OLS std. error
Standard error bias
Size of t-test
Size of bootstrap test
Size of IM test

δ = 1, θ = 0
δ = 0.8, θ = 0 δ = 0.8, θ = 0.8
β1
β2
β1
β2
β1
β2
0.990 0.000 0.990 0.000 0.990
0.000
0.921 0.000 0.936 0.000 0.935
0.000
-0.069 0.000 -0.054 0.000 -0.055 0.000
0.053 0.055 0.049 0.049 0.082
0.083
0.038 0.038 0.038 0.038 0.064
0.064
-0.015 -0.017 -0.011 -0.011 -0.018 -0.019
0.155
0.111
0.112
0.080
0.072
0.067
0.047
0.047
0.045

Analysis of bias in estimated coefficients and standard errors for regressions in small samples with
T = 100 and ρ1 = ρ2 = 0.99, as well as estimated size of conventional t-test, bootstrap, and IM
tests. For details on the simulation study refer to main text.

45

Table 3: Warning flags for predictive regressions in published studies
Study
JPS

LN

CP

GV

CPR

Predictor
PC1
PC2
PC3
GRO
INF
PC1
PC2
PC3
F1
F2
F3
F4
F5
F6
F7
F8
CP
H8
PC1
PC2
PC3
PC4
PC5
CP
PC1
PC2
PC3
supply
PC1
PC2
PC3
gap

1
0.974
0.973
0.849
0.910
0.986
0.984
0.944
0.601
0.766
0.748
-0.233
0.455
0.361
0.422
-0.111
0.225
0.773
0.777
0.980
0.940
0.592
0.425
0.227
0.767
0.988
0.942
0.582
0.998
0.986
0.939
0.590
0.975

ACF(l)
6
0.840
0.774
0.380
0.507
0.897
0.904
0.734
0.254
0.381
0.454
0.035
0.207
0.207
0.476
0.134
0.087
0.531
0.627
0.880
0.721
0.237
0.137
0.157
0.522
0.925
0.722
0.233
0.990
0.917
0.712
0.262
0.750

δ
12
0.696
0.467
0.216
0.260
0.815
0.821
0.537
0.113
0.088
0.188
-0.085
0.151
0.171
0.272
0.054
0.093
0.377
0.331
0.767
0.539
0.110
0.062
-0.135
0.361
0.860
0.521
0.094
0.974
0.841
0.528
0.153
0.475

-0.368
-0.048
0.202
-0.122
-0.189
-0.342
0.137
0.091
0.100
0.160
0.044
0.189
0.169
0.058
-0.079
0.048

-0.358
0.157
0.090
-0.020
0.121
-0.312
0.147
0.105
0.035
-0.338
0.179
0.055
-0.193

Measures of persistence and lack of strict exogeneity of the predictors. For the persistence we
report autocorrelations of the predictors at lags of one, six, and twelve months. Lack of strict
exogeneity is measured by δ, the correlation between the innovations to the predictors, ε1t or ε2t ,
and the lagged prediction error, ut . The innovations are obtained from estimated VAR(1) models
for x1t (the principal components of yields) and x2t (the other predictors). The forecast error ut is
calculated from a predictive regression of the average excess bond return across maturities. The
predictors are described in the main text. The data and sample are the same as in the published
studies. These are JPS (Joslin et al., 2014), LN (Ludvigson and Ng, 2010), CP (Cochrane and
Piazzesi, 2005), GV (Greenwood and Vayanos, 2014), and CPR (Cooper and Priestley, 2008).

46

Table 4: Joslin-Priebsch-Singleton: R2 in excess return regressions
Original sample: 1985–2008
R̄12
R̄22
R̄22 − R̄12
Two-year bond
Data
Simple bootstrap

Later sample: 1985–2013
R̄22
R̄22 − R̄12

0.49
0.36
(0.11, 0.63)
0.44
(0.13, 0.75)

0.35
0.12
0.06
0.26
(-0.00, 0.22) (0.05, 0.51)
0.06
0.32
(-0.00, 0.23) (0.07, 0.60)

0.28
0.32
(0.09, 0.56)
0.38
(0.12, 0.64)

0.16
0.06
(-0.00, 0.21)
0.06
(-0.00, 0.21)

0.20
0.37
0.26
0.32
(0.07, 0.48) (0.12, 0.54)
BC bootstrap
0.27
0.34
(0.06, 0.50) (0.12, 0.57)
Average two- through ten-year bonds
Data
0.19
0.39
Simple bootstrap
0.28
0.35
(0.08, 0.50) (0.12, 0.56)
BC bootstrap
0.30
0.37
(0.06, 0.55) (0.13, 0.61)

0.17
0.20
0.07
0.24
(-0.00, 0.23) (0.06, 0.46)
0.08
0.26
(-0.00, 0.27) (0.06, 0.49)

0.28
0.30
(0.11, 0.51)
0.33
(0.11, 0.55)

0.08
0.06
(-0.00, 0.21)
0.07
(-0.00, 0.23)

0.20
0.17
0.07
0.24
(-0.00, 0.23) (0.05, 0.46)
0.07
0.27
(-0.00, 0.26) (0.05, 0.50)

0.25
0.30
(0.10, 0.52)
0.33
(0.12, 0.56)

0.08
0.06
(-0.00, 0.21)
0.07
(-0.00, 0.24)

BC bootstrap
Ten-year bond
Data
Simple bootstrap

0.14
0.30
(0.06, 0.58)
0.38
(0.07, 0.72)

R̄12

Adjusted R̄2 for regressions of annual excess bond returns on three PCs of the yield curve (R̄12 ) and
on three yield PCs together with the macro variables GRO and IN F (R̄22 ), as well as the difference
in adjusted R̄2 . GRO is the three-month moving average of the Chicago Fed National Activity
Index, and IN F is one-year expected inflation measured by Blue Chip inflation forecasts. The data
used for the left half of the table is the original data set of Joslin et al. (2014); the data used in the
right half is extended to December 2013. The last panel shows results for the average excess bond
return for all bond maturities from two to ten years. The first row of each panel reports the values
of the statistics in the original data. The next three rows report bootstrap small-sample mean, and
the 95%-confidence intervals (in parentheses). The bootstrap simulations are obtained under the
null hypothesis that the macro variables have no predictive power. The bootstrap procedure for the
simple bootstrap and the bias-corrected (BC) bootstrap is described in the main text.

47

Table 5: Joslin-Priebsch-Singleton: inference in excess return regressions
P C1
P C2
P C3 GRO IN F
Wald
Original sample: 1985–2008
Coefficient
1.064
1.988 3.342 -2.174 -6.494
HAC statistic
5.603
4.671 0.865 2.438 4.232 25.476
HAC p-value
0.000 0.000 0.388 0.015 0.000 0.000
Bootstrap 5% c.v.
3.203 3.950 24.410
Bootstrap p-value
0.129 0.038 0.046
BC bootstrap 5% c.v.
3.460 4.286 27.664
BC bootstrap p-value
0.140 0.052 0.061
IM q = 8
0.002 0.040 0.002 0.563 0.940
IM q = 16
0.003 0.002 0.063 0.244 0.500
Estimated size of tests
HAC
0.209 0.285 0.382
Simple bootstrap
0.058 0.067 0.069
IM q = 8
0.049 0.054
IM q = 16
0.038 0.033
Later sample: 1985–2013
Coefficient
0.523
1.865 4.330 -0.271 -3.767
HAC statistic
2.524
3.755 1.345 0.323 2.408 5.799
HAC p-value
0.012 0.000 0.180 0.747 0.017 0.055
Bootstrap 5% c.v.
3.332 3.665 22.786
Bootstrap p-value
0.820 0.178 0.376
BC bootstrap 5% c.v.
3.420 3.919 24.471
BC bootstrap p-value
0.838 0.206 0.417
IM q = 8
0.275 0.030 0.003 0.550 0.325
IM q = 16
0.304 0.007 0.139 0.393 0.934
Predictive regressions for annual excess bond returns, averaged over two- through ten-year bond
maturities, using yield PCs and macro variables (which are described in the notes to Table 4). The
data used for the top panel is the original data set of Joslin et al. (2014); the data used for the
bottom panel is extended to December 2013. HAC statistics and p-values are calculated using
Newey-West standard errors with 18 lags. The column “Wald” reports results for the χ2 test that
GRO and IN F have no predictive power; the other columns report results for individual t-tests.
We obtain bootstrap distributions of the test statistics under the null hypothesis that GRO and
IN F have no predictive power. Critical values (c.v.’s) are the 95th percentile of the bootstrap
distribution of the test statistics, and p-values are the frequency of bootstrap replications in which
the test statistics are at least as large as in the data. See the text for a description of the
experimental design for the simple bootstrap and the bias-corrected (BC) bootstrap. We also
report p-values for t-tests using the methodology of Ibragimov and Müller (2010) (IM), splitting the
sample into either 8 or 16 blocks. The last four rows in the first panel report bootstrap estimates of
the true size of different tests with 5% nominal coverage, calculated as the frequency of bootstrap
replications in which the test statistics exceed their critical values, except for the size of bootstrap
test which is calculated as described in the main text. p-values below 5% are emphasized with bold
face.
48

49
0.068
0.788

2.725
0.682
0.496

0.651
1.652
0.099
2.817
0.224
0.139
0.831

-0.274
0.267
0.789
2.908
0.844
0.511
0.636

0.132
0.055
0.050
0.048

0.131
0.058
0.051
0.051

0.225
0.813

0.146
0.379
0.705
2.580
0.761
0.558
0.317

0.742
1.855
0.064
2.572
0.140
0.098
0.228

-5.014
2.724
0.007

F2

F1

P C3

0.147
0.690
0.491
2.516
0.587
0.537
0.923

0.097
0.053
0.051
0.051

-0.072
0.608
0.543
2.241
0.594
0.579
0.771

F3

-0.488
1.162
0.246
2.667
0.370
0.899
0.187

0.124
0.061
0.049
0.050

-0.528
1.912
0.056
2.513
0.128
0.088
0.327

F4

0.022
0.038
0.969
2.798
0.973
0.767
0.570

0.126
0.055
0.049
0.051

-0.321
1.307
0.192
2.497
0.301
0.703
0.358

F5

0.334
1.866
0.063
2.468
0.136
0.144
0.882

0.134
0.053
0.052
0.045

-0.576
2.220
0.027
2.622
0.092
0.496
0.209

F6

0.035
0.153
0.878
2.365
0.892
0.923
0.703

0.113
0.049
0.050
0.055

-0.401
2.361
0.019
2.446
0.057
0.085
0.027

F7

-0.075
0.423
0.673
2.298
0.718
0.398
0.239

0.086
0.046
0.042
0.046

0.551
3.036
0.003
2.242
0.010
0.324
0.502

F8

13.766
0.088
37.267
0.495

0.335
0.061

42.084
0.000
29.686
0.009

Wald

Predictive regressions for annual excess bond returns, averaged over two- through five-year bond maturities, using yield PCs and
factors from a large data set of macro variables, as in Ludvigson and Ng (2010). The top panel shows the results for the original
data set used by Ludvigson and Ng (2010); the bottom panel uses a data sample that starts in 1985 and ends in 2013. The
bootstrap is a simple bootstrap without bias correction. For a description of the statistics in each row, see the notes to Table 5.
p-values below 5% are emphasized with bold face.

P C1
P C2
A. Original sample: 1964–2007
Coefficient
0.136 2.052
HAC statistic
1.552 2.595
HAC p-value
0.121 0.010
Bootstrap 5% c.v.
Bootstrap p-value
IM q = 8
0.001 0.001
IM q = 16
0.000 0.052
Estimated size of tests
HAC
Bootstrap
IM q = 8
IM q = 16
B. Later sample: 1985–2013
Coefficient
0.157 1.182
HAC statistic
1.506 1.111
HAC p-value
0.133 0.268
Bootstrap 5% c.v.
Bootstrap p-value
IM q = 8
0.014 0.005
IM q = 16
0.024 0.185

Table 6: Ludvigson-Ng: predicting excess returns using PCs and macro factors

Table 7: Ludvigson-Ng: R̄2 for predicting excess returns using PCs and macro factors
R̄12
R̄22
Original sample: 1964–2007
Data
0.25
0.35
Bootstrap
0.20
0.24
(0.05, 0.39) (0.08, 0.42)
Later sample: 1985–2013
Data
0.14
0.18
Bootstrap
0.26
0.29
(0.05, 0.49) (0.08, 0.51)

R̄22 − R̄12
0.10
0.03
(-0.00, 0.11)
0.04
0.03
(-0.01, 0.14)

Adjusted R̄2 for regressions of annual excess bond returns, averaged over two- through five-year
bonds, on three PCs of the yield curve (R̄12 ) and on three yield PCs together with eight macro
factors (R̄22 ), as well as the difference in R̄2 . The top panel shows the results for the original data
set used by Ludvigson and Ng (2010); the bottom panel uses a data sample that starts in 1985 and
ends in 2013. For each data sample we report the values of the statistics in the data, and the mean
and 95%-confidence intervals (in parentheses) of the bootstrap small-sample distributions of these
statistics. The bootstrap simulations are obtained under the null hypothesis that the macro
variables have no predictive power. The bootstrap procedure, which does not include bias
correction, is described in the main text.

50

Table 8: Ludvigson-Ng: predicting excess returns using return-forecasting factors
Two-year bond Three-year bond Four-year bond
CP
H8
CP
H8
CP
H8
Original sample: 1964–2007
Coefficient
0.335
0.331
0.645
0.588
0.955
0.776
HAC t-statistic
4.429
4.331
4.666
4.491
4.765
4.472
HAC p-value
0.000 0.000 0.000 0.000 0.000 0.000
Bootstrap 5% c.v.
3.809
3.799
3.874
Bootstrap p-value
0.022
0.015
0.017
Estimated size of tests
HAC
0.514
0.538
0.545
Bootstrap
0.047
0.055
0.057
Later sample: 1985–2013
Coefficient
0.349
0.371
0.661
0.695
1.101
0.895
HAC t-statistic
2.644
3.348
2.527
3.409
3.007
3.340
HAC p-value
0.009 0.001 0.012 0.001 0.003 0.001
Bootstrap 5% c.v.
3.890
4.014
4.026
Bootstrap p-value
0.103
0.116
0.124

Five-year bond
CP
H8
1.115
4.371
0.000

0.937
4.541
0.000
3.898
0.014
0.539
0.050

1.320
2.946
0.003

1.021
3.270
0.001
3.942
0.128

Predictive regressions for annual excess bond returns, using return-forecasting factors based on
yield-curve information (CP ) and macro information (H8), as in Ludvigson and Ng (2010). The
first panel shows the results for the original data set used by Ludvigson and Ng (2010); the second
panel uses a data sample that starts in 1985 and ends in 2013. HAC t-statistics and p-values are
calculated using Newey-West standard errors with 18 lags. We obtain bootstrap distributions of the
t-statistics under the null hypothesis that macro factors and hence H8 have no predictive power.
We also report bootstrap critical values (c.v.’s) and p-values, as well as estimates of the true size of
conventional t-tests and the bootstrap tests with 5% nominal coverage (see notes to Table 5). The
bootstrap procedure, which does not include bias correction, is described in the main text. p-values
below 5% are emphasized with bold face.

51

Table 9: Ludvigson-Ng: R̄2 for predicting excess returns using return-forecasting factors
Original sample: 1985–2008
R̄12
R̄22
R̄22 − R̄12
Two-year bond
Data
Bootstrap
Three-year bond
Data
Bootstrap
Four-year bond
Data
Bootstrap
Five-year bond
Data
Bootstrap

R̄12

Later sample: 1985–2013
R̄22
R̄22 − R̄12

0.31
0.21
(0.06, 0.39)

0.42
0.24
(0.09, 0.41)

0.11
0.03
(-0.00, 0.10)

0.15
0.25
(0.04, 0.50)

0.23
0.28
(0.08, 0.52)

0.07
0.03
(-0.00, 0.12)

0.33
0.20
(0.05, 0.38)

0.43
0.23
(0.09, 0.40)

0.10
0.03
(-0.00, 0.10)

0.15
0.25
(0.05, 0.48)

0.22
0.29
(0.09, 0.51)

0.07
0.04
(-0.00, 0.13)

0.36
0.21
(0.06, 0.40)

0.45
0.25
(0.10, 0.42)

0.09
0.03
(-0.00, 0.11)

0.19
0.27
(0.07, 0.50)

0.24
0.30
(0.11, 0.52)

0.05
0.03
(-0.00, 0.12)

0.33
0.21
(0.06, 0.39)

0.42
0.24
(0.10, 0.41)

0.09
0.03
(-0.00, 0.11)

0.17
0.25
(0.06, 0.48)

0.21
0.29
(0.10, 0.50)

0.05
0.03
(-0.00, 0.13)

Adjusted R̄2 for regressions of annual excess bond returns on return-forecasting factors based on
yield-curve information (CP ) and macro information (H8), as in Ludvigson and Ng (2010). R̄12 is
for regressions with only CP , while R̄22 is for regressions with both CP and H8. The table shows
results both for the original data set used by Ludvigson and Ng (2010) and for a data sample that
starts in 1985 and ends in 2013. For each data sample and bond maturity, we report the values of
the statistics in the data, and for the bootstrap small-sample distributions of these statistics the
mean, and 95%-confidence intervals (in parentheses). The bootstrap simulations are obtained under
the null hypothesis that the macro variables have no predictive power. The bootstrap procedure,
which does not include bias correction, is described in the main text.

52

Table 10: Cochrane-Piazzesi: in-sample evidence

Original sample: 1964–2003
Data
HAC statistic
HAC p-value
Bootstrap 5% c.v./mean R̄2
Bootstrap p-value/95% CIs
IM q = 8
IM q = 16
Estimated size of tests
HAC
Bootstrap
IM q = 8
IM q = 16
Later sample: 1985–2013
Data
HAC statistic
HAC p-value
Bootstrap 5% c.v./mean R̄2
Bootstrap p-value/95% CIs
IM q = 8
IM q = 16

P C1

P C2

P C3

P C4

P C5

0.127
1.724
0.085

2.740
5.205
0.000

-6.307
2.950
0.003

-16.128
5.626
0.000
2.253
0.000
0.237
0.953

-2.038
0.748
0.455
2.236
0.507
0.233
0.283

0.085
0.046
0.040
0.043

0.083
0.053
0.050
0.049

-9.196
1.275
0.203
2.463
0.301
0.803
0.190

-9.983
1.351
0.178
2.433
0.273
0.435
0.949

0.002
0.000

0.104
1.619
0.106

0.011
0.001

0.030
0.004

1.586
2.215
0.027

0.079
0.031

0.873
0.148

3.962
1.073
0.284

0.044
0.215

Wald

31.919
0.000
8.464
0.000

R12

R22

R22 − R12

0.26

0.35

0.09

0.21
(0.05, 0.40)

0.21
(0.06, 0.41)

0.01
(0.00, 0.03)

0.14

0.17

0.03

0.26
(0.06, 0.49)

0.28
(0.08, 0.50)

0.02
(0.00, 0.05)

0.114
0.055

4.174
0.124
9.878
0.272

Predicting annual excess bond returns, averaged over two- through five-year bonds, using principal
components (PCs) of yields. The null hypothesis is that the first three PCs contain all the relevant
predictive information. The data used in the top panel is the same as in Cochrane and Piazzesi
(2005)—see in particular their table 4. HAC statistics and p-values are calculated using
Newey-West standard errors with 18 lags. We also report the unadjusted R2 for the regression
using only three PCs (R12 ) and for the regression including all five PCs (R22 ), as well as the
difference in these two. Bootstrap distributions are obtained under the null hypothesis, using the
bootstrap procedure described in the main text (without bias correction). For the R2 -statistics, we
report means and 95%-confidence intervals (in parentheses). For the HAC test statistics, bootstrap
critical values (c.v.’s) are the 95th percentile of the bootstrap distribution of the test statistics, and
p-values are the frequency of bootstrap replications in which the test statistics are at least as large
as the statistic in the data. We also report p-values for t-tests using the methodology of Ibragimov
and Müller (2010) (IM), splitting the sample into either 8 or 16 blocks. The last four rows in the
first panel report bootstrap estimates of the true size of different tests with 5% nominal coverage,
calculated as the frequency of bootstrap replications in which the test statistics exceed their critical
values, except for the size of bootstrap test which is calculated as described in the main text.
p-values below 5% are emphasized with bold face.

53

Table 11: Cochrane-Piazzesi: out-of-sample forecast accuracy
n
(1)
2
3
4
5
average

R12
R22
(2)
(3)
0.321 0.260
0.341 0.242
0.371 0.266
0.346 0.270
0.351 0.264

RM SE2
(4)
2.120
4.102
5.848
7.374
4.845

RM SE1
(5)
1.769
3.232
4.684
6.075
3.917

DM p-value RM SEmean
(6)
(7)
(8)
2.149 0.034
1.067
2.167 0.032
1.946
2.091 0.039
2.989
2.121 0.036
3.987
2.133 0.035
2.385

In-sample vs. out-of-sample predictive power for excess bond returns (averaged across maturities)
of a restricted model with three PCs and an unrestricted model with five PCs. The in-sample
period is from 1964 to 2002 (the last observation used by Cochrane-Piazzesi), and the out-of-sample
period is from 2003 to 2013. The second and third column show in-sample R2 . The fourth and fifth
column show root-mean-squared forecast errors (RMSEs) of the two models. The column labeled
“DM” reports the z-statistic of the Diebold-Mariano test for equal forecast accuracy, and the
following column the corresponding p-value. The last column shows the RMSE when forecasts are
the in-sample mean excess return.

54

Table 12: Greenwood-Vayanos: predictive power of Treasury bond supply
One-year
Term
Bond
yield spread
P C1
P C2
P C3 supply
Dependent variable: return on long-term bond
Coefficient
1.212
0.026
HAC t-statistic
2.853
3.104
HAC p-value
0.004
0.002
IM q = 8
0.030
0.795
IM q = 16
0.001
0.925
Dependent variable: return on long-term bond
Coefficient
1.800
2.872
0.014
HAC t-statistic
5.208
4.596
1.898
HAC p-value
0.000 0.000
0.058
IM q = 8
0.006 0.013
0.972
IM q = 16
0.000 0.000
0.557
Dependent variable: excess return on long-term bond
Coefficient
0.168 5.842 -6.089
0.013
HAC t-statistic
1.457 4.853 1.303
1.862
HAC p-value
0.146 0.000 0.193
0.063
IM q = 8
0.000 0.003 0.045
0.968
IM q = 16
0.000 0.000 0.023
0.854
Dependent variable: avg. excess return for 2-5 year bonds
Coefficient
0.085 1.669 -4.632
0.004
HAC statistic
1.270 3.156 2.067
1.154
HAC p-value
0.204 0.002 0.039
0.249
Bootstrap 5% c.v.
3.105
Bootstrap p-value
0.448
IM q = 8
0.005 0.134 0.714
0.494
IM q = 16
0.008 0.011 0.611
0.980
Predictive regressions for annual bond returns using Treasury bond supply, as in Greenwood and
Vayanos (2014) (GV). The coefficients on bond supply in the first two panels are identical to those
reported in row (1) and (6) of Table 5 in GV. HAC t-statistics and p-values are constructed using
Newey-West standard errors with 36 lags, as in GV. The last two rows in each panel report p-values
for t-tests using the methodology of Ibragimov and Müller (2010), splitting the sample into either 8
or 16 blocks. The sample period is 1952 to 2008. p-values below 5% are emphasized with bold face.

55

Table 13: Cooper-Priestley: predictive power of the output gap
˜
gap
CP
CP
P C1
Coefficient
-0.126
OLS t-statistic
3.224
HAC t-statistic
1.077
HAC p-value
0.282
Coefficient
-0.120 1.588
OLS t-statistic
3.479 13.541
HAC t-statistic
1.244 4.925
HAC p-value
0.214 0.000
Coefficient
0.113
1.612
OLS t-statistic
2.940
13.831
HAC t-statistic
1.099
5.059
HAC p-value
0.272
0.000
Coefficient
0.147
0.001
OLS t-statistic
3.524
4.359
HAC t-statistic
1.306
1.354
HAC p-value
0.192
0.176
Bootstrap 5% c.v. 2.933
Bootstrap p-value 0.356
IM q = 8
0.612
0.002
IM q = 16
0.243
0.000

P C2

P C3

0.043 -0.067
11.506 3.690
4.362 2.507
0.000 0.012

0.011
0.001

0.234
0.064

Predictive regressions for the one-year excess return on a five-year bond using the output gap, as in
˜ is the Cochrane-Piazzesi factor after orthogonalizing it
Cooper and Priestley (2008) (CPR). CP
with respect to gap, whereas CP is the usual Cochrane-Piazzesi factor. For the predictive
regression, gap is lagged one month, as in CPR. HAC standard errors are based on the Newey-West
estimator with 22 lags. The bootstrap procedure, which does not include bias correction, is
described in the main text. The sample period is 1952 to 2003. p-values below 5% are emphasized
with bold face.

56

0.15
0.10
0.05

ρ = 1, small−sample simulations
ρ = 0.99, small−sample simulations
ρ = 1, asymptotic distribution
ρ = 0.99, asymptotic distribution

0.00

Empirical size of test

0.20

Figure 1: Simulation study: size of t-test and sample size

0

200

400

600

800

1000

Sample size
True size of conventional t-test of H0 : β2 = 0 with nominal size of 5%, in simulated small samples
and according to local-to-unity asymptotic distribution, for different sample sizes, with δ = 1.
Regressors are either random walks (ρ = 1) or stationary but highly persistent AR(1) processes
(ρ = 0.99). For details on the simulation study refer to main text.

57

Figure 2: Cochrane-Piazzesi: predictive power of PCs across subsamples

3

Regressor
●
●

2

●
●

●

●
●
●

●

●

●

●

1

●
●
●

●

●

●

0

●

●
●
●

−1

Standardized coefficient

●

●

PC1
PC2
PC3
PC4
PC5

●

PC1: t−stat = 4.74, p−value = 0.002
PC2: t−stat = 2.72, p−value = 0.030
PC3: t−stat = 0.17, p−value = 0.873
PC4: t−stat = 1.29, p−value = 0.237
PC5: t−stat = 1.31, p−value = 0.233

1970

●

●

1980

1990

2000

Endpoint for subsample
Standardized coefficients on principal components (PCs) across eight different subsamples, ending
at the indicated point in time. Standardized coefficients are calculated by dividing through the
sample standard deviation of the coefficient across the eight samples. Text labels indicate
t-statistics and p-values of the Ibragimov-Mueller test with q =
√ 8. Note that the t-statistics are
equal to means of the standardized coefficients multiplied by 8. The data and sample period is
the same as in Cochrane and Piazzesi (2005).

58

0
−2
−8

−6

−4

Excess return

2

4

6

Figure 3: Cochrane-Piazzesi: out-of-sample forecasts

Realized
Forecast 1
Forecast 2
In−sample mean
2004

2006

2008

2010

2012

Year
Realizations vs. out-of-sample forecasts of excess bond returns (averaged across maturities) from
restricted model (1) with three PCs and unrestricted model (2) with five PCs. The in-sample
period is from 1964 to 2002 (the last observation used by Cochrane-Piazzesi), and the out-of-sample
period is from 2003 to 2013. The figure also shows the in-sample mean excess return.

59