Full text of Working Papers (Federal Reserve Bank of Richmond) : Averaging Impulse Responses Using Prediction Pools, Working Paper 23-04

View original document

The full text on this page is automatically extracted from the file linked above and may contain errors and inconsistencies.

Averaging Impulse Responses Using
Prediction Pools

WP 23-04

Paul Ho
Federal Reserve Bank of Richmond
Thomas A. Lubik
Federal Reserve Bank of Richmond
Christian Matthes
Indiana University

Averaging Impulse Responses Using Prediction Pools∗
Paul Ho

Thomas A. Lubik

Federal Reserve Bank of Richmond†

Federal Reserve Bank of Richmond‡

Christian Matthes
Indiana University§

January 25, 2023

Abstract
Macroeconomists construct impulse responses using many competing time series models and different statistical paradigms (Bayesian or frequentist). We adapt optimal
linear prediction pools to efficiently combine impulse response estimators for the effects of the same economic shock from this vast class of possible models. We thus
alleviate the need to choose one specific model, obtaining weights that are typically
positive for more than one model. Three Monte Carlo simulations and two monetary
shock empirical applications illustrate how the weights leverage the strengths of each
model by (i) trading off properties of each model depending on variable, horizon, and
application and (ii) accounting for the full predictive distribution rather than being
restricted to specific moments.
JEL Classification: C32, C52
Key Words: Prediction Pools, Model Averaging, Impulse Responses, Misspecification

∗

We are grateful to Mikkel Plagborg-Møller, Mark Watson, our discussant Ed Herbst, and workshop participants at the 2021 CEF conference, the Drautzburg-Nason workshop, the Advances in Macro-Econometrics
Conference, the 2022 System Econometrics Meeting, the 2022 Midwest Econometrics meeting, Indiana University and the Federal Reserve Bank of Richmond for helpful comments. Aubrey George, Colton Lapp, and
Brennan Merone provided excellent research assistance. The views expressed herein are those of the authors
and not necessarily those of the Federal Reserve Bank of Richmond or the Federal Reserve System.
†
Research Department, P.O. Box 27622, Richmond, VA 23261. Email: paul.ho@rich.frb.org.
‡
Research Department, P.O. Box 27622, Richmond, VA 23261. Email: thomas.lubik@rich.frb.org.
§
Wylie Hall, 100 South Woodlawn Avenue, Bloomington, IN 47405. Email: matthesc@iu.edu.

Introduction

Impulse responses are a key tool in macroeconomists’ arsenal to trace out the effects of structural shocks on aggregate quantities and prices. When estimating these impulse responses,
economists have a wide range of options. For example, in the space of purely statistical
models, a researcher can choose, between local projections (LPs) and vector autoregressions
(VARs), Bayesian and frequentist methods, and different specifications. Naturally, each
choice has its own drawbacks and benefits. It is well known that these choices can generate
significantly different results (see Ramey (2016) for several leading examples). While there
is a growing literature discussing conditions under which one approach might be preferred
over another (Stock and Watson, 2018; Herbst and Johannsen, 2020; Plagborg-Møller and
Wolf, 2021), in practical applications many of the conditions are likely difficult to verify.
In this paper, we introduce a method to average impulse responses from different estimators by extending the optimal prediction pools studied by Geweke and Amisano (2011) and
Hall and Mitchell (2007).1 In particular, we compute the optimal weights that maximize
the weighted average log score function for forecasts conditional on the structural shock of
interest. The only input required is a set of forecast densities that trace out the modelspecific effects of the shock of interest. The individual impulse responses can be based upon
any method that delivers such a conditional forecast density for a given variable at a given
horizon.
Our approach is designed to appeal to empirical macroeconomists, who may find it difficult to choose between different methods for estimating impulse responses. LPs have become
popular because they allow for introduction of extraneous variables in a straightforward manner. At the same time, confidence intervals of LP-based responses tend to be wide and cannot
strictly be interpreted as structural. In contrast, VAR-based impulse response estimates suffer from a well-known bias. Our proposed solution to these issues in practical applications is
to simply take all of these concerns at face value and compute combined responses that take
these trade-offs into account. Our use of prediction pools provides a systematic and computationally tractable method to account for these issues in a wide range of applications, where
the focus is on identifying plausible and robust dynamic behavior over time, irrespective of
the underlying models.
A key strength of our approach is its flexibility. In particular, it removes the necessity
to choose one model or even one statistical paradigm. Moreover, the methodology is applicable to a wide range of models. Typical methods such as Bayesian model averaging are
1
Opinion pools, i.e., a forecast density formed by averaging over model-specific forecast densities, were
first introduced by Stone (1961).

unavailable when one of the estimators considered is based on LPs as LPs are not ‘generative
models’, that is, a set of LPs for different horizons do not form a consistent data-generating
process. Besides the aforementioned LPs and VARs, dynamic equilibrium models (Smets
and Wouters, 2007), dynamic factor models (Stock and Watson, 2016), or single equation
methods (Baek and Lee, 2022) can be used in our framework. In addition, our method
provides horizon- and variable-specific averages, thus exploiting each method’s strength as
much as possible.
As highlighted by Geweke and Amisano (2011), prediction pools have properties that
make them well-suited to average over models or estimators when it is clear that all included
models are misspecified. In contrast to Bayesian model averaging or related frequentist
methods, more than one model will generally receive a positive weight. This helps prediction
pools to outperform other model selection or model averaging approaches using various
measures of forecast accuracy. Our extension inherits these properties, as we demonstrate
with Monte Carlo exercises and with two applications that study the effects of U.S. monetary
policy shocks.
Prediction pools are also computationally straightforward to implement relative to alternative methods of averaging across models. Since each model-specific forecasting density
can be obtained separately, the most-time consuming part of forming prediction pools can
be parallelized. The second step is then a relatively simple numerical maximization problem with a concave objective function and convex constraints. Other methods that also
combine information from various models such as mixture models or composite-likelihood
estimators (Qu, 2018; Canova and Matthes, 2021) do not share this modularity and thus
have substantially higher computational complexity.
Overall, our paper highlights several broad messages for estimating impulse responses.
The theoretical properties of individual models are not sufficient criteria for the choice of
optimal weights in the prediction pool. Misspecified models can dominate correctly specified
(or more flexible) models in finite samples. On the other hand, models that produce tighter
estimates need not receive greater weight. The choice of model and their weights depend on
the entire predictive distribution and not only the point estimates. While our examples focus
on the mean and variance, higher-order moments or any other properties of the predictive
distribution can be important more generally. Finally, we also find that the optimal weights
on models depend on horizon, variable, and application, making it difficult to derive general
guidelines or rules-of-thumb.
We illustrate our methodology using three Monte Carlo experiments. The first is a
stylized univariate example motivated by Herbst and Johannsen (2020), which serves as a
proof of concept and implies reasonable optimal weights. They result in an average impulse
3

response with a bias similar to one chosen by minimizing a squared error criterion. The
second Monte Carlo compares a VAR and an LP in a setting where the VAR is misspecified,
but the LP produces noisier estimates and has finite sample bias that is of the opposite
sign from the VAR. While most of the weight is placed on the VAR, substantial weight is
also placed on the LP, reducing the bias of the averaged impulse response relative to the
VAR on its own, with the biases of the two models offsetting each other. The final Monte
Carlo simulates data from the DSGE model from Smets and Wouters (2007), illustrating
the weighting scheme’s ability to trade off bias and variance for a relatively realistic datagenerating process. These exercises highlight how our approach often gives positive weight
to all competing models, but is also consistent with previous theoretical results found in
Herbst and Johannsen (2020) and Li et al. (2022).
We then consider two empirical exercises. Our first application uses an instrument that
exploits high-frequency variation in asset prices around monetary policy decisions (Gertler
and Karadi, 2015; Caldara and Herbst, 2019). The second application follows Ramey (2016),
where we average across four models that use the same Romer and Romer (2004) narrative
instrument for monetary shocks. We find a range of results depending on application, horizon, and variable, emphasizing the flexibility of our methodology and the importance of
considering the full predictive distribution rather than individual statistics. In addition, we
find cases where the averaged impulse response delivers an economic message that is different
and often more plausible than any of the individual models.
Related Literature. Our approach is motivated by the vast array of choices for computing impulse responses available to practitioners. It is designed to allow researchers to
average optimally across multiple approaches rather than choosing just one. The two main
statistical models are VARs (Sims, 1980) and LPs (Jordà, 2005), which we focus on in our
Monte Carlos and applications. Within these two classes of models there are numerous
variations. For example, in VARs inference can be conducted using Bayesian or frequentist methods (Sims and Zha, 1999). The Bayesian approach requires the choice of priors
(Doan et al., 1984; Del Negro and Schorfheide, 2004; Giannone et al., 2015) while the frequentist approach requires choices about bias correction and the construction of confidence
intervals (Kilian, 1998; Pesavento and Rossi, 2006). With LPs, there is a growing literature
providing choices on the approach to inference (Herbst and Johannsen, 2020; Montiel Olea
and Plagborg-Møller, 2021; Lusompa, 2021; Bruns and Lütkepohl, 2022) and smoothing the
impulse responses (Barnichon and Brownlees, 2019; Miranda-Agrippino and Ricco, 2021a).
Having a general method that is flexible enough to cater to different models, variables,
and horizons is particularly useful given the range of conclusions in the literature about the
4

relative strengths of the different methods. While there are asymptotic results on the relative
performance of VARs and LPs (Stock and Watson, 2018; Plagborg-Møller and Wolf, 2021),
the conditions for these theorems may not be easily verifiable in practice. In finite sample
settings, the literature has also compared the performance of VARs and LPs (Kilian, 1998;
Marcellino et al., 2006; Li et al., 2022). However, it is difficult to draw general conclusions,
especially in empirical applications when the true model is not known. Our Monte Carlos and
empirical applications show that the relative weights on different models can vary drastically
not only with the data but also by variable and horizon. We consider it therefore important
to rely on a general method that is able to assign weights variable by variable and horizon
by horizon.
Prediction pools have been used to average models since their introduction by Geweke
and Amisano (2011) and subsequent follow-up work in Geweke and Amisano (2012) and
Amisano and Geweke (2017). The methodology has been extended by Waggoner and Zha
(2012) and Del Negro et al. (2016) to assign time-varying weights. Our key innovation is
that prediction pools can be used to average impulse responses whereby we treat the impulse
responses as conditional forecasts. This allows for a flexible method that inherits desirable
properties of the original prediction pools.
Model averaging has a long tradition in economics, partially motivated by the observation that averages of forecasts across multiple models tend to outperform forecasts based on
an individual model (Bates and Granger, 1969). This is, in fact, one of the original motivations behind optimal prediction pools (Geweke and Amisano, 2011). Alternative averaging
methods exist in both Bayesian and frequentist frameworks. In the Bayesian setting, model
averaging is just another application of Bayes’ theorem (for an application to VARs, see, for
example, Strachan and van Dijk (2007)). As mentioned before, Bayesian model averaging
generally requires use of generative models and, as such, rules out the use of LPs. Frequentist versions of model or forecast averaging such as Hansen (2007) also focus on specific
classes of models (averages of least squares estimators in that case). Hansen (2016) studies
model combination of various restricted VARs estimated via least squares. He proposes to
find optimal model weights to minimize the mean squared error of a function of the VAR
parameters (which can be an impulse response at a specific horizon).
Outline. The rest or the paper is structured as follows. Section 2 introduces our methodology. Section 3 describes our three Monte Carlo exercises. In Section 4, we apply our method
to study the response of various macroeconomic aggregates to monetary shocks identified by
the instruments from Gertler and Karadi (2015) and Romer and Romer (2004). Section 5
concludes.
5

Prediction Pools

We use prediction pools to average impulse responses across different models, based on
t
Geweke and Amisano (2011). In their framework, predictive densities p (zt+h |Xm
; Mm ) for

each model Mm are combined to create a predictive density for an observable zt conditional
t
on model-specific predictive variables Xm
, from which objects of interest, such as forecasts,

can be computed.2 The individual predictive densities are taken as given; that is, in contrast
with other approaches to model averaging such as the estimation of mixture models, the
parameters of the specific models and the model weights are not estimated jointly.
Formally, for any given horizon h, the goal is to maximize the log predictive score function:

T
X

max

m=1 wm,h =1,wm,h ≥0

t=1

"
log

M
X

#
t
wm,h p zt+h |Xm
; Mm

(1)

m=1

t
where zt+h denotes the variable of interest, Xm
denotes the history of variables that zt+h

depends on in model Mm , and m = 1, ..., M indexes different models. The framework can
be extended to the multivariate case, where zt+h can be a vector of observables, but for ease
of exposition and in our empirical setting later on we find it useful to focus on one variable
at a time.3

2.1

Adapting Prediction Pools to Impulse Response Averaging

Prediction pools generally improve forecasting ability relative to individual models as judged
by the log predictive score (Geweke and Amisano, 2011, 2012). They do so by usually giving
more than one model a positive weight, in contrast with posterior model probabilities in a
Bayesian setting.
We leverage the useful properties of prediction pools for the problem of impulse response
estimation using the insight that impulse responses are nothing but conditional forecasts.
2

Subscripts generally denote period-specific outcomes (except for the subscript m, which denotes the
model at hand), whereas superscripts denote histories up to and including the period specified in the superscript.
3
We treat the computation of the weights at different horizons as distinct problems. One could, alternatively, compute weights jointly and impose a penalty that forces the changes in the weights across horizons to
H
be smoother than in our benchmark. For example, one could estimate the sets of weights {{wm,h }M
m=1 }h=1
by maximizing the following objective function:
" M
#
H X
T
H X
M
X
X
X

t
max
log
wm,h p zt+h |Xm
; Mm + λ
(wm,h − wm,h−1 )2 ,
P
M
m=1

wm,h =1,wm,h ≥0

h=1 t=1

m=1

h=2 m=1

where λ controls how much smoother the weights will be relative to our benchmark. With λ = 0, we replicate
our benchmark since then each horizon’s weights can be solved for independently of all other horizons.

The impulse responses in our framework are averages of model-specific impulse response
estimators.4 Our approach thus rewards models that forecast well.
The primitives that we need for our approach are forecasting densities based on each
model. Where we differ from Geweke and Amisano (2011) is that we use a measure of
the shock of interest as a conditioning argument in our predictive densities. In general, we
form forecast densities that depend on observables up to time t − 1 and a measure of the
structural shock at time t. These measures of shocks can depend on time t data and model
parameters. They can also incorporate identification restrictions, as will become clear in
our examples. Whereas the forecasting densities used in Geweke and Amisano (2011) and
other papers that use prediction pools make no explicit use of identified shocks or specific
identification schemes for these shocks, incorporating these in our approach allows us to
discriminate between models with different identification schemes.5
This conditioning scheme makes our approach more relevant for the empirical practice
and choices that researchers are facing. An alternative approach would be to use the Geweke
and Amisano (2011) approach for different reduced-form models directly and then impose
identifying restrictions ex-post after finding the optimal weights. However, many applications of LPs directly use information on structural shocks (or instruments thereof) in the
estimation, making this alternative less appealing when at least one of the models in our
pool is based on LPs.
Another distinctive component of our approach is how we implement the distribution
over each model’s parameters, i.e., how our forecasting densities incorporate parameter uncertainty within a model. Geweke and Amisano (2011) use two approaches: Either the
posterior distribution of parameters from a Bayesian estimation or, alternatively, fixed parameter values from some point estimate. We use a more general framework where the
parameters of model Mm are collected in a vector Ωm . We generate draws from a distribution pm (Ωm ) that captures the parameter uncertainty we want to consider. This could
be a posterior distribution, a point mass, a prior distribution, or a distribution derived using frequentist principles, say by appealing to standard asymptotic arguments or numerical
approaches such as the bootstrap.
We generally study a vector yt of macroeconomic variables and denote the jth variable of
that vector by yt,j . Since LPs are usually estimated for one specific variable and horizon at
a time, we carry out our analysis variable by variable and horizon by horizon as well. This
also gives us additional flexibility as different models might yield better forecasting ability
4

Our focus on this paper is on linear models, but our approach could also be used in nonlinear settings.
For the different identification schemes to be comparable, we are implicitly assuming that they are indeed
identifying the same shock.
5

for different variables or horizons.
With these definitions in hand, we define our forecasting density for model m, p∗m :
p∗m

Z
(yt+h,j ) =

p yt+h,j |y t−1 , εt (Ωm , y t ), Ωm , Mm pm (Ωm )dΩm ,

(2)

t
which replaces the forecasting densities p (zt+h |Xm
; Mm ) in Equation (1). We can approx-

imate the integral on the right-hand side by Monte Carlo methods, as is often necessary
in practice. We can extend our definition of p∗ by allowing different models to depend on
different right-hand side variables. While we allow the shock measure εt (Ω, y t ) to be modelspecific, in our applications we use the same shock (or instrument of a shock as a conditioning
argument) in all models and assume that the observed shock is one element of the vector yt .
Geweke and Amisano (2011) use true out-of-sample forecasting densities, i.e., their dent
; Mm ) are generally re-estimated every period. While this is possible in our
sities p (zt+h |Xm

framework as well, we use an alternative approach inspired by cross validation. In particular,
we split the sample in half and estimate the models for each subsample separately. We then
use the implied out-of-sample forecasting densities for the parts of the sample that were
not used for estimation to obtain model weights. More specifically, we first estimate each
model using the first half of the sample, and then use those parameter estimates to forecast
the second half. In the next step, we estimate using the second sub-sample, fix parameter
estimates and forecast the first subsample. This produces two true out-of-sample forecast
densities without having to re-estimate every period. We view this approach as trading off
computing time and overfitting concerns, which would play a role if we did not split the
sample at all.6

2.2

Properties of Prediction Pools

With forecasting densities (2) in hand, the theorems stated in Geweke and Amisano (2011)
all apply. In particular, as long as the expected average forecast densities do not take on
the same value for different models, the true model, should it be contained in the set of
models we consider, asymptotically receives a weight of 1. In contrast to Bayesian model
averaging, more than one model will receive positive weight even asymptotically if the true
model is not contained in the set of weights (Geweke and Amisano, 2012). The individual
model with the highest log predictive score might not even receive a positive weight in the
6

We could extend our approach to allow for time-varying weights along the lines of Waggoner and Zha
(2012) or Del Negro et al. (2016).

optimal pool if there are more than two models being considered.7 Furthermore, the weights
satisfy a number of consistency requirements that make their use appealing. We state these
consistency requirements as derived by Geweke and Amisano (2011) in Appendix B.
We consider the prediction pool framework as particularly well suited for applications
in empirical macroeconomics. First, by studying each horizon separately, we overcome the
issue that LPs are not generative models. In particular, there is no unique way to simulate
a sample of arbitrary length from LPs estimated using different horizons. The simulation
from one horizon is in general inconsistent with simulations from LPs for a different horizon. As a result, Bayesian model averaging is not possible. Second, prediction pools allow
us to compare Bayesian and frequentist approaches. In particular, the probability distribution pm (Ωm ) can be either Bayesian (i.e., a posterior distribution) or frequentist (i.e., an
asymptotic distribution).8 Finally, solving the optimization problem (1) is computationally
straightforward.

2.3

Implementation

We now present a step-by-step guide that summarizes our approach.
1. Split the estimation sample in half, so that each subsample has T /2 observations (we
assume for simplicity that T is even). We denote the subsample by s = 1, 2, where
s = 1 means that periods dated t = 1, ...T /2 are used in the estimation, whereas s = 2
means that periods t = T /2 + 1, ..., T are used. In a slight abuse of notation, we define
a function s(t) that is equal to 1 if t ≤ T /2 and equal to 2 if t > T /2. We now give
densities additional superscripts that denote the estimation sample.
2. Estimate (or calibrate) each model m = 1, ..., M for each subsample s. This means
that for each model we get a distribution psm (Ωm ) for each subsample. This is the most
time-consuming step of the algorithm, but can be easily parallelized.
3. For each model and subsample construct p∗,s
m (yt+h,j ) by first constructing the forecast
density conditional on parameters and a given shock (see Section 2.4 for an example
7

The pooling weights thus do not necessarily represent a ranking or evaluation of the models. Rather,
the weights are chosen to optimize the performance of the averaged model in terms of the log-score objective
function.
8
By allowing for both Bayesian and frequentist models to enter our model pool, we implicitly equate the
interpretations of uncertainty in Bayesian and frequentist frameworks. This is very much in the spirit of
much applied work, which compares error bands across Bayesian and frequentist approaches, disregarding
philosophical differences between the two approaches. Nevertheless, in our applications below, we do not
compare across paradigms.

on how to do this in VAR models). Then average over draws from the relevant psm (Ωm )
density. This step can also be parallelized.
4. Compute model weights by solving the following maximization problem for each horizon h and each variable j separately:

m=1

T
X

max

j
j
wm,h
=1,wm,h
≥0 t=1

"
log

M
X

#
j
∗,3−s(t)
wm,h
pm

(yt+h,j )

(3)

m=1

∗,3−s(t)

The superscript of the density pm

clarifies that we use out-of-sample forecasts to

construct the objective function. Geweke and Amisano (2011) provide conditions for
the concavity of the objective function.
5. With model weights in hand, we can construct weighted averages of impulse responses
and other statistics of interest from each model.9

2.4

Illustrative example: Constructing p∗ for a VAR(1)

For concreteness, we now illustrate how to construct the forecasting density p∗m (yt+h,j ) in
the context of a linear Gaussian VAR(1):
yt = Byt−1 + ut

(4)

ut = Cεt ,

(5)

where εt ∼ N (0, I) is a vector of structural shocks and V [ut ] = CC 0 . In terms of the notation
from the previous section and assuming this VAR is model 1, we have Ω1 = [vec(B)0 vec(C)0 ]0 ,
where vec denotes columnwise vectorization of a matrix. The impulse response of yt to shock
i at horizon h is then B h C•,j , where C•,j is the jth column of the matrix C.
Given B and C, we can compute the on-impact conditional distributions:
E [yt | yt−1 , εt,j ] = Byt−1 + C•,j εt,j

(6)

0
V [yt | yt−1 , εt,j ] = CC 0 − C•,j C•,j

(7)

Once we have obtained the model weights, we re-estimate each model using the entire sample to obtain
a final estimate of pm (Ωm ) and use that distribution to construct our statistics of interest.

and iterate forward:
E [yt+h | yt−1 , εt,j ] = BE [yt+h−1 | yt−1 , εt,j ]

(8)

V [yt+h | yt−1 , εt,j ] = BV [yt+h−1 | yt−1 , εt,j ] B 0 + CC 0

(9)

The predictive density of the vector yt conditional on parameters h periods ahead is then
Gaussian with conditional means and variances defined above.

Furthermore, the fore-

casting distribution of a specific variable yt,j conditional on parameters and the shock is
given by a normal distribution where the mean and variance are the relevant elements of
E [yt+h | yt−1 , εt,j ] and V [yt+h | yt−1 , εt,j ].
With the internal instrument VAR, which we use in our Monte Carlos and empirical
applications in Sections 3 and 4, an econometrician observes the shock if she knows the
parameters. More generally, we replace εt,j with ε̂t,j , the jth element of ε̂t ≡ C −1 (yt −Byt−1 ),
the fitted value of εt .
If B and C are estimated, we can account for parameter uncertainty by integrating over
their posterior or asymptotic distribution. We implement this by averaging the predictive
density across draws in a Bayesian framework, for example.

2.5

Extensions

We close our description of the methodology with a brief discussion of straightforward ways
to adapt or extend our approach to several common settings.
In many scenarios, macroeconomists use identification schemes that do not point identify
the structural shock of interest (such as in the case of sign restrictions in VARs). To accommodate such cases, we can enlarge the parameter vector Ωm for any model m, where the
structural shock is not point-identified, to include a parameter that selects one possible value
of the structural shock consistent with the other parameters of the model. In a VAR, this
would be a rotation matrix that maps the covariance matrix of the one-step ahead forecast
error into the matrix of impact impulse responses. While this parameter is by definition not
point identified, it does not conflict with our approach.
Similarly, it is numerically straightforward to accommodate models with nonlinearities
where the models are conditionally linear and Gaussian. Key examples are VAR models with
parameters that follow discrete (Sims and Zha, 2006) or continuous (Cogley and Sargent,
2005; Primiceri, 2005) Markov processes, where the respective innovations are independent
of other innovations in the model. In these cases, we need to enlarge the parameter vector
Ωm to include estimates of the time t state of the Markov process.

Our approach typically gives larger weights to models that are better at forecasting the
series of interest at a given horizon. This forecasting ability can be due to the inclusion of
the structural shock or due to other features of each model. If a researcher wants to reward
models with a larger weight when the inclusion of the structural shock improves forecast
ability, the following alternative objective function could be used:
"

#
XX

F (φ) =

wm p∗m (yt+h,j ) + φ

#
XX

standard objective

}

wm (p∗m (yt+h,j ) − pm (yt+h,j )) ,

}

reward for forecast improvement due to structural shock

where we define
Z
pm (yt+h,j ) =

p yt+h,j |y t−1 , Ωm , Mm pm (Ωm )dΩm

(10)

as the forecast density based on model Mm when the structural shock is not used as a
predictive variable. The parameter φ governs how much the researcher rewards forecast
improvement due to the inclusion of a structural shock. For simplicity, we set φ = 0 and
ignore the forecast improvement from the structural shock, as is typical in most model
averaging in the literature. As a first pass to assess the viability of this modification, one
could also compare weights based on pm and p∗m .

Monte Carlo Simulations

We now present three Monte Carlo exercises to illustrate our methodology. First, we consider
a univariate example with two alternative models that produce consistent estimates but differ
in finite sample. Second, we consider a model in which the VAR is misspecified but the LP
produces consistent estimates. Third, we consider data simulated from a DSGE model, such
that both the VAR and LP are misspecified. We report biases and standard deviations of
our approach vis-a-vis individual models, following the common focus of the literature on
first and second moments. However, our approach targets the entire forecast distribution
and is not restricted to these moments.

3.1

AR(1)

As an initial proof of concept, we first consider the AR(1) Monte Carlo exercise from Herbst
and Johannsen (2020). We show that in this setting, our model averaging approach performs
close to optimally on a number of dimensions.
12

Data-Generating Process. We generate data from the univariate model:
yt = ρyt−1 + et + vt

(11)

iid

where (et , vt )0 ∼ N (0, I). We take ρ = 0.97 and use T = 80 observations. We seek the
impulse response of yt to a shock et .
Models. We estimate models of the form:
(h)

(h)0
yt+h = βm
xm,t + εm,t+h ,

(12)

and use our methodology to compare the two specifications for xt considered by Herbst and
Johannsen (2020):
• With controls: xm,t = (et , yt−1 )0 .
• Without controls: xm,t = et .
(h)

Both specifications produce consistent estimates βm,1 of the impulse response at horizon h.
However, the second specification does not control for lagged yt , resulting in differing finite
sample performances between models. Herbst and Johannsen (2020) show that the two
specifications produce different finite sample biases. The variances of the estimated impulse
responses also differ.
Results. Figure 1 shows the results averaged across 5 × 104 simulations. The weights
produced are intuitive and perform well on a number of dimensions. The top left panel
shows that optimal weights tend to favor the model with a smaller bias. The weights are
closer to 0.5 when the biases of the two models are closer.
The remaining panels show that the resulting mixture model performs well. First, the
bias of the mixture model is close to the optimum that one could get with each individual
model horizon by horizon. Second, the standard deviation of the mean estimate from the
mixture model is also close to the lower envelope of the two individual models.10 Third, we
compute the density of the true impulse response under each of the models in the lower left
panel.11 It shows that the average density is relatively high under the averaged model with
10

We compute the standard deviation of the impulse responses for each Monte Carlo sample and then
average across all samples. Both the sample specific bias and standard deviation depend on the estimated
weights for that specific sample. The averages we report in our figure for the Monte Carlo experiments thus
take into account sample variation in the estimated weights.
11
To be specific, we compute the model-specific predictive density evaluated at the true impulse response
and then average across Monte Carlo repetitions.

Optimal Weight on Model With Controls

IRF Point Estimate

0.8

0.6

0.4

0.2

0
0

Log Average Density

Standard Deviation of IRF Point Estimates

0.6
0.5

0.5

0.4
0
0.3
-0.5

prediction pool average
least-squares average
with controls
without controls

0.2

-1

0.1
0

Figure 1: Top left: Prediction pool and least-squares weights on model with controls; Top
right: Estimated impulse responses under each specification, averaged models, and true
model; Bottom left: Log average probability density of true impulse response under each
specification and averaged models; Bottom right: Standard deviation of point estimate
of impulse response estimates. Dashed lines correspond to individual models, solid lines
correspond to averaged models, and dotted line corresponds to true impulse response. All
plots show averages across all Monte Carlo repetitions.
optimal weights.
We also compare the results to optimal weights computed using a least-squares objective
function, which replaces the optimization problem in equation (3) with:

m=1

(h)

min
wm,h =1,wm ,h≥0

T
−h
X

yt+h −

t=1

M
X

!2
(h)

wm ŷm,t+h

(13)

m=1

(h)0

where ŷm,t+h = βm Xm,t is the fitted value of yt+h in model m. We use the same samplesplitting scheme as with the prediction pool weights.
Even though the least-squares objective function targets the bias and the standard deviation of the averaged point estimates, the prediction pool performs similarly on both these
measures. The prediction pool is thus able to obtain close to optimal point estimates according to this least squares objective while taking into account the entire probability distribution

for the estimated impulse response in each simulation. In situations where the forecasting
density is more complicated, this is not necessarily guaranteed to be true and weights based
on such a least squares objective could miss important features of the data.

3.2

Misspecified Shock

We now present an example in which the VAR is misspecified but the LP produces consistent
estimates. The example highlights how the weights trade off the flexibility of the LP and the
structure and relatively tighter estimates of the VAR. In addition, in finite sample the two
models produce impulse responses with biases of opposite signs, offsetting each other once
we average their impulse responses.
Data-Generating Process. We consider data generated from the model:
yt = ρyt−1 + v1,t + v2,t
v2,t = γv2,t−1 + e2,t ,
where

v1,t
e2,t

#
∼N

(14)
(15)

" # "
#!
0
1
0
,
0
0 1 − γ2

and (ρ, γ) = (0.97, 0.75). Our parameterization ensures that the long-run variance of v2,t is
one, which is equal to the variance of v1,t .
We seek the impulse response of yt to a shock v1,t . Under the data-generating process,
the impulse response at horizon h is ρh . We obtain results from 1000 simulations of 500
periods each.
Models. The first model we use to estimate the impulse response is an internal instrument
VAR (Noh, 2018; Plagborg-Møller and Wolf, 2021):
" #
zt
yt

"
=B

zt−1
yt−1

#
+ ut ,

(16)

where zt is the shock of interest and ut is assumed to be independent over time. The
impulse response at horizon h is B h C•,1 , where C is the lower triangular matrix satisfying
CC 0 = V [ut ], obtained using a Cholesky decomposition.12 As before, C•,1 is the first column
12

This is closely related to the VARX model from Bagliano and Favero (1999) and Paul (2020), where the
instrument is included as an exogenous variable in a VAR.

Weight on LP

0.4

IRF Bias

0.2

0.3

0.1

0.2

IRF Standard Deviation

0.4
0.3
0.2

0.1

-0.1
0

LP
VAR
average

0.1
0
0

Figure 2: Prediction pool weights, biases, and asymptotic standard deviations from Monte
Carlo with persistent shocks. Biases and standard deviations averaged across simulations.
Left: Optimal weights on LP; Middle: Bias of impulse responses; Right: Standard deviation of impulse responses. All plots show averages across all Monte Carlo repetitions.
of C. We assume for simplicity that the shock is perfectly observed, i.e., zt = v1,t . We
estimate the model equation by equation using least squares, with standard errors computed
using the “wild” bootstrap (Gonçalves and Kilian, 2004).
The second model we consider is an LP:
(h)

yt+h = β (h) v1,t + γv(h) v1,t−1 + γy(h) y1,t−1 + εt+h .

(17)

The estimated impulse response at horizon h is β (h) . The model is estimated using least
squares, with White standard errors (Montiel Olea and Plagborg-Møller, 2021).
The two models face a bias-variance trade-off highlighted by Li et al. (2022). The VAR
(16) is misspecified because the autocorrelation of the shock ut is assumed to be zero. This
induces bias even asymptotically. The LP produces consistent estimates, with a finite sample
bias that vanishes as the sample size goes to infinity. However, the structure of the VAR
induces a smaller variance than the LP. Our averaging approach balances both considerations
while also taking into account the finite sample performance of each method.
Results. The results are summarized in Figure 2. While the majority of the weight is
placed on the VAR, there is substantial weight of up to almost 0.4 placed on the LP. The
weight on the LP peaks around h = 4, but remains above 0.2 for all horizons after impact.
By averaging the two models, we obtain an impulse response that has a lower standard
deviation and only slightly larger bias than the LP. Since the VAR and LP have biases of
opposite signs, averaging them tends to offset each other.13 In this case, the difference in
standard deviations leads to a larger weight on the VAR.
More generally, a correctly-specified or more flexible model need not dominate a mis13

When we use weights from in-sample predictive densities instead of splitting the sample, we find that
the bias almost completely vanishes. See Figure A.2 in Appendix A

specified model in finite sample. The finite sample performance of each model may not
correspond to their asymptotic behavior. Furthermore, these properties may differ across
impulse response horizon or the variable of interest. Our impulse response averaging approach flexibly accounts for these by constructing an optimal composite impulse response
variable by variable and horizon by horizon. Figure 2 also highlights the trade-off between
bias and standard deviation that is present in practically any model-averaging exercise (unless one model dominates both in terms of bias and standard deviation). Our approach
reduces the bias relative to the VAR, but does so by increasing the standard deviation.
However, our approach outperforms the individual models in terms of the log predictive
score by construction.

3.3

Medium-Scale New Keynesian Model

We next consider a Monte Carlo exercise with data generated from a quantitative dynamic
stochastic general equilibrium (DSGE) model to connect more closely to actual empirical
settings. We use a DSGE model as our data-generating process because it implies VARMA
(Vector Autoregressive Moving Average) dynamics for the vector of observables, so that both
models we consider, VARs and LPs, are misspecified. Despite using closely related models,
we find very different estimates in finite sample. The averaged impulse response balances
the bias-variance trade-off, and in some cases even has a smaller bias than either individual
model.
Data-Generating Process. We simulate data from the log-linearized medium-scale New
Keynesian model from Smets and Wouters (2007) with parameters fixed at the posterior
mode reported in the paper. We use the model to generate 150 periods of simulated data
for the seven observables used by Smets and Wouters (2007) to estimate the model: GDP
growth, consumption growth, investment growth, wage growth, hours, inflation, and the
federal funds rate. We will focus on the impulse response of each variable to a monetary
shock, which we assume to be observed by the econometrician. We obtain results across 200
simulations.
Models. We compare two models: an internal instrument VAR (16) estimated using
Bayesian methods, and the Bayesian LP (Miranda-Agrippino and Ricco, 2021a).

The

Bayesian LP estimates:
"

zt+h
yt+h

#
=B

" #
(h) zt
yt

(h)

+ ut+h .

(18)

for each horizon h > 0. The impulse response at horizon h is B (h) C•,1 , where C•,1 is obtained
from (16). Miranda-Agrippino and Ricco (2021a) show how to impose a prior on the model
and estimate the LP impulse response analogously to a Bayesian VAR.14 Both models have
one lag and use the same Minnesota prior. In addition, we assume that the shock zt is
perfectly observed.
The two models are closely connected. First, if B (h) = B h , then the VAR and LP produce
identical impulse responses. In particular, given the same priors, the two models would
produce identical on-impact impulse responses. Next, as pointed out by Plagborg-Møller
and Wolf (2021), under the appropriate regularity conditions, the two models asymptotically
produce identical impulse responses. However, as the Monte Carlo exercises will show, in
finite sample and under misspecification the two models can lead to substantially different
estimates despite their close connections, emphasizing the need for a systematic way to
average across models.15
Results. The weights, averaged across simulations, are summarized in the left panels of
Figure 3. Overall, the prediction pools place greater weight on the VAR, with the LP
typically getting a weight of 0.2 or less. The weight on the LP tends to fall at longer
horizons. Nevertheless, there are non-trivial weights on the LP, especially for inflation, the
federal funds rate, and hours.
The middle and right panels of Figure 3 plot the average biases and standard deviations
of the impulse response functions, providing an explanation for the small weights on the
LP. First, even though the LP has greater flexibility, in many cases its bias tends to be
larger or is at most of similar magnitude relative to the VAR. This arises partly due to
the relatively short sample of 150 periods. Second, the right panels show that the LP has
substantially larger posterior standard deviation, which is consistent with evidence reported
in the literature (Miranda-Agrippino and Ricco, 2021b; Li et al., 2022). The difference
14

Miranda-Agrippino and Ricco (2021a) do not explicitly model autocorrelation in the residual of their
LP specification, but instead use a sandwich-type estimator for the posterior covariance matrix.
15
There are two differences of note relative to Plagborg-Møller and Wolf (2021). First, because we impose
a prior, the estimated impulse responses at horizon h > 0 differ even if the least squares estimates are
equivalent. In particular, for longer horizons, the likelihood of the LP becomes more dispersed, bringing
the posterior closer to the prior. Second, the estimated system (18) differs from the LP setup used in
Plagborg-Møller and Wolf (2021).

GDP

Weight on LP

IRF Bias

0.2

0.1

-0.1

-0.2

Inflation

Interest Rate

Consumption
Investment

0.5

0
1

16
0.3

0.1

0.2

0
0.1

0
4

0.5

0.2

-0.1

0.15

-0.2

0.1

0
1

0.2

0.5

0
0

-0.2
1

0
1

16
0.4

0.04
0.02
0
-0.02

0.2
0.1
0
1

0.2
0
1

16
2

0.6

Hours

0.1

0.2

Wages

0
1

0.06
0.04
0.02
0
-0.02

0.5

IRF Standard Deviation

0.5

0
0.4

-0.1

VAR
average

0.2

-0.2
1

0
1

Figure 3: Prediction pool weights, biases, and posterior standard deviations from Smets
and Wouters (2007) Monte Carlo. Biases and standard deviations averaged across simulations. Left: Optimal weights on LP; Middle: Bias of impulse responses; Right: Posterior
standard deviation of impulse responses. All plots show averages across all Monte Carlo
repetitions.
in standard deviations is especially large at longer horizons, which accounts for the lower
weights on the LP at those horizons.
To get a better sense of what is driving the weights, we focus first on the impulse response
for inflation. The weights on the LP are relatively high, especially in the initial quarters.
Correspondingly, we find that the LP estimates a response that has a smaller bias than
the VAR. Nevertheless, the VAR continues to receive more than half the weight because its
impulse response standard deviation is smaller. At longer horizons, the weight on the LP
19

falls as the difference in biases shrinks while the difference in standard deviations increases.
The impulse response for hours further illustrates the behavior of the prediction pools.
The weights on the LP increases over the first four quarters but does not decay as quickly
as for other variables. Even though the standard deviations are similar initially, the LP
displays a substantially larger bias than the VAR at short horizons, reducing its optimal
weight. Subsequently, the LP and VAR have biases of opposite signs that offset each other
when averaged, as was the case in our previous Monte Carlo exercise. By averaging the
impulse responses, the prediction pool can produce an average impulse response that has
a smaller bias than either model, with the bias almost completely eliminated at horizon
h = 15. At longer horizons, the weights trade off two forces. First, the VAR bias begins to
increase while the LP bias begins to decrease. Second, the LP posterior standard deviation
increases while the VAR standard deviation remains relatively constant. In balance, the
weights begin to favor the LP less at longer horizons, but with a decline that is less steep
than in other variables.
Overall, the results here emphasize two key messages. First, the relative biases and
variances of the models differ depending on variable and horizon. Prediction pools offer
the flexibility to trade off these properties variable by variable and horizon by horizon, thus
making full use of the relative strengths of each model. Second, even when models have
similar asymptotic properties, there can be substantial gains from averaging over them in
finite sample. In particular, the bias of the average impulse response can in some cases be
lower than that of either individual model.

Empirical Applications

We now apply our methodology to estimating impulse responses to monetary shocks on actual
data. We consider two identification schemes, where we use external instruments that have
recently become popular in the macroeconomic literature, namely a high-frequency identification instrument and a narrative instrument. These instruments have featured prominently
in many LP applications, but have recently also been used in VARs. In both applications
we consider a range of plausible empirical models in our prediction pools. Overall, our applications indicate that prediction pools offer a more plausibly accurate assessment of the
dynamic effects of monetary shocks as they optimally resolve the bias-variance trade-off,
especially when, as is likely, the underlying models are misspecified.

4.1

High-Frequency Identification

The first empirical application uses a monthly VAR to study the effects of a monetary shock
using the high-frequency identification instrument from Gertler and Karadi (2015), similar
to Caldara and Herbst (2019). In particular, we use data on industrial production (IP),
unemployment, the producer price index for finished goods, the federal funds rate, and the
Baa corporate bond spread. We consider a sample with monthly data from March 1990
through November 2007. As in the Smets and Wouters (2007) Monte Carlo exercise, we
consider two models: a Bayesian internal instruments VAR and a Bayesian LP. We choose
12 lags for both models and use a Minnesota prior. The tightness parameter for the VAR
is selected to maximize the marginal likelihood, while the tightness parameter for the LP is
fixed at 1.16
Results. The empirical application results, shown in Figure 4, illustrate the importance
of computing the weights variable by variable and horizon by horizon, rather than having a
single weight for all variables and horizons. In particular, while almost all the weight for the
IP impulse response is placed on the VAR, the prediction pool places majority of the weight
on the LP at some horizons for each of the other variables. In addition, most of the variables
feature weights on the LP that range from zero to one depending on the horizon. These
results extends the typical finding that relative forecast performances depend on variable
and horizon to our particular interpretation of impulse responses and conditional forecasts.
The large weights attached to the LP for certain variables and horizons contrasts with
the Smets and Wouters (2007) Monte Carlo, where the average weights were never much
above 0.5. This reiterates the message that the relative performance of each model depends
on the setting, including the data-generating process and sample size.
The impulse response of unemployment provides one example of how the averaged impulse
response may be qualitatively different than each individual model. The prediction pool
places approximately equal weight on the LP and VAR after the three-year horizon. The
resulting averaged impulse response has a peak response from unemployment approximately
two-and-a-half years after the shock that subsequently reverts to zero, in contrast to the
persistent negative and positive responses from the LP and VAR, respectively.
There are also cases where we find intuitive reasons for more uneven weights. For example,
in the IP impulse response, the prediction pool places almost all the weight on the VAR.
The average impulse response thus displays a persistent decline in the level of IP in response
16

The prior for the LP is chosen to be flatter than in the VAR because the likelihood for the LP is
dominated by a Minnesota prior with tightness parameter matching the VAR. A tightness parameter of 1
provides some shrinkage while allowing the data to speak.

Weight on LP

IRF Mean

IRF Standard Deviation
5

1
0.5

0
-1

Unemployment

PPI

1
0
-0.2
-0.4
-0.6

0.5

0
0

40
4

0.5

0
-1

0
0

Baa Spread

0
0

0.5

0.2
0.1
0
-0.1

0.5
0
0

0
0

FFR

1
LP

0.5

VAR

average

0
0

Figure 4: Prediction pool weights, posterior mean, and posterior standard deviation from
high-frequency identification empirical application. Left: Optimal weights on LP; Middle: Posterior mean of impulse responses; Right: Posterior standard deviation of impulse
responses.
to a contractionary monetary shock. This is closer to predictions from economic theory than
the persistent positive response that the LP finds.
An intermediate case is the impulse response for the federal funds rate at long horizons.
Here the LP is favored but the VAR continues to receive a nontrivial weight. The resulting
point estimate is positive, but with a large standard deviation. Once again, the averaged
impulse response is arguably more reasonable than each individual response. Relative to the
LP, the averaged impulse response has a mean that is slightly closer to zero. Unlike the VAR
that produced a relatively tightly estimated negative response, the averaged response has a
large standard deviation.
As in the Monte Carlo exercises, we see benefits to averaging the impulse responses using
prediction pools. The averaged impulse responses are more plausible than either the LP or
the VAR. Even more than the Monte Carlos, the flexibility of the prediction pools is critical,
with the weights varying dramatically across variables and horizons.
22

4.2

Narrative Instrument

Our second empirical application follows the study of the Romer and Romer (2004) shocks
in Ramey (2016). In particular, we use monthly data on the log of IP, the unemployment
rate, the log of the CPI, the log of a commodity price index, the federal funds rate, and the
Romer and Romer (2004) instrument for March 1969 through December 1996.
We consider four models, each estimated using frequentist methods:
1. Cholesky VAR. Following Coibion (2012), we estimate a VAR with the log of IP,
the unemployment rate, the log of the CPI, and the log of the commodity price index
in the first block, followed by the cumulated Romer and Romer (2004) instrument
ordered last. The monetary shock is assumed to be the last shock from a Cholesky
decomposition.
2. Internal Instrument VAR. We estimate a VAR with the Romer and Romer (2004)
instrument as the first variable, followed by the log of IP, the unemployment rate, the
log of the CPI, the log of the commodity price index, and the federal funds rate. The
monetary shock is assumed to be the first shock from a Cholesky decomposition.
3. LP With Recursiveness Assumption. We follow Ramey (2016) and estimate
regressions of the form
zt+h = αh + θh · shockt + control variables + εt+h ,

(19)

where zt+h is the variable of interest, shockt is the Romer and Romer (2004) instrument.
The control variables include lags of the Romer and Romer (2004) shock, the log of
IP, the unemployment rate, the log of the CPI, the log of the commodity price index,
and the federal funds rate, as well as contemporaneous values of the log of IP, the
unemployment rate, and the log price indices to preserve the recursiveness assumption,
as in the Cholesky VAR.
4. LP Without Recursiveness Assumption. This is identical to the LP with the
recursiveness assumption, except that we do not control for contemporaneous variables.
This makes the assumption that the Greenbook forecasts used by Romer and Romer
(2004) already include all information used by the Fed for setting interest rates.
Following Ramey (2016), both VARs use twelve lags and both LPs use two lags. In principle,
all four models estimate the same impulse response using the same instrument. However, the
models make different identifying assumptions, include different controls, and have different
23

Weights

IRF Mean

0.02
0

0.5

IRF Standard Deviation

0.02

0.01

-0.02
0

Unemployment

0.6
0.5
0.4

0.5

0.2

-0.5
0

0
0

10
10

-3

0.015

CPI

0
0.01
0.5

-10
0.005
-20

CPI (Commodities)

40
0.04

0.02

VAR (Cholesky)
VAR (internal instrument)
LP (non-recursive)
LP (recursive)
average

0
-0.02

0.5

0.02

-0.04
-0.06

0
0

Figure 5: Prediction pool weights, posterior mean, and posterior standard deviation from
narrative identification empirical application. Left: Optimal weights on LP; Middle: Posterior mean of impulse responses; Right: Posterior standard deviation of impulse responses.
numbers of lags. Importantly, this means that unlike the previous application, the LPs do
not nest the VARs here.
Results. The results are summarized in Figure 5. In general, majority of weight is placed
on the LPs on impact, while the VARs get assigned greater weight at longer horizons. This
is similar to the Smets and Wouters (2007) Monte Carlo and differs from the high-frequency
instrument application in Section 4.1. However, unlike the Smets and Wouters (2007) Monte
Carlo, the relative weights are arguably not tightly connected with the respective standard
deviations. In particular, the internal instrument VAR receives a large weight even though
it generally has a higher variance than the Cholesky VAR and a similar variance to the LP
without the recursiveness assumption.
The impulse response for IP yields two interesting observations. First, at the one- to
three-year horizon, the prediction pool places majority of the weight first on the internal
instrument VAR then the recursive LP, yielding a deeper and more prolonged contraction
in response to the identified shock than implied by the Cholesky VAR. The fact that the
latter is not favored despite its lower variance suggests that the average impulse response
24

provides a better fit to the data. However, after three years, the Cholesky VAR becomes
heavily weighted, implying a rebound in IP rather than a convergence back to trend.
The weights for the unemployment and CPI impulse responses are mostly on the internal
instrument VAR. One notable exception is unemployment at the long horizon. Like IP, the
weights favor the Cholesky VAR and are associated with a rebound in economic activity,
with unemployment undershooting its trend.
The average impulse response for commodity prices is fairly noisy, with substantial weight
placed on the LPs after the three-year horizon. This results in a high variance at those
horizons. In particular, the variance of the averaged impulse response is higher than any of
the individual responses at the four-year horizon. Given the volatility of commodity prices,
it is plausible that the data are not informative about their response to the identified shock.
As a result an impulse response with high variance may fit the data best from the standpoint
of the log predictive score function, which accounts for the full predictive density rather than
just the point estimates.
The above results stress the fact that prediction pools utilize the predictive density and
not only particular moments of the estimated impulse responses. While tighter estimates
may reduce mean-square error, they need not increase the log predictive score function (3).
The prediction pool not only favors models whose fit to the data sufficiently dominate the
rest of the pool, but may even favor a large variance if prediction is, in fact, difficult.
As a final exercise, we drop the Cholesky VAR from the set of models and repeat the
exercise. The results are shown in Figure 6. The main observation is that the weight that
was previously on the Cholesky VAR primarily gets transferred to the internal instrument
VAR, which is the closest model. The averaged impulse responses remain relatively similar
to the exercise with all four models, illustrating the consistency of the weights assigned as
we change the set of models (see Appendix B for details).

Conclusion

In this paper, we develop a methodology that delivers an encompassing approach to computing dynamic responses of macroeconomic variables to shocks. A fundamental challenge
for empirical researchers is the choice of a specific model for such an exercise since it is well
known that they come with their own issues such as bias or large standard errors. Based
on the idea of prediction pools, as in Geweke and Amisano (2011), we show how to average
across impulse responses from multiple models. Our framework thus presents an approach
to incorporate evidence from a variety of models in a consistent and plausible manner to get
closer to the actual truth in real economic data.
25

Weights

IRF Mean

0.01

IRF Standard Deviation

0.02

0
0.5

0.01

-0.01
-0.02

Unemployment

0.5

0.4

0.2

-0.5
0

0
0

10
10

-3

0.015

CPI

0
0.01
0.5

-10
0.005
-20

CPI (Commodities)

0.02

0.5

-0.02

40
0.04

VAR (internal instrument)
LP (non-recursive)
LP (recursive)
average

0.02

-0.04
-0.06

0
0

Figure 6: Prediction pool weights, posterior mean, and posterior standard deviation from
narrative identification empirical application. Left: Optimal weights on LP; Middle: Posterior mean of impulse responses; Right: Posterior standard deviation of impulse responses.
General theorems about which class of models should be used in empirical macroeconomics are hard to come by once we make realistic assumptions. This has led to many
alternative impulse response estimators coexisting in the literature. We exploit that each of
these can be useful in particular situations, making empirical macroeconomics an ideal setting to exploit flexible model-averaging schemes. Our prediction-pool based approach makes
model-averaging in these scenarios possible.
The key differences relative to existing methods are (i) much greater flexibility, which allows us to use, for example, estimators based on different statistical paradigms; (ii) relative
small computational burden especially since the most time-consuming task can be parallelized; and (iii) ability to exploit each method’s strength as much as possible by computing
horizon- and variable-specific weights based on predictive densities instead of a few selected
statistics.
Overall, our Monte Carlos and empirical applications highlight several broad messages
for estimating impulse responses:
1. The optimal weights on models depend on horizon, variable, and application.
26

2. The choice of model depends on the entire predictive distribution, not only the point
estimates. Our examples focus on the mean and variance, but skewness, kurtosis, or
any other property of the predictive distribution could be important more generally.
3. Theoretical properties of individual models are not sufficient criteria for the choice of
weights. For instance, misspecified models may dominate correctly specified (or more
flexible) models in finite sample. On the other hand, models that produce tighter
estimates need not receive greater weight.
Our use of prediction pools provides a systematic and computationally tractable way to
account for these issues in a wide range of applications.
The flexibility of our methodology raises the possibility of numerous applications beyond
what we have explored here. For example, our approach can also be extended to average
across different identification schemes for the same shock (for example, studying the effects
of monetary policy shocks using a VAR identified with sign restrictions and a VAR identified
using an instrument). Furthermore, one could envision using our approach to discriminate
between various equilibrium models that encode different transmission mechanisms for the
shock of interest.

References
Amisano, Gianni and John Geweke (2017), “Prediction Using Several Macroeconomic Models.” Review of Economics and Statistics, 99, 912–925.
Baek, ChaeWon and Byoungchan Lee (2022), “A Guide to Autoregressive Distributed Lag
Models for Impulse Response Estimations.” Oxford Bulletin of Economics and Statistics.
Bagliano, Fabio C and Carlo A Favero (1999), “Information from Financial Markets and
VAR Measures of Monetary Policy.” European Economic Review, 43, 825–837.
Barnichon, Regis and Christian Brownlees (2019), “Impulse Response Estimation by Smooth
Local Projections.” Review of Economics and Statistics, 101, 522–530.
Bates, J. M. and C. W. J. Granger (1969), “The Combination of Forecasts.” Journal of the
Operational Research Society, 20, 451–468.
Bruns, Martin and Helmut Lütkepohl (2022), “Comparison of Local Projection Estimators for Proxy Vector Autoregressions.” Journal of Economic Dynamics and Control, 134,
104277.
Caldara, Dario and Edward Herbst (2019), “Monetary Policy, Real Activity, and Credit
Spreads: Evidence from Bayesian Proxy SVARs.” American Economic Journal: Macroeconomics, 11, 157–92.
Canova, Fabio and Christian Matthes (2021), “A Composite Likelihood Approach for Dynamic Structural Models.” Economic Journal, 131, 2447–2477.
Cogley, Timothy and Thomas J. Sargent (2005), “Drift and Volatilities: Monetary Policies
and Outcomes in the Post WWII U.S.” Review of Economic Dynamics, 8, 262–302.
Coibion, Olivier (2012), “Are the Effects of Monetary Policy Shocks Big or Small?” American
Economic Journal: Macroeconomics, 4, 1–32.
Del Negro, Marco, Raiden B. Hasegawa, and Frank Schorfheide (2016), “Dynamic Prediction
Pools: An Investigation of Financial Frictions and Forecasting Performance.” Journal of
Econometrics, 192, 391–405.
Del Negro, Marco and Frank Schorfheide (2004), “Priors from General Equilibrium Models
for VARs.” International Economic Review, 45, 643–673.

Doan, Thomas, Robert Litterman, and Christopher Sims (1984), “Forecasting and Conditional Projection Using Realistic Prior Distributions.” Econometric Reviews, 3, 1–100.
Gertler, Mark and Peter Karadi (2015), “Monetary Policy Surprises, Credit Costs, and
Economic Activity.” American Economic Journal: Macroeconomics, 7, 44–76.
Geweke, John and Gianni Amisano (2011), “Optimal Prediction Pools.” Journal of Econometrics, 164, 130–141.
Geweke, John and Gianni Amisano (2012), “Prediction with Misspecified Models.” American
Economic Review: Papers & Proceedings, 102, 482–86.
Giannone, Domenico, Michele Lenza, and Giorgio E Primiceri (2015), “Prior Selection for
Vector Autoregressions.” Review of Economics and Statistics, 97, 436–451.
Gonçalves, Sılvia and Lutz Kilian (2004), “Bootstrapping Autoregressions with Conditional
Heteroskedasticity of Unknown Form.” Journal of Econometrics, 123, 89–120.
Hall, Stephen G. and James Mitchell (2007), “Combining Density Forecasts.” International
Journal of Forecasting, 23, 1–13.
Hansen, Bruce E. (2007), “Least Squares Model Averaging.” Econometrica, 75, 1175–1189.
Hansen, Bruce E (2016), “Stein Combination Shrinkage for Vector Autoregressions.” Working paper, University of Wisconsin, Madison.
Herbst, Edward and Benjamin K. Johannsen (2020), “Bias in Local Projections.” Finance
and Economics Discussion Series 2020-010, Washington: Board of Governors of the Federal
Reserve System.
Jordà, Òscar (2005), “Estimation and Inference of Impulse Responses by Local Projections.”
American Economic Review, 95, 161–182.
Kilian, Lutz (1998), “Small-Sample Confidence Intervals for Impulse Response Functions.”
Review of Economics and Statistics, 80, 218–230.
Li, Dake, Mikkel Plagborg-Møller, and Christian K. Wolf (2022), “Local Projections vs.
VARs: Lessons From Thousands of DGPs.” Working paper.
Lusompa, Amaze (2021), “Local Projections, Autocorrelation, and Efficiency.” Federal Reserve Bank of Kansas City Working Paper 21-01.

Marcellino, Massimiliano, James H. Stock, and Mark W. Watson (2006), “A Comparison of
Direct and Iterated Multistep AR Methods for Forecasting Macroeconomic Time Series.”
Journal of Econometrics, 135, 499–526.
Miranda-Agrippino, Silvia and Giovanni Ricco (2021a), “Bayesian Local Projections.” Working paper.
Miranda-Agrippino, Silvia and Giovanni Ricco (2021b), “The Transmission of Monetary
Policy Shocks.” American Economic Journal: Macroeconomics, 13, 74–107.
Montiel Olea, José Luis and Mikkel Plagborg-Møller (2021), “Local Projection Inference is
Simpler and More Robust Than You Think.” Econometrica, 89, 1789–1823.
Noh, Eul (2018), “Impulse-Response Analysis with Proxy Variables.” Working paper, University of California San Diego.
Paul, Pascal (2020), “The Time-Varying Effect of Monetary Policy on Asset Prices.” Review
of Economics and Statistics, 102, 690–704.
Pesavento, Elena and Barbara Rossi (2006), “Small-Sample Confidence Intervals for Multivariate Impulse Response Functions at Long Horizons.” Journal of Applied Econometrics,
21, 1135–1155.
Plagborg-Møller, Mikkel and Christian K. Wolf (2021), “Local Projections and VARs Estimate the Same Impulse Responses.” Econometrica, 89, 955–980.
Primiceri, Giorgio E. (2005), “Time Varying Structural Vector Autoregressions and Monetary Policy.” Review of Economic Studies, 72, 821–852.
Qu, Zhongjun (2018), “A Composite Likelihood Framework for Analyzing Singular DSGE
Models.” The Review of Economics and Statistics, 100, 916–932.
Ramey, Valerie A. (2016), “Macroeconomic Shocks and Their Propagation.” Handbook of
Macroeconomics, 2, 71–162.
Romer, Christina D and David H Romer (2004), “A New Measure of Monetary Shocks:
Derivation and Implications.” American Economic Review, 94, 1055–1084.
Sims, Christopher A (1980), “Macroeconomics and Reality.” Econometrica, 1–48.
Sims, Christopher A and Tao Zha (1999), “Error Bands for Impulse Responses.” Econometrica, 67, 1113–1155.
30

Sims, Christopher A. and Tao Zha (2006), “Were There Regime Switches in U.S. Monetary
Policy?” American Economic Review, 96, 54–81.
Smets, Frank and Rafael Wouters (2007), “Shocks and Frictions in US Business Cycles: A
Bayesian DSGE approach.” American Economic Review, 97, 586–606.
Stock, James H. and Mark W. Watson (2018), “Identification and Estimation of Dynamic
Causal Effects in Macroeconomics Using External Instruments.” The Economic Journal,
128, 917–948.
Stock, J.H. and M.W. Watson (2016), “Dynamic Factor Models, Factor-Augmented Vector
Autoregressions, and Structural Vector Autoregressions in Macroeconomics.” In Handbook
of Macroeconomics (J. B. Taylor and Harald Uhlig, eds.), volume 2, 415–525, Elsevier.
Stone, M. (1961), “The Opinion Pool.” The Annals of Mathematical Statistics, 32, 1339 –
1342.
Strachan, R.W. and H.K. van Dijk (2007), “Bayesian Model Averaging in Vector Autoregressive Processes With an Investigation of Stability of the US Great Ratios and Risk
of a Liquidity Trap in the USA, UK and Japan.” Econometric Institute Research Papers EI 2007-11, Erasmus University Rotterdam, Erasmus School of Economics (ESE),
Econometric Institute.
Waggoner, Daniel F. and Tao Zha (2012), “Confronting Model Misspecification in Macroeconomics.” Journal of Econometrics, 171, 167–184.

Supplementary Figures

Inflation

GDP

-0.5
-1

-0.02
-0.04
-0.06

Investment

Consumption

Interest Rate

-0.5
-1

0.5

0
-1

0
-0.05
-0.1

-2
0

-0.5

Hours

Wages

LP
VAR

-1

average
true

-1.5
0

Figure A.1: Posterior mean estimates of impulse response to monetary policy shock in New
Keynesian model Monte Carlo, average across simulations. GDP, consumption, investment,
and wages in growth rates.

Weight on LP

0.8

IRF Bias

0.2

0.6

0.1

0.4

IRF Standard Deviation

0.4
0.3
0.2

0.2

-0.1
0

LP
VAR
average

0.1
0
0

Figure A.2: Prediction pool weights, biases, and asymptotic standard deviations from Monte
Carlo with persistent shocks, without sample splitting. Biases and standard deviations averaged across simulations. Left: Optimal weights on LP; Middle: Bias of impulse responses;
Right: Standard deviation of impulse responses.

Theoretical Results of Geweke and Amisano (2011)

In this section we state major theoretical results from Geweke and Amisano (2011) for the
sake of completeness. Let us first explicitly state our objective function:

fT (wh ) = P

M
m=1

T
X

max
wm,h =1,wm,h ≥0

t=1

"
log

M
X

#
wm,h p∗m (yt+h,j )

(A.1)

m=1

where wh = [w1,h w2,h · · · wM,h ]0 is the vector of model weights for a given horizon h (and a
given variable j, which we have not made explicit in this notation). We will generally assume
that fT (wh ) is concave, i.e., ∂ 2 fT /∂wh ∂wh0 is negative definite. For the case of two models
(M = 2), Geweke and Amisano (2011) show that that the objective function will be concave
if the expected difference between the two predictive densities will not be zero as the same
size increases.17 We will call a subset of models dominant if its weights sum to 1. A subset
of models is excluded if each of those models has a weight of 0. With the assumption of
concavity, Geweke and Amisano (2011) show the following results:
1. If {M1 , . . . , Mm } dominates the pool {M1 , . . . , Mn } then {M1 , . . . , Mm } dominates
{M1 , . . . , Mm , Mj1 , . . . , Mjk } for all {j1 , . . . , jk } ⊆ {m + 1, . . . , n}.
2. If {M1 , . . . , Mm } dominates all pools {M1 , . . . , Mm , Mj } (j = m + 1, . . . , n) then
{M1 , . . . , Mm } dominates the pool {M1 , . . . , Mn }.
3. The set of models {M1 , . . . , Mm } is excluded in the pool {M1 , . . . , Mn } if and only
if Mj is excluded in each of the pools {Mj , Mm+1 , . . . , Mn } (j = 1, . . . , m).
4. If the model M1 is excluded in all pools (M1 , Mi ) (i = 2, . . . , n) then M1 is excluded
in the pool (M1 , . . . , Mn ).

Note that even in the case of LPs and VARs with the same right-hand side variables, it is unlikely (at
least at larger horizons) that the implied predictive densities are the same even though the VAR specification
is nested in LPs.