The full text on this page is automatically extracted from the file linked above and may contain errors and inconsistencies.
Global Robust Bayesian Analysis in Large Models WP 20-07 Paul Ho Federal Reserve Bank of Richmond Global Robust Bayesian Analysis in Large Models∗ Paul Ho† Federal Reserve Bank of Richmond‡ paul.ho@rich.frb.org June 30, 2020 Abstract This paper develops a tool for global prior sensitivity analysis in large Bayesian models. Without imposing parametric restrictions, the methodology provides bounds for posterior means or quantiles given any prior close to the original in relative entropy, and reveals features of the prior that are important for the posterior statistics of interest. We develop a sequential Monte Carlo algorithm and use approximations to the likelihood and statistic of interest to implement the calculations. Applying the methodology to the error bands for the impulse response of output to a monetary policy shock in the New Keynesian model of Smets and Wouters (2007), we show that the upper bound of the error bands is very sensitive to the prior but the lower bound is not, with the prior on wage rigidity playing a particularly important role. ∗ Download the latest version of the paper here. I am indebted to Jaroslav Borovička, Chris Sims and Mark Watson for their guidance. I thank Timothy Christensen, Liyu Dou, Ulrich Müller, Mikkel Plagborg-Møller, Frank Schorfheide, Denis Tkachenko, and numerous seminar and conference participants for comments and suggestions. I am also grateful for financial support from the Alfred P. Sloan Foundation, the CME Group Foundation, Fidelity Management & Research, the Macro Financial Modeling Initiative, and the International Association for Applied Econometrics I received for this project when I was working on it as a student at Princeton University. ‡ The views expressed herein are those of the author and do not necessarily represent the views of the Federal Reserve Bank of Richmond or the Federal Reserve System. † 1 Introduction In Bayesian estimation, we are confronted with the questions of how much our posterior estimates depend on our prior and which parts of the prior are most important. Tackling these questions analytically is difficult because both the likelihood and the statistics of interest may be complicated functions of the parameters over which we define the prior. On the other hand, it is infeasible to repeat the estimation for all possible priors. In particular, given the complicated dependence of the posterior statistic on the prior, one may wish to consider a vast range of nonparametric changes in the joint prior of multiple parameters, rather than limiting oneself to a restrictive class of priors that relies on assumptions such as independence or distributional form. Existing prior sensitivity tools either restrict one to infinitesimal parametric changes to the prior or are infeasible outside relatively simple or low-dimensional settings. This paper develops a method to investigate the sensitivity of conclusions across a nonparametric set of priors, while remaining feasible in large models. We refer to this method as relative entropy prior sensitivity (REPS). The calculation searches across priors that are close to the original prior in relative entropy, then finds the worst-case prior that leads to the largest change in the reported posterior estimates. Even though the set of alternative priors is nonparametric, the solution for the worst-case prior and posterior requires solving for only one scalar, regardless of the number of parameters, then reweighting draws from the original prior and posterior. To overcome cases where direct reweighting results in poor approximations for the worst-case distributions, we develop a sequential Monte Carlo (SMC) algorithm to obtain draws from these distributions. The prior sensitivity analysis informs an econometrician of how sensitive her posterior results are to the prior, and identifies features of the prior that are important for these results. For example, if the econometrician reports the posterior mean of an elasticity, she would search for the priors that respectively minimize and maximize this mean for a given relative entropy, thus obtaining bounds on the posterior mean. The worst-case prior will differ most from the original prior in dimensions that are most important for the posterior mean. These are parts of parameter space that are not well-identified by the likelihood but matter for the posterior mean. One should be particularly concerned if the posterior is sensitive to features of the prior arising from ad hoc assumptions such as independence or distributional forms that were used solely for convenience. To generate draws from the worst-case prior and posterior in complicated and highdimensional settings, we adapt the SMC algorithm of Herbst and Schorfheide (2014) and use approximations of the likelihood and statistic of interest. In principle, one could take a 1 large number of draws from the original prior and posterior, solve for the worst-case prior and posterior, then use importance sampling to reweight the draws from the original distributions. However, importance sampling performs poorly when the distribution of these weights has fat tails. SMC overcomes the challenge by solving for a sequence of intermediate priors and posteriors, and recursively obtaining draws from each of these intermediate distributions. Nevertheless, SMC can become computationally infeasible due to the need to repeatedly compute the likelihood and statistic of interest. To reduce the computational burden, we use approximations of the inputs of the REPS calculations in a procedure we refer to as approximate relative entropy prior sensitivity (AREPS). AREPS yields draws from an approximate worst-case prior and posterior, which can then be reweighted to obtain draws from the exact worst-case distributions. In existing applications, the computational time for AREPS is of the same order of magnitude as that of estimating the model once. To gauge the sensitivity of one’s posterior results to the prior, we provide a rule of thumb for what a large or small relative entropy is. The rule of thumb consists of a formula that has one free parameter, which we calibrate using asymptotic behavior the Gaussian location model. This allows a practitioner to quantify the sensitivity of her posterior to the prior by measuring how much a given change in the prior can affect the posterior estimate, and comparing the sensitivity to a Gaussian location model with a large number of observations. The comparison indicates the prior sensitivity of the estimation relative to what one would have concluded by looking only at prior and posterior variances. Our main application is the impulse response of output to a monetary policy shock in the New Keynesian model of Smets and Wouters (2007). We use REPS to construct bounds that contain pointwise 68% error bands arising from any prior in a relative entropy ball around the original prior and to compare the bounds when we distort the prior for the Taylor rule parameters to when we distort the prior for the nominal frictions parameters. In contrast to Müller (2012), who finds that the impulse responses are relatively insensitive to the priors for the structural parameters, REPS reveals that the upper bound of the error bands is very sensitive to changes to the prior but the lower bound is not. The impulse response is more dependent on the prior at long horizons. The impulse response is more sensitive to the prior of the nominal frictions parameters than the Taylor rule parameters, and is especially sensitive to the prior on the wage rigidity parameter. REPS detects dependence on the prior in the Smets and Wouters (2007) estimation that would be hard to discern with other approaches to prior sensitivity such as that of Müller (2012). The worst-case prior adds mass to certain regions of the tail of the posterior where the likelihood is large. In addition, the worst-case distortions show that joint changes in the prior can result in larger changes in the posterior than if one were to distort the marginals 2 only. Without having to reestimate the model for different priors, we find results that are consistent with the extensive literature studying the New Keynesian model of Smets and Wouters (2007), providing support for the validity of the methodology and the numerical implementation. Related literature. REPS overcomes key limitations of existing approaches to prior sensitivity analysis.1 Local methods (e.g., Gustafson (2000); Müller (2012)) consider derivatives of specific posterior quantities with respect to the prior. These methods focus on only infinitesimal changes in the prior and posterior and are often restricted to parametric changes in the prior. Our main application and stylized examples show that these restrictions can result in misleading conclusions about prior sensitivity. REPS does not impose such restrictions, allowing for joint nonparameteric distortions across parameters. Global methods (e.g., Berger and Berliner (1986); Moreno (2000)) allow for a wider class of priors and consider potentially large changes in the posterior, but are infeasible outside a limited range of applications. REPS is a global method that is feasible in settings with high-dimensional and complicated likelihoods such as dynamic stochastic general equilibrium (DSGE) models. Moreover, it can be applied to a range of statistics such as means and credible intervals of a wide class of functionals of the model parameters. REPS is also a more general methodology that applies to any Bayesian estimation problem, in contrast to the literature focusing on prior sensitivity in partial identification problems (e.g., Giacomini and Kitagawa (2018); Giacomini, Kitagawa, and Uhlig (2019)). The use of relative entropy follows the robust control literature (Petersen et al. (2000); Hansen and Sargent (2001)). The key difference between REPS and the existing robust control literature is that the worst-case prior from REPS depends on the likelihood. Hansen and Sargent (2007) also solve for a worst-case prior that is constrained to be close to an economic agent’s original prior in relative entropy. However, they consider an ex-ante problem that does not condition on the observed data. In contrast, here we consider an econometrician analyzing the prior after observing data, and therefore condition our worst-case prior on the data. Conditioning on the data allows REPS to account for characteristics of the likelihood that are important for the posterior results. In related work, Giacomini, Kitagawa, and Uhlig (2019) construct a relative entropy ball around the prior for set-identified parameters conditional on the identified subvector of parameters. In contrast, when measuring the sensitivity of the posterior to the prior of a subvector of the parameter, REPS focuses on the marginal prior rather than the conditional prior. 1 The econometrics literature on prior sensitivity analysis dates back to Chamberlain and Leamer (1976) and Leamer (1982). The early statistics literature on the topic is reviewed in Berger et al. (1994). 3 The importance of prior sensitivity analysis is especially salient in Bayesian DSGE models like our main application (see Herbst and Schorfheide (2015); Fernández-Villaverde et al. (2016) for overviews). These models have many parameters connected by numerous equilibrium conditions. As a result, priors typically rely on simplifying assumptions such as independence or conjugacy, which potentially matter for the posterior. For example, Del Negro and Schorfheide (2008) show that the joint prior matters for the posterior estimates of the role of nominal rigidities. Moreover, some of the parameters do not have a tight range of values that is widely accepted. Systematic prior sensitivity analysis provides diagnostics for whether an audience with heterogenous prior should be concerned about one’s posterior estimates. The need for prior sensitivity analysis is further motivated by the identification problems in DSGE models described by Canova and Sala (2009). In contrast to the subsequent literature on identification in DSGE models (e.g., Iskrev (2010); Komunjer and Ng (2011); Koop et al. (2013)) that primarily focuses on asymptotic identification, the framework here takes the Bayesian approach of conditioning on current observed data. While our main application is a DSGE model, REPS can be applied to any Bayesian estimation. Ho (2020) applies the REPS methodology to an overlapping generations model, showing the importance of capital and lifecycle consumption data for identifying the effects of an aging population on interest rates. Bayesian methods are also widely used in the estimation of vector autoregressions (VARs). Del Negro and Schorfheide (2004) and Giannone et al. (2018) show that the priors in VARs play an important role for forecasting. Baumeister and Hamilton (2015) and Giacomini, Kitagawa, and Read (2019) show that priors are important when structural VARs are partially identified. Finally, Bayesian methods have been used by Abdulkadiroglu et al. (2017) and Avery et al. (2013) to estimate matching models. Outline. I introduce the REPS framework in Section 2 and demonstrate the methodology using two stylized examples in Section 3. Section 4 derives the rule of thumb to quantify the difference between priors. I discuss implementation in Section 5. In Section 6, I apply the methodology to the DSGE model of Smets and Wouters (2007). Section 7 concludes. 2 2.1 Relative entropy prior sensitivity Setting and notation Consider the Bayesian estimation of a parameter θ ∈ Θ given data X. Bayes rule states that the prior π (θ) and likelihood L (θ|X) imply the posterior p (θ|X) ∝ π (θ) L (θ|X). Suppose we are interested in the posterior of a function ψ : Θ → Ψ of the parameter θ, where Ψ may 4 be multidimensional. For example, ψ (θ) could be an elasticity, a variance decomposition, or an impulse response function at an arbitrary range of horizons. Denote the expectation under an arbitrary probability measure f by Ef [·], and define the objective function γψ : Ψ → R so that Ep [γψ (ψ)] captures the property of the posterior of ψ that we are interested in. For instance, if we set γψ (ψ) = ψ, then Ep [γψ (ψ)] is the posterior mean of ψ. Altenatively, if we take γψ (ψ) = 1 {ψ ≤ ψ ∗ }, then Ep [γψ (ψ)] is the cumulative distribution function of ψ evaluated at ψ ∗ . We denote γ (θ) ≡ γψ (ψ (θ)), and study how Ep [γ (θ)] depends on π. To study the dependence of the posterior p on the prior π, we need to describe the distorted posterior implied by an alternative prior. In particular, given an alternative prior π̃ that is absolutely continuous with respect to π, we can write: π̃ (θ) ≡ M (θ) π (θ) , (2.1) where M is the Radon-Nikodym derivative of π̃ with respect to π. Since π̃ is a probability distribution, we have M > 0 and Eπ [M ] = 1. Given the likelihood L, the prior π̃ implies the distorted posterior: M (θ) p (θ|X) . (2.2) p̃ (θ|X) = Ep [M ] The normalization by Ep [M ] ensures that p̃ integrates to one. For any function g : Θ → G, the prior and posterior expectations arising from the alternative prior π̃ can be written h i (θ) Eπ [M (θ) g (θ)] and Ep EMp [M g (θ) respectively. ] 2.2 Setup To analyze the sensitivity of the posterior estimates to the prior, we search across a set of alternative priors that are close to the original prior in relative entropy, seeking the worstcase prior that yields the largest change in the posterior mean of the objective function γ. Comparing the change in the prior to the change in the posterior mean of γ tells us of how much the posterior mean of γ depends on the prior. Comparing the worst-case and original priors reveals parts of the prior that are important for determining the posterior mean of γ. Primal problem. Formally, we consider: " min M (θ):M >0,Eπ [M ]=1 Ep M (θ) γ (θ) Ep [M ] # s.t. Eπ [M (θ) log M (θ)] ≤ R. 5 (2.3) (2.4) The minimization over M satisfying M > 0 and Eπ [M ] = 1 is equivalent to minimizing over alternative priors, as the random variable M indexes the possible priors. We choose the prior h i (θ) that minimizes Ep EMp [M γ (θ) , the distorted posterior mean of γ. Replacing the minimization ] operator with maximization gives the upper bound for the posterior mean. The left-hand side of (2.4) is the relative entropy or Kullback-Leibler divergence of the alternative prior relative to the original prior. The constant R ∈ R+ provides an upper bound on the relative entropy, limiting us to priors that are statistically difficult to distinguish from the original prior, which we implicitly assume to contain useful information about the distribution of θ. As R → 0, the worst-case and original priors converge as we are restricted to choosing M = 1. Section 4 gives benchmarks for large and small values of R. There are several reasons for using relative entropy to constrain the set of priors. Firstly, relative entropy has theoretical justication. It measures the information that a Bayesian with prior π needs to gather to change her beliefs to the alternative prior M π and is invariant to the parameterization of θ. Secondly, relative entropy does not impose parametric restrictions on the prior distortions, allowing us to analyze how distributional assumptions on the prior affect the posterior estimate. For instance, even if π were an independent Gaussian prior, the relative entropy set of priors would include non-Gaussian and correlated priors. Thirdly, the functional form for relative entropy delivers an analytic solution to (2.3)-(2.4) that allows REPS to maintain tractability in large models. Finally, the use of relative entropy implies relatively weak conditions for the solution to (2.3)-(2.4) to be well-defined. Section 2.3 elaborates on the latter two points. The problem (2.3)-(2.4) is related to the constraint problem of Hansen and Sargent (2001) and the prior robustness problem from Hansen and Sargent (2007), but differs because the objective function in (2.3) and the relative entropy in (2.4) are taken under different probability measures. This difference in probability measures arises because we are interested in how ex-ante beliefs affect ex-post estimates. Since we wish to report the posterior estimate of Ep [γ (θ)], our objective function is the distorted posterior mean of γ, which conditions on the observed data. However, we wish to consider small changes in the prior, and are thus led to restrict the relative entropy of the alternative priors with respect to the original prior. Dual problem. Instead of specifying the bound R on relative entropy, it is convenient to specify the worst-case posterior mean γ̃ and solve the dual problem: min M (θ):M >0,Eπ [M ]=1 Eπ [M (θ) log M (θ)] " s.t. Ep 6 (2.5) # M (θ) γ (θ) = γ̃. Ep [M ] (2.6) We now search across priors that imply that γ has posterior mean γ̃, and picking the one that is closest to the original prior in terms of relative entropy.2 We will justify the formulation (2.5)-(2.6), argue that it simplifies the solution, and explain how one can move seamlessly between the primal and dual problems in practice. 2.3 Solution The solution to (2.5)-(2.6) has the form: M (θ) ∝ exp [λL (θ|X) (γ (θ) − γ̃)] , (2.7) where λ ∈ R is a constant to be solved for from the constraint (2.6). We have therefore reduced the minimization over a nonparametric set of priors to a problem with one equation and one unknown, regardless of the dimensionality of θ. This is key to making REPS feasible in large models. The distortion M depends on the parameter θ through the objective function γ and likelihood L. The worst-case distortion M reweights based on γ because the statistic of interest is the posterior mean of γ. The direction and degree of reweighting depends on λ, which is the Lagrange multiplier on (2.6) scaled by Ep [M ].3 If γ̃ < Ep [γ (θ)], then in order to reduce the posterior mean of γ, we require λ < 0 so that the worst-case prior places more weight on smaller values of γ. The difference between the solution (2.7) and the worst-case distortion in Hansen and Sargent (2001) is that the distortion in (2.7) is scaled by the likelihood L, which captures the role of the data in (2.5)-(2.6). The worst-case distortion depends on the likelihood because the expectations in (2.5) and (2.6) are taken under different probability measures. Since the posterior is proportional to the product of the prior and the likelihood, concentrating distortions in the high likelihood regions generates large changes in the posterior from small distortions of the prior. If the likelihood is flat, we return to the standard exponential tilt of Hansen and Sargent (2001). To solve for λ, substitute (2.7) into (2.6): # " exp [λL (θ|X) (γ (θ) − γ̃)] γ (θ) . γ̃ = Ep Ep [exp [λL (θ|X) (γ (θ) − γ̃)]] | {z change of measure 2 (2.8) } Robertson et al. (2005) minimize relative entropy subject to moment constraints, in order to find the forecasting model satisfying the posterior moment constraints that is closest to some benchmark model. Unlike them, we take the relative entropy of the prior instead of the posterior. 3 λ is analogous to the penalty parameter in the multiplier problem of Hansen and Sargent (2001). 7 The right-hand side is the posterior expectation of γ after a change of measure that depends on λ. As λ increases, the change of measure places more weight on large values of γ, increasing the right-hand side. Since the left-hand side is a constant and the right-hand side is increasing in λ, (2.8) implies a unique solution for λ that is straightforward to solve for numerically.4 Comparison to the primal problem. Appendix A shows that (2.3)-(2.4) also produces a solution of the form (2.7), but requires us to solve for both λ and γ̃. The multiplier representation of (2.3)-(2.4), which is the Lagrangian problem, specifies λ and requires us to solve for γ̃. We favor the dual representation (2.5)-(2.6) because γ̃ has a straightforward interpretation, while λ has no clear economic interpretation and is difficult to specify ex-ante. It is therefore more convenient to specify a reasonable change in the posterior mean of γ, then check how much the prior needs to be distorted in order to generate this change. In practice, it will not matter whether one picks R and finds the worst-case posterior mean, or one picks γ̃ and finds the corresponding relative entropy. The sequential Monte Carlo algorithm in Section 5 yields a sequence of (R, γ̃) pairs, allowing one to trace out the mapping between R and γ̃. We thus obtain γ̃ as a function of R despite solving the dual problem initially. Regularity conditions. For the solution (2.7) to be valid, we require that the Eπ [M ] = 1 constraint in (2.3) and (2.5) can be satisfied, and that Ep [exp [λL (γ − γ̃)]] in (2.8) exists. A sufficient condition for this is that L (θ|X) (γ (θ) − γ̃) is bounded, so that the right-hand side of (2.7) is bounded for any given λ. As an illustration, consider the prior θ ∼ N (0, 1), and suppose we are interested in γ (θ) ≡ θk , where k ≥ 3 is an odd integer. If the likelihood has Pareto tails proportional to |θ|−(α+1) , then the above condition requires that α + 1 > k. On the other hand, suppose the likelihood were flat. Then the solution to (2.3)-(2.4) is not well-defined.5 Accordingly, the constraint Eπ [M ] = 1 can no longer be satisfied with M satisfying (2.7). However, the infeasibility is symptomatic of fat tails in the likelihood, and is thus an indication that the posterior mean is especially sensitive to the prior. Relative entropy allows for relatively weak regularity conditions because it heavily penalizes increases in the mass of the tail of the original prior, thus favoring distortions closer to 4 While the worst-case distribution is unique, one may find an alternative prior in the same set that implies a change in the posterior statistic of interest that is only marginally smaller. Thus the worst-case distributions provide sufficient but not necessary distortions to generate large changes in the posterior results. 5 To see this, recall that the kth moment of a t-distribution with ν < k degrees of freedom does not exist. Hence the moment of interest would not exist in an alternative prior with the tail of that t-distribution. To obtain an arbitrarily small relative entropy, we can distort the prior appropriately far out in the tail. 8 the mode of the original prior.6 A statistical divergence that penalizes tail distortions less (e.g., total variation distance) would in general require more stringent regularity conditions. 2.4 Extensions The REPS framework allows for flexibility in application. We now consider several prior sensitivity problems that can be incorporated into (2.5)-(2.6). Credible intervals. One can adapt the constraint (2.6) to analyze the prior sensitivity of quantiles of the function of interest ψ (θ), producing bounds on the credible intervals for ψ given a set of deviations in the prior. In particular, suppose we are interested in the dependence of the qth quantile of ψ on the prior. Then we can solve (2.5) subject to: " Ep # o M (θ) n 1 ψ (θ) < ψ̃ = q, Ep [M ] (2.9) where ψ̃ is the worst-case quantile of ψ. Since (2.9) is (2.6) with γ̃ = q and γ (θ) = n o 1 ψ (θ) < ψ̃ , the solution has the form (2.7), with the appropriate substitution for γ̃ and γ. Fixing γ̃ = q, the sequential Monte Carlo algorithm yields a mapping between R and ψ̃. With the above substitutions, we have γ (θ) − γ̃ ∈ {1 − γ̃, −γ̃}. The distortion M therefore only takes on extreme values if θ has high likelihood. As a result, the conditions that Eπ [M ] = 1 and that Ep [exp [λL (γ − γ̃)]] exists are satisfied even if the likelihood is flat, since M (θ) ∝ exp [λ (1 − γ̃)] ψ (θ) < ψ̃ exp [−λγ̃] ψ (θ) ≥ ψ̃ . (2.10) In contrast, if γ were not bounded, then M would take on extreme values either when θ has high likelihood or when θ implies an extreme value of γ. Subspaces. Taking the expectations in (2.5)-(2.6) over the marginal prior and posterior of a subspace Θ∗ of Θ allows us to study the dependence of the posterior on the marginal prior over Θ∗ instead of the entire space Θ.7 Such an exercise can be useful if there is a natural partition for θ. For example, in a New Keynesian model, one may be especially concerned about a subset of parameters whose priors are hard to calibrate from existing data. The reduced dimensionality of Θ∗ also simplifies the analysis of the worst-case distortions. R Intuitively, express relative entropy as π̃ (θ) log [π̃ (θ) /π (θ)] dθ, and notice that log [π̃/π] → ∞ as π → 0. This also implies that we consider alternative priors with the same support as π. 7 In the context of a partially identified model, the current approach differs from Giacomini, Kitagawa, and Uhlig (2019), who would distort the prior of θ∗ conditional on the identified parameters. 6 9 More formally, consider min M (θ∗ ):Eπ∗ [M ]=1 Eπ∗ [M (θ∗ ) log M (θ∗ )] " s.t. Ep∗ (2.11) M (θ∗ ) Ep [γ (θ) |θ∗ ] = γ̃, Ep∗ [M ] # (2.12) where π ∗ and p∗ are the marginal prior and posterior over Θ∗ . The constraint (2.12) arises by applying the law of iterated expectations to (2.6) and noting that M now depends on θ∗ only. Define γ ∗ (θ∗ ) ≡ Ep [γ (θ) |θ∗ ]. The solution to (2.11)-(2.12) is: M (θ∗ ) ∝ exp [λL∗ (θ∗ |X) (γ ∗ (θ∗ ) − γ̃)] , (2.13) where L∗ (θ∗ |X) ≡ p∗ (θ∗ |X) /π ∗ (θ∗ ) is the marginal likelihood of θ∗ . The solution (2.13) to the subspace problem is similar to the original solution (2.7), with the likelihood L replaced with the marginal likelihood L∗ and the objective function γ (θ) replaced by its expectation conditional on θ∗ . Additional constraints. We can further restrict the set of permissible priors by including prior or posterior moment restrictions to (2.5)-(2.6), as in the “tilted robustness” problem of Bidder et al. (2016). Each additional restriction produces one additional multiplier to solve for, while the moment restriction provides the additional equation with which to solve for the new unknown. See Appendix A for details. One such moment restriction constrains the marginal data density. In particular, express the ratio of the marginal data density of the worst-case prior to that of the original prior as: R Ep [M (θ)] = R L (θ|X) π̃ (θ) dθ L (θ|X) π (θ) dθ (2.14) and restrict Ep [M ]. The quantity Ep [M ] is easily computed by taking an average of M (θ) across Monte Carlo draws. Berger et al. (1994) discusses why one might want to include the marginal data density as a criterion to ensure that the alternative priors considered are plausible. A small Ep [M ] 1 suggests that π̃ is strongly rejected by the data, a large Ep [M ] 1 could be evidence of π̃ being overfitted to the data, while Ep [M ] = 1 indicates that the data favor neither the original nor the alternative prior. 10 0.8 original prior and likelihoods prior mixture likelihood normal likelihood 0.6 original posteriors 1 0.8 0.6 0.4 0.4 0.2 0 -3 0.2 -2 -1 0 1 2 0 -3 3 worst case priors 1.5 1 0.5 0.5 -2 -1 0 1 2 0 -3 3 -1 0 1 2 3 2 3 worst case posteriors 1.5 1 0 -3 -2 -2 -1 0 1 Figure 3.1: Likelihoods, priors and posteriors for mixture and normal likelihoods. Blue dashed lines correspond to mixture likelihood; red dotted lines correspond to normal likelihood. Top left: Original prior and likelihoods; Top right: Original posteriors; Bottom left: Worst case priors; Bottom right: Worst case posteriors, with original and worst case means in gray. 3 Two illustrative examples We now present two stylized examples to illustrate how REPS can diagnose dependence on the prior that may be hard to detect otherwise and are prevalent in many applications. The first example shows that REPS accounts for behavior of the likelihood in the tail of the posterior, which may be hard to distinguish from visual inspection. The second shows that even if the prior and likelihood are Gaussian, REPS detects that the sensitivity to the prior depends on the object of interest ψ. 3.1 Example 1: multiple modes in the likelihood Suppose θ ∈ R, and we have the prior θ ∼ N (1, 0.62 ). We consider two alternative likelihoods: a mixture model N (−θ, 0.62 ) w.p. 0.5 X∼ , (3.1) N (θ, 0.62 ) w.p. 0.5 with data X = 1, and a Gaussian model X ∼ N θ, 0.6782 , 11 (3.2) with data X = 0.831. The parameters of the Gaussian model are picked so that both posteriors have mean 0.942 and standard deviation 0.485. The top right panel of Figure 3.1 shows that the two posteriors are hard to distinguish visually. On the other hand, the top left panel of Figure 3.1 shows that the mixture likelihood has modes at −1 and 1, while the normal likelihood has only one mode at 0.831. REPS shows that the posterior mean of θ has greater sensitivity to changes in the prior under the mixture likelihood. In particular, fixing γ (θ) = θ and R = 1.25, we solve (2.3)(2.4) for both models. Since the original prior and R are identical across the models, we are choosing from the same set of priors in both cases. However, the mixture model’s worst-case posterior mean of 0.002 is substantially lower than the normal model’s worst-case posterior mean of 0.224. Without REPS, the similarity of the two posteriors might mislead one to think that the posteriors are equally sensitive to changes in the prior. Local prior sensitivity methods may also fail to detect the difference in the prior sensitivity. For instance, the derivative of the posterior mean with respect to the prior mean, as considered by Müller (2012), suggests that the posterior means in the two cases are equally sensitive to the prior. The worst-case priors and posteriors, plotted in the bottom row of Figure 3.1, indicate the importance of the range of alternative priors considered by REPS. With the normal likelihood, the worst-case prior for θ is approximately normal, centered around 0. The worst-case posterior is also approximately normal, centered around the new mean of 0.2. In contrast, with the mixture likelihood, the worst-case prior is flatter and places a relatively large mass around θ = −1. The worst-case posterior is now bimodal, with a second mode around θ = −1 corresponding to the second mode in the likelihood, which was not visible from the original posterior. The worst-case distortions are informative about parts of the prior that one should be concerned about even if one does not regard the exact shape of the worst-case prior as being plausible. This example illustrates how the robustness of a posterior estimate to changes in the prior can depend on peaks in the likelihood that are dampened by the original prior. In such cases, visual inspection of the prior and posterior can mislead one to believe a result is more robust than it actually is. Herbst and Schorfheide (2014) show that under more diffuse priors, the DSGE models of Smets and Wouters (2007) and Schmitt-Grohé and Uribe (2012) produce multimodal posteriors that can alter inference relative to a tighter prior. We find that these features matter for posterior inference in our main application to the DSGE model of Smets and Wouters (2007) in Section 6. Such multimodality is often hard to detect without reestimating the model for different priors. Flatter marginal priors may not reveal these modes, since the parameterization and independence assumptions matter for how flattening the marginals impacts the posterior of the object of the interest. REPS 12 : prior 0.8 0.6 1.5 0.4 1 0.2 0.5 0 -3 ( ): prior 2 original maximum minimum 0 -2 -1 0 1 2 3 0 : posterior 1 1 3 4 3 4 ( ): posterior 2 0.8 2 1.5 0.6 1 0.4 0.5 0.2 0 -3 0 -2 -1 0 1 2 3 0 1 2 Figure 3.2: Prior and posterior of θ and ψ (θ). Black solid lines correspond to original distribution, blue dashed lines correspond to distributions that maximize mean; red dotted lines correspond to distributions that minimize mean. Top left: Prior of θ; Top right: Prior of ψ (θ) Bottom left: Posterior of θ; Bottom right: Posterior of ψ (θ). provides a systematic approach to prior sensitivity analysis that accounts for potentially subtle features of the likelihood if they are important for one’s posterior results. 3.2 Example 2: log-normal distribution Suppose θ ∈ R, and we have prior θ ∼ N (0, 1), X ∼ N (θ, 1), and observe data X = 0. The posterior is θ ∼ N 0, 21 . Suppose we wish to do REPS analysis on ψ (θ) = exp (θ), so that ψ is log-normal with mean 1.28 and standard deviation 1.03. Figure 3.2 shows that the posterior of ψ is skewed even though the posterior for θ is symmetric. REPS shows that this asymmetry matters for the sensitivity of the posterior mean of ψ to changes in the prior. Fixing γ (θ) = ψ (θ) and taking R = 0.57, which corresponds to a one standard deviation change in the posterior mean of θ, the posterior mean of ψ has a maximum value of 2.50 (an increase of 1.18 standard deviations) and minimum value of 0.59 (a decrease of 0.67 standard deviations). The asymmetry in sensitivity arises because the ψ is bounded below by zero but has a posterior with a fat right tail. The worst-case distortions are also asymmetric. The prior that maximizes the mean distorts the tails more relative to the prior that minimizes the mean, because the convexity of the exponential function amplifies (dampens) the effect of distortions on the right (left) 13 tail of θ on the mean of ψ. The asymmetry arises despite the symmetry of the Gaussian prior and likelihood. Since relative entropy is invariant to one-to-one transformations of θ, the set of priors does not depend on the parameterization of the problem. These insights generalize to more complex settings where ψ may be a complicated function whose sensitivity to the prior may be hard to analyze. For example, if ψ is an impulse response in a DSGE model, one would need to solve the model and then evaluate the impulse response. The challenge is compounded when θ is high-dimensional. Even if visual inspection of the posterior of ψ reveals that it could be sensitive to the prior, further analysis is needed to determine which parts of the prior of θ are important. REPS accounts for the function of interest through the γ term in the solution (2.7) while checking across a wide range of alternative priors. 4 Quantifying the change in prior To quantify prior sensitivity, we need to gauge how much the prior has changed to produce the specified change in posterior mean. In this section, I provide a formula that summarizes these relationships and give practitioners a rule of thumb for what is a large or small value of R, taking the Gaussian location model as a benchmark. 4.1 Intuition There are two challenges in gauging the size of R. Firstly, the worst-case distortions are nonparametric, making it hard for a practitioner to have an intuition for whether the change in the prior is large or small. Secondly, because the distortions are concentrated in the high likelihood region, existing approaches such as the error detection probabilities (Hansen and Sargent (2008)) produce misleading conclusions.8 Instead, one needs to account for the concentration of the likelihood and the pdf of the prior around the high likelihood region when interpreting R. To see this, notice that we can write the relative entropy as an integral over the prior, and recall that the worst-case distortions are concentrated in the high likelihood region. As the likelihood becomes more concentrated, the volume of the high likelihood region shrinks, reducing the effective region over which the integral is computed, thus decreasing the relative entropy. Within the high likelihood region, the integral is scaled by the prior. Intuitively, we have more prior knowledge about regions of high prior probability, making it more costly to change our beliefs about those regions. 8 In our setting, such approaches can imply more sensitivity even when the likelihood is more concentrated. 14 4.2 Gaussian location model The above intuition applies for the asymptotic behavior of the solution for (2.5)-(2.6) in the Gaussian location model, which we then use as a benchmark for R. Appendix C verifies that the asymptotics provide a good approximation even for a relatively small sample size. Consider a d-dimensional vector θ = (θ1 , ..., θd )0 whose true value is θ0 . Suppose we have prior θ ∼ N (0, Σπ ) and observe T iid realizations of X ∼ N (θ0 , Ω) with sample mean X T . Assume Ω is full rank. Then we have the posterior θ ∼ N (θp,T , Σp,T ) where θp,T = T Σp,T Ω−1 X T −1 Σp,T = Σ−1 π + TΩ (4.1) −1 . (4.2) Denote the posterior standard deviation of θi by σi,p,T . Lemma 1. Suppose γ (θ) = θ1 and γ̃T = θ1,p,T − cσ1,p,T where c ∈ R+ . Then as T → ∞, 1 d a.s. the solution to (2.5)-(2.6) satisfies T 2 RT −→ Rπ (θ0 ) |Ω|− 2 for some R. Lemma 1 states that the relative entropy RT needed to shift the posterior mean by c d posterior standard deviations declines at rate T 2 , and it depends on the variance of the data and the prior at θ0 . The scaling factor R varies depending on the number of dimensions d and amount of distortion c. See Appendix A for the proof. As the sample size increases, the distortions asymptotically concentrate around a small d region whose volume shrinks at rate T 2 . The |Ω| term accounts for the dispersion of the likelihood for a given T . Since the likelihood concentrates around θ0 , the asymptotic relative entropy is scaled by the prior π (θ0 ) at θ0 . The same asymptotics apply for the mean or quantile of any linear combination of θ.9 4.3 Rule of thumb for R We calibrate R by taking a Gaussian approximation of the prior and posterior, then comparing the relative entropy to the corresponding Gaussian location model. In particular, for prior and posterior means (µπ , µp ) and variances (Σπ , Σp ), consider the approximation N (µπ , Σπ ) and N (µp , Σp ) for the prior and posterior respectively. Define the dispersion of the likelihood: −1 −1 Σ` ≡ Σ−1 − Σ . (4.3) p π In a Gaussian location model with T observations of X ∼ N (θ, Ω), we have Σ` = T1 Ω. 9 For other models that satisfy the Bernstein-von Mises theorem, we have a similar asymptotic relative entropy, with |Ω| replaced by the Fisher information matrix. 15 mean 1.2 84% quantile 1.2 d=1 d=2 d=3 d=4 1 1 r 0.8 r 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0 0.5 1 1.5 0 0.5 distortion (in std dev) 1 1.5 distortion (in std dev) Figure 4.1: Scaled relative entropy r and distortion in Gaussian location model with θ ∼ N (0, I), X T = 0, T = 100, d ∈ {1, ..., 4}. All distortions are scaled by posterior standard deviation. Left: Increase in posterior mean of θ1 by one posterior standard deviation; Right: Increase in 84% quantile of θ1 by one posterior standard deviation. We parameterize the relative entropy as: R= v u π (µ ) p u δ t |Σ` | 1.6 r π (µπ ) |Σπ | where δ = 0 d=1 d d>1 . (4.4) The last two terms arise from asymptotics in Lemma 1, and account for the concentration of the likelihood (relative to the prior) and the pdf of the prior around the high likelihood region. The 1.6δ scaling accounts for further differences across dimensions. The remaining free parameter r controls the size of the change in the prior and posterior. Intuitively, r captures the sensitivity to the prior given the prior and posterior variances. While the prior and posterior variances are sufficient to determine prior sensitivity in a Gaussian location model, they do not account for the behavior of the likelihood in the tail of the posterior or correlations in the likelihood. In Section 6, we show that such features are important for determining the sensitivity to the prior, and that REPS accounts for them. A practitioner can compute r from (4.4), then compare the change in the posterior for the estimation of interest to the change that r would imply in a Gaussian location model. Figure 4.1 shows how r corresponds to changes in the mean and 84% quantile in the Gaussian location model with Σπ = Ω = I, and T = 100. As a rule of thumb, r < 0.05 is small and r > 0.50 is large. For the mean and 84% quantile of the Gaussian model, r = 0.05 and r = 0.50 correspond approximately to 1/4 standard deviation and one standard deviation changes, respectively.10 10 One can also obtain a Gaussian approximation using the Hessian of the prior and posterior at their modes. However, the local nature of this approximation is less consistent with the intuition that we should 16 5 Implementation We now discuss the numerical implementation of the calculations in Section 2. We assume that we have Monte Carlo draws from both the prior and the posterior. 5.1 Importance sampling If the distribution of the worst-case distortion M does not have fat tails, then we can solve for M and evaluate the worst-case prior and posterior using importance sampling. In particular, for any λ, we can evaluate the right-hand side of (2.8), approximating the expectations with Monte Carlo sample averages. We can then solve (2.8) using this Monte Carlo approximation. Given the solution for λ, we can now compute M for any value of θ. Reweighting the original Monte Carlo draws by M then gives us the worst-case prior and posterior. Lemma 4 in Appendix A shows that we can evaluate relative entropy using the expression: Eπ [M (θ) log M (θ)] = − log Eπ [exp [λL (θ|X) (γ (θ) − γ̃)]] . (5.1) The right-hand side is negative log of the normalizing constant that ensures M π integrates to one, which is straightforward to evaluate using draws from the prior. This produces more precise estimates than averaging log M across the draws from the worst-case prior because such a calculation would require the normalizing constant as well. Importance sampling performs poorly when the distribution of M has fat tails, which occurs when the likelihood is sharply peaked. This problem becomes more severe as we increase the dimensionality of Θ or the number of observations. The solution (2.7) shows that when the likelihood is sharply peaked, the distortions are concentrated in a small region of the parameter space, but the distortions in that region are large. As a result, in order to accurately approximate the prior and posterior distortions, we need an increasingly large number of Monte Carlo draws for the high likelihood region to be sufficiently well sampled. 5.2 Sequential Monte Carlo When importance sampling fails, we can use sequential Monte Carlo (SMC) to generate draws from the worst-case prior and posterior. Rather than using importance sampling to move directly from the original to the worst-case distributions, SMC introduces a sequence of bridge distributions that serve as intermediate steps between the original and worst-case scale relative entropy according to the dispersion of the likelihood. For example, if a model is not point identified and the likelihood is flat in some identified set, such an approach may lead |Σ` | to be large even if the identified set is small. 17 Algorithm 1: Sequential Monte Carlo for REPS NP P Input: Draws {θπ,j }N j=1 and {θp,j }j=1 from original prior and posterior. NP P Output: Draws {θπ̃,j }N j=1 and {θp̃,j }j=1 from worst-case prior and posterior. NP NP NP P Initialize: Set {θπ0 ,j }N j=1 = {θπ,j }j=1 and {θp0 ,j }j=1 = {θp,j }j=1 . for i = 1 to NSMC do compute weights: Solve for λi and compute mi ≡ πi /πi−1 for each draw. NP P selection: Draw from {θπi−1 ,j }N j=1 and {θpi−1 ,j }j=1 using a multinomial distribution with probability weights proportional to mi (θπi−1 ,j ) and mi (θpi−1 ,j ) respectively. mutation: For each draw, take NMH Metropolis-Hastings steps. end NP NP NP P return {θπ̃,j }N j=1 = {θπNSMC ,j }j=1 and {θp̃,j }j=1 = {θpNSMC ,j }j=1 . distributions. Beginning with draws from the original prior and posterior, referred to as particles, we iteratively construct particle approximations of the bridge distributions, before arriving at a particle approximation of the worst-case prior and posterior. Our algorithm is based on Herbst and Schorfheide (2014), who begin with draws from a prior and transition to the posterior by constructing bridge distributions that are proportional to the product of the prior and the likelihood raised to an exponent. To adapt the algorithm, we construct bridge distributions for our setting and compute the corresponding weights between consecutive bridge distributions. SMC We take the bridge distributions to be the worst-case priors {πi }N and posteriors i=1 NSMC {pi }i=1 arising from the solution of (2.5)-(2.6) for a sequence of intermediate worst-case posterior means γ̃0 > ... > γ̃NSMC , with γ̃0 = Ep [γ (θ)] and γ̃NSMC = γ̃. When studying quantiles, we can fix the quantile of interest q and construct a sequence of intermediate worst-case quantiles ψ̃0 > ... > ψ̃NSMC , where ψ̃0 is the quantile under the original posterior, and ψ̃NSMC = ψ̃ is the worst-case quantile. With the bridge distributions in hand, we sketch out the SMC procedure in Algorithm 1. We leave the details to Appendix B. A key by-product of the SMC algorithm is that it provides the choice of whether to fix the relative entropy and obtain worst-case means or quantile, or fix the worst-case quantities and obtain the associated relative entropy. By producing a sequence of worst-case distortions and solving for the associated relative entropies, the SMC algorithm allows the user to map the relationship between the relative entropy R and worst-case mean γ̃ (or quantile ψ̃). As emphasized in Section 2.3, the dual problem is then a computational device and does not force the user to choose the worst-case quantity instead of the relative entropy. 18 Algorithm 2: Approximate REPS NP P Input: Draws {θπ,j }N j=1 and {θp,j }j=1 from original prior and posterior. NP P Output: Draws {θπ̃,j }N j=1 and {θp̃,j }j=1 from approximate worst-case prior and posterior. 1. 2. 3. 4. Approximate π and p by π̂ and p̂. b ∗ |X) ≈ cL∗ (θ ∗ |X), where c is a constant. Use π̂ and p̂ to obtain an approximation L(θ ∗ Obtain an estimate for γ̂(θ) ≈ Ep [γ(θ)|θ ]. b γ̂). Do Algorithm 1, replacing (π, p, L∗ , γ) with (π̂, p̂, L, NP NP NP P return {θπ̃,j }N j=1 = {θπNSMC ,j }j=1 and {θp̃,j }j=1 = {θpNSMC ,j }j=1 . 5.3 Approximate REPS The main computational challenge in Algorithm 1 is the computation of L and γ in the mutation step. In particular, let the number of particles and Metropolis-Hastings mutation steps be NP and NMH , respectively. Obtaining particle approximations of the worst-case prior and posterior each require us to compute L and γ for NP × NMH × NSMC different parameter values. Both L and γ may be computationally expensive to compute. If we are interested in more than one statistic, we also need to repeat the SMC algorithm for each objective function we are interested in.11 In addition, to apply Algorithm 1 to (2.11)-(2.12), we require the marginal likelihood L∗ and conditional expectation γ ∗ , both of which may be difficult to compute. To overcome these, we use an approximation to the REPS calculations, which we refer to as approximate relative entropy prior sensitivity (AREPS). b and The main idea of AREPS is to replace π, p, L∗ , and γ ∗ with approximations π̂, p̂, L, γ̂. In particular, using the Monte Carlo draws from the original estimation as observations, we fit a set of basis functions to obtain approximations of π̂, p̂, and γ̂ respectively for π, p, b and γ̂ correspond to and γ.12 If we are doing the REPS analysis over the entire Θ, then L approximations for L and γ, respectively. Analogously to semiparametric methods, we use basis functions to allow for flexibility while mitigating the curse of dimensionality that arises in fully nonparametric estimation. From the practical perspective, many of these methods are straightforward to implement in most statistical software using built-in commands. The steps are described in Algorithm 2. b for a set of particles With the appropriate approximations, the approximate likelihood L 11 For example, if γ is an impulse response function in a DSGE model, then one needs to solve the model for each draw of θ in order to compute the impulse response. To compute the likelihood L, one needs to run a Kalman filter using the solved model. If one were interested in the error bands for a set of impulse response functions, one would need to repeat Algorithm 1 for each impulse response at multiple horizons. 12 The motivation for the approximations is similar in spirit to variational Bayesian inference. While variational Bayesian methods seek the approximating distribution that is closest to the true posterior in relative entropy, here we make use of the fact that we have existing Monte Carlo draws that we can use to directly approximate the distributions and functions. 19 can be computed in vectorized form. If one has multiple objectives (e.g., multiple horizons of an impulse response), one can parallelize across SMC algorithms. In addition, since we no longer need to compute the true L∗ or γ ∗ , output from packages such as Dynare can be directly fed into the algorithm. For computational efficiency, the approximations π̂, p̂, and γ̂ should be fast to evaluate. In our application in Section 6, we use a Gaussian mixture model and a quadratic logit to approximate p and γ ∗ , respectively. For numerical accuracy, L∗ and γ ∗ need to be well approximated in regions with the largest distortions. Since the approximations typically perform more poorly in the tails of the distributions, AREPS would provide misleading results if the distortions M take on extreme values in the tails of π ∗ , p∗ , or γ ∗ . This problem tends to be less severe when γ ∗ is bounded. For example, AREPS would generally produce more accurate results when studying quantiles, since γ ∗ ∈ [0, 1] and M takes on extreme values in the high likelihood regions, which tend to be near the posterior mode. If closed-form expressions for L∗ and γ ∗ are available, we can reweight draws from the approximate worst-case prior and posterior to obtain draws from the true worst-case prior and posterior. See Appendix B for details. 6 Application: Smets and Wouters (2007) Our main application is the workhorse DSGE model from Smets and Wouters (2007). Despite the size of the model, REPS is not only feasible, but also accounts for features of the likelihood that are especially hard to diagnose in such high-dimensional settings even with the use of local prior sensitivity methods. The upper bound of the error bands is highly sensitive to the prior, especially to the nominal frictions parameters, of which the prior on the wage rigidity parameter is particularly important. We discuss the worst-case distortions in detail to provide support for the validity of the methodology and approximations. 6.1 Model and estimation Smets and Wouters (2007) presents a medium-scale New Keynesian model with sticky wages and prices, wage and price indexation, habit formation, investment adjustment costs, variable capital utilization, and fixed costs in production. The model includes total factor productivity, risk premium, investment-specific, wage mark-up, price mark-up, government spending, and monetary policy shocks.13 The model has thirty-six parameters. 13 We estimate the equations as presented in the text of Smets and Wouters (2007). In their estimation, Smets and Wouters (2007) use an alternative scaling for their risk premium shock. The scaling does not change the estimation results materially. 20 We use quarterly data from 1984Q1 to 2007Q4 of the Federal Reserve Economic Data (FRED) for GDP growth, consumption growth, investment growth, wage growth, hours, inflation, and the federal funds rate. The series are updated vintages of those used in Smets and Wouters (2007) for the period after the start of the Great Moderation. Our original prior is from Smets and Wouters (2007). We make 1.5 million Markov Chain Monte Carlo draws after discarding 40,000 burn-in draws from the posterior using a standard MetropolisHastings algorithm. 6.2 Prior sensitivity Object of interest. We construct 68% robust error bands for the impulse response of output to a one percentage point decrease in interest rates up to five years from impact. These are bounds that uniformly contain all error bands arising from the chosen set of priors. To that end, we solve the REPS problem separately for each horizon and posterior quantile determining the error band. Parameters of interest. We consider the sensitivity of the error bands to changes in the prior of two groups of structural parameters. The first set of parameters is {ρ, rπ , ry , r∆y } from the monetary policy rule: r̂t = ρr̂t−1 + (1 − ρ) [rπ π̂t + ry (∆ŷt ) + r∆y [(∆ŷt ) − (∆ŷt−1 )]] + εrt , (6.1) where r̂t is the interest rate, π̂t is the inflation rate, ∆ŷt is the output gap, and εrt is an exogenous AR(1) shock process. The second set of parameters is {ξw , ξp , ιw , ιp }, which determine the level of wage rigidity, price rigidity, wage indexation, and price indexation. Each has a range [0, 1], with 0 corresponding to the flexible wage and price benchmarks. Economic theory suggests that both are important for determining the response of output to monetary policy. The Taylor rule captures the persistence of the monetary policy shock, and how the monetary authority responds to dampen future deviations in inflation and the output gap arising from the initial monetary policy shock. Economic agents have rational expectations about this future path of interest rates, and make decisions that determine the response of output in equilibrium. Similarly, the nominal frictions determine how much prices adjust in response to monetary policy, and thus how much output responds in equilibrium. However, because the impulse response function is determined in equilibrium, it is hard to make precise analytical statements about the effect of changing any of these parameters on the impulse response. 21 Taylor rule 2 nominal frictions 2 1.5 1.5 1 1 0.5 0.5 0 original robust 0 0 5 10 15 20 0 quarters 5 10 15 20 quarters Figure 6.1: Robust 68% error bands. Black lines show original median (solid) and error bands (dotted); red dashed lines show robust error bands. Left: Taylor rule prior, with r = 5 × 10−3 ; Right: nominal frictions prior, with r = 2.5 × 10−4 . Computational details. We use the exact prior. The posterior is approximated using a Gaussian mixture with 40 components. The impulse response at each horizon is approximated using a quadratic regression, yielding R2 s of between 0.97 to 0.98. Finally, the conditional probability that the impulse response is less than a given cutoff is estimated with a quadratic logit.14 For the SMC, we use 5 × 104 particles, 250 SMC stages, and 10 Metropolis-Hastings draws at each stage. We average the results across 10 runs of the SMC. A single run of the SMC in MATLAB takes approximately three hours. We parallelize the computations by horizon across 21 cores. Appendix D provides further details, including a comparison of the exact and approximate posteriors. 6.3 Robust error bands Figure 6.1 plots the original and robust 68% error bands for the impulse response. For each set of parameters, we fix the relative entropy so that the maximum distance between the original and robust error bands is approximately one posterior standard deviation. As noted in Section 2.3 and 5.2, we could also have chosen a fixed relative entropy since the SMC produces the full mapping between the relative entropy and worst-case quantiles. We use the ability to move between relative entropy and worst-case quantile in order to fix the relative entropy across horizons in Figure 6.1. Our analysis adds to the literature documenting the sensitivity of Bayesian DSGE estimates to the prior. Even though a model may be weakly identified, the direction that lacks identification may not be important for the statistics one is most interested in. For the impulse response here, we see different degrees of sensitivity depending on the horizon, 14 As discussed in Section 5, MATLAB has built-in commands for these approximations. 22 20 2 prior posterior 10 15 15 10 10 5 5 1 0 0 0.6 0.8 0 1 6 1 2 0 3 0 10 0.2 0.4 3 4 0 0.2 0.4 4 2 5 2 2 1 0 0 0 0.5 1 0 0 0.5 1 0 0 0.5 1 0 0.5 1 Figure 6.2: Original prior and posterior of Taylor rule parameters {ρ, rπ , ry , r∆y } and nominal frictions parameters {ξw , ξp , ιw , ιp }. quantile, or subset of parameters considered. Sensitivity. We use r = 5 × 10−3 for the Taylor rule and r = 2.5 × 10−4 for the nominal frictions prior, where r is defined in equation (4.4). These values of r are one and two orders of magnitude smaller than the r = 0.05 benchmark respectively. In a Gaussian location model, r = 5 × 10−3 and r = 2.5 × 10−4 would respectively correspond asymptotically to 0.14 and 0.01 posterior standard deviation changes in the quantile. This is at least an order of magnitude smaller than the differences in the upper bound for most horizons, but similar to the differences in the lower bound shown in Figure 6.1. The first reason for the sensitivity is that the likelihoods for both the Taylor rule and the nominal frictions parameters are dispersed, allowing a small change in the prior to generate a large change in the posterior of the parameters. Figure 6.2 shows that their marginal posteriors are not much more concentrated than their priors. To study the joint concentration of the likelihood, we measure the dispersion of the likelihood relative to the q prior using the statistic |Σ` | / |Σπ |, where Σ` is defined in (4.3). This statistic is 0.23 for the Taylor rule and 1.53 for the nominal frictions parameters. In the Gaussian location model, these values correspond to having a standard normal prior and observing just one observation of X ∼ N (θ, ω 2 I) with ω = 0.69 and ω = 1.11, respectively. With such a dispersed likelihood, the asymptotics in Lemma 1 overpredict the relative entropy needed to change the posterior estimates. In particular, in Figure C.1, the relative entropy with T = 1 is an order of magnitude smaller than what is predicted by the log-log trend for T > 10. The greater dispersion in the marginal likelihood of the nominal frictions parameters accounts 23 for the greater sensitivity to their prior relative to the prior on the Taylor rule parameters. In addition, both the Taylor rule and nominal frictions parameters are good predictors of the impulse response function. For example, using the Monte Carlo posterior draws as observations and running quadratic regressions of the impulse response 20 quarters after impact on the Taylor rule parameters and nominal frictions parameters yield R2 s of 0.82 and 0.48, respectively. As a result, a given change in the posterior for either set of parameters shifts the posterior for the impulse response substantially. Asymmetry. The sensitivity to the prior is uniform across neither horizons nor bounds. For both sets of parameters, the upper bound is substantially more sensitive to changes in the prior than the lower bound. In other words, given the model and data, a policymaker should be more concerned about underestimating rather than overestimating the effects of a surprise decrease in interest rates. Like the log-normal example shown in Figure 3.2, the original posterior of the impulse response is skewed to the right, which partially explains the greater sensitivity of the upper bound. However, the shape of the posterior for the impulse response does not fully capture the asymmetry in sensitivity. In particular, the asymmetry is more pronounced for the nominal frictions parameters, reflecting differences in the mapping from the two sets of parameters to the impulse response. The robust and original error bands diverge as we increase the horizon, indicating that the impulse response depends more on the prior at longer horizons. Intuitively, since lowfrequency fluctuations are estimated less precisely than high-frequency fluctuations, the marginal likelihood for the impulse response is more concentrated for short horizons.15 6.4 Worst-case distributions As an illustration, we now study the worst-case priors and posteriors for the impulse response one year after impact. For the Taylor rule prior, we consider the worst-case distortions that decrease the lower bound by 1/3 posterior standard deviation and increase the upper bound by one posterior standard deviation, respectively. For the nominal frictions prior, we consider the worst-case distortions that decrease the lower bound by 1/20 posterior standard deviation and increase the upper bound by one posterior standard deviation, respectively. These deviations correspond approximately to the robust error bands in Figure 6.1. We summarize the distortions by the changes in the prior and posterior means (normalized 15 The impulse response at short horizons are also less well-predicted by the parameters of interest. For example, quadratic regressions of the impulse response on impact on the Taylor rule and nominal frictions parameters respectively yield R2 s of 0.27 and 0.05, which are substantially smaller than those for the impulse response 20 quarters after impact. 24 Lower bound Parameter Eπ̃ [·]−Eπ [·] σπ [·] Ep̃ [·]−Ep [·] σp [·] Upper bound Eπ̃ [·]−Eπ [·] σπ [·] Ep̃ [·]−Ep [·] σp [·] Taylor rule ρ persistence rπ inflation coef. ry output gap coef. r∆y output gap growth coef. 0.007 0.000 0.005 –0.002 –0.165 0.743 0.192 –0.072 –0.003 –0.004 –0.001 –0.005 0.125 –0.234 –0.282 0.204 Nominal frictions ξw wage rigidity ξp price rigidity ιw wage indexation ιp price indexation –0.009 –0.040 –0.001 0.004 –0.162 –0.076 –0.048 0.108 0.006 –0.000 –0.002 0.008 0.517 –0.041 0.037 –0.212 Table 6.1: Difference between worst-case and original prior and posterior means, normalized by standard deviations. Worst-case distributions correspond to impulse response four quarters from impact. Taylor rule: Lower bound decreased by 1/3 standard deviations and upper bound increased by one standard deviation under worst case; Nominal frictions: Lower bound decreased by 1/20 standard deviations and upper bound increased by one standard deviation under worst case. by the respective standard deviations) in Table 6.1. The prior means change less than the posterior means because the distortions are nonparametric and applied jointly to the various parameters. For example, changes in the skew of a distribution can leave both first and second moments unchanged, but lead to substantial changes in the posterior results if the likelihood is high at one tail of the prior distribution. In addition, because the changes in the prior are concentrated in the high likelihood regions, they may appear small when integrated out, but still lead to relatively large changes in the posterior means. The distortions are asymmetric across the upper and lower bounds, emphasizing that the upper and lower bounds of the error bands depend on the prior in different ways. Among the nominal rigidity parameters, we find that the posterior is especially dependent on the wage rigidity parameter, as the upper bound worst-case prior is heavily distorted in the direction that changes the corresponding posterior mean. In contrast, the price rigidity and wage indexation priors do not appear as important for the impulse response. For the Taylor rule parameters, the prior on the inflation coefficient is especially important for the lower bound of the error band, but the worst-case prior for the upper bound does not distort especially strongly in the direction of any one of the parameters. In what follows, we show that the worst-case distortions reveal information about the 25 model and likelihood that is difficult to uncover using existing approaches. The methods used to understand the results are useful but heuristic instruments to analyze the results ex-post, and do not inform the researcher ex-ante about which parts of the prior are important for the posterior estimates. Much of the analysis conditions on the size of the impulse response. Comparing the behavior of the left and right tails of the impulse response reveals the reasons for the asymmetry in distortions. To understand the worst-case distortions, we use the fact that the distortion (2.13) depends on the parameters through the objective function γ ∗ (θ∗ ) and the marginal likelihood L∗ (θ∗ |X). For each set of parameters, we run regressions of the impulse response on the parameters using the Monte Carlo draws from the original posterior, in order to analyze the relationship between the parameters and the impulse response. Since γ ∗ (θ∗ ) in (2.13) is a conditional expectation, this regression captures both the direct effect of the parameters on the impulse response, as well as the indirect effect from the conditional distribution of the remaining parameters under the posterior. In order to shift the lower (upper) bound of the error band, we need to shift mass to the left (right) tail of the distribution of the impulse response. We thus restrict the regressions to draws for which the impulse response in one standard deviation below (above) the mean to understand the lower (upper) bound of the impulse response. The impulse response and parameters are normalized to mean zero and standard deviation one, so the coefficients can be interpreted as the number of standard deviations the impulse response increases by in response to a one standard deviation increase in the corresponding parameter on average. The results are reported in Table 6.2. We analyze the original and worst-case distributions in order to understand the shape of the likelihood. 6.4.1 Taylor rule distortions Dependence of impulse response on parameters. The regression results for the Taylor rule parameters depend on the size of the impulse response. When the impulse response is small, the coefficients and R2 are smaller than when the impulse response is large. The Taylor rule parameters thus explain less of the variation in left tail of the impulse response function, which leads to the lower bound being less sensitive to changes in the Taylor rule prior than the upper bound. The coefficients indicate that the impulse response decreases in response to an increase in rπ or ry , and increases in response to an increase in ρ or r∆y . These results are consistent with both economic intuition and the signs of the distortions. The response to a monetary policy shock is stronger when the Taylor rule is more persistent or less responsive to changes in the output gap and inflation. 26 Parameter Lower bound Upper bound Taylor rule ρ persistence rπ inflation coef. ry output gap coef. r∆y output gap growth coef. 0.088 –0.057 –0.036 0.024 0.232 –0.170 –0.233 0.112 Nominal frictions ξw wage rigidity ξp price rigidity ιw wage indexation ιp price indexation 0.049 0.019 0.016 –0.000 0.260 –0.060 0.053 0.015 Table 6.2: Regression of impulse response four quarters from impact on parameters. Lower bound: conditional on impulse response being at least one standard deviation below its mean; Upper bound: conditional on impulse response being at least one standard deviation above its mean. Likelihood. The likelihood offers further insights on three features of the worst-case posterior. Firstly, the distortion for rπ is especially large for the lower bound relative to the magnitude of the regression coefficient. Secondly, the worst-case prior distorts ρ minimally even though the regression suggests that ρ has a relatively large effect on the impulse response. Finally, the relative distortions for rπ and r∆y are larger for the upper bound. We begin by comparing the original and worst-case marginal posteriors for rπ , shown in Figure 6.3. The worst-case posterior for the lower bound is bimodal, with an additional peak around rπ = 2.75, revealing a high likelihood in the right tail of rπ . The observation corroborates results from Herbst and Schorfheide (2014), who find that the posterior mean of rπ moves from 2.04 to 2.78 when one replaces the prior from Smets and Wouters (2007) with a more diffuse one.16 The prior from Smets and Wouters (2007) shrinks toward smaller values of rπ , making it difficult to detect the possibility of an additional mode without reestimating the model with a new prior. REPS detects that such shrinkage is important for the posterior outcomes of interest relative to other features of the prior. On the other hand, the REPS 16 The additional mode in the likelihood arises from fitting the data to the Taylor rule, as evidenced by the large inflation coefficient of 2.59 when we use data for the federal funds rate, inflation, and output gap to estimate the Taylor rule using linear regression (ignoring autocorrelation in the monetary policy shock). This large value arises due to the low-frequency variation in the data. We run the regression using the trend and cyclical components of HP-filtered data and find coefficients of 2.66 and 0.92, respectively. Sala (2015) estimates a similar DSGE model in the frequency domain and finds a posterior estimates of 1.81 and 1.12 for the low-frequency and high-frequency components, respectively. 27 2.5 original upper bound lower bound 2 1.5 1 0.5 0 1.5 2 2.5 3 Figure 6.3: Original and worst-case posteriors of rπ . Black solid line: original posterior; Blue dashed line: worst-case posterior for upper bound; Red dotted line: worst-case posterior for lower bound. analysis also eases concerns from the results of Herbst and Schorfheide (2014) by showing that a prior favoring larger values of rπ does not substantially change the posterior of the impulse response, as seen from the relatively narrow gap between the lower bounds of the robust and original error bands in Figure 6.1. The likelihood reveals two reasons for the small distortions in ρ. Firstly, Figure 6.2 suggests that ρ is relatively sharply identified by the likelihood—the ratio of posterior to prior standard deviation for ρ is smaller than that of rπ , ry , and r∆y , with a value of 0.23 as compared to 0.82, 0.66, and 0.60, respectively. As a consequence, larger changes in the prior are needed to produce the same change in the posterior or ρ, resulting in greater relative entropy cost. In addition, the correlations of ρ with rπ , ry , and r∆y are in conflict with their effect on the impulse response. In particular, the regression results suggest that ρ should be distorted in the opposite direction from rπ and ry , but in the same direction as r∆y . Such distortions are costly in terms of relative entropy as they run against the likelihood, as ρ has a positive correlation with rπ and ry of 0.18 and 0.33, respectively, and a negative correlation of −0.10 with r∆y . It is thus optimal to distort ρ less than other parameters. In general, one may understate the dependence of the posterior on the prior if one considers the effects of the parameters on the object of interest but not the likelihood. The likelihood also supports the large distortions in ry and r∆y for the upper bound relative to the lower bound. Figure 6.2 shows that the posterior for ry is centered around small parameter values relative to the prior, while the posterior for r∆y is centered around large parameter values relative to the prior. Therefore, decreasing ry and increasing r∆y on average imply distortions around higher likelihood regions than increasing ry and decreasing r∆y . Since the impulse response is on average decreasing in ry and increasing in r∆y , it is 28 optimal to distort ry and r∆y more when increasing the upper bound of the error bands. The joint distortion is reinforced by the negative correlation of −0.26 between ry and r∆y , which is consistent with the two parameters being distorted in opposite directions. The asymmetry further emphasizes the need to do the REPS computations separately for each bound. Even though both worst-case priors correspond to the same impulse response, the optimal distortions for the upper and lower bound can be very different due to asymmetry in the likelihood and the mapping from parameters to impulse response. 6.4.2 Nominal frictions distortions We now analyze the worst-case distortions for the nominal frictions prior to understand several features of the worst-case posterior means in Table 6.1. Firstly, the wage rigidity parameter ξw is distorted relatively more, especially for the upper bound. Next, the posterior means for price rigidity ξp and wage indexation ιw move in contradictory directions when we go from the lower bound to the upper bound. Finally, the worst-case posterior mean for price indexation ιp increases for the lower bound and decreases for the upper bound, contradicting the standard intuition that reducing nominal frictions should dampen the impulse response. Dependence of impulse response on parameters. The largest coefficient from the regression reported in Table 6.2 is the one on wage rigidity ξw , which partly explains why the large distortion for ξw . Moreover, the regression coefficient on ξp is negative in the regression for the upper bound, rationalizing the counterintuitive direction of the distortion of the prior on ξp . The negative coefficient arises because of an omitted variable bias—ξp is correlated with other parameters that also affect the impulse response, biasing the regression coefficient relative to what we would have found if we controlled for all the parameters in the model. The coefficient obtained without controlling for the remaining parameters is the relevant one here because we keep the prior of all other parameters unchanged. REPS accounts for the fact that changing the prior for ξp changes the posterior of the impulse response through both the marginal effect of ξp and the effect of any other correlated parameters. On the other hand, the coefficient of wage indexation ιw for the upper bound regression contradicts the shift of the posterior distribution towards smaller values. Moreover, the small regression coefficients on price indexation ιp are inconsistent with both the direction and magnitude of the worst-case distortions. Likelihood. The worst-case posteriors for ξw , shown in Figure 6.4, provide an explanation for the especially large change in the posterior mean of ξw for the upper bound. In particular, the worst-case posterior for the upper bound is bimodal, with a new mode around ξw = 0.90. 29 6 original upper bound lower bound 5 4 3 2 1 0 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 6.4: Original and worst-case posteriors for ξw . Black solid line: original posterior; Blue dashed line: worst-case posterior for upper bound; Red dotted line: worst-case posterior for lower bound. Lower bound ξp ιw ξw ξp ιw ιp ξw 1 0.19 –0.13 –0.07 1 –0.23 –0.23 1 –0.15 Upper bound ξp ιw ιp ξw ξp ιw ιp 1 ξw 1 –0.20 –0.05 –0.08 1 –0.31 –0.15 1 –0.14 ιp 1 Table 6.3: Posterior correlation of nominal frictions parameters. Left: Conditional on impulse response being at least one standard deviation below its mean; Right: Conditional on impulse response being at least one standard deviation above its mean. As with the lower bound worst-case posterior of rπ , this is in line with the diffuse prior estimates of Herbst and Schorfheide (2014). In particular, the posterior mean of ξw shifts from 0.70 under the Smets and Wouters (2007) prior to 0.93 under the diffuse prior. This is a larger change than that of ξp , ιw , and ιp , whose posterior means move from 0.66, 0.59, and 0.22 to 0.72, 0.73, and 0.11, respectively under the diffuse prior. Again, REPS accounts for peaks in the likelihood that are hard to detect without reestimating the model under the appropriate prior. Unlike the additional mode for rπ , this new mode in the posterior for ξw substantially shifts the error bands for the impulse response. Indeed, the regression coefficient for ξw in Table 6.2 is larger in magnitude than that for rπ . The posterior correlations, reported in Table 6.3, help to account for the counterintuitive distortions in price indexation ιp . The negative correlation of ιp with ξw , ξp , and ιw implies that increases in these parameters correspond on average to a decrease in ιp . Hence the likelihood favors moving ιp in the opposite direction from ξw , ξp and ιw . In addition, Figure 6.5 shows that under the worst-case posterior for the upper bound, the new mode for ξw corresponds to small values of ιp , decreasing the posterior mean of ιp . More generally, these 30 0.7 original worst case 0.6 0.5 0.4 0.3 0.2 0.1 0 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 6.5: Original and lower bound worst-case posteriors for (ξw , ιp ). Gray dashed lines: original posterior; Colored solid lines: worst-case posterior. distortions indicate that if we check the robustness of the posterior by changing the prior of a set of parameters in the direction suggested by economic intuition without accounting for correlations across parameters, we may understate the sensitivity of the posterior. The posterior correlations for ξp further emphasize this point and provide an additional explanation for the inconsistent direction of distortions for the two worst-case posteriors of ξp . The rigidity parameters ξw and ξp have a positive correlation of 0.19 conditional on the impulse response being small, and a negative correlation of −0.20 conditional on the impulse response being large. This drives ξp in the same direction as ξw for the lower bound, and in the opposite direction from ξw for the upper bound. Given the distortions for ξw , it is therefore optimal to decrease ξp for both the upper and lower bound. In addition, ξp is negatively correlated with ιw for both the lower and upper bounds. This implies that distorting their priors in the same direction would concentrate distortions in low likelihood regions of the parameter space, which explains why the posterior distortions are small and, for the upper bound, the prior distortions for ξp and ιw are in opposite directions. 6.5 Comparison to local methods The wide robust error bands stand in contrast to Müller (2012), who finds that the impulse responses are relatively insensitive to the priors for the structural parameters. The difference arises because the REPS analysis allows for changes in the prior that are not considered by Müller (2012). The worst-case priors reveal the importance of the correlations and tail behavior of the prior, neither of which are accounted for by the perturbations considered 31 by Müller (2012). These features arise from assumptions made for convenience, making it crucial to understand how they matter for the posterior. Local methods use derivatives that also fail to capture the asymmetry in the sensitivity to the prior. Figure 6.1 shows that prior sensitivity depends on the direction in which we wish to change the impulse response. Such non-linearity in the prior sensitivity should be expected more broadly, given the irregularity of the likelihood and the complicated mapping from the parameters to the function of interest ψ in many applications. 7 Conclusion To understand how the data inform the posterior estimates, one needs to disentangle the roles of the prior and the likelihood. Despite numerous assumptions made when writing down priors and frequent disagreement over what priors to use, such analysis is often either absent or ad hoc. Reestimating the model for a full set of alternative priors is often computationally demanding and even infeasible. Nevertheless, few prior sensitivity tools have been developed for broad economic applications. REPS allows for global prior sensitivity analysis even in large models. The global nature of REPS accounts for features of the likelihood that may be neglected with local methods or simple inspection of the prior and posterior. The framework allows us to study the robustness of credible intervals to the prior or to focus on subvectors of the parameter that may be of special interest. REPS reduces the problem of checking across an infinite dimensional set of priors to that of solving for one unknown from one equation, allowing us to complete the computations in a similar amount of time it would have taken to estimate the model once. The New Keynesian model of Smets and Wouters (2007) provides a laboratory to show how REPS can reveal properties of the posterior that are sensitive to changes in the prior and parts of the prior are important for these properties. The worst-case distortions uncover features of the likelihood that are important for posterior inference yet hard to detect. These are useful diagnostics for any Bayesian estimation. In parallel work, Ho (2020) uses REPS to uncover the data required to identify the effects of demographic changes on long-run interest rates. There is much future work to be done applying REPS to a wider range of applications, both as a robustness check and as a tool to understand how the data inform our estimates. 32 References Abdulkadiroglu, A., N. Agarwal, and P. Pathak (2017). The Welfare Effects of Coordinated School Assignment: Evidence from the New York City High School Match. American Economic Review 107 (12), 3635–3689. Avery, C. N., M. E. Glickman, C. M. Hoxby, and A. Metrick (2013). A Revealed Preference Ranking of U.S. Colleges and Universities. Quarterly Journal of Economics 128 (1), 425– 467. Baumeister, C. and J. D. Hamilton (2015). Sign Restrictions, Structural Vector Autoregressions, and Useful Prior Information. Econometrica 83 (5), 1963–1999. Berger, J. and L. M. Berliner (1986). Robust Bayes and Empirical Bayes Analysis with Contaminated Priors. The Annals of Statistics 14 (2), 461–486. Berger, J. O., E. Moreno, L. R. Pericchi, M. J. Bayarri, J. M. Bernardo, J. A. Cano, J. De la Horra, J. Martín, D. Ríos-Insúa, B. Betrò, et al. (1994). An Overview of Robust Bayesian Analysis. Test 3 (1), 5–124. Bidder, R., R. Giacomini, and A. McKenna (2016). Stress Testing with Misspecified Models. Federal Reserve Bank of San Francisco, Working Paper Series 2016-26. Cai, M., M. Del Negro, E. Herbst, E. Matlin, R. Sarfati, and F. Schorfheide (2019). Online Estimation of DSGE Models. FRB of New York Staff Report (893). Canova, F. and L. Sala (2009). Back to Square One: Identification Issues in DSGE Models. Journal of Monetary Economics 56 (4), 431–449. Chamberlain, G. and E. E. Leamer (1976). Matrix Weighted Averages and Posterior Bounds. Journal of the Royal Statistical Society: Series B (Methodological) 38 (1), 73–84. Del Negro, M. and F. Schorfheide (2004). Priors from General Equilibrium Models for VARs. International Economic Review 45 (2), 643–673. Del Negro, M. and F. Schorfheide (2008). Forming Priors for DSGE Models (and how it affects the assessment of nominal rigidities). Journal of Monetary Economics 55 (7), 1191–1208. Fernández-Villaverde, J., J. F. Rubio-Ramírez, and F. Schorfheide (2016). Solution and Estimation Methods for DSGE Models. In Handbook of Macroeconomics (1 ed.), Volume 2, pp. 527–724. Elsevier B.V. 33 Giacomini, R. and T. Kitagawa (2018). Robust Bayesian Inference for Set-Identified Models. Cemmap Working Paper. Giacomini, R., T. Kitagawa, and M. Read (2019). Robust Bayesian Inference in Proxy SVARs. Working paper. Giacomini, R., T. Kitagawa, and H. Uhlig (2019). Estimation under Ambiguity. Working paper. Giannone, D., M. Lenza, and G. E. Primiceri (2018). Priors for the Long Run. Journal of the American Statistical Association. Gustafson, P. (2000). Local Robustness in Bayesian Analysis. In Robust Bayesian Analysis, pp. 71–88. Springer. Hansen, L. P. and T. J. Sargent (2001). Robust Control and Model Uncertainty. American Economic Review: Papers & Proceedings 91 (2). Hansen, L. P. and T. J. Sargent (2007). Recursive Robust Estimation and Control Without Commitment. Journal of Economic Theory 136 (1), 1–27. Hansen, L. P. and T. J. Sargent (2008). Robustness. Princeton University Press. Herbst, E. and F. Schorfheide (2014). Sequential Monte Carlo Sampling for DSGE Models. Journal of Applied Econometrics 29, 1073–1098. Herbst, E. and F. Schorfheide (2015). Bayesian Estimation of DSGE Models. Princeton University Press. Ho, P. (2020). Estimating the Effects of Demographics on Interest Rates: A Robust Bayesian Perspective. Working paper. Iskrev, N. (2010). Local Identification in DSGE Models. nomics 57 (2), 189–202. Journal of Monetary Eco- Komunjer, I. and S. Ng (2011). Dynamic Identification of Dynamic Stochastic General Equilibrium Models. Econometrica 79 (6), 1995–2032. Koop, G., M. H. Pesaran, and R. P. Smith (2013). On Identification of Bayesian DSGE Models. Journal of Business and Economic Statistics 31 (3), 300–314. Leamer, E. E. (1982). Sets of Posterior Means with Bounded Variance Priors. Econometrica: Journal of the Econometric Society, 725–736. 34 Moreno, E. (2000). Global Bayesian Robustness for Some Classes of Prior Distributions. In Robust Bayesian Analysis, pp. 45–70. Springer. Müller, U. K. (2012). Measuring Prior Sensitivity and Prior Informativeness in Large Bayesian Models. Journal of Monetary Economics 59 (6), 581–597. Petersen, I. R., M. R. James, and P. Dupuis (2000). Minimax Optimal Control of Stochastic Uncertain Systems with Relative Entropy Constraints. IEEE Transactions on Automatic Control 45 (3), 398–412. Robertson, J., E. W. Tallman, and C. H. Whiteman (2005). Forecasting Using Relative Entropy. Journal of Money, Credit and Banking 37 (3), 383–401. Sala, L. (2015). DSGE Models in the Frequency Domain. Journal of Applied Econometrics 30, 219–240. Schmitt-Grohé, S. and M. Uribe (2012). What’s News in Business Cycles. Econometrica 80 (6), 2733–2764. Smets, F. and R. Wouters (2007). Shocks and Frictions in US Business Cycles : A Bayesian DSGE Approach. American Economic Review 97 (3), 586–606. 35 Appendix A A.1 Proofs Solution to primal and dual problems Lemma 2. The solutions for M in problems (2.3)-(2.4) and (2.5)-(2.6) both have the form (2.7). Proof. Denote the marginal data density under prior π and likelihood L by ζ ≡ First recall that π (θ) L (θ) p (θ|X) = ζ R π (θ) L (θ|X) dθ. (A.1) Consider the dual problem (2.5)-(2.6). Attaching the multiplier λEp [M ] ζ to (2.6) and the multiplier µ to the constraint Eπ [M ] = 1, we have the first-order condition: 0 = µ + 1 + log M (θ) − λL (θ|X) (γ (θ) − γ̃) , (A.2) which we can rearrange to obtain (2.7). Now consider the primal problem (2.3)-(2.4). Attaching the multiplier 1/ (λEp [M ] ζ) to (2.4) and the multiplier µ/ (λEp [M ] ζ) to the constraint Eπ [M ] = 1, we have the first-order condition: Ep [M (θ) γ (θ)] γ (θ) − 0 = µ + 1 + log M (θ) − λEp [M ] L (θ|X) Ep [M ] Ep [M ]2 " #! M (θ) = µ + 1 + log M (θ) − λL (θ|X) γ (θ) − Ep γ (θ) . Ep [M ] ! (A.3) Rearranging (A.3), we have: " " M (θ) ∝ exp λL (θ|X) γ (θ) − Ep #!# M (θ) γ (θ) Ep [M ] , (A.4) which has the same form as (2.7) once we replace γ̃ with the full expression for the worst-case posterior mean of γ. A.2 Subspaces Lemma 3. The solution for M is problem (2.11)-(2.12) is (2.13). 36 Proof. Notice that Ep∗ [M (θ∗ ) γ (θ)] = Ep∗ [M (θ∗ ) Ep [γ (θ) |θ∗ ]] (A.5) It immediately follows that the first-order condition of (2.11)-(2.12) is: 0 = µ + 1 + log M (θ) − λL (θ|X) (Ep [γ (θ) |θ∗ ] − γ̃) , (A.6) which simplifies to (2.13). A.3 Additional constraints Consider the constrained optimization problem: min Eπ [M (θ) log M (θ)] (A.7) M (θ):Eπ [M ]=1 " # M (θ) s.t. Ep γ (θ) = γ̃ Ep [M ] # " M (θ) gp,k (θ) = g̃p,k for k = 1, ..., K Ep Ep [M ] (A.8) (A.9) Eπ [M (θ) gπ,l (θ)] = g̃π,l for l = 1, ..., L (A.10) This is the problem (2.5)-(2.6) augmented by the additional moment conditions (A.9)-(A.10). Attaching multipliers λEp [M ] ζ , µp,k Ep [M ] ζ and µπ,l to constraints (A.8), (A.9) and (A.10) respectively, we obtain the first-order condition: 0 =µ + 1 + log M (θ) − λL (θ|X) (γ (θ) − γ̃) − K X µp,k L (θ|X) (gp,k (θ) − g̃p,k ) − k=1 L X µπ,l (gπ,l (θ) − g̃π,l ) , (A.11) l=1 where µ is the multiplier on the constraint Eπ [M ] = 1. We rearrange (A.11) to obtain the solution: M (θ) ∝ exp [λL (θ|X) (γ (θ) − γ̃)] " × exp L (θ|X) K X µp,k (gp,k (θ) − g̃p,k ) + k=1 L X # µπ,l (gπ,l (θ) − g̃π,l ) , (A.12) l=1 where the second term introduces K + L additional unknowns arising from the moment conditions (A.9)-(A.10). 37 A.4 Evaluating relative entropy Lemma 4. The solution for the minimum relative entropy in (2.5)-(2.6) is: Eπ [M (θ) log M (θ)] = − log Eπ [exp [λL (θ|X) (γ (θ) − γ̃)]] (A.13) Proof. Define κ ≡ 1/Eπ [exp [λL (θ|X) (γ (θ) − γ̃)]]. Taking logs of the solution (2.7) yields: log M (θ) = log κ + λL (θ|X) (γ (θ) − γ̃) . (A.14) Denote the marginal data density by ζ ≡ π (θ) L (θ|X) dθ and denote the worst-case prior and posterior by π̃ and p̃ respectively. Expand the expression for relative entropy: R Eπ [M (θ) log M (θ)] = = Z Z M (θ) log M (θ) π (θ) dθ [log κ + λL (θ|X) (γ (θ) − γ̃)] π̃ (θ) dθ = log κ + Z λζ (γ (θ) − γ̃) Z = log κ + λζ π̃ (θ) L (θ|X) dθ ζ γ (θ) p̃ (θ|X) dθ − γ̃ = log κ (A.15) The third equality uses the fact that π̃ integrates to one. The fourth equality uses the h i (θ) γ (θ) = fact that p̃ = π̃L/ζ. The last equality uses (2.6) and the fact that Ep EMp [M ] R γ (θ) p̃ (θ|X) dθ. A.5 Asymptotics Define Σ`,T ≡ T1 Ω to be the variance of the likelihood. Proof. (Lemma 1) Assume X T = θ0 , and first consider the case with Ω diagonal, which implies Σ`,T is diagonal. Define ∆ (θ) ≡ θ −θ0 . Abusing notation, we can write the likelihood as a function of ∆: − 21 L (∆; Σ` ) = |2πΣ`,T | 1 T exp − ∆0 Σ−1 `,T ∆ = Σ` 2 − 21 L −1 Σ`,T2 ∆; I (A.16) Abusing notation again, write (2.7) as a function of ∆: T σ1,p,T M (∆; Σ` ) ∝ exp λ (Σ`,T ) L (∆; Σ`,T ) ∆1 + 1 − ω1 38 θ0 + cσ1,p,T (A.17) As T → ∞, since T σ1,p,T ω1 → 1, there exists λ∗ , M ∗ such that: 1 |Σ`,T |− 2 λ (Σ`,T ) → λ∗ 1 2 M Σ`,T ∆; Σ`,T (A.18) → M ∗ (∆) (A.19) for (2.6) and Eπ [M (∆)] = 1 to be satisfied. In particular, we have M ∗ (∆) ∝ exp [λ∗ L (∆; I) (∆1 + cω1 )] (A.20) Denoting π̂ (∆) ≡ π (∆ + θ0 ), the relative entropy is d d Z d Z T 2 RT = T 2 ≈T2 M (∆; Σ`,T ) log [M (∆; Σ`,T )] π̂ (∆) d∆ −1 −1 M ∗ Σ`,T2 ∆ log M ∗ Σ`,T2 ∆ d 1 (A.21) π̂ (∆) d∆ 1 ≈ Rπ (θ0 ) T 2 |Σ`,T |− 2 = Rπ (θ0 ) |Ω|− 2 (A.22) (A.23) for some constant R. The second line follows because we can find, for any ε, some neighborhood Nε around zero such that M ∗ (∆) log [M ∗ (∆)] < ε for all ∆ 6∈ Nε . When Ω is not diagonal, we first note that we can decompose the likelihood of θ into the marginal likelihood of θ1 and the conditional likelihood of θ2:d |θ1 , both of which remain Gaussian. An eigendecomposition of θ2:d |θ1 reparameterizes the likelihood in terms of orthogonal components, after which we can apply the proof for diagonal Ω. Finally, the proof a.s. follows through with general X T since X T −→ θ0 . B Sequential Monte Carlo B.1 Implementation details Constructing bridge distributions. To define the sequence of worst-case means, one can take: ν i γi = Ep [γ (θ)] + (γ̃ − Ep [γ (θ)]) . (B.1) NSMC A smaller value of ν corresponds to larger initial steps, and smaller steps toward the end of the SMC algorithm.17 Substituting γ̃i into the right-hand side of (2.6) for each i yields a 17 Cai et al. (2019) propose an adaptive algorithm to select the step sizes. 39 sequence of distortions: Mi (θ) ∝ exp [λi L (θ|X) (γ (θ) − γ̃i )] , (B.2) SMC which in turn imply a sequence of intermediate worst-case priors {πi }N and posteriors i=0 NSMC {pi }i=0 . Transition between bridge distributions. To transition iteratively through these bridge distributions, we use transition weights πi /πi−1 and pi /pi−1 , both of which are proportional to: Mi (θ) mi (θ) ≡ ∝ exp [L (θ|X) [λi (γ (θ) − γ̃i ) − λi−1 (γ (θ) − γ̃i−1 )]] . (B.3) Mi−1 (θ) Given λi−1 and draws from πi−1 and pi−1 , the only unknown remaining is λi , which we solve for from (2.7), which we rewrite as: " γ̃ = Epi−1 # exp [L (θ|X) [λi (γ (θ) − γ̃i ) − λi−1 (γ (θ) − γ̃i−1 )]] γ (θ) . Epi−1 [exp [L (θ|X) [λi (γ (θ) − γ̃i ) − λi−1 (γ (θ) − γ̃i−1 )]]] (B.4) With a sufficiently large NSMC , importance sampling of pi from pi−1 is feasible. We can then solve for λi in (B.4) by using the particle approximation of pi−1 to evaluate the expectation on the right-hand side. SMC Number of particles, mutation steps, and SMC steps. Given the sequence {γ̃i }N i=1 , three parameters need to be chosen for Algorithm 1: the number of particles NP , the number of Metropolis-Hastings mutation steps NMH , and the number of SMC steps NSMC . Relative to Herbst and Schorfheide (2014), it is more important here to have a large number of particles, so that the expectations in equation (B.4) are evaluated accurately when solving for λi . Similarly, NMH must be sufficiently large in order to solve for λi accurately. If λi is computed accurately, the posterior mean of γ evaluated from the particles before and after the mutation step should be identical up to sampling error. We check if NSMC is sufficiently large by ensuring that at each stage, the distribution of mi is well-behaved in the tails. Moving from approximate to true worst-case distributions. Algorithm 2 provides draws from an approximate worst-case prior and posterior, with distortions h i f (θ ∗ ) ∝ exp λ̃L b (θ ∗ |X) (γ̂ (θ ∗ ) − γ̃) M 40 (B.5) instead of (2.13). To transform these draws into draws from the true worst-case distribution, notice that the Radon-Nikodym derivative between the true and approximate worst-case distributions is exp [λ∗ L∗ (θ∗ |X) (γ ∗ (θ∗ ) − γ̃)] M (θ∗ ) h i ∝ (B.6) f (θ ∗ ) b (θ ∗ |X) (γ̂ (θ ∗ ) − γ̃) M exp λ̃L where we can solve for λ∗ using the constraint (2.12). Once we solve for λ∗ , we can begin with the approximate worst-case draws, then use the selection and mutation steps from Algorithm 1 to obtain draws from the true worst-case distributions. B.2 Evaluating relative entropy We now use the output from the SMC in Algorithm 1 together with Lemma 4 to evaluate the relative entropy of the worst-case prior relative to the original prior. To use Lemma 4, recall that the intermediate weights in Algorithm 1 have the form: mi (θ) ≡ κi Mi (θ) exp [L (θ|X) [λi (γ (θ) − γ̃i ) − λi−1 (γ (θ) − γ̃i−1 )]] . = Mi−1 (θ) κi−1 Since κ = κNSMC = QNSMC κi i=1 κi−1 (B.7) , at each stage we evaluate: κi = Eπi−1 [exp [L (θ|X) [λi (γ (θ) − γ̃i ) − λi−1 (γ (θ) − γ̃i−1 )]]] , κi−1 from which we obtain: Eπ [Mi (θ) log Mi (θ)] = i X ι=1 log κι κι−1 (B.8) (B.9) for i = 1, ..., NSMC . Directly evaluating the relative entropies from the particle approximations would itself require solving for κi , leading to greater numerical error. C Gaussian location model finite sample performance We now show that Lemma 1 provides a good approximation for the relative entropy in the Gaussian location model even for relatively small values of T . In each case, we set Σπ = Ω = I and show that the relative entropy needed to: 1. increase the posterior mean of θ1 by one posterior standard deviation; or 2. increase the 84% quantile of θ1 by one posterior standard deviation, d approximately scales with T − 2 and π (θ0 ), as predicted by Lemma 1. 41 mean 100 relative entropy relative entropy 100 10-2 10-4 10-6 100 d=1 d=2 d=3 d=4 101 102 10-2 10-4 10-6 100 103 T 84% quantile 101 102 103 T Figure C.1: Relative entropy for given number of observations in Gaussian location model with θ ∼ N (0, I), X T = 0, d ∈ {1, ..., 4}. Left: increase in posterior mean of θ1 by one posterior standard deviation; Right: increase in 84% quantile of θ1 by one posterior standard deviation. All calculations for the Gaussian location model for dimensionality d = 1 and d = 2 are done using grids. For d = 1, the grid has range [−8, 8] and has 105 + 1 uniformly spaced grid points. For d = 2, the grid has range [−5, 5] × [−5, 5] and has 103 + 1 uniformly spaced grid points in each direction. For d > 2, I use the SMC Algorithm 1 with (NP , NMH , NSMC ) = (d × 105 , 10, 100), and average across 25 runs. d Sample size and dimension. To show the dependence of R on T 2 , we first fix X T = 0 and vary d ∈ {1, ..., 4} and T ∈ {1, ..., 103 }. Figure C.1 shows the relative entropy for different values of T and d. Firstly, notice that Lemma 1 gives an accurate approximation for the behavior of relative entropy as T increases for T ≥ 10, with a gradient of − d2 when we plot log R against log T . When T is small, the data have not yet swamped the prior. The resulting greater sensitivity to the prior is reflected in the relative entropy for T < 10 being small compared to when T ≥ 10, relative to what is predicted by Lemma 1. Sample mean. To show the dependence of R on π (θ0 ), we now fix d = 1 and vary X T such that θp,T ∈ [−2, 2]. We do this for T ∈ {10, 1000}. Figure C.2 compares the relative entropy for different values of θp,T to the prior π (θp,T ) at that point, normalizing the values so all plots have a maximum of one. Even with T = 10, the relative entropy is almost proportional to the prior π (θp,T ) at the posterior mean. With T = 1000, the scaled prior and relative entropy are visually indistinguishable. We consider the prior at θp,T instead of X T for two reasons. Firstly, locating the maximum likelihood may be computationally involved in settings other than the Gaussian location model, while evaluating the posterior mean is trivial given Monte Carlo draws. Secondly, since θp,T → θ0 , using the posterior mean is asympotically equivalent to using X T . 42 mean 0.8 0.6 0.4 0.2 0 -2 -1 84% quantile 1 scaled prior, relative entropy scaled prior, relative entropy 1 0 1 prior T = 10 T = 1000 0.8 0.6 0.4 0.2 0 -2 2 -1 0 p,T 1 2 p,T Figure C.2: Relative entropy for given posterior mean with θ ∼ N (0, 1), θp,T ∈ [−2, 2]. Left: increase in posterior mean of θ1 by one posterior standard deviation; Right: increase in 84% quantile of θ1 by one posterior standard deviation. D D.1 Smets and Wouters (2007) Gaussian mixture approximation of posterior To approximate p, I take a Gaussian mixture approximation of p θ̂|X , where θ̂ is the following transformation of θ: θ̂i = θi θi ∈ (−∞, ∞) log (θi ) log 1 − 1 θi θi ∈ (0, ∞) (D.1) θi ∈ (0, 1) which is chosen so that all the components of θ̂ are bounded neither above nor below. This transformation improves the quality of the approximation, especially around the tails, because the marginals of the transformed parameters are closer to being Gaussian. There is suggestive evidence that the Gaussian mixture approximates the posterior well. Figure D.1 plots the marginals of each parameter under the posterior and the Gaussian mixture approximation. We see that the marginals are visually indistinguishable. The first and second moments are also very similar. A (one component) Gaussian approximation would match these moments perfectly except for sampling error. D.2 Sequential Monte Carlo Worst-case quantiles. Let ψh (θ) be the impulse response of output to a 100 basis point decrease in interest rates at horizon h. Let Qf [ψh ; q] be the qth quantile of ψh under the 43 44 0.4 0 0 0 0 0 0 0.5 0.5 0.2 0.5 2 0.5 1 1 0.4 1 4 0 5 10 0 2 4 0 5 10 0 5 0 5 0 5 10 0 0 0 0 0 0 0.5 0.5 0.5 0.5 0.5 0.5 1 1 1 1 1 0 20 40 0 5 0 20 40 0 5 10 0 5 0 5 10 0 0 0 0 1 0 0.5 0.5 0.5 1.5 0.5 0.5 1 1 1 2 1 0 0 20 40 0 2 4 0 5 0 0 0 0 -10 0.2 0.4 0 1 2 0 -10 0.5 1 0.5 0.5 0 2 0 0.5 1 1 10 4 10 0 0 0 10 20 0 10 20 0 5 0 0 0 0 0.2 10 20 0 10 20 0 5 10 0.5 0.4 0.5 0.5 0.5 0.5 1 0.6 1 1 approx true Figure D.1: Original and approximate marginal posteriors. Blue solid line corresponds to true posterior; red dashed line corresponds to Gaussian mixture approximation. 0 0 0.5 10 5 0 20 10 0 1 0 0.5 20 5 0 40 10 0 0.2 0 20 0 10 0 0.5 10 20 0 10 10 0 -0.5 20 20 1 0 0.5 0 0 2 2 0 4 20 4 0 2 0.2 0 -20 4 0.4 distribution f , and let σf [·] denote the standard deviation under distribution f . We set the worst-case qth quantile ψ̃hq for horizon h as follows. For the Taylor rule prior, for the lower bound we choose: ψ̃h0.16 = Qp [ψh ; 0.16] − 18 σp [ψh ] Qp [ψh ; 0.16] − 1 σp [ψh ] 5 Qp [ψh ; 0.16] − 14 σp [ψh ] Q [ψ ; 0.16] − 1 σ [ψ ] p h 3 p h h ≤ 2 or h ≥ 16 13 ≤ h ≤ 15 (D.2) h = 3 or 10 ≤ h ≤ 12 otherwise and for the upper bound we choose Qp [ψh ; 0.84] + 21 σp [ψh ] h≤4 ψ̃h0.84 = Qp [ψh ; 0.84] + 43 σp [ψh ] 5 ≤ h ≤ 8 Q [ψ ; 0.84] + σ [ψ ] otherwise p h p h (D.3) For the nominal frictions prior, for the lower bound we choose: ψ̃h0.16 = 1 Qp [ψh ; 0.16] − 240 σp [ψh ] Qp [ψh ; 0.16] − 1 σp [ψh ] 80 1 Qp [ψh ; 0.16] − 40 σp [ψh ] Q [ψ ; 0.16] − 1 σ [ψ ] p h 20 p h≤2 h=3 (D.4) 10 ≤ h ≤ 21 otherwise h and for the upper bound we choose Qp [ψh ; 0.84] + 21 σp [ψh ] h≤4 ψ̃h0.84 = Qp [ψh ; 0.84] + 43 σp [ψh ] 5 ≤ h ≤ 8 Q [ψ ; 0.84] + σ [ψ ] otherwise p h p h (D.5) The worst-case quantiles are chosen so that they imply similar sized distortions in terms of relative entropy. Bridge distributions. To construct the bridge distributions, we consider the sequence of quantiles analogously to (B.1): q ψ̃h,i = Qp [ψh ; q] + ψ̃hq − Qp [ψh ; q] 45 i NSMC ν (D.6) For the Taylor rule prior, we set ν = 21 and ν = 43 for the lower and upper bound respectively. For the nominal frictions prior, we set ν = 13 and ν = 43 for the lower and upper bound respectively. D.3 Robust error bands To generate the robust error bands, we use a quadratic regression for each bound and horizon to predict the worst-case quantile for a given relative entropy. In particular, at each stage of each SMC run, we evaluate the quantile and relative entropy. Aggregating across the 10 SMC runs, we obtain 250 × 10 = 2500 draws, to which we fit a quadratic regression of the quantile on the relative entropy. The robust error bands are constructed from the fitted values for a given level of distortion. For robustness, I also fit mixture regression to account more flexibly for nonlinearities, but do not find substantive differences. 46