The full text on this page is automatically extracted from the file linked above and may contain errors and inconsistencies.
Finance and Economics Discussion Series Divisions of Research & Statistics and Monetary Affairs Federal Reserve Board, Washington, D.C. The Reliability of Inflation Forecasts Based on Output Gap Estimates in Real Time Athanasios Orphanides and Simon van Norden 2004-68 NOTE: Staff working papers in the Finance and Economics Discussion Series (FEDS) are preliminary materials circulated to stimulate discussion and critical comment. The analysis and conclusions set forth are those of the authors and do not indicate concurrence by other members of the research staff or the Board of Governors. References in publications to the Finance and Economics Discussion Series (other than acknowledgement) should be cleared with the author(s) to protect the tentative character of these papers. The Reliability of Inflation Forecasts Based on Output Gap Estimates in Real Time Athanasios Orphanides and Simon van Norden∗ November 2004 Abstract A stable predictive relationship between inflation and the output gap, often referred to as a Phillips curve, provides the basis for countercyclical monetary policy in many models. In this paper, we evaluate the usefulness of alternative univariate and multivariate estimates of the output gap for predicting inflation. Many of the ex post output gap measures we examine appear to be quite useful for predicting inflation. However, forecasts using realtime estimates of the same measures do not perform nearly as well. The relative usefulness of real-time output gap estimates diminishes further when compared to simple bivariate forecasting models which use past inflation and output growth. Forecast performance also appears to be unstable over time, with models often performing differently over periods of high and low inflation. These results call into question the practical usefulness of the output gap concept for forecasting inflation. Keywords: Phillips curve, output gap, inflation forecasts, real-time data. JEL Classification System: E37, C53. Athanasios Orphanides is an adviser in the Division of Monetary Affairs at the Board of Governors of the Federal Reserve System, a research fellow of the Centre for Economic Policy Research, and a fellow of the Center for Financial Studies. E-mail: Athanasios.Orphanides@frb.gov. Simon van Norden is a Professeur Agrégé at the HEC Montréal and a CIRANO fellow. E-mail: simon.vannorden@hec.ca. ∗ We benefited from presentations of earlier drafts at the European Central Bank, CIRANO, the Federal Reserve Bank of Philadelphia Conference on Real Time Data Analysis, the Centre for Growth and Business Cycle Research, as well as at the annual meetings of the American Economics Association, the European Economics Association and the Canadian Economics Association. We would also like to thank Sharon Kozicki, Tim Cogley, Jeremy Piger, Todd Clark, Desire Vencatachellom, Ken West and two anonymous referees for useful comments and discussions. Athanasios Orphanides wishes to thank the Sveriges Riksbank and European Central Bank for their hospitality during September 2001 when part of this work was completed. Simon van Norden wishes to thank the SSHRC and the HEC Montréal for their financial support. The opinions expressed are those of the authors and do not necessarily reflect the views of the Board of Governors of the Federal Reserve System. 1 Introduction A stable predictive relationship between inflation and a measure of deviations of aggregate demand from the economy’s potential supply—the “output gap”—provides the basis for many formulations of activist countercyclical stabilization policy. Such a relationship, referred to as a Phillips curve, is often seen as a helpful guide for policymakers aiming to maintain low inflation and stable economic growth. According to this paradigm, when aggregate demand exceeds potential output, the economy is subject to inflationary pressures and inflation should be expected to rise. Under these circumstances, policymakers aiming to contain the acceleration in prices might wish to adopt policies restricting aggregate demand. Similarly, when aggregate demand falls short of potential supply, inflation should be expected to fall, prompting policymakers to consider the adoption of expansionary policies.1 Even assuming that the theoretical motivation for a relationship between the output gap and inflation is fundamentally correct, a number of issues may complicate its use for forecasting in practice. First, the definition of “potential output”—and the accompanying “output gap”—that might be useful in practice is far from clear. Given a definition of the output gap, its exact empirical relationship with inflation is not known a priori and would need to be determined from the data. Second, even if the proper conceptual and empirical relationships were identified, the operational usefulness of the output gap will be limited by the availability of timely and reliable estimates of the identified concept. As is well known, empirical estimates of the output gap are generally subject to significant and highly persistent revisions. (For example, see Orphanides and van Norden (2002).) The subsequent evolution of the economy leads to improved historical estimates of the gap by providing useful information about the state of the business cycle. As a result, considerable uncertainty regarding the value of the gap remains even long after it would be needed for 1 The widespread use of models featuring estimated “Phillips curves” of various forms for monetary policy analysis at numerous central banks and other institutions is evidence of the appeal of this paradigm. See Bryant, Hooper and Mann (1993) and Taylor (1999) for collections of monetary policy evaluations that feature such estimated models. 1 forecasting inflation. This suggests that although the output gap may be quite useful for historical analysis, its practical usefulness for forecasting inflation in real time may be quite limited. In this paper we assess the usefulness of alternative estimation methods of the output gap for predicting inflation, paying particular attention to the distinction between suggested usefulness—based on ex post analysis using revised output gaps, and operational usefulness—based on simulated real-time out-of-sample analysis.2 First, using out-of-sample analysis based on ex post estimates of the output gap, we confirm that many concepts appear to be useful for predicting inflation. This is as would be expected since the implicit Phillips curve relationships recovered in this manner are similar to the relationships commonly found in empirical macroeconometric models. To assess their operational usefulness, we generate out-of-sample forecasts based on real-time output gap measures; those constructed using only data (and parameter estimates) available at the time forecasts are generated.3 We compare the resulting forecasts to both autoregressive forecasts of inflation and bivariate forecasts that employ information from output growth as well as past inflation. Our findings show that forecasts using ex post estimates of the output gap severely overstate the gap’s usefulness for predicting inflation. Real-time forecasts using the output gap are often less accurate than forecasts that abstract from the output gap concept altogether. And the relative usefulness of real-time output gap estimates diminishes further when compared to simple bivariate forecasting models which use past inflation and output growth. In some cases, we find certain measures of the output gap produce superior forecasts of inflation. However, relative performance seems to vary considerably over time, with models which perform relatively well in some periods performing relatively poorly in others. Thus, 2 Our analysis is related to investigations of the usefulness of the unemployment gap for forecasting inflation, such as Stock and Watson (1999), Atkeson and Ohanian (2001), and Fisher, Liu and Zhou (2002). In some macroeconometric models, unemployment gaps and output gaps are related through Okun’s law. 3 For this exercise, we rely on the real-time dataset for macroeconomists which was created and is maintained by the Federal Reserve Bank of Philadelphia. See Croushore and Stark (2001) for background information regarding this database. 2 past forecast performance may provide little guidance in selecting an operationally useful definition of the output gap going forward. The remainder of this paper is organized as follows. In sections 2 and 3 we define the output gap concepts used and detail the methodology of our forecasting exercise. The main results are presented in section 4 and section 5 concludes. 2 Trends and Cycles Ex Post and in Real Time One way to define the output gap is as the difference between actual output and an underlying unobserved trend towards which output would revert in the absence of business cycle fluctuations. Let qt denote the (natural logarithm of) actual output during quarter t, and µt its trend. Then, the output gap, yt can be defined as the cyclic component resulting from the decomposition of output into a trend and cycle component: qt = µt + yt Since the underlying trend is unobserved, its measurement, and the resulting measurement of the output gap, very much depends on the choice of estimation method, underlying assumptions and available data that are brought to bear on the measurement problem. For any given method, simple changes in historical data and the availability of additional data can change, sometimes drastically, the resulting estimates of the cycle for a given quarter. Evidence of the difference between historical and real-time estimates of output gaps has been presented by Orphanides and van Norden (2002). In Table 1, we present some of the summary reliability indicators they examine for twelve alternative measures of the output gap which we employ in our analysis.4 These results mirror those of Orphanides and van Norden (2002). We find that revisions in real-time estimates are often of the same magnitude as the historical estimates themselves and that, for many of the alternative 4 Brief descriptions of the various measures appear in Appendix A. Further details, including the output gaps used in this study, as well as the programs and data used to create them, are freely available from the authors at http://www.hec.ca/pages/simon.van-norden. 3 methods, historical and real-time estimates frequently have opposite signs. The importance of ex post revisions to output gap estimates suggests that the presence of a predictive relationship between inflation and ex post estimated output gap measures does not guarantee that the output gap will be useful for forecasting inflation in practice. Simply, the ex post estimates of output gaps at a point in time may differ substantially from estimates which could be made without the benefit of hindsight. As well, these differences may hinder the real-time estimation of the presumed predictive relationship, further complicating the real-time forecasting problem. 2.1 Data Sources and Vintages We use the term vintage to describe the values for data series as published at a particular point in time. Most of our data is taken from the real-time data set compiled by Croushore and Stark (2001); we use the quarterly vintages from 1965Q1 to 2003Q3 for real output. Construction of the output series and its revision over time is further described in Orphanides and van Norden (1999, 2002). We use 2003Q3 data as “final data” recognizing, of course, that “final” is very much an ephemeral concept in the measurement of output. To measure inflation, we use the change in the log of the consumer price index (CPI). We use this both for our forecasting experiments and also to estimate measures of the output gap in multivariate models that include inflation. CPI data are revised much less than output data, with changes in seasonal factors causing most of the revisions. We therefore use the 2003Q3 vintage of CPI data for all of our analysis. This allows us to focus on the effects of revisions in the output data and the estimated output gap in our analysis. One of our models (Structural VAR) also uses data on interest rates, which are never revised. 2.2 Measuring Output Gaps We construct output gap estimates using a variety of different models, as listed in Table 1. Each of the output gap models is used to produce gap estimates of varying vintages. Each 4 output gap vintage uses precisely one vintage of the output data. An estimated output gap is called a final estimate if it uses the final data vintage. Note that all the output gap estimation techniques (aside from the Hodrick-Prescott filter) require that one or more parameters be estimated to fit the data. Such estimation was repeated for every combination of technique and vintage. This means, for example, that in constructing output gap vintages from an unobserved components (UC) model spanning the period 1969Q1-2003Q3 (139 quarters), we reestimate the model’s parameters 139 times, and then store 139 series of smoothed estimates. 3 A Forecasting Experiment We are interested in quantifying the extent to which the output gap concept provides a practical means of improving forecasts of inflation. The answer will clearly depend on a large number of factors, such as the time period of interest, the way in which forecasts are constructed, the benchmark against which such forecasts are compared, and the loss function used to evaluate the quality of different forecasts. We restrict our attention to US CPI inflation since 1969 and use the mean-squared forecast error (MSFE) to compare forecast quality. 3.1 Forecasting Inflation and Benchmarks Let πth = log(Pt ) − log(Pt−h ) denote inflation over h quarters ending in quarter t. We examined forecasts of inflation at various horizons but use one year (h=4) as our baseline. Note that because of reporting lags, data for quarter t first become available in quarter t+1. Thus, a four-quarter ahead forecast is a forecast five quarters ahead of the last quarter for 4 which actual data are available.5 Our objective, therefore, is to forecast πt+4 with data for quarter t − 1 and earlier periods. 5 Since the last datapoint in our sample is for the 2003Q2 quarter, this implies that 2002Q1 is the last datapoint available for forming a forecast we can use in our evaluation experiment. 5 We examine simple linear forecasting models of the form: h πt+h =α+ n X 1 βi · πt−i + i=1 m X γi · yt−i + et+h (1) i=1 where n and m denote the number of lags of inflation and the output gap in the equation. We estimate the unknown coefficients {α, βi , γi } by ordinary least squares. We set n and m using a variety of different methods; in the results presented here we use the Bayes Information Criterion (BIC). Results with other lag selection methods were found to give similar conclusions. To provide a benchmark for comparison, we estimate a univariate forecasting model of inflation based on equation (1) but omitting the output gaps. We refer to this model as the autoregressive (AR) benchmark. Of course, the problem faced by forecasters in practice is more complex than the one we consider. One obvious and important difference is that the information set available to policymakers is much richer. It is therefore possible that output gaps might improve on simple univariate forecasts of inflation but not on forecasts using a broader range of inputs. For this reason, tests against an autoregressive forecast benchmark should be considered to be weak tests of the utility of empirical output gap models. To provide a slightly stronger test, we also consider benchmark forecasts which replace the output gap in (1) with the first difference of the log of real output. As St-Amant and van Norden (1998) argue, using output growth in this way can be interpreted as implicitly defining an estimated output gap as a one-sided filter of output growth with weights based on the estimated coefficients of equation (1). van Norden (1995) refers to such estimates as TOFU gaps (Trivial Optimal Filter–Unrestricted). We refer to this as the TF benchmark forecast and interpret it as a simple reduced-form inflation forecast that uses a slightly larger information set than the AR benchmark, one which contains historical information on both prices and output growth. Comparing forecasts based on output gaps to the TF benchmark aids in isolating the usefulness (or lack thereof) of the economic structure and other restrictions embedded in the construction of the output gaps. 6 3.2 Forecasting and Output Gap Revisions Several practical issues complicate the use of (1) for inflation forecasting. Since the suitable number of lags of inflation and the output gap n and m, and the coefficients of the equation are not known a priori, these need to be estimated with available data. As our sample increases and additional data become available, these estimates change. In addition, output gap estimates (like output data) are revised over time. This in turn, can influence the selected number of lags and the coefficients of equation (1) estimated in any given sample. In addition, given the parameters of the equation, revisions in the output gap will directly change the forecast value of inflation. We therefore use (1) to construct 3 to 4 different kinds of forecasts for each output gap model. These forecasts differ in the way lag lengths are determined and in the way the output gap model is used. Let yti,j be an estimate of the output gap at time t formed using data of vintage i, where i > t and j = t or i − 1. For non-UC models (i.e. all except the Watson, Harvey-Clark, Harvey-Jaeger, Kuttner and Gerlach-Smets models) the index j is irrelevant; yti,t = yti,i−1 . For UC models, j = t denotes a filtered output gap estimate; although the model parameters are estimated from using data up to i − 1, the Kalman filter recursions to estimate the gap do not use data beyond t. For these same models, j = i − 1 denotes a smoothed estimate; although yti,t and yti,i−1 use the same parameter estimates to calculate the output gap, the latter also uses the data after t to recursively update its estimate of yt . When T = 2003Q3, the terminology of Orphanides and van Norden (2002) refers to the time series {ytT,T −1 } as Final estimates of the gap and to {ytT,t } as Quasi-Final estimates. We will commonly refer to these as FL and QF estimates. These different kinds of output gap estimates are used to construct different kinds of forecasts. The first of these uses fixed lag lengths with final estimates of the output gap to 7 recursively estimate the forecasting equation h πt+h = α̂t−1 + n̂ X 1 β̂it−1 · πt−i + i=1 m̂ X T,T −1 γ̂it−1 · yt−i + et+h (2) i=1 where T refers to 2003Q3. This replicates the kind of recursively-estimated, out-of-sample forecasting experiments which are commonly performed but which ignore output gap revision. These forecasts are infeasible because they require information (Final estimates of output gaps) which is not available at the time the forecast is made. They also estimate the optimal lag lengths m̂, n̂ ex post. We refer to this Fixed-Lag Final-estimate forecast as FL-FL. In the case of UC models, we can construct similar forecasts using Quasi-Final rather than Final estimates of the output gap h πt+h = α̂t−1 + n̂ X 1 β̂it−1 · πt−i + i=1 m̂ X T,t γ̂it−1 · yt−i + et+h (3) i=1 Orphanides and van Norden (2002) note that the difference between the Final and QuasiFinal estimates of the output accounts for the bulk of the revisions in the output gaps they examine. The difference between the accuracy of these and the Final gap forecasts above helps us to understand the relative importance of errors in gap estimation for forecast accuracy. Like the Final gap forecasts, these forecasts are infeasible. We refer to these as FL-QF forecasts. We also construct feasible forecasts which attempt to mirror closely the forecasts which practitioners would construct using such output gap models. Specifically, in these forecasts the lag lengths for both explanatory variables vary over time and are estimated recursively. The output gap series is also updated with its latest available vintage every time the parameters of the forecasting equation are re-estimated. The resulting Variable-Lag Real-Time 8 output gap (VL-RT) forecasting equation takes the form6 h πt+h = α̂ t−1 + t−1 n̂X β̂it−1 · 1 πt−i i=1 + t−1 m̂ X t,t−1 γ̂it−1 · yt−i + et+h (4) i=1 where the superscripts on (m̂, n̂) indicate the information set used to estimate the lag lengths. While these are the most realistic forecasts we examine, they are also the most difficult to compute. Among other things, they require more than just the real-time gap estimates presented in Orphanides and van Norden (2002); they require all vintages of the complete estimated output gap series. To summarize, we can construct two or three series of forecasts for each output gap model we analyze: (1) using recursive estimation, fixed lag lengths and final output gap estimates, (2) using recursive estimation, fixed lag lengths and quasi-final output gap estimates (which are only available for the 5 UC models we examine), and (3) using recursive estimation, variable lag lengths and all vintages of smoothed output gap estimates. We also examine one other type of forecast, one which uses variable lag lengths and final output gaps and which we refer to as VL-FL. Like the FL-QF forecast, this helps to isolate the contribution of output gap revision to forecast accuracy. As we will see below, however, these methods differ in the appropriate ways one should conduct inference. 3.3 Forecast Evaluation We wish to evaluate the quality of the resulting forecasts by testing the null hypothesis that a given pair of models have equal MSFEs. Various tests of equal forecast accuracy have been proposed in recent years, notably by Diebold and Mariano (1995) for forecasting models without estimated parameters and by West (1996) for models with estimated parameters. While such tests have been popular, the assumptions they require are unfortunately violated t,t−1 Note that in equation (4) we use smoothed estimates of the output gap (yt−i ) rather than filtered t,t−i estimates (yt−i ). This reflects the common practice of practitioners, which is to use the most accurate possible estimate of the gap in estimating their forecast equations. Limited experiments which replaced these smoothed estimates with filtered estimates suggest that this does not have a major impact on forecast performance. Koenig, Dolmas and Piger (2003) discuss how the use of data of varying vintage affects forecast accuracy. 6 9 for some of the hypotheses of interest here. First, the use of Diebold-Mariano statistics with standard normal critical values for asymptotic inference is justified only if the two models being compared are not nested. However, when using suitable lag lengths, the output gap models nest the AR benchmark model. Clark and McCracken (2001) suggest alternative tests for the case of nested models, while Clark and McCracken (2002) find that the limiting distribution of these statistics is non-pivotal for forecast horizons greater than one period. To compare these models, we therefore use the MSE-F statistic proposed by McCracken (2000), which takes the form MSE-F = P · (M SF E1 − M SF E2 ) M SF E2 (5) where P is the number of forecasts, M SF E1 is the MSFE of the restricted model and M SF E2 is the MSFE of the unrestricted model. The distribution of the statistic under the null hypothesis of equal MSFE is estimated via a bootstrap experiment with 2000 replications, as detailed in Appendix B. Because these distributions are non-pivotal, the test statistics are bootstrapped anew for every different choice of (P, h, y, m, n). This means that every p-value we report for the AR benchmark is based on its own set of 2000 bootstrap experiments. Second, while the available asymptotic theory underlying all such tests allows for the coefficients in an equation like (1) to be re-estimated over time, it assumes that lag lengths are fixed during the recursive estimation, that the data remain fixed during the recursive estimation, and that the data are not estimated. All these assumptions are violated for the VL-RT forecasts we construct, so no p-values are presented for this case. Inference in the case of the TF benchmark is more straightforward as the models of interest are no longer nested. Accordingly, we base our inference on the test statistics proposed by Diebold and Mariano (1995) and West (1996). Specifically, letting dt ≡ e2it − e2jt be the difference in squared forecast errors between model i and model j at time t, d ≡ 10 T −1 · PT t=1 (dt ) the mean difference, and ρτ ≡ T −1 · PT t=τ +1 (dt − d) · (dt−τ − d) the estimated autocovariance of dt at lag τ , we compute the test statictic: d Ω/T z=p where Ω ≡ P6 l=−6 (1 − |l|/7) · ρl (6) is the Newey-West (1986) Heteroscedasticity and Autocor- relation (HAC) robust estimator of the long-run variance of dt . West (1996) shows that under conventional assumptions this statistic is asymptotically normally distributed under the null hypothesis of equal forecast accuracy when the parameters of the forecast model are estimated by ordinary least squares. We therefore calculate and report 2-sided p-values for the TF benchmark using the standard normal distribution. Again, this asymptotic theory is not applicable to the VL-RT forecasts, so no p-values are reported in this case. 4 4.1 Does the Output Gap Improve Forecasts of Inflation? Are Improvements in Forecast Accuracy Significant? Our next step is to examine the results of the forecasting experiments described above. Table 2 shows the results of formal tests for differences in MSFE between the two benchmark models and the twelve output gap models. The upper panel of the table compares forecasts constructed using final output data, final estimates of the output gap, and constant lag lengths in the forecasting equation (FL-FL). The middle panel of the table shows the comparable results when using quasi-final rather than final (i.e. filtered rather than smoothed) estimates of the output gap (FL-QF). Since such estimates can only be constructed from UC models of the output gap, only results for the five UC models are presented. In both cases, we see the MSFE of the benchmark models, the fractional improvement in MSFE relative to the benchmark models ((M SF EBenchmark − M SF EGap )/M SF EGap ) and the p-value for the test of the null hypothesis that the MSFEs of the benchmark and the gap model are equal. Differences between these two panels are entirely due to the effects of ex post revisions of output gaps. 11 The first thing apparent from the top panel of the table is that all the gap models forecast better than the autoregressive benchmark model when using final output gaps. In all but one case the differences in MSFE are greater than 10 per cent, and in four of the twelve cases they are greater than 30 per cent. The suggested improvement is statistically significant at the 5 per cent level for all but the SVAR model and at the one per cent level for nine of the twelve models. These results confirm the conventional wisdom that ex post output gaps appear to help forecast inflation. They also show that out-of-sample tests have sufficient power to detect relevant differences in MSFE. The evidence supporting the usefulness of output gaps is weakened when the benchmark model is changed by adding real output growth to the forecasting equation (the TF model). As can be seen on the right side of the top panel, three of the twelve gap models now have larger MSFEs than the benchmark, and only five of the twelve show an improvement of more than 10 per cent. The differences in MSFE are significant at the 10 per cent level in only three cases and are never significant at the 5 per cent level. However, comparison of the significance of the differences in MSFE across the two benchmarks is complicated by differences in the tests used for nested and non-nested models, as explained in section 3.3. Note, in particular, that the reported p-values for nested models (the AR benchmark) are based on one-sided tests, while those for non-nested models (the TF benchmark) are based on two-sided tests. In addition, Clark and McCracken (2001, 2002) suggest that the MSE-F statistic, which is used for the AR benchmark, is more powerful than the z statistic used for the TF benchmark. The apparent superiority of output-gap based forecasts is also weakened by the use of quasi-final rather than final estimates of the gap, shown in the middle panel. Improvements over the AR benchmark are now lower in every case, falling 10 to 20 per cent, and in one case output-gap-based forecasts are less accurate than the benchmark. However, improvements in forecast accuracy are still significant at or near the 5 per cent significance level in the four 12 remaining cases. The situation changes further if we instead use the TF benchmark. Four of the five models now forecast less accurately than the benchmark model. Ignoring the effects of output gap revisions evidently tends to overstate the importance and significance of output gaps for forecasting inflation. The bottom panel of Table 2 shows the results of tests for differences in MSFE between the two benchmark models and the twelve output gap models when the forecasts are constructed with time-varying lag lengths and real-time output gap estimates (VL-RT). This change also increases the MSFE of the benchmark AR model by a little over 10 per cent. The relative accuracy of these real-time forecasts is almost always lower than that of the ex post forecasts analysed in the top panel of the table. Drops in relative MSFE are substantial for many models. As noted earlier. the normal asymptotic theory results are not valid in this case so no p-values are reported. Crude simulations based on bootstrapped MSE-F statistics, however, suggested that several output gap models which appeared to forecast significantly better than the AR benchmark in the top panel no longer showed a significant difference in accuracy. The reversal in the performance of the output gap models relative to the output growth (TF) benchmark, is even more striking. This can be seen by comparing the top and bottom panels on the right-hand side of the table. In real time, none of the output gap models examined forecasts better than the TF benchmark. 4.2 The Effect of Output Gap Revisions on Relative Forecast Accuracy To better understand the causes for the changes in MSFE noted above, Table 3 compares the MSFEs of three different forecasting experiments. The first is identical to that documented in the upper panel of the previous table, using final output data and gap estimates as well as constant lag lengths in the forecasting equation (FL-FL). The second experiment uses the same output data and gap estimates, but now updates the lag lengths each time the forecast coefficients are recursively re-estimated (VL-FL). The third experiment is identical 13 to that documented in the bottom panel of the previous table, using time-varying lag lengths and real-time output gap estimates (VL-RT). Differences in outcomes between the first two experiments isolate the effects of variations in lag length. Differences between the second two experiments similarly isolate the effects of output gap revision. The table shows that the introduction of time-varying lag lengths has important effects on forecast accuracy. A priori, such time-variation may improve forecasts if the underlying relationship is unstable over time. On the other hand, it may introduce another source of estimation error, which could reduce forecast accuracy. The table shows that all forecasts see a reduction in accuracy, averaging 15 per cent. The benchmarks forecasts see changes in MSFEs which are very close to the average. Moving from Final to real-time output gap estimates has no effect on the AR benchmark forecast, but tends to make other forecasts less accurate. While the average effects of this change are smaller than those of changes in lag length, the impact varies much more across models. Four models see their accuracy improve while three see their MSFE rise by more than 20 per cent. Note that the TF benchmark sees the greatest improvement in accuracy. Evidently, revisions in output growth contain useful information about future inflation. The net effect of the changes in lag length determination and data vintage worsens forecast accuracy in all but one case. The net effect on the AR benchmark is somewhat less than average, while the TF benchmark improves more than any other model. The results above suggest that some output gap models forecast inflation more accurately than an autoregressive model, even when using real-time output gap estimates. However, none of the output gap models we examine forecasts inflation as well as simple models which use both past inflation and output growth. Further, the relative performance of different models is greatly affected by the use of real-time rather than ex post output gap estimates. Finally, uncertainty about the lag structure also adds considerably to MSFEs. 14 4.3 The Robustness of Changes in Forecast Accuracy We now investigate the robustness of the results presented in Table 2. Table 4 examines the effects of changing the period over which forecasts are evaluated. The full 1969-2002 sample is split into two roughly equal halves, with the 1969-1983 portion characterized by relatively high and volatile inflation, whereas prices were more stable over the 1984-2002 period. The greater volatility of inflation in the former period implies that least-squares methods applied to the full sample tend to emphasize the fit of the model over the former period. Perhaps as a consequence, the full-sample results presented in Table 2 largely reflect forecast performance over the first half of the sample. Results for the low-inflation period after 1983 may be a more relevant guide for contemporary decision-making, but they differ from the full-sample results in several ways. First, looking at forecasts with final output gaps, we see that the AR benchmark has become harder to beat. Nine of the 12 models see their relative MSFEs decline, and only five can reject the null of equal forecast accuracy at the 5 per cent level (compared to 11 in the earlier portion of the sample). This decline in the predictability of inflation has been noted previously in other studies, for example, Atkeson and Ohanian (2001), and Fisher, Liu and Zhou (2002). The picture for the TF benchmark is less clear; while the relative performance of the output gap models improves somewhat in the latter sample, there is little evidence of significantly different forecast accuracy. Second, looking at forecasts with real-time output gaps, it appears that it has become increasingly difficult to forecast as well as the benchmarks. Out of 12 models 11 (10) have larger MSFEs than the AR (TF) benchmark in the post-83 period. The Band-Pass filter is the only model to forecast inflation better than either benchmark in the recent period, giving over a 20 per cent reduction in MSFE. It is also interesting to note that, consistent with the reported decline in the predictability of inflation, the AR benchmark now forecasts slightly better in real time than the TF benchmark. 15 One possible explanation for the difference in results across the two sample periods is parameter instability, a feature which has been noted by other research on inflation forecasts, in particular, Stock and Watson (1996, 1999), and Clark and McCracken (2003). Indeed, examination of changes in the period over which the forecasting model is estimated suggested some evidence of such instability for some of our output gap forecasting models. We also considered the effects of changing the forecasting horizons, forecasting changes rather than levels of inflation, using different lag selection criteria, and using nominal rather than real income growth as a benchmark. (Detailed results are available from the authors upon request.) Based on a review of these findings, it appears that the results shown in Table 2 are among the best that can be obtained for inflation forecasts from simple linear forecasting models using output gaps. Having considered this evidence, one might also ask which of the output gap models examined here a practitioner should use to forecast inflation (if forced to do so.) It would appear that the deterministic trend models (Linear, Quadratic and Breaking) were often among the worst-performing in real-time, and should probably be avoided for that reason. UC models which estimated Phillips Curves (Kuttner and Gerlach-Smets) had some of the largest differences in performance when used with real-time rather than final estimates. The Band-Pass and the Beveridge-Nelson methods perform better in our simulated realtime experiments. However, their success appeared to be sensitive to the forecast horizon used. Rather than rely on any of these output gap models, our analysis suggests that a practitioner could do well by simply taking into account the information contained in real output growth without attempting to measure the level of the output gap—the TOFU model. This model was consistently among the best performers, particularly over the post1983 forecast sample. 16 5 Conclusion Forecasting inflation is a difficult but essential task for the successful implementation of monetary policy. The hypothesis that a stable predictive relationship between inflation and the output gap—a Phillips curve—is present in the data, suggests that output gap measures could be useful for forecasting inflation. This has served as the basis for empirical formulations of countercyclical monetary policy in many models. We find that many alternative measures of the output gap appear to be quite useful for forecasting inflation, on the basis of ex post analysis. That is, a historical Phillips curve is suggested by the data, and final (constructed ex post) estimates of the output gap are useful for understanding subsequent movements in inflation. However, this historical usefulness does not imply a similar operational usefulness. Our simulated real-time forecasting experiment suggests, instead, that the predictive ability of many different output gap measures may be illusory. Output gaps typically can not forecast inflation as well out of sample as simple linear models of inflation and output growth (although the differences are mostly not statistically significant.) This is particularly true if we restrict our attention to the post-1983 period. These rather pessimistic findings regarding the output gap mirror earlier investigations regarding the predictive power for forecasting inflation of “unemployment gaps,” that is the difference between the rate of unemployment and estimates of the NAIRU. As demonstrated by Staiger, Stock and Watson (1997a,b) and Stock and Watson (1999), estimates of the NAIRU are inherently unreliable, and simulated out-of-sample forecasting exercises do not indicate a robust improvement in inflation forecasts from using information about unemployment. Stock and Watson (1999) also show that better inflation forecasts may be obtained by indicators other than the unemployment gap. Our analysis suggests similar conclusions regarding the output gap as well. Instead of using output gaps, forecasts of inflation which simply incorporate information from the growth rate of output appear to forecast inflation as well or better. 17 Finally, we note that these negative findings regarding the usefulness of real-time measures of the output gap do not necessarily invalidate the potential usefulness of the theoretical Phillips curve framework per se, nor that of ex post constructed output gaps for historical analysis. That said, the dubious contribution of real-time measures of the output gap for forecasting inflation brings into question their role in the formulation of reliable real-time policy analysis. 18 References Atkeson, Andrew and Lee E. Ohanian, “Are Phillips Curves Useful for Forecasting Inflation,” Federal Reserve Bank of Minneapolis Quarterly Review, 25(1), 2-11, Winter 2001. Baxter, Marianne; King, Robert G., “Measuring Business Cycles: Approximate BandPass Filters for Economic Time Series” The Review of Economics and Statistics 81(4) November 1999. Beveridge, Stephen and Charles R. Nelson, “A New Approach to Decomposition of Economic Time Series into Permanent and Transitory Components with Particular Attention to Measurement of the ‘Business Cycle’,” Journal of Monetary Economics, 7, 151-174, 1981. Blanchard. Olivier and Danny Quah, “The Dynamic Effects of Aggregate Demand and Supply Disturbances,” American Economic Review, 79(4), 655-673, September 1989. Bryant, Ralph C., Peter Hooper and Catherine Mann eds. Evaluating Policy Regimes: New Research in Empirical Macroeconomics, Brookings: Washington DC, 1993. Cayen, Jean-Philippe and Simon van Norden “Fiabilité des estimations de l’ écart de production au canada.” Bank of Canada working paper 2002-10. Clark, Peter K., “The Cyclical Component of U.S. Economic Activity,” Quarterly Journal of Economics 102(4), 1987, 797-814. Clark, Todd E. and Michael W. McCracken, “Tests of Equal Forecast Accuracy and Encompassing for Nested Models” Journal of Econometrics, 105, 85-110, 2001. Clark, Todd E. and Michael W. McCracken, “Evaluating Long-Horizon Forecasts” Federal Reserve Bank of Kansas City mimeo, 2002. Clark, Todd E. and Michael W. McCracken, “The predictive content of the output gap for inflation: resolving in-sample and out-of-sample evidence.” Federal Reserve Bank of Kansas City mimeo, 2003. Croushore, Dean and Tom Stark, “A Real-Time Data Set for Macroeconomists,” Journal of Econometrics, 105, 111-130, November, 2001. Diebold, Francis X. and Roberto S. Mariano, “Comparing Predictive Accuracy,” Journal of Business and Economic Statistics, 13, 1995, 253-265. Fisher, Jonas D. M., Chin Te Liu, and Ruilin Zhou, “When Can we Forecast Inflation?” Federal Reserve Bank of Chicago Economic Perspectives, 1Q/2002, 30-42, 2002. Gerlach, Stefan and Frank Smets, “Output Gaps and Inflation: Unobserable-Components Estimates for the G-7 Countries.” Bank for International Settlements mimeo, Basel 1997. 19 Harvey, Andrew C., “Trends and Cycles in Macroeconomic Time Series,” Journal of Business and Economic Statistics, 3, 216-227, 1985. Hodrick, Robert, and Ed Prescott, “Post-war Business Cycles: An Empirical Investigation,” Journal of Money, Credit, and Banking, 29, 1997, 1-16. Koenig, Evan F., Sheila Dolmas and Jeremy Piger, “The Use and Abuse of ‘Real-Time’ Data in Economic Forecasting,” Review of Economics and Statistics, 85(3) August 2003, 618-628. Kuttner, Kenneth N., “Estimating Potential Output as a Latent Variable,” Journal of Business and Economic Statistics, 12(3), 1994, 361-68. McCracken, Michael W., “Asymptotics for Out-of-Sample Causality” University of Missouri mimeo 2000. Newey, Whitney K. and Kenneth D. West, “A Simple, Positive Semi-definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix,” Econometrica 55(3), 703-08, May 1987. Orphanides, Athanasios and Simon van Norden, “The Reliability of Output Gap Estimates in Real Time,” Finance and Economics Discussion Series 1999-38, August 1999. Orphanides, Athanasios and Simon van Norden, “The Unreliability of Output Gap Estimates in Real Time,” Review of Economics and Statistics, 84(4), 569-583, November 2002. Orphanides, Athanasios and Simon van Norden, “The Reliability of Inflation Forecasts Based on Output Gap Estimates in Real Time,” CIRANO working paper 2003s-01. St-Amant, Pierre and Simon van Norden, “Measurement of the Output Gap: A discussion of recent research at the Bank of Canada,” Bank of Canada Technical Report No. 79, 1998. Staiger, Douglas, James H. Stock, and Mark W. Watson, “How Precise are Estimates of the Natural Rate of Unemployment?” in Romer, Christina and David Romer, eds. Reducing Inflation: Motivation and Strategy, Chicago: University of Chicago Press, 1997a. Staiger, Douglas, James H. Stock, and Mark W. Watson, “The NAIRU, Unemployment and Monetary Policy,” Journal of Economic Perspectives 11(1), Winter 1997b, 33-49. Stock, James H. and Mark W. Watson, “Evidence on Structural Instability in Macroeconomic Time Series Relations,” Journal of Business and Economic Statistics, 14(1), 11-30, January, 1996. Stock and Watson “Business Cycle Fluctuations in U.S. Macroeconomic Time Series.” NBER Working Paper No. 6528, 1998, 83 p., prepared for The Handbook of Macroeconomics, edited by John B. Taylor and Michael Woodford. 20 Stock, James H. and Mark W. Watson, “Forecasting Inflation,” Journal of Monetary Economics, 44, 293-335, 1999. Taylor, John B., Monetary Policy Rules, Chicago: University of Chicago, 1999. van Norden, Simon, “Why is it so hard to measure the current output gap?” Bank of Canada mimeo, 1995. West, Kenneth D. “Asymptotic Inference About Predictive Ability.” Econometrica, 64, 1996, 1067-84. 21 22 Univariate. Univariate. Univariate. Univariate. Univariate. Trivariate. Univariate. Univariate. Univariate. Bivariate. Bivariate. Quadratic Trend Breaking Trend Hodrick-Prescott Band Pass Beveridge-Nelson Structural VAR Watson Harvey-Clark Harvey-Jaeger Kuttner Gerlach-Smets Harvey-Clark model and inflation equation. Watson model and inflation equation. Local Linear Trend and Cycle. Local Linear Trend and AR(2). Local Level and AR(2). Imposes long-run restrictions. Assumes ARIMA(1,1,2). 6–32 quarters, series padded with AR forecasts. With λ = 1600. Trend Break in 1973Q1, starting in 1977Q1. Method Details 0.79 0.87 0.56 0.75 0.88 0.68 0.84 0.72 0.52 0.77 0.51 COR 0.88 0.82 0.90 0.86 0.92 0.87 0.85 0.09 0.77 0.93 0.87 0.97 AR 0.90 1.05 1.54 0.92 0.91 1.50 0.95 0.63 0.77 1.05 0.81 1.06 NSR 1.63 0.40 0.61 0.50 0.39 0.55 0.41 0.30 0.36 0.45 0.28 0.42 OPSIGN 0.58 Notes: Univariate methods employ only real GNP/GDP data. Bivariate also employ CPI inflation. Trivariate also employs treasury bill data. The last four columns present summary measures of the reliability of real-time estimates of the output gap. All statistics are for the 1969:1–2003:1 period. COR denotes the correlation of the real-time and final estimates of the output gap, AR the first order serial correlation of the revision (the difference between the final and real-time series), NSR indicates the ratio of the root of the mean square revision and the standard deviation of the final estimate of the gap, and OPSIGN indicates the frequency with which the real-time and final gap estimates have opposite signs. Data Univariate. Method Linear Trend Description of Alternative Output Gap Measures and Summary Reliability Statistics Table 1 Table 2 Relative Improvement in MSFE Method AR AR p-value Fixed Lags, Final Gaps Benchmark MSFE 0.494 Linear Trend 0.302 0.009 Quadratic Trend 0.168 0.010 Breaking Trend 0.106 0.034 Hodrick-Prescott 0.149 0.000 Band-Pass 0.134 0.000 Beveridge-Nelson 0.139 0.000 SVAR 0.047 0.121 Watson 0.319 0.001 Harvey-Clark 0.270 0.002 Harvey-Jaeger 0.109 0.001 Kuttner 0.336 0.008 Gerlach-Smets 0.362 0.001 Fixed Lags, Quasi-Final Gaps Watson 0.132 0.043 Harvey-Clark 0.070 0.068 Harvey-Jaeger −0.032 0.811 Kuttner 0.248 0.030 Gerlach-Smets 0.091 0.070 Variable Lags, Real-time Gaps Benchmark MSFE 0.559 Linear Trend 0.045 Quadratic Trend 0.021 Breaking Trend 0.043 Hodrick-Prescott 0.132 Band-Pass 0.283 Beveridge-Nelson 0.211 SVAR −0.093 Watson 0.121 Harvey-Clark 0.147 Harvey-Jaeger 0.080 Kuttner 0.107 Gerlach-Smets 0.099 TF TF p-value 0.436 0.148 0.030 −0.024 0.013 0.000 0.004 −0.077 0.163 0.120 −0.022 0.178 0.201 0.164 0.779 0.778 0.900 0.997 0.309 0.474 0.060 0.162 0.811 0.079 0.052 −0.002 −0.056 −0.146 0.100 −0.038 0.979 0.374 0.382 0.250 0.414 0.416 −0.219 −0.237 −0.221 −0.154 −0.042 −0.095 −0.323 −0.163 −0.143 −0.193 −0.173 −0.179 Notes: The AR benchmark is a univariate autoregressive forecast of inflation; the TF benchmark forecasts from a linear regression on lagged inflation and real output growth. Mean squared forecast errors (MSFE) for the two benchmark models are shown multiplied by 1000. The remaining figures in the AR and TF columns denote the relative improvements in MSFE for the output gap models, measured as (A − B)/B where A is the MSFE of the benchmark and B is that of the output gap model. The p-values for the AR benchmark are for the null that B ≥ A, based on the statistic in equation (5). The p-values shown for the TF benchmark are for two-sided test of the null that A = B, based on the statistic in equation (6). See section 3.3 and Appendix B for further discussion of the construction and interpretation of the p-values. The forecast horizon is 4 quarters and forecast performance is evaluated over the period from 1969Q1 to 2002Q1. Forecast equation estimation starts in 1955Q1. Fixed lag lengths are (1,1) while varying lag lengths are reset every quarter using BIC. 23 Table 3 The Effect of Lag Selection and Data Vintage Method AR benchmark TF benchmark Linear Trend Quadratic Trend Breaking Trend Hodrick-Prescott Band-Pass Beveridge-Nelson SVAR Watson Harvey-Clark Harvey-Jaeger Kuttner Gerlach-Smets Mean Std Dev FL-FL 0.494 0.436 0.380 0.423 0.447 0.430 0.436 0.434 0.472 0.375 0.389 0.446 0.370 0.363 MSFE VL-FL 0.559 0.496 0.438 0.500 0.494 0.556 0.502 0.482 0.502 0.433 0.448 0.577 0.402 0.426 VL-RT 0.559 0.416 0.533 0.545 0.534 0.492 0.434 0.460 0.614 0.497 0.486 0.516 0.503 0.507 Change in MSFE (percent) FL to VL FL to RT Total −13.0 0.0 −13.0 −13.7 16.0 4.6 −15.4 −21.7 −40.4 −18.1 −9.0 −28.8 −10.6 −8.0 −19.5 −29.2 11.5 −14.4 −15.2 13.5 0.4 −11.0 4.5 −6.0 −6.4 −22.3 −30.1 −15.4 −14.9 −32.6 −15.1 −8.4 −24.7 −29.5 10.7 −15.7 −8.5 −25.3 −36.0 −17.3 −19.0 −39.6 −15.6 −5.2 −21.1 6.7 14.5 14.4 Notes: MSFE denotes the mean squared forecast error (shown multiplied by 1000.) FL-FL refers to forecasts using fixed lag lengths and final output gap estimates. VL-FL refers to forecasts using variable lag lengths and final output gap estimates. VL-RT refers to forecasts using variable lag lengths and real-time output gap estimates. FL to VL refers to the change from FL-FL to VL-FL. FL to RT refers to the change from VL-FL to VL-RT. Total refers to the change from FL-FL to VL-RT. 24 Table 4 Relative Improvement in MSFE: Sub-sample Evaluation 1969Q1–1983Q4 Method AR p-value TF Fixed Lags, Final Gaps Benchmark MSFE 0.863 0.739 Linear Trend 0.247 0.025 0.068 Quadratic Trend 0.194 0.014 0.023 Breaking Trend 0.120 0.038 −0.041 Hodrick-Prescott 0.178 0.000 0.009 Band-Pass 0.172 0.003 0.004 Beveridge-Nelson 0.174 0.000 0.005 SVAR 0.013 0.359 −0.133 Watson 0.331 0.001 0.140 Harvey-Clark 0.320 0.002 0.131 Harvey-Jaeger 0.140 0.001 −0.024 Kuttner 0.317 0.024 0.128 Gerlach-Smets 0.432 0.001 0.226 Fixed Lags, Quasi-Final Gaps Watson 0.091 0.117 −0.065 Harvey-Clark 0.081 0.074 −0.074 Harvey-Jaeger 0.252 0.002 0.072 Kuttner 0.194 0.088 0.023 Gerlach-Smets 0.115 0.076 −0.045 Variable Lags, Real-time Gaps Benchmark MSFE 1.010 0.689 Linear Trend 0.225 −0.165 Quadratic Trend 0.228 −0.163 Breaking Trend 0.172 −0.201 Hodrick-Prescott 0.508 0.028 Band-Pass 0.301 −0.113 Beveridge-Nelson 0.288 −0.122 SVAR −0.106 −0.391 Watson 0.209 −0.176 Harvey-Clark 0.205 −0.179 Harvey-Jaeger 0.445 −0.015 Kuttner 0.205 −0.179 Gerlach-Smets 0.153 −0.214 p-value AR 1984Q1–2002Q1 p-value TF p-value 0.573 0.838 0.664 0.942 0.974 0.202 0.263 0.154 0.152 0.841 0.278 0.048 0.191 0.555 0.079 0.060 0.051 0.013 0.025 0.199 0.277 0.113 0.006 0.411 0.154 0.003 0.139 0.164 0.063 0.254 0.086 0.012 0.009 0.070 0.308 0.020 0.028 0.187 0.517 0.054 0.035 0.025 −0.011 0.000 0.170 0.247 0.086 −0.018 0.377 0.126 0.043 0.859 0.870 0.868 0.938 0.953 0.405 0.197 0.674 0.864 0.042 0.519 0.422 0.326 0.595 0.815 0.404 0.311 0.032 −0.474 0.494 0.010 0.010 0.267 1.000 0.019 0.418 0.280 0.007 −0.487 0.458 −0.015 0.024 0.931 0.198 0.045 0.865 0.191 −0.357 −0.405 −0.289 −0.451 0.215 −0.035 −0.018 −0.144 −0.046 −0.480 −0.177 −0.081 0.196 −0.341 −0.390 −0.272 −0.438 0.244 −0.011 0.006 −0.123 −0.023 −0.468 −0.158 −0.059 Notes: The AR benchmark is a univariate autoregressive forecast of inflation; the TF benchmark forecasts from a linear regression on lagged inflation and real output growth. Mean squared forecast errors (MSFE) for the two benchmark models are shown multiplied by 1000. The remaining figures in the AR and TF columns denote the relative improvements in MSFE for the output gap models, measured as (A − B)/B where A is the MSFE of the benchmark and B is that of the output gap model. The p-values for the AR benchmark are for the null that B ≥ A, based on the statistic in equation (5). The p-values shown for the TF benchmark are for two-sided test of the null that A = B, based on the statistic in equation (6). See section 3.3 and Appendix B for further discussion of the construction and interpretation of the p-values. The forecast horizon is 4 quarters and forecast equation estimation starts in 1955Q1. Fixed lag lengths are (1,1) while varying lag lengths are reset every quarter using BIC. 25 Appendix A: The Construction of Real Time Output Gaps The output gaps used in this study, as well as the data and programs used to create them, are freely available from the authors. The estimates examined here include all those examined in Orphanides and van Norden (2002) plus the Band-Pass, Beveridge-Nelson, Harvey-Jaeger and SVAR methods described below; this is identical to the list of models considered in Orphanides and van Norden (2003). The range of available estimates were updated so that the “final” data vintage now corresponds to 2003Q3 (i.e. data available as of mid-August 2003, so data series end in 2003Q2) rather than 2000Q1 as in these two earlier papers. Data for real output were taken from the Real Time Data Archive of the Federal Reserve Bank of Philadelphia in September 2003. Observations span the period from 1947Q1 to 2003Q2. Vintages for output run from Nov. 1965 to August 2003. All CPI data are from the 2003Q3 vintage. The SVAR method also uses data for 3-month US treasury bills. Data for this rate (secondary market) from January 1934 to August 2003 were obtained from the FRED database of the Federal Reserve Bank of St Louis. All output gap models we consider decompose the logarithm of output into trend and cycle components. The linear trend (LT) and quadratic trend (QT) models are from OLS regressions with linear and quadratic deterministic trends. The breaking trend model is identical to the LT model until 1976Q4. Starting in 1977Q1, it allows for an estimated break in the trend at the end of 1973. The Hodrick-Prescott(HP) method is based on the filter proposed by Hodrick and Prescott (1997) with their recommended smoothing parameter of 1600 for quarterly data. The band-pass method (BP) is based on the Stock and Watson (1998) adaptation of the Baxter and King (1999) approach. Following Stock and Watson (1998), we use a filter 25 observations in width and pad the available observations with forecasts from an AR(4) model. The Beveridge-Nelson follows Beveridge and Nelson (1981) in modelling output as an ARIMA(p,1,Q) series. Based on results for the full sample, we use an ARIMA(1,1,2), with parameters re-estimated by maximum likelihood methods before 26 each recalculation of the trend. We examine five unobserved component (UC) models, all of which are estimated by maximum likelihood. Three of the five are univariate models. The Watson (WT) model is based on Watson (1986) and models the output trend as a random walk with drift while the cycle is assumed to follow a stationary AR(2) process. The Harvey-Clark (CL) model follows Harvey (1985) and Clark (1987), replacing the constant drift in the trend of the WT model with a random walk. The Harvey-Jaeger (HJ) model has the same trend as the CL model but replaces the AR(2) component with a stochastic cycle. All three of these univariate models require estimation of five parameters, including variances for the assumed Gaussian shocks. The Kuttner (KT) model appends a Phillips curve, as specified in Kuttner (1994), to the WT model, giving a bivariate model with eight more estimated parameters than its univariate counterpart. The Gerlach-Smets (GS) model similarly adds the Phillips curve specified in Gerlach and Smets (1997) to the CL model, yielding a bivariate model with six more estimated parameters than its univariate counterpart. The Structural VAR measure of the output gap (BQ) is based on a VAR identified via restrictions on the long-run effects of the structural shocks, as proposed by Blanchard and Quah (1989). Our implementation is identical to that of Cayen and van Norden (2002), who use a trivariate system including output, CPI and yields on 3-month treasury bills. Lag lengths for the VAR are selected using finite-sample corrected LR tests and a generalto-specific testing approach. 27 Appendix B: Evaluation of Forecast Performance As noted in section 3.3, our statistical inference for the forecast performance of the output gap models relative to the AR benchmark model is based on the MSE-F statistic proposed by McCracken (2000). This takes the form MSE-F = P · (M SF E1 − M SF E2 ) M SF E2 (B.1) where P is the number of forecasts, M SF E1 is the mean squared forecast error (MSFE) of the restricted model and M SF E2 is the MSFE of the unrestricted model. The distribution of the MSE-F statistic under the null hypothesis of equal MSFE is estimated via a bootstrap experiment. The bootstrap begins by estimating a constrained VAR(12) in πt1 , ytT,T −1 in which we impose the restriction that y does not Granger-cause π. 2000 simulated realizations of this DGP are created by simulating the estimated model h with shocks randomly drawn with replacement from the estimated residuals. πt+h is then constructed as the sum of h consecutive observations of π11 . For each simulation, the dyh , π 1 , y T,T −1 for namic model is initialized with historical observations starting with πk+h k−i k−i an independently drawn value of k. MSE-F statistics are then calculated for each simulated series and their empirical distribution is used to estimate p-values for the true data’s MSE-F statistics. Because these distributions are non-pivotal, the distribution of the test statistics is bootstrapped anew for every different choice of (P, h, y, m, n). The p-values for every reported MSE-F are therefore based on independent bootstrap experiments. 28