The full text on this page is automatically extracted from the file linked above and may contain errors and inconsistencies.
Federal Reserve Bank of Chicago Causality, Causality, Causality: The View of Education Inputs and Outputs from Economics Lisa Barrow and Cecilia Elena Rouse WP 2005-15 Causality, Causality, Causality: The View of Education Inputs and Outputs from Economics Lisa Barrow Federal Reserve Bank of Chicago Cecilia Elena Rouse Princeton University and NBER November 1, 2005 Prepared for the Consortium for Policy Research in Education, State of Education Policy Research Meeting, February 14-15, 2005. We thank Brian Jacob, Jesse Rothstein and Diane Whitmore Schanzenbach for helpful conversations, Helen Ladd and the editors for insightful comments and Kyung-Hong Park for research assistance. Any errors in fact or interpretation are ours. The opinions in this paper do not reflect those of the Federal Reserve Bank of Chicago or the Federal Reserve System. 1 I. Introduction Frustrated with decades of research on education that seemingly amounts to little accumulated knowledge on how to improve student academic outcomes, policymakers and researchers are taking stock of what we do and do not know about the effectiveness of educational inputs.1 As an example, in 2002 the U.S. Department of Education created the “What Works Clearinghouse” (WWC)—a data base meant to provide educators and policymakers with a “trusted source” of information on what “scientifically-based” education research has to say about what works and does not work in education. The fact that the federal government was willing to spend $18.5 million (U.S. Department of Education 2002) to fund such an enterprise reflects the view of many that ultimately we know little about which inputs matter for student success in education. Why do we seem to know so little? Many economists would argue it is because research has not emphasized isolating causal relationships between education inputs and student outcomes (Angrist 2004). Rather, education research has focused on other aspects of the issue, such as differences across settings, which usually has not been the major concern for researchers placing a priority on causality.2 If one believes student outcomes are uniquely tied to the educational setting, then it is fruitless to try to draw general conclusions about the “average 1 We emphasize that “inputs” can be interpreted either narrowly or broadly. District organization (e.g., primary, middle, high school vs. K-8 and high school) can be interpreted as an input as can the structure of teacher contracts. Similarly inputs may be defined as class size, text books, or computers in the class room. We take the broader interpretation but will only discuss the evidence regarding a few of the thousands of potential inputs to educational outcomes. 2 Of course all researchers attempt to estimate the causal (or unbiased) effect of an input on educational outcomes. However, all research necessarily demands sacrifice and some researchers will sacrifice causality rather than not estimate differences across settings; others would make the opposite decision. We attempt to draw a distinction between these emphases. 2 effect” of an education input (which, in this view, does not apply to anyone in reality) (Cook 2001).3 However, the WWC as well as many others in the education research field have started to highlight research that emphasizes isolating the causal relationships between education inputs and student outcomes. Some refer to this emphasis as on “identification” (i.e., “identifying” the impact of a particular input as distinct from other factors) or “internal validity” as termed by Campbell and Stanley (1963). In this paper we discuss methodologies for estimating the causal effect of resources on education outcomes; we also review what we believe to be the best evidence from economics on a few important inputs: spending, class size, teacher quality, the length of the school year, and technology. In general we conclude that while the number of papers using credible strategies is thin4, there is certainly evidence that what schools do matters. But, many unanswered questions remain. II. The Theoretical and Empirical Ideal A. Economists’ View of Education Resources5 Economists typically analyze a school’s performance and the effectiveness of its inputs using an “education production function.” The school produces education using inputs and a 3 See Cook (2001) for a thoughtful discussion of why education research has by-andlarge rejected randomized experiments. 4 For example, in a recent review of curriculum-based interventions to improve middle school math achievement, the WWC staff found 77 studies. Of these 77 studies only 10 (studying 5 interventions) were found to have met the WWC standards for evidence which place an emphasis on causal inference (or internal validity). 5 Parts of this section are drawn from Rouse (2005). 3 production technology. One can then measure the effect of particular inputs on the output (education), usually for each student. Specifically, one can think of a production function as: (1) where Eist represents the output for student i in school s in year t; NSit represents non-school inputs into student i’s educational attainment, such as her natural “ability,” the extracurricular inputs provided by her parents (e.g., music lessons, extra tutoring in subjects), parental inputs (e.g., reading to their children, doing “educationally-rich” activities at home), and her educational history (that is, her achievement level in 4 th grade is not only a function of her current school, but also of her schooling in kindergarten, 1 st, 2nd, and 3rd grade)6; Rist represents the resources under the control of school s in year t (e.g., class sizes, quality of teaching staff, and the curriculum); Xist represents the school inputs that are not typically under the control of public schools (e.g., the quality of a student’s “peers”), and e ist is an error term that represents all of the other “stuff” that is not otherwise represented (e.g., measurement error).7 6 Why do we categorize a student’s educational history as a “non-school” input? Because we are attempting to distinguish between contemporaneous inputs under the control of the current school and inputs over which the current school has no control. Obviously, a school cannot change what happened to a student in the past. 7 Other researchers in the social sciences often represent this educational production function using hierarchical linear models (HLM) in which they explicitly account for more of the organizational structure, i.e., schools are made up of classes which are made up of individual students (Bryk and Raudenbush 1992). We address issues of identifying causal relationships in the framework commonly used in economics rather than in HLM; however, the issues of identification discussed below are also relevant to identification in the HLM framework. While HLM models allow for more nuanced and structured modeling of the parameters and error term, whether the coefficient estimates are unbiased continues to rest on whether the covariates included in the regression fully account for all confounding factors that might affect student achievement and are correlated with the school input in question. In this regard, HLM is similar to OLS estimation discussed below. 4 The function, f, represents the “production function” or the educational practices that transform the inputs into what a student actually learns. The formulation of the education production function depicted in equation (1) also highlights some of the issues that complicate the design of methodologies for estimating the effectiveness of specific school resources—few of the measures that one would ideally include are observable. Take, for example, educational output, Eist. We rely on our schools to help children learn academic subjects as well as to help them become full-functioning, happy adults by teaching democratic values, responsibility, cooperation, consideration, and other aspects of working well with others.8 As such typical outcomes (such as test scores or labor market wages) clearly reflect only part of what we expect from schools. Further, standardized tests do not fully reflect the academic achievement of students. They typically focus on only a few subjects, and in order to keep the testing affordable and not-too-intrusive, are relatively short and mostly rely on multiple choice questions (which are less costly to score). Thus, tests generally provide incomplete and noisy measures of educational output which make it much harder to detect the effectiveness of inputs.9 Because of non-school factors and other inputs beyond the school’s control, one cannot easily generate a causal estimate of the effect of school quality on outcomes.10 For example, the 8 That said, in this paper we often refer to the educational output as “student achievement” for ease of exposition. 9 More formally, a noisy measure of educational output (the dependent variable in a regression model) will increase the residual variance which will increase the size of the standard errors. As such, one will be less likely to reject the null hypothesis that an input has no effect on student outcomes. 10 In this regard, while economists refer to equation (1) as a production function, in many respects it is not. In order to truly recover the parameters of a production function one would need to hold everything else constant. Thus, if one were studying the effect of lowering class size on a student’s achievement, one would require that all other educational inputs, such as 5 test scores of more disadvantaged students (in School A) will likely be lower than the test scores of more advantaged students (in School B). If the quality of school resources in each school is correlated with the socioeconomic status of the students, it will be difficult to disentangle the role of school inputs from the influences of non-school factors. Suppose School B has more qualified teachers and more computers in the classroom than does School A. To study the effect of teacher quality and computers on school outputs, one must develop an analytical strategy that adequately controls for the non-school factors. In many cases we suspect the school serving more advantaged students will also have higher quality school inputs. Since (in this example) more computers and more qualified teachers are positively correlated with student family background, education production function estimates will overstate the effectiveness of school resources if one does not adequately control for family background.11 Why is identification so important? From a policy perspective, if one implements a program based on estimates of school effectiveness that are overstated (understated), then the benefits to society will be smaller (larger) than anticipated by the research. If the misstatement (bias) is small, this is not a big problem, however in many cases the bias could be quite large teaching styles, curriculum, extracurricular activities, and non-school factors remain constant. Because there are no data that allow the researcher to control for all such inputs, the literature typically asks a more general question: what is the effect of an exogenous decrease in, say, class size, on student achievement not requiring that all other inputs remain constant? In this example, teachers may change their teaching style in response to a smaller class size or parents may ease up on complementary educational activities such as tutoring (believing their child is receiving a higher quality education while in school). This more systemic response is what one might expect from an exogenous change in educational policy. See Todd and Wolpin (2003) for a more in-depth discussion of this conceptual issue. 11 In some cases the bias may be negative. For example, special needs and English language learner classes tend to be much smaller than classes for regular or gifted students (Boozer and Rouse 2001). In this case, one may erroneously conclude that smaller class sizes lead to worse student outcomes. 6 leading to no societal benefits, or worse, adverse outcomes. (Or, in the case of understatement, a potentially beneficial program may not be adopted.) B. The Ideal Way to Measure the Impact of Schooling Inputs To identify the causal impact of school resources, ideally one would begin with a group of students and educate them during the year with the first educational input in question (or the status quo). At the end of the year one would assess the students or administer an appropriate test, the results of which would perfectly reflect what the students know. 12 Next, one would take the same group of students and revert them back to their initial conditions at the beginning of the first year. That is, they would be the same age, have the same living conditions, etc. This second year, one would then educate the students with the second educational input in question. (For example, in the first year the input might be teachers with regular teaching credentials and in the second it might teachers who have gone through an alternative certification program.) At the end of the year, one would again assess what each student knows.13 The difference between the students’ outcomes using the first input and those using the second would isolate the value of the second input relative to the first. Why is this the ideal design for measuring the value of an input? First, because the same 12 Note that while we describe the ideal outcome as a test, in theory one could use any other outcome (such as adult wages or voting behavior). 13 In the ideal methodology one need not administer a test at the beginning of each school year because the students (and all of their characteristics) are identical in both years. If one were to administer a test at the beginning of the year, the difference between what the students know at the beginning of the year and the end of the year could (mostly) be attributed to the input used that year since the students are the same in each testing period. This would constitute the input’s “value-added” in the each year. 7 students are educated using each educational input starting from the same initial conditions, one has guaranteed that all background characteristics of the students are the same, including their prior exposure to high and low quality schooling, their family situation, and their innate ability. In other words, one has effectively controlled for Xist and NSit. Second, because the assessments perfectly reflect what students know, there is no measurement error. The combination of these two features means that one can isolate the relative influence of the educational inputs. The key is that by observing how students fare under both regimes one has a “counterfactual” outcome against which to compare the outcome using the main input of interest. Namely, we observe how much the students learn using the first input as well as how much they would have learned had we instead used the second input. As we will discuss, it turns out that establishing a credible counterfactual outcome is among the most difficult tasks faced by the analyst. Clearly, the ideal evaluation is impossible to implement. No one can turn back time to assess the students under the exact same conditions in each year. In addition, there has not been an assessment devised that perfectly reflects what students know. Rather, existing tests reflect only a part of what students know, and there are permanent confounding factors (such as different test-taking abilities) and random confounding factors (such as some students not feeling well on the day of the test or not getting enough sleep before the test). The analyst’s task is, nevertheless, to implement a methodology that comes as close to the ideal approach as possible. One must also consider how generalizable the outcome of any evaluation would be (that is, whether the evaluation has “external validity” (Campbell and Stanley 1963). If we started with a “representative” group of students, then the ideal evaluation would uncover the average relative effectiveness of the input even if the effectiveness of the input varies over the 8 population. If it is the case that the first or second input is more effective for some students than others, the students in our study must be representative of the population of students in order to assess the effectiveness of each input on average. Consider the extreme example in which the first input is effective for teaching girls but not effective for teaching boys, and the second input is as effective for teaching boys as the first input is for teaching girls but not at all effective for teaching girls. Further, assume that 50 percent of the student population is female and 50 percent is male. In this extreme example, the two inputs are equally effective on average. However, if the students in the study were disproportionately female, we would incorrectly conclude that, on average, the first input is more effective than the second input. Thus, only by starting with a group of students who are representative of the population (in general, or the population of interest for the policy) can one guarantee that the exercise will uncover the true average treatment effect of the input. Of course, in this extreme but simple example with an ideal set up in which one knows all of the characteristics of the students, one could estimate the effectiveness of the inputs for different subgroups of the population and discover, in fact, that the first input was more effective for girls and the second input was more effective for boys. This issue of “heterogenous treatment effects” is part of the reason many in education cast a jaundiced eye toward randomized experiments (and many other quantitative methodologies). However, the average effect (even if there are different effects for different subpopulations or in different settings) is important for setting policy, especially if it is difficult for policymakers to target a policy narrowly or effectively. As such, the first-order question is whether an intervention works in general or for very broad categories of schools or students. Among the important subsequent questions is whether it is more effective for some groups or 9 situations than for others. In a related manner, one characteristic of many methodologies that emphasize causality is that the researcher does not delve into the intricacies of why the intervention may have mattered (or not mattered). For example, in the studies of class size reduction using the randomized Project STAR data from Tennessee, researchers have not identified why class size reduction mattered. Was it because there were fewer students in the class per se (i.e., a peer effect story) or because the teachers changed their teaching styles? It is important to note that this is not an inherent limitation of the randomization (or other methodologies which emphasize causality). Rather it follows from the researcher placing a greater emphasis on causality such that he or she will not attempt to address issues of which subcomponents of an intervention might have mattered, unless there was randomization along those dimensions as well. That is, unless the researcher randomly assigned teachers to teaching styles in addition to differing class sizes, he or she will worry that teachers who adopted certain styles may be different from those who did not in ways that are not observable. Using survey data (or other observational data) and applying ordinary least squares regression will not solve this problem (unless, of course, they contain all of the relevant background variables). In the next two sections we review methodologies researchers have used in their quest to study the effectiveness of school inputs. III. Methods Using Observational Data Observational data are gathered from observing existing situations in schools. That is, they contain the existing input levels in schools (e.g., class sizes and teacher qualifications) as 10 well as information on students attending the schools. These are “observational” because there is no attempt on the part of the analyst to manipulate the situation generating the data. Most of the literature on the effect of educational inputs on student outcomes relies on observational data typically because they are most readily available. However, the fundamental problem with observational data is that individuals and schools choose their situations, such that one must control for all factors that led the individual or school to their choice that might also be correlated with the outcome of interest. Each of the approaches discussed, below, attempts to address this fundamental problem in a different way. A. Ordinary Least Squares Regression Traditionally, researchers have used ordinary least squares regression (OLS) to study the impact of school resources on outcomes (e.g., Coleman et al 1966). In the cross-sectional case, the analyst relies on a specification such as, (2) where the variables are the same as those in equation (1) and ", $, 8, and * are parameters to be estimated. If all of the other factors (NSit and Xist) are observed in the data set such that one can include them in the regression (thereby holding them constant), then one can generate an unbiased (causal) estimate of $. However, we know of no data that contain all of the other factors for which one must control. Rather, most cross-sectional data contain only limited information on important non-school and school factors. For example, in school administrative data one rarely, if ever, has an accurate measure of family income. As a result, researchers control for whether the student was eligible for the National School Lunch Program, a proxy for 11 income that is very crude at best. Today many researchers believe that cross-sectional OLS estimates are likely inaccurate (i.e., they are statistically biased). As a result, they turn to data that follow a student over time—longitudinal data. With longitudinal data one can control for observed and unobserved student characteristics, particularly those that do not change over time. In addition, because these data have information on students over multiple years, one comes closer to the comprehensive data required in equation (1). Because of accountability requirements in The No Child Left Behind Act of 2001, many states are beginning to collect such data on individual students statewide. This advance in data collection will be an invaluable resource for education researchers going forward.14 Since these data have not been readily available, one approach that researchers have used is known as a “value-added” specification (e.g., Summers and Wolfe 1977). These equations take the form: (3) where Eist-1 is the student’s outcome in the previous year. In this case, one estimates the effect of a (concurrent) resource Rist on the change in a student’s outcome. If Eist-1 fully captures the effect of all previous schooling and non-schooling inputs on the student’s achievement, then one can generate an unbiased estimate of $N, the effectiveness of the school input in question. However, it seems unlikely that a noisy measure of a student’s performance in the prior year (as reflected 14 Texas, North Carolina, and Florida already have fairly rich databases and have provided researchers with access to them. That said, we know of no administrative data that contain all of the information that would be relevant for replicating the ideal research design. 12 in test scores) will fully control for all relevant factors.15 In general, the basic problem with using OLS regression to estimate the effect of school resources on student achievement—using either cross-sectional or longitudinal data—is that one is uncertain whether or not one has controlled for all important factors16 in the regression. As such, much of the latest and most compelling research on the impact of school resources on student achievement has moved away from simple OLS regression. C. Regression Discontinuity Imagine that an educational input is assigned to students based on the value of some measure. For example, suppose that a state imposes a maximum class size of 25 students per teacher. If the number of students exceeds 25 students, then the students are to have a teacher plus a teacher’s aide. If there are more than 40 students then the school must create two classes (each with one teacher). Because of the cutoffs imposed by the law, students in schools with 39 students in, say, the 3rd grade will experience a much larger class size (39 students) than students in schools with 41 students in the 3rd grade who will be educated in class sizes of 20 and 21. The key is that the variation in whether a school has 39 students in the 3rd grade or 41 students likely occurs by chance. More specifically, it is unlikely there are other factors that determine whether there are 39 or 41 students in the 3rd grade that also affect student outcomes. As such, one can compare the outcomes of students in schools with 39 students to those of students in schools 15 See Todd and Wolpin (2003) for a more comprehensive discussion of what different empirical models using longitudinal data identify and under what assumptions. 16 Statistically “important factors” are those that influence the student’s performance on the outcome measure and that are correlated with the resource in question. 13 with 41 students and attribute any difference to the effect of class size. This basic methodology is known as a “regression discontinuity” design (Cook and Campbell 1979), and it has grown in popularity in research on education quality. More generally this design will work when the input in question (in this example, class size) is at least partly determined by a known discontinuous function17 of an observed characteristic (in this example, 3rd grade enrollment). Because of the discontinuous relationship between the input in question and the observed characteristic, the researcher can control directly for the observed characteristic while still identifying the effect of the input in question on student outcomes making the strategy much more compelling than typical OLS.18 An important disadvantage of regression discontinuity designs is that the range of values over which one gets identifying variation tends to be rather small. (For example, the most compelling comparison in the class size example is between schools with enrollments of 39 vs. 41; one can imagine that schools with 15 rather than 35 3rd-grade students also vary along other dimensions that may or may not be observable.) In addition, if the effect of class size on student 17 That is, class sizes and enrollment do not simply increase one-for-one forever, but there is a change in the relationship at some point. In this example, class sizes increase one-for-one until there are 40 students in the class and then class sizes abruptly (and discontinuously) decrease to 20 and 21 students. 18 In order for regression discontinuity design methods to provide credible estimates of the effect of educational inputs on student outcomes, the key individuals involved (e.g., parents, principals, teachers, students) must not have control over the exact value of the measure on which eligibility for the input will be based. Thus, these key individuals must not have control over the exact size of the 3rd grade class in our example. If the underlying measure can be manipulated, then one could manipulate the school enrollment to engineer the desired class size. If the desired class size is correlated with other unobserved determinants of student outcomes, such as commitment to education, then the estimate of the effectiveness of class size will be biased. In this example, the methodology seems more credible when applied to public schools—that do not have complete control over their enrollment—than to private schools. 14 achievement in the range for which one can generate unbiased estimates is different from that in other ranges, then the estimated parameters may not generalize.19 However, because regression discontinuity provides a credible way to estimate a parameter with internal validity, it provides an invaluable tool for education research. Further, while it may not appear to be very practical, regression discontinuity is a candidate analytical design whenever there are cutoffs for program participation. In Section V, below, we discuss papers that use this design to study the effectiveness of professional development, smaller class sizes, and summer school and grade retention. D. Natural Experiments (Instrumental Variables) “Natural experiments” provide another approach for analyzing observational data in a way that comes closer to the ideal experiment than OLS. In this approach, researchers attempt to locate determinants of schooling inputs that would not be expected to independently alter their educational outcomes. Here’s the basic idea used in this methodology. Suppose we were interested in studying the effect of financial resources on student outcomes. And, we knew of a determinant of financial resources, say, a change in the state education financing formula, that would increase the amount of money allocated to one group of schools. Suppose further we were certain that this change in the financing formula did not have any direct effect on the students’ outcomes, 19 For example, when class sizes vary by only 1 or 2 students—which may be the viable range for policy changes—teachers may not change their teaching styles significantly. However, when class sizes change a lot (e.g., 39 students vs. 20 students) then many other educational practices may change making it difficult to isolate the effect of class size, per se. Or, the effect of class size reduction may matter for, say, classes with over 30 students but not for classes with 20-25 students. 15 except through the impact on the schools’ revenues.20 We would then estimate the effect of state aid on outcomes in two steps: In the first step we would estimate the effect of the state aid on school revenue. In the next step we would measure the effect of the change in state aid on students’ outcomes. If we found that the outcomes of the students improved, then we could be sure that increased revenues were the cause of the outcome improvement since we were certain that the change in state aid had no direct effect on outcomes. The ratio of the outcome improvement caused by the change in state aid to the change in the educational input caused by the state aid is a straightforward estimate of the causal effect of financial resources on student achievement. This instrumental variables (IV) estimator uses the “exogenous” event (a change in the state financing formula) as the instrumental variable.21 This is, indeed, the approach taken in the recent paper by Jonathan Guryan (2003) to study the impact of “money” on student outcomes (see section V, below). As another example of how this estimation strategy works, consider the recent paper by 20 Thus, for example, one would need to be careful with state aid formulas that were designed to be redistributive. In this case, schools in poor areas would likely receive more state aid and yet their students would likely perform worse on tests than students in wealthier areas. As such, one would not want to simply use the level of state aid as the outside determinant of resources. However, some changes in the formula may have been driven by factors that are uncorrelated with the characteristics of the districts such that these changes would be valid outside determinants of state aid. 21 Some investigators refer to the fact that schools use varying levels of a particular input as a “natural experiment.” (For example, we have heard researchers propose studying the effect of a whole school reform model by “exploiting the natural variation” arising from the fact that some schools have adopted the model and others have not.) This is not what most economists would refer to as a “natural experiment” particularly since the method of analysis that follows is simply OLS (in which one relates whether or not a school uses the whole school reform model in question to student outcomes). One is still left with the question as to why some schools adopted the model and others did not, and whether this same (unobserved) factor that led them to adopt it is correlated with student outcomes. 16 Angrist and Lavy (2002) that studies the effect of technology on student achievement. Angrist and Lavy note that schools in Israel that received a technology grant were more likely to use computer-assisted instruction (CAI) (again, see section V, below). The grant program is a suitable instrumental variable so long as one can assume that any difference between schools that received funding through the program and those that did not is only the use of computers in the schools and not other observable characteristics of the schools (e.g., schools with more motivated principals were more likely to apply for the grant program and have better performing schools). Angrist and Lavy study the correlation of participation in the program with other school environment characteristics (e.g., class size, hours of instruction, non-computer technology) and conclude that the program increased CAI instruction without changing other characteristics of the schools. That said, this highlights the main empirical challenge in IV strategies: the researcher must make the claim that the instrumental variable only affects the outcome through its effect on the educational input (the endogenous variable) in question. As such, the researcher must make assumptions about unobservable factors—which are inherently difficult to prove or disprove.22 Another disadvantage of IV strategies is that like regression discontinuity designs, if there is heterogeneity in the effect of an input on student outcomes, the estimated effect may not generalize to other segments of the population. The reason is that IV identifies the effect of an input on student achievement among those students (or schools) that are induced to change their behavior because of the instrumental variable (Imbens and Angrist 1994). Thus, in the case of 22 In this regard, IV shares much in common with OLS. Both strategies rely on assuming that the error term is not correlated with either the input in question (in the case of OLS) or the instrumental variable (in the case of IV), conditional on the (other) observed covariates. 17 the technology grant program in Israel, IV identifies an effect of CAI only for those schools that increased their intensity of CAI because of the grant program (there are others that may have received money from the program that would have increased their CAI intensity even without the program). If student achievement is particularly responsive to increases in CAI intensity in these schools, then the IV estimate of the effectiveness of technology will overstate the average effect across all schools. While quite popular, especially among economists, it is extremely difficult to find credible instrumental variables such that this methodology is unlikely to become a mainstay of education research. IV. Methods Using Experimental Data Finally we come to what many describe as the “gold standard” in evaluation methodology: randomized designs. In this case, one group of students is randomly assigned to be educated with the input in question (the treatment group) and a second group of students is (randomly) assigned to be educated with the status quo or another input (the control group). One then tests the students at the end of the evaluation. The difference in student outcomes between those in the treatment group and those in the control group represents the effect of the input in question relative to an alternative. Why is this the ideal design for measuring the impact of school inputs on student outcomes? First, the random assignment of students to an educational input initially controls for all background characteristics of students, including their prior exposure to high and low quality schooling. That is, on average, both groups would have the same distribution of students along observable and unobservable dimensions such that one need 18 not control for Xist and NSit.23 Further, because a random event determined to which group (treatment or control) a student was assigned, one need not be concerned that students who expected to benefit more from an input were those likely to be educated using it (a concern under OLS), and the error term is uncorrelated with the input in question. Randomization makes the identification of causality in experiments more transparent than many other methodologies; thus, experiments are quite compelling. That said, experiments are not well-suited to answering all questions. First, the more aggregated the unit observation, the more “cumbersome” a randomized evaluation becomes. For example, to study the effect of district-wide open enrollment on student outcomes using an experimental design one must randomly assign districts to treatment and control groups. While there are more than 14,000 districts in the U.S. from which to choose, the logistics of getting a sufficient number of districts to cooperate, such that one would have a large enough sample with which to draw conclusions, would be quite daunting. In addition, in an “ideal” experiment the random assignment process completely determines the status of individuals in the treatment and control groups. In many experimental settings involving human subjects, however, there is slippage between the random assignment status of experimental subjects and whether or not they actually receive the treatment.24 For 23 While researchers will often also attempt to administer a pre-test at the beginning of a randomized evaluation it is not necessary to do so to generate an unbiased estimate of the effect of the input in question on student outcomes, if the randomization is conducted correctly. The reason is that the characteristics (both observed and unobserved) of the students in treatment and control groups are the same, on average, at the beginning of the evaluation. Hence, controlling for the student’s pre-test will not change the estimate of the effectiveness of the input in question. Controlling for the pre-test can, at times, improve statistical precision (i.e., lower the standard errors). 24 This problem is exacerbated in multi-year studies. 19 example, in the Tennessee Project STAR experiment some students randomly assigned to small classes ended up in larger classes and vice versa. In a randomized experiment, if one simply compares the outcomes of students originally assigned to the treatment group to those originally assigned to the control group (regardless of which treatment the student actually received), one estimates the “intention-to-treat” effect (Rubin 1974; Efron and Feldman 1991). While of interest, the intent-to-treat effect may be unsatisfying for those educators and policymakers who desire an estimate of the effects of a particular input for students who are educated actually using the input—the effect of “treatment on the treated.” The estimated intent-to-treat effect does not establish whether the input in question—properly implemented—is better than an alternative. Note that the difference between the intent to treat and the treatment on the treated is “take up” (or implementation). Students randomly assigned to the treatment group may actually choose to use another input (that is not the input in question), and students randomly assigned to the control group may actually use the input in the question (i.e., get “treated”). While the effect of treatment on the treated is important, we believe there are at least two reasons why we should also be interested in intention-to-treat. First, it is the only policy instrument available to policy makers. If the state of Tennessee decides to lower class sizes for all students, all policymakers in Tennessee can do is mandate lower class sizes. If all schools comply with the new law, then one might expect results that follow from the treatment on the treated parameter. In reality, some schools may not be able to meet the new lower class size requirements. If so, any anticipated gains in student achievement will be diluted. Take the extreme case in which no schools are able to reduce class sizes (say, for example, because of lack of building capacity). In this case, even if student achievement is much higher with smaller class sizes, there will be no achievement gains from the program because the program was not 20 implemented. When considering a policy change, policymakers must consider both halves of the issue: both how the treatment affects the treated and whether the program can and will be implemented as desired.25 Because it combines both halves of the issue, the intention-to-treat estimates reflect the overall potential gains from an educational policy change. Second, as in many experimental settings, the randomization only occurred in the intention-to-treat, and, as such, this estimate is the only unambiguously unbiased estimate that one can obtain from an OLS regression, assuming the initial selection was truly random. That said, one can estimate the effect of treatment on the treated by using an IV strategy in which one uses the random assignment as an instrumental variable for whether or not the student received the treatment (the input in question). This is a case in which, if the initial selection was truly random, the instrumental variable will not be correlated with the error term in the outcome equation and therefore will be valid. Although experimental approaches probably come closest to the ideal evaluation design, they do have some analytical shortcomings which are worth highlighting. For example, they tend to be rather “blunt” instruments. One implements an experimental design out of concerns for obtaining an unbiased estimate of the effect of the input on student achievement. However, one can only truly get such an unbiased estimate from the point of random assignment, and unless the experimental design is sufficiently complicated, one can really only answer one question (whether those assigned to the treatment group fared differently from those assigned to the control group). Cost concerns (and complicated implementation) often preclude comparing 25 This discussion is analogous to the distinction between “efficacy” (i.e., how well a drug might work in theory) and “effectiveness” (i.e., how well a drug works in practice given the fact that patients may not follow the protocol exactly). 21 the effectiveness of more than two or three different inputs in one study. Note, however, that one can ask whether the effect differs for subgroups identified based on characteristics measured before random assignment takes place, provided sample sizes are large enough. Similarly, it is rare that researchers using experimental designs attempt to determine (at least causally) which dimensions of an intervention may have worked or not worked. Again, the reason is that unless random assignment also occurred along the “subdimensions” any such analysis will not necessarily yield causal estimates of the effectiveness of the subdimensions. Keeping in mind these many empirical challenges, we now turn to a brief summary of what these methodologies suggest about the importance of schooling inputs on student outcomes. V. Does School Quality Matter? A. Coleman, Hanushek, and Card and Krueger The Coleman Report (Coleman et al. 1966) is credited with launching an explosion of studies estimating the relationships between educational outcomes and school inputs. Many papers were written criticizing the methodology used in the Coleman Report including arguments that longitudinal studies or well-designed experiments were needed to make causal inferences (e.g., Sewell 1967). Further, even the Report’s authors note that their cross-sectional analysis does not provide a strong basis for causal interpretation. However, the Report was broadly interpreted to find that schools do not matter; instead, family background and peers explained most of the variation in education outcomes. By the mid-1980s, Hanushek (1986) includes 147 studies in a survey of the literature relating educational outcomes to school inputs. Ten years later, Hanushek (1996) finds more than double the number of studies to survey. The reviews and conclusions of Hanushek’s 22 analyses reinforced the findings of the Coleman Report, and by the early 1990s, many people were firmly convinced that “money does not matter,” namely, that once family inputs into schooling were taken into account, school resources did not matter. As Hanushek (1997) writes, “Simple resource policies hold little hope for improving student outcomes.” He further concludes, “Three decades of intensive research leave a clear picture that school resource variations are not closely related to variations in student outcomes and, by implication, that aggressive spending programs are unlikely to be good investment programs unless coupled with other fundamental reforms.” (Hanushek 1996) Although Hanushek’s meta-analyses have been extremely influential, researchers have criticized them along a number of dimensions. Hedges, Lain, and Greenwald (1994) note that many of the studies Hanushek surveys can be faulted for methodological reasons similar to those discussed above. For example, many of the surveyed studies are based on cross-sectional, observational data and do not have longitudinal data on student outcomes or natural experiment features. Further, Hanushek relies on simple “vote counting” in his analysis. Using more sophisticated meta-analytical techniques, Hedges, Lain, and Greenwald conclude that among the studies surveyed in Hanushek (1989), per pupil expenditures, teacher experience, and teacherpupil ratios are positively related to student outcomes. They also find that the effect sizes for per pupil expenditures are large and educationally important. Krueger (2003) makes a more basic point in criticizing Hanushek for weighting all estimates equally and thus giving more weight to studies that publish more estimates. Focusing on the class size results included in Hanushek (1996), Krueger uses alternative weighting strategies, including giving equal weight to each study rather than equal weight to each estimate, and finds support for a positive relationship between smaller class sizes and better student 23 outcomes. Today, while researchers recognize the importance of family background and other nonschool inputs in determining educational outcomes, many have come to question the findings of the Hanushek meta-analyses as well as the validity of many of the individual studies estimating education production functions. Card and Krueger (1992) perhaps marks the turning of the tide on the view that schools do not matter. Instead of focusing on direct education outcomes, Card and Krueger focus on how school quality affects the returns to schooling (i.e., the increase in earnings associated with an additional year of schooling). Assuming that school quality for men working in a given labor market varies exogenously by their state of birth and cohort, Card and Krueger find that men who were educated in states and years with higher quality schools—schools with lower pupil-teacher ratios, longer school years, and higher relative teacher pay—earn more for an additional year of education than men educated in states and years with lower quality schools. These results are consistent with earlier work finding positive relationships between school quality and earnings (e.g., Johnson and Stafford 1973; Rizzuto and Wachtel 1980) and work that attributes much of the closing of the black-white wage gap to improvements in school quality for African American students (e.g., Smith and Welch 1989). While some researchers challenged the assumptions used in Card and Krueger (1992), others began to consider that school resources may affect students’ earnings after leaving school without having measurable effects on academic achievement while in school (Burtless 1996). C. Recent Studies Since Card and Krueger (1992), there have been many new papers examining the effects of school inputs on student achievement, several of which use estimation strategies aimed at 24 identifying the causal relationships between school inputs and student outcomes. In this section we review a few of the best studies in economics assessing school spending, class size, teacher quality, time in school, and technology. 1. Spending Because some students may be more expensive to educate than others and schools and districts differ in the types of students they serve, simply looking at the relationship between average student test scores and per pupil spending may indicate that greater school spending is associated with lower student achievement. As a result, researchers rely on alternative strategies for identifying the causal relationship between spending and student outcomes. Barrow and Rouse (2004) and Guryan (2003) use changes in state school financing aid formulas as instrumental variables to isolate plausibly exogenous changes in school spending. Barrow and Rouse (2004) examine the general question of whether spending on schools is valued by the “market” by looking at the effects of increased school spending on local property values. Indeed, the authors find that school spending is valued, on average, since they estimate that property values increased by the expected amount in school districts that received an extra $1 per pupil in state school financing. If potential residents did not value the additional spending because school districts were viewed as spending excessively or wastefully, additional state aid should not have resulted in such large increases in property values. Guryan (2003) looks more specifically at the relationship between school spending and student achievement in Massachusetts. He finds that additional state aid resulting from a change in the financing formulas led to a significant increase in math, reading, and science test scores for both 4th and 8th grade students. Specifically, he estimates that a $1000 increase in per-pupil 25 spending leads to a one-third to one-half of a standard deviation increase in average test scores. In sum, Barrow and Rouse (2004) and Guryan (2003) both suggest that money matters when it comes to public schools. Below we look at studies that examine more specifically whether different inputs matter. 2. Class size Although the effect of class size on student achievement has most often been studied using observational data, Boozer and Rouse (2001) provide a clear demonstration of how estimates of class size effects can be misleading due to the relationship between class size and student ability as well as how school-level measures of pupil-teacher ratios can mask significant within-school variation in actual class size. Thus, one should be suspect of estimates that do not make use of more sophisticated estimation techniques to uncover the causal relationship between class size and student achievement. Fortunately, class size is one of the education topics that has been studied using a variety of estimation techniques, including regression discontinuity, instrumental variables, and randomized evaluation. Angrist and Lavy (1999) use a regression-discontinuity estimator to look at the effect of class size on student test scores in Israel. Public schools in Israel have a maximum class size of 40 pupils which generates a non-linear, non-monotonic relationship between grade enrollment and class size. As discussed above, this will generate large differences in class size between grades with enrollment of 39 students and grades with enrollment of 41 students. For fourth and fifth grade students, Angrist and Lavy (1999) find that reductions in class size increase test scores by statistically significant and educationally important amounts. They do not find similar effects for third grade students. 26 Two other papers have used regression discontinuity and/or instrumental variables as well. Hoxby (2000) uses class size minimums and maximums in Connecticut to look at the effect on student test scores of changes in class size driven by movements in enrollment populations that push schools over and under the class size thresholds. She finds mixed results on the relationship between class size and student performance. Boozer and Rouse (2001) use state class size maximums as an instrumental variable for student-level class size in the NELS88 and find that smaller classes improve student achievement. Perhaps the best known and most convincing evidence on the impact of class size comes from the Tennessee Student/Teacher Achievement Ratio experiment (Project STAR) in which Tennessee kindergarten students were randomly assigned to small classes (13 to 17 students per teacher), regularly-sized classes (22 to 25 students per teacher), or regularly-sized classes with a teacher’s aid (22 to 25 students per teacher). The experiment continued through the third grade and then students were returned to regularly sized classes. Finn and Achilles (1990) and Krueger (1999) find that students in the smaller classes outperformed students in the larger classes on standardized tests. Additionally in a longer-term follow-up of Project STAR, Krueger and Whitmore (2001) find that students who were randomly assigned to smaller classes were significantly more likely to take a college entrance exam and that this effect was greater for African American students. At this point, many education researchers and policymakers have been convinced that smaller class sizes can improve student outcomes on average. However, many unanswered questions remain. For example, we need to know more about the cost of class size reduction relative to other interventions and whether it is cost effective. In addition, California’s experience with class size reduction in the 1990s highlighted that implementation—especially on 27 a large scale—can go awry (Bohrnstedt and Stecher 1999). Importantly, even the evidence from Project STAR suggests that the impact of class size reduction differs across schools and subpopulations of students (for example, Krueger (1999) found the largest effects for African American and low-income students). Clearly we need to know more about the conditions under which reducing class sizes will be most fruitful. 3. Teacher quality The preponderance of evidence suggests that teachers matter for student outcomes. Hanushek, Rivkin, and Kain (2005) use Texas data on elementary students linked to teachers at the school-grade level in order to estimate the effect of teachers on student learning while Aaronson, Barrow, and Sander (2003) use Chicago Public Schools data on high school students linked to teachers at the classroom level to examine teacher quality. Both studies find large variation in teacher quality as measured by the effect of teachers on student test score gains. Hanushek, Rivkin, and Kain (2005) estimate that a one standard deviation increase in teacher quality at the grade level will increase student test scores by roughly 10 percent of a standard deviation while Aaronson, Barrow, and Sander (2003) find that a one standard deviation improvement in 9th grade math teacher quality for one semester is associated with a gain equal to 10 to 20 percent of the average math test score gain experienced in a typical school year. When it comes to determining what makes a good teacher, the research is much less clear. Research by Clotfelter, Ladd, and Vigdor (2004) in North Carolina illustrates the great tendency for the most qualified teachers to teach in schools with the most advantaged students as well as for parents of more advantaged children to get their children into classes with more qualified teachers. This sorting of teachers and students makes it difficult to disentangle the 28 causal effects of various measures of teacher quality. In addition, the characteristics of teachers available in the large administrative data sets are typically limited to those that determine compensation, such as whether or not a teacher has a master’s degree and how many years she has been teaching in the school district. Researchers have found some evidence that teacher quality improves sharply after one or two years of experience (e.g., Clotfelter, Ladd, and Vigdor 2004; Hanushek, Rivkin, and Kain 2005). However, new teachers exit teaching at fairly high rates, and Aaronson, Barrow, and Sander (2003) find that teachers in the lowest quality decile in one year are 26 percent less likely to be teaching in the next year than teachers in the highest quality decile suggesting that some of the experience results may be driven by selection if only the higher quality teachers stay beyond one or two years. Aaronson, Barrow, and Sander (2003) also find some evidence that undergraduate major may be related to teacher quality, while Clotfelter, Ladd, and Vigdor (2004) find evidence that teachers who score best on licensing tests are indeed higher quality teachers. Using a regression discontinuity design, Jacob and Lefgren (2004a) take advantage of the nonlinear relationship between school-level student achievement in Chicago Public Schools and the assignment of schools to probationary status in order to examine the relationship between professional development and student achievement. In 1996, elementary schools in which fewer than 15 percent of students met national norms on a standardized test of reading were placed on probation and given resources (up to $90,000 in the first year) to purchase staff development services. Schools with more than 15 percent of students meeting national norms were not placed on probation and not given the additional resources. Jacob and Lefgren thus assume that whether a school has just fewer than 15 percent of students that met the reading norm or slightly more than 15 percent of students that met the norm is by chance. 29 The authors find that schools on probation primarily spent the additional resources on professional and staff development purchased from a wide variety of external sources including universities, nonprofit organizations, and independent consultants. The authors find that teachers report a 25 percent increase in the frequency of attending professional development programs, and others (e.g., Smylie et al. 2001) have reported a more substantial increase in the quality of the professional development teachers received. Unfortunately, however, the authors find no evidence that the increase in the quantity and quality of professional development induced by schools’ probationary status translated into improved student achievement. In sum, the best evidence suggests that teachers matter; however, we still have much to learn about how to identify quality teachers when making hiring decisions or how to increase teacher productivity with training or professional development. 4. Time in school The length of the school year in the U.S. is a frequent target of criticism in discussions of why students in the U.S. score badly on standardized tests relative to other developed countries. Several studies document erosion of students’ skills over the summer vacation (e.g., Cooper et al. 2000), and there is some evidence that summer school can improve student achievement ( e.g., Jacob and Lefgren 2004b). Jacob and Lefgren (2004b) utilize regression discontinuity to look at the effect of summer school and grade retention on student achievement in Chicago. In 1996, Chicago Public Schools instituted a policy of requiring 3rd and 6th grade students to attend summer school if they did not meet minimum test score thresholds. Students were then retained in grade if they did not achieve the minimum test score following summer school. The authors are able to use the discontinuity 30 of the treatment rule in order to assess the benefits of the summer school and grade retention policy on student achievement. Namely, students scoring just below the minimum “passing” test score and students scoring just above the minimum passing test score are assumed to be quite similar except that those scoring just below the threshold are assigned to summer school. Jacob and Lefgren (2004b) find that the net effect of summer school and grade retention was to increase student achievement among 3rd grade students. However, the authors find no similar achievement gains for 6th grade students.26 Pischke (2003) specifically looks at the effect of school-year length by taking advantage of a natural experiment occurring in West Germany in the late 1960s. Adoption of a common fall-start to the school year led students in most states to experience two short school years, equivalent to roughly two-thirds of the standard length school year. In contrast, students in WestBerlin and Hamburg attended one long school year. Pischke (2003) finds that the shorter school year increased grade repetition among elementary school students, but that the shorter school year had no effect on the number of students attending the highest secondary school track or subsequent earnings as adults. Thus, there is little evidence of long-lasting negative effects of a shorter school year. The Pischke (2003) results point to an important difficulty in estimating educational policy effects with observational data even in the presence of a natural experiment. Although we may believe that the natural experiment is valid in that it generated exogenous variation in the length of a school year and that it should only affect student outcomes through the experiment’s 26 The authors also use regression discontinuity to look at the effect of grade retention alone using the post summer school test scores. Once again, they find that grade retention is beneficial to 3rd grade student achievement, but has no effect on 6th grade student achievement. 31 effect on the length of a school year, it is quite likely that teachers changed their behavior to compensate for the temporarily shortened school year. Since the behavioral response to a shortterm change in the school year may be different from the responses generated by a permanent change in the length of the school year, the results may lack external validity. 5. Technology Research on the success of computer-aided instruction (CAI) has yielded mixed evidence at best. Some research using observational data has shown computers can offer highly individualized instruction, allow students to learn at their own paces, enhance assessment, and increase student motivation (e.g., Sandholz et al. 1997; Means and Olson 1995; Lepper 1985). In contrast, other research reports that computers are frequently poorly embraced by teachers, can disrupt classrooms, and fail to increase student achievement in any measurable way (e.g., Cuban 2001; Becker 2000; Angrist and Lavy 2002; Rouse and Krueger 2004). A common critique of the literature is that both student outcomes and what constitutes “computer use” are poorly defined (Cuban, 2001). For example, while Angrist and Lavy (2002) are able to use an instrumental variables estimator to look at the effect of CAI on student test scores, the intensity of computer use is defined by the teacher’s response to a rather vague question about how often they used “computer software or instructional computer programmes” (Angrist and Lavy 2002). The authors find no evidence that greater use of CAI improved student test scores in math or Hebrew. Borman and Rachuba (2001) and Rouse and Krueger (2004) have the advantages of being able to evaluate the effect of much more specific computer use—the use of a particular instructional software—and to implement random assignment of students to treatment and 32 control groups. Both studies evaluate the popular Fast ForWord (FFW) computerized reading instruction program using random assignment within schools in large urban school districts. The studies’ findings are remarkably similar: both rule out large impacts of computerized instruction with estimated effects that are not statistically different from zero. While these studies suggest that CAI does not significantly improve student educational outcomes, one might find that different computerized reading programs were successful or that the use of CAI in other subjects significantly raised students’ learning in those subjects. Further, one might find that FFW was effective when used in other settings. Both randomized evaluations of FFW were conducted in schools, but it may be that schools are not the best environment in which to implement the program (FFW is also often used by psychologists and reading specialists in private practice). While the schools and teachers in the studies did their best to engage students and keep them on task, the many disruptions that occur during the semester may have compromised students’ ability to benefit from the program; the same students may have benefitted from the program in a different setting. Currently, however, there is very little evidence that CAI is effective in schools. VI. Conclusion Educators and policy makers are increasingly intent on using scientifically-based evidence when making decisions about education policy. Thus, education research today must necessarily be focused on identifying the causal relationships between education inputs and student outcomes. The good news is that the body of credible research on causal relationships is growing, and we have started to gather evidence that some school inputs matter while others do not. 33 As this body of knowledge grows, we can also get inside the “black box” of the inputs that work. Once we understand that an input improves student outcomes, on average, we can look at the next set of questions: Do all students benefit from a particular input? Who benefits most from a particular input? Which aspects of multidimensional programs are most beneficial? (A challenge will be to develop studies that also generate causal estimates of this next generation of questions!) As we develop a knowledge base regarding what works in education, we will also need a better understanding about how to implement appropriate policies using that knowledge. In addition, policymakers need information with which to assess the tradeoffs between different inputs to make sensible decisions. For example, Jacob and Lefgren (2004b) find a small, but statistically significant, positive effect of summer school and grade retention on student reading skills at a cost of about $750 per student.27 This cost per student may be compared to other interventions, such as class size reduction, that have larger effects (more than three times as large) on student reading skills but also cost more than $2000 per student (Krueger 1999). As a result, implementing the summer school and grade retention may be more costeffective for some school districts than reducing class sizes. This conclusion could not be based on estimates of the effectiveness of grade retention and summer school alone. Clearly, more information on such tradeoffs in educational practice is critical. Policymakers must also understand that it is much more difficult to credibly evaluate the effectiveness of school policies after the fact. Rather, if research and evaluation are part of a 27 Authors’ calculations based on the following assumptions. The current annual cost per pupil in the Chicago Public Schools is about $9000. If the current school year is about 180 days, then the cost per pupil per day is $50. The summer school program for 3rd graders was for 6 weeks for one-half day, or for 15 days. Thus, the cost per pupil for the summer school was about $750. 34 new policy from the beginning, then researchers can collect the necessary data (which are often difficult—if not impossible—to collect after the policy has been implemented). Further, if a policy change is only to be implemented in a small number of locations, researchers can help policymakers design the selection of locations in a way that meets both political and research needs. Indeed, some of our best opportunities for learning more about the impact of education resources on student outcomes will come from just such partnerships between policymakers and researchers. Finally we note that good policy is not based on the results of a single study, but rather from a pattern of results extending over time and across a number of settings. Let’s take the evidence on small class sizes, as an example. The evidence from the Tennessee class size reduction experiment is important because it has been analyzed by multiple researchers, and the basic results have been found to be robust to alternative ways of analyzing the data. That said, without other credible evidence that smaller class sizes make a difference for students, one would not want to draw such conclusions. Another recent example of the caution with which one must approach a single study comes from the evidence on the Fast ForWord computerized language program. Results from Miller et al. (1999) suggest the program has a large and statistically significant effect on student outcomes. However, as discussed earlier, this finding was not found to be robust in alternative settings. Indeed, the purpose of the federally-funded WWC is to provide policymakers with summaries (or meta-analyses) of the best research on any particular topic. This effort reflects the fact that it is only by piecing together results from a variety of high-quality studies that we can begin to develop a picture of what does, and does not, work in education. 35 References Aaronson, D., L. Barrow, and W. Sander. 2003. “Teachers and Student Achievement in the Chicago Public High Schools.” Unpublished manuscript. Federal Reserve Bank of Chicago, Chicago, Illinois. Available online at http://www.chicagofed.org/publications/workingpapers/papers/wp2002-28.pdf (accessed January 20, 2005). Angrist, J. D. 2004. “American Education Research Changes Tack.” Oxford Review of Economic Policy, 20: 198- 212. Angrist, J. D. and V. Lavy. 1999. “Using Maimonides’ Rule to Estimate the Effect of Class Size on Scholastic Achievement.” Quarterly Journal of Economics 114: 533-575. Angrist, J. D. and V. Lavy. 2002. “New Evidence on Classroom Computers and Pupil Learning.” The Economic Journal 112: 735-765. Barrow, L. and C. E. Rouse. 2004. “Using Market Valuation to Assess the Importance and Efficiency of Public School Spending.” Journal of Public Economics 88: 1747-1769. Becker, H. J. 2000. “Who's Wired and Who’s Not.” The Future of Children 10: 44-75. Boozer, M. A. and C. E. Rouse. 2001. “Intraschool Variation in Class Size: Patterns and Implications.” Journal of Urban Economics 50: 163-189. Bohrnstedt, G. W. and B. M. Stecher, eds. 1999. Class Size Reduction in California: Early Evaluation Findings, 1996-1999, http://www.classize.org/techreport/index.htm. (accessed January 20, 2005). Borman, G. D. and L. T. Rachuba. 2001. Evaluation of the Scientific Learning Corporation’s Fast ForWord Computer-Based Training Program in the Baltimore City Public Schools. A Report Prepared for the Abell Foundation. Bryk, A. S. and S. W. Raudenbush. 1992. Hierarchical Linear Models: Applications and Data Analysis Methods. London: Sage Publications. Burtless, G. 1996. Introduction and summary to Does money matter? The Effect of School Resources on Student Achievement and Adult Success. Edited by G. Burtless. Washington, D.C.: Brookings Institution Press. Campbell, D. T. and J. C. Stanley. 1963. “Experimental and Quasi-Experimental Designs for Research on Teaching.” In Handbook of Research on Teaching. Edited by N. L. Gage. Chicago: Rand McNally. Card, D. and A. B. Krueger. 1992. “Does School Quality Matter? Returns to Education and the 36 Characteristics of Public Schools in the United States.” The Journal of Political Economy 100: 1-40. Clotfelter, C. T., H. F. Ladd, and J. L. Vigdor. 2004. “Teacher Sorting, Teacher Shopping, and the Assessment of Teacher Effectiveness.” Unpublished manuscript, Duke University, Durham, North Carolina. Coleman, J. S. and E. Q. Campbell with C. F. Hobson, J. McPartland, A. M. Mood, F. D. Weinfield, and R. L. York. 1966. Equality of Educational Opportunity. Washington, D.C.: U. S. Office of Education. Cook, T. D. 2001. “A Critical Apprisal of the Case Against Using Experiments to Assess School (or Community) Effects.” Education Next Unabridged Articles. No. 3 (Fall), http://www.educationnext.org/unabridged/20013/cook.html (accessed January 17, 2005). Cook, T. D. and D. T. Campbell. 1979. Quasi-Experimentation: Design and Analysis Issues for Field Settings. Boston: Houghton Mifflin Company. Cooper, H., K. Charlton, J. C. Valentine, and L. Muhlenbruck. 2000. Making the Most of Summer School: A Meta-Analytic and Narrative Review. Malden, Massachusetts: Society for Research in Child Development Monograph. Cuban, L. 2001. Oversold and Underused: Computers in the Classroom. Cambridge, Massachusetts: Harvard University Press. Efron, B., and D. Feldman. 1991. “Compliance as an Explanatory Variable in Clinical Trials.” Journal of the American Statistical Association 86: 9-17. Finn, J. D., and C. M. Achilles. 1990. “Answers and Questions About Class Size: A Statewide Experiment.” American Educational Research Journal 27: 557-577. Guryan, J. 2003. “Does Money Matter? Estimates from Education Finance Reform in Massachusetts.” Unpublished manuscript, University of Chicago, Chicago, Illinois. Hanushek, E. A. 1986. “The Economics of Schooling: Production and Efficiency in Public Schools.” Journal of Economic Literature 24: 1141-1177. Hanushek, E. A. 1989. “The Impact of Differential Expenditures on School Performance.” Educational Researcher 18: 45-65 . Hanushek, E. A. 1996. “Measuring Investment in Education.” Journal of Economic Perspectives 10: 9-30. Hanushek, E. A. 1997. “Assessing the effects of school resources on student performance: An update.” Educational Evaluation and Policy Analysis 19: 141-164. 37 Hanushek, E. A., S. G. Rivkin, and J. F. Kain. 2005. “Teachers, schools, and academic achievement.” Econometrica 73: 417-458. Hedges, L. V., R. Laine, and R. Greenwald. 1994. “Does Money Matter? A Meta-Analysis of Studies of the Effects of Differential School Inputs on Student Outcomes.” Education Researcher 23: 5-14. Hoxby, C. M. 2000. “The Effects of Class Size on Student Achievement: New Evidence from Population Variation.” Quarterly Journal of Economics 115: 1239-1285. Imbens, G. and J. Angrist. 1994. “Identification and Estimation of Local Average Treatment Effects.” Econometrica 62: 467-475. Jacob, B. A. and L. Lefgren. 2004a. “The Impact of Teacher Training on Student Achievement: Quasi-Experimental Evidence from School Reform Efforts in Chicago.” Journal of Human Resources 39: 50-79. Jacob, B. A. and L. Lefgren. 2004b. “Remedial Education and Student Achievement: A Regression-Discontinuity Analysis.” Review of Economics and Statistics 86: 226-244. Johnson, G. E. and Stafford, F. P. 1973. “Social Returns to Quantity and Quality of Schooling.” Journal of Human Resources 8: 139-155. Krueger, A. B. 1999. “Experimental Estimates of Education Production Functions.” Quarterly Journal of Economics 114: 497-531. Krueger, A. B. 2003. “Economic Considerations and Class Size.” The Economic Journal 113: F34-F63. Krueger, A. B. and D. M. Whitmore. 2001. “The Effect of Attending a Small Class in the Early Grades on College-Test Taking and Middle School Test Results: Evidence from Project STAR.” The Economic Journal 111: 1-28. Lepper, M. R. 1985. “Microcomputers in Education, Motivational and Social Issues.” American Psychologist 40: 1-18. Means, B., and Olson, K. 1997. Technology and Education Reform. Office of Educational Research and Improvement, Contract No. RP91-172010. Washington, DC: U.S. Department of Education. Miller, S. L., M. M. Merzenich, P. Tallal, K. DeVivo, K. La-Rossa, N. Linn, A. Pycha, B. E. Peterson, and W. M. Jenkins. 1999. “Fast ForWord Training in Children with Low Reading Performance.” Nederlandse vereniging voor lopopedie en foniatrie: 1999 jaarcongres auditieve vaardigheden en spraak-taal. 38 Pischke, J. 2003. “The Impact of Length of School Year on Student Performance and Earnings: Evidence from the German Short School Years.” National Bureau of Economic Research Working paper 9964. Rizzuto, R. and Wachtel, P. 1980. “Further Evidence on the Returns to School Quality.” Journal of Human Resources 15: 240-254. Rouse, C. E. “Accounting for Schools: Econometric Issues in Measuring School Quality.” In Measurement and Research Issues in a New Accountability Era. Edited by Carol Anne Dwyer. New Jersey: Lawrence Erlbaum Associates, 2005. Rouse, C. E. and A. B. Krueger with L. Markman. 2004. “Putting Computerized Instruction to the Test: A Randomized Evaluation of a ‘Scientifically-based’ Reading Program.” Economics of Education Review 23: 323-338. Rubin, D. 1974. “Estimating Causal Effects of Treatments in Randomized and Non-randomized Studies.” Journal of Educational Psychology 66: 688-701. Sandholtz, J. H., C. Ringstaff, and D. C. Dwyer. 1997. Teaching with Technology: Creating Student-Centered Classrooms. New York: Teachers College Press. Sewell, W. H. 1967. Review of Equality of Educational Opportunity by J. S. Coleman and E. Q. Campbell with C. F. Hobson, J. McPartland, A. M. Mood, F. D. Weinfield, and R. L. York. American Sociological Review 32: 475-479. Smith, J. P. and F. R. Welch .1989. “Black Economic Progress After Myrdal.” Journal of Economic Literature 27: 519-564. Smylie, Mark A., E. Allensworth, R. C. Greenberg, R. Harris, and S. Luppescu. 2001. Teacher Professional Development in Chicago: Supporting Effective Practice. Chicago: Consortium on Chicago School Research. Also available online at http://www.consortium-chicago.org/publications/pdfs/p0d01.pdf. Summers, A. A. and B. L. Wolfe. 1977. “Do Schools Make a Difference?” American Economic Review 67: 639-652. Todd, P. E. and K. I. Wolpin. 2003. “On the Specification and Estimation of the Production Function for Cognitive Achievement.” The Economic Journal 113: F3-F33. U.S. Department of Education. 2002. U.S. Department of Education Awards Contract for ‘What Works Clearinghouse.’ http://www.ed.gov/news/pressreleases/2002/08/08072002a.html (accessed January 20, 2005). Working Paper Series A series of research studies on regional economic issues relating to the Seventh Federal Reserve District, and on financial and economic topics. Outsourcing Business Services and the Role of Central Administrative Offices Yukako Ono WP-02-01 Strategic Responses to Regulatory Threat in the Credit Card Market* Victor Stango WP-02-02 The Optimal Mix of Taxes on Money, Consumption and Income Fiorella De Fiore and Pedro Teles WP-02-03 Expectation Traps and Monetary Policy Stefania Albanesi, V. V. Chari and Lawrence J. Christiano WP-02-04 Monetary Policy in a Financial Crisis Lawrence J. Christiano, Christopher Gust and Jorge Roldos WP-02-05 Regulatory Incentives and Consolidation: The Case of Commercial Bank Mergers and the Community Reinvestment Act Raphael Bostic, Hamid Mehran, Anna Paulson and Marc Saidenberg WP-02-06 Technological Progress and the Geographic Expansion of the Banking Industry Allen N. Berger and Robert DeYoung WP-02-07 Choosing the Right Parents: Changes in the Intergenerational Transmission of Inequality Between 1980 and the Early 1990s David I. Levine and Bhashkar Mazumder WP-02-08 The Immediacy Implications of Exchange Organization James T. Moser WP-02-09 Maternal Employment and Overweight Children Patricia M. Anderson, Kristin F. Butcher and Phillip B. Levine WP-02-10 The Costs and Benefits of Moral Suasion: Evidence from the Rescue of Long-Term Capital Management Craig Furfine WP-02-11 On the Cyclical Behavior of Employment, Unemployment and Labor Force Participation Marcelo Veracierto WP-02-12 Do Safeguard Tariffs and Antidumping Duties Open or Close Technology Gaps? Meredith A. Crowley WP-02-13 Technology Shocks Matter Jonas D. M. Fisher WP-02-14 Money as a Mechanism in a Bewley Economy Edward J. Green and Ruilin Zhou WP-02-15 1 Working Paper Series (continued) Optimal Fiscal and Monetary Policy: Equivalence Results Isabel Correia, Juan Pablo Nicolini and Pedro Teles WP-02-16 Real Exchange Rate Fluctuations and the Dynamics of Retail Trade Industries on the U.S.-Canada Border Jeffrey R. Campbell and Beverly Lapham WP-02-17 Bank Procyclicality, Credit Crunches, and Asymmetric Monetary Policy Effects: A Unifying Model Robert R. Bliss and George G. Kaufman WP-02-18 Location of Headquarter Growth During the 90s Thomas H. Klier WP-02-19 The Value of Banking Relationships During a Financial Crisis: Evidence from Failures of Japanese Banks Elijah Brewer III, Hesna Genay, William Curt Hunter and George G. Kaufman WP-02-20 On the Distribution and Dynamics of Health Costs Eric French and John Bailey Jones WP-02-21 The Effects of Progressive Taxation on Labor Supply when Hours and Wages are Jointly Determined Daniel Aaronson and Eric French WP-02-22 Inter-industry Contagion and the Competitive Effects of Financial Distress Announcements: Evidence from Commercial Banks and Life Insurance Companies Elijah Brewer III and William E. Jackson III WP-02-23 State-Contingent Bank Regulation With Unobserved Action and Unobserved Characteristics David A. Marshall and Edward Simpson Prescott WP-02-24 Local Market Consolidation and Bank Productive Efficiency Douglas D. Evanoff and Evren Örs WP-02-25 Life-Cycle Dynamics in Industrial Sectors. The Role of Banking Market Structure Nicola Cetorelli WP-02-26 Private School Location and Neighborhood Characteristics Lisa Barrow WP-02-27 Teachers and Student Achievement in the Chicago Public High Schools Daniel Aaronson, Lisa Barrow and William Sander WP-02-28 The Crime of 1873: Back to the Scene François R. Velde WP-02-29 Trade Structure, Industrial Structure, and International Business Cycles Marianne Baxter and Michael A. Kouparitsas WP-02-30 Estimating the Returns to Community College Schooling for Displaced Workers Louis Jacobson, Robert LaLonde and Daniel G. Sullivan WP-02-31 2 Working Paper Series (continued) A Proposal for Efficiently Resolving Out-of-the-Money Swap Positions at Large Insolvent Banks George G. Kaufman WP-03-01 Depositor Liquidity and Loss-Sharing in Bank Failure Resolutions George G. Kaufman WP-03-02 Subordinated Debt and Prompt Corrective Regulatory Action Douglas D. Evanoff and Larry D. Wall WP-03-03 When is Inter-Transaction Time Informative? Craig Furfine WP-03-04 Tenure Choice with Location Selection: The Case of Hispanic Neighborhoods in Chicago Maude Toussaint-Comeau and Sherrie L.W. Rhine WP-03-05 Distinguishing Limited Commitment from Moral Hazard in Models of Growth with Inequality* Anna L. Paulson and Robert Townsend WP-03-06 Resolving Large Complex Financial Organizations Robert R. Bliss WP-03-07 The Case of the Missing Productivity Growth: Or, Does information technology explain why productivity accelerated in the United States but not the United Kingdom? Susanto Basu, John G. Fernald, Nicholas Oulton and Sylaja Srinivasan WP-03-08 Inside-Outside Money Competition Ramon Marimon, Juan Pablo Nicolini and Pedro Teles WP-03-09 The Importance of Check-Cashing Businesses to the Unbanked: Racial/Ethnic Differences William H. Greene, Sherrie L.W. Rhine and Maude Toussaint-Comeau WP-03-10 A Firm’s First Year Jaap H. Abbring and Jeffrey R. Campbell WP-03-11 Market Size Matters Jeffrey R. Campbell and Hugo A. Hopenhayn WP-03-12 The Cost of Business Cycles under Endogenous Growth Gadi Barlevy WP-03-13 The Past, Present, and Probable Future for Community Banks Robert DeYoung, William C. Hunter and Gregory F. Udell WP-03-14 Measuring Productivity Growth in Asia: Do Market Imperfections Matter? John Fernald and Brent Neiman WP-03-15 Revised Estimates of Intergenerational Income Mobility in the United States Bhashkar Mazumder WP-03-16 3 Working Paper Series (continued) Product Market Evidence on the Employment Effects of the Minimum Wage Daniel Aaronson and Eric French WP-03-17 Estimating Models of On-the-Job Search using Record Statistics Gadi Barlevy WP-03-18 Banking Market Conditions and Deposit Interest Rates Richard J. Rosen WP-03-19 Creating a National State Rainy Day Fund: A Modest Proposal to Improve Future State Fiscal Performance Richard Mattoon WP-03-20 Managerial Incentive and Financial Contagion Sujit Chakravorti, Anna Llyina and Subir Lall WP-03-21 Women and the Phillips Curve: Do Women’s and Men’s Labor Market Outcomes Differentially Affect Real Wage Growth and Inflation? Katharine Anderson, Lisa Barrow and Kristin F. Butcher WP-03-22 Evaluating the Calvo Model of Sticky Prices Martin Eichenbaum and Jonas D.M. Fisher WP-03-23 The Growing Importance of Family and Community: An Analysis of Changes in the Sibling Correlation in Earnings Bhashkar Mazumder and David I. Levine WP-03-24 Should We Teach Old Dogs New Tricks? The Impact of Community College Retraining on Older Displaced Workers Louis Jacobson, Robert J. LaLonde and Daniel Sullivan WP-03-25 Trade Deflection and Trade Depression Chad P. Brown and Meredith A. Crowley WP-03-26 China and Emerging Asia: Comrades or Competitors? Alan G. Ahearne, John G. Fernald, Prakash Loungani and John W. Schindler WP-03-27 International Business Cycles Under Fixed and Flexible Exchange Rate Regimes Michael A. Kouparitsas WP-03-28 Firing Costs and Business Cycle Fluctuations Marcelo Veracierto WP-03-29 Spatial Organization of Firms Yukako Ono WP-03-30 Government Equity and Money: John Law’s System in 1720 France François R. Velde WP-03-31 Deregulation and the Relationship Between Bank CEO Compensation and Risk-Taking Elijah Brewer III, William Curt Hunter and William E. Jackson III WP-03-32 4 Working Paper Series (continued) Compatibility and Pricing with Indirect Network Effects: Evidence from ATMs Christopher R. Knittel and Victor Stango WP-03-33 Self-Employment as an Alternative to Unemployment Ellen R. Rissman WP-03-34 Where the Headquarters are – Evidence from Large Public Companies 1990-2000 Tyler Diacon and Thomas H. Klier WP-03-35 Standing Facilities and Interbank Borrowing: Evidence from the Federal Reserve’s New Discount Window Craig Furfine WP-04-01 Netting, Financial Contracts, and Banks: The Economic Implications William J. Bergman, Robert R. Bliss, Christian A. Johnson and George G. Kaufman WP-04-02 Real Effects of Bank Competition Nicola Cetorelli WP-04-03 Finance as a Barrier To Entry: Bank Competition and Industry Structure in Local U.S. Markets? Nicola Cetorelli and Philip E. Strahan WP-04-04 The Dynamics of Work and Debt Jeffrey R. Campbell and Zvi Hercowitz WP-04-05 Fiscal Policy in the Aftermath of 9/11 Jonas Fisher and Martin Eichenbaum WP-04-06 Merger Momentum and Investor Sentiment: The Stock Market Reaction To Merger Announcements Richard J. Rosen WP-04-07 Earnings Inequality and the Business Cycle Gadi Barlevy and Daniel Tsiddon WP-04-08 Platform Competition in Two-Sided Markets: The Case of Payment Networks Sujit Chakravorti and Roberto Roson WP-04-09 Nominal Debt as a Burden on Monetary Policy Javier Díaz-Giménez, Giorgia Giovannetti, Ramon Marimon, and Pedro Teles WP-04-10 On the Timing of Innovation in Stochastic Schumpeterian Growth Models Gadi Barlevy WP-04-11 Policy Externalities: How US Antidumping Affects Japanese Exports to the EU Chad P. Bown and Meredith A. Crowley WP-04-12 Sibling Similarities, Differences and Economic Inequality Bhashkar Mazumder WP-04-13 Determinants of Business Cycle Comovement: A Robust Analysis Marianne Baxter and Michael A. Kouparitsas WP-04-14 5 Working Paper Series (continued) The Occupational Assimilation of Hispanics in the U.S.: Evidence from Panel Data Maude Toussaint-Comeau WP-04-15 Reading, Writing, and Raisinets1: Are School Finances Contributing to Children’s Obesity? Patricia M. Anderson and Kristin F. Butcher WP-04-16 Learning by Observing: Information Spillovers in the Execution and Valuation of Commercial Bank M&As Gayle DeLong and Robert DeYoung WP-04-17 Prospects for Immigrant-Native Wealth Assimilation: Evidence from Financial Market Participation Una Okonkwo Osili and Anna Paulson WP-04-18 Individuals and Institutions: Evidence from International Migrants in the U.S. Una Okonkwo Osili and Anna Paulson WP-04-19 Are Technology Improvements Contractionary? Susanto Basu, John Fernald and Miles Kimball WP-04-20 The Minimum Wage, Restaurant Prices and Labor Market Structure Daniel Aaronson, Eric French and James MacDonald WP-04-21 Betcha can’t acquire just one: merger programs and compensation Richard J. Rosen WP-04-22 Not Working: Demographic Changes, Policy Changes, and the Distribution of Weeks (Not) Worked Lisa Barrow and Kristin F. Butcher WP-04-23 The Role of Collateralized Household Debt in Macroeconomic Stabilization Jeffrey R. Campbell and Zvi Hercowitz WP-04-24 Advertising and Pricing at Multiple-Output Firms: Evidence from U.S. Thrift Institutions Robert DeYoung and Evren Örs WP-04-25 Monetary Policy with State Contingent Interest Rates Bernardino Adão, Isabel Correia and Pedro Teles WP-04-26 Comparing location decisions of domestic and foreign auto supplier plants Thomas Klier, Paul Ma and Daniel P. McMillen WP-04-27 China’s export growth and US trade policy Chad P. Bown and Meredith A. Crowley WP-04-28 Where do manufacturing firms locate their Headquarters? J. Vernon Henderson and Yukako Ono WP-04-29 Monetary Policy with Single Instrument Feedback Rules Bernardino Adão, Isabel Correia and Pedro Teles WP-04-30 6 Working Paper Series (continued) Firm-Specific Capital, Nominal Rigidities and the Business Cycle David Altig, Lawrence J. Christiano, Martin Eichenbaum and Jesper Linde WP-05-01 Do Returns to Schooling Differ by Race and Ethnicity? Lisa Barrow and Cecilia Elena Rouse WP-05-02 Derivatives and Systemic Risk: Netting, Collateral, and Closeout Robert R. Bliss and George G. Kaufman WP-05-03 Risk Overhang and Loan Portfolio Decisions Robert DeYoung, Anne Gron and Andrew Winton WP-05-04 Characterizations in a random record model with a non-identically distributed initial record Gadi Barlevy and H. N. Nagaraja WP-05-05 Price discovery in a market under stress: the U.S. Treasury market in fall 1998 Craig H. Furfine and Eli M. Remolona WP-05-06 Politics and Efficiency of Separating Capital and Ordinary Government Budgets Marco Bassetto with Thomas J. Sargent WP-05-07 Rigid Prices: Evidence from U.S. Scanner Data Jeffrey R. Campbell and Benjamin Eden WP-05-08 Entrepreneurship, Frictions, and Wealth Marco Cagetti and Mariacristina De Nardi WP-05-09 Wealth inequality: data and models Marco Cagetti and Mariacristina De Nardi WP-05-10 What Determines Bilateral Trade Flows? Marianne Baxter and Michael A. Kouparitsas WP-05-11 Intergenerational Economic Mobility in the U.S., 1940 to 2000 Daniel Aaronson and Bhashkar Mazumder WP-05-12 Differential Mortality, Uncertain Medical Expenses, and the Saving of Elderly Singles Mariacristina De Nardi, Eric French, and John Bailey Jones WP-05-13 Fixed Term Employment Contracts in an Equilibrium Search Model Fernando Alvarez and Marcelo Veracierto WP-05-14 Causality, Causality, Causality: The View of Education Inputs and Outputs from Economics Lisa Barrow and Cecilia Elena Rouse WP-05-15 7