The GiveWell Blog

High-quality study of Head Start early childhood care program

Early this year, the U.S. Department of Health and Human Services released by far the most high-quality study to date of the Head Start childhood care program. I’ve had a chance to review this study, and I feel the results are very interesting.

  • The study’s quality is outstanding, in terms of design and analysis (as well as scale). If I were trying to give an example of a good study that can be held up as a model, this would now be one of the first that would come to mind.
  • The impact observed is generally positive but small, and fades heavily over time.

The study’s quality is outstanding.

This study has almost all the qualities I look for in a meaningful study of a program’s impact:

  • Impact-isolating, selection-bias-avoiding design. Many impact studies fall prey to selection bias, and may end up saying less about the program’s effects than about pre-existing differences between participants and non-participants. This study uses randomization (see pages 2-3) to separate a “treatment group” and “control group” that are essentially equivalent in all measured respects to begin with (see page 2-12), and follows both over time to determine the effects of Head Start itself.
  • Large sample size; long-term followup. The study is an ambitious attempt to get truly representative, long-term data on impact. “The nationally representative study sample, spread over 23 different states, consisted of a total of 84 randomly selected grantees/delegate agencies, 383 randomly selected Head Start centers, and a total of 4667 newly entering children: 2559 3-year-olds and 2108 4-year-olds” (xviii). Children were followed from entry into Head Start at ages 3 and 4 through the end of first grade, a total of 3-4 years (xix). Follow-up will continue through the third grade (xxxviii).
  • Meaningful and clearly described measures. Researchers used a variety of different measures to determine the impact of Head Start on children’s cognitive abilities, social/emotional development, health status, and treatment by parents. These measures are clearly described starting on page 2-15. The vast majority were designed around existing tools that seem (to me) to be focused on collecting factual, reliable information. For example, the “Social skills and positive approaches to learning” dimension assessed children by asking parents whether their child “Makes friends easily,” “Comforts or helps others,” “Accepts friends’ ideas in sharing and playing,” “Enjoys learning,” “Likes to try new things,” and “Shows imagination in work and play” (2-32). While subjective, such a tool seems much more reliable (and less loaded) to me than a less specified question like “Have your child’s social skills improved?”
  • Attempts to avoid and address “publication bias.” We have written before about “publication bias,” the concern that bad news is systematically suppressed in favor of good news. This study contains common-sense measures to reduce such a risk:
    • Public disclosure of many study details before impact-related data was collected. We have known this study was ongoing for a long time; baseline data was released in 2005, giving a good idea of the measures and design being used and making it harder for researchers to “fit the data to the hoped-for conclusions” after collection.
    • Explicit analysis of whether results are reliable in aggregate. This study examined a very large number of measures; it was very likely to find “statistically significant” effects on some purely by chance, just because so many were collected. However, unlike in many other studies we’ve seen, the authors address this issue explicitly, and (in the main body of the paper, not the executive summary) clearly mark the difference between effects that may be an artifact of chance (even though “statistically significant,” finding some effects of comparable size was quite likely due to the large number of measures examined) and effects that are much less likely to be an artifact of chance. (See page 2-52)

  • Explicit distinction between “confirmatory” analysis (looking at the whole sample; testing the original hypotheses) and “exploratory” analysis (looking at effects on subgroups; looking to generate new hypotheses). Many studies present the apparent impact of a program on “subgroups” of the population (for example, effects on African-Americans or effects on higher-risk families; without hypotheses laid out in advance, it is often unclear just how the different subgroups are defined and to what extent subgroup analysis reflects publication bias rather than real impacts. This paper is explicit that the only effects that should be taken as a true test of the program are the ones applying to the full population; while subgroup analysis is presented, it is explicitly in the interest of generating new ideas to be tested in the future. (See page xvi)
  • Charts. Showing charts over time often elucidates the shape and nature of effects in a way that raw numbers cannot. See page 4-16 for an example (discussed more below).

The least encouraging aspect of the study’s quality is response rates, which are in the 70%-90% range (2-19).

In my experience, it’s very rare for an evaluation of a social program – coming from academia or the nonprofit sector – to have even a few of the above positive qualities.

Some of these qualities can only be achieved for certain kinds of studies (for example, randomization is not always feasible), and/or can only be achieved with massive funding (a sample this large and diverse is out of reach for most). However, for many of the qualities above (particularly those related to publication bias), it seems to me that they could be present in almost any impact study, yet rarely are.

I find it interesting that this exemplary study comes not from a major foundation or nonprofit, but from the U.S. government. Years ago, I speculated that government work is superior in some respects to private philanthropic work; if true, I believe this is largely an indictment of the state of philanthropy.

The impact observed is positive, but small and fading heavily over time.

First off, the study appears meaningful in terms of assessing the effects of Head Start and quality child care. It largely succeeded in separating initially similar (see page 2-12) children such that the “treatment” group had significantly more participation in Head Start (and out-of-home child care overall) than the “control” group (see chart on page xx). The authors write that the “treatment” group ended up with meaningfully better child care, measured in terms of teacher qualifications, teacher-child ratios, and other measures of the care environment (page xxi). (Note that the program only examined the effects of one year of Head Start: as page xx shows, “treatment” 3-year-olds had much more Head Start participation than “control” 3-year-olds, but the next year the two groups had similar participation.)

The impacts themselves are best summarized by the tables on pages 4-10, 4-21, 5-4, 5-8, 6-3, 6-6. Unlike in the executive summary, these tables make clear which impacts are clearly distinguished from randomness (these are the ones in bold) and those that are technically “statistically significant” but could just be an artifact of the fact that so many different measures were examined (“*” means “statistically significant at p=0.1”; “**” means “statistically significant at p=0.05”; “***” means “statistically significant at p=0.01” and all *** effects also appear to be in bold).

The basic picture that emerges from these tables is that

  • Impact appeared encouraging at the end of the first year, i.e., immediately after participation in Head Start. Both 4-year-olds and 3-year-olds saw “bold” impact on many different measures of cognitive skills, as well as on the likelihood of receiving dental care.
  • That said, even at this point, effects on other measures of child health, social/emotional development, and parent behavior were more iffy. And all effects appear small in the context of later child development – for example, see the charts on page 4-16 (similar charts follow each table of impacts).
  • Impact appeared to fade out sharply after a year, and stay “faded out” through the first grade. Very few statistically significant effects of any kind, and fewer “bold” ones, can be seen at any point after the first year in the program. The charts following each table, tracking overall progress over time, make impact appear essentially invisible in context.
  • I don’t think it would be fair to claim that impact “faded out entirely” or that Head Start had “no effects.” Positive impacts far outnumber negative ones, even if these impacts are small and rarely statistically significant. It should also be kept in mind that this many of the families who had been lotteried out of Head Start itself had found other sources of early child care (xv); because it was comparing Head Start to alternative (though apparently inferior, as noted above) care, rather than to no care at all, effects should not necessarily be expected to be huge.


The impact of Head Start shown here is highly disappointing compared to many of its advocates’ hopes and promises. It is much weaker than the impact of projects like the Perry Preschool program and the Carolina Abecedarian program, which have been used in the past to estimate the social returns to early childhood care. It is much weaker than the impact that has been imputed from past lower-quality studies of Head Start. It provides strong evidence for the importance of high-quality studies and the Stainless Steel Law of Evaluation, as well as for “fading impacts” as a potential problem.

I don’t believe any of this makes it appropriate to call Head Start a “failure,” or even to reduce its government funding. As noted above, the small impacts noted were consistently more positive than negative, even several years after the program; it seems clear that Head Start is resulting in improved early childhood care and is accomplishing something positive for children.

I largely feel that anyone disappointed by this study must have an unrealistic picture of just how much a single year in a federal social program is likely to change a person. The U.S. achievement gap is complex and not well understood. From a government funding perspective, I’m happy to see a program at this level of effectiveness continued. When it comes to my giving, I continue to personally prefer developing-world aid, where a single intervention really can make huge, demonstrable, lasting differences in people’s lives (such as literally saving them) for not much money.


  • Alexander on August 23, 2010 at 12:10 am said:

    Thanks for the summarizing the study. I’m curious about the fade-out effects, and whether the variables being used are ultimately the ones we care about.

    You probably saw the study (PDF) that the New York Times reported on last month about how big the impacts of kindergarten classes (and teachers) may be on eventual earnings. One of the really interesting things that you can see on pages 48 & 49 of the linked PDF is that the effect of kindergarten class quality on grade-level test scores appears to fade out rapidly, but manifests itself quite strongly in eventual earnings. The authors hypothesize that this is because the tests stop assessing some skills that may ultimately effect income, but which may have played a role in kindergarten tests. (As they always say: all I really need to know, I learned in kindergarten…)

    Could that phenomenon (of the tests ceasing to examine the relevant variables) also explain some of the fade-out observed in the Head Start study? It would seem to cohere with results of some other early childhood interventions which have limited to moderate academic effects, but extraordinary longer-term results.

    I only glanced at the Head Start study, but it looks like they’re using the same assessments over the entire data collection, so changes in what is being assessed probably could not explain the fade-out effects. That said, do you think it is possible that the same assessment administered to four- and seven-year olds might tell you different things? More specifically, that an elementary reading assessment of a four-year old would tell you about something like attention span or effort, which might impact income decades later, whereas with a seven-year old the same assessment might say more about the quality of reading instruction?

  • Holden on August 27, 2010 at 7:21 am said:

    Alexander, this is an interesting idea and I think it is definitely possible. I haven’t had a chance to review the study on kindergarten teachers yet.

    One of the challenges of evaluation is finding appropriate measures. I feel that social programs are currently often oversold, to the point where people expect to see many sorts of impacts that are probably unrealistic – and focus studies on looking for these impacts. This phenomenon leaves us with a lot of disappointing results for programs that may nonetheless be successful, and may partially explain the Stainless Steel Law of Evaluation. We don’t conduct our own studies, so we can’t really do anything about this; we have to take what academia provides. What we can say is that results are “disappointing” relative to what was apparently hoped for.

  • Alexander on November 25, 2010 at 3:50 am said:

    Chetty et al. published a working paper about this in September. They make an interesting point about the fade-out research:

    Our results also complement the findings of studies on the long-term impacts of other early childhood interventions (reviewed in Almond and Currie 2010). For example, the Perry preschool program randomized 123 children into a control group and an intensive pre-school treatment group. Schweinhart, Barnes, and Weikhart (1993), Schweinhart et al. (2005), and Heckman et al. (2010a, 2010b) show that the Perry preschool program had extremely large impacts on earnings and other adult outcomes despite relatively rapid fade-out of test score impacts. Heckman et al. (2010c) show that the Perry intervention also improved non-cognitive skills and argue that this mechanism accounts for much of the long-term impact. Campbell et al. (2002) show that the Abecedarian project, which randomized 111 children into intensive early childhood programs from infancy to age 5, led to lasting improvements in education and other outcomes in early adulthood. At a larger scale, several studies have shown that the Head Start program leads to improvement in a variety of adult outcomes despite fade-out on test scores (e.g., Currie and Thomas 1995, Garces, Thomas, and Currie 2002, Ludwig and Miller 2007, Deming 2009). The results reported here are the first experimental evidence on the long-term impacts of a scalable intervention in a large sample with minimal attrition. In particular, we show that a better classroom environment from ages 5-8 can have substantial long-term benefits even without intervention at earlier ages. This result is consistent with the findings of Card and Krueger (1992), who show that better educational inputs have substantial long-term payoffs using state-by-cohort variation.

    Although the study you’re discussing is of incredibly high quality, I would want significantly more information about the longer-term effects before accepting that Head Start’s impact is as limited as the study suggests.

  • Alexander, agreed for the most part.

    I think it is always interesting and meaningful when a well-designed study fails to find the impact it was looking for. It implies that something doesn’t work the way we thought, or as well as we had thought. It isn’t the last word on the impact of Head Start, and the context you provide is important as well.

  • Carl Shulman on January 24, 2013 at 4:18 pm said:

    HHS has released a follow-up of the study children in third grade:

Comments are closed.