Early this year, the U.S. Department of Health and Human Services released by far the most high-quality study to date of the Head Start childhood care program. I’ve had a chance to review this study, and I feel the results are very interesting.
- The study’s quality is outstanding, in terms of design and analysis (as well as scale). If I were trying to give an example of a good study that can be held up as a model, this would now be one of the first that would come to mind.
- The impact observed is generally positive but small, and fades heavily over time.
The study’s quality is outstanding.
This study has almost all the qualities I look for in a meaningful study of a program’s impact:
- Impact-isolating, selection-bias-avoiding design. Many impact studies fall prey to selection bias, and may end up saying less about the program’s effects than about pre-existing differences between participants and non-participants. This study uses randomization (see pages 2-3) to separate a “treatment group” and “control group” that are essentially equivalent in all measured respects to begin with (see page 2-12), and follows both over time to determine the effects of Head Start itself.
- Large sample size; long-term followup. The study is an ambitious attempt to get truly representative, long-term data on impact. “The nationally representative study sample, spread over 23 different states, consisted of a total of 84 randomly selected grantees/delegate agencies, 383 randomly selected Head Start centers, and a total of 4667 newly entering children: 2559 3-year-olds and 2108 4-year-olds” (xviii). Children were followed from entry into Head Start at ages 3 and 4 through the end of first grade, a total of 3-4 years (xix). Follow-up will continue through the third grade (xxxviii).
- Meaningful and clearly described measures. Researchers used a variety of different measures to determine the impact of Head Start on children’s cognitive abilities, social/emotional development, health status, and treatment by parents. These measures are clearly described starting on page 2-15. The vast majority were designed around existing tools that seem (to me) to be focused on collecting factual, reliable information. For example, the “Social skills and positive approaches to learning” dimension assessed children by asking parents whether their child “Makes friends easily,” “Comforts or helps others,” “Accepts friends’ ideas in sharing and playing,” “Enjoys learning,” “Likes to try new things,” and “Shows imagination in work and play” (2-32). While subjective, such a tool seems much more reliable (and less loaded) to me than a less specified question like “Have your child’s social skills improved?”
- Attempts to avoid and address “publication bias.” We have written before about “publication bias,” the concern that bad news is systematically suppressed in favor of good news. This study contains common-sense measures to reduce such a risk:
- Public disclosure of many study details before impact-related data was collected. We have known this study was ongoing for a long time; baseline data was released in 2005, giving a good idea of the measures and design being used and making it harder for researchers to “fit the data to the hoped-for conclusions” after collection.
- Explicit analysis of whether results are reliable in aggregate. This study examined a very large number of measures; it was very likely to find “statistically significant” effects on some purely by chance, just because so many were collected. However, unlike in many other studies we’ve seen, the authors address this issue explicitly, and (in the main body of the paper, not the executive summary) clearly mark the difference between effects that may be an artifact of chance (even though “statistically significant,” finding some effects of comparable size was quite likely due to the large number of measures examined) and effects that are much less likely to be an artifact of chance. (See page 2-52)
- Explicit distinction between “confirmatory” analysis (looking at the whole sample; testing the original hypotheses) and “exploratory” analysis (looking at effects on subgroups; looking to generate new hypotheses). Many studies present the apparent impact of a program on “subgroups” of the population (for example, effects on African-Americans or effects on higher-risk families; without hypotheses laid out in advance, it is often unclear just how the different subgroups are defined and to what extent subgroup analysis reflects publication bias rather than real impacts. This paper is explicit that the only effects that should be taken as a true test of the program are the ones applying to the full population; while subgroup analysis is presented, it is explicitly in the interest of generating new ideas to be tested in the future. (See page xvi)
- Charts. Showing charts over time often elucidates the shape and nature of effects in a way that raw numbers cannot. See page 4-16 for an example (discussed more below).
The least encouraging aspect of the study’s quality is response rates, which are in the 70%-90% range (2-19).
In my experience, it’s very rare for an evaluation of a social program – coming from academia or the nonprofit sector – to have even a few of the above positive qualities.
Some of these qualities can only be achieved for certain kinds of studies (for example, randomization is not always feasible), and/or can only be achieved with massive funding (a sample this large and diverse is out of reach for most). However, for many of the qualities above (particularly those related to publication bias), it seems to me that they could be present in almost any impact study, yet rarely are.
I find it interesting that this exemplary study comes not from a major foundation or nonprofit, but from the U.S. government. Years ago, I speculated that government work is superior in some respects to private philanthropic work; if true, I believe this is largely an indictment of the state of philanthropy.
The impact observed is positive, but small and fading heavily over time.
First off, the study appears meaningful in terms of assessing the effects of Head Start and quality child care. It largely succeeded in separating initially similar (see page 2-12) children such that the “treatment” group had significantly more participation in Head Start (and out-of-home child care overall) than the “control” group (see chart on page xx). The authors write that the “treatment” group ended up with meaningfully better child care, measured in terms of teacher qualifications, teacher-child ratios, and other measures of the care environment (page xxi). (Note that the program only examined the effects of one year of Head Start: as page xx shows, “treatment” 3-year-olds had much more Head Start participation than “control” 3-year-olds, but the next year the two groups had similar participation.)
The impacts themselves are best summarized by the tables on pages 4-10, 4-21, 5-4, 5-8, 6-3, 6-6. Unlike in the executive summary, these tables make clear which impacts are clearly distinguished from randomness (these are the ones in bold) and those that are technically “statistically significant” but could just be an artifact of the fact that so many different measures were examined (“*” means “statistically significant at p=0.1″; “**” means “statistically significant at p=0.05″; “***” means “statistically significant at p=0.01″ and all *** effects also appear to be in bold).
The basic picture that emerges from these tables is that
- Impact appeared encouraging at the end of the first year, i.e., immediately after participation in Head Start. Both 4-year-olds and 3-year-olds saw “bold” impact on many different measures of cognitive skills, as well as on the likelihood of receiving dental care.
- That said, even at this point, effects on other measures of child health, social/emotional development, and parent behavior were more iffy. And all effects appear small in the context of later child development – for example, see the charts on page 4-16 (similar charts follow each table of impacts).
- Impact appeared to fade out sharply after a year, and stay “faded out” through the first grade. Very few statistically significant effects of any kind, and fewer “bold” ones, can be seen at any point after the first year in the program. The charts following each table, tracking overall progress over time, make impact appear essentially invisible in context.
- I don’t think it would be fair to claim that impact “faded out entirely” or that Head Start had “no effects.” Positive impacts far outnumber negative ones, even if these impacts are small and rarely statistically significant. It should also be kept in mind that this many of the families who had been lotteried out of Head Start itself had found other sources of early child care (xv); because it was comparing Head Start to alternative (though apparently inferior, as noted above) care, rather than to no care at all, effects should not necessarily be expected to be huge.
The impact of Head Start shown here is highly disappointing compared to many of its advocates’ hopes and promises. It is much weaker than the impact of projects like the Perry Preschool program and the Carolina Abecedarian program, which have been used in the past to estimate the social returns to early childhood care. It is much weaker than the impact that has been imputed from past lower-quality studies of Head Start. It provides strong evidence for the importance of high-quality studies and the Stainless Steel Law of Evaluation, as well as for “fading impacts” as a potential problem.
I don’t believe any of this makes it appropriate to call Head Start a “failure,” or even to reduce its government funding. As noted above, the small impacts noted were consistently more positive than negative, even several years after the program; it seems clear that Head Start is resulting in improved early childhood care and is accomplishing something positive for children.
I largely feel that anyone disappointed by this study must have an unrealistic picture of just how much a single year in a federal social program is likely to change a person. The U.S. achievement gap is complex and not well understood. From a government funding perspective, I’m happy to see a program at this level of effectiveness continued. When it comes to my giving, I continue to personally prefer developing-world aid, where a single intervention really can make huge, demonstrable, lasting differences in people’s lives (such as literally saving them) for not much money.