The GiveWell Blog

High-quality study of Head Start early childhood care program

Early this year, the U.S. Department of Health and Human Services released by far the most high-quality study to date of the Head Start childhood care program. I’ve had a chance to review this study, and I feel the results are very interesting.

  • The study’s quality is outstanding, in terms of design and analysis (as well as scale). If I were trying to give an example of a good study that can be held up as a model, this would now be one of the first that would come to mind.
  • The impact observed is generally positive but small, and fades heavily over time.

The study’s quality is outstanding.

This study has almost all the qualities I look for in a meaningful study of a program’s impact:

  • Impact-isolating, selection-bias-avoiding design. Many impact studies fall prey to selection bias, and may end up saying less about the program’s effects than about pre-existing differences between participants and non-participants. This study uses randomization (see pages 2-3) to separate a “treatment group” and “control group” that are essentially equivalent in all measured respects to begin with (see page 2-12), and follows both over time to determine the effects of Head Start itself.
  • Large sample size; long-term followup. The study is an ambitious attempt to get truly representative, long-term data on impact. “The nationally representative study sample, spread over 23 different states, consisted of a total of 84 randomly selected grantees/delegate agencies, 383 randomly selected Head Start centers, and a total of 4667 newly entering children: 2559 3-year-olds and 2108 4-year-olds” (xviii). Children were followed from entry into Head Start at ages 3 and 4 through the end of first grade, a total of 3-4 years (xix). Follow-up will continue through the third grade (xxxviii).
  • Meaningful and clearly described measures. Researchers used a variety of different measures to determine the impact of Head Start on children’s cognitive abilities, social/emotional development, health status, and treatment by parents. These measures are clearly described starting on page 2-15. The vast majority were designed around existing tools that seem (to me) to be focused on collecting factual, reliable information. For example, the “Social skills and positive approaches to learning” dimension assessed children by asking parents whether their child “Makes friends easily,” “Comforts or helps others,” “Accepts friends’ ideas in sharing and playing,” “Enjoys learning,” “Likes to try new things,” and “Shows imagination in work and play” (2-32). While subjective, such a tool seems much more reliable (and less loaded) to me than a less specified question like “Have your child’s social skills improved?”
  • Attempts to avoid and address “publication bias.” We have written before about “publication bias,” the concern that bad news is systematically suppressed in favor of good news. This study contains common-sense measures to reduce such a risk:
    • Public disclosure of many study details before impact-related data was collected. We have known this study was ongoing for a long time; baseline data was released in 2005, giving a good idea of the measures and design being used and making it harder for researchers to “fit the data to the hoped-for conclusions” after collection.
    • Explicit analysis of whether results are reliable in aggregate. This study examined a very large number of measures; it was very likely to find “statistically significant” effects on some purely by chance, just because so many were collected. However, unlike in many other studies we’ve seen, the authors address this issue explicitly, and (in the main body of the paper, not the executive summary) clearly mark the difference between effects that may be an artifact of chance (even though “statistically significant,” finding some effects of comparable size was quite likely due to the large number of measures examined) and effects that are much less likely to be an artifact of chance. (See page 2-52)

  • Explicit distinction between “confirmatory” analysis (looking at the whole sample; testing the original hypotheses) and “exploratory” analysis (looking at effects on subgroups; looking to generate new hypotheses). Many studies present the apparent impact of a program on “subgroups” of the population (for example, effects on African-Americans or effects on higher-risk families; without hypotheses laid out in advance, it is often unclear just how the different subgroups are defined and to what extent subgroup analysis reflects publication bias rather than real impacts. This paper is explicit that the only effects that should be taken as a true test of the program are the ones applying to the full population; while subgroup analysis is presented, it is explicitly in the interest of generating new ideas to be tested in the future. (See page xvi)
  • Charts. Showing charts over time often elucidates the shape and nature of effects in a way that raw numbers cannot. See page 4-16 for an example (discussed more below).

The least encouraging aspect of the study’s quality is response rates, which are in the 70%-90% range (2-19).

In my experience, it’s very rare for an evaluation of a social program – coming from academia or the nonprofit sector – to have even a few of the above positive qualities.

Some of these qualities can only be achieved for certain kinds of studies (for example, randomization is not always feasible), and/or can only be achieved with massive funding (a sample this large and diverse is out of reach for most). However, for many of the qualities above (particularly those related to publication bias), it seems to me that they could be present in almost any impact study, yet rarely are.

I find it interesting that this exemplary study comes not from a major foundation or nonprofit, but from the U.S. government. Years ago, I speculated that government work is superior in some respects to private philanthropic work; if true, I believe this is largely an indictment of the state of philanthropy.

The impact observed is positive, but small and fading heavily over time.

First off, the study appears meaningful in terms of assessing the effects of Head Start and quality child care. It largely succeeded in separating initially similar (see page 2-12) children such that the “treatment” group had significantly more participation in Head Start (and out-of-home child care overall) than the “control” group (see chart on page xx). The authors write that the “treatment” group ended up with meaningfully better child care, measured in terms of teacher qualifications, teacher-child ratios, and other measures of the care environment (page xxi). (Note that the program only examined the effects of one year of Head Start: as page xx shows, “treatment” 3-year-olds had much more Head Start participation than “control” 3-year-olds, but the next year the two groups had similar participation.)

The impacts themselves are best summarized by the tables on pages 4-10, 4-21, 5-4, 5-8, 6-3, 6-6. Unlike in the executive summary, these tables make clear which impacts are clearly distinguished from randomness (these are the ones in bold) and those that are technically “statistically significant” but could just be an artifact of the fact that so many different measures were examined (“*” means “statistically significant at p=0.1”; “**” means “statistically significant at p=0.05”; “***” means “statistically significant at p=0.01” and all *** effects also appear to be in bold).

The basic picture that emerges from these tables is that

  • Impact appeared encouraging at the end of the first year, i.e., immediately after participation in Head Start. Both 4-year-olds and 3-year-olds saw “bold” impact on many different measures of cognitive skills, as well as on the likelihood of receiving dental care.
  • That said, even at this point, effects on other measures of child health, social/emotional development, and parent behavior were more iffy. And all effects appear small in the context of later child development – for example, see the charts on page 4-16 (similar charts follow each table of impacts).
  • Impact appeared to fade out sharply after a year, and stay “faded out” through the first grade. Very few statistically significant effects of any kind, and fewer “bold” ones, can be seen at any point after the first year in the program. The charts following each table, tracking overall progress over time, make impact appear essentially invisible in context.
  • I don’t think it would be fair to claim that impact “faded out entirely” or that Head Start had “no effects.” Positive impacts far outnumber negative ones, even if these impacts are small and rarely statistically significant. It should also be kept in mind that this many of the families who had been lotteried out of Head Start itself had found other sources of early child care (xv); because it was comparing Head Start to alternative (though apparently inferior, as noted above) care, rather than to no care at all, effects should not necessarily be expected to be huge.

Takeaways

The impact of Head Start shown here is highly disappointing compared to many of its advocates’ hopes and promises. It is much weaker than the impact of projects like the Perry Preschool program and the Carolina Abecedarian program, which have been used in the past to estimate the social returns to early childhood care. It is much weaker than the impact that has been imputed from past lower-quality studies of Head Start. It provides strong evidence for the importance of high-quality studies and the Stainless Steel Law of Evaluation, as well as for “fading impacts” as a potential problem.

I don’t believe any of this makes it appropriate to call Head Start a “failure,” or even to reduce its government funding. As noted above, the small impacts noted were consistently more positive than negative, even several years after the program; it seems clear that Head Start is resulting in improved early childhood care and is accomplishing something positive for children.

I largely feel that anyone disappointed by this study must have an unrealistic picture of just how much a single year in a federal social program is likely to change a person. The U.S. achievement gap is complex and not well understood. From a government funding perspective, I’m happy to see a program at this level of effectiveness continued. When it comes to my giving, I continue to personally prefer developing-world aid, where a single intervention really can make huge, demonstrable, lasting differences in people’s lives (such as literally saving them) for not much money.

Needed from major funders: More great organizations

In the wake of the recent Giving Pledges, we’ve been discussing what advice we’d give a major philanthropist (aside from our usual plea to conduct evaluations and share them publicly).

For the most part, our recommendations and criteria are aimed at individual donors, not major philanthropists. We stress the value of given to proven, cost-effective, scalable organizations rather than funding experiments, but we don’t feel that this advice applies to major philanthropists – taking risks with small, untested organizations and approaches makes a great deal of sense when you have the time and funds to follow their work closely, hold them accountable, and perform the evaluation that will hopefully show you (and possibly/eventually the world) how things arae going. However, we do have some thoughts on the kind of risk that’s worth taking.

One of our biggest frustrations in trying to help individual donors has been the difficulty of finding organizations, as opposed to programs or projects, we can be confident in. As we have discussed in our series on room for more funding, we feel that donors can’t take “restricted gifts” at face value, and that they must ultimately either find an organization they can be confident in as a whole or one with a clear and publicly disclosed agenda for it would do with more funding. Such organizations have proven very difficult to find.

  • In the area of developing-world aid, we’ve found many organizations with activities so diverse that it’s impossible for us, or for them, to provide any kind of bird’s-eye view of their activities.
  • Meanwhile, we’ve also seen very promising intervention categories that we can’t support simply because we can’t match them to strong, focused organizations. See our past discussion of community-led total sanitation; we have similar issues with salt iodization.
  • In more informal investigations into other causes, we’ve found a multitude of organizations that seem to act as “umbrellas” for a cause, seemingly doing “many things related to the cause” rather than pursuing narrower, targeted agendas. For an example, see our discussion of anti-cancer organizations.
  • For another example, see the organizations listed at Philanthropedia’s report on global warming, which are mostly not focused solely on specific anti-global-warming strategies, but rather extremely broad environmental organizations simultaneously carrying out all manner of global-warming-related activities (forest conservation, political advocacy, research into new energy sources and more), as well as non-global-warming-related activities such as endangered species protection.

Of course, it could make sense for an organization to have varied activities, if there are synergies between them and a clear strategy underlying them. But in all the cases discussed above, that doesn’t appear to be what’s happening. In fact, my impression from the conversations I’ve had with major funders is that most large organizations are essentially loose coalitions of separate offices and projects, some excellent, some poor. Two major funders have stated to me, off the record, that one major international nonprofit does great work in some areas but that they would never endorse a contribution to it. One has stated to me that (paraphrasing) “I don’t think about what organization to fund – it all comes down to which people are good, and people move around a lot.” From scrutinizing nearly any major funder’s list of grants, or from examining the work of the Center for High-Impact Philanthropy at University of Pennsylvania (which aims to advise larger donors), it seems clear that the typical approach of a major funder is to evaluate projects and people, not organizations.

Unfortunately, this attitude is somewhat self-fulfilling. As long as major funders treat organizations as contractors to carry out their projects of choice, organizations will remain loose coalitions; successful projects will be isolated events. We’ll see none of the gains that come with organization-level culture, knowledge and training built around core competencies. And people giving smaller amounts will have no way to know what they’re really giving to.

We’ve argued before that great organizations are born, not made. Rather than trying to wrench existing organizations into their preferred projects, we’d like to see more major funders trying to “birth” great organizations, so that there’s something left over when they move on.

Philanthropy vouchers

We focus on finding charities that are doing demonstrably good work already, rather than on proposals for new sorts of projects. This post is an exception: we’ve been tossing around an idea for “philanthropy vouchers” that we think could be worth trying in a broad variety of contexts, and we’re interested in others’ thoughts.

The idea is a variation of the “development vouchers” idea put forth by William Easterly in The White Man’s Burden (see page 330). Prof. Easterly proposes that official aid agencies co-create an independent “voucher fund,” and issue vouchers to people in developing countries that can be redeemed for money from the fund. The basic appeal of the idea is that, like cash handouts, it may shift the power and choice to the hands of the people we’re trying to help, rather than the hands of well-meaning outsiders at charities; but the two major concerns with cash handouts (fraud/manipulation by less poor locals and poor/irresponsible use of the money) could be mitigated by some basic regulations on what sorts of services the vouchers can be spent on.

While Prof. Easterly proposes a coordinated effort by major aid agencies, our proposal can be carried out at very small scale, unilaterally, by a single funder. The funder would simply issue a set amount in vouchers, set its own rules for how they could be redeemed, and set aside the necessary funds.

Specifically, to carry out a philanthropy vouchers program, a funder would do the following:

  1. Determine how much “money” it wanted to inject into a community in the form of vouchers.
  2. Form a definition of a “philanthropic organization,” i.e., an organization that would be eligible for collecting these vouchers from people and trading them to the funder for cash. This classification could be formed in a variety of ways: the funder might lay out a set of general criteria for “philanthropic” organizations and take applications for formal designation as “philanthropic,” with approved organizations’ getting the right to trade vouchers to the funder for cash; or it might do something as simple as accepting vouchers from any organization classified as a charity in its country of origin.
  3. Print vouchers and distribute them to the people in an area (trying to target those in need, but the targeting wouldn’t be as high-stakes as it is with cash).
  4. From there, any organization classified as “philanthropic” could offer its goods and services, and all such organizations would effectively be competing for the funds embedded in the vouchers.
  5. The funder would still be well advised to do its own monitoring and evaluation of how the program is going – in particular, spot-interviewing participants to ensure that vouchers were obtained through transparent and mutually consensual transactions

For a hypothetical example, consider an “alternative Millennium Villages” powered by philanthropy vouchers.

  • The funder would create a definition of “philanthropic organization” as any US-registered public charity, or local government agency, whose activities in the village consisted of providing or “selling” the following: vitamin and mineral supplements, health services, water, primary education, food meeting basic nutrition standards, or electricity. Organizations would apply to the funder for recognition as such an organization, a process that need not be nearly as involved as applying for direct funding. Organizations with other ideas for helping people, such as cellphones, could apply as well, and their status would be at the funder’s discretion.
  • The funder would print 5,000 vouchers for $50 each, and distribute them throughout a village of 5,000 with a rough goal of allocating one voucher per person (or N vouchers per family of N). (Assuming $50,000 in funder overhead, this would be equivalent in cost to Millennium Villages). Alternatively, the funder might allocate some of the vouchers to a “common fund” allocated through a voting procedure among villagers, in order to encourage the purchase of “public goods” such as well construction (though of course the villagers could also arrange such a “common fund” themselves, or simply choose to “pay” ala carte for water).
  • Nonprofits and government agencies would then hopefully offer services in attempts to win clients’ vouchers. If a nonprofit perceived that others were focusing excessively on farmer training as opposed to water, it could invest in providing water and hope to take in more revenue in vouchers than its costs.
  • With each voucher submitted to the funder, an organization might submit a brief description of what was provided in exchange for the voucher, and to whom; the funder would then perform “spot interviews” to see if these descriptions were confirmed by villagers.

Though the example given is for the developing world, I think the concept could as easily be used in poor communities in the U.S.

There would be many challenges involved in such a program. Tensions could arise between different “competing” organizations, and they may resort to misleading advertising or even coercion in order to win more vouchers. Vouchers wouldn’t be distributed perfectly fairly or evenly among participants. However, these issues could be monitored to some degree using spot interviews, and the concerns would be smaller than with a cash handout program. On the flip side, voucher revenues would provide strong indicators of which services people valued most and how that changed over time, and the actual services provided could adjust in real-time to these indicators. Incentives and possibilities for innovation and adaptation would likely be much greater than for a centrally planned project.

All in all, it seems to us like a project along these lines would be worth trying, hopefully accompanied (as with any pilot) by strong monitoring and evaluation. What do you think?

Invest in Kids

As part of our research into United States causes, we’ve been looking at Invest in Kids, an organization focused on implementing evidence-based programs in Colorado, and we recently had the chance to speak with Lisa Merlino, Invest in Kids’ Executive Director (edited transcript of our conversation (DOC)).

While our research is still in progress, we want to highlight some of things we really like about Invest in Kids:

  • Founding story. Invest in Kids was started in the late-90s by a group of mostly lawyers in Colorado who wanted to start an organization to help children in need. They considered their options and spoke with experts to identify programs with strong track records. Ultimately, they were convinced by the Nurse-Family Partnership’s strong evidence of effectiveness and decided to start an organization focused on implementing the evidence-based program. At that time, David Olds, NFP’s founder, was conducting the 3rd randomized-controlled trial of NFP’s model, and the NFP National Service Office (the NFP charity that GiveWell recommends) did not yet exist.
  • Ongoing program selection. After implementing NFP, Invest in Kids began looking for other evidence-based programs to implement. In 2003, they settled on the Incredible Years, another program that has been subject to rigorous evaluation. More recently, they participated in a clinical trial of the Good Behavior Game. According to Ms. Merlino, “This research trial was completed and although changes in child behavior trended in a positive direction, the preliminary data shows outcomes were not statistically significant for the children who received the intervention. Therefore, Invest in Kids has decided not to replicate the program at this time. However, anecdotally we heard powerful stories of improvement in teachers and children so we remain hopeful about the positive outcomes that may be seen from this intervention. We continue to await additional results from this and other trials around the country.”
  • Monitoring and evaluation. Ms. Merlino told us that they have ongoing monitoring of the programs they implement to assess whether the outcomes their programs achieve are in line with their expectations based on the research. Note that IIK has sent us these reports, but we haven’t yet had a chance to review them.

While our analysis of Invest in Kids is ongoing, we’re excited about them. Their general approach of looking to scale up what works should, in our view, serve as a model for other non-profit organizations. We’re looking forward to learning more about them over the next few months.

The Money for Good study

The Money for Good study’s headline finding is that “few donors do research before they give, and those that do look to the nonprofit itself to provide simple information about efficiency and effectiveness.”

That conclusion syncs up with our own experience talking to donors, but we aren’t discouraged by the results. That’s because where the Money for Good study answered the question “how do most donors behave?” we’re interested in answering a different question: is there a market for giving based on evidence of impact and how big is that market?

Hope Consulting shared their raw survey data with us, and we’ve done a rough estimate of the size of the potential “GiveWell market” by extrapolating the percentages in the survey to the size of the overall giving market. We estimate:

  • $4.1 billion from donors who report having done research to compare and evaluate multiple organizations (as opposed to researching a single organization or researching how much to give).
  • $3.8 billion if we further narrow the above set by looking at what factors are important to them, and eliminate any donors that rank what we consider “factors irrelevant to impact” (e.g., “ability to get involved with the organization” or “public recognition of my donation”) higher than what we consider “factors relevant to impact” (e.g., “organizational effectiveness”)
  • $554 million from donors who both did research to compare organizations (i.e., fit in the first group above) and reported that “amount of good organization is accomplish” was the most important piece of information sought in their research.

We still don’t have a great sense of the potential market size for GiveWell-style research, but it certainly hasn’t been established that the market is small.

Ultimately, we think it’s important to take the study’s conclusions with a grain of salt. If you polled all TV watchers on what they want, you’d conclude that only a very small percentage want something like The Wire, yet that show wasn’t exactly a failure. In fact, for most successful businesses I can think of, it’s still the case that most people aren’t customers of them.

Our goal isn’t to create a product that the majority of people like; it’s to create a product that some minority market loves. From what we’re seeing now, it’s still possible that the minority of donors interested in impact-focused research is quite large.

Slow spending

The Chronicle of Philanthropy and NPR note that charities don’t seem to have spent large percentages of the funds raised for Haiti to date. Here we (a) lay out the numbers, using the Chronicle of Philanthropy’s helpful public survey data; (b) discuss what it means for donors that most of the money raised seems to be reserved for long-term as opposed to immediate relief.

The numbers

The Chronicle of Philanthropy’s survey data gives a total of over $1.6 billion raised, and seems to include nearly all of the “big name” charities working in disaster relief. We have collected the data, for all charities that provided comparable “raised” and “spent” figures (i.e., either both worldwide or both non-worldwide), into this Excel file.

The chart below summarizes this data by sorting the charities in order of how much they’ve raised. Each bar represents a listed charity; the total length of the bar corresponds to funds raised, while the blue part corresponds to funds spent.

Notes:

  • 38 of the 48 charities have spent under 75% of the money they’ve raised; 29 of the 48 have spent less than half the money they’ve raised; and 22 of the 48 have spent less than a third.
  • The Mennonite Central Committee reports spending far more than it has raised, but its numbers are confusing (it is the only listed charity whose “worldwide” money raised figure is lower than the other figure it provides) and there is no summary of how it’s spent the funds.
  • The Entertainment Industry Foundation reports raising and spending exactly the same amount ($66 million). There is no summary of how it’s spent the funds.
  • Only two other charities, Population Services International and Fonkoze, report spending over 90% of what they’ve raised. Both cases involve relatively small amounts (around $2 million for Fonkoze and $211,000 for Population Services International). Update: this Fonkoze figure is for Fonkoze USA, not for Fonkoze as a whole.

Overall, about 38% of the ~$1.6 billion raised has been spent. In fact, the amount spent – around $627 million – is not much greater than the amount ($560 million) that was raised in the first 9 days after the earthquake hit.

Why this matters: “speed relief” vs. longer-term relief and recovery

We don’t believe that spending money slowly indicates irresponsibility. Shortly after the earthquake hit, we expressed doubts about whether there was “room for more funding”, and the Chronicle’s coverage implies that at this point there largely isn’t.

However, we do feel that it’s important for donors to note how much of their donations are likely paying for longer-term, as opposed to immediate, relief, because this has implications for what one should look for in a disaster relief charity in the future.

Immediately after the earthquake hit, many (including us) were stressing the importance of a charity’s existing capacity on the ground and its ability to respond quickly and efficiently. When we think about longer-term relief, though, we wish to focus less on capacity/speed and more on the things we usually focus on:

  • Is the organization clear about where the money is going?
  • Does the organization formally assess whether and to what extent its work is succeeding? (For disaster relief, in particular, we’d hope to see evidence that the organization is actively getting and acting on feedback from beneficiaries.)
  • Does the organization focus on activities that help as many people as possible, as much as possible, for as little money as possible?

In assessing disaster relief organizations, we plan on focusing on what they’ve accomplished over the longer term, because that’s what donations in these situations are most likely to be paying for.