Quantcast Impact evaluation | The GiveWell Blog
October 18th, 2011

What it takes to evaluate impact

When someone asks me what makes GiveWell different from other third-party charity evaluators, I often answer by listing all the things we’ve done in order to investigate our current top-rated charity, VillageReach.

All in all, we’ve spent hundreds of hours examining VillageReach - yet we still feel very far from being “settled” on the question of how promising its activities are. Like any outstanding opportunity to do good, VillageReach’s work involves large and complex challenges. We’ll never have 100% of the relevant information or 100% certainty on its merits, but because we’ve recommended VillageReach so highly and moved over $1 million to it, it’s important to us that we do the best we can.

It isn’t realistic to do this kind of in-depth investigation for thousands (or even hundreds) of charities. We have to save our resources for the most promising charities if we want to have a reasonable level of confidence in our top recommendations. That means we take shortcuts on less promising charities, and we don’t put in the work it would take to distinguish between “worst,” “bad,” “mediocre” and “decent” groups - we’re laser-focused on the ones that we consider “best.”

Other independent charity evaluators tend to measure themselves by how many charities they rate. They exist largely for donors who already know where they want to give, and want a basic legitimacy check before they finalize the donation. To accommodate this goal, these other evaluators need to be far less thorough and more simplified than we are. That means - in our view - that they have no realistic chance of ever meaningfully rating impact, i.e., the degree to which a charity is succeeding at its mission.

GiveWell isn’t for everyone. Donors looking to check the charity they already want to give to are better off with other resources. But for donors who don’t already have a charity in mind and are looking to maximize their impact, we don’t know of any other group that provides a comparable product.

May 27th, 2011

In Defense of the Streetlight Effect

In a recent guest post for Development Impact, Martin Ravallion writes:

the current fashion [for evaluating aid projects] is for randomized control trials (RCTs) and social experiments more generally … The problem is that the interventions for which currently favored methods are feasible constitute a non-random subset of the things that are done by donors and governments in the name of development. It is unlikely that we will be able to randomize road building to any reasonable scale, or dam construction, or poor-area development programs, or public-sector reforms, or trade and industrial policies—all of which are commonly found in development portfolios. One often hears these days that opportunities for evaluation are being turned down by analysts when they find that randomization is not feasible. Similarly, we appear to invest too little in evaluating projects that are likely to have longer-term impacts; standard methods are not well suited to such projects … (Emphasis mine)

He concludes with a call for “‘central planning’ in terms of what gets evaluated” to ensure that evaluation doesn’t become concentrated among the projects that are easy to evaluate. His post could be seen as a direct retort to the kind of work emphasized in the recent books Poor Economics and More Than Good Intentions (our review). These books present ideas and evidence that is mostly drawn from high-quality studies, and have little to say on questions that high-quality studies cannot help answer.

My instinct is the opposite of Dr. Ravallion’s: I feel that the move toward high-quality evaluations is a good thing, even if it starts to cause bias in what sorts of programs are evaluated - and carried out. What follows is an attempt to explain my feeling on this. My feeling is a function of my worldview and biases, and this post should be taken less as a “rebuttal” than as an opportunity to explicate my worldview and biases.

My disagreement with Dr. Ravallion has to do with my experience as a “customer” of social science. The vast majority of studies I’ve come across have been seemed so methodologically suspect to me that I’ve ended up not feeling they shed much light on anything at all; and many (not all) exceptions are studies that have come out of the “randomista” movement. (Another particularly helpful source of evidence has been Millions Saved, which focused on global health.) Given this situation, I’m not excited about using “central planning” to make sure that researchers continue to try answering questions that they simply don’t have the methods to answer well. I’d rather see them stick to areas where they can be helpful.

What does it look like when we build knowledge only where we’re best at building knowledge, rather than building knowledge on the “most important problems?” A few thoughts jump to mind:

  • Over the last several decades, I am not sure whether we’ve generated any useful and general knowledge about how to promote women’s empowerment and equality - from the outside - in developing-world countries. But we’ve generated a lot of knowledge about how to produce affordable, convenient birth control in a variety of forms. I would guess (though this is just a guess, as empowerment itself is so hard to measure) that the latter kind of knowledge generation has done much more for empowerment and equality than attempts to study empowerment/equality directly.
  • Similarly, what has done more for political engagement in the U.S.: studying how to improve political engagement, or studying the technology that led to the development of the Internet, the World Wide Web, and ultimately to sites like Change.org (as well as new campaign methods)?
  • More broadly, studying areas we’re good at studying and generating knowledge we’re good at generating has led to a lot of wealth generation and poverty reduction. I feel poverty reduction brings a lot of benefits that would be hard to bring about (or even fully understand) directly.

Bottom line - researching topics we’re good at researching can have a lot of benefits, some unexpected, some pertaining to problems we never expected such research to address. Researching topics we’re bad at researching doesn’t seem like a good idea no matter how important the topics are. Of course I’m in favor of thinking about how to develop new research methods to make research good at what it was formerly bad at, but I’m against applying current problematic research methods to current projects just because they’re the best methods available.

If we focus evaluations on what can be evaluated well, is there a risk that we’ll also focus on executing programs that can be evaluated well? Yes and no.

  • Some programs may be so obviously beneficial that they are good investments even without high-quality evaluations available; in these cases we should execute such programs and not evaluate them.
  • But when it comes to programs that where evaluation seems both necessary and infeasible, I think it’s fair to simply de-emphasize these sorts of programs, even if they might be helpful and even if they address important problems. This reflects my basic attitude toward aid as “supplementing people’s efforts to address their own problems” rather than “taking responsibility for every problem clients face, whether or not such problems are tractable to outside donors.” I think there are some problems that outside donors can be very helpful on and others that they’re not well suited to helping on; thus, “helping with the most important problem” and “helping as much as possible” are not at all the same to me.

It’s common in our sphere to warn against the “streetlight effect,” i.e., “looking for your keys where there’s light, rather than where the keys are most likely to be.” In the context of aid, this means executing - and studying - the programs that are easiest to evaluate rather than the programs that are most likely to do good. (Chris Blattman uses this analogy in the context of Dr. Ravallion’s post.)

But for the aid world, the right analogy would acknowledge that there are a lot of keys to be found, and a lot of unexplored territory both in and outside the light. In that context, the “streetlight effect” seems like a good thing to me.

January 28th, 2010

Can choosing the right charity double your impact?

Reader Evan writes:

I’ve been thinking about how best to donate to Haiti, and I reviewed some of the materials on your website and found them pretty helpful and persuasive. So thank you! But then my law firm announced that it would match donations to the Red Cross or Doctors Without Borders. Given that, I think I have to donate to one of those orgs: even if my money would probably be better spent elsewhere, it’s hard to imagine that it would be more than twice as well spent. Do you disagree?

My intuition here is different than Evan’s. My guess would be that giving to one of our top-rated charities could easily accomplish more than twice as much good as supporting the efforts of the Red Cross or Doctors Without Borders in Haiti.

This guess is largely based on two factors:

  1. The large divergence in relative cost-effectiveness of different programs (which can approach a factor of 1,000, not just a factor of 2) combined with the reasonable position that disaster relief is not among the most cost-effective avenues for charitable funds.
  2. A back-of-the-envelope calculation for cost-effectiveness of efforts in Haiti which puts it well below the cost-effectiveness of our top charities.

In this post, I’ll look at the first factor. I’ll post more on the second issue in a future post.

Cost-effectiveness for different approaches to helping people varies widely

The most cost-effective programs are so much more impactful per dollar than other programs that a much smaller donation to a top program will likely help significantly more people. We’re careful about our use of cost-effectiveness figures, and the Disease Control Priorities Report’s (DCP) in particular (which we think constitute “best-case” scenarios rather than what a donor can expect from his donation), but we do think they give a reasonable basic sense of the differences between different kinds of programs.

Figures 2.2 and 2.3 on pages 41-2 of the DCP report provide cost-effectiveness estimates for many common programs charities run. (These are all presented using the $/DALY metric. For more information on what this is, see our overview for interpreting the DALY metric.) Some of the most cost-effective programs are deworming programs ($3/DALY), expanding immunization coverage ($7/DALY), and bednets to prevent malaria ($11/DALY).

Some of the least cost-effective (but common among charities) programs are improved water and sanitation to prevent diarrhea ($4,185/DALY), some types of maternal and neonatal care packages ($1,060/DALY), and Antiretroviral therapy to treat HIV/AIDS ($922/DALY).

These examples are not meant to demonstrate that the less cost-effective programs are necessarily less worthy, but they do illustrate that the impact per dollar a donor can expect from his gift can easily vary by 2-3 orders of magnitude, even under assumptions that programs are essentially being carried out as intended. Of course, if some programs are poorly executed or simply ineffective, the difference can be much larger still.

With that context, when choosing which charity to support, I wouldn’t trade much confidence-in-an-organization to merely double the size of my donation.

As we’ve discussed before, with limited information we’d tentatively guess that disaster relief funds are closer to the less-cost-effective end of the range rather than the most-cost-effective end. With that in mind, I’d guess that a gift to VillageReach or Stop TB could easily accomplish more than twice as muc good as a gift supporting the Red Cross or Doctors without Borders in Haiti.

The above is very general and though relevant, not at all specific to the situation in Haiti. In a future post, I’ll post more on some specifics regarding Haiti and why I think it offers further support to the notion that donors can accomplish more good by giving to our top charities, even if they give less.

Two other small notes

There are a couple other factors that contribute (though in a relatively small way) to my conclusion here:

  • It doesn’t seem appropriate to consider causing one’s company to give a donation to be equivalent to doubling one’s cost-effectiveness. The firm may have taken matching funds from a pool already allocated to charitable giving, or the partners may have given the funds to charities themselves. Even if the funds wouldn’t otherwise go to charity, the firm likely has another motive for giving, which should lead you to consider how this program differs from other embedded giving programs, which we think are of dubious additional value.
  • Giving to a charity because it has demonstrated effectiveness has the additional benefit of signaling to other charities that effectiveness matters to donors. A core belief of ours at GiveWell is that rewarding charities for effectiveness in changing lives will incentivize other charities to improve their programs to compete for those donor funds. Proactive giving (i.e., trying to choose the best charity available) furthers this dynamic; passive giving (choosing from a predefined list) hampers it. It’s also quite possible that in a very direct sense telling your company that you’ve chosen to give in this way could influence them to adjust their own giving towards more considered and effective charities.
January 22nd, 2010

More on the microfinance “repayment rate”

We are concerned about the way repayment rates are often reported. We’ve written about this issue before, arguing that different delinquency indicators can easily be misleading and pointing to one example we found where a microfinance institution’s reported repayment rate substantially obscures the portion of its borrowers that have repaid loans.

Following the links from David Roodman’s recent post about Richard Rosenberg, we found another paper Mr. Rosenberg authored making all the same points, much better than we did. The paper is Richard Rosenberg’s. “Measuring microcredit delinquency: ratios can be harmful to your health.” CGAP Occasional Paper #3. 1999. Available online here (pdf).

Relevant quotes from Mr. Rosenberg’s paper

The importance of using the “right” delinquency measure:

MFIs use dozens of ratios to measure delinquency. Depending on which of them is being used, a “98 percent recovery rate” could describe a safe portfolio or one on the brink of meltdown. (Pg 1)

The measure we’ve been asking for seems to be equivalent to what he calls the “collection rate.”

Most of the discussion will be devoted to three broad types of delinquency indicators: (a) Collection rates measure amounts actually paid against amounts that have fallen due. (b) Arrears rates measure overdue amounts against total loan amounts. (c) Portfolio at risk rates measure the outstanding balance of loans that are not being paid on time against the outstanding balance of total loans. (Pg 2)

It’s essential to not only know which measure is being used, but precisely how an MFI calculates its version of the measure:

But the reader must be warned that there is no internationally consistent terminology for portfolio quality measures—for instance, what this paper calls a “collection rate” may be called a “recovery rate,” a “repayment rate,” or “loan recuperation” in other settings. No matter what name is used, the important point is that we can’t interpret what a measure is telling us unless we understand precisely the numerator and the denominator of the fraction. (Pg 2)

Mr. Rosenberg describes different tests to which MFIs should subject various delinquency measures to determine which is most appropriate. For GiveWell’s purposes, one of the key tests is the “smoke and mirrors” test:

Can the delinquency measure be made to look better through inappropriate rescheduling or refinancing of loans, or manipulation of accounting policies? This is our smoke and mirrors test. (Pg 3)

The practice of rescheduling and renegotiating loans:

When a borrower runs into repayment problems, an MFI will often renegotiate the loan, either rescheduling it (that is, stretching out its original payment terms) or refinancing it (that is, replacing it—even though the client hasn’t really repaid it—with a new loan to the same client). These practices complicate the process of using a collection rate to estimate an annual loan loss rate. Before exploring those complications and suggesting alternative solutions for dealing with them, the author needs to issue a warning: any reader looking for a perfect solution will be disappointed. The suggested approaches all have drawbacks. It is important to recognize that heavy use of rescheduling or refinancing can cloud the MFI’s ability to judge its loan loss rate. This is one of many reasons why renegotiation of problem loans should be kept to a minimum—some MFIs simply prohibit the practice. (Pg 10)

The strengths of PAR (”portfolio at risk”) as a measure:

The international standard for measuring bank loan delinquency is portfolio at risk (PAR). This measure compares apples with apples. Both the numerator and the denominator of the ratio are outstanding balances. The numerator is the unpaid balance of loans with late payments, while the denominator is the unpaid balance on all loans The PAR uses the same kind of denominator as an arrears rate, but its numerator captures all the amounts that are placed at increased risk by the delinquency. (Pg 13)

And its weaknesses:

Like many other delinquency measures, the PAR can be distorted by improper handling of renegotiated loans. MFIs sometimes reschedule—that is, amend the terms of—a problem loan, capitalizing unpaid interest and set- ting a new, longer repayment schedule. Or they may refinance a problem loan, issuing the client a new loan whose proceeds are used to pay off the old one. In both cases the delinquency is eliminated as a legal matter, but the resulting loan is clearly at higher risk than a normal loan. Thus a PAR report must age renegotiated loans separately, and provision such loans more aggressively. If this is not done, the PAR is subject to smoke and mirrors distortion: management can be tempted to give its portfolio an artificial facelift by inappropriate renegotiation. (Pg 16)

PAR can also be misleading in a situation where an MFI is growing rapidly (a key argument of our past posts):

Another potential distortion in PAR measures is worth mentioning. Arguably the PAR denominator should include only loans on which at least one payment has fallen due, so that late loans in the numerator are compared only to loans that have had a chance to be late. Nevertheless, it is customary to use the total outstanding loan balance for the denominator. The distortion involved is usually not large for MFIs, because the period before the first payment is a small fraction of the life of their loans. For instance, for a stable portfolio of loans paid in 16 weekly installments with no grace period, a PAR of 5.0 percent measured with the customary denominator (total outstanding portfolio) would rise only to 5.3 percent using the more precise denominator (excluding loans on which no payment has yet come due.) However, if a portfolio is growing very fast, or if there is a grace period or other long interval before the first payment is due, then the customary PAR denominator can seriously understate risk. Pg 17

Table 6 on Pg 19 summarizes the strengths of weaknesses of different measures:

Why is this important?

Given how complicated this all is, we think that MFIs need to be clear and transparent about (a) which measures they use and (b) precisely how they calculate them.

However, this isn’t the case. For example, we aren’t confident that most MFIs normally report rescheduled and renegotiated loans as at-risk in PAR measures.

On the one hand, Commenter Ben writes, “Best practice is to treat all loans that have been rescheduled as PAR.” (This is consistent with MixMarket’s glossary, which indicates that, “[A PAR measure] also includes loans that have been restructured or rescheduled.”

Nevertheless, “best practice” may not correlate with “in practice.”

  • This Kiva document (its “Partnership Application”) is explicit in the definition of PAR 30: “The value of loans outstanding that have one or more repayments past due more than 30 days. This includes the entire unpaid balance of the loan, including both past due and future installments, but not accrued interest or renegotiated loans.” (emphasis mine) Note that, to Kiva’s credit, it explicitly asks for renegotiated loans separately in the application.
  • As Holden recently commented, “At least one MFI has indicated to us that it does not report [renegotiated loans in its PAR measures].”

The definition you read today isn’t necessarily the one that MFIs are using.

What measure do we use and why?

We’ve written before that our preferred measure is what the paper discussed above calls the collection rate. While the collection rate measure fails to provide a warning to MFIs that their portfolio is in danger, it is the strongest on Mr. Rosenberg’s “Bottom-line” test because it simply and clearly measures failed repayments. It’s therefore less susceptible to obfuscation and manipulation.

For GiveWell’s purposes, we need a delinquency measure that most clearly reports borrowers’ situations. While PAR measures provide information, it’s clear that PAR measures are more valuable to evaluating the risk of an MFI’s portfolio, which while relevant is not our key concern.

November 13th, 2009

Chess in the Schools

The New York Times recently profiled Chess in the Schools:

The Chess-in-the-Schools program has sought to foster analytical skills on the theory that these will help students succeed academically. The group teaches 20,000 children a year and calculates that it has taught 425,000 children since 1986. Children gather to learn the game at the group’s headquarters in Manhattan.

It seems like 20 years and 425,000 children is quite a lot of investment in the “theory that [chess] will help students succeed academically.” The Times feature provides a calming justification for the investment: “Chess helps promote intellectual growth and has been shown to improve academic performance.” Let’s look at the evidence for this claim.

The study we found

An early-1990s study looks at achievement test scores of chess-playing students over two years at District 9 in the Bronx. It observes that (a) the overall average reading score improved among chessplayers by about 5 percentile points, but didn’t improve among the set of remaining District 9 students; (b) 15 of 22 second-year participants improved their reading scores by some amount, while only 491 of 1118 non-participants in the district - and 245 of 655 non-participants with high reading scores, improved.

This study is riddled with major problems:

  • The numbers the researchers choose to compare seem arbitrary and possibly cherry-picked. Why do the researchers look at the “percentage who improved” among second-year chessplayers but not for both years? Why do they compare the second-year students to “high-performing nonparticipants,” but not give the same comparison when looking at all students?
  • The problem of selection bias is unusually obvious here. They’re comparing kids who volunteered to play chess against those who didn’t. Think of the chess club members at your school, and ask yourself if they would have been just like all the other kids had chess club not been offered. There’s no reason to think these two groups of kids are otherwise similar or would be expected to respond similarly to school.
  • This is a study of somewhere between 22 and 53 students at a single district in the early 1990s. Even if the study were highly rigorous, it would still be a long way from “proof that chess helps promote intellectual growth.”

The studies we couldn’t find

The Chess-in-the-Schools website states:

In 1991 and 1996, Stuart M. Margulies, Ph.D., a noted educational psychologist, conducted two studies examining the effects of chess on children’s reading scores. The studies demonstrated that students who participated in the chess program showed improved scores on standardized tests. The gains were even greater among children with low or average initial scores. Children who were in the non-chess playing control group showed no gains.

Another study in 1999, measured the impact of chess on the emotional intelligence of fifth graders. The results of the study were striking. The overall success rate in handling real life situations with emotional intelligence was 91.4% for the children who participated in the Chess-in-the-Schools program. In contrast, those who were not involved with the chess program had an average overall success rate of only 64.4%.

We’re guessing that the study we’re looking at is an update of the 1991 study since it references no previous studies and discusses results from 1991 and 1992. We can’t find the other studies anywhere. Chess-in-the-Schools provides neither links nor citations.

Even in the best-case scenario, it’s apparently been at least a decade since the last test of the Chess-in-the-Schools model.

“Chess helps promote intellectual growth and has been shown to improve academic performance?”

In researching charities, one of the more discouraging things we’ve learned is how little support it takes for a statement like “Chess helps promote intellectual growth and has been shown to improve academic performance” to be repeated by charities, donors, and even the media.

As far as we can tell, Chess-in-the-Schools is not a demonstrated success story. It’s just been promoted and scaled up like one.

October 21st, 2009

Agriculture charity evaluation: incomes boosted are not the same as lives changed

What’s wrong with this “evidence of impact” for high-profile charities?

Among other possible problems, two major issues jump out:

1. No context on what “normal” variation in incomes looks like for poor farmers. Some years have more favorable weather - and local economic situations - than others. Enough that one year’s income or crop yield could be double another’s? 4x? 20x?

Unfortunately, one of the better pieces of “evidence” that jumps to mind is a 75-year-old novel, The Good Earth, whose farmer protagonist is comfortable one year and has literally zero income the next, for no other reason than the weather. If a given year’s yield were close enough to zero, the next year could be a huge increase (2x, 4x, 20x or more) simply by returning to normal.

I have seen little information on the local year-to-year volatility that poor farmers can experience, but I imagine that it (a) varies greatly from region to region and (b) could easily involve incomes falling and jumping by enormous amounts.

None of the above reports provide any context on this question, beyond qualitative statements about how favorable the rains were in each year examined. None of them employ any sort of “comparison group” of farmers (aside from one vague reference to “farms not using improved seeds and fertilizers” in the Malawi Millennium Village). Ultimately, none accomplish one of the most basic goals of an evaluation: giving a sense of how likely the “gains” they describe are to have arisen by pure chance.

With larger sample sizes, we might be able to use country-level volatility for context. But that brings me to the next problem.

2. We have no assurance that the described gains are representative, as opposed to “cherry-picked.”

All of the above organizations have reputations for consistent and thorough monitoring and evaluation, yet in all cases, we find ourselves looking for “impact” from a tiny subset of their projects.

Some ways to produce more compelling evidence of impact

  1. Be clear about what is being measured and what is being published, and when. It seems to us that in this area, charity evaluation lags far behind clinical trials, which are constantly registered before they are complete so people can track their progress. (The Poverty Action Lab is similarly transparent with its own ongoing projects.)
  2. More sample size; more context; use of comparison groups. Discussed above.
  3. Look for more sustained improvements in people’s lives. One measure I find superior to straight “income” or “crop yields” is asset accumulation. A jump in income could be temporary; if someone upgrades their roof or sanitation, it’s likely that at least they expect the gain to be a real and lasting one. The Village Enterprise Fund’s evaluation is one of the better charity evaluations I’ve seen in the area of economic empowerment, partly because it focuses on standard of living rather than a simple measure of income.

*It’s possible that the yields mentioned are for “clusters” of villages rather than individual villages; there are only 12 clusters. However, the source documents available for Sauri and Koraro appear to be at the village rather than the cluster level, and the details of how the measurements were made are unclear.