Our Principles for Assessing Evidence

For several years now we’ve been writing up our thoughts on the evidence behind particular charities and programs, but we haven’t written a great deal about the general principles we follow in distinguishing between strong and weak evidence. This post will

  • Lay out the general properties that we think make for strong evidence: relevant reported effects, attribution, representativeness, and consonance with other observations. (More)
  • Discuss how these properties apply to several common kinds of evidence: anecdotes, awards/recognition/reputation, “micro” data and “macro” data. (More)

This post focuses on broad principles that we apply to all kinds of “evidence,” not just studies. A future post will go into more detail on “micro” evidence (i.e., studies of particular programs in particular contexts), since this is the type of evidence that has generally been most prominent in our discussions.

General properties that we think make for strong evidence

We look for outstanding opportunities to accomplish good, and accordingly, we generally end up evaluating charities that make (or imply) relatively strong claims about the impact of their activities on the world. We think it’s appropriate to approach such claims with a skeptical prior and thus to require evidence in order to put weight on them. By “evidence,” we generally mean observations that are more easily reconciled with the charity’s claims about the world and its impact than with our skeptical default/”prior” assumption.

To us, the crucial properties of such evidence are:

  • Relevant reported effects. Reported effects should be plausible as outcomes of the charity’s activities and consistent with the theory of change the charity is presenting; they should also ideally get to the heart of the charity’s case for impact (for example, a charity focused on economic empowerment should show that it is raising incomes and/or living standards, not just e.g. that it is carrying out agricultural training).
  • Attribution. Broadly speaking, the observations submitted as evidence should be easier to reconcile with the charity’s claims about the world than with other possible explanations. If a charity simply reports that its clients have higher incomes/living standards than non-participants, this could be attributed to selection bias (perhaps higher incomes cause people to be more likely to participate in the charity’s program, rather than the charity’s program causing higher incomes), or to data collection issues (perhaps clients are telling surveyors what they believe the surveyors want to hear), or to a variety of other factors.

    The randomized controlled trial is seen by many – including us – as a leading method (though not the only one) for establishing strong attribution. By randomly dividing a group of people into “treatment” (people who participate in a program) and “control” (people who don’t), a researcher can make a strong claim that any differences that emerge between the two groups can be attributed to the program.

  • Representativeness. We ask, “Would we expect the activities enabled by additional donations to have similar results to the activities that the evidence in question applies to?” In order to answer this well, it’s important to have a sense of a charity’s room for more funding; it’s also important to be cognizant of issues like publication bias and ask whether the cases we’re reviewing are likely to be “cherry-picked.”
  • Consonance with other observations. We don’t take studies in isolation: we ask about the extent to which their results are credible in light of everything else we know. This includes asking questions like “Why isn’t this intervention better known if its effects are as good as claimed?”

Common kinds of evidence

  • Anecdotes and stories – often of individuals directly affected by charities’ activities – are the most common kind of evidence provided by charities we examine. We put essentially no weight on these, because (a) we believe the individuals’ stories could be exaggerated and misrepresented (either by the individuals, seeking to tell charity representatives what they want to hear and print, or by the charity representatives responsible for editing and translating individuals’ stories); (b) we believe the stories are likely “cherry-picked” by charity representatives and thus not representative. Note that we have written in the past that we would be open to taking individual stories as evidence, if our “representativeness” concerns were addressed more effectively.
  • Awards, recognition, reputation. We feel that one should be cautious and highly context-sensitive in deciding how much weight to place on a charity’s awards, endorsements, reputation, etc. We have long been concerned that the nonprofit world rewards good stories, charismatic leaders, and strong performance on raising money (all of which are relatively easy to assess) rather than rewarding positive impact on the world (which is much harder to assess). We also suspect that in many cases, a small number of endorsements can quickly snowball into a large number, because many in the nonprofit world (having little else with which to assess a charity’s impact) decide their own endorsements more or less exclusively on the basis of others’ endorsements. Because of these issues, we think this sort of evidence often is relatively weak on the criteria of “relevant reported effects” and “attribution.”

    We certainly feel that a strong reputation or referral is a good sign, and provides reason to prioritize investigating a charity; furthermore, there are particular contexts in which a strong reputation can be highly meaningful (for example, a hospital that is commonly visited by health professionals and has a strong reputation probably provides quality care, since it would be hard to maintain such a reputation if it did not). That said, we think it is often very important to try to uncover the basis for a charity’s reputation, and not simply rely on the reputation itself.

  • Testimony. We see value in interviewing people who are well-placed to understand how a particular change took place, and we have been making this sort of evidence a larger part of our process (for example, see our reassessment of VillageReach’s pilot project). When assessing this sort of evidence, we feel it is important to assess what the person in question is and isn’t well-positioned to know, and whether they have incentive to paint one sort of picture or another. How the person was chosen is another factor: we generally place more weight on the testimony of people we’ve sought out (using our own search process) than on the testimony of people we’ve been connected to by a charity looking to paint a particular picture.
  • “Micro” data. We often come across studies that attempt to use systematically collected data to argue that, e.g., a particular program improved people’s lives in a particular case. The strength of this sort of evidence is that researchers often put great care into the question of “attribution,” trying to establish that the observed effects are due to the program in question and not to something else. (“Attribution” is a frequent weakness of the other kinds of evidence listed here.) The strength of the case for attribution varies significantly, and we’ll discuss this in a future post.

    When examining “micro” data, we often have concerns around representativeness (is the case examined in a particular study representative of a charity’s future activities?) and around the question of relevant reported outcomes (these sorts of studies often need to quantify things that are difficult to quantify, such as standard of living, and as a result they often use data that may not capture the full reality of what happened).

  • “Macro” data. Some of the evidence we find most impressive is empirical analysis of broad (e.g., country-level) trends. While this sort of evidence is often weaker on the “attribution” front than “micro” data, it is often stronger on the “representativeness” front. (More.)

In general, we think the strongest cases use multiple forms of evidence, some addressing the weaknesses of others. For example, immunization campaigns are associated with both strong “micro” evidence (which shows that intensive, well-executed immunization programs can save lives) and “macro” evidence (which shows, less rigorously, that real-world immunization programs have led to drops in infant mortality and the elimination of various diseases).

Comments

Our Principles for Assessing Evidence — 5 Comments

  1. Hi, I am very encouraged with great work you are doing to support the vulnerable members of society. I children home which is extending support to HIV and AIDS affected anf infected children. We do this by providing education etc. Though we ve had alot of financial constrains. Can i send our proposal,mission,vission and all we do for to go through?

  2. I just posted the following comment (below the asterisks) at the above website. We’ll see if they post it. You folks at GiveWell need to do a lot better thinking about evidence and evaluation and get some more training in the differences among (1) internal validity, (2) external validity, (3) construct validity of the intervention, (4) construct validity of the measured variables, and (5) construct validity of the situation in which a particular intervention takes place. Methods other than RCTs are crucial to all of these. Evaluation is not as simple as your rhetoric suggests.

    Please, please stop saying silly stuff about disaster relief on your website. It makes you look unsophisticated.

    ***

    GiveWell seems pretty robotic to me, exemplifying the worst tendencies of the randomista crowd. They fell in love with RCTs, and interventions that are less amenable to that evaluation model become disfavored. Silliness; tail wagging the dog. Here’s what they say about famine relief (where there are unmet needs and the counterfactual is pretty simple… people go hungry in the context of food scarcity).

    http://www.givewell.org/international/disaster-relief

    “We feel that disaster relief may not be the ideal cause for people seeking to accomplish as much good as possible with their donations (more below). However, we feel it is an important cause because it is so emotionally compelling and because donors often have so little to go on in making a decision.”

    By all means, we’d rather have hungry Syrians than a intervention that doesn’t fit our preferred evaluation approach.

  3. Jon, I don’t think your portrayal of GiveWell is accurate. We are open to a variety of methodologies and do not only consider RCTs, though we consider RCTs to have major advantages (more). We stand by our statement on disaster relief, particularly for the high-profile disasters that people generally come to our website looking for advice on; we think there are ways to accomplish more good per dollar, and this is due to a holistic consideration of what we know.

  4. I’m afraid you don’t show sophistication on issues of external validity or construct validity, which are crucial for assessments of where to allocate the marginal dollar. Under the rubric of external validity, which has everything to do with assessments of the cost-effectiveness of the marginal dollar, you say “We’d guess that the programs taking place in studies are often unusually high-quality, in terms of personnel, execution, etc.” You’d guess? That’s not a very robust method for assessments of potential replication and scale-up. You look at “unusual conditions”? In practice, many other evaluation methods and highly contextual factors are involved in questions of whether to replicate or scale up a particular intervention in a particular circumstance and time, aside from the oversimplified randomistas, who tend to take external validity for granted. Do you think that is how doctors take account of studies that find *average* impacts in narrow populations, or populations in other places, when thinking about how to apply available evidence to decisions about individual patients? On the topic of disaster relief, you show a values-based rather than evidence-based preference for preventive interventions compared to therapeutic interventions. If you were to translate your preferences to policy, there would be no resources dedicated to disaster relief, including famine relief. I’m afraid that would result in a lot of deaths. These are among the issues that I find troubling when well-meaning but inexperienced and under-trained people start to pontificate about these things if they have not done their homework. Unfortunately, because you bring a strong randomista bias to your work, the randomista movement embraces your efforts.

  5. Jon – thanks for continuing the discussion. I want to address two issues:

    • External validity. We agree with the statement that “many other evaluation methods and highly contextual factors are involved in questions of whether to replicate or scale up a particular intervention in a particular circumstance and time,” and we apply it in our work. Looking at our reviews of, e.g., the Against Malaria Foundation or the Deworm the World Initiative, these kinds of contextual considerations play a huge role, and we try to understand them to the best of our ability.
    • Disaster relief and marginal funds. It is not the case that “If you were to translate [our] preferences to policy, there would be no resources dedicated to disaster relief, including famine relief.” GiveWell aims to direct marginal funding, rather than to dictate the overall flow of funds in the world. We believe that there are more cost-effective uses of funds at the current margin than additional support for famine relief, but it’s certainly not the case that we believe the optimal amount of funding for disaster relief is 0. A similar principle applies in considering the benefits of disaster relief versus preparedness: disaster relief currently receives far more funding than preparedness (PDF pages 6-7), so I would guess additional funding for preparedness is likely to be more cost-effective than additional funding for relief, absent evidence to the contrary. I don’t have a “values-based preference” for preventive interventions.

    Hope this helps clarify some of our views.