The GiveWell Blog

Updated thoughts on our key criteria

For years, the 3 key things we’ve looked for in a charity have been (a) evidence of effectiveness; (b) cost-effectiveness; (c) room for more funding. Over time, however, our attitude toward all three of these things – and the weight that we should put on our analysis of each – has changed. This post discusses why

  • On the evidence of effectiveness front, we used to look for charities that collected their own data that could make a compelling case for impact. We no longer expect to see this in the near future. We believe that the best evidence for effectiveness is likely to come from independent literature (such as academic studies). We believe that if a program does not have a strong independent case, there is unlikely to be a charity that can demonstrate impact with such a program.
  • We have continually lowered our expectations for how much role cost-effectiveness analysis will play in our decisions. We still believe that doing such analysis is worthwhile when possible – partly because of the questions it raises – but we believe the cases where it can meaningfully distinguish between two interventions are limited.
  • We have continually raised our expectations for how much role room for more funding analysis will play in our decisions. Questions around “room for more funding” are now frequently the first – and most core – questions we ask about a giving opportunity.

Evidence for effectiveness
In our 2007-2008 search for outstanding charities, we took applications and asked charities to make their own case for impact. In 2009, we identified evidence-backed “priority programs” using independent literature, but still actively looked for charities (even outside these programs) with their own evidence of effectiveness. In 2011, we continued this hybrid approach.

In all of these searches, we’ve found very little in the way of “charities demonstrating effectiveness using their own data.”

We believe the underlying dynamic is that

  • Evidence on these sorts of interventions is very difficult and expensive to collect.
  • It’s particularly difficult to collect such evidence in a way that addresses various common concerns that we believe to be very common and important in the context of evaluating charitable programs.
  • Studies that can adequately address these issues are generally “gold-standard” studies, and are therefore of general interest (and can be found by searching independent/academic literature).

Accordingly, our interest in “program evaluation” – the work that charities do to systematically and empirically evaluate their own programs – has greatly diminished. We are skeptical of the value of studies that fall below the “gold standard” bar that usually accompanies high-reputation independent literature.

This shift in our thinking has greatly influenced how our process works and what we expect it to find. Rather than putting a lot of time into scanning charities’ websites for empirical evidence, as we did previously, we now are focused on identifying the evidence-backed interventions, then finding the vehicles by which donors can fund these interventions.

Cost-effectiveness
The ultimate goal of a GiveWell recommendation is to help a donor accomplish as much good as possible, per dollar spent. Accordingly, we have long been interested in trying to estimate how much good is accomplished per dollar spent, in terms such as lives saved per dollar or DALYs averted per dollar.

Over the years, we’ve put a lot of effort into this sort of analysis, and learned a lot about it. In particular:

  • In sectors outside of global health and nutrition, it is generally impractical to connect measurable outcomes to meaningful outcomes (for example, we may observe that an education program raises test scores, but it is very difficult to connect this to something directly related to improvements in quality of life). Not surprisingly, the vast majority of attempts to do cost-effectiveness analysis (including both GiveWell’s attempts and others’ attempts) have been in the field of global health and nutrition.
  • Within global health and nutrition, even the most prominent, best-resourced attempts at cost-effectiveness analysis have had questionable quality and usefulness.
  • Our own attempts to do cost-effectiveness analysis have turned out to be very sensitive to small variations in basic assumptions. Such sensitivity is directly relevant to how much weight we should put on such estimates in decision-making.
  • That said, we continue to find cost-effectiveness analysis to be very useful when feasible, partly because it is a way of disciplining ourselves to make sure we’ve addressed every input and question that matters on the causal chain between interventions (e.g., nets) and morally relevant outcomes (e.g., lives saved). In addition, cost-effectiveness analysis can be useful for extreme comparisons, identifying interventions that are extremely unlikely to have competitive cost-effectiveness (for example, see our comparison of U.S. and international aid).

While we still intend to work hard on cost-effectiveness analysis, and we still see value in it, we do not see it as holding out much promise for helping to resolve difficult decisions between one giving opportunity and another. We find other criteria to be easier to make distinctions on – criteria such as strength of evidence (discussed above) and room for more funding (discussed below).

Room for more funding
For the first few years of our history, we knew that the issue of room for more funding was important, but we made little headway on figuring out how to assess it. We tried asking charities directly how additional dollars would be used, but didn’t receive very helpful answers (see applications received for our 2007-2008 process).

In 2010, as a result of substantial conversations with VillageReach, we developed the basic approach of scenario analysis, and since then we’ve used this approach to reach some surprising conclusions, such as the lack of short-term room for more funding for the Nurse-Family Partnership and recommending KIPP Houston rather than the KIPP Foundation due to “room for more funding” issues.

By now, room for more funding is in some ways the “primary” criterion we look at, in the sense that it’s often the first thing we ask for and sits at the core of our view on an organization. This is because

  • Asking “what activities additional dollars would allow” determines what activities we focus on evaluating.
  • Many of the charities and programs that may seem to have the most “slam-dunk” case for impact also seem – not surprisingly – to have their funding needs already met by others. We’ve found it relatively challenging to find activities that are both highly appealing and truly underfunded.
  • In the absence of reliable explicit cost-effectiveness analysis, an alternative way of maximizing impact is to look for the most appealing activities that have funding gaps. The analytical, “sector-agnostic” approach we bring to giving seems well-suited to doing so in a way that other funders can’t or won’t.

Many people – including us early in our history – may be inclined to think that maximizing impact consists of laying out all the options, estimating their quantified impact-per-dollar, and ranking them. We’ve seen major limitations to this approach (though we still utilize it). We’ve also, however, come across another way of thinking about maximizing impact: finding where one can fit into the philanthropic ecosystem such that one is funding the best work that others won’t.

Surveying the research on a topic

We’ve previously discussed how we evaluate a single study. For the questions we try to answer, though, it’s rarely sufficient to consult a single study; studies are specific to a particular time, place, and context, and to get a robust answer to a question like “Do insecticide-treated nets reduce child mortality?” one should conduct – or ideally, find – a thorough and unbiased survey of the available research. Doing so is important: we feel it is easy (and common) to form an inaccurate view based on a problematic survey of research.

This post discusses what we feel makes for a good literature review: a report that surveys the available research on a particular question. Our preferred way to answer a research question is to find an existing literature review with strong answers to these questions; when necessary, we conduct our own literature review with the same questions in mind.

Our key questions for a literature review

  • What are the motivations of the literature reviewer? A biased survey of research can easily lead to a biased conclusion, if the reviewer is selective about which studies to include and which to focus on. We are generally highly wary of literature reviews commissioned by charities (for example, a 2005 survey of studies on microfinance commissioned by the Grameen Foundation) or advocacy groups. We prefer reviews that are done by parties with no obvious stake in coming to one recommendation or another, and with a stake in maintaining a reputation for neutrality (these can, in appropriate cases, include government agencies as well as independent groups such as the Cochrane Collaboration).
  • How did the literature reviewer choose which studies to include? Since one of the ways a literature review can be distorted is through selective inclusion of studies, we take interest in the question of whether it has included all (and only) sufficiently high-quality studies that bear on the question of interest.In some cases, there are only a few high-quality studies available on the question of interest, such that the reviewer can discuss each study individually, and the reader can hold the reviewer accountable if s/he knows of another high-quality study that has been left out. However, for a topic like the impact of insecticide-treated nets on malaria, there may be many high-quality studies available. In these cases, we prefer literature reviews in which the reviewer is clear about his/her search protocol, ideally such that the search could be replicated by a reader.
  • How thoroughly and consistently does the literature review discuss the strengths and weaknesses of each study? As we wrote previously, studies can vary a great deal in quality and importance. When we see a literature review simply asserting that a particular study supports a particular claim – without discussing the strengths and weaknesses of this study – we consider it a low-quality literature review and do not put weight on it. In our view, a good literature review is one that provides a maximally thorough, consistent, understandable summary of the strengths and weaknesses of each study it includes.
  • Does the literature review include meta-analysis, attempting to quantitatively combine the results of several studies? In some cases it is possible to perform meta-analysis: combining the results from multiple studies to get a single “pooled” quantitative result. In other cases a literature review limits itself to summarizing the strengths and weaknesses of each study reviewed and giving a qualitative conclusion.

Strong and weak literature reviews
In general, we feel that the Cochrane Collaboration performs strong literature reviews by the criteria above. Examples of its reviews include a review we discussed previously on deworming and a review on insecticide-treated nets to protect against malaria.

  • The Cochrane Collaboration is an independent group that aims to base its brand on unbiased research, and does not take commercial funding.
  • Cochrane reviews generally explicitly lay out their search strategy and selection criteria in their summaries.
  • Cochrane reviews generally list all of the studies considered along with relatively in-depth discussions of their methodology, strengths and weaknesses (full text is required to see these).
  • Cochrane reviews generally perform quantitative meta-analysis and include the conclusions of such analysis in their summaries.

An example of a more problematic literature review is King, Dickman and Tisch 2005, cited in our report on deworming. This review does well on some of our criteria: it is clear about its search and inclusion criteria (see Figure 1 on page 1562), and it performs quantified meta-analysis (see Table 1 on page 1565). However,

  • It provides a list of all studies included, but unlike the Cochrane reviews we’ve seen, it does not provide any information for these studies (methodology, sample size, etc.) other than the reference.
  • It does not discuss individual studies’ strengths and weaknesses at all.
  • It does not make it possible for the reader to connect the study’s conclusions (in Table 5) to specific studies. (Figures 2-4 break down a few, but not all, of the study’s conclusions with lists of individual studies.) Since over 100 studies were included, we do not see a practical way for a reader to vet the literature review’s conclusions.
  • There is also ambiguity in what the reported conclusions mean: for example, Table 5 does not specify whether it is examining the impact of deworming on the level or change of each listed outcome (i.e., impact on weight vs. impact on change in weight over time).

We have at times seen advocacy groups and/or foundations put out literature reviews that are far more flawed than the study discussed above. Though we generally don’t keep track of these, we provide one example, a paper entitled “What can we learn from playing interactive games?” A representative quote from this paper:

There is also evidence that game playing can improve cognitive processing skills such as visual discernment, which involves the ability to divide visual attention and allocate it to two or more simultaneous events (Greenfield et al., 1994b); parallel processing, the ability to engage in multiple cognitive tasks simultaneously (Gunter, 1998); and other forms of visual discrimi-nation including the ability to process cluttered visual scenes and rapid sequences of images (Riesenhuber, 2004). Experiments have also found improvements in eye-hand coordination after playing video games (Rosenberg et al., 2005).

The paper does not discuss selection, inclusion, strengths, or weaknesses of studies, or even their basic design and the nature/magnitude of their findings (for example, how is “parallel processing” measured?)

All else equal, we would prefer a world in which all literature reviews were more like Cochrane reviews than like the more problematic reviews discussed above. However, it’s worth noting that Cochrane reviews appear to be quite expensive, upwards of $100,000 each. Conducting a truly thorough and unbiased literature review is not necessarily easy or cheap, but we feel it is often necessary to get an accurate picture of what the research says on a given question.

How we evaluate a study

We previously wrote about our general principles for assessing evidence, where “evidence” is construed broadly (it may include awards/recognition/reputation, testimony, and broad trends in data as well as formal studies). Here we discuss our approach to a particular kind of evidence, what we call “micro data”: formal studies of the impact of a program on a particular population at a particular time, using quantitative data analysis.

We list several principles that are important to us in deciding how much weight to put on a study’s claims. A future post will discuss the application of these principles to some example studies.

Causal attribution
A study of a charity’s impact will generally highlight a particular positive change in the data – for example, improved school attendance or fewer health problems among children who were dewormed. One of the major challenges of a study is to argue that such a change was caused by the program being studied, as opposed to other factors. Many studies make simple before-or-after comparisons, which may be conflating program effects with other unrelated changes over time (for example, generally improving wealth/education/sanitation/etc.) Many studies make simple participant-to-non-participant comparisons, which can face a significant problem of selection bias: the people who are chosen to participate in a program, or who choose to participate in a program, may be different from non-participants in many ways, so differences may emerge that can’t be attributed to the program.

One way to deal with the problem of causal attribution is via randomization. A randomized controlled trial (in this context) is a study in which a set of people is identified as potential program participants, and then randomly divided into one or more “treatment group(s)” (group(s) participating in the program in question) and a “control group” (a group that experiences no intervention). When this is done, it is generally presumed that any sufficiently large differences that emerge between the treatment and control groups were caused by the program.

Many, including us, consider the randomized controlled trial to be the “gold standard” in terms of causal attribution. However, there are often cases in which randomized controlled trials are politically, financially or practically non-feasible, and there are a variety of other techniques for attributing causality, including:

  • Instrumental variables. An “instrumental variable” is a variable that affects the outcome of interest (for example, income) only through its impact on the intervention/program of interest (for example, access to schooling). An example of such an approach is Duflo 2001, which examines a large-scale government school construction program; it reasons that people who lived in districts that the program reached earlier got better access to education, through a “luck of the draw” that could be thought of as similar to randomization, so any other differences between people who lived in such districts and other people could fairly be attributed to differences in access to education, rather than other differences.We are open to the possibility of a compelling instrumental-variables study, but in practice, it seems that we see very few instrumental variables that are highly plausible as meeting the criteria, and many that seem very questionable. For example, a a paper by McCord, Conley and Sachs uses malaria ecology as an instrument for mortality, implying that the only way malaria ecology could affect the outcome of interest (fertility) is through its impact on mortality. However, Sachs has elsewhere argued that malaria ecology affects people in many ways other than through mortality, and we believe this to be the case.
  • Regression discontinuity. Sometimes there is a relatively arbitrary “cutoff point” for participation in a program, and a study may therefore compare people who “barely qualify” with people who “barely fail to qualify,” along the lines of this study on giving children vouchers to purchase computers. We believe this method to be a relatively strong method of causal attribution, but (a) there tend to be major issues with external validity, since comparing “people who barely qualified with people who barely failed to qualify” may not give results that are representative of the whole population being served; (b) this methodology appears relatively rare when it comes to the topics we focus on.
  • Using a regression to “control for” potential confounding variables. We often see studies that attempt to list possible “confounders” that could serve as alternative explanations for an observed effects, and “control” for each confounder using a regression. For example, a study might look at the relationship between education and later-life income, recognize that this relationship might be misleading because people with more education may have more income to begin with, and therefore examine the relationship between education and income while “controlling for” initial income.We believe that this approach is very rarely successful in creating a plausible case for causality. It is difficult to name all possible confounders and more difficult to measure them; in addition, the idea that such confounders are appropriately “controlled for” usually depends on subtle (and generally unjustified) assumptions about the “shape” of relationships between different variables. Details of our view are beyond the scope of this post, but we recommend Macro Aid Effectiveness Research: A Guide for the Perplexed (authored by David Roodman, whom we have written about before) as a good introduction to the common shortcomings of this sort of analysis.
  • Visual and informal reasoning. Researcher sometimes make informal arguments about the causal relationship between two variables, by e.g. using visual illustrations. An example of this: the case for VillageReach includes a chart showing that stock-outs of vaccines fell dramatically during the course of VillageReach’s program. Though no formal techniques were used to isolate the causal impact of VillageReach’s program, we felt at the time of our VillageReach evaluation that there was a relatively strong case in the combination of (a) the highly direct relationship between the “stock-outs” measure and the nature of VillageReach’s intervention (b) the extent and timing of the drop in stockouts, when juxtaposed with the timing of VillageReach’s program. (We have since tempered this conclusion.)We sometimes find this sort of reasoning compelling, and suspect that it may be an under-utilized method of making compelling causal inferences.

Publication bias
We’ve written at length about publication bias, which we define as follows:

“Publication bias” is a broad term for factors that systematically bias final, published results in the direction that the researchers and publishers (consciously or unconsciously) wish them to point.

Interpreting and presenting data usually involves a substantial degree of judgment on the part of the researcher; consciously or unconsciously, a researcher may present data in the most favorable light for his/her point of view. In addition, studies whose final conclusions aren’t what the researcher (or the study funder) hoped for may be less likely to be made public.

Publication bias is a major concern of ours. As non-academics, we aren’t easily able to assess the magnitude and direction of this sort of bias, but we suspect that there are major risks anytime there is a combination of (a) a researcher who has an agenda/”preferred outcome”; (b) a lot of leeway for the researcher to make decisions that aren’t transparent to the reader. We’d guess that both (a) and (b) are very common.

When we evaluate a study, we consider the following factors, all of which bear on the question of how worried we should be that the paper reflects “the conclusions the researcher wanted to find” rather than “the conclusions that the data, impartially examined, points to”:

  • What are the likely motivations and hopes of the authors? If a study is commissioned/funded by a charity, the researcher is probably looking for an interpretation that reflects well on the charity. If a study is published in an academic journal, the researcher is likely looking for an interpretation that could be considered “interesting” – which usually means finding “some effect” rather than “no effect” for a given intervention, though there are potential exceptions (for example, it seems to us that the relatively recent studies of microfinance would have been considered “interesting” whether they found strong effects or no effects, since the impacts of microfinance are widely debated).
  • Is the paper written in a neutral tone? Do the authors note possible alternate interpretations of the data and possible objections to their conclusions? When we saw a white paper commissioned by the Grameen Foundation (at the time, the most comprehensive review of the literature on microfinance we could find) making statements like “Unfortunately, rather than ending the debate over the effectiveness of microfinance, Pitt and Khandker’s paper merely fueled the fire” and “The previous section leaves little doubt that microfinance can be an effective tool to reduce poverty” (a statement that didn’t seem true to us), we questioned the intentions of the author, and were more inclined to be pessimistic where details were scarce. In general, we expect a high-quality paper to proactively identify counterarguments and limitations to its findings.
  • Is the study preregistered? Does it provide a link to the full details of its analysis, including raw data and code? As we have previously written, preregistration and data/code sharing are two important tools that can alleviate concerns around publication bias (by making it harder for questionable analysis decisions to go unnoticed).It seems to us that these practices are relatively rare in the field of economics, and less rare in the field of medicine.
  • How many outcomes does the study examine, and which outcomes does it emphasize in its summary? We often see studies that look for an intervention’s effect on a wide range of outcomes, find significant effects only on a few, and emphasize these few without acknowledging (or quantitatively analyzing) the fact that focusing on the “biggest measured effect size” is likely to overstate the true effect size. Preregistration (see above) would alleviate this issue by allowing researchers to credibly claim that the outcome they emphasize was the one they had intended to emphasize all along (or, if it wasn’t, to acknowledge as much). However, even if a study isn’t preregistered, it can acknowledge the issue and attempt to adjust for it quantitatively; studies frequently do not do so.
  • Is the study expensive? Were its data collected to answer a particular question? If a lot of money and attention is put into a study, it may be harder for the study to fall prey to one form of publication bias: the file drawer problem. Most of the field studies we come across involve collecting data on developing-world populations over a period of years, which is fairly expensive, for the purpose of answering a particular question; by contrast, studies that consist simply of analyzing already-publicly-available data, or of experiments that can be conducted in the course of a day (as with many psychology studies), seem to us to be more susceptible to the file-drawer problem.

Other considerations

  • Effect size and p-values. A study will usually report the “effect size” – the size of the effect it is reporting for the program/treatment – in some form, along with a p-value that expresses, roughly speaking, how likely it is that an effect size at least as big as the reported effect size would have been observed, by chance, if the treatment fundamentally had no effect.We find the effect size useful for obvious reasons – it tells us how much difference the program is reported to have made, and we can then put this in context with what we’ve seen of similar programs to gauge plausibility. We find the p-value (and, related, reports of which effects are “statistically significant” – which, in the social sciences, generally means a p-value under 5%) useful for a couple of reasons:
    • Even a very large observed effect (if observed in a relatively small sample) could simply be random variation. We generally emphasize effects with p-values under 5%, which is a rough and common proxy for “unlikely to be random variation.”
    • The p-value tends to be considered important within academia: researchers generally emphasize the findings with p-values under a certain threshold (which varies by field). We would guess that most researchers, in designing their studies, seek to find a sample size high enough that they’ll get a sufficiently low p-value if they observe an effect as large as they hope/expect. Therefore, asking “is the p-value under the commonly accepted threshold?” can be considered a rough way of asking “Did the study find an effect as large as what the researcher hoped/expected to find?”

     

  • Sample size and attrition. “Sample size” refers to the number of observations in the study, both in terms of how many individuals were involved and how many “clusters” (villages, schools, etc.) were involved. “Attrition” refers to how many of the people originally included in the study were successfully tracked for reporting final outcomes.In general, we put more weight on a study when it has greater sample size and less attrition. In theory, the reported “confidence interval” around an effect size should capture what’s important about the sample size (larger sample sizes will generally lead to narrower confidence intervals, i.e., more precise estimates of effect size). But (a) we aren’t always confident that confidence intervals are calculated appropriately, especially in nonrandomized studies; (b) large sample size can be taken as a sign that a study was relatively expensive and prominent, which bears on “publication bias” as discussed above; (c) more generally, we intuitively see a big difference between a statistically significant impact from a study that randomized treatment between 72 clusters including a total of ~1 million individuals (as an unpublished study on the impact of vitamin A did) and a statistically significant impact from a study that included only 111 children (as the Abecedarian Project did), or a study that compared two villages receiving a nutrition intervention to two villages that did not (as with a frequently cited study on the long-term impact of childhood nutrition).
  • Effects of being studied? We think it’s possible that in some cases, the mere knowledge that one has been put into the “treatment group,” receiving a treatment that is supposed to improve one’s life, could be partly or fully responsible for an observed effect. One mechanism for this issue would be the well-known “placebo effect.” Another is the possibility that people might actively try to get themselves included in the treatment group, leading to a dynamic in which the most motivated or connected people become overrepresented in the treatment group.The ideal study is “double-blind”: neither the experimenters nor the subjects know which people are being treated and which aren’t. “Double-blind” studies aren’t always possible; when a study isn’t blinded, we note this, and ask how intuitively plausible it seems that the outcomes observed could have been due to the lack of blinding.
  • External validity. Most of the points above emphasize “internal validity”: the validity of the study’s claim a certain effect occurred in the particular time and place that the study was carried out in. However, even if the study’s claims about what happened are fully valid, there is the additional question: “how will the effects seen in this study translate in other settings and larger-scale programs?”We’d guess that the programs taking place in studies are often unusually high-quality, in terms of personnel, execution, etc. (For example, see our discussion of studies on insecticide-treated nets: the formal studies of net distribution programs involved a level of promoting usage that large-scale campaigns do not and cannot include.) In addition, we often note something about a study that indicates that it took place under unusual conditions (for example, a prominent study of deworming took place while El Nino was bringing worm infections to unusually high levels).

A note on randomized controlled trials (RCTs)
The merits of randomized controlled trials (RCTs) have been debated, and in particular the question has arisen of whether the RCT should be considered the “gold standard.”

We believe that RCTs have multiple qualities that make them – all else equal – more credible than other studies. In addition to their advantages for causal attribution, RCTs tend to be relatively expensive and to be clearly aimed at answering a particular question, which has advantages regarding publication bias. In today’s social sciences environment – in which preregistration is rare – we think that the property of being an RCT is probably the single most encouraging (easily observed) property a study can have, which has a practical implication: we often conduct surveys of research by focusing/starting on finding RCTs (while also trying to include the strongest and most prominent non-RCTs).

That said, the above discussion hopefully makes it clear that we ask a lot of questions about a study besides whether it is an RCT. There are nonrandomized studies we find compelling as well as randomized studies we don’t find compelling. And we think it’s possible that if preregistration were more common, we’d consider preregistration to be a more important and encouraging property of a study than randomization.

Our principles for assessing evidence

For several years now we’ve been writing up our thoughts on the evidence behind particular charities and programs, but we haven’t written a great deal about the general principles we follow in distinguishing between strong and weak evidence. This post will

  • Lay out the general properties that we think make for strong evidence: relevant reported effects, attribution, representativeness, and consonance with other observations. (More)
  • Discuss how these properties apply to several common kinds of evidence: anecdotes, awards/recognition/reputation, “micro” data and “macro” data. (More)

This post focuses on broad principles that we apply to all kinds of “evidence,” not just studies. A future post will go into more detail on “micro” evidence (i.e., studies of particular programs in particular contexts), since this is the type of evidence that has generally been most prominent in our discussions.

General properties that we think make for strong evidence
We look for outstanding opportunities to accomplish good, and accordingly, we generally end up evaluating charities that make (or imply) relatively strong claims about the impact of their activities on the world. We think it’s appropriate to approach such claims with a skeptical prior and thus to require evidence in order to put weight on them. By “evidence,” we generally mean observations that are more easily reconciled with the charity’s claims about the world and its impact than with our skeptical default/”prior” assumption.

To us, the crucial properties of such evidence are:

  • Relevant reported effects. Reported effects should be plausible as outcomes of the charity’s activities and consistent with the theory of change the charity is presenting; they should also ideally get to the heart of the charity’s case for impact (for example, a charity focused on economic empowerment should show that it is raising incomes and/or living standards, not just e.g. that it is carrying out agricultural training).
  • Attribution. Broadly speaking, the observations submitted as evidence should be easier to reconcile with the charity’s claims about the world than with other possible explanations. If a charity simply reports that its clients have higher incomes/living standards than non-participants, this could be attributed to selection bias (perhaps higher incomes cause people to be more likely to participate in the charity’s program, rather than the charity’s program causing higher incomes), or to data collection issues (perhaps clients are telling surveyors what they believe the surveyors want to hear), or to a variety of other factors.The randomized controlled trial is seen by many – including us – as a leading method (though not the only one) for establishing strong attribution. By randomly dividing a group of people into “treatment” (people who participate in a program) and “control” (people who don’t), a researcher can make a strong claim that any differences that emerge between the two groups can be attributed to the program.
  • Representativeness. We ask, “Would we expect the activities enabled by additional donations to have similar results to the activities that the evidence in question applies to?” In order to answer this well, it’s important to have a sense of a charity’s room for more funding; it’s also important to be cognizant of issues like publication bias and ask whether the cases we’re reviewing are likely to be “cherry-picked.”
  • Consonance with other observations. We don’t take studies in isolation: we ask about the extent to which their results are credible in light of everything else we know. This includes asking questions like “Why isn’t this intervention better known if its effects are as good as claimed?”

Common kinds of evidence

  • Anecdotes and stories – often of individuals directly affected by charities’ activities – are the most common kind of evidence provided by charities we examine. We put essentially no weight on these, because (a) we believe the individuals’ stories could be exaggerated and misrepresented (either by the individuals, seeking to tell charity representatives what they want to hear and print, or by the charity representatives responsible for editing and translating individuals’ stories); (b) we believe the stories are likely “cherry-picked” by charity representatives and thus not representative. Note that we have written in the past that we would be open to taking individual stories as evidence, if our “representativeness” concerns were addressed more effectively.
  • Awards, recognition, reputation. We feel that one should be cautious and highly context-sensitive in deciding how much weight to place on a charity’s awards, endorsements, reputation, etc. We have long been concerned that the nonprofit world rewards good stories, charismatic leaders, and strong performance on raising money (all of which are relatively easy to assess) rather than rewarding positive impact on the world (which is much harder to assess). We also suspect that in many cases, a small number of endorsements can quickly snowball into a large number, because many in the nonprofit world (having little else with which to assess a charity’s impact) decide their own endorsements more or less exclusively on the basis of others’ endorsements. Because of these issues, we think this sort of evidence often is relatively weak on the criteria of “relevant reported effects” and “attribution.”We certainly feel that a strong reputation or referral is a good sign, and provides reason to prioritize investigating a charity; furthermore, there are particular contexts in which a strong reputation can be highly meaningful (for example, a hospital that is commonly visited by health professionals and has a strong reputation probably provides quality care, since it would be hard to maintain such a reputation if it did not). That said, we think it is often very important to try to uncover the basis for a charity’s reputation, and not simply rely on the reputation itself.
  • Testimony. We see value in interviewing people who are well-placed to understand how a particular change took place, and we have been making this sort of evidence a larger part of our process (for example, see our reassessment of VillageReach’s pilot project). When assessing this sort of evidence, we feel it is important to assess what the person in question is and isn’t well-positioned to know, and whether they have incentive to paint one sort of picture or another. How the person was chosen is another factor: we generally place more weight on the testimony of people we’ve sought out (using our own search process) than on the testimony of people we’ve been connected to by a charity looking to paint a particular picture.
  • “Micro” data. We often come across studies that attempt to use systematically collected data to argue that, e.g., a particular program improved people’s lives in a particular case. The strength of this sort of evidence is that researchers often put great care into the question of “attribution,” trying to establish that the observed effects are due to the program in question and not to something else. (“Attribution” is a frequent weakness of the other kinds of evidence listed here.) The strength of the case for attribution varies significantly, and we’ll discuss this in a future post.When examining “micro” data, we often have concerns around representativeness (is the case examined in a particular study representative of a charity’s future activities?) and around the question of relevant reported outcomes (these sorts of studies often need to quantify things that are difficult to quantify, such as standard of living, and as a result they often use data that may not capture the full reality of what happened).
  • “Macro” data. Some of the evidence we find most impressive is empirical analysis of broad (e.g., country-level) trends. While this sort of evidence is often weaker on the “attribution” front than “micro” data, it is often stronger on the “representativeness” front. (More.)

In general, we think the strongest cases use multiple forms of evidence, some addressing the weaknesses of others. For example, immunization campaigns are associated with both strong “micro” evidence (which shows that intensive, well-executed immunization programs can save lives) and “macro” evidence (which shows, less rigorously, that real-world immunization programs have led to drops in infant mortality and the elimination of various diseases).

Quick update: New way to follow GiveWell’s research progress

There are two types of materials we publish periodically throughout the year:

  • We frequently speak with charity representatives or other subject matter experts. We ask permission to take notes during these conversations so that we can publish them to our conversations page.
  • We publish new charity review or intervention report pages.

We’ve set up a Google Group so that those who want can get updated when we publish new material.

You can subscribe to this via RSS using this RSS feed. You can also sign up to receive updates via email at the group’s home page.

Revisiting the 2011 Japan disaster relief effort

Last year, Japan was hit by a severe earthquake and tsunami, and we recommended giving to Doctors Without Borders specifically because it was not soliciting funds for Japan. We reasoned that the relief effort did not appear to have room for more funding – i.e., we believed that additional funding would not lead to a better emergency relief effort. We made our case based on factors including the lack of an official appeal on ReliefWeb, reports from the U.N. Office for the Coordination of Humanitarian Affairs, statements by the Japanese Red Cross, the behavior of major funders including the U.S. government, and the language used by charities in describing their activities. We acknowledged that donations may have beneficial humanitarian impact in Japan, as donations could have beneficial humanitarian impact anywhere, but felt that the nature of the impact was likely to fall under what we characterized as “restitution” and “everyday aid” activities, as opposed to “relief” or “recovery” activities.

Since it’s now been over a year since the disaster, we made an effort to find one-year reports from relevant organizations and get a sense of how donations have been spent.

We have published a detailed and sourced set of notes on what reports we could find and what they revealed about activities. Our takeaways:

  • Very little information on expenditures was provided. Out of 11 organizations we examined, there were only 6 that prominently reported (such that we could find it) the total amount raised or spent for Japan disaster relief. Out of these, only Save the Children, the American Red Cross, and the Japanese Red Cross provided any breakdown of spending by category. Breakdowns provided by Save the Children and American Red Cross were very broad, with 5 and 3 categories respectively; the Japanese Red Cross provided more detail.
  • The Japanese Red Cross spent most of the funds it received on two categories of expense: (1) cash transfers and (2) electrical appliances for those affected. It reports the equivalent of ~$4.2 billion in cash grants. Out of its ~$676 million budget for recovery activities, 49% was spent specifically on “sets of six electronic household appliances … distributed to 18,840 households in Iwate, 48,638 in Miyagi, 61,464 in Fukushima and 1,820 in other prefectures.” (This quote was the extent of the information provided on this activity.) The Japanese Red Cross also spent significant funds on reconstruction/rehabilitation of health centers and services and pneumonia vaccinations for the elderly.

    A relatively small amount of funding (the equivalent of ~$5.6 million) is reported for activities that the Japanese Red Cross puts under its “emergency” categories in its budget. (These include distribution of supplies, medical services, and psychosocial counseling). It is possible that there was a separate budget for emergency relief that is not included in the report.

    Note that the Japanese Red Cross raised and spent substantially more money than the other nonprofits we’ve listed, and also gave substantially more detail on its activities and expenses.

  • Other nonprofits reported a mix of traditional “relief” activities, cash-transfer-type activities, and entertainment/recreation-related activities. Of the groups that provide some concrete description of their activities (not all did), all reported engaging in distribution of basic supplies and/or provision of psychosocial counseling. Most reported some cash-transfer-type activities: cash-for-work; scholarships; grants to community organizations; support for fisheries, including re-branding efforts and provision of fishing vessels. And most reported some entertainment/recreation activities: festivals, performing arts groups, community-building activities, sporting equipment and sports programs for youth, weekend and summer camps. None reported only traditional “relief” activities. (We concede that all of these activities may have had substantial humanitarian impact, and that some may have been complementary to more traditional “relief” activities; however, we think it is important to note these distinctions, for reasons discussed below.)
  • Currently, Oxfam’s page on the Japan disaster states, “Oxfam has been ready to assist further but is not launching a major humanitarian response at this time. We usually focus our resources on communities where governments have been unable – or, in some cases, unwilling – to provide for their people. But the Japanese government has a tremendous capacity for responding in crises, and a clear commitment to using its resources to the fullest.” Note that on the day of the disaster, Oxfam featured a solicitation for this disaster on its front page.

Based on our earlier conclusion that the relief effort did not have “room for more funding,” we expected to find (a) reports of the sort of activities that nonprofits could spend money on in non-disaster-relief settings (including cash-transfer-type programs, giving out either cash itself or items that could easily be resold, which could likely be carried out in any setting without objections); (b) reports that were relatively light on details and financial breakdowns. We observed both of these things in the reports discussed above, in nearly every case. In isolation, nothing about the above-described activities rules out the idea that nonprofits were carrying out important, beneficial activities that were core to recovery; but when combined with our earlier evidence of no “room for more funding,” we feel that the overall picture is consistent.

We were somewhat surprised to see the degree to which many nonprofits funded entertainment/recreation activities; these sort of activities aren’t what we think of as the core competency of international NGOs working mostly in the developing world, and we continue to feel that in a situation such as Japan’s, direct unconditional cash transfers make more sense than activities such as these. (This is a point in favor of the Japanese Red Cross, which – unlike other nonprofits – reported significant spending on cash transfers.)

We therefore stand by the conclusions we reached last year: that the relief and recovery effort did not have room for more funding, that those interested in emergency relief should have donated to Doctors Without Borders, and that those determined to help Japan specifically should have donated to the Japanese Red Cross.