Quantcast Cost-effectiveness analysis | The GiveWell Blog
November 10th, 2011

Maximizing Cost-effectiveness via Critical Inquiry

We’ve recently been writing about the shortcomings of formal cost-effectiveness estimation (i.e., trying to estimate how much good, as measured in lives saved, DALYs or other units, is accomplished per dollar spent). After conceptually arguing that cost-effectiveness estimates can’t be taken literally when they are not robust, we found major problems in one of the most prominent sources of cost-effectiveness estimates for aid, and generalized from these problems to discuss major hurdles to usefulness faced by the endeavor of formal cost-effectiveness estimation.

Despite these misgivings, we would be determined to make cost-effectiveness estimates work, if we thought this were the only way to figure out how to allocate resources for maximal impact. But we don’t. This post argues that when information quality is poor, the best way to maximize cost-effectiveness is to examine charities from as many different angles as possible - looking for ways in which their stories can be checked against reality - and support the charities that have a combination of reasonably high estimated cost-effectiveness and maximally robust evidence. This is the approach GiveWell has taken since our inception, and it is more similar to investigative journalism or early-stage research (other domains in which people look for surprising but valid claims in low-information environments) than to formal estimation of numerical quantities.

The rest of this post

  • Conceptually illustrates (using the mathematical framework laid out previously) the value of examining charities from different angles when seeking to maximize cost-effectiveness.
  • Discusses how this conceptual approach matches the approach GiveWell has taken since inception.

Conceptual illustration

I previously laid out a framework for making a “Bayesian adjustment” to a cost-effectiveness estimate. I stated (and posted the mathematical argument) that when considering a given cost-effectiveness estimate, one must also consider one’s prior distribution (i.e., what is predicted for the value of one’s actions by other life experience and evidence) and the variance of the estimate error around the cost-effectiveness estimate (i.e., how much room for error the estimate has). This section works off of that framework to illustrate the potential importance of examining charities from multiple angles - relative to formally estimating their cost-effectiveness - in low-information environments.

I don’t wish to present this illustration either as official GiveWell analysis or as “the reason” that we believe what we do. This is more of an illustration/explication of my views than a justification; GiveWell has implicitly (and intuitively) operated consistent with the conclusions of this analysis, long before we had a way of formalizing these conclusions or the model behind them. Furthermore, while the conclusions are broadly shared by GiveWell staff, the formal illustration of them should only be attributed to me.

The model

Suppose that:

  • Your prior over the “good accomplished per $1000 given to a charity” is normally distributed with mean 0 and standard deviation 1 (denoted from this point on as N(0,1)). Note that I’m not saying that you believe the average donation has zero effectiveness; I’m just denoting whatever you believe about the impact of your donations in units of standard deviations, such that 0 represents the impact your $1000 has when given to an “average” charity and 1 represents the impact your $1000 has when given to “a charity one standard deviation better than average” (top 16% of charities).
  • You are considering a particular charity, and your back-of-the-envelope initial estimate of the good accomplished by $1000 given to this charity is represented by X. It is a very rough estimate and could easily be completely wrong: specifically, it has a normally distributed “estimate error” with mean 0 (the estimate is as likely to be too optimistic as too pessimistic) and standard deviation X (so 16% of the time, the actual impact of your $1000 will be 0 or “average”).* Thus, your estimate is denoted as N(X,X).

The implications

I use “initial estimate” to refer to the formal cost-effectiveness estimate you create for a charity - along the lines of the DCP2 estimates or Back of the Envelope Guide estimates. I use “final estimate” to refer to the cost-effectiveness you should expect, after considering your initial estimate and making adjustments for the key other factors: your prior distribution and the “estimate error” variance around the initial estimate. The following chart illustrates the relationship between your initial estimate and final estimate based on the above assumptions.

Note that there is an inflection point (X=1), past which point your final estimate falls as your initial estimate rises. With such a rough estimate, the maximum value of your final estimate is 0.5 no matter how high your initial estimate says the value is. In fact, once your initial estimate goes “too high” the final estimated cost-effectiveness falls.

This is in some ways a counterintuitive result. A couple of ways of thinking about it:

  • Informally: estimates that are “too high,” to the point where they go beyond what seems easily plausible, seem - by this very fact - more uncertain and more likely to have something wrong with them. Again, this point applies to very rough back-of-the-envelope style estimates, not to more precise and consistently obtained estimates.
  • Formally: in this model, the higher your estimate of cost-effectiveness goes, the higher the error around that estimate is (both are represented by X), and thus the less information is contained in this estimate in a way that is likely to shift you away from your prior. This will be an unreasonable model for some situations, but I believe it is a reasonable model when discussing very rough (”back-of-the-envelope” style) estimates of good accomplished by disparate charities. The key component of this model is that of holding the “probability that the right cost-effectiveness estimate is actually ‘zero’ [average]” constant. Thus, an estimate of 1 has a 67% confidence interval of 0-2; an estimate of 1000 has a 67% confidence interval of 0-2000; the former is a more concentrated probability distribution.

Now suppose that you make another, independent estimate of the good accomplished by your $1000, for the same charity. Suppose that this estimate is equally rough and comes to the same conclusion: it again has a value of X and a standard deviation of X. So you have two separate, independent “initial estimates” of good accomplished, and both are N(X,X). Properly combining these two estimates into one yields an estimate with the same average (X) but less “estimate error” (standard deviation = X/sqrt(2)). Now the relationship between X and adjusted expected value changes:

Now you have a higher maximum (for the final estimated good accomplished) and a later inflection point - higher estimates can be taken more seriously. But it’s still the case that “too high” initial estimates lead to lower final estimates.

The following charts show what happens if you manage to collect even more independent cost-effectiveness estimates, each one as rough as the others, each one with the same midpoint as the others (i.e., each is N(X,X)).

The pattern here is that when you have many independent estimates, the key figure is X, or “how good” your estimates say the charity is. But when you have very few independent estimates, the key figure is K - how many different independent estimates you have. More broadly - when information quality is good, you should focus on quantifying your different options; when it isn’t, you should focus on raising information quality.

A few other notes:

  • The full calculations behind the above charts are available here (XLS). We also provide another Excel file that is identical except that it assumes a variance for each estimate of X/2, rather than X. This places “0″ just inside your 95% confidence interval for the “correct” version of your estimate. While the inflection points are later and higher, the basic picture is the same.
  • It is important to have a cost-effectiveness estimate. If the initial estimate is too low, then regardless of evidence quality, the charity isn’t a good one. In addition, very high initial estimates can imply higher potential gains to further investigation. However, “the higher the initial estimate of cost-effectiveness, the better” is not strictly true.
  • Independence of estimates is key to the above analysis. In my view, different formal estimates of cost-effectiveness are likely to be very far from independent because they will tend to use the same background data and assumptions and will tend to make the same simplifications that are inherent to cost-effectiveness estimation (see previous discussion of these simplifications here and here).

    Instead, when I think about how to improve the robustness of evidence and thus reduce the variance of “estimate error,” I think about examining a charity from different angles - asking critical questions and looking for places where reality may or may not match the basic narrative being presented. As one collects more data points that support a charity’s basic narrative (and weren’t known to do so prior to investigation), the variance of the estimate falls, which is the same thing that happens when one collects more independent estimates. (Though it doesn’t fall as much with each new data point as it would with one of the idealized “fully independent cost-effectiveness estimates” discussed above.)

  • The specific assumption of a normal distribution isn’t crucial to the above analysis. I believe (based mostly on a conversation with Dario Amodei) that for most commonly occurring distribution types, if you hold the “probability of 0 or less” constant, then as the midpoint of the “estimate/estimate error” distribution approaches infinity the distribution becomes approximately constant (and non-negligible) over the area where the prior probability is non-negligible, resulting in a negligible effect of the estimate on the prior.

    While other distributions may involve later/higher inflection points than normal distributions, the general point that there is a threshold past which higher initial estimates no longer translate to higher final estimates holds for many distributions.

The GiveWell approach

Since the beginning of our project, GiveWell has focused on maximizing the amount of good accomplished per dollar donated. Our original business plan (written in 2007 before we had raised any funding or gone full-time) lays out “ideal metrics” for charities such as

number of people whose jobs produce the income necessary to give them and their families a relatively comfortable lifestyle (including health, nourishment, relatively clean and comfortable shelter, some leisure time, and some room in the budget for luxuries), but would have been unemployed or working completely non-sustaining jobs without the charity’s activities, per dollar per year. (Systematic differences in family size would complicate this.)

Early on, we weren’t sure of whether we would find good enough information to quantify these sorts of things. After some experience, we came to the view that most cost-effectiveness analysis in the world of charity is extraordinarily rough, and we then began using a threshold approach, preferring charities whose cost-effectiveness is above a certain level but not distinguishing past that level. This approach is conceptually in line with the above analysis.

It has been remarked that “GiveWell takes a deliberately critical stance when evaluating any intervention type or charity.” This is true, and in line with how the above analysis implies one should maximize cost-effectiveness. We generally investigate charities whose estimated cost-effectiveness is quite high in the scheme of things, and so for these charities the most important input into their actual cost-effectiveness is the robustness of their case and the number of factors in their favor. We critically examine these charities’ claims and look for places in which they may turn out not to match reality; when we investigate these and find confirmation rather than refutation of charities’ claims, we are finding new data points that support what they’re saying. We’re thus doing something conceptually similar to “increasing K” according to the model above. We’ve recently written about all the different angles we examine when strongly recommending a charity.

We hope that the content we’ve published over the years, including recent content on cost-effectiveness (see the first paragraph of this post), has made it clear why we think we are in fact in a low-information environment, and why, therefore, the best approach is the one we’ve taken, which is more similar to investigative journalism or early-stage research (other domains in which people look for surprising but valid claims in low-information environments) than to formal estimation of numerical quantities.

As long as the impacts of charities remain relatively poorly understood, we feel that focusing on robustness of evidence holds more promise than focusing on quantification of impact.

*This implies that the variance of your estimate error depends on the estimate itself. I think this is a reasonable thing to suppose in the scenario under discussion. Estimating cost-effectiveness for different charities is likely to involve using quite disparate frameworks, and the value of your estimate does contain information about the possible size of the estimate error. In our model, what stays constant across back-of-the-envelope estimates is the probability that the “right estimate” would be 0; this seems reasonable to me.

November 4th, 2011

Some Considerations Against More Investment in Cost-Effectiveness Estimates

When we started GiveWell, we were very interested in cost-effectiveness estimates: calculations aiming to determine, for example, the “cost per life saved” or “cost per DALY saved” of a charity or program. Over time, we’ve found ourselves putting less weight on these calculations, because we’ve been finding that these estimates tend to be extremely rough (and in some cases badly flawed).

One can react to what we’ve been finding in different ways: one can take it as a sign that we need to invest more in cost-effectiveness estimation (in order to make it more accurate and robust), or one can take it as a sign that we need to invest less in cost-effectiveness estimation (if one believes that estimates are unlikely to become robust enough to take literally and that their limited usefulness can be achieved with less investment). At this point we are tentatively leaning more toward the latter view, this post lays out our thinking on why.

This post does not argue against the conceptual goal of maximizing cost-effectiveness, i.e., achieving the maximal amount of good per dollar donated. We strongly support this conceptual goal; rather, we are arguing that focusing on directly estimating cost-effectiveness is not the best way to maximize cost-effectiveness. We believe there are alternative ways of maximizing cost-effectiveness - in particular, making limited use of cost-effectiveness estimates while focusing on finding high-quality evidence (an approach we have argued for previously and will likely flesh out further in a future post).

In a nutshell, we argue that the best currently available cost-effectiveness estimates - despite having extremely strong teams and funding behind them - have the problematic combination of being extremely simplified (ignoring important but difficult-to-quantify factors), extremely sensitive (small changes in assumptions can lead to huge changes in the figures), and not reality-checked (large flaws can persist unchecked - and unnoticed - for years). We believe it is conceptually difficult to improve on all three of these at once: improving on the first two is likely to require substantially greater complexity, which in turn will worsen the ability of outsiders to understand and reality-check estimates. Given the level of resources that have been invested in creating the problematic estimates we see now, we’re not sure that really reliable estimates can be created using reasonable resources - or, perhaps, at all.

We expand on these points using the case study of deworming, the only DCP2 estimate that we have enough detail on to be able to fully understand and reconstruct.

Simplicity of the estimate

The estimate is extremely simplified. It consists of

  • Costs: two possible figures for “cost per child treated,” one for generic drugs and one for name-brand drugs. These figures are drawn from a single paper (a literature review published 3 years prior to the publication of the estimate); costs are assumed to scale linearly with the number of children treated, and to be constant regardless of the region.
  • Drug effectiveness: for each infection, a single “effectiveness” figure is used, i.e., treatment is assumed to reduce disease burden by a set percentage for a given disease. For each infection, a single paper is used as the source of this “effectiveness” figure.
  • Symptoms averted: the prevalence of different symptoms is assumed to be different by region, but the regions are broad (there are 6 total regions). Prevalence figures are taken from a single paper. The severity of each symptom is assumed to be constant regardless of context, using standard disability weights. Effective treatment is presumed to prevent symptoms for exactly one year, with no accounting for externalities, side effects, or long-term effects (in fact, in the original calculation even deaths are assumed to be averted for only one year).
  • Putting it all together: the estimate calculates benefits of deworming by estimating the number of children cured of each symptom for a single year (based on the six regional figures re: how common symptoms are), converting to DALYs using its single set of figures on how severe each symptom is, and multiplying by the single drug effectiveness figure. It divides these DALY-denominated benefits into the costs, which are again done using a single per-child figure.

No sensitivity analysis is included to examine how cost-effectiveness would vary if certain figures or assumptions turned out to be off. No adjustments are made to address issues such as (a) the high uncertainty of many of the figures (which has implications for overall cost-effectiveness); (b) the fact that figures are taken from a relatively small number of studies, and are thus likely to be based on unusually well-observed programs.

In our view, any estimate this simple and broad has very limited application when examining a specific charity operating in a specific context.

Sensitivity of the estimate

The estimate is extremely sensitive to changes in inputs. In the course of examining it and trying different approaches to estimating the cost-effectiveness of deworming, we arrived at each of the following figures at one point or another:

Cost per DALY for STH treatment Key assumptions behind this cost
$3.41 original DCP2 calculation
$23.92 +corrected disability weight of ascariasis symptoms
$256 -corrected disability weight of ascariasis symptoms
+corrected prevalence interpretation for all STHs and symptoms and disability weight of trichuriasis symptoms
$529 +corrected disability weight of ascariasis symptoms
$385 +incorrectly accounting for long-term effects
$326 -incorrectly accounting for long-term effects
+corrected duration of trichuriasis symptoms
$138 +correctly accounting for long-term effects
$82.54 Jonah’s independent estimate for, implicitly accounting for long-term effects and using lower drug costs

Our final corrected version of the DCP2’s estimate varies heavily within regions as well:

Cost per DALY for STH treatment Region
$77.39 East Asia & Pacific
$83.16 Latin America & Caribbean
$412.22 Middle East & North Africa
$202.69 South Asian Seas
$259.57 Sub-Saharan Africa

Lack of reality-checks

As we wrote previously, we believe that a helminth expert reviewing this calculation would have noticed the errors that we pointed to. This is because when one examines the details of the (uncorrected) estimate, it becomes clear that nearly all of the benefits of deworming are projected to come from a single symptom of a single disease - a symptom which is, in fact, only believed to be about 1/20 as severe as the calculation implies, and only about 1/100 as common.

So why wasn’t the error caught between its 2006 publication (and numerous citations) and our 2011 investigation? We can’t be sure, but we can speculate that

  • The DALY metric - while it has the advantage of putting all health benefits in the same units - is unintuitive. We don’t believe it is generally possible to look at a cost-per-DALY figure and compare it with one’s informal knowledge of an intervention’s costs and benefits (though it is more doable when the benefits are concentrated in preventing mortality, which eliminates one of the major issues with interpreting DALYs).
  • That means that in order to reality-check an estimate, one needs to look at the details of how it was calculated.
  • But looking at the details of how an estimate is calculated is generally a significant undertaking - even for an estimate as simple as this one. It requires a familiarity with the DALY framework and with the computational tools being used (in this case Excel) that a subject matter expert - the sort of person who would be best positioned to catch major problems - wouldn’t necessarily have. And it may require more time than such a subject matter expert will realistically have available.

In most domains, a badly flawed calculation - when used - will eventually produce strange results and be noticed. In aid, by contrast, one can use a completely wrong figure indefinitely without ever finding out. The only mechanism for catching problems is to have a figure that is sufficiently easy to understand that outsiders (i.e., those who didn’t create the calculation) can independently notice what’s off. It appears that the DCP2 estimates do not pass this test.

Our point here isn’t about the apparent lack of formal double-check in the DCP2’s process (though this does affect our view of the DCP2) but about the lack of reality-check in the 5 years since publication - the fact that at no point did anyone notice that the figure seemed off, and investigate its origin.

And the problem pertains to more than “catching errors”; it also pertains to being able to notice when the calculation becomes out of line with (for example) new technologies, new information about the diseases and interventions in question, or local conditions in a specific case. An estimate that can’t be - or simply isn’t - continually re-examined for its overall and local relevance may be “correct,” but its real-world usefulness seems severely limited.

The dilemma: the less simplified and sensitive, the more esoteric

It currently appears to us that the general structure of these estimates is too simplified and sensitive to be reliable without relatively constant reality-checks from outsiders (particularly subject matter experts), but so complex and esoteric that these reality-checks haven’t been taking place.

Improving the robustness and precision of the estimates would likely have to mean making them far more complex, which in turn could make it far more difficult for outsiders (including subject matter experts) to make sense of them, adapt them to new information and local conditions, and give helpful feedback.

The resources that have already been invested in these cost-effectiveness estimates are significant. Yet in our view, the estimates are still far too simplified, sensitive, and esoteric to be relied upon. If such a high level of financial and (especially) human-capital investment leaves us this far from having reliable estimates, it may be time to rethink the goal.

All that said - if this sort of analysis were the only way to figure out how to allocate resources for maximal impact, we’d be advocating for more investment in cost-effectiveness analysis and we’d be determined to “get it right.” But in our view, there are other ways of maximizing cost-effectiveness that can work better in this domain - in particular, making limited use of cost-effectiveness estimates while focusing on finding high-quality evidence (an approach we have argued for previously and will likely flesh out further in a future post).

September 29th, 2011

Errors in DCP2 cost-effectiveness estimate for deworming

Two notes on this post:

  • This post discusses flaws in a particular published cost-effectiveness estimate for deworming. It should not be taken as a general argument against deworming as a promising intervention, and it does not address various other publications on deworming including the 2003 paper by Edward Miguel and Michael Kremer.
  • Prior to publication, we sent a draft of this post to several relevant scholars including the authors of the estimate. They have reviewed our work and confirmed the major errors we point out.

Over the past few months, GiveWell has undertaken an in-depth investigation of the cost-effectiveness of deworming, a treatment for parasitic worms that are very common in some parts of the developing world. While our investigation is ongoing, we now believe that one of the key cost-effectiveness estimates for deworming is flawed, and contains several errors that overstate the cost-effectiveness of deworming by a factor of about 100. This finding has implications not just for deworming, but for cost-effectiveness analysis in general: we are now rethinking how we use published cost-effectiveness estimates for which the full calculations and methods are not public.

The cost-effectiveness estimate in question comes from the Disease Control Priorities in Developing Countries (DCP2), a major report funded by the Gates Foundation. This report provides an estimate of $3.41 per disability-adjusted life-year (DALY) for the cost-effectiveness of soil-transmitted-helminth (STH) treatment, implying that STH treatment is one of the most cost-effective interventions for global health. In investigating this figure, we have corresponded, over a period of months, with six scholars who had been directly or indirectly involved in the production of the estimate. Eventually, we were able to obtain the spreadsheet that was used to generate the $3.41/DALY estimate. That spreadsheet contains five separate errors that, when corrected, shift the estimated cost effectiveness of deworming from $3.41 to $326.43. We came to this conclusion a year after learning that the DCP2’s published cost-effectiveness estimate for schistosomiasis treatment - another kind of deworming - contained a crucial typo: the published figure was $3.36-$6.92 per DALY, but the correct figure is $336-$692 per DALY. (This figure appears, correctly, on page 46 of the DCP2.)

We do believe that the corrected DCP2 calculations are too harsh on deworming; our best estimate of the cost-effectiveness of deworming is in between the corrected and uncorrected DCP2 figures, at $30-$80 per DALY. In addition, there are strong arguments for deworming as an excellent intervention that do not depend on these figures. Overall we consider deworming a highly promising (though not the single most promising) intervention; we will be discussing our thoughts on this intervention further in the future. This post focuses not on deworming in general, but on the DCP2 figures and what lessons we should take from the flaws in them.

  • The estimates on deworming are the only DCP2 figures we’ve gotten enough information on to examine in-depth. Getting to this point took a lot of work and communication with a number of different scholars, so we aren’t sure of the extent to which other estimates might also turn out to be flawed if examined closely.
  • We believe that the errors we’ve found in the estimate would have been caught by a helminth expert independently examining the estimate. Therefore, the presence of these errors implies to us that there has been no such examination. If this is the case, it would argue against the reliability of the DCP2’s estimates in general.
  • We’ve previously argued for a limited role for cost-effectiveness estimates; we now think that the appropriate role may be even more limited, at least for opaque estimates (e.g., estimates published without the details necessary for others to independently examine them) like the DCP2’s.
  • More generally, we see this case as a general argument for expecting transparency, rather than taking recommendations on trust - no matter how pedigreed the people making the recommendations. Note that the DCP2 was published by the Disease Control Priorities Project, a joint enterprise of The World Bank, the National Institutes of Health, the World Health Organization, and the Population Reference Bureau, which was funded primarily by a $3.5 million grant from the Gates Foundation. The DCP2 chapter on helminth infections, which contains the $3.41/DALY estimate, has 18 authors, including many of the world’s foremost experts on soil-transmitted helminths.
  • It is possible that we have made errors in our corrections to the calculation. One of the reasons we go to great lengths to be transparent is because we want our errors to be caught as quickly as possible.

Outline for the remainder of this post:

About the DCP2’s estimate

The DCP2 was published by the Disease Control Priorities Project, a joint enterprise of The World Bank, the National Institutes of Health, the World Health Organization, and the Population Reference Bureau, which was funded primarily by a $3.5 million grant from the Gates Foundation.

The Gates Foundation also appears to have invested substantially in the dissemination of the DCP2’s findings, including a $4.4 million grant to the Population Reference Bureau to “disseminate key messages from [the DCP2].”

The DCP2 aims to estimate the cost-effectiveness of different health interventions, in terms of dollars per disability-adjusted life-year (DALY) saved, in order to prioritize the most cost-effective interventions–the ones that will have the largest effects in reducing mortality and morbidity for a given amount of funding. The DCP2’s published estimates imply that soil-transmitted helminth (STH) treatment is one of the cheapest ways to improve health: the same “amount of health” could be provided by spending $1 on STH deworming or roughly $34 on family planning programs or more than $90 on treating drug-resistant tuberculosis. In fact, it appears that the DCP2 rates STH treatment as the second most cost-effective health intervention of all, behind only hygiene promotion (p. 41).

The DCP2’s cost-effectiveness estimates for deworming have been cited widely to advocate a greater focus on treating STH infections, including in:

  • an article (PDF) in The Lancet
  • a report (PDF) by REACH, a consortium of large international NGOs and other organizations working to end child hunger, which labeled deworming one of 11 “promoted interventions”
  • the most-cited paper (PDF) published in the journal International Health
  • an editorial by Peter Hotez, a co-founder of the Global Network for Neglected Tropical Diseases, which has received more than $40 million in funding from the Gates Foundation
  • work by charity evaluators, such as GiveWell, Giving What We Can, and the University of Pennsylvania’s Center for High Impact Philanthropy.

Why we decided to look into the DCP2’s deworming estimates

We undertook this research because:

  • We wanted to do a case study of a cost-effectiveness estimate from the DCP2, understanding the full details of what goes into it and where the room for error is.
  • We were particularly curious about the estimate for treatment of soil-transmitted helminths since the published $3.41 per DALY averted figure didn’t seem to sync with what we knew about the costs and effectiveness of STH treatment (or the independent estimate of $280/DALY given by another study, as we’ve mentioned previously).
  • We also wanted to focus on STH treatment since the DCP2 rates it as the second most cost-effective health intervention of all, behind only hygiene promotion.
  • Finally, we wanted to learn more about deworming after Elie visited the Schistosomiasis Control Initiative in London and we became more optimistic about this organization than we had been.

Our process for investigating the estimate

GiveWell took the following steps to investigate the DCP2’s estimate for the cost effectiveness of STH deworming:

  • We initially contacted Peter Hotez, the lead author of the DCP2 chapter on intestinal nematode infections; he sent us several papers on the costs and effectiveness of deworming and referred us to another scholar to explain the calculation that the DCP2 had published.
  • This scholar, in turn, referred us to two more, who sent us further references in response to our questions.
  • At this point we had an extended back-and-forth trying to understand the details of the calculation that had been done, and since we weren’t sure we would reach a conclusion on this, we asked volunteer Jonah Sinick to use all the references we’d been sent to create his own best guess estimate for the cost-effectiveness estimate of deworming. This estimate implied a significantly higher cost per DALY than the published figure, which seemed strange since we were now using the references and inputs suggested to us by the chapter authors.
  • The scholars we had been corresponding with sent us a spreadsheet with the full details of the calculation, as well as an accompanying table, which we will call Table 9, that had been used to input some of the figures in the spreadsheet. Here is the PDF of Table 9 that we were sent.
  • However, the interpretation of the numbers from Table 9 was still unclear to us. Table 9 is not clearly labeled; the scholars involved in the calculation appeared to have conflicting interpretations of what the numbers meant, and both meanings were highly counterintuitive to us (details below).
  • So we contacted another scholar who had worked on Table 9 to get her help in interpreting it. She sent us the full paper from which Table 9 was taken, Intestinal Nematode Infections, and this paper appeared to have a different interpretation of Table 9 than the spreadsheet’s. We confirmed this with her.
  • We also found the disability weights being used counterintuitive, and after some investigation we received confirmation that they were erroneous (details below).
  • All in all, we found five errors in the estimate, not all of which were attributable to the creator of the spreadsheet.

Problems with the official estimate of the cost-effectiveness of deworming

The basic approach of the estimate is to:

  • Calculate the benefits of deworming by
    • Starting from a population of schoolchildren being dewormed;
    • Estimating the percentage of these children suffering from different symptoms of infection;
    • Using the above, estimate the number of children cured of these symptoms (the estimate assumes that they are cured for exactly one year, since reinfection can occur after deworming)
    • Incorporating the severity of symptoms to arrive at DALYs saved by the deworming
  • Separately calculate the costs of deworming this population of schoolchildren, and divide costs by DALYs to obtain the cost per DALY.

When we examined the details of the official estimate, it struck us that nearly all of the DALYs saved (i.e., nearly all of the benefit) were coming from the reduction of a single symptom of a single worm infection: cognitive impairment due to ascariasis (we abbreviate this as CIDTA). Specifically, the figures going into the estimate implied that:

  • In a hypothetical population of 208,530 children (age 5-14 in Latin America) treated, 45,060 suffer from CIDTA. (Cells C44 and L44 in “ascariasis” sheet). That’s about 22%.

  • The disability weight of CIDTA is 0.463 (cell E8). While these figures are difficult to interpret, this implies that having CIDTA is about half as bad as being dead (disability weight 1.0), and only slightly less debilitating than being blind (disability weight 0.6). (See the official list of disability weights published alongside the DCP2.) These figures implied (to us) that CIDTA was not a matter of subtle cognitive impairment, but of mental handicap so severe as to truly prevent normal functioning.
  • The intervention in question - a single dose of albendazole - could completely restore normal mental functioning (i.e., completely eliminate disability associated with CIDTA) for one year.

These implications didn’t sync with the information we had from other sources, such as the Global Burden of Disease (GBD) report published alongside the DCP2.

  • If ascariasis caused this sort of symptom, we’d expect to see much more focus on ascariasis (relative to other helminth infections) in the global health and deworming communities.
  • In addition (as we observed when trying to reconcile the official estimate with our own estimate), if 22% of the 110 million 5-14 year olds in Latin America (GBD, 198-199) had a disability with weight 0.463, then this - alone - would result in 11.2 million DALYs lost to ascariasis per year in this region (22% * 110 million * 0.463). However, the official DALY burden for this ascariasis (all symptoms) among this population is only 31,000 (GBD, 198-199) - in fact, the worldwide DALY burden for ascariasis is only 915,000 (GBD, 180-181).

We therefore did further investigation on the CIDTA symptom - both how prevalent it is and how severe it is. It turns out that the official calculation significantly overstates both. For example, among 5-14 year olds in Latin America, CIDTA affects about 0.23% of the population - not 22.6% as the official calculation suggests - and its correct disability weight is 0.024 (the same severity as anemia), not 0.463.

Specifics of these errors:

  • Prevalence of CIDTA. The official calculation starts from a hypothetical population of 1 million people of all ages, then calculates the number of 5-14 year olds (per million people) using demographic data, then takes the number of CIDTA cases directly from Table 9 (this figure is multiplied by 10 before being put in the official spreadsheet). For example, for 5-14 year olds in Latin America, Table 9’s “A/B” column has the figure, “4506″; the official calculation records “45060″ for the number of CIDTA cases among 5-14 year olds.

    The labeling of Table 9 is ambiguous and doesn’t make it clear whether this is the intended meaning of the figures. We contacted one of the original authors who wrote the paper from which Table 9 is taken, received a copy of the (unpublished) paper from her, discussed it with her, and found that this figure’s intended interpretation is different from the official calculations, in two ways:

    • The figure in the “A/B” column refers number of people at risk for a given symptom, not the number of people suffering from that symptom. These are equivalent for Type A and Type C symptoms, but not for Type B symptoms including CIDTA. Intestinal Nematode Infections (PDF), the working paper that contains Table 9, says that “in any annual cohort of heavily infected children some 5% suffer [Type B symptoms, which are the only symptoms that have life-long effects]” (p. 26). Using the figures as the official calculation did would therefore lead to a 20x overstatement in the prevalence of CIDTA.

      This mistake applies not just to cognitive impairment due to ascariasis, but also to cognitive impairment due to trichuriasis and hookworms, similarly leading to a 20x overstatement of the prevalence of cognitive impairment due to those infections as well.

    • The figures in Table 9 refer to the number of children at risk, per 100,000 children of the age group indicated in the row. For 5-14 year olds in Latin America, the figure (for symptoms “A/B”) is “4506″; this means that 4506 out of 100,000 5-14 year olds are at risk for CIDTA. This in turn means that 45060 of every million 5-14 year olds are at risk. However, the official calculation assumes 45060 cases not for one million 5-14 year olds, but for only 208,530 5-14 year olds (which is the number of 5-14 year olds one would expect in a population of 1 million people across the three age groups). Thus, this difference results in overstating the prevalence of CIDTA by about 5x.

      This mistake applies to each of the symptoms of all three soil-transmitted helminths, not just to CIDTA, and therefore leads to an overstate of the prevalence of every symptom of STHs by about 5x.

    Bottom line - the correct interpretation of Table 9 (for 5-14 year olds in Latin America) is that 45060 out of every million 5-14 year olds are at risk for CIDTA, and 5% of these actually have it - so 2253 out of every million 5-14 year olds have CIDTA. The official calculation assumes that in a population of 208,530 5-14 year olds, 45060 have CIDTA. The same types of errors apply to the other regions and conditions as well.

  • Severity of CIDTA. The disability weight of 0.463 is correctly transcribed from the Global Burden of Disease official disability weights, which in turn takes the figure from the earlier 1996 edition (which we examined in a library). However, we still found this figure odd because of the contrast with the other two kinds of helminth infections:
    Helminth type Symptom A - disability weight Symptom A - description Symptom B - disability weight Symptom B - description Symptom C - disability weight Symptom C – description
    Ascariasis 0.006 Reduction in cognitive ability in school-age children, which occurs only while infection persists 0.463 Delayed psychomotor development and impaired performance in language skills, motor skills, and coordination equivalent to a 5- to 10-point deficit in IQ 0.024 Blockage of the intestines due to worm mass
    Trichuriasis 0.006 Reduction in cognitive ability in school-age children, which occurs only while infection persists 0.024 Delayed psychomotor development and impaired performance in language skills, motor skills, and coordination equivalent to a 5- to 10-point deficit in IQ 0.114-0.138 Rectal prolapse and/or tenesmus and/or bloody mucoid stools due to carpeting of intestinal mucosa by worms
    Hookworm NA NA 0.024 Delayed psychomotor development and impaired performance in language skills, motor skills, and coordination equivalent to a 5- to 10-point deficit in IQ 0.024 Anemia due to hookworm infection

    It looked to us as though the weights may have been switched, in the case of ascariasis, for symptoms B and C. We contacted Colin Mathers, the second-listed author on the Global Burden of Disease publication, and he confirmed to us that the weights are in fact switched, stating, “We also noticed this and corrected it in the spreadsheets for WHO estimates, but possibly it has remained uncorrected in some of the summary tables of weights.” Thus, CIDTA’s correct disability weight is 0.024, but the published disability weight in both editions of the GBD - and the weight used in the official cost-effectiveness calculation - is 0.463.

We created a version of the official calculation that corrected for the above errors, as well as two other errors that we found in the process of checking the calculation as thoroughly as we could. (See Footnote 1 below.) Our version is here (XLS).

This calculation leads to a revised cost-effectiveness estimate of $326.43 per DALY, rather than the $3.41 per DALY in the original.

The DCP cost-effectiveness estimates only took into account short term effects of the three diseases, even though they have some long term effects. This seems to have been an intentional decision rather than an error, but our feeling is that a best estimate of the true cost-effectiveness of deworming would likely take these long-term effects into account. We therefore created another version of the estimate that does so, as best as we can. (See Footnote 2 below.) Taking these long-term effects into account, our cost-effectiveness estimate for STH treatment moves to $138.28 per DALY.

These corrections also have implications for the cost-effectiveness estimate for combination deworming (simultaneously addressing both STH and schistosomiasis, another type of infection). The DCP2 reports a cost-effectiveness estimate of $8-$19/DALY averted for combined treatment, depending on whether generic or brand-name drugs are used for schistosomiasis treatment. Using our overall best guess for the revised DCP2 estimate for STH of $138.28/DALY and the DCP2’s estimate for generic schistosomiasis drugs of $336/DALY (note that this is incorrectly presented as “$3.36/DALY” on page 476, but the correct figure - without the erroneous decimal point - appears on page 46), we estimate the cost-effectiveness of a combined program, according to the DCP2, as $177/DALY. Ignoring the long-term effects of STH treatment, as the DCP2 does, changes that figure to $272/DALY.

In our first email to the author of the spreadsheet, we had only caught the first four of the five errors mentioned above, and made substantial mistakes in our attempts to take long-term effects into account. It was only when we checked the figures later that we noticed both of these mistakes. Mistakes are easy to make in this type of situation (for an interesting study on spreadsheet mistakes, see here). Transparency is the best way we can think of to avoid such mistakes. Now that we’ve published the spreadsheets, we look forward to hearing about any other mistakes you find - in the original or ours.

Our independent estimate of the cost-effectiveness of STH treatment

At the same time we were working through the DCP cost-effectiveness estimate for STH deworming, Jonah Sinick, a GiveWell volunteer, was working on an independent set of cost-effectiveness estimates for deworming, separately for both STH and a second type of worm-based disease, schistosomiasis. His report on the results is now available here. His bottom-line best guess for the cost-effectiveness of STH deworming is $82.54/DALY. Jonah’s calculation implicitly takes long-term effects into account, as we do in our more optimistic version of the calculation (the one that comes to $138.28 per DALY). Most of the discrepancy between Jonah’s $82.54/DALY figure and our $138.28 figure can be explained by the DCP’s use of a much higher cost-per-child treated ($0.225 vs. $0.085), though Jonah also finds different levels of disease burden and treatment effectiveness. (See footnote 3 below.)

Jonah also found more promising results for schistosomiasis treatment, another form of deworming that (as mentioned above) can be combined with STH treatment. His estimate ranges from $28.19-$70.48/DALY for schistosomiasis deworming. This is much more optimistic than the DCP’s estimate of $336-$692/DALY because Jonah finds, following the current consensus in the literature, a much higher disability weight for schistosomiasis than the DCP used (0.02-0.05 vs. 0.005-0.006). The DCP’s higher cost-effectiveness estimate also assumes using much more expensive brand-name drugs, while the lower estimate, like Jonah’s, assumes generics.

Conservatively combining Jonah’s estimates for the cost-effectiveness of schistosomiasis and STH deworming (by assuming that no delivery costs are saved), we reach an estimate of $32-72/DALY, depending on the disability weight of schistosomiasis. More liberally assuming that a combined program would eliminate delivery costs equal to half the per-child cost of STH treatment, Jonah’s estimate of the cost-effectiveness of a combined program ranges from $29/DALY to $66/DALY, depending on the disability weight of schistosomiasis.

Implications for donors interested in deworming

These estimates are only a small part of the picture, in our view, regarding how promising deworming is as an intervention. We will be writing more about this in the future.

However, we think it is important to note that the DCP2’s original published figures implied that deworming is among the most cost-effective interventions listed in the publication; with errors corrected, it appears comparable to treating drug-resistant tuberculosis; taking into account long-term effects, it seems comparable to providing family planning services. Neither of those interventions are traditionally considered especially cost-effective. (Note that that according to the DCP2’s original estimate, STH deworming is 30-100X more cost-effective than those interventions.)

Whether or not the long-term effects are taken into account, the corrected DCP2 estimate of STH treatment falls outside of the $100/DALY range that the World Bank initially labeled as highly cost-effective (see page 36 of the DCP2.) With the corrections, a variety of interventions, including vaccinations and insecticide-treated bednets, become substantially more cost-effective than deworming.

The more important takeaway, for us, concerns the DCP2’s cost-effectiveness estimates in general. We believe that the errors we’ve found in the estimate - described above - would have been caught by a helminth expert independently examining the estimate. Therefore, the presence of these errors implies to us that there has been no such examination. If this is the case, it would argue against the reliability of the DCP2’s estimates in general. We have not done similar investigations of other DCP2 estimates, and given the process it took to get the details of this one, we are not planning to do many more until and unless the details of estimates become available publicly.

Our takeaways

  • We’re now much more hesitant to place any weight on DCP2 cost-effectiveness figures except where we can fully understand and check the calculations.
  • More generally, we feel this case illustrates how opaque, formal calculations can obscure important information and demonstrate high sensitivity to minor errors. We see this as support for our position that formalized cost-effectiveness analysis can do more harm than good in trying to maximize actual cost-effectiveness.
  • Explicit cost-effectiveness estimates will continue to play a relatively small role in our decisions between top charities, though we will still use them in deciding which charities are potential top candidates.
  • We’re continuing to investigate deworming as a promising intervention, but one of the most encouraging figures widely cited in its favor appears deeply flawed.
  • Transparency is crucial. Had the scholars we discussed these issues with been less willing to engage with us, or had we been unable to find Intestinal Nematode Infections or the spreadsheet, these substantial errors would not have come to light.

Footnote 1: The other two problems we found in the calculation both have to do with the burden of trichuriasis:

  • The spreadsheet swaps the disability weights for Type B and C symptoms of trichuriasis. In the Global Burden of Disease and Risk Factors (GBD) 1990, which the spreadsheet cites, the Type B symptom of trichuriasis is cognitive impairment, which has a disability weight of 0.024, while the Type C symptom is massive dysentery syndrome, with disability weights ranging from 0.116 to 0.138. In the ‘trichuriasis’ sheet of the spreadsheet, Type B morbidity has disability weights ranging from 0.116 to 0.138 while Type C morbidity has the lower disability weight of 0.024. In the original calculation, this leads to an overestimate of the burden of trichuriasis by nearly 4x, but once the main errors described above are corrected, correcting this error actually makes STH treatment appear more cost-effective.
  • The spreadsheet uses a duration of .05 years for trichuriasis symptom Type C, while Intestinal Nematode Infections suggests that the duration for trichuriasis symptom Type C should be 12 months (pg. 24). This mistake likely occurred because the duration for ascariasis symptom Type C is .05 years.

In the corrected spreadsheet, sheets ‘a.3′, ‘t.5′, and ‘h.3′ contain our corrections to all five of the issues we have identified (for ascariasis, trichuriasis, and hookworm respectively). Most of the corrections should be fairly self-explanatory, but please don’t hesitate to email us or comment here if you have questions. We corrected the second main error above by changing the population of 5-14 year olds treated to 1,000,000 (see, e.g., sheet ‘a.3′ cell C23).

Footnote 2: The Type B symptom of all three diseases treated by STH deworming is called “cognitive impairment,” has a disability weight of 0.024, and lasts a lifetime once it develops. Intestinal Nematode Infections implies that 3% of the population at risk for symptom B (that is, 3% of the population listed in the A/B columns in Table 9) newly acquires a lifelong disability each year (pg. 26). We therefore altered the calculation to reflect lifelong (not just 1-year) benefits for these 3% (replacing the 5% listed in #2 above because that 5% is the total proportion infected during a given year, not the total proportion newly infected). At the same time, we also changed DALYs saved due to prevented mortality to compound to the end of life, rather than just counting the one year of life saved during the treatment. (This, arguably, is an actual error in the DCP2 process, not just a disagreement about how to take long term effects into account. When an intervention prevents someone from dying, it does not seem reasonable to count just one extra year of life saved.)

Footnote 3: We also looked into the possibility that the disability weights for helminth infections are “too low,” as implied by a passage in the DCP2:

The Disease Control Priorities Project helminth working group has determined that the WHO global burden of disease estimates are low because they do not incorporate the full clinical spectrum of helminth-associated morbidity and chronic disability, including anemia, chronic pain, diarrhea, exercise intolerance, and undernutrition (King, Dickman, and Tisch 2005). (DCP2, pg. 471)

Based on our review of the literature and correspondence with relevant scholars, we believe this argument has never been raised specifically in respect to STHs; most of the papers about it are about schistosomiasis, another type of worm infection. There is one paper (Chan 1997) that appears to imply a higher disability burden for STHs than the standard burden, which gives rise to Jonah’s more optimistic STH cost-effectiveness estimate of $11.25/DALY. We think the data from that paper is no longer credible: it appears to have been based on a lower worm threshold for experiencing morbidity than further research has found appropriate (Brooker 2010). Furthermore, the cited source of the relevant data is a working paper, the published version of which does not contain the data cited.

August 18th, 2011

Why We Can’t Take Expected Value Estimates Literally (Even When They’re Unbiased)

While some people feel that GiveWell puts too much emphasis on the measurable and quantifiable, there are others who go further than we do in quantification, and justify their giving (or other) decisions based on fully explicit expected-value formulas. The latter group tends to critique us - or at least disagree with us - based on our preference for strong evidence over high apparent “expected value,” and based on the heavy role of non-formalized intuition in our decisionmaking. This post is directed at the latter group.

We believe that people in this group are often making a fundamental mistake, one that we have long had intuitive objections to but have recently developed a more formal (though still fairly rough) critique of. The mistake (we believe) is estimating the “expected value” of a donation (or other action) based solely on a fully explicit, quantified formula, many of whose inputs are guesses or very rough estimates. We believe that any estimate along these lines needs to be adjusted using a “Bayesian prior”; that this adjustment can rarely be made (reasonably) using an explicit, formal calculation; and that most attempts to do the latter, even when they seem to be making very conservative downward adjustments to the expected value of an opportunity, are not making nearly large enough downward adjustments to be consistent with the proper Bayesian approach.

This view of ours illustrates why - while we seek to ground our recommendations in relevant facts, calculations and quantifications to the extent possible - every recommendation we make incorporates many different forms of evidence and involves a strong dose of intuition. And we generally prefer to give where we have strong evidence that donations can do a lot of good rather than where we have weak evidence that donations can do far more good - a preference that I believe is inconsistent with the approach of giving based on explicit expected-value formulas (at least those that (a) have significant room for error (b) do not incorporate Bayesian adjustments, which are very rare in these analyses and very difficult to do both formally and reasonably).

The rest of this post will:

  • Lay out the “explicit expected value formula” approach to giving, which we oppose, and give examples.
  • Give the intuitive objections we’ve long had to this approach, i.e., ways in which it seems intuitively problematic.
  • Give a clean example of how a Bayesian adjustment can be done, and can be an improvement on the “explicit expected value formula” approach.
  • Present a versatile formula for making and illustrating Bayesian adjustments that can be applied to charity cost-effectiveness estimates.
  • Show how a Bayesian adjustment avoids the Pascal’s Mugging problem that those who rely on explicit expected value calculations seem prone to.
  • Discuss how one can properly apply Bayesian adjustments in other cases, where less information is available.
  • Conclude with the following takeaways:
    • Any approach to decision-making that relies only on rough estimates of expected value - and does not incorporate preferences for better-grounded estimates over shakier estimates - is flawed.
    • When aiming to maximize expected positive impact, it is not advisable to make giving decisions based fully on explicit formulas. Proper Bayesian adjustments are important and are usually overly difficult to formalize.
    • The above point is a general defense of resisting arguments that both (a) seem intuitively problematic (b) have thin evidential support and/or room for significant error.

The approach we oppose: “explicit expected-value” (EEV) decisionmaking

We term the approach this post argues against the “explicit expected-value” (EEV) approach to decisionmaking. It generally involves an argument of the form:

    I estimate that each dollar spent on Program P has a value of V [in terms of lives saved, disability-adjusted life-years, social return on investment, or some other metric]. Granted, my estimate is extremely rough and unreliable, and involves geometrically combining multiple unreliable figures - but it’s unbiased, i.e., it seems as likely to be too pessimistic as it is to be too optimistic. Therefore, my estimate V represents the per-dollar expected value of Program P.
    I don’t know how good Charity C is at implementing Program P, but even if it wastes 75% of its money or has a 75% chance of failure, its per-dollar expected value is still 25%*V, which is still excellent.

Examples of the EEV approach to decisionmaking:

  • In a 2010 exchange, Will Crouch of Giving What We Can argued:
    DtW [Deworm the World] spends about 74% on technical assistance and scaling up deworming programs within Kenya and India … Let’s assume (very implausibly) that all other money (spent on advocacy etc) is wasted, and assess the charity solely on that 74%. It still would do very well (taking DCP2: $3.4/DALY * (1/0.74) = $4.6/DALY – slightly better than their most optimistic estimate for DOTS (for TB), and far better than their estimates for insecticide treated nets, condom distribution, etc). So, though finding out more about their advocacy work is obviously a great thing to do, the advocacy questions don’t need to be answered in order to make a recommendation: it seems that DtW [is] worth recommending on the basis of their control programs alone.

  • The Back of the Envelope Guide to Philanthropy lists rough calculations for the value of different charitable interventions. These calculations imply (among other things) that donating for political advocacy for higher foreign aid is between 8x and 22x as good an investment as donating to VillageReach, and the presentation and implication are that this calculation ought to be considered decisive.
  • We’ve encountered numerous people who argue that charities working on reducing the risk of sudden human extinction must be the best ones to support, since the value of saving the human race is so high that “any imaginable probability of success” would lead to a higher expected value for these charities than for others.
  • “Pascal’s Mugging” is often seen as the reductio ad absurdum of this sort of reasoning. The idea is that if a person demands $10 in exchange for refraining from an extremely harmful action (one that negatively affects N people for some huge N), then expected-value calculations demand that one give in to the person’s demands: no matter how unlikely the claim, there is some N big enough that the “expected value” of refusing to give the $10 is hugely negative.

The crucial characteristic of the EEV approach is that it does not incorporate a systematic preference for better-grounded estimates over rougher estimates. It ranks charities/actions based simply on their estimated value, ignoring differences in the reliability and robustness of the estimates.

Informal objections to EEV decisionmaking

There are many ways in which the sort of reasoning laid out above seems (to us) to fail a common sense test.

  • There seems to be nothing in EEV that penalizes relative ignorance or relatively poorly grounded estimates, or rewards investigation and the forming of particularly well grounded estimates. If I can literally save a child I see drowning by ruining a $1000 suit, but in the same moment I make a wild guess that this $1000 could save 2 lives if put toward medical research, EEV seems to indicate that I should opt for the latter.
  • Because of this, a world in which people acted based on EEV would seem to be problematic in various ways.
    • In such a world, it seems that nearly all altruists would put nearly all of their resources toward helping people they knew little about, rather than helping themselves, their families and their communities. I believe that the world would be worse off if people behaved in this way, or at least if they took it to an extreme. (There are always more people you know little about than people you know well, and EEV estimates of how much good you can do for people you don’t know seem likely to have higher variance than EEV estimates of how much good you can do for people you do know. Therefore, it seems likely that the highest-EEV action directed at people you don’t know will have higher EEV than the highest-EEV action directed at people you do know.)
    • In such a world, when people decided that a particular endeavor/action had outstandingly high EEV, there would (too often) be no justification for costly skeptical inquiry of this endeavor/action. For example, say that people were trying to manipulate the weather; that someone hypothesized that they had no power for such manipulation; and that the EEV of trying to manipulate the weather was much higher than the EEV of other things that could be done with the same resources. It would be difficult to justify a costly investigation of the “trying to manipulate the weather is a waste of time” hypothesis in this framework. Yet it seems that when people are valuing one action far above others, based on thin information, this is the time when skeptical inquiry is needed most. And more generally, it seems that challenging and investigating our most firmly held, “high-estimated-probability” beliefs - even when doing so has been costly - has been quite beneficial to society.
  • Related: giving based on EEV seems to create bad incentives. EEV doesn’t seem to allow rewarding charities for transparency or penalizing them for opacity: it simply recommends giving to the charity with the highest estimated expected value, regardless of how well-grounded the estimate is. Therefore, in a world in which most donors used EEV to give, charities would have every incentive to announce that they were focusing on the highest expected-value programs, without disclosing any details of their operations that might show they were achieving less value than theoretical estimates said they ought to be.
  • If you are basing your actions on EEV analysis, it seems that you’re very open to being exploited by Pascal’s Mugging: a tiny probability of a huge-value expected outcome can come to dominate your decisionmaking in ways that seem to violate common sense. (We discuss this further below.)
  • If I’m deciding between eating at a new restaurant with 3 Yelp reviews averaging 5 stars and eating at an older restaurant with 200 Yelp reviews averaging 4.75 stars, EEV seems to imply (using Yelp rating as a stand-in for “expected value of the experience”) that I should opt for the former. As discussed in the next section, I think this is the purest demonstration of the problem with EEV and the need for Bayesian adjustments.

In the remainder of this post, I present what I believe is the right formal framework for my objections to EEV. However, I have more confidence in my intuitions - which are related to the above observations - than in the framework itself. I believe I have formalized my thoughts correctly, but if the remainder of this post turned out to be flawed, I would likely remain in objection to EEV until and unless one could address my less formal misgivings.

Simple example of a Bayesian approach vs. an EEV approach

It seems fairly clear that a restaurant with 200 Yelp reviews, averaging 4.75 stars, ought to outrank a restaurant with 3 Yelp reviews, averaging 5 stars. Yet this ranking can’t be justified in an EEV-style framework, in which options are ranked by their estimated average/expected value. How, in fact, does Yelp handle this situation?
Unfortunately, the answer appears to be undisclosed in Yelp’s case, but we can get a hint from a similar site: BeerAdvocate, a site that ranks beers using submitted reviews. It states:

Lists are generated using a Bayesian estimate that pulls data from millions of user reviews (not hand-picked) and normalizes scores based on the number of reviews for each beer. The general statistical formula is:
weighted rank (WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C
where:
R = review average for the beer
v = number of reviews for the beer
m = minimum reviews required to be considered (currently 10)
C = the mean across the list (currently 3.66)

In other words, BeerAdvocate does the equivalent of giving each beer a set number (currently 10) of “average” reviews (i.e., reviews with a score of 3.66, which is the average for all beers on the site). Thus, a beer with zero reviews is assumed to be exactly as good as the average beer on the site; a beer with one review will still be assumed to be close to average, no matter what rating the one review gives; as the number of reviews grows, the beer’s rating is able to deviate more from the average.

To illustrate this, the following chart shows how BeerAdvocate’s formula would rate a beer that has 0-100 five-star reviews. As the number of five-star reviews grows, the formula’s “confidence” in the five-star rating grows, and the beer’s overall rating gets further from “average” and closer to (though never fully reaching) 5 stars.

I find BeerAdvocate’s approach to be quite reasonable and I find the chart above to accord quite well with intuition: a beer with a small handful of five-star reviews should be considered pretty close to average, while a beer with a hundred five-star reviews should be considered to be nearly a five-star beer.

However, there are a couple of complications that make it difficult to apply this approach broadly.

  • BeerAdvocate is making a substantial judgment call regarding what “prior” to use, i.e., how strongly to assume each beer is average until proven otherwise. It currently sets the m in its formula equal to 10, which is like giving each beer a starting point of ten average-level reviews; it gives no formal justification for why it has set m to 10 instead of 1 or 100. It is unclear what such a justification would look like.

    In fact, I believe that BeerAdvocate used to use a stronger “prior” (i.e., it used to set m to a higher value), which meant that beers needed larger numbers of reviews to make the top-rated list. When BeerAdvocate changed its prior, its rankings changed dramatically, as lesser-known, higher-rated beers overtook the mainstream beers that had previously dominated the list.

  • In BeerAdvocate’s case, the basic approach to setting a Bayesian prior seems pretty straightforward: the “prior” rating for a given beer is equal to the average rating for all beers on the site, which is known. By contrast, if we’re looking at the estimate of how much good a charity does, it isn’t clear what “average” one can use for a prior; it isn’t even clear what the appropriate reference class is. Should our prior value for the good-accomplished-per-dollar of a deworming charity be equal to the good-accomplished-per-dollar of the average deworming charity, or of the average health charity, or the average charity, or the average altruistic expenditure, or some weighted average of these? Of course, we don’t actually have any of these figures.

    For this reason, it’s hard to formally justify one’s prior, and differences in priors can cause major disagreements and confusions when they aren’t recognized for what they are. But this doesn’t mean the choice of prior should be ignored or that one should leave the prior out of expected-value calculations (as we believe EEV advocates do).

Applying Bayesian adjustments to cost-effectiveness estimates for donations, actions, etc.

As discussed above, we believe that both Giving What We Can and Back of the Envelope Guide to Philanthropy use forms of EEV analysis in arguing for their charity recommendations. However, when it comes to analyzing the cost-effectiveness estimates they invoke, the BeerAdvocate formula doesn’t seem applicable: there is no “number of reviews” figure that can be used to determine the relative weights of the prior and the estimate.

Instead, we propose a model in which there is a normally (or log-normally) distributed “estimate error” around the cost-effectiveness estimate (with a mean of “no error,” i.e., 0 for normally distributed error and 1 for lognormally distributed error), and in which the prior distribution for cost-effectiveness is normally (or log-normally) distributed as well. (I won’t discuss log-normal distributions in this post, but the analysis I give can be extended by applying it to the log of the variables in question.) The more one feels confident in one’s pre-existing view of how cost-effective an donation or action should be, the smaller the variance of the “prior”; the more one feels confident in the cost-effectiveness estimate itself, the smaller the variance of the “estimate error.”

Following up on our 2010 exchange with Giving What We Can, we asked Dario Amodei to write up the implications of the above model and the form of the proper Bayesian adjustment. You can see his analysis here. The bottom line is that when one applies Bayes’s rule to obtain a distribution for cost-effectiveness based on (a) a normally distributed prior distribution (b) a normally distributed “estimate error,” one obtains a distribution with

  • Mean equal to the average of the two means weighted by their inverse variances
  • Variance equal to the harmonic sum of the two variances

The following charts show what this formula implies in a variety of different simple hypotheticals. In all of these, the prior distribution has mean = 0 and standard deviation = 1, and the estimate has mean = 10, but the “estimate error” varies, with important effects: an estimate with little enough estimate error can almost be taken literally, while an estimate with large enough estimate error ends ought to be almost ignored.

In each of these charts, the black line represents a probability density function for one’s “prior,” the red line for an estimate (with the variance coming from “estimate error”), and the blue line for the final probability distribution, taking both the prior and the estimate into account. Taller, narrower distributions represent cases where probability is concentrated around the midpoint; shorter, wider distributions represent cases where the possibilities/probabilities are more spread out among many values. First, the case where the cost-effectiveness estimate has the same confidence interval around it as the prior:

If one has a relatively reliable estimate (i.e., one with a narrow confidence interval / small variance of “estimate error,”) then the Bayesian-adjusted conclusion ends up very close to the estimate. When we estimate quantities using highly precise and well-understood methods, we can use them (almost) literally.

On the flip side, when the estimate is relatively unreliable (wide confidence interval / large variance of “estimate error”), it has little effect on the final expectation of cost-effectiveness (or whatever is being estimated). And at the point where the one-standard-deviation bands include zero cost-effectiveness (i.e., where there’s a pretty strong probability that the whole cost-effectiveness estimate is worthless), the estimate ends up having practically no effect on one’s final view.

The details of how to apply this sort of analysis to cost-effectiveness estimates for charitable interventions are outside the scope of this post, which focuses on our belief in the importance of the concept of Bayesian adjustments. The big-picture takeaway is that just having the midpoint of a cost-effectiveness estimate is not worth very much in itself; it is important to understand the sources of estimate error, and the degree of estimate error relative to the degree of variation in estimated cost-effectiveness for different interventions.

Pascal’s Mugging

Pascal’s Mugging refers to a case where a claim of extravagant impact is made for a particular action, with little to no evidence:

Now suppose someone comes to me and says, “Give me five dollars, or I’ll use my magic powers … to [harm an imaginably huge number of] people.

Non-Bayesian approaches to evaluating these proposals often take the following form: “Even if we assume that this analysis is 99.99% likely to be wrong, the expected value is still high - and are you willing to bet that this analysis is wrong at 99.99% odds?”

However, this is a case where “estimate error” is probably accounting for the lion’s share of variance in estimated expected value, and therefore I believe that a proper Bayesian adjustment would correctly assign little value where there is little basis for the estimate, no matter how high the midpoint of the estimate.

Say that you’ve come to believe - based on life experience - in a “prior distribution” for the value of your actions, with a mean of zero and a standard deviation of 1. (The unit type you use to value your actions is irrelevant to the point I’m making; so in this case the units I’m using are simply standard deviations based on your prior distribution for the value of your actions). Now say that someone estimates that action A (e.g., giving in to the mugger’s demands) has an expected value of X (same units) - but that the estimate itself is so rough that the right expected value could easily be 0 or 2X. More specifically, say that the error in the expected value estimate has a standard deviation of X.

An EEV approach to this situation might say, “Even if there’s a 99.99% chance that the estimate is completely wrong and that the value of Action A is 0, there’s still an 0.01% probability that Action A has a value of X. Thus, overall Action A has an expected value of at least 0.0001X; the greater X is, the greater this value is, and if X is great enough then, then you should take Action A unless you’re willing to bet at enormous odds that the framework is wrong.”

However, the same formula discussed above indicates that Action X actually has an expected value - after the Bayesian adjustment - of X/(X^2+1), or just under 1/X. In this framework, the greater X is, the lower the expected value of Action A. This syncs well with my intuitions: if someone threatened to harm one person unless you gave them $10, this ought to carry more weight (because it is more plausible in the face of the “prior” of life experience) than if they threatened to harm 100 people, which in turn ought to carry more weight than if they threatened to harm 3^^^3 people (I’m using 3^^^3 here as a representation of an unimaginably huge number).

The point at which a threat or proposal starts to be called “Pascal’s Mugging” can be thought of as the point at which the claimed value of Action A is wildly outside the prior set by life experience (which may cause the feeling that common sense is being violated). If someone claims that giving him/her $10 will accomplish 3^^^3 times as much as a 1-standard-deviation life action from the appropriate reference class, then the actual post-adjustment expected value of Action A will be just under (1/3^^^3) (in standard deviation terms) - only trivially higher than the value of an average action, and likely lower than other actions one could take with the same resources. This is true without applying any particular probability that the person’s framework is wrong - it is simply a function of the fact that their estimate has such enormous possible error. An ungrounded estimate making an extravagant claim ought to be more or less discarded in the face of the “prior distribution” of life experience.

Generalizing the Bayesian approach

In the above cases, I’ve given quantifications of (a) the appropriate prior for cost-effectiveness; (b) the strength/confidence of a given cost-effectiveness estimate. One needs to quantify both (a) and (b) - not just quantify estimated cost-effectiveness - in order to formally make the needed Bayesian adjustment to the initial estimate.

But when it comes to giving, and many other decisions, reasonable quantification of these things usually isn’t possible. To have a prior, you need a reference class, and reference classes are debatable.

It’s my view that my brain instinctively processes huge amounts of information, coming from many different reference classes, and arrives at a prior; if I attempt to formalize my prior, counting only what I can name and justify, I can worsen the accuracy a lot relative to going with my gut. Of course there is a problem here: going with one’s gut can be an excuse for going with what one wants to believe, and a lot of what enters into my gut belief could be irrelevant to proper Bayesian analysis. There is an appeal to formulas, which is that they seem to be susceptible to outsiders’ checking them for fairness and consistency.

But when the formulas are too rough, I think the loss of accuracy outweighs the gains to transparency. Rather than using a formula that is checkable but omits a huge amount of information, I’d prefer to state my intuition - without pretense that it is anything but an intuition - and hope that the ensuing discussion provides the needed check on my intuitions.

I can’t, therefore, usefully say what I think the appropriate prior estimate of charity cost-effectiveness is. I can, however, describe a couple of approaches to Bayesian adjustments that I oppose, and can describe a few heuristics that I use to determine whether I’m making an appropriate Bayesian adjustment.

Approaches to Bayesian adjustment that I oppose

I have seen some argue along the lines of “I have a very weak (or uninformative) prior, which means I can more or less take rough estimates literally.” I think this is a mistake. We do have a lot of information by which to judge what to expect from an action (including a donation), and failure to use all the information we have is a failure to make the appropriate Bayesian adjustment. Even just a sense for the values of the small set of actions you’ve taken in your life, and observed the consequences of, gives you something to work with as far as an “outside view” and a starting probability distribution for the value of your actions; this distribution probably ought to have high variance, but when dealing with a rough estimate that has very high variance of its own, it may still be quite a meaningful prior.

I have seen some using the EEV framework who can tell that their estimates seem too optimistic, so they make various “downward adjustments,” multiplying their EEV by apparently ad hoc figures (1%, 10%, 20%). What isn’t clear is whether the size of the adjustment they’re making has the correct relationship to (a) the weakness of the estimate itself (b) the strength of the prior (c) distance of the estimate from the prior. An example of how this approach can go astray can be seen in the “Pascal’s Mugging” analysis above: assigning one’s framework a 99.99% chance of being totally wrong may seem to be amply conservative, but in fact the proper Bayesian adjustment is much larger and leads to a completely different conclusion.

Heuristics I use to address whether I’m making an appropriate prior-based adjustment

  • The more action is asked of me, the more evidence I require. Anytime I’m asked to take a significant action (giving a significant amount of money, time, effort, etc.), this action has to have higher expected value than the action I would otherwise take. My intuitive feel for the distribution of “how much my actions accomplish” serves as a prior - an adjustment to the value that the asker claims for my action.
  • I pay attention to how much of the variation I see between estimates is likely to be driven by true variation vs. estimate error. As shown above, when an estimate is rough enough so that error might account for the bulk of the observed variation, a proper Bayesian approach can involve a massive discount to the estimate.
  • I put much more weight on conclusions that seem to be supported by multiple different lines of analysis, as unrelated to one another as possible. If one starts with a high-error estimate of expected value, and then starts finding more estimates with the same midpoint, the variance of the aggregate estimate error declines; the less correlated the estimates are, the greater the decline in the variance of the error, and thus the lower the Bayesian adjustment to the final estimate. This is a formal way of observing that “diversified” reasons for believing something lead to more “robust” beliefs, i.e., beliefs that are less likely to fall apart with new information and can be used with less skepticism.
  • I am hesitant to embrace arguments that seem to have anti-common-sense implications (unless the evidence behind these arguments is strong) and I think my prior may often be the reason for this. As seen above, a too-weak prior can lead to many seemingly absurd beliefs and consequences, such as falling prey to “Pascal’s Mugging” and removing the incentive for investigation of strong claims. Strengthening the prior fixes these problems (while over-strengthening the prior results in simply ignoring new evidence). In general, I believe that when a particular kind of reasoning seems to me to have anti-common-sense implications, this may indicate that its implications are well outside my prior.
  • My prior for charity is generally skeptical, as outlined at this post. Giving well seems conceptually quite difficult to me, and it’s been my experience over time that the more we dig on a cost-effectiveness estimate, the more unwarranted optimism we uncover. Also, having an optimistic prior would mean giving to opaque charities, and that seems to violate common sense. Thus, we look for charities with quite strong evidence of effectiveness, and tend to prefer very strong charities with reasonably high estimated cost-effectiveness to weaker charities with very high estimated cost-effectiveness

Conclusion

  • I feel that any giving approach that relies only on estimated expected-value - and does not incorporate preferences for better-grounded estimates over shakier estimates - is flawed.
  • Thus, when aiming to maximize expected positive impact, it is not advisable to make giving decisions based fully on explicit formulas. Proper Bayesian adjustments are important and are usually overly difficult to formalize.
March 19th, 2010

Cost-effectiveness estimates: inside the sausage factory

We’ve long had mixed feelings about cost-effectiveness estimates of charitable programs, i.e., attempts to figure out “how much good is accomplished per dollar donated.”

The advantages of these estimates are obvious. If you can calculate that program A can help much more people - with the same funds, and in the same terms - than program B, that creates a strong case (arguably even a moral imperative) for funding program A over program B. The problem is that by the time you get the impact of two different programs into comparable “per-dollar” terms, you’ve often made so many approximations, simplifications and assumptions that a comparison isn’t much more meaningful than a roll of the dice. In such cases, we believe there are almost always better ways to decide between charities.

This post focuses on the drawbacks of cost-effectiveness estimates. I’m going to go through the details of what we know about one of the best-known, most often-cited cost-effectiveness figures there is: the cost per disability-adjusted life-year (DALY) for deworming schoolchildren. This figure uses the disability-adjusted life-year (DALY) metric, probably the single most widely cited and accepted “standardized” measure of social impact within the unusually quantifiable area of health.

Note that various versions of this figure:

  • Occupy the “top spot” in the Disease Control Priorities Report’s chart of “Cost-effectiveness of Interventions Related to Low-Burden Diseases” (see page 42 of the full report). (I’ll refer to this report as “DCP” for the rest of this post.)
  • Are featured in a policy briefcase by the Poverty Action Lab (which we are fans of), calling deworming a “best buy for education and health.”
  • Appear to be the primary factor in the decision by Giving What We Can
    (a group that promotes both more generous and more intelligent giving) to designate deworming-related interventions as its top priority (see the conclusion of its report on neglected tropical diseases), and charities focused on these interventions as its two top-tier charities.

I don’t feel that all the above uses of this figure are necessarily inappropriate (details in the conclusion of this post). But I do feel that they point to the worthiness of inspecting this figure closely, and it is important to be aware of the following issues.

  1. The estimate is likely based on successful, thoroughly observed programs and may not be representative of what one would expect from an “average” deworming program.
  2. The estimate appears to rely on an assumption of continued successful treatment over time, an assumption which could easily be problematic in certain cases.
  3. A major input into the estimate is the prevalence of worm infections. In general, prevalence data is itself is the product of yet more estimations and approximations.
  4. Many factors in cost-effectiveness, positive and negative, appear to be ignored in the estimate simply because they cannot be quantified.
  5. Different estimates of the same program’s cost-effectiveness appear to strongly contradict each other.

Details follow.

Issue 1: the estimate is likely based on successful, thoroughly observed programs.

The Poverty Action Lab estimate of $5 per DALY is based on a 2003 study by Miguel and Kremer of a randomized controlled trial in Kenya. As the subject of an unusually rigorous evaluation, this program likely had an unusual amount of scrutiny throughout (and may also have been picked in the first place partly for its likelihood of succeeding). In addition, this program was carried out by a partnership between the Kenyan government and a nonprofit, ICS (pg 165), that has figured prominently in numerous past evaluations (for example, see this 2003 review of rigorous studies on education interventions).

In this sense, it seems reasonable to view its results as “high-end/optimistic” rather than “representative of what would one expect on average from a large-scale government rollout.”

Note also that the program included a significant educational component (169). The quality of hygiene education, in particular, might be much higher in a closely supervised experiment than in a large-scale rollout.

It is less clear whether the same issue applies to the DCP estimate, because the details and sources for the estimate are not disclosed (see box on page 476). However,

  • The other studies referenced throughout the chapter appear to be additional “micro-level” evaluations - i.e., carefully controlled and studied programs - as opposed to large-scale government-operated programs.
  • The DCP’s cost-effectiveness estimate for combination deworming (the program most closely resembling the program discussed in Miguel & Kremer) is very close to the Miguel & Kremer estimate of $5 per DALY. (There is some ambiguity on this point - more on this under Issue 5 below.)

Issue 2: the estimate appears to rely on an assumption of continued successful treatment over time, an assumption which could easily be problematic in certain cases.

Miguel & Kremer states:

single-dose oral therapies can kill the worms, reducing … infections by 99 percent … Reinfection is rapid, however, with worm burden often returning to eighty percent or more of its original level within a year … and hence geohelminth drugs must be taken every six months and schistosomiasis drugs must be taken annually. (pg 161)

Miguel & Kremer emphasizes the importance of externalities (i.e., the fact that eliminating some infections slows the overall transmission rate) in cost-effectiveness (pg 204), and it therefore seems important to ask whether the “$5 per DALY” estimate is made (a) assuming that periodic treatment will be sustained over time; (b) assuming that it won’t be.

Miguel & Kremer doesn’t explicitly spell out the answer, but it seems fairly clear that (a) is in fact the assumption. The study states that the program averted 649 DALYs (pg 204) over two years (pg 165), of which 99% could be attributed to aversion of moderate-to-heavy schistosomiasis infections (pg 204). Such infections have a disability weight of 0.006 per year, so this is presumably equivalent to averting over 100,000 years ((649*99%)/0.006) of schistosomiasis infection - even though well under 10,000 children were even loosely included in the project (including control groups and including pupils near but not included in the program - see pg 167). Even if a higher-than-standard disability weight was used, it seems fairly clear that many years of “averted infection” were assumed per child.

In my view, this is the right assumption to make in creating the cost-effectiveness estimate … as long as the estimate is used appropriately, i.e., as an estimate of how cost-effective a deworming program would be if carried out in an near-ideal way, including a sustained commitment over time.

However, it must be noted that sustaining a program over time is far from a given, especially for organizations hoping for substantial and increasing government buy-in over time. As we will discuss in a future post, one of the major deworming organizations appears to have aimed to pass its activities to the government, with unclear/possibly mixed results. And as we have discussed before, there are vivid examples of excellent, demonstrably effective projects failing to achieve sustainability in the past.

Does the DCP’s version of the estimate make a similar assumption? Again, we do not have the details of the estimate, but the DCP chapter - like the Miguel & Kremer paper - stresses the importance of “Regular chemotherapy at regular intervals” (pg 472).

One more concern along these lines: even if a program is sustained over time, there may be “diminished efficacy with frequent and repeated use … possibly because of anthelmintic resistance” (pg 472).

Extrapolation from a short-term trial to long-term effects is probably necessary to produce an estimate, but it further increases the uncertainty.

Issue 3: cost-effectiveness appears to rely on disease incidence/prevalence data that itself is the product of yet more estimations and approximations.

The Miguel & Kremer study took place in an area with extremely high rates of infection: 80% prevalence of schistosomiasis (where schistosomiasis treatment was applied), and 40-80% prevalence of three other infections (see pg 168). The DCP emphasizes the importance of carrying out the intervention in high-prevalence areas (for example, see the box on page 476). Presumably, the program should be carried out in as high-prevalence areas as possible for maximum cost-effectiveness.

The problem is that prevalence data may not be easy to come by. The Global Burden of Disease report reports using a variety of elaborate methods to estimate prevalence, using “environmental data derived from satellite remote sensing” as well as mathematical modeling (see pg 80). Though I don’t have a source for this statement, I recall either a conversation or a paper making a fairly strong case that data on neglected tropical diseases is particularly spotty and unreliable, likely because it is harder to measure morbidity than mortality (the latter can be collected from death records; the former requires more involved examinations and/or judgment calls and/or estimates).

Issue 4: many factors in cost-effectiveness appear to be ignored in the estimate simply because they cannot be quantified.

Both positive and negative factors have likely been ignored in the estimate, including:

  • Possible negative health effects of the deworming drugs themselves (DCP pg 479). (Negative impact on cost-effectiveness)
  • Possible development of resistance to the drugs, and thus diminishing efficacy, over time (mentioned above). (Negative impact on cost-effectiveness)
  • Possible interactions between worm infections and other diseases including HIV/AIDS (DCP pg 479), which may increase the cost-effectiveness of deworming. (Positive impact on cost-effectiveness)
  • The question of whether improving some people’s health leads them to contribute back to their families, communities, etc. and improve others’ lives. This question applies to any health intervention, but not necessarily to the same degree, since different programs affect different types of people. From what I’ve seen, there is very little available basis for making any sorts of estimates of such differences.

Issue 5: different estimates of the same program’s cost-effectiveness appear to strongly contradict each other.

The DCP’s summary of cost-effectiveness alone (box on pg 476) raises considerable confusion:

the cost per DALY averted is estimated at US $3.41 for STH infections [the type of infection treated with albendazole] … The estimate of cost per DALY is higher for schistosomiasis relative to STH infections because of higher drug costs and lower disability weights … the cost per DALY averted ranges from US$3.36 to US$6.92. However, in combination, treatment with both albendazole and PZQ proves to be extremely cost-effective, in the range of US$8 to US$19 per DALY averted.

The language seems to strongly imply that the combination program is more effective than treating schistosomiasis alone, but the numbers given imply the opposite. Our guess is actually that the numbers are inadvertently switched. To one taking the numbers too literally, the expected “cost-effectiveness” of a donation could be off by a factor of 2-5 depending on this question of copy editing.

Comparing this statement with the Miguel & Kremer study adds more confusion. The DCP estimates albendazole-only treatment at $3.41 per DALY, which appears to be better than (or at least at the better end of the range for) the combination program. However, Miguel & Kremer estimates that albendazole-only treatment is far less effective than the combination program, at $280 per DALY (pg 204).

Perhaps the DCP envisions albendazole treatment carried out in a different way or in a different type of environment. But given that the Miguel & Kremer study appears to be examining a fairly suitable environment for albendazole-only treatment (see above comments about high infection prevalence and strong program execution), this would indicate that cost-effectiveness is extremely sensitive to subtle changes in the environment or execution.

Bottom line

There is a lot of uncertainty in this estimate, and this uncertainty isn’t necessarily “symmetrical.” Estimates of different programs’ cost-effectiveness, in fact, could be colored by very different degrees of optimistic assumptions.

Despite all of the above issues, I don’t find the cost-effectiveness estimate discussed here to be meaningless or useless.

Researchers’ best guesses put the cost-effectiveness of deworming in the same ballpark as that of other high-priority interventions such as vaccines, tuberculosis treatment, etc. (I do note that many of these appear to have more robust evidence bases behind their cost-effectiveness - for example, estimated effects of large-scale government programs are sometimes available, giving an extra degree of context.)

I think it is appropriate to say that available evidence suggests that deworming can be as cost-effective as any other health intervention.

I think it is appropriate to call deworming a “best buy,” as the Poverty Action Lab does.

I do not think it is appropriate to conclude that deworming is more cost-effective than vaccinations, tuberculosis treatment, etc. I think it is especially inappropriate to conclude that deworming is several times more cost-effective than vaccinations, tuberculosis treatment, etc.

Most of all, I do not think it is appropriate to expect results in line with this estimate just because you donate to a deworming charity. I believe cost-effectiveness estimates usually represent “what you can achieve if the program goes well” more than they represent “what a program will achieve on average.”

In my view, the greatest factor behind the realized cost-effectiveness of a program is the specifics of who carries it out and how.