In September, we announced the Change Our Mind Contest for critiques of our cost-effectiveness analyses. Today, we’re excited to announce the winners!
We’re very grateful that so many people engaged deeply with our work. This contest was GiveWell’s most successful effort so far to solicit external criticism from the public, and it wouldn’t have been possible without the participation of people who share our goal of allocating funding to cost-effective programs.
Overall, we received 49 entries engaging with our prompts. We were very happy with the quality of entries we received—their authors brought a great deal of thought and expertise to engaging with our cost-effectiveness analyses.
Because we were impressed by the quality of entries, we’ve decided to award two first-place prizes and eight honorable mentions. (We stated in September that we would give a minimum of one first-place, one runner-up, and one honorable mention prize.) We also awarded $20,000 to the piece of criticism that inspired this contest.
Winners are listed below, followed by our reflections on this contest and responses to the winners.
The prize-winners
Given the overall quality of the entries we received, selecting a set of winners required a lot of deliberation.
We’re still in the process of determining which critiques to incorporate into our cost-effectiveness analyses and to what extent they’ll change the bottom line; we don’t agree with all the critiques in the first-place and honorable mention entries, but each prize-winner raised issues that we believe were worth considering. In several cases, we plan to further investigate the questions raised by these entries.
Within categories, the winners are listed alphabetically by the last name of the author who submitted the entry.
First-place prizes – $20,000 each1Both of these entries were outstanding, and they represent very different approaches. Because they are similarly excellent, we are naming two winners rather than one winner and one runner-up.
- Noah Haber for “GiveWell’s Uncertainty Problem.” The author argues that without properly accounting for uncertainty, GiveWell is likely to allocate its portfolio of funding suboptimally, and proposes methods for addressing uncertainty.
- Matthew Romer and Paul Romer Present for “An Examination of GiveWell’s Water Quality Intervention Cost-Effectiveness Analysis.” The authors suggest several changes to GiveWell’s analysis of water chlorination programs, which overall make Dispensers for Safe Water’s program appear less cost-effective.
To give a general sense of the magnitude of the changes we currently anticipate, our best guess is that Matthew Romer and Paul Romer Present’s entry will change our estimate of the cost-effectiveness of Dispensers for Safe Water by very roughly 5 to 10%, and that Noah Haber’s entry may lead to an overall shift in how we account for uncertainty (but it’s too early to say how it would impact any given intervention). Overall, we currently expect that entries to the contest may shift the allocation of resources between programs but are unlikely to lead to us adding or removing any programs from our list of recommended charities.
Honorable mentions – $5,000 each
- Alex Bates for a critical review of GiveWell’s 2022 cost-effectiveness model
- Dr. Samantha Field and Dr. Yannish Naik for “A critique of GiveWell’s CEA model for Conditional Cash Transfers for vaccination in Nigeria (New Incentives)“
- Akash Kulgod for “Cost-effectiveness of iron fortification in India is lower than GiveWell’s estimates“
- Sam Nolan, Hannah Rokebrand, and Tanae Rao for “Quantifying Uncertainty in GiveWell Cost-Effectiveness Analyses“
- Isobel Phillips for “Improving GiveWell’s modelling of insecticide resistance may change their cost per life saved for AMF by up to 20%“
- Tanae Rao and Ricky Huang for “Hard Problems in GiveWell’s Moral Weights Approach“
- Dr. Dylan Walters, Alison Greig, Steve Gilbert, Dr. Mandana Arabi, and Sameen Ahsan for critiques of GiveWell’s overall methodology and conceptual approach as well as critiques of the methodology in our vitamin A supplementation analysis
- Trevor Woolley and Ethan Ligon for “GiveWell’s Moral Weights Underweight the Value of Transfers to the Poor“
Participation prizes – $500 each
39 entries, not individually listed here.
All entries that met our criteria will receive participation prizes if they didn’t win a larger prize. To meet our requirements, authors had to share a critique that addressed our cost-effectiveness analysis and proposed a change that could make a material difference to our bottom line—this is no small feat, and we really appreciate everyone who took the time to do so!
Prize for inspiring the Change Our Mind Contest – $20,000
Joel McGuire, Samuel Dupret, and Michael Plant for “Deworming and decay: replicating GiveWell’s cost-effectiveness analysis.”
In July 2022, these three researchers at the Happier Lives Institute shared a critique of how GiveWell models the long-term benefits of deworming; they argue we should treat those benefits as decaying over time rather than remaining constant. We responded to their critique here. We’re in the process of incorporating this critique, and our best guess is that it will lead to a 10% to 30% decrease in our estimate of the cost-effectiveness of deworming, which we roughly estimate would have influenced $2 to $8 million in funding.
Because this work influenced our thinking and played a role in prompting the Change Our Mind Contest, we decided to make a grant of $20,000 to the Happier Lives Institute.
Logistics for prize-winners
We will be emailing the author who submitted each prize-winning entry, including those that won participation prizes. If you and your co-authors have not received an email by early January, please feel free to reach out to change-our-mind@givewell.org.
Reflections on this contest
There’s a robust community of people who are excited to engage with our work.
We received 49 entries that met the contest criteria, all of which represented meaningful engagement with our work. These entries came from a wide range of people—from health economists and from people in entirely unrelated fields, from the global health community and from the effective altruism community, from students and from professionals with years of work experience.
People submitted entries on many different topics. We received at least two entries on each of the six cost-effectiveness analyses we pointed people toward, plus some entries on other programs, and many cross-cutting entries on issues like uncertainty, the discount rate for future benefits, our moral weights, and more.
In order to manage all the suggestions we received, one of our researchers reviewed all 49 entries and created a dashboard for tracking the 100 discrete suggestions we identified. For each of those, we’re tracking whether we plan to do additional work to address the suggestion and how high-priority that work is.
We’re so glad that people were excited to contribute to our decision-making, and we’ll be continuing to look for ways to collaborate with the public to improve our work.
We have room for improvement, particularly on transparency.
We’re proud of being an unusually transparent research organization; transparency is one of our core values. Transparency has two facets: making information publicly available, and also making it easy to understand. We generally succeed at publishing the information that drives our decisions. But, we could do more to enable people to understand why we believe what we believe.
Some entries proposed changes that are actually very similar to what we’re already doing, but where the authors didn’t realize that because of the way our work is presented (e.g., a calculation takes place in a separate spreadsheet, or the name of a parameter doesn’t clearly represent its purpose). In other cases, entries flagged areas where the assumptions underlying our judgments aren’t apparent (e.g., in the case of development effects from averted cases of malaria). We appreciate these authors bringing those issues to our attention, and we hope to improve the clarity of our work!
People brought us new ideas—and old ones we hadn’t implemented.
Some entries covered ideas we hadn’t considered but found worth pursuing. For example, an entry arguing that iron fortification might be less cost-effective than we think inspired us to dig into the questions it raised about the prevalence of iron deficiency anemia in India. In this case, our current best guess is that our view on iron fortification won’t change much, but we believe it’s worthwhile for us to consider this issue.
Other entries covered issues we were aware of but hadn’t resolved. For example, we’ve known for a while that a calculation in our cost-effectiveness analysis for the Against Malaria Foundation is both poorly structured and presented in a confusing way. A few entries flagged perceived issues with how we calculate mortality in that analysis, and some authors thought the calculations were mistaken in a way that would have a significant impact on our bottom line (e.g., some entries understandably believed we’d failed to account for indirect deaths from malaria). We don’t think any of these entries captured the precise problem with the current calculation, but they homed in on a weak point in our analysis. Several months ago we created a revised version of our internal analysis that fixes the issue, but we haven’t yet finalized and published it. We’re likely to publish this revision in the next few months. In general, people flagging a known issue can help us prioritize changes.
This contest was worth doing.
We haven’t done anything like this before, and we weren’t really sure what to expect. We saw this as an opportunity to lean into our values, particularly transparency and truth-seeking, in service of helping people as much as we can with our funding decisions. The contest succeeded in that goal; we identified improvements we can make to our cost-effectiveness analyses in terms of both accuracy and clarity. And beyond that, this contest established that there are people who care deeply about our work and want to help us improve it. To everyone who participated—thank you!
Appendix: Discussion of winning entries
In this section, we share our initial thoughts on the two first-place entries. This appendix is probably more technical than will be of interest to most readers.
Noah Haber on uncertainty
Several entries focused on how GiveWell could improve its approach to uncertainty. This entry stood out for its clear demonstration of how failing to account for uncertainty can lead to suboptimal allocations, even in a risk-neutral framework.
In brief, this entry argues that when prioritizing by estimated expected value, one will sometimes select more uncertain programs whose true values are lower, over less uncertain programs whose true values are higher. This “optimizer’s curse” or “winner’s curse” can create a portfolio that is systematically less valuable than it could be if uncertainty was properly accounted for. This issue has been raised before, but we haven’t ever fully addressed it.2For example, a former GiveWell researcher wrote this post, which makes a different argument from Noah Haber’s piece but addresses a similar problem.
We’d like to consider the issues presented in this post and other recent criticisms of our approach to uncertainty in more depth. In the meantime, we’ll share some initial thoughts:
- We really appreciated that this piece drew a clear link between incorporating uncertainty and the ranking of programs. It shows that if we don’t account for uncertainty explicitly, we may be allocating too much funding to more uncertain programs, which lowers the value of our overall funding allocation.
- Currently, we make ad hoc adjustments for uncertainty, such as our strict internal validity adjustment for deworming. However, we haven’t adopted any rules for penalizing more uncertain programs, either quantitatively or qualitatively. This entry updates us toward believing we should consider a more systematic approach.
- We’re not sure if conducting the full uncertainty modeling recommended by the entry is the right approach for GiveWell, and we’d like to explore alternative approaches to addressing this issue.
- The entry argues the best approach would be to model uncertainty using a probabilistic sensitivity analysis (PSA). This would involve selecting and parameterizing probability distributions for key parameters; running repeat simulations to obtain a distribution of potential outcomes; and using this distribution as the basis for decision-making.
- We’re not sure if this is the right approach because we think there could be some important drawbacks. It could limit accessibility of our models externally, make it difficult to compare across models if we’re unsure if uncertainty is being equally accounted for across programs, and make it harder to understand intuitively what’s driving our bottom line on which programs are more cost-effective. We would want to weigh those downsides against the benefits of PSA.
- On the other hand, if we find that this problem leads to a sufficiently large impact on the value of our allocations, it might be worth the costs of a more complicated modeling approach. We’d like to do more work to explore how big of an impact this problem has on our portfolio.
Overall, we think handling uncertainty is an important issue, and we appreciate the nudge to consider it more deeply!
Matthew Romer and Paul Romer Present on water
This entry clearly presented a series of plausible changes to how we estimate the cost-effectiveness of water quality interventions, specifically Dispensers for Safe Water (DSW) and in-line chlorination (ILC). We believe the authors understand our analysis well and were able to identify some weak points in our cost-effectiveness analysis; we expect to make some but not all of the changes they propose.
To briefly summarize in our own words, this entry suggests that GiveWell should:
- Include Haushofer et al. (2021) in its meta-analysis on the effect of water chlorination on all-cause mortality.
- Use a formal Bayesian approach rather than a “plausibility cap” to estimate the effect of water chlorination on all-cause mortality.
- Revise our estimate for the age distribution of deaths averted by water chlorination, as well as for the medical costs averted.
- Discount both future costs and future benefits to account for changes over time.
- Revise the cost estimates for ILC.
- Review a calculation in our leverage and funging adjustment that they believe may contain an error.
We share some initial thoughts here, noting that we’re still in the process of deciding whether and how to incorporate these suggestions:
- Including Haushofer et al.: The choice to exclude Haushofer et al. was difficult, but we’re currently comfortable with our decision to exclude it. See more on this page, including in footnote 39. For our practical decision-making (versus a context like Cochrane meta-analyses where stricter decision rules might be needed), we think it makes sense to exclude it given (a) the fact that we find the implied intervention effect implausible across the 95% confidence interval; (b) the divergence of these results from the other strongest pieces of evidence we have; and (c) the large effect that including it would have on the pooled result.
- Using a formal Bayesian approach: We’re planning to consider this in more depth. Estimating effect sizes is difficult in cases where the point estimates from available evidence seem implausibly high to us. As the authors note, we’ve used a Bayesian approach in some of our other analyses (e.g., deworming), and it might be reasonable to use here. The plausibility cap we’re currently using seems like one reasonable approach in this context, but we haven’t fully explored other approaches. If we don’t move to a Bayesian approach, we may still make other changes inspired by this point.
- Revising the ages of deaths averted and medical costs averted:
- On the ages of deaths averted: We think it’s reasonable to use the age structure of direct deaths from enteric infections for indirect deaths as well because those deaths are still linked to enteric infections, based on the idea that enteric infections increase the risk of other diseases.
- On medical costs averted: Our published cost-effectiveness analysis uses an outdated method to estimate medical costs averted by water quality interventions, and we’re now internally using a method that we believe is better aligned with our analyses for interventions like those conducted by our top charities.
- Discounting future benefits and costs: We hadn’t realized that we’re treating future deaths differently in the New Incentives analysis—thank you for flagging that. For grants where the benefits occur more than a few years in the future (like this Dispensers for Safe Water grant), we generally want to account for both changing disease burdens (in this case, a decline in deaths from diarrhea) and general uncertainty over time, and we didn’t do that in this case. We’re less sure that we’d want to discount costs (versus benefits) in the future, given (a) consistency with our other analyses and (b) the fact that from our perspective, the costs are “spent” when we decide to make a grant.
- Revising costs of ILC: It’s true that we’re using very rough cost figures for ILC in our published analysis. We expect to learn more over time and incorporate that in future analyses we publish.
- Reviewing funging calculation: This seems like a likely error (that makes a small difference to the bottom line). We’ll correct it if upon further review we confirm that it’s an error!
Our best guess overall is that after more thoroughly reviewing these suggestions, we’ll revise our water quality cost-effectiveness analysis and our estimate of the cost-effectiveness of Dispensers for Safe Water will change by very roughly 5 to 10%.
Notes
↑1 | Both of these entries were outstanding, and they represent very different approaches. Because they are similarly excellent, we are naming two winners rather than one winner and one runner-up. |
---|---|
↑2 | For example, a former GiveWell researcher wrote this post, which makes a different argument from Noah Haber’s piece but addresses a similar problem. |
Comments
I appreciated Noah Haber’s piece on “GiveWell’s Uncertainty Problem” and GiveWell’s response, and I’m very interested to see what changes this leads to.
I’m curious about a couple things:
– Does the process of recommending charities account for uncertainty in any way that modeling them doesn’t? For example, would the cost-effectiveness estimate from a program with less or worse evidence behind it be taken more skeptically?
– Where are ad-hoc adjustments like the large adjustment for deworming already used?
For what it’s worth, I’m sure the statistical issues Haber’s piece highlights are real. These issues are well-understood in areas I have worked in. For example, teacher value-added models based on student test scores adjust for regression to the mean; if they didn’t, average teachers with little data would often appear extremely good or extremely bad. And in online experiments (A/B tests), small sample sizes make it much more likely that an effect will be overstated, and that a follow-up experiment would fail to replicate the initial one.
If a probabilistic parameter sensitivity analysis is too complicated, I think the New York Times buy or rent calculator is a great illustration of a simpler approach: https://www.nytimes.com/interactive/2014/upshot/buy-rent-calculator.html
You can play with the parameter sliders to see how they affect the final recommendation. Parameters can affect the cost of buying, renting, or both. This exercise makes transparent which parameters are really important. You can do the same thing with the existing spreadsheets by changing parameters and seeing what happens, but a UI like the NYT’s makes it a lot easier to build intuition.
One more thing on the topic of uncertainty: Uncertainty in denominators is a particular bugbear and can lead to huge biases. A probabilistic model can fix this in theory, but in practice can make issues much harder to spot.
Imagine the following situation, inspired by the real case of SMS reminders for vaccination. Say we know with certainty that the value of a reminder is $2, but we aren’t sure about the cost. It could be $0.10 or even less, since texting is cheap, but it could be as much as $1 if it’s hard to obtain phone numbers. The cost-effectiveness is $2 / cost. There are a few ways to estimate that:
A) Make a few guesses of what the cost might be, average them into a best-guess cost, then divide. So if the cost is between $0.10 or $1, a best guess is AVG($0.10, $1) = $0.55, and the cost-effectiveness is $2 / AVG($0.10, $1) = 3.6.
B) Compute cost-effectiveness in two different scenarios, one where the cost is $0.10 and one where it is $1. So in one scenario it’s $2 / $0.10 = 20, and in another it’s $2 / $1 = 2. Our final estimate is AVG($2 / $0.10, $2 / $1) = AVG(20, 2) = 11.
C) Use a statistical distribution over all possible costs and integrate over those scenarios. This is similar to (B), but represents all possibilities, not just two. Say our prior is that cost is uniformly distributed between $0.001, since texts might be really cheap, and $1. That gives a result of 13.8. (At least if you do it right! Monte Carlo analyses with fewer than 10,000 simulations will be noisy and unreliable.)
So different methods give wildly different results here. Which is right? I would say none should inspire confidence. But Method B, working all the way through a couple scenarios, makes the takeaway clearest: the possibility of very low costs drives the potential cost-effectiveness here, and getting better information on cost should be a high priority. The other approaches could obscure this.
In GiveWells Uncertainty Problem, the tip to estimate the 80% lower-bound confidence interval (also known as the 20th percentile) instantly enlightened my interpretation of a GiveWell estimate. The tip would also inform my own estimate in my software engineering.
In terms of how to handle the correct analysis being complicated when you need to explain your findings, one approach is to internally do the good analysis, but treat it as a double-check of a simpler analysis that (we hope) usually matches. When they really do differ, you do your best to boil down why, either generically (like “Often, a few studies will point to a strong effect which later studies can’t confirm. A close look at the uncertainties around this program makes us think it’s not yet the best bet for the highest impact”) or even digging into how a particular variable becomes more important once you do a more rigorous analysis.
Alternatively, you could just do the more rigorous analysis, but present it wrapped up in one ‘certainty’ factor so people have something to look at without opening up the black box.
Separately, not about GiveWell’s picks, but EA more broadly: it’s interesting to think about speculative extreme threats like AI risk through this lens. Aside from issues I have with AI arguments specifically, my gut wants to be skeptical when someone says “drop everything and focus on a possible harm we don’t yet see in the world around us.” There are things like climate change, but there we *do* have a pile of evidence that can clear a high bar.
There’s more to the speculation thing than just an analysis about probabilities; a lot of forces push people towards valuing their favorite idea too highly. (I’m sure there’s a ton to read about this that I’m ignorant of.) But the analysis seems relevant. It’s particularly interesting that, even in a world with just random uncertainty, looking towards the low end of the expected range can help avoid a winner’s-curse type situation.
Randall, well said!
(also posted on Noah’s website, though the comment doesn’t seem to be appearing there)
Congratulations Noah!
Three comments:
1. The way I think about the Optimizer’s Curse, the fundamental issue is that expected value does not commute with max. In symbols, for N distributions X_n: E(max(X_n)) >= max(E(X_n)) >= E(X_m). In our context, each X_n corresponds to some intervention. The LHS is a naive estimate of the cost-effectiveness of the naively best distribution which will exceed the middle term that is the true CE of the true best intervention; that in turn will usually exceed the RHS which is the true CE of the naive best intervention X_m. (Mathematically, the first inequality is a consequence of the convexity of max and the second is a trivial consequence of the definition of max. )
The bigger N is, the bigger the gaps will tend to be (of course it also depends on how concentrated the X_n are, and also how correlated they are). I think this matters, because the implicit N will vary by intervention. For example, I would argue for deworming N is roughly 3X what it is for SMC. If you thought about mass deworming of school age children from scratch and had to guess what the main benefit was it probably wouldn’t be longterm economic benefits without much if any measurable short term benefit to health or education. In contrast, for SMC the estimated primary benefit — lives saved — is the first, most direct one would think of, not arguably the third as for deworming. If you accept that argument, then threshold based approaches (like the three you discussed) should have a more punitive threshold for deworming than SMC.
2. Thresholding is a good, common sense approach. And in practical terms, it might be the way to go. But as you know of course we really care about the full distribution. If hypothetically after correcting from bias from the selection approach, we think intervention A has a 50% chance of being 0.95x cash and a 50% chance of being 20x cash, whereas intervention B has a 20% chance of being 0x cash and an 80% chance of being 2x cash, we’d clearly prefer A (even though B would win on cutoffs between 50% and 80%). Ultimately, what we would like (assuming we are risk neutral anyway) is an EV of CE, after correcting the distribution for the bias introduced by the selection procedure. But that would be hard! For one thing, the full selection procedure isn’t that well-defined, especially at the top of the funnel.
3. My general feeling is GiveWell tends to err a little bit on the conservative side in estimating parameters of its CE models, which may informally help partially correct for OC.
Hi, Elizabeth,
Thank you for your thoughtful comments, and apologies that it took us a while to respond!
To take your questions and comments in turn:
Qualitative evaluations of uncertainty. When the evidence behind a program is more uncertain, we do view our cost-effectiveness estimates with greater skepticism (even when we’ve added quantitative adjustments for those uncertainties), and that plays out in our decision-making. This often happens when we’re evaluating a technical assistance program—where the program activities consist of training, monitoring, or other support instead of directly delivering goods and services—or when we’re looking at a program that advocates policy change. In those cases, we’ll still produce a quantitative estimate of cost-effectiveness, but we might give more weight to qualitative considerations, such as the strength of the implementing organization, than we would for a top charity or other programs with more of a solid evidence base. An example of a grant in this category is our recommendation of $15 million to RESET Alcohol Initiative to advocate for changes to alcohol policy.
When there are significant uncertainties in our analysis, we also might fund research on the program or additional monitoring activities to improve our data and help us refine our cost-effectiveness estimates. Recent examples include our grants to Bridges to Prosperity and University of Chicago for research on IRD’s mobile conditional cash transfer program. The results of these studies will inform whether we decide to direct further funding to these programs.
Quantitative adjustments for uncertainty, e.g., in deworming. You can see the quantitative adjustments we apply to deworming programs here. (We enter these on the Deworm the World tab of our cost-effectiveness analysis, then carry over an aggregate adjustment to the tabs for all other deworming programs in the spreadsheet.) The evidence for long-term development effects from deworming is mostly based on one study, so we apply a large downward adjustment in case that study is overstating the effects—there’s a chance that further studies conducted on the same population would find smaller effect sizes.
We apply quantitative adjustments similar to these in other cases where the evidence base is highly uncertain and where we have some reason to believe the effect is smaller than the headline estimates. The plausibility cap we used in our analysis of water treatment’s effects on mortality is another example (see discussion in our water quality intervention report under “Mortality reduction”). These adjustments do still depend on a number of subjective judgment calls, leaving a lot of room for disagreement.
We think the qualitative and quantitative adjustments we make for uncertainty may partially address the issues raised in Noah Haber’s critique, but we’d like to explore this further.
Uncertainty on costs. We typically create a best-guess cost per person reached estimate using information on a charity’s spending (and spending of other organizations or governments involved in the program) and numbers of people reached in recent years. For an example, see our cost per output analysis of Malaria Consortium’s seasonal malaria chemoprevention program.
Costs are typically an important source of uncertainty early in investigations because they can vary widely depending on context, operational and programmatic design choices, and number of people reached. We often find it productive to try to narrow these uncertainties by working with implementing organizations to understand the specifics of potential funding opportunities..
Parameter sensitivity analysis. Thanks for sharing the buy/rent calculator—something like this would be a fun tool for people engaging with our models! We don’t have anything quite as easy to use as that. However, we do try to do this type of parameter sensitivity check in our models internally, for the reasons you mentioned—to help us figure out the parameters our models are most sensitive to, so we can prioritize those for further research and be transparent with readers about the key judgment calls in our analysis. But we haven’t been as consistent about this as we want to be so far, and hope to make this a more systematic part of our process going forward.
I hope that’s helpful!
Best,
Miranda
Congratulations to the winners! That is awesome!
Comments are closed.