The GiveWell Blog

The importance of “gold standard” studies for consumers of research

There’s been some interesting discussion on how the world of social science experiments is evolving. Chris Blattman worries that there is too much of a tendency toward large, expensive, perfectionist studies, writing:

each study is like a lamp post. We might want to have a few smaller lamp posts illuminating our path, rather than the world’s largest and most awesome lamp post illuminating just one spot. I worried that our striving for perfect, overachieving studies could make our world darker on average.

My feeling – shared by most of the staff I’ve discussed this with – is that the trend toward “perfect, overachieving studies” is a good thing. Given the current state of the literature and the tradeoffs we perceive, I wish this trend were stronger and faster than it is. I think it’s worth briefly laying out my reasoning.

Our relationship to academic research is that of a “consumer.” We don’t carry out research; we try to use existing studies to answer action-relevant questions. The example question I’ll use here is “What are the long-term benefits of deworming?”

In theory, I’d prefer a large number of highly flawed studies on deworming to a small number of “perfectionist” studies, largely for the reasons Prof. Blattman lays out – a large number of studies would give me a better sense of how well an intervention generalizes, and what kinds of settings it is better and worse suited to. This would be my preference if flawed studies were flawed in different and largely unrelated ways.

The problem is my fear that studies’ flaws are systematically similar to each other. I fear this for a couple of reasons:

1. Correlated biases in research methods. One of the most pervasive issues with flawed studies is selection bias. For example, when trying to assess the impact of the infections treated by deworming, it’s relatively easy to compare populations with high infection levels to populations with low infection levels, and attribute any difference to the infections themselves. The problem is that differences in these populations could reflect the impact of deworming, or could reflect other confounding factors: the fact that populations with high infection rates tend to be systematically poorer, have systematically worse sanitation, etc.

If researchers decided to conduct 100 relatively easy, low-quality studies of deworming, it’s likely that nearly all of the studies would take this form, and therefore nearly all would be subject to the same risk of bias; practically no such studies would have unrelated, or opposite, biases. In order to conduct a study without this bias, one needs to run an experiment (or identify a “natural experiment”), which is a significant step in the direction of a “perfectionist” study.

Even if we restrict the universe of studies considered to experiments, I think analogous issues apply. For example:

  • Poorly conducted experiments tend to risk reintroducing selection bias. If randomization is poorly enforced, a study can end up looking at “people motivated to receive deworming” vs. “people not as motivated,” which might be also confounded with general wealth, education, sanitation, etc.
  • There are many short-term randomized studies of deworming, but few long-term randomized studies. If my concern is long-term effects, having a large number of short-term studies isn’t very helpful.

2. Correlated biases due to academic culture. The issue that worries me most about academic research is publication bias: the fact that researchers have a variety of ways to “find what they want to find,” from selective reporting of analyses to selective publication. I suspect that researchers have a lot in common in terms of what they “want to find”; they tend to share a common culture, common background assumptions, and common incentives. As a non-academic, I’m particularly worried about being misled because of my limited understanding of such factors. I think this issue is even more worrying when studies are done in collaboration with nonprofits, which systematically share incentives to exaggerate the impact of their programs.

The case of microlending seems like a strong example. When we first confronted the evidence on microlending in 2008, we found a large number of studies, almost all using problematic methodologies, and almost all concluding that microlending had strong positive effects. Since then, a number of “gold standard” studies have pointed to a very different conclusion.

Many of the qualities that make a study “perfectionist” or “overachieving” – such as pre-analysis plans, or collection of many kinds of data that allow a variety of chances to spot anomalies – seem to me to reduce researchers’ ability to “find what they want to find,” and/or improve a critical reader’s ability to spot symptoms of selective reporting. Furthermore, the mere fact that a study is more expensive and time-consuming reduces the risk of publication bias, in my view: it means the study tends to get more scrutiny, and is less likely to be left unpublished if it returns unwanted results.

Some further thoughts on my general support of a trend toward “perfectionist” studies:

Synthesis and aggregation are much more feasible for perfectionist studies. Flawed studies are not only less reliable; they’re more time-consuming to interpret. The ideal study has high-quality long-term data, well-executed randomization, low attrition, and a pre-analysis plan; such a study has results that can largely be taken at face value, and if there are many such studies it can be helpful to combine their results in a meta-analysis, while also looking for differences in setting that may explain their different findings. By contrast, when I’m looking at a large number of studies that I suspect have similar flaws, it seems important to try to understand all the nuances of each study, and there is usually no clearly meaningful way to aggregate them.

It’s difficult to assess external validity when internal validity is so much in question. In theory, I care about external validity as much as internal validity, but it’s hard to pick up anything about the former when I am so worried about the latter. When I see a large number of studies pointing in the same direction, I fear this is due to correlated biases rather than to a sign of generalizability; when I see studies finding different things, I suspect that this may be a product of different researchers’ goals and preferences rather than anything to do with differences in setting.

On the flipside, once I’m convinced of a small number of studies’ internal validity, it’s often possible to investigate external validity by using much more limited data that doesn’t even fall under the heading of “studies.” For example, when assessing the impact of a deworming campaign, we look at data on declines in worm infections, or even just data on whether the intervention was delivered appropriately; once we’ve been sold that the basic mechanism is plausible, data along these lines can fill in an important part of the external validity picture. This approach works best for interventions that seem inherently likely to generalize (e.g., health interventions).

Rather than seeing internal and external validity as orthogonal qualities that can be measured using different methods, I tend to see a baseline level of internal validity as a prerequisite to examining external validity. And since this baseline is seldom met, that’s what I’m most eager to see more of.

Bottom line. Under the status quo, I get very little value out of literatures that have large numbers of flawed studies – because I tend to suspect the flaws of running in the same direction. On a given research question, I tend to base my view on the very best, most expensive, most “perfectionist” studies, because I expect these studies to be the most fair and the most scrutinized, and I think focusing on them leaves me in better position than trying to understand all the subtleties of a large number of flawed studies.

If there were more diversity of research methods, I’d worry less about pervasive and correlated selection bias. If I trusted academics to be unbiased, I would feel better about looking at the overall picture presented by a large number of imperfect studies. If I had the time to understand all the nuances of every study, I’d be able to make more use of large and flawed literatures. And if all of these issues were less concerning to me, I’d be more interested in moving beyond a focus on internal validity to broader investigations of external validity. But as things are, I tend to get more value out of the 1-5 best studies on a subject than out of all others combined, and I wish that perfectionist approaches were much more dominant than they currently are.


  • Colin Rust on January 19, 2016 at 8:21 pm said:

    I tend to agree fewer quality studies are more valuable than a larger number of flawed studies.

    That said, I do feel Blattman’s pain when he writes (in his initial blog post):

    I can tell you from experience it is excruciating to polish these papers to the point that a top journal and its exacting referees will accept them. I appreciate the importance of this polish, but I have a hard time believing the current state is the optimal allocation of scholarly effort. The opportunity cost of time is huge.

    I wonder if there are more collaborative models for conducting science — the study design and, separately, the data analysis — that could make achieving polish less painful? Instead of the traditional almost adversarial relationship between authors and referees leading to fine-tuning of the analysis (of course it’s too late for the study design), maybe a more collaborative approach by interested experts who are initially outsiders could more efficiently and less painfully achieve a good outcome.

    I’m just jawboning here, but I’ve been fascinated by the Polymath project, a radically collaborative project that has had some success in tackling some problems in pure mathematics. I wonder if there is something analogous that might be tried for various stages of some empirical studies.

  • Eli Brandt on January 20, 2016 at 12:24 am said:

    I could easily believe that reviewers are too stringent on issues of polish, and simultaneously too lax on fundamental assumptions that undercut the methodology.

    There are at least two things being talked about, polish and fundamental design. As a reader I care a lot about whether a study is, for example, a randomized trial or a correlational piece of junk. You can’t polish a correlational turd.

    My suspicion is that a lot of what authors call “polish” requests from decent reviewers are cases of “you know what you did, but you have to state it to somebody who doesn’t know.” These can feel like nitpicks to the author, but they are necessary if you want a publication that works for readers.

    Though there is a spectrum. Sometimes reviewers ask a paper to be not just self-explanatory but practically self-defending against all possible questions. There are diminishing returns here.

    And maybe not every communication needs to stand alone without its author. For posterity it would, for a large number of readers it practically would, but for a relatively ephemeral note to the twenty other workers in a field? Not so much.

    Maybe we should have more of a category of: peer-reviewed carefully for fundamentals, but only lightly for ‘polish’. (Or is that in broad use in some areas?)

  • Denis Drescher on January 20, 2016 at 1:34 am said:

    I recently stumbled across the paper “Evaluations with impact: decision-focused impact evaluation as a practical policymaking tool” on the Impact Evaluations blog. It distinguishes knowledge-focused evaluations (KFEs), “those primarily designed to build global knowledge about development interventions and theory,” and decision-focused evaluations (DFEs), those “driven by implementer demand, tailored to implementer needs and constraints, and embedded within implementer structures,” and recommends: “Where the primary need is for rigorous evidence to directly inform a particular development programme or policy, DFEs will usually be the more appropriate tool. KFEs, in turn, should be employed when the primary objective is to advance development theory or in instances when we expect high external validity, ex ante. This recalibration will require expanding the use of DFEs and greater targeting in the use of KFEs.”

    This taxonomy seems mostly orthogonal to the one into perfectionist and flawed studies, but it doesn’t seem to be entirely orthogonal since the KFEs are much more expensive (page 32) and much more likely to attract wide attention and thus, as you say, probably less likely to suffer from publication bias. Plus, the lowest-cost types of DFEs don’t even use randomization (page 26), so they’re of very limited use absent prior randomized studies.

    Would you still agree with their recommendation in full, or do you see greater need for KFEs than the authors?

  • Brett Keller on January 20, 2016 at 3:04 am said:

    One additional consideration is that while you are consumers of research, you tend to (from my outside observation) be the greatest consumers of specific areas of research that you are selecting in part based on the quantity and quality of the research. Ie, if an area doesn’t have any rigorous research, it’s unlikely to come to your attention, and even more unlikely to be something you really dig into. That’s definitely not meant as a criticism of your work, just a recognition that how you use research may be different from others’ uses.

    In other words, it’s possible that the optimal allocation of research resources for all consumers (policymakers, NGO folks, general public) might be for marginally better studies across a high number of fields, whereas the optimal allocation of research resources for your uses (deciding on best charities to accomplish specific goals) may be different.

  • Tim Ogden on January 20, 2016 at 12:08 pm said:

    Since you suspect that the less-perfectionist studies are systematically biased in the same way, and a seemingly strong assumption on what that bias is, wouldn’t a Bayesian adjustment to the less-perfectionist studies put you in largely the same place as the smaller adjustment to the more perfectionist studies.

    Or is your suspicion that the bias range includes negative and zero effects? In other words, is the bias so strong that even a large number of smaller/”not-bad but not-perfect” studies showing a positive effect would be obscuring an actually negative effect?

  • Holden on January 22, 2016 at 1:54 pm said:

    Thanks for the comments, all!

    Colin and Eli: I agree it’s worth distinguishing between perfectionism in designing/executing a study vs. perfectionism in writing it up for a journal. This post intended to focus on the former, and I perceived Prof. Blattman’s post as doing the same for the most part. When it comes to the latter, we’d probably be happy to see a maximally lax standard. We are generally happy to use a working paper when the peer-reviewed version isn’t out yet (especially if it is also possible to review the data and code).

    Denis, we don’t have a strong view on that question. I believe the KFE category is a better fit for the kind of study we generally rely on in establishing the basic case for impact, though we sometimes use things that might be considered DFEs when assessing how a particular charity’s execution compares to the program originally studied.

    Brett, that’s a good point. The aim of this post was to share our preferences and reasoning, and we recognize that others may have different needs.

    Tim: we think that flawed studies may have heavy bias but we aren’t at all confident about the magnitude or direction of the bias. The more such uncertainty we have, the more a Bayesian adjustment mostly just reduces to sticking to our prior (more). Because “perfectionist” studies leave less room for uncertainty about bias (compared to large sets of flawed studies), they are more able to shift our estimates away from the prior.

  • MWANGAZA KITABU on January 28, 2016 at 7:26 am said:

    Many of the qualities that make a study “perfectionist” or “overachieving” – such as pre-analysis plans, or collection of many kinds of data that allow a variety of chances to spot anomalies –
    Cant agree more. Good Job here!!!

Comments are closed.