There’s been some interesting discussion on how the world of social science experiments is evolving. Chris Blattman worries that there is too much of a tendency toward large, expensive, perfectionist studies, writing:
each study is like a lamp post. We might want to have a few smaller lamp posts illuminating our path, rather than the world’s largest and most awesome lamp post illuminating just one spot. I worried that our striving for perfect, overachieving studies could make our world darker on average.
My feeling – shared by most of the staff I’ve discussed this with – is that the trend toward “perfect, overachieving studies” is a good thing. Given the current state of the literature and the tradeoffs we perceive, I wish this trend were stronger and faster than it is. I think it’s worth briefly laying out my reasoning.
Our relationship to academic research is that of a “consumer.” We don’t carry out research; we try to use existing studies to answer action-relevant questions. The example question I’ll use here is “What are the long-term benefits of deworming?”
In theory, I’d prefer a large number of highly flawed studies on deworming to a small number of “perfectionist” studies, largely for the reasons Prof. Blattman lays out – a large number of studies would give me a better sense of how well an intervention generalizes, and what kinds of settings it is better and worse suited to. This would be my preference if flawed studies were flawed in different and largely unrelated ways.
The problem is my fear that studies’ flaws are systematically similar to each other. I fear this for a couple of reasons:
1. Correlated biases in research methods. One of the most pervasive issues with flawed studies is selection bias. For example, when trying to assess the impact of the infections treated by deworming, it’s relatively easy to compare populations with high infection levels to populations with low infection levels, and attribute any difference to the infections themselves. The problem is that differences in these populations could reflect the impact of deworming, or could reflect other confounding factors: the fact that populations with high infection rates tend to be systematically poorer, have systematically worse sanitation, etc.
If researchers decided to conduct 100 relatively easy, low-quality studies of deworming, it’s likely that nearly all of the studies would take this form, and therefore nearly all would be subject to the same risk of bias; practically no such studies would have unrelated, or opposite, biases. In order to conduct a study without this bias, one needs to run an experiment (or identify a “natural experiment”), which is a significant step in the direction of a “perfectionist” study.
Even if we restrict the universe of studies considered to experiments, I think analogous issues apply. For example:
- Poorly conducted experiments tend to risk reintroducing selection bias. If randomization is poorly enforced, a study can end up looking at “people motivated to receive deworming” vs. “people not as motivated,” which might be also confounded with general wealth, education, sanitation, etc.
- There are many short-term randomized studies of deworming, but few long-term randomized studies. If my concern is long-term effects, having a large number of short-term studies isn’t very helpful.
2. Correlated biases due to academic culture. The issue that worries me most about academic research is publication bias: the fact that researchers have a variety of ways to “find what they want to find,” from selective reporting of analyses to selective publication. I suspect that researchers have a lot in common in terms of what they “want to find”; they tend to share a common culture, common background assumptions, and common incentives. As a non-academic, I’m particularly worried about being misled because of my limited understanding of such factors. I think this issue is even more worrying when studies are done in collaboration with nonprofits, which systematically share incentives to exaggerate the impact of their programs.
The case of microlending seems like a strong example. When we first confronted the evidence on microlending in 2008, we found a large number of studies, almost all using problematic methodologies, and almost all concluding that microlending had strong positive effects. Since then, a number of “gold standard” studies have pointed to a very different conclusion.
Many of the qualities that make a study “perfectionist” or “overachieving” – such as pre-analysis plans, or collection of many kinds of data that allow a variety of chances to spot anomalies – seem to me to reduce researchers’ ability to “find what they want to find,” and/or improve a critical reader’s ability to spot symptoms of selective reporting. Furthermore, the mere fact that a study is more expensive and time-consuming reduces the risk of publication bias, in my view: it means the study tends to get more scrutiny, and is less likely to be left unpublished if it returns unwanted results.
Some further thoughts on my general support of a trend toward “perfectionist” studies:
Synthesis and aggregation are much more feasible for perfectionist studies. Flawed studies are not only less reliable; they’re more time-consuming to interpret. The ideal study has high-quality long-term data, well-executed randomization, low attrition, and a pre-analysis plan; such a study has results that can largely be taken at face value, and if there are many such studies it can be helpful to combine their results in a meta-analysis, while also looking for differences in setting that may explain their different findings. By contrast, when I’m looking at a large number of studies that I suspect have similar flaws, it seems important to try to understand all the nuances of each study, and there is usually no clearly meaningful way to aggregate them.
It’s difficult to assess external validity when internal validity is so much in question. In theory, I care about external validity as much as internal validity, but it’s hard to pick up anything about the former when I am so worried about the latter. When I see a large number of studies pointing in the same direction, I fear this is due to correlated biases rather than to a sign of generalizability; when I see studies finding different things, I suspect that this may be a product of different researchers’ goals and preferences rather than anything to do with differences in setting.
On the flipside, once I’m convinced of a small number of studies’ internal validity, it’s often possible to investigate external validity by using much more limited data that doesn’t even fall under the heading of “studies.” For example, when assessing the impact of a deworming campaign, we look at data on declines in worm infections, or even just data on whether the intervention was delivered appropriately; once we’ve been sold that the basic mechanism is plausible, data along these lines can fill in an important part of the external validity picture. This approach works best for interventions that seem inherently likely to generalize (e.g., health interventions).
Rather than seeing internal and external validity as orthogonal qualities that can be measured using different methods, I tend to see a baseline level of internal validity as a prerequisite to examining external validity. And since this baseline is seldom met, that’s what I’m most eager to see more of.
Bottom line. Under the status quo, I get very little value out of literatures that have large numbers of flawed studies – because I tend to suspect the flaws of running in the same direction. On a given research question, I tend to base my view on the very best, most expensive, most “perfectionist” studies, because I expect these studies to be the most fair and the most scrutinized, and I think focusing on them leaves me in better position than trying to understand all the subtleties of a large number of flawed studies.
If there were more diversity of research methods, I’d worry less about pervasive and correlated selection bias. If I trusted academics to be unbiased, I would feel better about looking at the overall picture presented by a large number of imperfect studies. If I had the time to understand all the nuances of every study, I’d be able to make more use of large and flawed literatures. And if all of these issues were less concerning to me, I’d be more interested in moving beyond a focus on internal validity to broader investigations of external validity. But as things are, I tend to get more value out of the 1-5 best studies on a subject than out of all others combined, and I wish that perfectionist approaches were much more dominant than they currently are.