As we look for evidence-backed programs, a major problem we’re grappling with is publication bias – the tendency of both researchers and publishers to skew the evidence to the optimistic side, before it ever gets out in the open where we can look at it. It sounds too scary to be real – how can we identify real good news if bad news is being buried? – but it’s a very real concern.
Publication bias takes several forms:
- Bias by publishers: journals are more likely to publish papers that “find” meaningful effects, as opposed to papers that find no effects (of a medicine, social program, etc.) A recent Cochrane review documents this problem in the field of medicine, finding a link between “positive” findings and likelihood of publication; a 1992 paper, “Are All Economic Hypotheses False?”, suggests that it affects economics journals.
- Bias by researchers: David Roodman writes (pg 13):
A researcher who has just labored to assemble a data set on civil wars in developing countries since 1970, or to build a complicated mathematical model of how aid raises growth in good policy environment, will feel a strong temptation to zero in on the preliminary regressions that show her variable to be important. Sometimes it is called “specification search” or “letting the data decide.” Researchers may challenge their own results obtained this way with less fervor than they ought … Research assistants may do all these things unbeknownst to their supervisors.
The effect of these problems has been thoroughly documented in many fields (for a few more, see these Overcoming Bias posts: one, two, three, four). And philanthropy-related research seems particularly vulnerable to this problem – a negative evaluation can mean less funding, giving charities every incentive to trumpet the good news and bury the bad.
How can we deal with this problem?
A few steps we are taking to account for the danger of publication bias:
- Place more weight on randomized (experimental) as opposed to non-randomized (quasiexperimental) evaluations. A randomized evaluation is one in which program participants are chosen by lottery, and lotteried-in people are then compared to lotteried-out people to look for program effects. In a non-randomized evaluation, selection of which two groups to compare is generally done after-the-fact. As Esther Duflo argues in “Use of Randomization in the Evaluation of Development Effectiveness” (PDF):
Publication bias is likely to a particular problem with retrospective studies. Ex post the researchers or evaluators define their own comparison group, and thus may be able to pick a variety of plausible comparison groups; in particular, researchers obtaining negative results with retrospective techniques are likely to try different approaches, or not to publish. In the case of “natural experiments” and instrumental variable estimates, publication bias may actually more than compensate for the reduction in bias caused by the use of an instrument because these estimates tend to have larger standard errors, and researchers looking for significant results will only select large estimates. For example, Ashenfelter, Harmon and Oosterberbeek (1999) show that the there is strong evidence of publication bias of instrumental variables estimates of the returns to education: on average, the estimates with larger standard errors also tend to be larger. This accounts for most of the oft-cited result that instrumental estimates of the returns to education are higher than ordinary least squares estimates.
In contrast, randomized evaluations commit in advance to a particular comparison group: once the work is done to conduct a prospective randomized evaluation the results are usually documented and published even if the results suggest quite modest effects or even no effects at all.
In short, a randomized evaluation is one where researchers determined in advance which two groups they were going to compare – leaving a lot less room for fudging the numbers (purposefully or subconsciously) later.
In the same vein, we favor “simple” results from such evaluations: we put more weight on studies that simply measured a set of characteristics for the two groups and published the results as is, rather than performing heavy statistical adjustments and/or claiming effects for sub-groups that are chosen after the fact. (Note that the Nurse-Family Partnership evaluations performed somewhat heavy after-the-fact statistical adjustments; the NYC Voucher Experiment original study claimed effects for a subgroup, African-American students, even though this effect was not hypothesized before the experiment.)
- Place more weight on studies that would likely have been published even if they’d shown no results. The Poverty Action Lab and Innovators for Poverty Action publish info on their projects in progress, making it much less feasible to “bury” their results if they don’t come out as hoped. By contrast, for every report published by a standard academic journal – or worse, a nonprofit – there could easily be several discouraging reports left in the filing cabinet.
I would also guess that highly costly and highly publicized (in advance) studies are less likely to be buried, and thus more reliable when they bear good news.
- Don’t rely only on “micro” evidence. The interventions I have the most confidence are the ones that have both been rigorously studied on a small scale and have been associated with major success stories (such as the eradication of smallpox, the economic emergence of Asia, and more) whose size and impact are not in question. More on this idea in a future post.