The GiveWell Blog

Publication bias: Over-reporting good news

As we look for evidence-backed programs, a major problem we’re grappling with is publication bias – the tendency of both researchers and publishers to skew the evidence to the optimistic side, before it ever gets out in the open where we can look at it. It sounds too scary to be real – how can we identify real good news if bad news is being buried? – but it’s a very real concern.

Publication bias takes several forms:

  • Bias by publishers: journals are more likely to publish papers that “find” meaningful effects, as opposed to papers that find no effects (of a medicine, social program, etc.) A recent Cochrane review documents this problem in the field of medicine, finding a link between “positive” findings and likelihood of publication; a 1992 paper, “Are All Economic Hypotheses False?”, suggests that it affects economics journals.
  • Bias by researchers: David Roodman writes (pg 13):

    A researcher who has just labored to assemble a data set on civil wars in developing countries since 1970, or to build a complicated mathematical model of how aid raises growth in good policy environment, will feel a strong temptation to zero in on the preliminary regressions that show her variable to be important. Sometimes it is called “specification search” or “letting the data decide.” Researchers may challenge their own results obtained this way with less fervor than they ought … Research assistants may do all these things unbeknownst to their supervisors.

The effect of these problems has been thoroughly documented in many fields (for a few more, see these Overcoming Bias posts: one, two, three, four). And philanthropy-related research seems particularly vulnerable to this problem – a negative evaluation can mean less funding, giving charities every incentive to trumpet the good news and bury the bad.

How can we deal with this problem?

A few steps we are taking to account for the danger of publication bias:

  1. Place more weight on randomized (experimental) as opposed to non-randomized (quasiexperimental) evaluations. A randomized evaluation is one in which program participants are chosen by lottery, and lotteried-in people are then compared to lotteried-out people to look for program effects. In a non-randomized evaluation, selection of which two groups to compare is generally done after-the-fact. As Esther Duflo argues in “Use of Randomization in the Evaluation of Development Effectiveness” (PDF):

    Publication bias is likely to a particular problem with retrospective studies. Ex post the researchers or evaluators define their own comparison group, and thus may be able to pick a variety of plausible comparison groups; in particular, researchers obtaining negative results with retrospective techniques are likely to try different approaches, or not to publish. In the case of “natural experiments” and instrumental variable estimates, publication bias may actually more than compensate for the reduction in bias caused by the use of an instrument because these estimates tend to have larger standard errors, and researchers looking for significant results will only select large estimates. For example, Ashenfelter, Harmon and Oosterberbeek (1999) show that the there is strong evidence of publication bias of instrumental variables estimates of the returns to education: on average, the estimates with larger standard errors also tend to be larger. This accounts for most of the oft-cited result that instrumental estimates of the returns to education are higher than ordinary least squares estimates.

    In contrast, randomized evaluations commit in advance to a particular comparison group: once the work is done to conduct a prospective randomized evaluation the results are usually documented and published even if the results suggest quite modest effects or even no effects at all.

    In short, a randomized evaluation is one where researchers determined in advance which two groups they were going to compare – leaving a lot less room for fudging the numbers (purposefully or subconsciously) later.

    In the same vein, we favor “simple” results from such evaluations: we put more weight on studies that simply measured a set of characteristics for the two groups and published the results as is, rather than performing heavy statistical adjustments and/or claiming effects for sub-groups that are chosen after the fact. (Note that the Nurse-Family Partnership evaluations performed somewhat heavy after-the-fact statistical adjustments; the NYC Voucher Experiment original study claimed effects for a subgroup, African-American students, even though this effect was not hypothesized before the experiment.)

  2. Place more weight on studies that would likely have been published even if they’d shown no results. The Poverty Action Lab and Innovators for Poverty Action publish info on their projects in progress, making it much less feasible to “bury” their results if they don’t come out as hoped. By contrast, for every report published by a standard academic journal – or worse, a nonprofit – there could easily be several discouraging reports left in the filing cabinet.I would also guess that highly costly and highly publicized (in advance) studies are less likely to be buried, and thus more reliable when they bear good news.
  3. Don’t rely only on “micro” evidence. The interventions I have the most confidence are the ones that have both been rigorously studied on a small scale and have been associated with major success stories (such as the eradication of smallpox, the economic emergence of Asia, and more) whose size and impact are not in question. More on this idea in a future post.

BBB standards: Accountability or technicalities?

Yesterday we got an email from someone looking for help on where to give, noting that two of our top charities do not meet the Better Business Bureau (BBB)’s 20 standards for charity accountability.

We believe that both of these organizations are reputable, accountable, and excellent, and were surprised to hear this. After checking out the BBB reports on them, we stand by our recommendations and feel that the BBB’s reservations come from technicalities, not legitimate issues.

Population Services International (PSI): financial transparency, but not in the BBB’s format

The BBB report on PSI states that PSI meets 19 of the BBB’s 20 standards. The missing one:

the detailed functional breakdown of expenses within the organization’s financial statements only included one program service category. It did not include a detailed breakdown of expenses for each of its major program activities.

PSI responds (same page) that it has one high-level program, Social Marketing, breaking down into hundreds of sub-programs, and so chose to list only the one on the financial statement.

Is PSI stingy with financial information? We don’t think so – in fact, we think PSI stands out for its willingness to disclose meaningful, helpful information about where its money goes. In our full review of PSI, we’re able to break out its expenses not only by expense type (promotion, evaluation, materials, staff, etc.) but also by region and by product. Getting breakdowns from several different perspectives is useful for truly understanding their activities, and it’s something that many other charities can’t or won’t provide (for example, it’s common to be refused information on how much was spent in each country).

But the BBB doesn’t look at several different breakdowns – it looks only at the official audited financial statement, and apparently this one wasn’t broken out as they expected. To me this looks like a case of an organization that is more generous with financial data than most, but didn’t anticipate the BBB’s requirements on a particular form.

Partners in Health (PIH): the Board Chair is salaried

Like PSI, PIH meets 19 of the BBB’s 20 standards. The missing one (from the BBB report):

Standard 4 : Compensated Board Members – Not more than one or 10% (whichever is greater) directly or indirectly compensated person(s) serving as voting member(s) of the board. Compensated members shall not serve as the board’s chair or treasurer.

PIH does not meet this Standard since the paid chief executive officer (CEO) also serves as the chair of the board.

The Board of Directors votes on compensation, so I can see why the BBB likes to see some distance between the Board and salaried staff. But it does allow paid staff to be on the Board as long as they don’t make up too many of its votes (as spelled out above); here the problem is that a salaried member has the formal Chair position.

I don’t have a copy of PIH’s bylaws, but to my knowledge (and based on our own bylaws, which are pretty standard and available at the bottom of this page), the Board Chair is distinguished from other members by procedural responsibilities, primarily presiding over meetings. The Chair does not have the power to cast extra votes, approve compensation without voting, or anything along those lines.

It seems worth keeping in mind that PIH is the same organization whose founder has been extensively written about and is known for things like not taking vacations because of his devotion to his work. I don’t know him personally and wouldn’t presume to guarantee an organization’s ethicality, but I’m guessing that if you polled relevant people, you’d find that Partners in Health is one of the more trusted and respected nonprofits out there, and that its choice of putting its CEO as Board Chair wouldn’t give pause to anyone in the know.

Bottom line

I think the BBB’s standards are well-intentioned and that there are sound principles behind them, but ultimately, they are measuring charities’ conformance to formal technicalities. I don’t believe there’s any substitute for carefully examining a charity’s activities, using all the documentation that’s available rather than just the documentation that’s standardized (such as the audited financial statement and bylaws). We reiterate our recommendations of both PSI (full review here) and PIH (full review here).

Quick notes on our progress

A few updates for people interested in the nuts and bolts of GiveWell’s progress (some of these have been included in our email updates, but not yet flagged on our blog):

  • We’ve recently (this week) updated our research agenda – see the updated agenda here.
  • The William and Flora Hewlett Foundation awarded us $100,000 for general operating support (the grant was made in December).
  • The membership of our Board has changed, as two members have left and two have joined within the last few months – see our updated list.
  • Audio and materials uploaded for old Board meetings, through November 2008 – view them all here
  • Final versions of our IRS Form 990 and audited financial statement for 2007 are available on our website here.
  • A full history of our business plans and changes of direction – including the most recent in November of 2008 – is now available here.
  • We now offer the GiveWell Advance Donation – implemented through a donor-advised fund – as a way for donors to give (and get their tax deduction) now, while deciding which of our recommended charities should get their funds after our next round of research.
  • In addition to our research email group, we’ve created a “general GiveWell project” email group for people who wish to discuss general GiveWell-related issues. Subscribers will receive our periodic email updates as well as alerts when we add substantial new content to our website or make substantial changes to our plans.

Beware just-published studies

A recent study on health care in Ghana has been making the rounds – first on Megan McArdle’s blog and then on Marginal Revolution and Overcoming Bias. McArdle says the study shows “little relationship between consumption and health outcomes”; the other two title their posts “The marginal value of health care in Ghana: is it zero?” and “Free Docs Not Help Poor Kids.” In other words, the blogosphere take is that this is a “scary study” showing that making primary care free doesn’t work (or even perhaps that primary care doesn’t work).

But wait a minute. Here’s what the study found:

  • It followed 2,592 Ghanaian children (age 6-59 months). Half were randomly selected to receive free medical care, via enrollment in a prepayment plan. The medical care included diagnosis, antimalarials and other drugs, but not deworming.
  • Children with free treatment got medical care for 12% more of their episodes (2.8 vs. 2.5 episodes per year per person).
  • Health outcomes were assessed after 6 months:
    • Moderate anemia (the main measure) afflicted 36 of the children who got free care, vs. 37 of the children who didn’t.
    • Severe anemia afflicted 2 of the children who got free care, vs. 3 of the children who didn’t.
    • There were five deaths among children who got free care, vs. 4 among children who didn’t.
    • Parasite prevalence and nutrition status were also measured but not considered to be good measures of the program’s effects (since it did not include deworming or nutrition-centered care).

Would you conclude from this that the free medical care was “ineffective?” I wouldn’t – I’d conclude that the study ended up with very low sample size and low “power” because the children it studied were much healthier than expected. The researchers predicted an anemia prevalence of 10%, but the actual prevalence was just under 3%. Severe anemia and death were even rarer, making any comparison of those numbers (2 vs. 3 and 5 vs. 4) pretty meaningless. So in the end, we’re looking at a control group of 37 kids with moderate anemia and looking for a significant difference in the other group, from a 6-month program – and one that didn’t even address all possible causes of anemia (again, there was no deworming and it doesn’t appear that there was iron supplementation – the only relevant treatment was antimalarials).

Bottom line, free medical care didn’t appear to lead to improvement, but there also didn’t appear to be much room for improvement in this particular group. A similar critique appears in the journal (and points out that we don’t even know how often anemia can be expected to be attributed to malaria vs. parasites or other factors).

Some possible explanations for the relatively low levels of anemia include:

  • The presence of observers led everyone to make more use of primary care (the “Hawthorne effect,” a possibility raised by a Marginal Revolution commenter).
  • Less healthy people (and/or people who used primary care less) were less likely to stay enrolled in the study (7-8% dropped out), so that the people who stayed in had better health.
  • Or for some other reason (selection of village?), the researchers studied an unusually or unexpectedly healthy group. Perhaps a group that already uses primary care when it’s very important to do so, such that the “extra” visits paid for by the intervention were lower-stakes ones, or just weren’t enough (again, only a 12% difference) to impact major health outcomes among the small number of afflicted children.

All of these seem like real possibilities to me, and the numerical results found don’t seem to strongly suggest much of anything because of the low power (as the critique observes).

I saw a similar dynamic play out a month ago: Marginal Revolution linked a new study claiming that vaccination progress has been overstated, but a Center for Global Development scholar raises serious methodological concerns about the study. I haven’t examined this debate enough to have a strong opinion on it, and overestimation seems like a real concern; but we want to see how the discussion and reaction plays out before jumping to conclusions from the new study.

We’re all for healthy skepticism of aid programs, and we like reading new studies. But in drawing conclusions, we try to stick to studies that are a little older and have had some chance to be reviewed and discussed (and we generally look for responses and conflicting reactions). Doing so still leaves plenty of opportunities to be skeptical, as with the thoroughly discussed New York City Voucher Experiment and other ineffective social programs.

Why we prefer the carrot to the stick

A couple of the commenters on a previous post object to our idea of “rewarding failure” and prefer to focus on “putting the bad charities out of business.”

In theory, I’d like to see a world where all charities are evaluated meaningfully, and only the effective ones survive. But the world we’re in is just too far from that. The overwhelming majority of charities simply perform no meaningful evaluation one way or the other – their effects are a big question mark.

It’s not in our power to sway enough donors – at once – to starve the charities that don’t measure impact. (And even if it were, there are simply too many of these for starving them all to be desirable.) But it is in our power to reward the few that do measure impact, thus encouraging more of it and creating more organizations that can eventually outcompete the unproven ones.

Of course failure isn’t valuable by itself, and shouldn’t be rewarded. But showing that a program doesn’t work is expensive and valuable in and of itself, and should be rewarded. As Paul Brest says, “the real problem is that, unless they are doing direct services, most nonprofits don’t know whether they are succeeding or failing.”

Evaluation is valuable whether or not it turns out to have positive results. Yet currently, only positive results are rewarded – honest evaluation is riskier than it should be. This is the problem that the “failure grant” idea is aimed at.

Uncharitable

Dan Pallotta sent me a copy of Uncharitable about a month ago, and I’ve been late in taking a look at it.

I highly recommend it for people interested in general discussions of the nonprofit sector.

The discussion I’ve seen of the book so far (Nicholas Kristof and Sean Stannard-Stockton) has focused on how much we should be bothered when people make money off of charity. Personally, I feel that I’ve yet to see a good argument that we should care how much money do-gooders make – as opposed to how much good they do (and how much it costs).

The chapter closest to my heart, though, is the one called “Stop Asking This Question.” Mr. Pallotta slams donors who focus on “how much of my dollar goes straight to the programs,” devoting even more ink to the matter than we have. We need more people pounding on this point.

The book’s basic theme, as I understand it, is this: what matters in philanthropy is the good that gets done, not anything else. Attacking programs that are effective (at helping people, at raising money, etc.) because they don’t conform to some abstract idea of yours about how nonprofits should run/think/feel is simply hurtful to the people philanthropy seeks to help (not to mention arrogant). The book often illustrates this point with nonprofit/for-profit analogies that some would find inapproprioate … but putting all analogies aside, it seems to me that this basic point shouldn’t be controversial.