The GiveWell Blog

What’s so hard about rigorous self-evaluation?

I’m not trying to be a jerk. I honestly want to know.

Most of the self-evaluation we see from charities looks at clients, but doesn’t compare them to (otherwise similar) non-clients. So it’s probably effective at preaching to the choir, but not at winning over skeptics. When we bring up the issues with it, we constantly hear things like “Of course we’d love to do a randomized/longitudinal study, but those are expensive and difficult and the funding isn’t there.”

This is how I imagine an interested charity could evaluate itself rigorously:

  1. Use a lottery to pick clients. Many/most humanitarian charities don’t have enough money to serve everyone in need. (And if they do have enough, there’s an argument that they don’t need more.) Instead of dealing with this issue by looking for the “most motivated” applicants, or using first-come first-serve (as most do), get everyone interested to sign up, fill out whatever information you need (including contact info), etc. – then roll the dice.

    Unethical? I don’t think so. I don’t see how this is any worse than “first-come first-serve.” It could be slightly worse than screening for interest/motivation … but (a) it’s also cheaper; (b) I’m skeptical of how much is to be gained by taking the top 60% of a population vs. randomly selecting from the top 80%; (c) generating knowledge of how to help people has value too.

    Cost: seems marginal, cheaper than using an interest/motivation screen. You do have to enter the names into Excel, but for 1000 people, that’s what, 5 hours of data entry?
    Staff time: seems marginal.

  2. Get contact information from everyone. This is actually part of step 1, as described. These days you can get on the Web in a library, homeless people blog, and I’m guessing that even very low-income people can and do check email. Especially if you give them a reason to (see below).

    Cost: see above
    Staff time: none

  3. Follow up with both clients and randomly selected non-clients, offering some incentive to respond. Incentives can vary (many times, with clients, providing information can be a condition of continuing support services). But if push comes to shove, I don’t think a ton of people would turn down $50.

    Cost: worst case, sample 100 clients and 100 non-clients per class and pay them $50 ea. That’s a decent amt of sample size. Follow up with each cohort at 2, 5, and 10 years. That’s a total of 600 people followed up with = $30k/yr, absolute maximum.
    Staff time: you do have to decide how to follow up, but once you’ve done that, it’s a matter of sending emails.

  4. Check and record the followup responses. If possible, get applicants to provide proof (of employment, of test scores) for their $50. Have them mail it to you, and get temps to audit it and punch it into Excel.

    Cost: assuming each response takes 30min to process and data entry costs $10/hr, that’s $3k/yr.
    Staff time: none.

  5. And remember: Have Fun! Did you think I was going to put something about rigorous statistical analysis here? Forget it. Data can be analyzed by anyone at any time; only you, today, can collect it. When you have the time/funding/inclination, you can produce a report. But in the meantime, just having the data is incredibly valuable. If some wealthy foundation (or a punk like GiveWell) comes asking for results, dump it on them and say “Analyze it yourself.” They’re desperate for things to spend money on; they can handle it.

    (Also, I’m not a statistics expert, but it seems to me that if you have data that’s actually randomized like this, analyzing it is a matter of minutes not hours. Writing up your methodology nicely and footnoting it and getting peer-reviewed and all that is different, but you don’t need to do that if you’re just trying to gauge yourself internally.)

The big key here, to me, is randomization. Trying to make a good study out of a non-randomized sample can get very complicated and problematic indeed. But if you separate people randomly and check up on their school grades or incomes (even if you just use proxies like public assistance), you have a data set that is probably pretty clean and easy to analyze in a meaningful way. And as a charity deciding whom to serve, you’re the only one who can take this step that makes everything else so much easier.

I tried to be generous in estimating costs, and came out to ~$35k/yr, almost all of it in incentives to get people to respond. Nothing to sneeze at, but for a $10m+ charity, this doesn’t seem unworkable. (Maybe that’s what I’m wrong about?) And this isn’t $35k per study – this is $35k/yr to follow every cohort at 2, 5, and 10 years. That dataset wouldn’t be “decent,” it would be drool-worthy.

And the benefits just seem so enormous. First and foremost, for the organization itself – unless its directors are divinely inspired, don’t they want to know if their program is having an impact? Secondly, as a fundraising tool for the growing set of results-oriented foundations. Finally, just for the sake of knowledge and all that it can accomplish. Other charities can learn from you, and help other people better on their own dime.

The government can learn from you – stop worrying about charity replacing government and instead use charity (and its data) as an argument for expanding it. In debates from Head Start to charter schools to welfare, the research consensus is that we need more research – and the charities on the ground are the ones who are positioned to do it, just by adding a few tweaks onto the programs they’re already conducting.

So what’s holding everyone back? I honestly want to know. I haven’t spent much time around disadvantaged people and I’ve never run a large operation, so I could easily be missing something fundamental. I want to know how difficult and expensive good self-evaluation is, and why. Please share.

Cause 1: Where we stand

Now that Holden and I have finished drafting reviews for Cause 5 (to be made public in a couple of weeks), we’ve moved our focus to Cause 1: Help people in Africa avoid death and extreme debiliation.

Unlike Cause 5, in which organizations roughly followed the same model to help people, organizations applying for a Cause 1 grant take wildly divergent approaches. And, in most cases they are taking not just one approach, but are running a huge set of projects that don’t always have a clear overarching theme or approach. This obviously presents a large challenge, and makes it impossible for us to compare organizations as directly and quantitatively as we did for Cause 5.

Here’s what we’re thinking so far. Mostly, our applicants fall into the following broad categories: comprehensive community aid, providing lots of different kinds of services to a small group of a people; distribution, getting lots of small, inexpensive items to many people; corrective surgery, providing a relatively expensive but life-changing surgery to those with congenital deformities; and mammoths, which do just about everything for everyone everywhere.

Comprehensive community aid. These organizations go into a village and attempt to provide everything for the village including primary health services (for childbirth, pneumonia, etc.), distributing necessary medicine/products (bed-nets, ORS, de-worming pills), education about hygiene and protected sex, economic aid including farming technology, and much more. This is the model with the most intuitive appeal to us. When you’re trying to help people thousands of miles away from a culture you’ll probably never fully understand, it seems smart to work intensely with one group of people and document all the ways in which their lives change – that way you’re more likely to catch unintended consequences, adapt to changing problems, etc., and make sure you’re actually changing their lives for the better. (This seems far superior to deciding in advance on one problem, like AIDS, and attacking it furiously while leaving other problems unaddresed.)

But that documentation is essential – immersion doesn’t equal understanding, and if a charity isn’t measuring and reporting life change, we aren’t going to bet on it. So far (though we’re still working on it), we haven’t been able to get a real picture of the life change effected by an organization using this approach. Organizations do many activities for which they often offer no evidence of the eventual impact (e.g., they tell us how many families attended their HIV/AIDS awareness campaign, but don’t offer evidence for what effect they expect that to have).

Distributors. These organizations distribute cheap and potentially life-saving items: ORS to treat diarrhea, vitamin A supplements to prevent malnutrition and blindness, bednets to fight malaria, condoms to prevent HIV/AIDS, etc. The advantage of this approach, it seems, is the potential cost-effectiveness: by focusing on the cheapest, simplest diseases to treat, you can treat a lot more people.

We estimate that the stronger candidates in this area are saving lives for around $50-120 a pop … but this estimate is based largely on combining reports on number of units sold/distributed with academic research on the effects of these units (e.g., studying the effects of Vitamin A in developed-world hospitals,), plus a lot of guesswork about utilization (i.e., it’s one thing to sell or distribute condoms – but how much are they actually getting used?) Organizations only sometimes monitor the utilization of the products they distribute, and they rarely, if ever, measure the change in actual disease prevalence for the people they serve. So charities in this area may be helping a lot of people, but it’s hard for us to be confident in their effects.

Corrective surgery. These organizations perform a very specific procedure (or set of procedures) for people suffering from an ailment. They perform surgeries to correct debilitating deformities (cleft palate, severe burns, etc.) or correct vision impairment. The best of these organizations can tell us how many surgeries they perform and what conditions they correct, which along with their total expenses gives us a picture of how many lives they’re changing for each dollar they spend. These organizations don’t tell us a lot about how debilitating the conditions they fix truly are (leaving us to question the impact they’re having).

This model is attractive because each surgery affects a specific person. Knowing how many surgeries each organization performs tells how many people’s lives have been changed – there’s not a lot of doubt. But, this model doesn’t come close to achieving the cost-effectiveness of the (albeit somewhat theoretical) distribution model, with $/life impacted at 10-20x the cost. And, without a clear picture of the debilitation these surgeries prevent (to what degree they are somewhat cosmetic), we worry that their impact is even lower.

UNICEF. UNICEF does everything, everywhere. They distribute, perform surgeries, and somtimes just focus on providing all services to a set group of people. We won’t be able to evaluate the entirety of UNICEF’s programs, but we may be able to evaluate their Accelerated Childhood Survival and Development Program, which has an approach similar to “comprehensive community care” above, and which appears to be slated for a very large and growing role in UNICEF’s programming.

So, where are we going from here? At the moment, the only organizations that we can confidently say are helping people are those peforming corrective surgeries. But, we’re waiting on more information from distributors which, we hope, gives us more confidence that people are actually using the products they receive. We’re disappointed in what we’ve seen thus far from comprehensive community care. Even though it makes a lot of sense to provide everything to a small group of people (even at higher costs), we’re not convinced of the impact these programs (which are mostly more convential Africa-aid organizations) are having.

One more note: with so many different approaches to helping people, there’s no way that they’re going to be close in terms of cost-effectiveness. There’s no reason to think that an organization distributing inexpensive items across a continent is in the same ballpark as an organization providing corrective surgeries to a few thousand children each year. Donors need to understand what they’re getting for their dollar.

Preview: Cause 5

Elie and I have just finished drafting our reviews for Cause 5: help disadvantaged adults become self supporting. We can’t make them public yet because we need to give our applicants a chance to look and point out mistakes (and write any responses they want to write). But here’s a quick story about what we’ve been doing.

First, the moral of the story: deciding where to give is hard. Elie and I have gone through 3-4 completely different approaches, before finding one we’re pretty happy with.

First we tried a pretty quantitative approach: look at how many people each finalist placed “sustainably” in a job (i.e., 12-month retention or above), then estimate how many people would likely have gotten similar jobs on their own, by slicing and dicing Census data to simulate the target population. The difference is “lives changed,” and lives changed divided by expenses should yield “lives changed per dollar,” which can generate a rough ranking. We gave up on this pretty quickly as we realized that many of the differences between our applicants’ populations can’t be captured in any way by the Census (differences in motivation, substance abuse history, etc.)

I then had the bright idea of clumping applicants together when their clients appeared similar. The HOPE Program and Catholic Charities both serve severely disadvantaged adults, similar in most of the ways we have data on; Vocational Foundation and Covenant House both serve disconnected (not employed or in school) youth. I created a big writeup putting the pairs side by side, and arguing that HOPE’s results are so much better than Catholic Charities’ (and VFI’s so much better than Covenant’s) as to imply true “program effects.”

I finished it around 8 this morning, at which point I went to sleep and Elie got up, took a look, and called BS. CCCS takes referrals from the govt; HOPE is working with people who want to work. Covenant House’s clients are over 50% homeless; not so VFI’s. You just can’t compare them like this. The fact is that while we know how many people each charity placed in jobs, we have no way of knowing how a comparable population would do without help. We’ve got to go with what makes sense to us.

And what makes sense to us is that it’s really hard for a 3-12 month program to fundamentally change a person. HOPE’s numbers are strong enough (relative to CCCS’s) to make us think it might be happening, but not enough to blow us away. In the end, everyone’s numbers are consistent with the hypothesis that employment programs can’t help everyone, or even most people; those who are getting jobs are likely the more motivated ones. That doesn’t mean it’s impossible to help people – they might have the willingness, but benefit from picking up specific skills, certifications, or just help with knowing where to look.

So which would you bet on? A program trying to “reform” homeless people at great cost, placing about 30% of them, or a program that finds people who are already willing and able to be a Nurse’s Aide – or Environmental Remediation Technician – and gets them the certification they need? In the end, we answered the latter. The certification model is simple, cost-effective, and makes sense. If I had to bet my life on whether getting people who want to be Nurse’s Aides certified as Nurse’s Aides is helping them, I’d say yes. If I had to bet my life on a 6-month course turning a person around, I’d need a lot more convincing data.

Right now we think the strongest two applicants are St. Nick’s Community Preservation Corp. and Highbridge Community Life Foundation, which follow exactly this model. Both see the vast majority of their clients take the jobs they’re trained for and hold onto these jobs. Both spend relatively little to accomplish this. Both do a million activities we have next to no information about, and both leave us wondering whether their clients could get similar jobs without help.

We prefer St. Nick’s, very slightly, because of the greater variety of jobs it trains for, some of which have much higher pay. A couple other organizations are still falling into our “recommended” category because they have strong numbers, and models that at least plausibly could be responsible for major life change (the HOPE Program is one of these).

We’re using a combination of intuition (our feeling about certification vs. general training), outcomes (we’re not recommending anyone if they don’t have retention numbers to back up the idea that they’re successfully placing clients), and calculations (a rough look at “cost per person placed sustainably” backs up our intuition that certification programs will be most cost-effective). There’s no one magic formula or metric that we’re hanging this decision on, and we know that we’ve made debatable leaps in judgment. But when I read over what we’ve written, and ask myself, “Holden, would you bet on this? If you were responsible for your donors’ karma, would this be your best shot at keeping them safe from lightning?” my answer is yes.

That said, I’ll feel a lot better about it once we put it out there and see what others think (should be within a couple weeks). I seriously cannot believe that other foundation people make these kinds of decisions talking to no one but each other. Does that really happen? That’s crazy.

Just say something, anything

Smarter Spending on AIDS: How the Big Funders Can Do Better. When I saw that title, linked here, I quickly opened the link expecting a report critically evaluating which strategies work in the fight against HIV/AIDS.

Should we fund condom distribution or programs promoting monogamy? Is ARV distribution enough or do non-profits need to follow-up with clients to make sure each takes their medication? What progress has been made on an AIDS vaccine – does that need more funding? Instead, I found a report full of corporate gobbledygook, which endorsed the following best practices – “working with the government; building local capacity; keeping funding flexible; selecting appropriate recipients; making the money move; and collecting and sharing data.”

Seriously? “Selecting appropriate recipients?” “Making the money move?” Does anyone think a paper like this can, will, or should change anyone’s behavior?

This is just the latest example I’ve seen of reports that seem to actually say nothing. By “nothing” I mean one of two things: either 1) the conclusions a paper offers are so general and vague and offer such scant evidence and reasoning that they’re practically useless or 2) the paper asserts conclusions which are so obvious that no one could possibly argue with them.

There’s the paper on practices of high-impact nonprofits that’s been floating around the blogosphere; I thought Albert’s post (linked) did a good job pointing out its shortcoming, but I also want to mention that its 6 attempts at “debunking myths” (pg 34-35) seem to come down to saying: “Effective nonprofits can come in all shapes and sizes.” Really? This changes everything!

There’s the Hard Lessons paper many have praised as a breakthrough in foundation self-criticism. Hard lessons taught here include “Allow room for the definition of success to shift and evolve as people learn what is possible and effective, as relationships deepen, and as the work matures”; “Match evaluation tools to their purposes”; and “Cultivate a flexible learning stance” (pg vii). They don’t, though, include any lessons about program design itself.

We often say we’d like to see more self-evaluation in the nonprofit sector. Papers like these are not what we’re referring to.