The GiveWell Blog

Experience vs. data, or, why I just muted the Yankees game

So I’ve been watching the ballgame, and it struck me how much sports announcers have impacted my outlook on charity. I can explain.

The most common form of “evidence” we get from charities goes something like this: “We don’t have the data, but we’re here, every day. We work with the children, personally. We’ve been doing this for decades and we’ve accumulated a lot of knowledge that doesn’t necessarily take the form of statistics.”

Put aside, for a minute, the fact that we get that same story from all 150 charities we’re deciding between (all of which presumably think their activities are most deserving of more funding). There’s another problem with the attitude above, one that occurs to me every time I hear Michael Kay announcing a baseball game. In sports, unlike in charity (and really unlike in most things, which is why I find it an interesting case study), the facts are available – and when you look at them, you realize just how little that “on the ground” experience can be worth.

The fact is that baseball announcers and sportswriters spend their entire lives watching, studying, and thinking about sports. Many of them are former athletes who have played the game themselves. They are respected, they are paid to do what they do, and they are more experienced (i.e., they’ve seen more) than I’ll ever be in my life. And yet so many of them truly know absolutely nothing.

“Jeter’s a whole different player in October,” says Mr. Kay (demonstrably false). “You don’t want young pitchers carrying you in the playoffs.” (Comically false – 3 of the last 5 World Series champions had rookie closers.) I’m not giving any more examples – this post would hit 30,000 words in a heartbeat. But I’m happy to refer you to sources that give 2-3 examples per day of seasoned professionals – who’ve spent their whole lives on this stuff – saying things that are obviously, intuitively, factually, empirically, demonstrably, completely wrong.

It hits me over and over again, and I still haven’t quite gotten used to it. My only explanation is that humans have an incredible ability to ignore what they actually see, in favor of (a) what they expect to see (b) what they want to see. Now when I talk to an Executive Director or Development Officer whose life consists of running a charity and whose livelihood depends on convincing people that it’s the world’s best way to help people … I don’t know how much these factors cloud their judgment. Maybe not at all, in some (truly amazing, borderline inhuman) cases. But when they assure me that outcomes data isn’t necessary because they’ve been doing this for years, forgive me for having trouble swallowing this: I can’t help but think of Michael Kay, a man who’s done very little with his life but watch the Yankees, and still manages to know nothing about them.

Want to see what we mean by “bias?”

The US government commissioned an evaluation of its Talent Search program, designed to fight the college enrollment gap by providing advice (mostly on financial aid) to disadvantaged students. The evaluation is pretty thorough, rigorous, and straightforward about what it did and didn’t find, as we’ve come to expect with government-funded evaluations. (This is probably because the government is under pressure to actually produce something; “government” doesn’t tend to melt people’s hearts, and brains, the way “charity” seems to.)

This well-done study gives a great illustration of how misleading a poorly done study can be. What do I mean? Let’s focus on Texas (one of the three states it examined).

On page 47 (of the PDF – 23 of the study), you can see comparison statistics for Talent Search participants and other students in the state. Talent Search participants were far more likely to be economically disadvantaged … but … they also had much better academic records. I’d guess that’s because while money has its perks, it isn’t everything – motivation also matters a lot, and it’s easy to see why the Texans who bothered to participate in Talent Search might be more motivated than average kids.

How would you react if a charity running an afterschool program told you, “80% of our participants graduate high school, compared to 50% for the state as a whole?” Hopefully I just gave you a good reason to be suspicious. Kids who pick up an afterschool program and stick with it are already different from average kids.

But there’s more. The study authors spotted this problem, and did what they could to correct for it. Page 52 shows how they did it, and they did about as good a job as you could ask them to. They used an algorithm to match each Talent Search participant to a non-participant who was as similar as possible (what I dub the “evil twin” approach) … they ended up with a comparison group of kids who are, in all easily measured ways, exactly like the participants.

So, if Talent Search participants outperformed their evil twins, Talent Search must be a good thing, right? Not so fast. As page 55 states, Talent Search participants had an 86% graduation rate, while their evil twins were only at 77%. The authors equivocate a bit on this, but to me it’s very clear that you can’t credit the Talent Search program for this difference at all. The program is centered on financial aid and college applications, not academics; to think that it would have any significant effect on graduation rates is a huge stretch.

The fact is, that invisible “motivation” factor is a tough thing to control for entirely. Even among students with the same academic record, some can be more motivated than others, and can thus have a brighter future, with or without any charity’s help.

This is why Elie and I are so prickly and demanding when it comes to evidence of effectiveness. These concerns about selection bias aren’t just some academic technicality – they’re a real issue when dealing with education, where motivation is so important and so hard to measure. If you believe, as we do, that closing or denting the achievement gap is hard, you have to demand convincing evidence of effectiveness, and the fact is that studies with sloppily constructed comparison groups (or no comparison groups) are not this. We want charities that understand these issues; that know how difficult it is to do what they’re doing; and that are unafraid – in fact, determined – to take a hard, unbiased look at what they’re accomplishing or failing to accomplish. It’s psychologically hard to measure yourself in a way that really might show failure, but how else are you going to get better?

Final finalists finalized

Our Cause 4 finalists are (in off-the-top-of-my-head order) KIPP, Achievement First, Replications Inc., New Visions for Public Schools, Student Sponsor Partners, Children’s Scholarship Fund, LEAP, Teach for America, the LEDA Scholars Program, the St. Aloysius School, Harlem Center for Education, and Double Discovery Center. Rather than put our reasoning here, I edited it into last week’s post so it’s all in one place. See Cause 4 under that post.

Two futures

My thoughts on the latest Giving Carnival:

Fundraising today is all about the pitch; 10 years from now, I hope it will be about the product.

Fundraising today reminds me a lot of car commercials. Car commercials try to convince you that a Ford is better than a Toyota or a Volvo outclasses a Volkswagen by showing pictures of cars speeding around tight curves on steep mountains. Those commercials don’t tell you anything about what distinguishes one car from another. Similarly, my experience with fundraising materials from non-profits is that they’re virtually interchangeable (all are heavy on pictures of extremely cute 5 year-old children) and give me no information about what distinguishes one non-profit from another.

In the case of cars, the world has changed such that I have the ability to distinguish between powerful cars and less powerful cars, reliable cars and less reliable cars, safe cars and less safe cars using any number of resources (Edmunds, Consumer Reports, etc.).With charities the same is not yet true, but it will be.

In the future, fundraisers will have to support their programs by clearly explaining what their programs accomplish (as Holden and I keep learning, a pretty difficult enterprise) and providing evidence for their claims. They’ll have to do this because donors will demand it. No longer willing to accept vague descriptions of program accomplishments and no longer scared to criticize organizations which, though they mean well, accomplish little, donors will force a new age of fundraising, one where the conversation revolves around how best to help those in need.

That’s the utopian future I hope for and envision. The future that frightens me most (and the one that other carnival participants, like here and here, envision) is a future that’s a stronger, faster, super-powered version of today. More efficient strategies, super-slick presentations, and lightning fast, fully-linked technologies allow people to give more with virtually no thought at all. Where the world might be improving but then again, it might not, and it doesn’t even matter because no one will care. “People are giving more than ever!”, they’ll say, and that will be that.

Finals

Over the last four weeks, Elie and I have read through the ~160 applications we received, and now we’re wrapping up Round One and getting ready to get deeper into the issues and the charities we’ve picked as finalists.

Unfortunately, putting the apps themselves on the web is going to take a while, just because of boring technological reasons (as well as the need to remove all the confidential stuff, which fortunately is mostly just salary info). We’ve been focusing entirely on picking finalists, so that we can get our Round 2 apps to them as quickly as possible. In the meantime, here’s a rundown of what we’ve done & where we stand.

I believe this is the most complete description you will find anywhere (in public) of what criteria a grantmaker used to pick between applicants. If I’m wrong, send me the link.

Basic approach

Round One is mostly about practicality. If we tried to do back-and-forth questioning with – and gain full understanding of – all our applicants, we’d have no grant decisions and nothing to show for years. So the basic question we’ve been asking is less “Which charities have great programs?” and more “Which charities are going to be able to demonstrate their effectiveness, using facts rather than impressions?”

It’s possible that if we spent 10 years working with a charity, we could be convinced of its greatness despite no hard data (although I think this is less possible than most people think: when you’re aiming to make a permanent difference in clients’ lives, you can’t just rely on how they behave for the time you know them). As we seek to serve individual donors, we need to make decisions that not only ring true to us, but that we can justify to people who’ve never met us or our applicants.

We asked each applicant to pick one program to focus on for now – rather than trying to start broad, we are looking for charities that have at least one instance of documenting and evaluating a program rigorously.

Africa (Causes 1 and 2)

Saving lives and improving living standards are so intimately connected to each other – and so many of our finalists do quite a bit of both – that for now we’re treating this applicant pool as one. We had a total of 71 applications in these causes.

The Aga Khan Foundation, Food for the Hungry, Population Services International, Partners in Health, the American Red Cross, Opportunities Industrialization Centers International, and Project HOPE are all mega-charities that do many things in many places, but gave us reason to think we can understand them, by submitting very detailed, concrete accounts of their featured programs. In each case, we could get a feel for the how many people benefited and how they benefited from the program, whether because the charity tracked clients directly (as Partners in Health did) or because it submitted a strong independent research case for similar programs (as the Red Cross did with its bednet distribution program). We sent each of these charities The Matrix, talked to each on the phone, and got mixed reactions: some are sending us what they have, others are sending us their own “dashboards,” and one (Red Cross) has decided that the time cost of Round 2 is too high.

The International Eye Foundation and Helen Keller International are less broad, each focusing on eye care and preventable blindness. Like the megacharities above, each sent us a very strong, clear, quantified account of an individual program. We sent them “mini-Matrix” applications to get a better sense of their eye care activities.

Opportunity International and the Grameen Foundation both focus on microfinance. Opportunity International did the best job of any microfinance organization giving a sense that its activities have led to improved lives (in other words, going beyond “number of loans” and giving livings-standards-related figures); Grameen Foundation’s application led us, through a series of footnotes, to the best overview I’ve seen of the general effects of microfinance (although as I’ll write in the future, it leaves much to be desired). We sent each a motherlode of questions about the regions they work in and the people they lend to, so that we can get a better understanding of how this complex strategy impacts their clients’ lives.

Interplast, KickStart, the HealthStore Foundation, and the Global Network for Neglected Tropical Diseases all have relatively simple, unified, and highly compelling models for helping people. KickStart develops and markets irrigation technology to improve living standards; Interplast treats congenital defects; the HealthStore Foundation franchises locals to sell medicine; and GNNTDC focuses specifically on intervention campaigns for the “diseases of poverty” that don’t get much mainstream attention (onchocerciasis, anyone?) Of these, only KickStart provided the depth of information and measurement that we wanted to see, but we are hopeful that we’ll be able to get there more easily than for more sprawling charities. We sent each of these four charities highly tailored applications, asking very specifically for the information we need to understand their effects.

And those are the 15 finalists for helping people in Africa. No other applicants gave us a concrete sense of how many people they’d helped and how – they gave descriptions of their activities, anecdotes, newspaper articles, survey data (which I’m very skeptical of, as I explained recently), and often very strong evidence of the size of the problems they were attacking (i.e., disease X kills Y people a year). My take is, when you’re helping people thousands of miles away and choosing between hundreds of possible strategies, none of that is enough, because none of that tells you whether (and how much, and how cost-effectively) your program has worked (improved people’s lives). You need a sense for how many lives your activities have changed … at least for one program. Others may disagree, but that’s how we made the cut.

Early childhood care (Cause 3)

We are basically stalled on this one. Our 14 applicants sent piles of data about how they interact with children … but how does this translate to later life outcomes? We asked in our application, but no one answered – we got nothing on test scores, grades, or anything else that happens once the children enter school.

We haven’t yet named finalists. We are going to continue our research on early childhood care, this time looking specifically for proven best practices, then return to the applications to compare these best practices with the practices of our applicants.

K-12 education (Cause 4)

Note: This section was edited on 9/8/2007 to reflect our announcement of finalists.

We received 50 applications for this cause. Because the achievement gap is such a thorny problem, we focused as much as possible on charities with rigorous evidence that they’ve improved academic outcomes for disadvantaged kids. We used the following principles:

  • What counts in this cause is academic performance. Graduation rates, college enrollment rates, attendance, grade promotion, test scores. Raising children’s self-esteem may be valuable, but if that can’t be shown to translate to better performance (specifically or generally), it doesn’t make a strong Cause 4 applicant. And we trust survey data (kids’, parents’ and teachers’ perceptions of their performance) much less than behavior data, which in turn we trust much less than performance data. I believe there are a million psychology studies that will back me up on this mistrust of surveys, but I don’t have time to compile them now – we will do so by the time we go live with our website.
  • Context is key. It isn’t enough to see that test scores improved over some time period; test scores usually improve with age (as you can see when you look at citywide and state-wide data, which the stronger applicants provide). To us, evidence that a program worked means evidence that the participants outperformed some comparable “control group.” Some control groups are better designed than others: showing that your students had a higher graduation rate than the city as a whole is less compelling than showing that they had a higher graduation rate than students from similar areas.
  • Selection bias is dangerous. This famous paper is one example of the danger of selection bias: if you offer a program to help children succeed, the more motivated children (and families) will likely be the ones to take advantage. Child and family motivation is incredibly important in education (we will also be sure to provide the research case for this when the time comes, though it should intuitively click). One applicant referred us to studies showing that participants in chess programs outperform participants in other extracurriculars … common sense tells us that the teaching power of chess likely has less to do with this than the fact that, to put it bluntly, kids who are nerds get better grades. Our strongest applicants showed or referred us to evidence that is at least attempting to get around this tricky problem, either through randomization or through the (less powerful but still something) technique of comparing changes in test scores.
  • We want evidence that a program has worked before. A couple applicants submitted extremely rigorous, methodologically strong studies showing practically no difference between their participants and a control group. I love these applicants, honestly. It’s fantastic to measure yourself, find failure where you didn’t expect it, and openly share it with others. I wish I saw more of it, and in fact I’ve been toying with the idea of offering a “Failure Grant” sometime in the future to organizations that can convincingly show why a program is failing and what they can do about it. But for our pilot year, we are looking for proven, effective, scalable ways of helping people, not just strong research techniques.

Learning through an Expanded Arts Program (LEAP) promotes a specific in-school curriculum, and attached a study showing that children randomly selected to receive this curriculum tested better than those who had been randomly selected not to. Children’s Scholarship Fund provides scholarships for private school to K-8 students, and attached a study on very similar programs that also employed randomization (although it showed better results only for African-American children, something we find fishy – we need to investigate more to see whether this study may have cherry-picked results). Teach for America, which trains and subsidizes recent college graduates to teach in disadvantaged districts, attached a broad study of effects at several of its sites, finding that its teachers’ students perform about the same on reading and better on math. KIPP, Achievement First, and Replications Inc. all are in the business of creating new schools, and all examined the changes in their students’ test scores versus those of students in nearby districts – far from a perfect examination, but more convincing than anything else we’ve seen from similar programs. New Visions for Public Schools, a school support organization that we’ve known about for a long time (it’s even one of our old recommended orgs from the part-time version of our project last year), looked at a series of schools it created by matching its students to students in similar districts with similar academic records, and found consistently improved attendance and grade promotion rates (though test score outcomes were mixed).

Student Sponsor Partners claims to seek out “academically below-average” children and pay their way through private high school – we need more detail on its selection process, but if it is avoiding bias, then its results (substantially higher graduation rates, and enrollment in better colleges, compared to kids from the same middle schools) are impressive. The LEDA Scholars program, an extremely selective program that attempts to raise college expectations for top disadvantaged/minority students, attached a study showing much better results for its scholars than for those who made its final round. (How comparable are these two groups? We need to find out a lot more about this.) Harlem Center for Education and Double Discovery Zone both run the Federal Talent Search program, which, as HCE pointed out, has been shown preliminarily to increase college matriculation rates compared to a somewhat reasonable (though not perfect, as I’ll discuss next week) comparison group. Finally, the St. Aloysius School runs all the way from preschool through 8th grade, with high school support – an intensity of intervention that we haven’t seen in any other program – and claims a 98% graduation rate. We need more info, but we felt we had to check this one out a bit further.

That’s 12 finalists. Some of the studies leave a lot of questions unanswered, some have small sample size, etc., but in the end, we feel that this is a good representative group of different kinds of programs, and that these are the programs that have the best shot at really convincing us that they’ve improved academic outcomes. Other submissions either had no data on academic outcomes, or provided no useful context for this data (i.e., a comparison group that there’s at least some reason to believe is appropriate to compare), or showed little to no apparent impact of their program.

We didn’t take any afterschool/summer programs, because none were able to show improved academic outcomes either for their own program or for a very similar one, and we haven’t encountered any independent evidence that afterschool/summer programs in general (whether academically or recreationally focused) have the kind of impact we’re looking for.

Helping NYC adults become self-supporting (Cause 5)

We’ve already been over our take on this cause: most applicants run similar programs, and we are continuing with the ones that sent us data not just on how many people have been placed in jobs, but how many have remained in those jobs 3-24 months later. The HOPE program, Vocational Foundation, Year Up, Catholic Charities of NY, and Covenant House all fit this bill; Highbridge Community Life Center and St. Nick’s Community Preservation Corp. – did not provide retention data, but gave such a thorough description of the jobs (and job markets) they prepare clients for that we felt we had a strong sense of likely outcomes for their clients.

That’s seven finalists, from a field of 21.

A couple things we didn’t consider

We didn’t give any points for newspaper articles, third-party awards, etc. unless the relevant attachment linked us to some useful data. We believe that donors and the media (and possibly foundations too) evaluate charities on the wrong criteria, which is why we exist in the first place.

We took a mostly “black box” approach, paying much more attention to the evidence that a tactic has worked than to the details of the tactic. The fact is, nearly every charity we looked at is doing something that makes sense and could logically improve people’s lives. Almost none of them are doing something that makes tons more sense than the others. We read through all the descriptions of programs, and we made sure to include at least one organization for any model we found extremely unique and compelling, but it was the evidence for effectiveness that carried most of the weight in determining finalists.

We paid as little attention as we could to how well written the application was, and anything else that would favor a good grantwriter over a bad one. We simply see no reason to think that the organizations with better fundraisers would be the ones with better programs. Of course, a grantwriter who didn’t send any evidence of effectiveness killed their org’s chances, which is something we have to live with. But I think we asked for this evidence pretty clearly in our application.

On fallibility, uncertainty, and the transient nature of truth

I think I said it best in my email to the organizations that didn’t pass Round 1:

Practical considerations have made it necessary to make our decisions based on limited information. We know we haven’t gotten a complete picture of the issues or of any organization. Our decision doesn’t reflect a belief that your organization is ineffective; it merely reflects that the application was not one of the strongest we received.

Might there be applicants who have all the data we want, and just misread the application? Yes. Might there be applicants who do wonderful work and don’t measure it? Absolutely. Might we be making faulty assumptions about the issues and what matters? You betcha.

But our goal isn’t perfection, our goal is to spend money as well as possible and start a dialogue about how we did it. We had to cut a field of over 150 applications down to something that we can work with. What I’ve done above is given you a full account of how we did it, and once we post the applications (and announce it here) you’ll be able to compare our principles with the applications themselves. Is our reasoning clear? Concrete? Reasonable? Insane? Speak up.

Helping people: Easy or hard?

I wonder how much of the difference between our approach to charity and others’ approach comes from this very simple fact:

We think that improving people’s lives is really hard.

It might be easy to brighten someone’s day, even their week. But to get someone from poverty and misery to self-sufficiency and a world of opportunity … you have to change not just their resources, but their skills, often their attitude, and always their behavior. You can try to do this early in life or late; either way, it’s an uphill battle.

As we’ve been over and over, that’s why we think “money spent” is a poor proxy for “good accomplished,” and tend not to get as excited as others over dollars raised or dollars spent. But as we get deeper into reading apps, I’m finding there’s more and more to this rift.

I’m very skeptical of any program that claims great effects with relatively low amounts of intervention, whether it’s a one-time class on condom use or a once-a-week tutoring session. I think about how easy it is for the people I know to sit through some class, walk out full of ideas, and forget them a week later, and I think – if you’re doing anything meaningful for people with this little investment, you must be some sort of sorceror.

I trust survey data about as far as I can throw it (note: data cannot be thrown). Katya’s post today gives a sniff of the argument that “people are notoriously bad predictors of their behavior”; there’s actually an enormous amount of literature out there ramming home this point (if you really want to see it, let me know). We’ve seen a lot of survey data along the lines of “97% of participants felt the program improved their ability to ___,” which would be great if wishes were horses and horses were great.

I tend to feel the same way about anecdotes and personal experiences. They can be useful to get a picture of a program, but as evidence that it works? People are easy to change in the moment and incredibly hard to change in the long run. How can you possibly tell from watching a group of children for a year that they’re headed for better lives?

And in the end, though I keep reminding myself that evidence of effectiveness can come in all different forms, I can’t keep down that longing for randomized studies, or at least studies that employ some sort of comparison group. These are very rare in the applications we’re receiving: charities tend to look only at their clients. While this clearly saves them a lot of money and hassle, it leaves me wondering whether they’re really helping people … or just picking out the ones who are going to succeed anyway (you’re probably familiar with this question in the context of Ivy League schools). I’m sure this is heresy, but if we started a Placebo Foundation that sought out the “most motivated” poor people as clients, did absolutely nothing for them, and then examined their outcomes, would we find the same “successes” that many charities claim?

And yet every study I’ve read from applicants – even when limited to survey data, even when lacking a comparison group of any kind – concludes that the program in question was a success.

Unless there’s willful deception going on here, this implies to me that the people who work at charities think helping people is really, really easy; that there’s no need to worry about all the questions above; that the flimsiest and most perfunctory of evidence is good enough to walk away from feeling that people in need have been truly helped.

Maybe we’re wrong. Maybe helping people is that simple. If that’s what you think, keep writing those checks to whomever sends you mail, and make sure they aren’t blowing a penny more than they have to on salaries. If you’re concerned as I am, though, I’ve got some tough news from you: aside from GiveWell, I don’t think you have much company.