The GiveWell Blog

Not everyone under-evaluates …

Fundraisers seem to do a phenomenal job.

Somehow, you don’t see fundraisers making a lot of arguments that “Money spent on evaluation means less letters mailed out” or “Evaluation is difficult and you can never really isolate causality perfectly.” Instead, you see them testing. And testing. And testing. And learning things that are far from obvious. And testing again.

Maybe it’s because they’re the ones the nonprofits rely on to stay in business. Program, on the other hand, doesn’t have to be as good as it can be … unless we demand it.

Literature reviews

All I was trying to say in the last post could be summed up like this:

This is an awesome literature review about early child care programs. It describes how the author found all the papers she discusses … it is totally straightforward about what the methodological strengths and weaknesses of each paper are … and, hold on to your seats – the tables on pages 218 and 223 can only be described as “rock star.” At a glance, you can see all the studies under discussion, who they looked at, how they looked, and what they found.

The literature review I discussed on Saturday is nothing of the sort. It’s unclear about study design, it makes broad claims whose support is unclear, and of course, there are no awesome tables.

If only people were as determined to take a tough look at microlending as they are at Head Start. But of course, Head Start is politics; it affects us all; it’s important; it’s difficult; it’s controversial; it needs to be argued about. Microlending is charity, so it’s none of those things. That all checks out, right?

Microlending: The mystery deepens

Goodness, this post is long and dry. The headline is: I read the paper everyone points to as “hard evidence of microfinance’s effectiveness,” and I came out with tons of questions and a need to visit the library. I’ve learned nothing about how microlending works (is it financing investment? Smoothing consumption? What kinds of activity is it making possible?), and all of the data for how well it works leaves me with about 1000 methodological concerns, possibly just because how vaguely the studies are described.

The paper is published by the Grameen Foundation and available here.

Concerns about bias

As in education, selection bias is a major concern in evaluating microfinance. If Person X is ready to take out a loan, confident that she will pay it back, while person Y isn’t – who would you bet on, even if no loan is given? The Coleman study described on pages 20-21 gives an excellent illustration of this issue. It’s the only study in the paper that uses what I would call the “ideal” design: inviting twice as many participants to a program as it has slots, a year in advance, and then choosing the participants at random. The study found that participants in the credit program were generally wealthier than non-participants, but that once you controlled for this, the program didn’t appear to make them any better off.

The review author points out that the study was done in Thailand, which already has a government-sponsored subsidized credit system. So we agree that the paper doesn’t tell us much about the impact of microlending in general … but it does show the perils of selection bias, and I’ve led with it because this problem affects so many of the other studies.

The worst are the studies – and there are many – that simply compare living standards for clients vs. non-clients, without examining whether the two groups may have been different to begin with. These could easily simply be showing the same effect as the study above: borrowers are wealthier before they ever borrow. Most of the studies discussed early in the paper look likely to have this problem (or at least the description doesn’t make it clear how they deal with it): Hossain 1988 (page 16), SEWA study (24-25), Mibanco study (26 – it isn’t entirely clear who’s being compared to whom, but all the differences discussed are between client and non-clients with no discussion of where they started out), ASHI study (27), and the FINCA studies (28). The last two look at changes in incomes, not just incomes, but if one group started off wealthier, I’d think they’d be more likely to increase their incomes too (regardless of any help from charities).

Incoming-client comparisons: still a long way from easing my mind

The vast majority of the studies discussed try to get around this problem by comparing incoming clients to existing clients. This seems better than simply comparing clients to non-clients in the same region: incoming clients presumably share most qualities of existing clients, aside from the fact that they haven’t yet benefited from microcredit. But, this test is still miles from rigorousness. Page 7 points out a couple potential problems with it – eager borrowers may differ from “wait and see” types, and more importantly (to me), MFIs may loosen their admission standards over time, which would mean that incoming clients are systematically worse off than existing clients for reasons that have nothing to do with the benefits of microloans. And then, of course, there’s just the fact that times change. For example, if microloan programs systematically attract people above a certain income level (as the Coleman study implies), and an economic boom makes everyone wealthier, you’ll see existing clients (originally the only people wealthy enough to enter the program) doing better than new clients (who have just now become wealthy enough).

A relatively simple (imperfect) way to adjust for all this would be to compare incoming clients both to existing clients today, and to those same existing clients at the time they entered the program. This would at least check whether existing clients were systematically better off to start with. Here’s the bad news: if the studies discussed do this, the literature review very rarely mentions it. It occasional points to a divergence between existing clients and incoming clients on some random quality like age (see page 36), rural/urban (36), schooling (33), etc., implying that the two groups are often not very comparable … but the review is generally short on the details, and in almost every case does not address the issue I’ve highlighted here.

Nearly all of these “incoming-clients” studies show significant positive differences between existing and incoming clients, implying that microfinance has improved lives for its clients, if these concerns are addressed. But I’m going to have to check out the papers myself before I feel very convinced.

Here’s what we’re left with:

These are all of the studies discussed that appear to address the concerns above in any way:

  1. A 2004 study of a Pakistan program (33-34) compared clients to non-clients starting at similar levels of income, and showed a much larger increase in income for clients (though both experienced huge increases – it was ~30% to ~20%). This doesn’t account for “motivation bias” (the “optimism and drive” that taking out a loan may indicate), but at least it’s looking at people who started in about the same place.
  2. A similar study was done on a Bosnia/Herzegovina program in 2005 (35-36), again showing much larger income gains for participants in the programs, and again adjusting for starting income though not for the “optimism and drive” bias.
  3. The Second Impact Assessment of the BRAC program (29) compared changes in clients vs. non-clients; between 1993-1996, the % clients with a sanitary latrine went from 9 to 26, while the % non-clients with a sanitary latrine went from 10 to 9. The latrine variable is the only one where the paper makes clear that they started in the same place, and the rest of the discussion of the study seems to imply that they were pretty different to start with, in other ways.
  4. Page 21 claims that Gwen Alexander “recreated” the design of the randomized Coleman study I led off with, using the same dataset that a bunch of the other studies were working off. It’s totally unclear to me how you recreate a randomized-design study using data that didn’t involve randomized design, and the paper doesn’t fill me in at all on this.
  5. Finally, pages 17-20 discuss a back-and-forth between two scholars, in which the details of what it means to own a “half acre of land” – as well as a debate over a complicated, unexplained methodology that actually appears to be called “weighted exogenous sampling maximum likelihood-limited information maximum likelihood-fixed effects” – appear to make the difference between “microfinance is phenomenal” and “microfinance accomplishes nothing.” The part I’m most interested in is the final paper in this series, Khandker (2005) (discussed on page 19), which draws incredibly optimistic conclusions (crediting microfinance with more than a 15% reduction in poverty over a 6-year period). Unfortunately, the review gives no description of the methodology here, particularly how all the concerns about bias were addressed: all it says is that the methodology was “simpler” than whatever wacky thing was done in the first paper.

Bottom line:

So, bottom line: we have 3 studies (the first three above) showing promising impacts at particular sites, though they were not done by independent evaluators and may suffer from both “publication bias” (charities’ refraining from publishing negative reports) and the “optimism/motivation bias.” We have 2 studies that the review claims found great results with a rigorous methodology, but its description leaves the details of this methodology completely unclear. And, we have a host of studies that could easily simply have been observing the phenomenon that (relatively) wealthier people are more likely to take advantage of microlending programs.

My conclusion? We have to get to the library and read these papers, especially the ones that are claimed to be rigorous.

Conclusion of the review? “The previous section leaves little doubt that microfinance can be an effective tool to reduce poverty” (see page 22 of the 47-page study – before 80% of the papers had even been discussed!) And in the end, that’s why I’m so annoyed right now. This paper does not, to me, live up to the promise it makes on page 6, to “[compile] all the studies … and present them together, in a rigorous and unbiased way, so that we could finally have informed discussions about impact rooted in empirical data rather than ideology and emotion” (pg 6). It covers some truly low-information studies (like the first set I discussed) while presenting them as evidence for effectiveness; it discusses the most important studies without giving any idea of how (or whether) they corrected for the most dangerous forms of bias. It calls the fact that a paper stimulated debate, rather than unanimity, “unfortunate” (18). It’s peppered throughout with excited praise (like the quote above, and the super-annoying parenthetical on page 29). In the end, I don’t feel very inclined to take its claims at face value until I hit the library myself.

Without either the detail I need or a tone I trust, I don’t feel very convinced right now that microlending improves lives, especially reliably. (I’d put it around 63%.) I’m surprised that this is the paper everyone points to; I’m tempted to say that if more people read it, as opposed to just counting the pages and references, something clearer and more neutral would have become available by now.

Averages

Averages really annoy me. Average income, average test score, etc. When we’re talking about any kind of analysis of people, I have a hard time thinking of any case where you should be looking at the average of anything.

I much prefer “% of people above some threshold” type measures: % of students who graduated in 4 yrs or less, % of students scoring at proficiency level 3 or higher, % of families earning $20k/yr or less. This kind of metric is about 1.2x as complicated, and 2000x as meaningful, as an average.

Just thought you’d like to know.

Experience vs. data, or, why I just muted the Yankees game

So I’ve been watching the ballgame, and it struck me how much sports announcers have impacted my outlook on charity. I can explain.

The most common form of “evidence” we get from charities goes something like this: “We don’t have the data, but we’re here, every day. We work with the children, personally. We’ve been doing this for decades and we’ve accumulated a lot of knowledge that doesn’t necessarily take the form of statistics.”

Put aside, for a minute, the fact that we get that same story from all 150 charities we’re deciding between (all of which presumably think their activities are most deserving of more funding). There’s another problem with the attitude above, one that occurs to me every time I hear Michael Kay announcing a baseball game. In sports, unlike in charity (and really unlike in most things, which is why I find it an interesting case study), the facts are available – and when you look at them, you realize just how little that “on the ground” experience can be worth.

The fact is that baseball announcers and sportswriters spend their entire lives watching, studying, and thinking about sports. Many of them are former athletes who have played the game themselves. They are respected, they are paid to do what they do, and they are more experienced (i.e., they’ve seen more) than I’ll ever be in my life. And yet so many of them truly know absolutely nothing.

“Jeter’s a whole different player in October,” says Mr. Kay (demonstrably false). “You don’t want young pitchers carrying you in the playoffs.” (Comically false – 3 of the last 5 World Series champions had rookie closers.) I’m not giving any more examples – this post would hit 30,000 words in a heartbeat. But I’m happy to refer you to sources that give 2-3 examples per day of seasoned professionals – who’ve spent their whole lives on this stuff – saying things that are obviously, intuitively, factually, empirically, demonstrably, completely wrong.

It hits me over and over again, and I still haven’t quite gotten used to it. My only explanation is that humans have an incredible ability to ignore what they actually see, in favor of (a) what they expect to see (b) what they want to see. Now when I talk to an Executive Director or Development Officer whose life consists of running a charity and whose livelihood depends on convincing people that it’s the world’s best way to help people … I don’t know how much these factors cloud their judgment. Maybe not at all, in some (truly amazing, borderline inhuman) cases. But when they assure me that outcomes data isn’t necessary because they’ve been doing this for years, forgive me for having trouble swallowing this: I can’t help but think of Michael Kay, a man who’s done very little with his life but watch the Yankees, and still manages to know nothing about them.

Want to see what we mean by “bias?”

The US government commissioned an evaluation of its Talent Search program, designed to fight the college enrollment gap by providing advice (mostly on financial aid) to disadvantaged students. The evaluation is pretty thorough, rigorous, and straightforward about what it did and didn’t find, as we’ve come to expect with government-funded evaluations. (This is probably because the government is under pressure to actually produce something; “government” doesn’t tend to melt people’s hearts, and brains, the way “charity” seems to.)

This well-done study gives a great illustration of how misleading a poorly done study can be. What do I mean? Let’s focus on Texas (one of the three states it examined).

On page 47 (of the PDF – 23 of the study), you can see comparison statistics for Talent Search participants and other students in the state. Talent Search participants were far more likely to be economically disadvantaged … but … they also had much better academic records. I’d guess that’s because while money has its perks, it isn’t everything – motivation also matters a lot, and it’s easy to see why the Texans who bothered to participate in Talent Search might be more motivated than average kids.

How would you react if a charity running an afterschool program told you, “80% of our participants graduate high school, compared to 50% for the state as a whole?” Hopefully I just gave you a good reason to be suspicious. Kids who pick up an afterschool program and stick with it are already different from average kids.

But there’s more. The study authors spotted this problem, and did what they could to correct for it. Page 52 shows how they did it, and they did about as good a job as you could ask them to. They used an algorithm to match each Talent Search participant to a non-participant who was as similar as possible (what I dub the “evil twin” approach) … they ended up with a comparison group of kids who are, in all easily measured ways, exactly like the participants.

So, if Talent Search participants outperformed their evil twins, Talent Search must be a good thing, right? Not so fast. As page 55 states, Talent Search participants had an 86% graduation rate, while their evil twins were only at 77%. The authors equivocate a bit on this, but to me it’s very clear that you can’t credit the Talent Search program for this difference at all. The program is centered on financial aid and college applications, not academics; to think that it would have any significant effect on graduation rates is a huge stretch.

The fact is, that invisible “motivation” factor is a tough thing to control for entirely. Even among students with the same academic record, some can be more motivated than others, and can thus have a brighter future, with or without any charity’s help.

This is why Elie and I are so prickly and demanding when it comes to evidence of effectiveness. These concerns about selection bias aren’t just some academic technicality – they’re a real issue when dealing with education, where motivation is so important and so hard to measure. If you believe, as we do, that closing or denting the achievement gap is hard, you have to demand convincing evidence of effectiveness, and the fact is that studies with sloppily constructed comparison groups (or no comparison groups) are not this. We want charities that understand these issues; that know how difficult it is to do what they’re doing; and that are unafraid – in fact, determined – to take a hard, unbiased look at what they’re accomplishing or failing to accomplish. It’s psychologically hard to measure yourself in a way that really might show failure, but how else are you going to get better?