Microlending debate: an example of why academic research should be used with caution

We often use academic research to inform our work, but we try to do so with great caution, rather than simply taking reported results at face value. We believe that if you trust academic research just because it is peer-reviewed, published, and/or reputable, this is a mistake.

A good example of why we’re concerned comes from the recent back-and-forth between David Roodman and Mark Pitt, which continues a debate begun in 1999 over what used to be considered the single best study on the social impact of microfinance.

It appears that the leading interpretation of this study swung wildly back and forth over the course of a decade, based not on major reinterpretations but on arguments over technical details, while those questioning the study were unable to view the full data and calculations of the original. We feel that this illustrates problems with taking academic research at face value and supports many of the principles we use in our approach to using academic research. Details follow.

Timeline/summary

1998-2005 studies by Khandker and Pitt

According to a 2005 white paper published by the Grameen Foundation (PDF), a 1998 book and accompanying paper released by Shahidur Khandker and Mark Pitt “were influential because they were the first serious attempt to use statistical methods to generate a truly accurate assessment of the impact of microfinance.”

Jonathan Morduch challenged these findings shortly after their publication, but a 2005 followup by Khandker appeared to answer the challenge and claim that microlending had a very strong social impact:

each additional 100 taka of credit to women increased total annual household expenditures by more than 20 taka … microfinance accounted for 40 percent of the entire reduction of moderate poverty in rural Bangladesh.

As far as we can tell, this result stood for about four years as among the best available evidence that microlending helped bring people out of poverty. Our mid-2008 review of the evidence stated,

These studies rely heavily on statistical extrapolation about who would likely have participated in programs, and they are far from the strength and rigor of the Karlan and Zinman (2007) study listed above, but they provide somewhat encouraging support for the idea that the program studied had a widespread positive effect.

2009 response by Roodman and Morduch

A 2009 paper by David Roodman and Jonathan Morduch argued that

  • The Khandker and Pitt studies were seriously flawed in their attempts to attribute impact. The reduction in poverty they observed could have been an artifact of wealth driving borrowing, rather than the other way around.
  • The Khandker and Pitt studies could not be replicated: the full data and calculations they had used were not public, and Roodman and Morduch’s best attempts at a replication did not produce a remotely similar conclusion (they demonstrated no positive social impact of microlending, even a slight negative one).

This paper stood for the next two years as a prominent refutation of the Khandker and Pitt studies. Pitt writes that the work of Roodman and Morduch has become “well-known in academic circles” and “seems to have had a broad impact.” It appeared in a “new volume of the widely respected Handbook of Development Economics” as well as in congressional testimony.

2011 back-and-forth

Earlier this year,

  • Mark Pitt published a response arguing that Roodman and Morduch’s failure to replicate his study was due to Roodman and Morduch’s errors.
  • David Roodman replied, conceding an error in his original replication but defending his claim that the original study (by Khandker and Pitt) was not a valid demonstration of the impact of microlending.
  • Mark Pitt responded again and argued that the study was a valid demonstration.
  • David Roodman defended his statement that it was not and added, “this is the first time someone other than [Mark Pitt] has been able to run and scrutinize the headline regression in the much-discussed paper … If you anchor Pitt and Khandker’s regression properly in the half-acre rule … the bottom-line impact finding goes away.”
  • We had hoped to see a further response from Mark Pitt before discussing this matter, but Roodman also wrote that Mark Pitt is now traveling and that “this could be the last chapter in the saga for a while.”

Bottom line: as far as we can tell, we still have one researcher claiming that the original study strongly demonstrates a positive social impact of microfinance; another researcher claiming it demonstrates no such thing; and no end in sight, 13 years after the publication of the original study.

Disagreements among researchers are common, but this one is particularly worrisome for a few reasons.

Major concerns highlighted by this case

  • Conflicting interpretations of the study have each stood for several years at a time. The original study stood as the leading evidence about microlending’s social impact between 2005-2009; the challenge by Roodman and Morduch was highly prominent, and apparently not commented on at all by the original authors, between 2009-2011.

  • Disagreements have been technical, many concerning details that few understand and that still don’t seem resolved. David Roodman states that “the omission of the dummy for a household’s target status” is responsible for his estimated effect of microlending coming out negative instead of positive. Numerous other errors on both sides are alleged, and the remaining disagreements over causal inference are certainly beyond what I can easily follow (if a reader can explain them in clear terms I encourage doing so in the comments).
  • Resolution has been hampered by the fact that Roodman and Morduch could only guess at the calculations Pitt and Khandker performed. This is the biggest concern to me. Roodman writes that he was never able to obtain the original data set used in the paper; that the data set he did receive (upon request) was (in his view) confusingly labeled; and even that one of the original authors “fought our efforts to obtain the later round of survey data from the World Bank.” As a result, his attempt at replication was a “scientific whodunit,” and his April 2011 update represents “the first time someone other than [the original author] has been able to run and scrutinize the headline regression in the much-discussed paper.”

    If I weren’t already somewhat familiar with this field, I would be shocked that it’s even possible to have a study accepted to any journal (let alone a prestigious one) without sharing the full details of the data and calculations, and having the calculations replicated and checked. But in fact, disclosure of data – and replication/checking of calculations – appears to be the exception, not the rule, and is certainly not a standard part of the publication/peer review process.

Bottom line – the leading interpretation of a reputable and important study swung wildly back and forth over the course of a decade, based not on revolutionary reinterpretations but on quibbles over technical details, while no one was able to view the full data and calculations of the original. For anyone assuming that a prestigious journal’s review process – or even a paper’s reputation – is a sufficient stamp of reliability on a paper, this is a wake-up call.

Some principles we use in interpreting academic research

  • Never put too much weight on a single study. If nothing else, the issue of publication bias makes this an important guideline. (On this note, note that the 2009 Roodman and Morduch paper was rejected for publication; its sole peer-reviewer was an author of the original paper that Roodman and Morduch were questioning.)
  • Strive to understand the details of a study before counting it as evidence. Many “headline claims” in studies rely on heavy doses of assumption and extrapolation. This is more true for some studies than for others.
  • If a study’s assumptions, extrapolations and calculations are too complex to be easily understood, this is a strike against the study. Complexity leaves more room for errors and judgment calls, and means it’s less likely that meaningful critiques have had the chance to emerge. Note that before the 2009 response to the study discussed here was ever published, GiveWell took it with a grain of salt due to its complexity (see quote above). Randomized controlled trials tend to be relatively easy to understand; this is a point in their favor.
  • If a study does not disclose the full details of its data and calculations, this is another strike against it – and this phenomenon is more common than one might think.
  • Context is key. We often see charities or their supporters citing a single study as “proof” of a strong statement (about, for example, the effectiveness of a program). We try not to do this – we generally create broad overviews of the evidence on a given topic and source our statements to these.

While a basic fact can be researched, verified and cited quickly, interpreting an impact study with appropriate care takes – in our view – concentrated time and effort and plenty of judgment calls. This is part of why we’re less optimistic than many about the potential for charity research based on (a) crowdsourcing; (b) objective formulas. Instead, our strategy revolves around transparency and external review.

Comments

Microlending debate: an example of why academic research should be used with caution — 10 Comments

  1. Your 1,2 & 5 points are useful. Your points 3 & 4 reveal a naivete about research and researchers. Firstly, the findings and methods, particularly in the analysis, of MOST research is too complex for most of the general public to understand. Research articles are written for other researchers, not some generic reader. Apart from that, putting full step-by-step details of any analysis into an article is generally unfeasible from a length perspective. Secondly, there are immense issues with releasing data sets. Why? Because the researcher has spent considerable time and money collecting them and they can be used for multiple analyses, multiple purposes. By publishing them you’re thereby permitting others to analyse them not just for the current studies, but to draw unrelated new findings and inferences from them. Why should I essentially “give” that away. If someone doubts my analysis of a data set they can fly to where I am and look over my shoulder as I walk them through it. Otherwise, they’re welcome to collect their own data and make a counter-argument to the one I made. Having said all that, I think you make some interesting points that contribute to the discussion of the issues in publicizing research.

  2. Holden,

    I think this is extremely well done and I have only minor points to add:

    * Good news: The Journal of Political Economy, which in 1998 published the original Pitt & Khandker paper at the center of the controversy, now *does* require sharing of data and code. The policy does not apply retroactively. At any rate, I believe it is still correct that most journals do not require such sharing. At this point, only the most prestigious do.

    * I would strongly second your point about the simplicity of randomized trials being a strength. Simplicity makes them both more credible and easier to understand.

    * I can’t resist adding that while our paper was rejected by JPE, this is like have a letter to the editor rejected by the New York Times. We are optimistic about eventual publication in another respected journal.

    * While I feel bad about the mistake we (mostly I) made in our working paper, I think it’s worth pointing out that we did share all of our data and code, which helped Mark Pitt find the mistake and illustrated how code+data sharing serve the public good.

    * Michael Bowen, I think you make a good point about why researchers will resist sharing data when they have worked so hard to collect it and want to mine it for all the papers they can to survive in a publish-or-perish world. But from the point of view of GiveWell and all donors, public and private, Holden’s argument stands, and is not naive. If the research methods and data are secret, they should be less trusted by people who make decisions about giving. Whatever the reasons, the sad fact is that research that is often justified as being “policy-relevant” is much less relevant than it could be because methods are kept secret.

  3. Michael,

    I’m long past my days as an academic, but I do have a Ph.D. and certainly have done my time in academia. I think you’re completely off-base in your criticism of point 3. I don’t see how that reflects naivete. It seems perfectly reasonable for non-academics to conclude that they should not rely heavily on a single study’s conclusions, if they cannot easily understand the methodologies of the study and analysis. That doesn’t mean non-experts can never rely on conclusions that they don’t fully understand. But it means they probably want to see multiple independent studies, all reaching the same conclusion, before they give significant weight to that conclusion.

    Not only does that not seem a naive position to me, it seems to me that it would be naive to do otherwise.

    As for your point on 4, when I was in grad school ~20 years ago, academia was still in what I considered to be the early transition to an institutionalized and systematically proprietary approach to research. There had always been many individuals who acted in a highly proprietary fashion, but this had not typically been viewed as a desirable or respectable behavior. I fear that by now, we are in much later stages of the transition, and your defense of treating data as wholly proprietary, seemingly indefinitely, certainly reflects that.

    The point of academic research is supposed to be, first and foremost, to advance knowledge. While it is certainly understandable that researchers should want to reserve the first opportunity for analysis to themselves, how long do you suppose that legitimate interest outweighs the societal interest in being able to mine the data?

    In this case, it sounds like all the data is at least a half-dozen years old, and much of it may be more than a dozen years old. And how was the research funded, in the first place? Unless things have changed dramatically in the last 20 years — and I can’t imagine they have — the vast majority of research is funded via grants, and the vast majority of grants are publicly funded in one way or another. Even private grants likely enjoy significant tax subsidies. (And certainly, in a domain such as microfinance, there can’t be a lot of corporate funding of research.)

    So how long should the original researchers properly reserve their opportunity to exclusively analyze the data, and deny others the opportunity both to validate their conclusions, and to potentially extract other findings — findings that the original researchers likely will never extract, and in many cases may never even attempt to extract?

    Should the original researchers forever lay sole claim to the data, merely because they might someday decide to go back and look for other findings, or merely because they would regret their own oversight if somebody else — perhaps even a rival — achieves renown using their data?

    I don’t doubt that academics have acted in such a proprietary fashion, to some degree, forever. And I doubt that it’s even more common today, with academia having largely institutionalized that mindset (with modern changes like institutions laying claim to all IP developed by either faculty or students).

    But I do doubt that such behavior ought to be defended as a virtue, or really even as something that should be considered acceptable. And certainly I doubt that people who criticize such ought to be deemed naive.

  4. In the first instance you are conflating two issues; data literacy (and literacy about science methodologies) versus drawing conclusions from single studies. The media is particularly bad at drawing broad implications from single studies…and consequently so is the general public because much of their understanding of how to make sense of science implicitly derives from the media. I don’t disagree with you at all that doing such is a bad practice, I think it is too. The other issue tho’ is data literacy, and in general the public understanding of research methodology and data literacy is quite poor….far too insufficient to understand the methods and data in most science studies (not to mention to understand the nuances of the conclusions even). I have a PhD too, and part of my research involves examining data literacy and news media and their representations of science, so I write with some knowledge on those two issues. I think you are not well understanding how little the general public understands about data/science and how difficult it is to understand the nuances of work in another domain. An example of this is the recent kerfuffle within the climate change community when emails from the CRU were released. The researchers talked about using a data “trick”, which I immediately took to mean they were engaging in a mathematical transformation of the data to enhance analysis. However their use of that word was widely broadcast as suggesting they were doing something inappropriate in the analysis and therefore the analysis was wrong. They weren’t doing anything deceptive at all, engaging in data transformations is a common practice in science, but one would usually only know that if one had been an insider (such as I, and perhaps you, have been) and the impact on the general public wrt their belief about global warming was profoundly negative. Outsiders do not well understand, and often completely misinterpret, insider practices. Now, I’d agree many things should be done to improve public science/data literacy, but expecting the public to understand field-specific practices is a bit of a reach.

    You raise good questions about data ownership, a topic I’ve played with in written notes to myself but have yet to write about conceptually. It’s a big issue in research, particularly in collaborative studies. Myself, I have data sets which are 20 years old, and which with new studies have gained a previously unknown relevance that have then resulted in new insights, new writing and subsequently new publications for me. They only exist because of the questions I asked, I pondered, and represent information I collected and went through the effort of keeping. Why, I would ask you, should someone else be able to use them for their own writing? I have had far too many papers written from data sets I collected before I had a chance to write those papers, or validate the data with other studies, because I foolishly shared data sets in the past. I put a lot of effort into collecting those data sets, and didn’t receive any form of recognition for it whatsoever. My career, my ability to get grants, is based on what I publish. There’s at least a half a dozen papers I can’t write now, which were planned for, because of that sort of tomfoolery. Basically, if I have to start releasing data sets I’ll just stop doing that type of research, because the payoff for me isn’t just “immediate” from a data set, it’s also in their long-term utility. Releasing them to everyone else means that the long-term utility purposes are lost so why should I go to that effort? As I said, anyone can fly to where I am, have me go over the data sets for them, and see how I do my analysis….and even THAT takes effort that I won’t be paid to do. Formatting data sets in a fashion that others can make sense of them also takes time, and time is money….that’s another issue wrt the release of data sets. Basically, if someone disagrees with some of my work they’re welcome to collect their own data and refute it….you’ve yet to make a compelling reason about why I should release my data sets for others to use for their own purposes.

    Basically, what you’re arguing is that there should be some researchers that gain the credibility to get grants to collect their own data and write papers, but that there could be other researchers that never have to collect their own data, never have to go to the effort of writing grants and competing for funds, who could just write papers on the backs of others. Because whether you realize it or not, that’s exactly what would happen.

    Finally, much of my current data (I started in science, participated in research in that domain, and then gradually transitioned to social sciences) involves information from individuals. When I write my papers I ensure that none of the information available could be used to identify an individual, but there’s no easy way whatsoever to transform my data sets so that it wouldn’t be possible to identify an individual if someone went to the effort to do so…at least no easy way to guarantee it. I could easily see with current data mining techniques and how sophisticated they are that that wouldn’t hold true for collections of economic data. The responsibility of any researcher is to ensure that there is no possible way to identify the participants in their study. If researchers go overboard on that front then it is hard to criticize them. Thus, I would expect any researcher who does any work with humans whatsoever would be quite hesitant to release their data in any way, shape or form, because one never knows when publicly available data combined with other data sets and sophisticated analyses couldn’t be used to identify individuals.

    Nice chatting with you.

    mb

  5. Michael,

    The context of Holden’s comments were principles that GiveWell uses in considering academic research. These are well-educated individuals who have invested considerably in, at least for a non-trivial amount of time, making the evaluation of evidence of effectiveness their full-time profession. I don’t think it’s appropriate to interpret those comments as if they were tantamount to offering advice to the general public.

    (In fact, even if it were interpreted as advice to the readers of the GiveWell blog — rather than as the explanation of GiveWell’s methods that it explicitly purports to be — I still don’t think it would be appropriately treated as if it were advice to the general public, because I don’t think the readership here is at all representative of the general public.)

    We’re pretty far in the weeds, at this point, from what GiveWell is about…but, as for data publication, you seem to be suggesting that mandating publication of original data sets would eliminate almost all incentives to collect such data to begin with, and all but eliminate such research efforts.

    I beg to differ, and I don’t believe that you can provide evidence to support your belief in this case.

    Your argument is essentially identical to a wide variety of arguments made respecting various proprietary rights, or even basic economics. It’s quite common for companies (and individuals) to make exactly the same argument respecting intellectual property rights, regulatory policy, and taxes, for example.

    In many, if not most of these cases, the social goods under discussion are, in fact, highly inelastic — and certainly, much, much less elastic than the people making these arguments suggest.

    For example, if we eliminated the patent system wholesale, there are some domains where this would likely have a very significant impact (areas involving very high investments, like drug development), but there are many more areas where it would likely have almost no impact — or where, when viewed in a broader context, would be likely to have a net positive impact (e.g., software algorithms — an area where the patent system very likely creates a significant net drag on innovation).

    I expect that requiring publication of data sets is no different, and depending on the specific areas, and the details of methodology adopted, would be likely to produce a net increase in research output.

    For example, you suggest that researchers would be divided into groups solely doing research involving original data collection, and groups solely doing research involving others’ data collection. That seems quite unlikely to me.

    Because of the relative absence of data publication requirements today, you view data publication as a net loss of publication opportunities for yourself. However, if everybody were publishing their data, you very likely would find many opportunities to publish based on others’ data, as an offset to lost opportunities with respect to your own data.

    (So while there could be a group of people who only conduct research on others’ data, it seems very unlikely that all others would only conduct research involving data they gathered themselves.)

    Furthermore, it’s reasonable to expect that with both the availability of data to many more researchers, and the competition created by such, more and higher quality analyses and conclusions would be extracted from that data — creating an substantial offset to any reduction in the amount of data collection research, to begin with.

    What’s more, if a particular data publication requirement created too big a disincentive to investing in gathering data, simple adjustments could be made to address such — like creating some period of exclusivity, as I mentioned previously, before requiring publication. Perhaps the optimal period is 1 year, perhaps it’s 5 years…but it almost certainly doesn’t need to be 20 years (or an infinite period) in order to create a sufficient incentive for researchers to conduct original data collection. And the presence of a limited period of exclusivity is likely to encourage researchers to maximize use of their own data before they lose exclusivity, again producing a net increase in the social good coming from that data.

    In fact, I see that David Roodman noted above that some of the most prestigious journals are presently requiring publication of the original data used to produce the papers they publish — and even with the availability of many alternative avenues of publication, apparently the mere prestige of one forum vs. another is sufficient to entice many researchers to publish their underlying data, rather than keeping it as a proprietary asset.

    This certainly does not support the contention that mandatory data publication would undermine incentives to conduct original data collection.

    And while I think that concerns about avoiding publication of personally-identifiable data are important to address, I don’t believe it is anywhere close to an insurmountable problem. I work in an industry where there are both strong regulatory requirements to protect personally-identifiable data, and strong marketplace requirements to protect such, and I don’t see it as being systematically difficult to eliminate personally-identifiable data from data sets.

    In various social science research contexts, perhaps there will be occasions where such is challenging, but my strong expectation would be that such would be the exception, rather than the rule — and I expect that the issue could be largely addressed, on a going forward basis, through intelligent construction of the data collection methodology used for any given research project, to avoid collecting data in a fashion that will subsequently be hard to scrub for personally-identifiable information, while still preserving the intended research value.

  6. J.S.

    You suggest that the conversation started in the original article is somewhat specific to GiveWell. There’s an argument otherwise, based on this technological age. I got here on the basis of a twitter feed that was commenting on why much research should be ignored linking to a blog referencing the five “principles” on this page in a general fashion as pertaining to all research. So, lots of the “general public” could find their way to the blog to read those principles and then here to read these arguments. I’m engaging the ideas here because altho’ the intent might have been for it to relate to issues specific to GiveWell, the conversation had a broader audience than you might expect, and so I engaged it with consideration of that.

    Your argument about some journals requiring data publication now is a specious one. You do not know any longer who is not publishing, or who is selectively publishing, in those journals so they do not have to publish their data sets. As a researcher, unless there was a specifically limited data set I could publish without the issues I described above, I would publish in other places.

    Your comment on the data set being proprietary also warrants comment. In many cases a data set is no different than a novel….it merely exists because of the individual creativity and curiousity of an individual. Given that the author of a novel has proprietary rights for, in the US it’s at least 75 years, why shouldn’t a data set be the same? A patent is held for even longer. Your comment that data could be held under some license for a period of time, and that 20 years is too long, doesn’t well understand the ways in which a researcher can use their data over a fifty-year career. And that data set is *my* personal creation, only existing because of a myriad of factors specific to my skills, my interest, my connections even…you’ve yet to make a compelling argument about why I should release it to others. That others might want it isn’t a good reason. And that my work might be doubted without release of it is surely a decision that I, not others, am responsible for making. I’m actually remarkably generous…I’ve shared data sets I’ve worked on with others collaboratively and cooperatively to generate new questions, new work, new publications under their name with mine in a secondary role. But that was a choice I got to make.

    As for eliminating information that could lead to personal identification…that may hold true for some data sets more than others. However I’ll also maintain that you cannot *guarantee* that someone else couldn’t engage in an analysis using your data set, or other information, to identify individuals (or even small collections of individuals, which is equally problematic) in ways you hadn’t thought of. And that’s the issue…you might think you’ve blinded information, but that belief is guided by your personal understanding of analysis. Someone else’s might well exceed your own and glean information from the data set you didn’t think was available. Therein lies the problem.

    Perhaps this discussion is happening because economists have a different understanding of data than other researchers…such might be the case. But I know from my work in the social and natural sciences that both groups contain many individuals who would be loathe to publish their data sets…in fact, when it’s been discussed I’ve never met anyone that supported the idea, for many of the reasons cited above. Research literature in the social science, studying scientists, even reports that within organizations, when the data literally belonged to the organization (not the individuals), the data hasn’t been shared within research groups with colleagues IN the organization. If there’s massive resistance to sharing data in that setting, one can well understand how the idea of sharing it “outside” of that setting would be greeted.

    Finally, with respect to secondary analysis; I know individuals who have built their entire careers on analyzing data collected by other people, other organizations, who have never collected the data in any of their research publications themselves because they rely on publicly available data sets collected by others. Those individuals already exist, it’s not much of a stretch to think their numbers might increase if there were more data sets available. You see, you don’t need grants to publish that sort of work…your only “cost” is sitting at your own computer in your office. It’s much easier to not have to get grants, to not have to construct and present your arguments for others on the path to getting money to conduct research. That there are people that spend their lives on secondary analysis hardly surprises me, but I don’t feel much like helping them by giving them access to my data sets.

    Cheers,

    mb

  7. Michael,

    You criticized Holden for being naive in stating a principle that clearly wasn’t naive in the context he made it. That his comments may have ended up being more widely disseminated than I (and perhaps even he) realized is irrelevant. That some people might (or even did) apply them in a different context, where you would argue that the application is naive is irrelevant. He’s not obligated to edit his comments for how others may apply or misapply them, and that he didn’t edit his comments for such doesn’t render his original point naive.

    The point about the most prestigious journals requiring data set publication is hardly specious. That they are successfully doing such tells us an awful lot. Surely they would reverse their policy if they found that they were being starved of high quality submissions, as a result. The question of whether some researchers choose to avoid them, given the choice of other avenues, is not relevant, because I don’t deny that many researchers may prefer to keep their data sets proprietary. But the simple fact that ANY researcher capable of meeting the standards of those prestigious journals actually chooses to publish with them, rather than in a less prestigious forum that does not require data set publication, clearly indicates much, much less resistance than you predict. If high quality research is already voluntarily subjected to such data publication requirements, clearly much high quality research can be expected to continue, even in the face of a hypothetical, across-the-board data publication requirement.

    As for proprietary rights in data sets and intellectual property, you make a number of factual and analytical errors. First, some factual points. Copyright in the US currently lasts for life of the author plus 70 years, or 95 years in the case of anonymous (e.g., corporate) works. (In my view — and that of many others — this is much, much too long. See http://www.thepublicdomain.org/ if you’re interested in delving into this topic in detail.) If I recall correctly, the original copyright statute provided for a 14 year initial copyright term, renewable once for a second 14 year term (only if the author was still alive at the time of renewal, I believe), so maximum copyright protection was 28 years. The term has steadily been extended over time. (More on this below.)

    Patents have much shorter terms than copyright terms – without getting into all the nitty-gritty details, issued patents expire roughly 20 years after the initial filing of a patent application.

    Regardless of these factual errors, there are fundamental differences between patented and copyrighted works, and proprietary data sets.

    First, a data set is not comparable to a copyrighted work such as a novel (or a patented invention), under intellectual property law. The kind of protection that would apply to a proprietary data set is trade secret protection. Basically, so long as you are diligent in protecting its secrecy, you have some rights (against theft of the secrets).

    The intellectual property basis for copyrights and patents is completely different. These draw their legal basis from Article I, Section 8 of the US Constitution: “To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries.” The point is that rights are granted to authors and inventors, for limited times, to promote the advancement of knowledge.

    Trade secret protection is the diametric opposite of advancing knowledge, and so is largely antithetical to what academia is supposed to be about.

    A second important difference between a data set emanating from academic research and a novel is that academic research (even at private institutions) is largely funded by public funding sources. Whether directly via a public grant, or indirectly via university tuition (the vast majority of which is publicly subsidized via either direct or tax subsidies), public subsidies directly to institutions, tax subsidies to private donors, etc., a very high percentage of academic funding derives from public sources and subsidies.

    Accordingly, the products of academia are properly viewed much differently than an individual author writing a novel.

    In fact, as patents emanating from academia have increased over the last 30 years, many have argued that inventions derived from publicly-funded research should not be eligible for patents. I think that’s a reasonable argument to have. Still, with patents (even with the serious problems the system currently has) at least you have the baseline social benefits of a limited duration, and the requirement that the invention actually be published, in order to advance the knowledge of others/society. Neither of these social goods pertain to a case where data sets are treated as wholly proprietary to a researcher.

    So let me turn the question around: why should an academic researcher be allowed to benefit from considerable (or total) public funding, and yet retain unlimited proprietary ownership of the products of that funding, to the detriment of the public?

    As to the argument that there is great resistance to sharing research data – I don’t think that’s a particularly controversial statement. But it’s also not at all relevant to the discussion. There are lots of things people and organizations are resistant to, or to flip it around lots of things that they want. People generally want to get the best deal for themselves that they can. That doesn’t mean that if they don’t get what they want, most or all of them will take their ball and go home. (You can’t always get what you want…it’s generally sufficient to merely get what you need. ;)

    Disney doesn’t want Steamboat Willie (the first version of Mickey Mouse) to go out of copyright. So every twenty years, when Steamboat Willie is about to enter the public domain, Disney lobbies Congress to extend copyright terms…and has been successful enough to get such, to this point.

    But does that mean that if Congress ever refused, Disney (and others) would refuse to continue making copyrighted works? Of course not.

    Sure, if you eliminated copyright protection altogether, it might be a different story – then you’d have a case where Disney would have no opportunity even to recoup their costs in producing a copyrighted work, let alone turn a profit.

    But if we returned copyright terms to their original maximum of 28 years, would Disney suddenly refuse to produce new creative works? Of course not.

    They would surely be lobbying hard against such a change. They would surely arguing that the world would effectively come to an end (that they would have much less incentive to produce creative works), because they would prefer to retain the rights longer. But the fact is, revenues from 28+ years out, for a 28+ year-old work no less, are insignificant when discounted back to today, in any business model, so an objective observer would reasonably expect essentially no impact on Disney’s output. (This was my previous point about outputs often being quite inelastic, despite protestations to the contrary.)

    The same can reasonably be expected of researchers. Of course they will tell us that the world will come to an end if they are forced to release their data to others, because given their druthers, they would prefer to keep their ownership interest as strong as possible. But the reality is that, so long as they have some reasonable opportunity to benefit from their own hard work, they will continue to have incentives to do that work.

    Merely having exclusive access to the data until first publication might well be sufficient. Or having access for some reasonable period, say in the range of 1 to 5 years post collection might provide sufficient incentive. But sorry, when you talk about needing 20 or even 50 years, you lose me. That sounds no different than a Disney arguing that 95 years is not a long enough term for copyright, and terms really NEED to be extended to 115 years.

    In fact, the evidence you cite regarding strong resistance to sharing data is actually a strong argument that some kind of intervention is necessary, in order to promote the efficient exploitation of research data. For widespread, strong resistance suggests that if there isn’t some externally-imposed limit, many academic researchers will seek to maximize their own personal benefit, at the (considerable) expense of society. Given that society is largely footing the bill for that research, however, this doesn’t seem an even remotely reasonable outcome.

    You seem to view secondary research as somehow detrimental and parasitic. I do not. I view it as helping to maximize the social value extracted from prior work. In this regard it is highly efficient. It does not surprise me that people who specialize only in that exist, but I certainly don’t consider such problematic.

    Btw, I am not an economist. (My Ph.D. is in computer science. For the last decade+, I have worked in the cable/telecom industry.)

    Also, this discussion having both become quite extensive, and being quite far into the weeds relative to GiveWell, philanthropy and the original blog posting, I would suggest we probably ought to find someplace else to carry on the discussion, if we’re going to continue such.

  8. There are not many attractions to working as a public academic. One of them is you have the opportunity to have control over your work, your research, and its outcomes, over extended periods of time. Many academics publish significant work integrating across many of their studies after extended periods of time, in the latter parts of their career. If you take away one of the few tangible benefits of working in the public sphere, then one may as well work for a company or a think tank. One certainly doesn’t work in the public sphere for salary, benefits or retirement opportunities…so control of your data over the long-term for long-term purposes is one of the few benefits I see in the job.

    Basically we have different perspectives on what data is. I see it as a personally creative act that involves personal investment and consequently is a personal product. You see it basically as a non-personal commodity. Also, being American (which I am not, and hence the flaws in my cited numeric levels of protection offered in the US), you have a very market-driven view on most parts of your culture, as Americans tend to have….that you see data as a commodity is hardly surprising. Finally, you come from a discipline which is not really data-driven, in the sense that generative disciplines are, and therefore have a different view than those whose lives involve generating data. Thus, I see little opportunity for congruence on this issue.

    You might find it interesting to know, btw, that any of the software that has been generated by my public grants has all been released as open-source. I was heavily encouraged to copyright it and sell it, but made the argument that the public paid for its development and consequently it should be available to the public gratis. In my view, data is not a “product” in the same way that software is.

    All the best.

  9. Michael,

    I expect you’re correct that it is unlikely we’ll achieve agreement on this, but I think you’re incorrect in some of your characterizations of my perspective.

    I don’t view data as merely a commodity. I recognize it can be, and often is treated as, a proprietary asset. But in the same sense that you viewed your software developed under public funding as an asset that was not properly treated as (entirely) proprietary, I view data collected under public funding as also not properly treated as (entirely) proprietary.

    I certainly do have an American view of intellectual property. (And I wasn’t aware that we were talking across international boundaries. ;) ) However, I’ll note that worldwide intellectual property treatment is largely unified. To the best of my knowledge, most of the world has the same basic rules for copyright (although not all countries have yet adopted the most recent 20 year term extension adopted in the US), for patents (the US has mostly reconciled its terms to the world standard as of 1995), and respecting trade secret protections.

    The rest of what I express I would deny as representing a specifically “American” viewpoint. (I certainly would not expect all, or even a majority of Americans to think similarly.) I would also reject the characterization of a market bias. Rather, I would say my thoughts reflect a strong bias I hold in favor of economic efficiency. And certainly, it reflects the biases (or understanding, as the case may be) I have based on my own study of human psychology and behavioral economics.

    Finally, I would agree with your basic assertion that professional autonomy is a key characteristic that attracts people to academia. However, I would disagree with the implication that longtime/lifetime control of data (or other research products, such as copyrightable works, and patentable inventions) is the, or even an, essential aspect of that professional autonomy. I think that if the rules of the game for academic work imposed reasonable limitations on the proprietary treatment of research data (and other research products), there would still be an ample supply of highly-qualified individuals wanting to work in academia. (I can’t speak for the rest of the world, but for the record, in the US, academic research doesn’t pay badly at all, and for upper echelon researchers, it pays quite well. Certainly, there’s probably no better paying job one can get that offers anywhere close to the level of autonomy that academia does.)

    Regards,

    Jonathan