Open Science and Open Source
Last week at the 32th Chaos Communication Congress in Hamburg I went to the Open Science workshop. For two hours I participated in a lively debate among a very diverse group of people interested in this topic, from computer security researchers, high energy physicists to social scientists and psychologists. We talked and shared ideas on publishing and open access to scientific papers, modern ways of collaboration in academia, viable alternatives to impact factor and so on. It was interesting to hear how things work in other fields of research and how differences in culture affect how open and accessible their work ends up.
Image by Nicolas Wöhrl
There are some assorted notes from the workshop at the OKFN site. On the topic of publishing source code, I wanted to share some thoughts I had during my recent return to study of cyclostationary signal theory. Since it was kind of an ad-hoc idea at the time, I probably did not express my self very clearly, so here's a bit more detailed account of what I tried to say with some back story.
I was frustrated previously about how papers on that topic are very vague about exact implementation details and do not publish their source code. It later turned out that in fact, some of them do - kind of. I found a number of papers that more or less vaguely referenced autofam.m as their way of estimating the spectral correlation density, which is the key element I was interested in. While the papers themselves did not provide links to the contents of this file, Google does find a few unattributed copies floating around the Internet.
I went through this file with a fine toothpick since all theoretical deliberations I read about this method (FAM stands for FFT Accumulation Method, by the way) still left me with several questions regarding how it is actually implemented in practice. I found the missing pieces of the puzzle I was looking for, but I also found what I'm pretty certain are bugs in the code: something that looks like an off-by-one error due to Matlab's unusual 1-based indexing and a suspicious for-loop that attempts to write several different values into the same matrix element. I don't know how much these problems would affect the results the papers were publishing or, of course, even whether the authors were using the same exact code I was looking at. The question of bugs does however suggest an interesting problem with sharing source code in the scientific context.
Can code-reuse cause implementation bugs to remain unseen for longer? Imagine several, seemingly independent peer-reviewed publications showed a phenomenon that was due to an implementation bug in the code they shared. It seems that discovery might easily get generally accepted as correct. Especially in a field that is mostly focused around computer simulations.
In my case, it's obvious that implementing this method from its mathematical description is not trivial and I think it's not an unreasonable assumption that authors of these papers took an existing implementation without thoroughly checking the code they were using (a code comment sends kind of a mixed signal regarding that). If the source would not be accessible and each author would have to re-implement it from scratch, the chances of them agreeing on some anomalous result would be much lower.
What I learned at the open science workshop is that in fact the code is sometimes hidden on purpose. The high-energy physics guys (unfortunately I didn't catch their names) mentioned that often different groups of students are independently working on the same research task to avoid exactly this kind of a problem. I often find it impossible to write proper tests for my own code that is doing numerical calculations and comparing multiple independent implementations sounds like a good way to gain more trust into the correctness of results - provided you have enough PhD students. Also, they added that PhD students sometimes to talk with each other, which casts some doubts on the effectiveness of the whole thing.
At the first glance, this sounds like an argument against revealing the code. However, apart from inconsistencies with previously published results, problems in implementations are likely to remain unseen forever if there is no source to inspect. These days nobody is motivated in publishing results of an re-implemented replica of an already published experiment anyway. If source is published, there are at least chances that someone will discover bugs in it by inspection. There is also no doubt that there are many other benefits in openly sharing code. If nothing else, a few lines of code can seemingly explain a detail that is lost in a hundred pages of text, like I learned in my study of FAM.
In the end, it seems the situation where you have researchers (or PhD students) secretly swapping USB drives around the water cooler and informally sharing code is the worst of both options. I don't really expect reviewers to look for off-by-one errors in Matlab code (although some might), but publishing code together with journal articles at least gives that option. It also clearly shows which publications shared some specific algorithm implementations, which should make it easier to question and review any suspicious results they might share.
Everyone, if best researchers, make mistakes as was clearly demonstrated by Reinhart and Rogoff controversy couple of years ago (http://www.newyorker.com/news/john-cassidy/the-reinhart-and-rogoff-controversy-a-summing-up) so I can't think of a persuasive reason why code should ever be hidden.
Maybe it shouldn't be accepted without tests, but then again I have doubt they would be that useful judging by my (limited) experience with quality of code written in academia.
I wonder how much of it could be tested as re-implementation assignments to students in last year before graduation? :)