02 May 2013

Totally Open Science: A Proposal for a New Type of Preregistration


Methodological rigor has been the center of a growing debate in the behavioral and brain sciences.  A big problem thus far is that we've largely only published results. Preregistration forces us to publish methods and hypotheses ahead of time, which can help with p-value hacking, post-hoc storytelling and the "file drawer" method for dealing with negative or unwanted results. Even prominent journals like Cortex are getting in on preregistration with a publication guarantee, effectively focusing peer review on methods and hypotheses and not on "interesting" results. Some journals also require data sharing, including Cortex in its new initiative, by uploading to public hosting services like FigShare.


I want to go one step further and suggest that it's time to share data, method and process. I want every box in that little chart covered, and moreover, I want to be able to look at how we get from box to box.

Preregistration is great and should help us to avoid a lot of post-hoc tomfoolery. But preregistration is difficult to use for certain types of exploratory or simulation-based research. While the reporting of incidental results are still allowed under certain forms of preregistration (including the Cortex) model, purely exploratory studies, including iterative simulation development studies, don't fit in well with preregistration. The Neuroskeptic agrees that such results don't fit the preregistration model per se, but should be marked as exploratory (perhaps implicitly via their missing registration) so that it's clear that any interesting patterns could be the result of careful selection:
We all know that any 1000C finding might be a picked cherry from a rich fruit basket.
By opening up process, we can still learn a lot about the fruit left in the basket.

The following is my proposal for a variant of preregistration compatible with exploratory and simulation-based research. It is based on open access and open source principles and will discourage the post-hoc practices that lead to unreliable results. The key idea is transparency at every step -- making the context of an experiment and an analysis available and apparent not only encourages "honesty" of individual researchers in their claims but also allows us to get away from the binary world of "significance". This is just an initial proposal, so I won't go into all the details and I am aware that there are a few kinks to work out.

Beyond Preregistration: Full Logging


My basic proposal is this: public, append-only tracking of research process and iteration via distributed version control systems like Mercurial and Git.  In essence, this is a form of extensive, semi-automated logging / snapshotting. For the individual user, this also has the nice advantage of allowing you to go back in time to older versions, compare different versions, and even help track down inconsistencies between analyses.



The initial entry in the log should clearly state whether the study is confirmatory or exploratory. Simulatory or not is orthogonal to confirmatory/exploratory: if you're just testing whether a new model fits the data reasonably well, then you should define ahead of time what you mean by "reasonably well" and test that as you would any hypothesis in a real-world experimental investigation. If you're trying to develop a model/simulation in the first place and just want to see how good you can make it for the data at hand, then that is exploratory research and should be marked as such. Texas sharp-shooting is just as problematic, if not more so, in simulation-based research as in research in the real world.

This should then dovetail nicely into a Frontiers type publishing model with an iterative, interactive review. The review process would just be part of the log.

Context and Curiosity


A fundamental problem with our statistics is that we think in binary: "significant" or "not significant" and often completely ignore the context and assumptions of statistical tests. Even xkcd has touched upon the many of the common issues in understanding statistics. Many issues arise from post-hoc thinking, and this is what preregistration tries to prevent. Post-hoc thinking violates statistical assumptions. Odds are there are some interesting patterns in your data that occurred by chance. If you test them after you've already seen that they're there, then you're begging the question. If you report that you found this pattern while doing something else, then it presents a direction for future research. But if you present that pattern as the one you were looking for all along, then you've violated the assumption of randomness that null hypothesis testing is based on.

By recording the little steps, we give our data the full context to understand and interpret them, even if they are "just" exploratory data. Exploratory data has a different context and it's the context we need to fully evaluate a result, and not some label like "significant".

Isaac Asimov supposedly once said that "The most exciting phrase to hear in science, the one that heralds new discoveries, is not 'Eureka' but 'That’s funny...'".  Even if serendipitous success is the exception and not the rule, we need a forum to get all the data we have out in the open in a way that doesn't distort its meaning.

Some Details 


The following gets a tad more technical, but should make my idea a bit more concrete. There are a lot more details that I have given serious thought to, but won't address here.  

Implementation


More precisely, I'm suggesting something like GitHub or Bitbucket, but with the key difference that history is immutable and repositories cannot be deleted (to prevent ad-hoc mutation via deletion and recreation.) The preregistration component would be the initial commit, in which a README type document would outline the plan. For confirmatory research, this README would follow the same form as preregistration. (Indeed, the initial commit could even be done automatically following a traditional, web-based preregistration form.) For exploratory research (e.g. mining data corpora for interesting patterns as hints for planning confirmatory research), the README would be a description of the corpora (or a description of the planned corpora), including the planned size (i.e. test subjects and trials) of the corpus (optional stopping is bad). For simulation-based research, the README would include a description of the methodology for testing goodness of fit as well as an outline of the theoretical background being implemented computationally (lowering the bar of your model post hoc is bad). Exploratory dead ends would be apparent as unmerged branches.  

As stated above, this should tie nicely into a Frontiers type publishing model with an iterative, interactive review. Publications coming from a particular experimental undertaking would have to be included in the repository (or a fork thereof if you're analyzing somebody else's data), which would make it clear when somebody's been double dipping as well as quickly giving an overview of all relevant publications. As part of this, all the scripts that go into generating figures should be present in the repository. This of course requires that you write a script for everything, even if it's just one line to document exactly what command-line options you gave. 

The open repository nature also supports reproduciblity via extensive documentation/logging and openness of the original data. The latter is also important for "warm-ups" to confirmatory research: getting a good preregistration protocol outlined often requires playing with some existing data beforehand to work out kinks in the design. For exploratory and simulatory work, everything is documented: you know what was tried out, what did and didn't work, as well as both the results of statistical tests and their context, all of which is required to figure out useful future directions. 

A Few Obvious Concerns


Now, there are a few obvious problems that we need to address now, despite me trying to avoid too many details.
  1. The log given by the full repository is far too big to be reviewed in its entirety. This is certainly true, but a full review should rarely be necessary, and the presence of such data would both discourage dishonesty as well as providing a better means to track it down. Of course, this is assuming that people publish the intermediate steps where data were falsified or distorted, but then again, the presence of large, opaque jumps in the log would also be an indicator of something odd going on. ("Jumps" in the logical sense and not necessarily in the temporal sense. Commit messages can and should provide additional context.) For more general perusal or tracking down a particular error source, there are many well-known methods for finding a particular entry -- and many of them are already part of both Mercurial and Git!
  2. Data often comes in large binary formats. I know, neither Mercurial nor Git do too well with large binary files; however, the original data files (measurements) should be immutable, which means that there will be no changes to track. Derived data files (e.g filtered data in EEG research, anything that can be generated from the original measurements) should not be tracked, but their creation should be trivial if all support scripts are included in the repository. This will also reduce the amount of data that has to be hosted somewhere.
  3. Even if we get people to submit to this idea, they can still lie by modifying history before pushing to the central repository. I don't have a full answer to this yet beyond "cheating is always possible, but this system should make it harder." Even under traditional preregistration, it's still possible to cheat by playing with time stamps on your files and preregistering afterwards. Non trivial, but possible. And so it is here. However, as pointed out above, the form of the record should also give some indication that something fishy is going on. Moreover, the initial commit reduces to traditional preregistration in the case of confirmatory research. Finally, this approach is about getting everything out in the sunlight; it does not guarantee publication, if for example, there is a fundamental flaw in your methodology. But the openness may allow somebody to comment and help you before you've gone too far off the path! 

Open (for Comments)


More so than even with traditional preregistration, the system proposed above should encourage and enforce a radical openness in science. For the edge cases of preregistration (exploratory and simulatory work), you can avoid some of the rigidity of preregistation at a heavy price: everything is open and it is very clear that your data is exploratory and indeed it's clear when you found interesting data. It's clear when  you find something after a long fishing expedition, which means it's clear that the result is to be taken with a grain of salt. But it also provides an unbelievably open format for showing people interesting patterns in the data, which potentially support existing research but also demand further investigation with a more focused experiment. 

It's not science if it's not open.

(Special thanks to Jona Sassenhagen for his extensive feedback on previous drafts and long discussions in the office. Long a fan of preregistration, you can find his first foray into the system proposed here on GitHub.)

6 comments:

  1. A very nice idea!

    You mention that you would want a versioning system that doesn't allow for deletions etc (unlike Git), but do you think that's really necessary? My understanding of versioning systems is that you can remove things from the history, but you cannot alter timestamps, right? So, at least to begin with, you could just use an existing versioning system (like Jona), which saves a lot of effort.

    Speaking of effort: That would probably be a big problem. Working with versioning systems is quite tricky, I remember that from when I started with Git (and I still get confused sometimes). It might be difficult to convince researchers to learn this.

    (Incidentally, a while back I wrote a piece that also touched on the use of versioning in research, although in a different context: http://www.cogsci.nl/blog/miscellaneous/207-the-beauty-of-being-wrong-a-plea-for-post-publication-revision)

    Cheers!
    Sebastiaan

    ReplyDelete
    Replies
    1. I don't think strict local immutability is necessary -- modifying history for things like editing a commit message (fixing typos) or deleting confidential data (i.e. non anonymized EEG data, which in most countries qualifies as protected medical data) is important. But such corrections should hopefully all happen relatively soon afterwards, i.e., hopefully before you're pushed your data to the public repository. Phases in Mercurial support this model, where you can change things until you've marked them public.

      You are correct, you can't alter individual time-stamps without the tampering being evident further down the tree: the SHA1 hash of every commit is dependent on content, timestamp and its parents' hashes. So changing one commit involves changing the whole tree. Still, you can imagine a script that checks out a revision from an existing repo, fakes the timestamp (using e.g. the system clock), checks it into another repo, and then repeats that for all the children. But I don't see a useable way to prevent that level of intentional trickery.

      I think we should use the existing systems like Jona is doing, but just place some additional restrictions on the public repository to encourage / enforce good practice. My big motivation for immutability is that 'dead-ends' get published so that it's clear how many things you tried out (multiple comparisons and post-hoc reasoning) as well as useful for other researchers who are considering doing the same thing. At the start of my doctorate, I went to my advisor with an idea for an experiment and was told that the idea was very good, so good in fact that another PhD student had already tried it and gotten a null result, so the study was never published. The difficulty in getting null results published only encourages the abuse of p-values, which is really the center of this whole debate. Cortex's initiative helps solve some of the null-result problem, but only for confirmatory work.

      As far as effort goes, I know how that goes. I'm a relatively experienced Mercurial user, and I still regularly have to look things up. I think that's been the big hurdle in gaining converts to DVCS. But there are things like SparkleShare which make using Git (for linear history) as easy as Dropbox. The only downside is that the commit messages are mostly useless, but every save is a commit, so if you save often, you have small enough differences that you still track the thought processes. (Incidentally, that's also a great way to track a BibTeX database -- I highly recommend it!)

      Delete
  2. A very interesting proposal. One challenge however (and this is already a big problem, even for more minimalist registration) is that people will see this as a chore. I think the solution to that would be to make a system that was useful in itself as well as implementing preregistration. A bit like how Wikipedia is an encyclopedia right now, but it's also a history of Wikipedia when you click the "History" tab - the two functions are inseperable.

    Could there be a registration service that was so useful, people would be willing to use it even if they didn't care about the registration?

    ReplyDelete
    Replies
    1. SparkleShare goes somewhat in the direction of reducing the chore-factor by automatically doing a commit and push every time you save. (It's basically a DropBox-style graphical front end for Git.) But that still doesn't get rid of the manual part of the documentation -- writing down your objectives, design, etc. I'm not sure there's really a way to avoid that, unless the preregistration / early documentation demonstrably makes writing up the manuscript for publication much easier. I would argue that it already does in a lot of ways, but like so many things that spare later effort at the cost of effort up front, it's a hard sell.

      I like the Wikipedia-style history idea, and that's indeed a lot of what I was going for -- you can see the history/evolution/development/process behind a particular study and/or manuscript. Versioning every save point (e.g with SparkleShare) provides that manner of history.

      But what if took the Wikipedia comparison one step further and had science that open? Start with a a wiki for experiments. You outline your design in a public space, and everybody is free to comment or contribute. When the design is good enough, you freeze the design page. You load all the raw experimental data to the Commons area, and place a link to it on the design page. Then you open up another page for the manuscript, again crosslinked, and the review process happens interactively via the history mechanism.

      The downside (beside the well-known ones that Wikipedia deals with like spamming, etc.) is that you 'risk' having a paper without authorship in the traditional sense. There will probably be a single lab where the experiment is actually carried out, but if your contribution / review process is very open, you could have a paper with dozens of (minor) contributors. I'm not sure we're ready to leave authorship on that level behind as individual scientists, and the academic employment system certainly isn't.

      This is what is happening to some extent in the software world with sites like the aforementioned Github and Bitbucket -- Github even goes so far as to make the comparison to social networks and call it 'social coding'. But most research doesn't work the same way software development does. And I think you're absolutely right -- we need a system so compelling in its own right that people want to use it for practical reasons and not just philosophical ones.

      Delete
  3. Check out Open Science Framework (disclosure: I work for a foundation that is funding them). It provides a free way to do just about everything (I think) that you mention under "Implementation."

    ReplyDelete
    Replies
    1. I had actually looked at the OSF a while back and found it initially a bit opaque, in part because I didn't watch the video. (This seems to the universal criticism of all such proposals, including mine.) The website still seems incomplete in terms of documentation.

      But now that I've watched the video, I see that it does a lot of what I discussed here. Indeed, its user interface is clearly inspired by Bitbucket/Github and it runs on Git. The web interface and the documentation still need quite a bit of work, and I would love to see good integration with vanilla Git and a native client of some sort, but the project looks very promising and is quite similar to what I had in mind. I'll have to try it out for my next experiment. :-)

      (Tip for the OSF guys: make a readable version of the video with screen shots and text blurbs. Lots of people are too lazy to read, but requiring sound is a problem for some settings and it's much easier to skim text than video!)

      One big difference is the 'ontology' behind it all. I'm not sure how I feel about specifying a single category whenever you add a new node, especially given that a lot of the categories seem to overlap in my usual parlance (eg, 'Methods and Measures' vs. 'Procedure').

      A minor difference (subject to change) is also the prominence of the log / commit history. I would like to have it at most one click away, but I'm also not sure how well that plays into the ontology differences. I suspect that every node is actually its own repository, which makes displaying a single history for the entire project a bit more challenging. But then again, the node-based approach certainly has other advantages and I'd be open to debate on that front.

      Delete