News
Bioscientists Publish their Research Data
Interview with Todd Vision of Dryad
Q: What is Dryad?
A: Dryad is a repository for the data that supports the findings in the scientific literature. Its focus is on biology, biomedicine, and related fields [1]. While there already exist primary archives for a few specialized kinds of biological data, like DNA sequences, the long tail of research data requires a repository such as Dryad that can accept data files of varying formats.
Dryad is also a membership organization that provides a platform for publishers, librarians, and other stakeholders to work together toward the common goal of having the evidence base of the scientific literature preserved and made openly available – not only for the validation of published findings but also to drive new research.
Q: Dryad’s contact with researchers is through the journals in which they publish rather than institutions where they work. Why is that?
A: I believe the over-riding challenge in making research data available for reuse is winning the involvement of the researchers who collected the data. Since researchers already select and organize the most valuable and reliable of their data in preparing their articles for publication, it takes relatively little extra effort for them to then release that data to the journal or repository as part of the publication process. There is always an article that describes why the data were collected, the methods used, the results obtained, and so on. And the study has been judged by peer reviewers to be of value to the scientific record. So there is some assurance that the data are both reusable and merit preservation. Journals with strong data policies go a long way to ensure that data gets deposited and meets community standards, and in fact Dryad was started as way to support a group of biology journals that were jointly introducing a new data archiving policy [2].
Q: How are authors motivated to deposit data?
A: The majority of researchers favor research data being available for reuse but need to be assured that their colleagues will also release their data, that their journal or funder or institution cares, and that the professional credit they receive is going to outweigh the risks (such as getting the next paper scooped). One critical way to achieve this is for research organizations to send a loud and clear message that public data archiving is expected as a matter of course, and to live up to that message by evaluating the data contributions of each researcher alongside their publications. We also work to engineer the repository to maximize professional reward (e.g., through data citations and other means of impact tracking) and minimize risk (e.g., by allowing limited-duration data embargoes). There are many subtle ways we can support professional reward. For instance, in a large collaborative project, someone who is only a middle author on an article will be happy to list a dataset on their CV to which they can claim first-authorship.
Q: How does one repository handle the diversity of datatypes, formats, technical standards and so on?
A: When a journal publishes an article, they expect us to host any and all data, so we must be flexible about what we accept. We do review the data files upon deposit to ensure they meet minimal standards, but do not recode the data or reject legitimate content. In our view, the ultimate responsibility for reusability is with the author, the reviewers and the journal. This diversity of content does make it more challenging to take preservation actions through migration of file formats, but we are learning how to do that. What we can ensure at the repository is discoverability and uniformity of presentation - through high quality bibliographic metadata, reciprocal links between the data and article, assignment of DataCite DOIs [3], getting datasets indexed by search engines, and so on. As a digital library, we see our role as providing a system for accessing books (aka data), not deciding what should be inside those books.
Q: How are the data licensed?
A: Researchers agree, upon deposit, that once the article is published - or, for about a third of the files, once the one year post-publication embargo is over - the data are to be released into the public domain using a Creative Commons Zero waiver [4]. This allows us to make the data available open in the sense of the Panton Principles for Open Data in Science: “freely available on the public internet permitting any user to download, copy, analyze, re-process, pass them to software or use them for any other purpose without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself” [5]. At the same time, we work on many fronts to ensure that authors are cited when their data is reused, and that those data citations are trackable. There are some hard cultural and technical problems to be overcome in making trackable data citations a reality, but in our view adding a legal requirement is not helpful.
Q: What is the relationship of data archiving to green and gold Open Access for journal articles?
A: In order to make the data available to users without cost, while having a sustainable organization that can look after data long term, Dryad must have a business model in which curation and preservation costs are met upfront - at the time of data deposit. In that way, Dryad’s business model is similar to that of gold Open Access. But it is important to note that the way access and preservation are funded for the data in Dryad is independent of the arrangement for the associated articles. Dryad hosts data associated with articles in many different kinds of journals, and does not, as an organization, favor one model over another.
Q: How do you think a data repository should measure success?
A: The ultimate goal is for the data to be reused – we are not interested in archiving for the sake of archiving. I think a repository like Dryad can only achieve this by supporting a virtuous circle in which researchers are incentivized to publish data because the benefits are apparent to them, and they see the value in using a repository do it. This presupposes that the repository is technically and organizationally competent to handle preservation, which is what trusted repository audit frameworks aim to assess. But demonstrating such competence is still just a means to an end -- the over-riding goal is for data to be used.
We recently collaborated with the UK Digital Curation Center to find out from a sample of our stakeholders (including publishers, researchers and librarians) what features they valued in a data repository [6]. Features that rose to the top of the list were trackable data citations, support for embargoes, an easy deposit process, and the ability to deposit a wide range of data types. Some of the features less widely valued were surprising to us, such as an assurance of peer review for data. Studies like this, while never definitive, are very helpful in calibrating repository priorities.
Contributor: Todd Vision
[1] http://datadryad.org
[2] Moore AJ, McPeek MA, Rausher MD, Rieseberg L, Whitlock MC (2010) The need for archiving data in evolutionary biology. Journal of Evolutionary Biology 23:659-660. http://dx.doi.org/10.1111/j.1420-9101.2010.01937.x
[3] http://datacite.org
[4] http://creativecommons.org/about/cc0
[5] http://pantonprinciples.org/
[6] http://www.dcc.ac.uk/news/how-can-we-evaluate-data-repositories-pointers-dryaduk
For more information:
Vision, T.J. (2010) Open Data and the Social Contract of Scientific Publishing. BioScience 60(5):330-330. http://dx.doi.org/10.1525/bio.2010.60.5.2