Edinburgh DataShare is an institutional research data repository, which serves as the Open Access archiving solution of the University of Edinburgh's Research Data Service. It was established in 2008 as a proof of concept and became a funded part of the broader service in 2012. Based on open source repository platform, DSpace, it is operated and supported by the university's library staff. It currently holds 3,055 items, including datasets, images, audio files, and software code. It is free at the point of use, allowing direct or batch deposits of items up to 100 GB in size. DataCite DOIs are attached to each item, and objects are linked to people, papers, and research projects in the university's CRIS (current research information system).
Edinburgh is an internationally performing public research university, placed 20th in the 2020 QS World University Rankings, with more than 40,000 students from some 156 countries. Most subjects are covered, with 21 schools across three colleges: Arts, Humanities & Social Science; Medicine and Vet Medicine; and Science and Engineering. A central Information Services Group provides Library and IT services to the staff and students.
Who are the repository's users? Not the depositors…
An interesting question that arises from time to time within the Research Data Support team in Edinburgh University Library, is who are the users of our institutional data repository? Because we serve the research community of our own University, we sometimes think of the depositors as the users. But I always argue that ultimately it is the end-users of the data which are deposited, and which are downloaded and used for new purposes, who are the real users of the repository. It is these – active, but largely invisible – users of the repository whose needs we must keep in mind when designing not just interfaces, but policies, especially regarding data quality and provenance.
If a depositor – in their haste to get a DOI to add to their submitted paper, or to clear off their storage drive, or to pursue a new research proposal – takes shortcuts with data documentation or metadata that reduce the usability of the resultant dataset or data item, it is the end-user who will suffer the consequences. This is why the repository administrators must review the submission with care, putting themselves in the shoes of the data user, and asking themselves, is there sufficient information here to make the data re-usable? Data may be re-used for a number of purposes, of course, some of them trivial. But if a researcher is going to base a research publication on data they did not create themselves, a certain level of confidence will need to be met for that researcher to sufficiently trust the data and its documentation to spend time working with it to create a new research output such as a publication.
For this reason, the team conducts rudimentary data and metadata quality checks on every deposit, so as to ensure it is as reusable as possible for that unknown end-user. This requires good judgment and tact, to reject a submission from a university academic (in some cases, high ranking), in order to request more information about the dataset before accepting it into the collection. However it helps to make the data more FAIR (findable, accessible, interoperable, reusable), which aids the end-user, and improves data quality to some extent.
To help the depositors make their data 'future-proof' and as FAIR as possible, we have developed a Checklist for deposit, to answer the questions, "How do you know if you are ready to share your data? What do you need to think about in advance of depositing?" See excerpt below and https://edin.ac/3cN4p2M for the full list. Topics include data granularity, data preparation and file formats, documentation, permissions, rights and open licensing, and embargoes.
We asked some of our depositors in video interviews about their experience of archiving their data in our repository and how they felt about their data being reused, either by other researchers or the general public.
Dr Marc Metzger from the School of GeoSciences, speaks of the value of transparency and impact in sharing data publicly. Moreover he saves himself time by making his climate mapping research data openly available so that others can download it for themselves, rather than him having to send out copies in response to requests. This approach represents best practice – making the data openly available is also more convenient for users, removing a potential barrier to the reuse of the data. https://youtu.be/7GVNay4NL8U
In another video we hear from Dr Bert Remijsen, who has gathered a significant body of audio data – songs and stories - from individuals living in South Sudan as part of his linguistics research into the languages of Shilluk and Dinka. He finds it very rewarding that not only other language researchers can access the data freely, but also the members of the Sudanese community have discovered parts of their lost heritage through the collections. He was pleasantly surprised when a news organisation used the music as a backdrop to a piece of televised journalism as well. https://youtu.be/pQFWZV8g3jU
Because open access repositories require nothing from the user in return for access to the data, in some ways we know very little about our users, or the purposes to which they put the data. The user is free to do anything they like with it, other than (according to the default CC-BY licence) attribute the data creator if they republish it. However, even this is not monitored. We often tell depositors that their data will be on the web, 'in the wild' and there are no university resources to enforce the licence conditions (this is why we generally discourage non-commercial, non-derivatives, or share-alike licences).
However, some things can be known. We have usage statistics, both native within DSpace and with Google Analytics, that can give us anonymised data about page views and file downloads. In fact the DSpace interface allows us to share these statistics publicly, so that both usage statistics and search statistics for any grouping (community), collection or dataset can be seen without a login from the right panel menu, as shown below. This view provides raw numbers for the last 6 months plus country and city of origin, making it easy to compare with other collections or datasets.
We have on occasion tried to determine the most popular datasets based on these statistics, and recognise the data creators who produced (and crucially, shared!) them. This once culminated in the Edinburgh DataShare Awards, presented at our annual internal Dealing with Data event (alas, it did not become a tradition). We had lots of categories – for the most data sharing school, the most prolific data sharer from each of our three colleges, most popular shared data of the year, and most popular shared data ever! It was interesting to see how excited the winners were, even though the only thing we presented them with was a printed certificate embossed with our "databot", as seen in the image below. This was written up in a blog post, http://datablog.is.ed.ac.uk/2017/12/07/the-edinburgh-datashare-awards/, but if you are really curious about the most popular dataset ever, I'll provide the spoiler: it was Peter Sandercock's International Stroke Trial database (version 2) at http://dx.doi.org/10.7488/ds/104. This is a rare open access dataset of a clinical trial, and it may be downloaded as much by those curious about whether the data are disclosive as those interested in the findings.
Other rewards for contributors, as well as insights for us as repository administrators, come in the form of published citations and Altmetrics. Published citations can be tricky to track, but we are getting to grips with tools offered by DataCite to track citations of our holdings, and watching for citations in the Data Citation Index, as part of the Web of Knowledge, through our library's subscription. One mystery we are working on is why there is a difference between the number of citations in the two sources – DataCite has found about a dozen, whereas Data Citation Index found about 50. Some of these are of course self-citations. I suspect these numbers will grow as researchers begin to take data citation seriously, and publishers provide rules for proper citation that will make references appear in such counts.