The NICEST2 hackathon on FAIR climate data took place on March 11, 16 and 17, 2021 (13:00 - 17:00 CET). It had a combination of lectures/training and more practical sessions and discussions. Reprohack was one of this practical sessions and span over the 2 hackathon days.
A ReproHack - reproducibility hackathon - is an event where participants aim to reproduce scientific results detailed in published papers.
The aim of a ReproHack is absolutely not to undermine or discredit researchers or their work but to highlight their great effort and understand how to better support researchers towards Open Science.
Reprohack participants choose a paper and then try to understand the paper and to reproduce its results, from published code and data. At the end of the reprohack, results are reported back to the authors.
See https://github.com/reprohack/reprohack-hq for more information and links to past events.
The emphasis was not on reproducibility but on the FAIRness of scientific publications. We will discuss further what we meant by it in a next section.
We have chosen a scientific paper published by one of the hackathon's participant:
Humanitarian need drives multilateral disaster aid by Dellmuth, Lisa M. and Bender, Frida A.-M. and Jönsson, Aiden R. and Rosvold, Elisabeth L. and von Uexkull, Nina. PNAS January 26, 2021 118 (4) e2018293118; https://doi.org/10.1073/pnas.2018293118.
Our objective is to analyze the current practices adopted by researchers and the inherent difficulties related to available infrastructures such as data archive providers and publishers.
Aiden Jönsson is a PhD Student at the Department of Meteorology (MISU), Stockholm University (Sweden).
Aiden is currently doing work on the "albedo symmetry" problem: while the surface and clear-sky reflectivity of the Earth is asymmetric, the all-sky reflectivity is compensated by cloud contributions to give a remarkable degree of symmetry in all-sky reflectivity. Aiden wants to learn more about how dynamics and clouds work with Earth's radiative energy balance in order to understand why/how this would be. You can find some of his codes and notebooks at: codeberg.org/aidenrobert
Aiden is an early adopter of the Open Science principles and has done incredible efforts to make his research FAIR.
We opened the aforementioned paper: https://www.pnas.org/content/118/4/e2018293118.
PNAS stands for Proceedings of the National Academy of Sciences of the United States of America.
To get some information on how the procedures to follow when submitting a paper, we clicked on the tab "Authors".
PNAS highlights 4 main reasons why an author would choose PNAS:
FAIR (Findable, Accessible, Interoperable and Reusable) is not mentioned on the main page of the PNAS Author Center. However, when willing to submit a paper, authors would read the more detailed Information for Authors PDF where an entire section is dedicated to "Materials and Data availability".
Below is a summary of instructions given to authors for making materials, data, and associated protocols, including code and scripts, available to readers upon publication:
When working online (with the HTML version), we checked the Data availability link (as available from the paper publisher), which links in this case to a Harvard Dataverse deployment, which provides a list of 80+ datasets (they can be also seen in a folder-view using the Tree tab). The information in the metadata tabs seems rather simple and is probably not used too much. This is a very specific case, of course.
Any technical problems will damage the reputation of researchers; not the repository itself.
First read README.txt as we thought we would get some information on where to start.
So we started from undisasteraid_final.do
which is available online:
Obviously Stata has been used for some of the analysis: stata is not open source so it makes it difficult to reproduce/reuse the script provided.
So to download a file, we need to manually add the output filename:
wget https://dataverse.harvard.edu/api/access/datafile/4288994 -O accum_precip_calc.py
However:
It is a complex analysis and even though a lot of information is given, we missed a summary on software (packages, versions) we need before starting to reproduce the analysis. At every step, we "discovered" it which slow down and can discourage potential users.
We suggest to add such information in the README.txt file.
Choose a license for your codes: there is no appropriate license for programs: CC0 - "Public Domain Dedication" cannot be used for software. We recommend the authors to add a specific license such as MIT license for all codes they share.
Make your code citable: deposit codes in version control repository (such as github) and prepare a release to deposit for instance in Zenodo. See more information here.
Add the codes/programs for all the steps from pre-processing, data retrievals to visualization: all the codes used for the analysis are present but we did not find any of the codes used for plotting (it may be in the stata files that we were not able to use as we did not have access to Stata software).
It is a good practice to provide plotting routines used to generate the plots in the scientific paper. Make sure to save data used in these plotting routines (we should not need to redo all the analysis to be able to reproduce one plot).
Compared to many existing scientific paper, a lot can be reused. However, contacting authors is probably still necessary to be able to reuse the datasets.
We would like to thank all the authors and congratulate them for their effort to make their research FAIR. We have identified a few improvements that we believe could both improve the FAIRness of their research and ease their day to day work:
Harvard dataverse has not been very reliable during our hackathon: during the hackathon, we were no able to download all the files at a time with the Download button. We tried many times and different days and it always failed. In the absence of a clear error message, the problem can easily undermine the researcher's work and effort to make their research FAIR.
Choosing a research data archive is very important and below we give a few criteria to take into consideration:
Harvard dataverse fulfills many of these criteria but was overall very slow and difficult to use; mostly because the Download button was not working as expected. A week later, we managed to download the entire dataset with no problems.
Funding
NICEST2 is a project within the Nordic e-Infrastructure Collaboration (NeIC). NeIC is an organisational unit under NordForsk.
Contact