EOSC Nordic Climate demonstrator: EOSC Nordic Climate tasks

EOSC Nordic WP5 tasks and status

We describe below WP5 tasks as defined in the submitted application. Then we give an overview of the status at the start of the project (state of the art), then the current status of each task. Status of each task is regularly updated (add new section with a new date).

EOSC Nordic Task 5.2.1

T5.2.1: Cross-border data processing workflows (M1-36) - Lead: UT/ETAIS Participants: UIO-USIT/SIGMA2,

UICE, SNIC, UGOT/GGBC, FMI, SU/SNIC, UIO-GEO/SIGMA2 In this subtask, we will facilitate data pre- and post-processing workflows (High Performance computation or High Throughput computation) on distributed data and computing resources by enabling community specific or thematic portals, such as PlutoF and Galaxy-based geoportal, traditionally designed to submit jobs on local clusters, to allow scheduling of jobs on remote resources. The API modules that will be developed to support the most commonly and widely used Distributed Resource Managers (DRMs) will be designed to become a generic solution, i.e. independent from the architecture and the technology of any given portal.

What is known/available at the start of the project

Here we list what is relevant for the Climate science demonstrator only.

  • Galaxy pulsar for wide job execution system distributed across several European datacenters, allowing to scale Galaxy instances computing power over heterogeneous resources.
  • Galaxy workflows are text files that can be easily exchanged. Galaxy workflows can be searched per Galaxy instances. For instance on Galaxy Europe, shared workflows are published.
  • Galaxy shared histories. Galaxy allows users to share their “histories” (data, processing, etc.) via a link. Users can set permissions to restrict access to a group of users if necessary (or a single user).

Status: January 2020

Preliminary list of tasks to enable T5.2.1:

  • The plan is to install one single Galaxy instance for all the Nordics. Pulsar will be used to submit jobs on various platforms (HPCs and cloud computing). The objective is to use a similar setting as the one used by Galaxy Europe to ease maintenance and facilitate deployment of new tools by the Climate community,
  • The list of Galaxy tools available/needed is provided and maintained by the Climate community. T5.2.1 will install Galaxy tools that are made available in the Galaxy Toolshed or available as interactive environment in the Galaxy Europe github repository.
  • The list of available/needed training material is also provided and maintained by the Climate community (NICEST2). T5.2.1 will install Galaxy tools and datasets (Galaxy data libraries) necessary for users to use these training material on the Nordic Galaxy instance.

Status: April 2020

Galaxy Training material

New training material under review for publication:

New training material under development:

New training material planned:

  • Analyzing CMIP6 data with Galaxy Climate JupyterLab (in preparation; NICEST2 - not started yet)
  • ESMValTool with Galaxy Climate JupyterLab (in preparation; NICEST2 - not started yet)
  • Running CESM with Galaxy Climate JupyterLab (in preparation; it will be based on GEO4962, a course that is regularly given at the University of Oslo. See GEO4962).

Status: May 2020

Galaxy Training material

New training materials published:

New training material planned:

  • Analyzing CMIP6 data with Galaxy Climate JupyterLab: development has started here with corresponding PR (in progress).

Galaxy climate workbench framework for EOSC-Nordic

More information on Galaxy climate workbench can be found here.

EOSC Nordic Task 5.2.2

T5.2.2: Code Repositories, Containerization and “virtual laboratories” (M1-36) - Lead: SIGMA2 Participants:

UICE, CSC, UIO-GEO/SIGMA2, UIO-INF/SIGMA2, UH In this subtask, we will pilot solutions for cross-borders “virtual laboratories” to allow researchers to work in a common software and data environment regardless which computing infrastructure the analysis is performed on, thus ensuring the highest reproducibility of the results. The work will encompass evaluations of different Docker Hub technologies provided by the EOSC-hub as well as mechanisms for build automation, package management and containerization. The subtask will focus on building a natural language processing laboratory, but the overall goal will be to create a generic recipe for building virtual laboratories.

What is known/available at the start of the project

  • Tools developed in the framework of Galaxy are available in the NordicESMHub github organization as galaxy-tools github repository.
  • conda package manager has been used by the Norwegian Climate community for packaging tools (for instance cesm) in bioconda. conda-forge could be used too (but corresponding containers may not be created automatically).
  • Each package added to Bioconda also has a corresponding Docker BioContainer automatically created and uploaded to Quay.io. A list of these and other containers can be found at the Biocontainers Registry. For instance, CESM bioconda container can be found here wit both docker and singularity containers available.
  • Tools/models developed outside the Galaxy framework are stored in various places. We do not have a full overview yet.

Status: January 2020

Preliminary list of tasks to enable T5.2.2:

  • Discussion on possible solutions for submitting jobs from Galaxy to different platforms (in Sweden and Norway). Pulsar seems to be the best solution for Galaxy. This is already what is used by Galaxy Europe where Galaxy Climate is currently deployed.
  • Usage of conda package manager is recommended along with containers (as done with bioconda and biocontainers).
  • There is no equivalent container community repository for climate: should we set up something similar to biocontainers?

Status: April 2020

Target backend systems have been identified:

Norway:

Sweden:

Discussion with SNIC has been initiated with Sweden for using HPC resources.

EOSC Nordic Task 5.3.1

T5.3.1: Integrated Data Management Workflows (M1-36) Lead: CSC – Participants: UIO-INF/SIGMA2, UH, SNIC,

UICE, UIO-GEO/SIGMA2, FMI, SIGMA2 This task will provide solutions for facilitating complex data workflows involving disciplines specific repositories, data sharing portals (such as Earth System Grid Federation, ESGF) and storage for active computing. An emerging HTTP API solution integrated with B2SAFE workflows will be adopted to streamline the creation of replicas of community specific data repositories towards the computing sites, where computations can be performed. This task will comprise also the adaptation of portals

What is known/available at the start of the project

Status: January 2020

CVFMS is used to replicate Galaxy reference data on any Galaxy instance. Look at Galaxy Reference Data with CVMFS tutorial for more information on the usage of CVMFS in Galaxy for deploying/replicating reference data. This approach is probably suitable for small climate datasets (for instance teaching datasets, in-situ observations) but is not appropriate for the bulk amount of climate data. We suggest to investigate other remote access solutions.

Preliminary list of tasks to enable T5.3.1:

Status: April 2020

  • Preliminary tests using ownCloud have been successful. However, no performance analysis has been performed yet. So far we only tested the functionalities. Larger datasets will be harvested for further testing.

Status: June 2020

Climate data relevant for EOSC-Nordic

The list of data relevant for the Climate community can be found here.

EOSC Nordic Task 5.3.2

T5.3.2: Machine actionable DMPs (M1-36) Lead: SIGMA2 – Participants: GFF, SNIC, UGOT/SND

Link DMP with storage & computing resource allocations.

What is known/available at the start of the project

  • Storage and computing resource allocation are usually based on scientific merit. Project managers submit storage and/or compute applications that are usually evaluated and granted by a resource allocation committee.
  • Data Management Plans are usually requested but not necessarily mandatory. In addition, DMPs are usually very little assessed and monitored during the lifetime of a project.

Status: January 2020

Preliminary list of tasks to enable T5.3.2:

  • This task has not started yet.

Status: April 2020

  • Brainstorming activity to narrow down what we can do during EOSC-Nordic: the goal will be to “link” DMPs (for instance Easy DMP) with resource allocation.

Status: May 2020

  • A first draft for EOSC-Nordic Climate Science roadmap has been released: