Carnegie Mellon University

Table of Contents

COSMO: a Research Data Service Platform

What is COSMO?

COSMO is a platform (web portals, API) for easily sharing large datasets, as well as a set of domain-agnostic recommendations for scientific data sharing in a way that expedites information access without having to transfer it first.

Recommendations

COSMO implements this feature via a Data-Sharing Portal, which contains the dataset structure information with different levels of granularity.

Pros

  • Researchers can navigate the dataset files and identify what is of relevance to them.

Cons

  • A structure for the dataset has to be implemented for reflecting it over a web portal, so researchers can navigate in an intuitive way.
  • Descriptions for individual sections and files need to be set in place so researchers can navigate the dataset.
Enable data index metadata (MBs) to be downloaded first for identifying sections of interest before downloading the main content (GBs, TBs, PBs).

Pros

  • Researchers can download parts of the information to quickly determine if it is relevant for their own research.
  • Researchers can initiate individual transfers for different subsets of the files, which can also be resumed if needed, making it easier to access  each batch of information faster than having to download a single huge file.

Cons

  • The dataset needs to be structured in a way that enables information to be usable when only downloading parts  of it. For example, allowing metadata information to be downloadable per iteration, so researchers can first download and examine the metadata  and then decide if they are going to download the rest of the information. This concept, of course, can be used for splitting the dataset into smaller chunks, allowing for efficient  data download
  • Descriptions for each sections need to be set in place so researchers can download the right files.
Enable different options for transferring data. For example, use Anonymous Globus endpoints for supporting direct (HTTP) downloads, and then use regular Globus collections for allowing automatic transfers to everyone using a Globus Connect Personal account.

Pros:

  • Researchers accessing the data have more options to download the data based on their specific environment and resources.

Cons:

  • Multiple resources have to be installed, configured, and maintained for the full services to be up and running.

By sharing description of the data structures as well as the relationship between data fields, the users are equipped to access the data efficiently.

Pros:

  • It becomes easier to use the data fields and relationships between them.
  • Researchers will need less hands-on support for getting to use datasets.
  • Thoroughly documenting helps ease the burden of knowledge transfers when team members are leaving a project.

Cons:

  • It is a time-consuming task to write a throughout description of every field in a dataset.
  • Any changes that are performed to the dataset will also have to trigger an update for the documentation of any fields modified.
This approach allows researchers to browse and interact with the data via a commonly-used programmatic protocol. COSMO includes data endpoints, instructions, tutorials, and a website that allows directly querying the API from the browser.

Pros:

  • Researchers are not required to download the dataset to start inspecting the dataset and getting preliminary results
  • The REST API is programming-language-agnostic so that researchers can interact with the information and produce results using the tools they are already comfortable withc.
  • No storage is needed for transferring and hosting the dataset, as the REST API will be locally accessing the remote data and only returning the results produced.

Cons:

  • It takes time to design, implemente, and deploy the  REST API.
  • Depending on the dataset, performance tuning and/or parallelization techniques would need to be used so response times are short.

Datasets Leveraging COSMO

These are the datasets that are currently using COSMO or are being configured to do so. Before and After screenshots are also provided for showing the impact COSMO has had on them.

BlueTides

The BlueTides simulation project aims to understand how supermassive black holes and galaxies formed in the first billion years of the universe's history using one of the largest cosmological hydrodynamic simulations ever performed, enclosing a box with the side of 400 cMpc/h and a total of 0.7 trillion simulated particles.

Main/Description (landing-page) Web Portal: https://bluetides.psc.edu/

Source code: https://github.com/pscedu/cosmo/tree/main/bluetides-landing-description-portal

Visual comparison before and after adopting COSMO:

Before

Landing page before

After

Landing page after


Data-sharing

The BlueTides simulation dataset files comprise the following three types of data: Simulation snapshots, Friends-of-Friends (FOF) groups catalogs, and Particles in group (PIG) catalogs. The snapshot data is organized in blocks containing information about each type of particle -- dark matter, gas, star, and black hole. The FOF and PIG catalogs contain properties about the gravitationally-bound particle groups (halos) and their member particles identified by the FOF group finder algorithm.

Live data-sharing portal: https://bluetides-portal.psc.edu/

Source code: https://github.com/pscedu/cosmo/tree/main/bluetides-data-sharing-portal

Visual comparison before and after adopting COSMO:

Before

Data-sharing before 1


Data-sharing before 1

After

Data-sharing after 1


Data-sharing after 1


Data Structure Description

A description of the BlueTides dataset fields can be found as an individual page, so there is clarity regarding each of the different fields available and the relationships between them.

BlueTides Data Structure: https://bluetides.psc.edu/data-structure/

Source code: https://github.com/pscedu/cosmo/tree/main/bluetides-landing-description-portal

Visual comparison before and after adopting COSMO:

Before

BlueTides data structure before

After

BlueTides data structure after

API

The REST API for BlueTides was written using the FastAPI framework to provide users a convenient way to access key subsets of simulation data. The API endpoint design follows the data structure of the BlueTides simulation data. The canonical endpoints have a top-down structure as the data catalogs; the advanced queries allow advanced search by specifying criteria for bulk queries.

API Frontend (interactive API documentation): https://bluetides-api.psc.edu/docs/
API Reference: https://bluetides.psc.edu/api-reference/
API Tutorial: https://bluetides.psc.edu/tutorial/
API URL (API URL for querying data) https://bluetides-api.psc.edu/

Source code: https://github.com/pscedu/cosmo/tree/main/bluetides-api

Visual comparison before and after adopting COSMO:

Before (an API was not available prior to COSMO)

BlueTides API before

After

BlueTides API after 

ASTRID

This is a new dataset that is being made available through COSMO. It is still work in progress.


System Design and Implementation

The following diagram shows the structure of COSMO implemented on top of the Vera cluster as an example. The main components shown are;

  • Landing-page/Description Web portal: It contains general information about the dataset and project and also works as an entry-point to navigate to other sections, such as team member introduction (People), publications (Results), and a gallery page to showcase the simulation results (Gallery).
  • Data-Sharing Web Portal: This component provides an overview of the data information and individual access to the data files via Globus endpoints (Data Access). Globus, which provides features such as automatic data-integrity checks and options for resuming data transfers after disruptions, provides a reliable file transfer solution for large files.
  • API Portal: This component includes the API Reference and API Tutorial, which contains the API endpoint descriptions and a tutorial for utilizing the API tools via Python scripts, respectively.

COSMO Architecture for BlueTides

How to Share your Dataset using COSMO

The  main options for getting to use COSMO for your dataset are:

  • Implement it yourself by leveraging the available open-source repositories and examples.
  • Collaborate with PSC to provide the expertise to assist with the project..
  • PSC can help train researchers to develop their own implementation of COSMO (training sessions).

Limitations, Computational Requirements, Constraints

Limitations: 

  • You need to code the API for it to support your dataset

Computational requirements

  • Disk storage
    • Decompressed dataset (bigfile, easily usable by multiple workers)
    • Transfer-optimized files for downloads (zip files, bundle of the decompressed files)
    • Example for a 5TB dataset:
      • Decompressed dataset (5TB)
      • Transfer-optimized files (The percentage of the data that you would like to make available)

Constraints

  • Disk storage space for the full dataset, including both transfer-optimized and Input/Output-optimized files.

Resources

These code repositories contain the source code written to implement the COSMO recommendations and best practices needed for serving the BlueTides Simulation as a proof-of-concept dataset: