You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airavata.apache.org by "Pamidighantam, Sudhakar V" <sp...@illinois.edu> on 2019/05/15 17:50:43 UTC

Some Data requirements

Please see some data needs we are seeing in the current gateways. Some of these are handled but several require additional development, integration and operational changes.

Specific use cases could/should  be documented as well. This may not cover all unmet needs and other are encouraged to add to this as we embark on providing first class Data management in Apache Airavata.

Airavata Data Requirements


  1.  Data Ingestion

  Data for input can be of different types and hierarchies, starting from 1. individual parameters, 2. name lists in a simple/small file (in KB), typically instructions for the execution, 3. data files which could be large (up to 20 GB), 4. Directories containing multiple files each (100s GB).

These files can come in many forms  ascii, binary, compressed ( zip, tar) etc..

There may be data from data bases that need to be extracted and presented potentially for the user to choose or modify and use further (Ex. Supercrtbl) in an experiment as input.

The data from a previous execution (result/restart data) may need to be used to either restart an experiment along with modified inputs (routinely happens in SEAGrid). In such case a way to refer to the previous job/experiment and/or data locality is needed.

In the case of workflows with multiple tasks, the independent input data for different tasks in the workflow may have to uploaded upfront and need to be thus labelled appropriately.

In the case of job arrays, data for each of the independent tasks may be presented in different hierarchies as folders or compressed sets.

There is a use case where an input segment/field  may have multiple files/parameters (file arrays, parameter arrays) associated with that while others may have different types (AMPGateway BSR3 application stage1).

Some data may be pre-staged on the remote HPC system (Future water) or brought from third party locations/services (Box, Data Repos, Instruments) and associated with the experiment.

The web, session and other timeouts need to be tuned for making sure all the needed data is transferred in usable condition.

B. Data Validation and Handling

There need to be a way to validate all the inputs needed/required to be available before an experiment/workflow is scheduled. Files transferred should be checked for completeness by checksum or other validation. The data need to be uploaded and organized appropriately for the execution on the remote and even intermediate staging areas. If the data need to be  staged from third party locations and pre-staged data need to be used a way to verify the data accessibility need to be provided. Restart data can be checked if they contain right data for restart. The remote hosts may have quotas and the validation should consider if there is sufficient space to move the data for scheduling the experiment.

C. Data Processing
In some cases, data need to be processed before used in an experiment. Uncompressing a zip/tar need to be handled. In some cases,  specific preprocessing routines may need to run for the data to be prepared. For some cases the data need to be organized for learning by machines. A way to extend the extraction of critical attributes from the inputs, experiments and results for learning may be very useful.

4. Data Dissemination

Data need to be provided for the users to monitor personally or automatic validators and parsers. Once the experiment completes the parsers should be able to pickup and complete a post processing step. Data could be large (10s GB) and a failsafe way to provide this output data will be needed. Data may need to be compressed and organized for the additional post processing steps. Users needed a way to extract (output) data from multiple experiments in bulk to process it through external programs and scripts. This requires a way to select a set of experiments to extract their logs/outputs with sufficient warning regarding the size of the resulting download.

E. Data Storage

Data need to be stored for immediate consumption and potential reuse in the gateway/or other systems.

F. Data Archive and retrieval

Data need to be archived to tertiary storage device so the primary storage service is reused for newer data/experiments. But a way to retrieve the data from archival when needed should be in place.

G. Data deletion/hiding
Some data (erroneous, unwanted) need to be deleted so it does not interfere with new experiments or processing. A way to hide/delete based on user choice would be useful to provide. Sometimes restart data corrupted if a fixed checkpoint file is specified and this needs to be deleted or replaced with an immediately previous good copy.

Thanks for your attention.
Sudhakar.