You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airavata.apache.org by "Douglas Chau (JIRA)" <ji...@apache.org> on 2015/03/25 21:58:54 UTC

[jira] [Commented] (AIRAVATA-1646) [GSoC] Brainstorm Airavata Data Management Needs

    [ https://issues.apache.org/jira/browse/AIRAVATA-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14380754#comment-14380754 ] 

Douglas Chau commented on AIRAVATA-1646:
----------------------------------------

In regards to this project, I would like to understand the scope of the work. I am interested in identifying what are the best ways to incorporate Cassandra into the project. Here are some questions I have particularly towards metadata and provenance:

- Do we have access to the apache thrift data model currently in use by Airavata? If so, can we modify this model?
- What other object store technologies are you interested in (Cassandra and MongoDB)?
- How will the metadata be used? Depending on metadata usage it can affect which technologies and which features of that specific technology we should enable.
- What are some examples of meta data is being stored? Is the data structured or unstructured?
- What kind of provenance data is being stored?
- What kind of queries would you expect to be run on the provenance data?
- Do we need look into Apache Storm for querying streaming data?
- Will we receive accounts on NSF XSEDE clusters for this project?

Thanks, 
Doug

> [GSoC] Brainstorm Airavata Data Management Needs
> ------------------------------------------------
>
>                 Key: AIRAVATA-1646
>                 URL: https://issues.apache.org/jira/browse/AIRAVATA-1646
>             Project: Airavata
>          Issue Type: Brainstorming
>            Reporter: Suresh Marru
>              Labels: gsoc, gsoc2015,, mentor
>
> Currently Airavata focuses on Execution Management and the Registry Sub-System (with app, resource and experiment catalogs) capture metadata about applications and executions. There were few efforts (primarily from student projects) to explore this void. It will be good to concretely propose data management solutions to for input data registration, input and generated retrieval, data transfers and replication management. 
> Metadata Catalog: In addition current metadata management is based on shredding thrift data models into mysql/derby schema. This is described in [1]. We have discussed extensively on using Object Store data bases with a conclusion of understanding the requirements more systematically. A good stand alone task would be to understand current metadata management and propose alternative solutions with proof of concept implementations. Once the community is convinced, we can then plan on implementing them into production. 
> Provenance: Airavata could be enhanced to capture provenance to organize the data for reuse, discovery, comparison and sharing. This is a well explored field. There might be good compelling third party solutions. Especially it will be good to explore in the bigdata space and identify leverages (either concepts, or even better implementations).
> Auditing and Traceability:  As Airavata mediates executions on behalf of gateways, it has to strike a balance between abstracting the compute resource interactions at the same time providing transparent execution trace. This will bloat the amount of data to be catalogued. A good effort will be to understand the current extent of airavata audits and provide suggestions. 
> BigData Leverage: Airavata needs to leverage the influx of tools in this space. Any suggestions on relevant tools which will enhance Airavata experience will be a good fit. 
> [1] - https://cwiki.apache.org/confluence/display/AIRAVATA/Airavata+Data+Models+0.12
> [2] - http://markmail.org/thread/4lguliiktjohjmsd



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)