You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airavata.apache.org by "Kotabagi, Karan" <kk...@iu.edu> on 2018/09/06 04:09:03 UTC

Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server

Hi Dev,


As part of the Science Gateway Architecture course we have got the project proposal from Sudhakar to achieve the pre-data staging using the Nextcloud.


Please find the following project proposal and wiki link for the project ideation phase, please review the same and advise if there are any of the points that can be useful to

start with the project.


Project Proposal:

Achieve pre-data staging of the files using the Nextcloud file storage and explore ways to reduce the data transfer movements between the compute resources and local airavata server.


Wiki Link:

https://github.com/airavata-courses/airavata-nextcoud/wiki/Project-Ideation


Regards

Karan


Re: Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server

Posted by "Kotabagi, Karan" <kk...@iu.edu>.
Hello Marcus,​


If we register the data product URI during the pre-staging time then airavata will become aware of this pre-staged data in the nextcloud. We believe this step will get us closer to the data first approach (as mentioned by Supun), as we will avoid registering the data product URI during the creation of the experiment, and will only be associating the already registered data.


Thanks!


Regards

Karan

________________________________
From: Christie, Marcus Aaron <ma...@iu.edu>
Sent: Monday, October 29, 2018 4:01 PM
To: Airavata Dev
Subject: Re: Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server

Hi Karan,

Just to clarify, what is the advantage of registering a data product URI at pre-staging time?

I think you have correctly identified the next steps. There isn't an API to get a data product URI from an existing replica location.

Thanks,

Marcus

On Oct 22, 2018, at 12:25 PM, Kotabagi, Karan <kk...@iu.edu>> wrote:

Hi Dev,

As we had discussed with Marcus we are able to get the remote path picked up and registered, in that way airavata is picking up the remote file from the nextcloud, and path is recognized and the same flow continues with the source and the destination URI’s.

We have given the option for the users in the seagrid-rich client and upload / pre-stage to the file to the nextcloud. During, the creation of the experiment we are able to fetch the file from the remote interface file list from the nextcloud and select the pre-uploaded file and the path will be registered during the experiment creation.

Although, since we want Airavata to know that the file is pre-staged, we have another approach that we can start working with and is explained as follows :-

1> Register and get back the data product URI during the pre-staging of the files. When the users upload the input file via pre-data staging option that is provided, then the path should be registered and we will have the data product URI for this unique path.

2>When the users select the file from the nextcloud interface and pick up the remote file with the select pre-uploaded file option then at that time of creation of the experiment we need to associate the file path (which will be unique for every file) to the data replication model and return the experiment model with the data product URI that is associated with this file path.

In order to achieve this we need to get the  data product URI using the unique file path. So we need to define an API in airavata that returns us back the data product URI associated with the  unique remote file path. In this way, we can associate the pre-staged file during the experiment creation.

Flow Diagram for this approach (Please find the attached if not viewable in email).

<image002.jpg>

Please include the comments over this approach.

Regards
Karan Kotabagi
School of Informatics and Computing
Indiana University, Bloomington
From: Christie, Marcus Aaron <ma...@iu.edu>>
Sent: Wednesday, October 3, 2018 10:44 AM
To: Airavata Dev <de...@airavata.apache.org>>
Subject: Re: Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server

Hi Karan,

Sorry for taking so long to get back to you. You can call the registerDataProduct method to register the file and its location and this will return the product URI.



On Sep 25, 2018, at 9:35 AM, Kotabagi, Karan <kk...@iu.edu>> wrote:

Hi Dev,

As we are working on this and exploring the way to pre-stage the data.

We need to know if there  is a way to call the API in Airavata to register the product URI after the file is uploaded from the client end ?​

Regards
Karan
________________________________
From: Kotabagi, Karan <kk...@iu.edu>>
Sent: Sunday, September 9, 2018 5:51 PM
To: dev
Subject: Re: Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server

Hi Supun,

Thank you for the detailed suggestions and insights, this gave us a in-depth understanding with the future aspects of the project.

We will have more questions as we move along.

Regards
Karan​
________________________________
From: Supun Nakandala <su...@gmail.com>>
Sent: Saturday, September 8, 2018 7:39 PM
To: dev
Subject: Re: Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server

Hi Karan,

This will be a very useful addition for Airavata. I have been thinking about this sometime back and below are some ideas that I think will be worth sharing with you.

1. I think the value added by this project goes beyond the current SEAGrid use cases. As of now (if I am correct), Airavata is following an "application first" approach for creating experiments. This makes sense as most of the existing experiments are simulation type experiments. But there will be growing interest for "data first" type experiments such as machine learning and bioinformatics type experiments where you keep on reusing the same inputs/dataset for a variety of ML models or applications.

2. In order to support a full capable data first approach, we need to maintain a catalog containing information about the data items (similar to the Application catalog that we currently have). This catalog should have at least the basic metadata describing the data format, origin, supported applications etc. The best place to add this information will be the data catalog. But if I am correct, that will require adding these capabilities to the data catalog as currently it only supports cataloging output data from some of the selected applications.

3. For reducing the overhead of data movement you will need to have a replica catalog. The same data files can be distributed to multiple places and replica catalog will help keep track of them. In an ideal scenario, based on the available locations of a data item and a network cost model, the system should be able to decide which data item to be used for a particular experiment to minimize data movement cost. Also in some cases, the availability of replicas will be subjected to some expiration time as in some computer resources the scratch space will be subjected to purging. In an ideal scenario, the replica catalog should capture this information too.

As you can see this project spans across almost all aspects of Airavata data infrastructure and there are some interesting distributed systems problems. But as you have shown in the Wiki you can start with SEAGrid as a concrete use case. I hope the big picture will give you more interesting ideas for extending your project further.

Best
-Supun

On Sat, Sep 8, 2018 at 3:29 PM Kotabagi, Karan <kk...@iu.edu>> wrote:
​**************Re-sending the previous email*****************
Hi Dev,

We have discussed few changes with Sudhakar and updated the Wiki with the new Napkin Drawing and User Story, please review the same and let us know if there are any of the suggestions.

Wiki Link:
https://github.com/airavata-courses/airavata-nextcoud/wiki/Project-Ideation

Regards
Karan
________________________________
From: Kotabagi, Karan
Sent: Thursday, September 6, 2018 12:09 AM
To: dev@airavata.apache.org<ma...@airavata.apache.org>
Subject: Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server

Hi Dev,

As part of the Science Gateway Architecture course we have got the project proposal from Sudhakar to achieve the pre-data staging using the Nextcloud.

Please find the following project proposal and wiki link for the project ideation phase, please review the same and advise if there are any of the points that can be useful to
start with the project.

Project Proposal:
Achieve pre-data staging of the files using the Nextcloud file storage and explore ways to reduce the data transfer movements between the compute resources and local airavata server.

Wiki Link:
https://github.com/airavata-courses/airavata-nextcoud/wiki/Project-Ideation

Regards
Karan


Re: Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server

Posted by "Christie, Marcus Aaron" <ma...@iu.edu>.
Hi Karan,

Just to clarify, what is the advantage of registering a data product URI at pre-staging time?

I think you have correctly identified the next steps. There isn't an API to get a data product URI from an existing replica location.

Thanks,

Marcus

On Oct 22, 2018, at 12:25 PM, Kotabagi, Karan <kk...@iu.edu>> wrote:

Hi Dev,

As we had discussed with Marcus we are able to get the remote path picked up and registered, in that way airavata is picking up the remote file from the nextcloud, and path is recognized and the same flow continues with the source and the destination URI’s.

We have given the option for the users in the seagrid-rich client and upload / pre-stage to the file to the nextcloud. During, the creation of the experiment we are able to fetch the file from the remote interface file list from the nextcloud and select the pre-uploaded file and the path will be registered during the experiment creation.

Although, since we want Airavata to know that the file is pre-staged, we have another approach that we can start working with and is explained as follows :-

1> Register and get back the data product URI during the pre-staging of the files. When the users upload the input file via pre-data staging option that is provided, then the path should be registered and we will have the data product URI for this unique path.

2>When the users select the file from the nextcloud interface and pick up the remote file with the select pre-uploaded file option then at that time of creation of the experiment we need to associate the file path (which will be unique for every file) to the data replication model and return the experiment model with the data product URI that is associated with this file path.

In order to achieve this we need to get the  data product URI using the unique file path. So we need to define an API in airavata that returns us back the data product URI associated with the  unique remote file path. In this way, we can associate the pre-staged file during the experiment creation.

Flow Diagram for this approach (Please find the attached if not viewable in email).

<image002.jpg>

Please include the comments over this approach.

Regards
Karan Kotabagi
School of Informatics and Computing
Indiana University, Bloomington
From: Christie, Marcus Aaron <ma...@iu.edu>>
Sent: Wednesday, October 3, 2018 10:44 AM
To: Airavata Dev <de...@airavata.apache.org>>
Subject: Re: Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server

Hi Karan,

Sorry for taking so long to get back to you. You can call the registerDataProduct method to register the file and its location and this will return the product URI.



On Sep 25, 2018, at 9:35 AM, Kotabagi, Karan <kk...@iu.edu>> wrote:

Hi Dev,

As we are working on this and exploring the way to pre-stage the data.

We need to know if there  is a way to call the API in Airavata to register the product URI after the file is uploaded from the client end ?​

Regards
Karan
________________________________
From: Kotabagi, Karan <kk...@iu.edu>>
Sent: Sunday, September 9, 2018 5:51 PM
To: dev
Subject: Re: Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server

Hi Supun,

Thank you for the detailed suggestions and insights, this gave us a in-depth understanding with the future aspects of the project.

We will have more questions as we move along.

Regards
Karan​
________________________________
From: Supun Nakandala <su...@gmail.com>>
Sent: Saturday, September 8, 2018 7:39 PM
To: dev
Subject: Re: Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server

Hi Karan,

This will be a very useful addition for Airavata. I have been thinking about this sometime back and below are some ideas that I think will be worth sharing with you.

1. I think the value added by this project goes beyond the current SEAGrid use cases. As of now (if I am correct), Airavata is following an "application first" approach for creating experiments. This makes sense as most of the existing experiments are simulation type experiments. But there will be growing interest for "data first" type experiments such as machine learning and bioinformatics type experiments where you keep on reusing the same inputs/dataset for a variety of ML models or applications.

2. In order to support a full capable data first approach, we need to maintain a catalog containing information about the data items (similar to the Application catalog that we currently have). This catalog should have at least the basic metadata describing the data format, origin, supported applications etc. The best place to add this information will be the data catalog. But if I am correct, that will require adding these capabilities to the data catalog as currently it only supports cataloging output data from some of the selected applications.

3. For reducing the overhead of data movement you will need to have a replica catalog. The same data files can be distributed to multiple places and replica catalog will help keep track of them. In an ideal scenario, based on the available locations of a data item and a network cost model, the system should be able to decide which data item to be used for a particular experiment to minimize data movement cost. Also in some cases, the availability of replicas will be subjected to some expiration time as in some computer resources the scratch space will be subjected to purging. In an ideal scenario, the replica catalog should capture this information too.

As you can see this project spans across almost all aspects of Airavata data infrastructure and there are some interesting distributed systems problems. But as you have shown in the Wiki you can start with SEAGrid as a concrete use case. I hope the big picture will give you more interesting ideas for extending your project further.

Best
-Supun

On Sat, Sep 8, 2018 at 3:29 PM Kotabagi, Karan <kk...@iu.edu>> wrote:
​**************Re-sending the previous email*****************
Hi Dev,

We have discussed few changes with Sudhakar and updated the Wiki with the new Napkin Drawing and User Story, please review the same and let us know if there are any of the suggestions.

Wiki Link:
https://github.com/airavata-courses/airavata-nextcoud/wiki/Project-Ideation

Regards
Karan
________________________________
From: Kotabagi, Karan
Sent: Thursday, September 6, 2018 12:09 AM
To: dev@airavata.apache.org<ma...@airavata.apache.org>
Subject: Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server

Hi Dev,

As part of the Science Gateway Architecture course we have got the project proposal from Sudhakar to achieve the pre-data staging using the Nextcloud.

Please find the following project proposal and wiki link for the project ideation phase, please review the same and advise if there are any of the points that can be useful to
start with the project.

Project Proposal:
Achieve pre-data staging of the files using the Nextcloud file storage and explore ways to reduce the data transfer movements between the compute resources and local airavata server.

Wiki Link:
https://github.com/airavata-courses/airavata-nextcoud/wiki/Project-Ideation

Regards
Karan


RE: Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server

Posted by "Kotabagi, Karan" <kk...@iu.edu>.
Hi Dev,

 

As we had discussed with Marcus we are able to get the remote path picked up and registered, in that way airavata is picking up the remote file from the nextcloud, and path is recognized and the same flow continues with the source and the destination URI’s.

 

We have given the option for the users in the seagrid-rich client and upload / pre-stage to the file to the nextcloud. During, the creation of the experiment we are able to fetch the file from the remote interface file list from the nextcloud and select the pre-uploaded file and the path will be registered during the experiment creation.

 

Although, since we want Airavata to know that the file is pre-staged, we have another approach that we can start working with and is explained as follows :-

 

1> Register and get back the data product URI during the pre-staging of the files. When the users upload the input file via pre-data staging option that is provided, then the path should be registered and we will have the data product URI for this unique path.

 

2>When the users select the file from the nextcloud interface and pick up the remote file with the select pre-uploaded file option then at that time of creation of the experiment we need to associate the file path (which will be unique for every file) to the data replication model and return the experiment model with the data product URI that is associated with this file path.

 

In order to achieve this we need to get the  data product URI using the unique file path. So we need to define an API in airavata that returns us back the data product URI associated with the  unique remote file path. In this way, we can associate the pre-staged file during the experiment creation.

 

Flow Diagram for this approach (Please find the attached if not viewable in email).

 



 

Please include the comments over this approach.

 

Regards 

Karan Kotabagi
School of Informatics and Computing
Indiana University, Bloomington

From: Christie, Marcus Aaron <ma...@iu.edu> 
Sent: Wednesday, October 3, 2018 10:44 AM
To: Airavata Dev <de...@airavata.apache.org>
Subject: Re: Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server

 

Hi Karan, 

 

Sorry for taking so long to get back to you. You can call the registerDataProduct method to register the file and its location and this will return the product URI.

 





On Sep 25, 2018, at 9:35 AM, Kotabagi, Karan <kkotabag@iu.edu <ma...@iu.edu> > wrote:

 

Hi Dev,

 

As we are working on this and exploring the way to pre-stage the data.

 

We need to know if there  is a way to call the API in Airavata to register the product URI after the file is uploaded from the client end ?​

 

Regards 

Karan


  _____  


From: Kotabagi, Karan <kkotabag@iu.edu <ma...@iu.edu> >
Sent: Sunday, September 9, 2018 5:51 PM
To: dev
Subject: Re: Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server 

 

Hi Supun,

 

Thank you for the detailed suggestions and insights, this gave us a in-depth understanding with the future aspects of the project.

 

We will have more questions as we move along.

 

Regards

Karan​


  _____  


From: Supun Nakandala <supun.nakandala@gmail.com <ma...@gmail.com> >
Sent: Saturday, September 8, 2018 7:39 PM
To: dev
Subject: Re: Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server 

 

Hi Karan, 

 

This will be a very useful addition for Airavata. I have been thinking about this sometime back and below are some ideas that I think will be worth sharing with you.

 

1. I think the value added by this project goes beyond the current SEAGrid use cases. As of now (if I am correct), Airavata is following an "application first" approach for creating experiments. This makes sense as most of the existing experiments are simulation type experiments. But there will be growing interest for "data first" type experiments such as machine learning and bioinformatics type experiments where you keep on reusing the same inputs/dataset for a variety of ML models or applications.

 

2. In order to support a full capable data first approach, we need to maintain a catalog containing information about the data items (similar to the Application catalog that we currently have). This catalog should have at least the basic metadata describing the data format, origin, supported applications etc. The best place to add this information will be the data catalog. But if I am correct, that will require adding these capabilities to the data catalog as currently it only supports cataloging output data from some of the selected applications.

 

3. For reducing the overhead of data movement you will need to have a replica catalog. The same data files can be distributed to multiple places and replica catalog will help keep track of them. In an ideal scenario, based on the available locations of a data item and a network cost model, the system should be able to decide which data item to be used for a particular experiment to minimize data movement cost. Also in some cases, the availability of replicas will be subjected to some expiration time as in some computer resources the scratch space will be subjected to purging. In an ideal scenario, the replica catalog should capture this information too.

 

As you can see this project spans across almost all aspects of Airavata data infrastructure and there are some interesting distributed systems problems. But as you have shown in the Wiki you can start with SEAGrid as a concrete use case. I hope the big picture will give you more interesting ideas for extending your project further.

 

Best

-Supun

 

On Sat, Sep 8, 2018 at 3:29 PM Kotabagi, Karan <kkotabag@iu.edu <ma...@iu.edu> > wrote:

​**************Re-sending the previous email*****************

Hi Dev,

 

We have discussed few changes with Sudhakar and updated the Wiki with the new Napkin Drawing and User Story, please review the same and let us know if there are any of the suggestions.

 

Wiki Link:

https://github.com/airavata-courses/airavata-nextcoud/wiki/Project-Ideation

 

Regards 

Karan


  _____  


From: Kotabagi, Karan
Sent: Thursday, September 6, 2018 12:09 AM
To: dev@airavata.apache.org <ma...@airavata.apache.org> 
Subject: Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server 

 

Hi Dev,

 

As part of the Science Gateway Architecture course we have got the project proposal from Sudhakar to achieve the pre-data staging using the Nextcloud.

 

Please find the following project proposal and wiki link for the project ideation phase, please review the same and advise if there are any of the points that can be useful to 

start with the project.

 

Project Proposal: 

Achieve pre-data staging of the files using the Nextcloud file storage and explore ways to reduce the data transfer movements between the compute resources and local airavata server.

 

Wiki Link:

https://github.com/airavata-courses/airavata-nextcoud/wiki/Project-Ideation

 

Regards 

Karan

 

 

 


Re: Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server

Posted by "Christie, Marcus Aaron" <ma...@iu.edu>.
Hi Karan,

Sorry for taking so long to get back to you. You can call the registerDataProduct method to register the file and its location and this will return the product URI.


On Sep 25, 2018, at 9:35 AM, Kotabagi, Karan <kk...@iu.edu>> wrote:

Hi Dev,

As we are working on this and exploring the way to pre-stage the data.

We need to know if there  is a way to call the API in Airavata to register the product URI after the file is uploaded from the client end ?​

Regards
Karan
________________________________
From: Kotabagi, Karan <kk...@iu.edu>>
Sent: Sunday, September 9, 2018 5:51 PM
To: dev
Subject: Re: Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server

Hi Supun,

Thank you for the detailed suggestions and insights, this gave us a in-depth understanding with the future aspects of the project.

We will have more questions as we move along.

Regards
Karan​
________________________________
From: Supun Nakandala <su...@gmail.com>>
Sent: Saturday, September 8, 2018 7:39 PM
To: dev
Subject: Re: Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server

Hi Karan,

This will be a very useful addition for Airavata. I have been thinking about this sometime back and below are some ideas that I think will be worth sharing with you.

1. I think the value added by this project goes beyond the current SEAGrid use cases. As of now (if I am correct), Airavata is following an "application first" approach for creating experiments. This makes sense as most of the existing experiments are simulation type experiments. But there will be growing interest for "data first" type experiments such as machine learning and bioinformatics type experiments where you keep on reusing the same inputs/dataset for a variety of ML models or applications.

2. In order to support a full capable data first approach, we need to maintain a catalog containing information about the data items (similar to the Application catalog that we currently have). This catalog should have at least the basic metadata describing the data format, origin, supported applications etc. The best place to add this information will be the data catalog. But if I am correct, that will require adding these capabilities to the data catalog as currently it only supports cataloging output data from some of the selected applications.

3. For reducing the overhead of data movement you will need to have a replica catalog. The same data files can be distributed to multiple places and replica catalog will help keep track of them. In an ideal scenario, based on the available locations of a data item and a network cost model, the system should be able to decide which data item to be used for a particular experiment to minimize data movement cost. Also in some cases, the availability of replicas will be subjected to some expiration time as in some computer resources the scratch space will be subjected to purging. In an ideal scenario, the replica catalog should capture this information too.

As you can see this project spans across almost all aspects of Airavata data infrastructure and there are some interesting distributed systems problems. But as you have shown in the Wiki you can start with SEAGrid as a concrete use case. I hope the big picture will give you more interesting ideas for extending your project further.

Best
-Supun

On Sat, Sep 8, 2018 at 3:29 PM Kotabagi, Karan <kk...@iu.edu>> wrote:

​**************Re-sending the previous email*****************
Hi Dev,

We have discussed few changes with Sudhakar and updated the Wiki with the new Napkin Drawing and User Story, please review the same and let us know if there are any of the suggestions.

Wiki Link:
https://github.com/airavata-courses/airavata-nextcoud/wiki/Project-Ideation

Regards
Karan
________________________________
From: Kotabagi, Karan
Sent: Thursday, September 6, 2018 12:09 AM
To: dev@airavata.apache.org<ma...@airavata.apache.org>
Subject: Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server

Hi Dev,

As part of the Science Gateway Architecture course we have got the project proposal from Sudhakar to achieve the pre-data staging using the Nextcloud.

Please find the following project proposal and wiki link for the project ideation phase, please review the same and advise if there are any of the points that can be useful to
start with the project.

Project Proposal:
Achieve pre-data staging of the files using the Nextcloud file storage and explore ways to reduce the data transfer movements between the compute resources and local airavata server.

Wiki Link:
https://github.com/airavata-courses/airavata-nextcoud/wiki/Project-Ideation

Regards
Karan




Re: Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server

Posted by "Kotabagi, Karan" <kk...@iu.edu>.
Hi Dev,


As we are working on this and exploring the way to pre-stage the data.


We need to know if there  is a way to call the API in Airavata to register the product URI after the file is uploaded from the client end ?​


Regards

Karan

________________________________
From: Kotabagi, Karan <kk...@iu.edu>
Sent: Sunday, September 9, 2018 5:51 PM
To: dev
Subject: Re: Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server


Hi Supun,


Thank you for the detailed suggestions and insights, this gave us a in-depth understanding with the future aspects of the project.


We will have more questions as we move along.


Regards

Karan​

________________________________
From: Supun Nakandala <su...@gmail.com>
Sent: Saturday, September 8, 2018 7:39 PM
To: dev
Subject: Re: Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server

Hi Karan,

This will be a very useful addition for Airavata. I have been thinking about this sometime back and below are some ideas that I think will be worth sharing with you.

1. I think the value added by this project goes beyond the current SEAGrid use cases. As of now (if I am correct), Airavata is following an "application first" approach for creating experiments. This makes sense as most of the existing experiments are simulation type experiments. But there will be growing interest for "data first" type experiments such as machine learning and bioinformatics type experiments where you keep on reusing the same inputs/dataset for a variety of ML models or applications.

2. In order to support a full capable data first approach, we need to maintain a catalog containing information about the data items (similar to the Application catalog that we currently have). This catalog should have at least the basic metadata describing the data format, origin, supported applications etc. The best place to add this information will be the data catalog. But if I am correct, that will require adding these capabilities to the data catalog as currently it only supports cataloging output data from some of the selected applications.

3. For reducing the overhead of data movement you will need to have a replica catalog. The same data files can be distributed to multiple places and replica catalog will help keep track of them. In an ideal scenario, based on the available locations of a data item and a network cost model, the system should be able to decide which data item to be used for a particular experiment to minimize data movement cost. Also in some cases, the availability of replicas will be subjected to some expiration time as in some computer resources the scratch space will be subjected to purging. In an ideal scenario, the replica catalog should capture this information too.

As you can see this project spans across almost all aspects of Airavata data infrastructure and there are some interesting distributed systems problems. But as you have shown in the Wiki you can start with SEAGrid as a concrete use case. I hope the big picture will give you more interesting ideas for extending your project further.

Best
-Supun

On Sat, Sep 8, 2018 at 3:29 PM Kotabagi, Karan <kk...@iu.edu>> wrote:

​**************Re-sending the previous email*****************

Hi Dev,


We have discussed few changes with Sudhakar and updated the Wiki with the new Napkin Drawing and User Story, please review the same and let us know if there are any of the suggestions.


Wiki Link:

https://github.com/airavata-courses/airavata-nextcoud/wiki/Project-Ideation


Regards

Karan

________________________________
From: Kotabagi, Karan
Sent: Thursday, September 6, 2018 12:09 AM
To: dev@airavata.apache.org<ma...@airavata.apache.org>
Subject: Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server


Hi Dev,


As part of the Science Gateway Architecture course we have got the project proposal from Sudhakar to achieve the pre-data staging using the Nextcloud.


Please find the following project proposal and wiki link for the project ideation phase, please review the same and advise if there are any of the points that can be useful to

start with the project.


Project Proposal:

Achieve pre-data staging of the files using the Nextcloud file storage and explore ways to reduce the data transfer movements between the compute resources and local airavata server.


Wiki Link:

https://github.com/airavata-courses/airavata-nextcoud/wiki/Project-Ideation


Regards

Karan


Re: Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server

Posted by "Kotabagi, Karan" <kk...@iu.edu>.
Hi Supun,


Thank you for the detailed suggestions and insights, this gave us a in-depth understanding with the future aspects of the project.


We will have more questions as we move along.


Regards

Karan?

________________________________
From: Supun Nakandala <su...@gmail.com>
Sent: Saturday, September 8, 2018 7:39 PM
To: dev
Subject: Re: Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server

Hi Karan,

This will be a very useful addition for Airavata. I have been thinking about this sometime back and below are some ideas that I think will be worth sharing with you.

1. I think the value added by this project goes beyond the current SEAGrid use cases. As of now (if I am correct), Airavata is following an "application first" approach for creating experiments. This makes sense as most of the existing experiments are simulation type experiments. But there will be growing interest for "data first" type experiments such as machine learning and bioinformatics type experiments where you keep on reusing the same inputs/dataset for a variety of ML models or applications.

2. In order to support a full capable data first approach, we need to maintain a catalog containing information about the data items (similar to the Application catalog that we currently have). This catalog should have at least the basic metadata describing the data format, origin, supported applications etc. The best place to add this information will be the data catalog. But if I am correct, that will require adding these capabilities to the data catalog as currently it only supports cataloging output data from some of the selected applications.

3. For reducing the overhead of data movement you will need to have a replica catalog. The same data files can be distributed to multiple places and replica catalog will help keep track of them. In an ideal scenario, based on the available locations of a data item and a network cost model, the system should be able to decide which data item to be used for a particular experiment to minimize data movement cost. Also in some cases, the availability of replicas will be subjected to some expiration time as in some computer resources the scratch space will be subjected to purging. In an ideal scenario, the replica catalog should capture this information too.

As you can see this project spans across almost all aspects of Airavata data infrastructure and there are some interesting distributed systems problems. But as you have shown in the Wiki you can start with SEAGrid as a concrete use case. I hope the big picture will give you more interesting ideas for extending your project further.

Best
-Supun

On Sat, Sep 8, 2018 at 3:29 PM Kotabagi, Karan <kk...@iu.edu>> wrote:

?**************Re-sending the previous email*****************

Hi Dev,


We have discussed few changes with Sudhakar and updated the Wiki with the new Napkin Drawing and User Story, please review the same and let us know if there are any of the suggestions.


Wiki Link:

https://github.com/airavata-courses/airavata-nextcoud/wiki/Project-Ideation


Regards

Karan

________________________________
From: Kotabagi, Karan
Sent: Thursday, September 6, 2018 12:09 AM
To: dev@airavata.apache.org<ma...@airavata.apache.org>
Subject: Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server


Hi Dev,


As part of the Science Gateway Architecture course we have got the project proposal from Sudhakar to achieve the pre-data staging using the Nextcloud.


Please find the following project proposal and wiki link for the project ideation phase, please review the same and advise if there are any of the points that can be useful to

start with the project.


Project Proposal:

Achieve pre-data staging of the files using the Nextcloud file storage and explore ways to reduce the data transfer movements between the compute resources and local airavata server.


Wiki Link:

https://github.com/airavata-courses/airavata-nextcoud/wiki/Project-Ideation


Regards

Karan


Re: Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server

Posted by Supun Nakandala <su...@gmail.com>.
Hi Karan,

This will be a very useful addition for Airavata. I have been thinking
about this sometime back and below are some ideas that I think will be
worth sharing with you.

1. I think the value added by this project goes beyond the current SEAGrid
use cases. As of now (if I am correct), Airavata is following an
"application first" approach for creating experiments. This makes sense as
most of the existing experiments are simulation type experiments. But there
will be growing interest for "data first" type experiments such as machine
learning and bioinformatics type experiments where you keep on reusing the
same inputs/dataset for a variety of ML models or applications.

2. In order to support a full capable data first approach, we need to
maintain a catalog containing information about the data items (similar to
the Application catalog that we currently have). This catalog should have
at least the basic metadata describing the data format, origin, supported
applications etc. The best place to add this information will be the data
catalog. But if I am correct, that will require adding these capabilities
to the data catalog as currently it only supports cataloging output data
from some of the selected applications.

3. For reducing the overhead of data movement you will need to have a
replica catalog. The same data files can be distributed to multiple places
and replica catalog will help keep track of them. In an ideal scenario,
based on the available locations of a data item and a network cost model,
the system should be able to decide which data item to be used for a
particular experiment to minimize data movement cost. Also in some cases,
the availability of replicas will be subjected to some expiration time as
in some computer resources the scratch space will be subjected to purging.
In an ideal scenario, the replica catalog should capture this information
too.

As you can see this project spans across almost all aspects of Airavata
data infrastructure and there are some interesting distributed systems
problems. But as you have shown in the Wiki you can start with SEAGrid as a
concrete use case. I hope the big picture will give you more interesting
ideas for extending your project further.

Best
-Supun

On Sat, Sep 8, 2018 at 3:29 PM Kotabagi, Karan <kk...@iu.edu> wrote:

> ​**************Re-sending the previous email*****************
>
> Hi Dev,
>
>
> We have discussed few changes with Sudhakar and updated the Wiki with the
> new Napkin Drawing and User Story, please review the same and let us know
> if there are any of the suggestions.
>
>
> *Wiki Link:*
>
> https://github.com/airavata-courses/airavata-nextcoud/wiki/Project-Ideation
>
>
> Regards
>
> Karan
> ------------------------------
> *From:* Kotabagi, Karan
> *Sent:* Thursday, September 6, 2018 12:09 AM
> *To:* dev@airavata.apache.org
> *Subject:* Achieve the Pre-Data Staging and explore ways to reduce the
> data transfer between the compute resource and airavata server
>
>
> Hi Dev,
>
>
> As part of the Science Gateway Architecture course we have got the project
> proposal from Sudhakar to achieve the pre-data staging using the Nextcloud.
>
>
> Please find the following *project proposal* and *wiki link* for the
> project ideation phase, please review the same and advise if there are any
> of the points that can be useful to
>
> start with the project.
>
>
> *Project Proposal:*
>
> Achieve pre-data staging of the files using the Nextcloud file storage and
> explore ways to reduce the data transfer movements between the compute
> resources and local airavata server.
>
>
> *Wiki Link:*
>
> https://github.com/airavata-courses/airavata-nextcoud/wiki/Project-Ideation
>
>
> Regards
>
> Karan
>
>
>

Re: Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server

Posted by "Kotabagi, Karan" <kk...@iu.edu>.
?**************Re-sending the previous email*****************

Hi Dev,


We have discussed few changes with Sudhakar and updated the Wiki with the new Napkin Drawing and User Story, please review the same and let us know if there are any of the suggestions.


Wiki Link:

https://github.com/airavata-courses/airavata-nextcoud/wiki/Project-Ideation


Regards

Karan

________________________________
From: Kotabagi, Karan
Sent: Thursday, September 6, 2018 12:09 AM
To: dev@airavata.apache.org
Subject: Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server


Hi Dev,


As part of the Science Gateway Architecture course we have got the project proposal from Sudhakar to achieve the pre-data staging using the Nextcloud.


Please find the following project proposal and wiki link for the project ideation phase, please review the same and advise if there are any of the points that can be useful to

start with the project.


Project Proposal:

Achieve pre-data staging of the files using the Nextcloud file storage and explore ways to reduce the data transfer movements between the compute resources and local airavata server.


Wiki Link:

https://github.com/airavata-courses/airavata-nextcoud/wiki/Project-Ideation


Regards

Karan


Re: Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server

Posted by "Kotabagi, Karan" <kk...@iu.edu>.
?Hi All,


We have discussed few things with Sudhakar and updated the Wiki, please review the same and let us know if there are any of the comments for the same.



________________________________
From: Kotabagi, Karan
Sent: Thursday, September 6, 2018 12:09 AM
To: dev@airavata.apache.org
Subject: Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server


Hi Dev,


As part of the Science Gateway Architecture course we have got the project proposal from Sudhakar to achieve the pre-data staging using the Nextcloud.


Please find the following project proposal and wiki link for the project ideation phase, please review the same and advise if there are any of the points that can be useful to

start with the project.


Project Proposal:

Achieve pre-data staging of the files using the Nextcloud file storage and explore ways to reduce the data transfer movements between the compute resources and local airavata server.


Wiki Link:

https://github.com/airavata-courses/airavata-nextcoud/wiki/Project-Ideation


Regards

Karan