You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airavata.apache.org by Suresh Marru <sm...@apache.org> on 2015/06/12 15:09:55 UTC

[DISCUSS] Data models for 0.16 and beyond

Hi All,

With the experience of adapting thrift data models for Airavata in past couple of years, its time for us to revisit them. Most persistent criticism has been the data models have been complex. Next the data models and architecture evolved in parallel and the implementations did not always match the intended models. In an effort to address these issues, lets first discuss the minimal required data models.

We need to confirm the models to the general principle of Experiments deriving into a Process or a Workflow. For single application, a process can be directly derived from Experiment Details. For workflows, multiple process are created. Executing a process leads to creation of multiple Tasks. Task is a general type which are enacted at run time based on a generic execution sequence of environment setup, data input staging, application execution and monitoring, data output staging and environment cleanup.

Please review the initial draft:
https://github.com/apache/airavata/tree/master/thrift-interface-descriptions/airavata-data-models <https://github.com/apache/airavata/tree/master/thrift-interface-descriptions/airavata-data-models>

Assume lazy consensus and update the models, lets literately review and update these thrift IDL’s. We don’t yet need to dive into code generation, until these are close to final.

@Supun, may be you can start thinking on the data base representation on these models and assume the details will change but the general structure might remain.

Cheers,
Suresh

Re: [DISCUSS] Data models for 0.16 and beyond

Posted by Supun Nakandala <su...@gmail.com>.

As a starting place you can check this class
https://github.com/apache/airavata/blob/master/modules/registry/registry-core/src/main/java/org/apache/airavata/registry/core/experiment/catalog/impl/ExperimentCatalogImpl.java


On Wed, Jul 1, 2015 at 8:59 AM, John Weachock <jw...@gmail.com> wrote:

> Excellent! Do you have any pointers to where I can look to start reading
> the code and begin adding the necessary features to the registry?
>
> Thanks!
>
> On Tue, Jun 30, 2015 at 5:05 AM, Supun Nakandala <
> supun.nakandala@gmail.com> wrote:
>
>> Hi John,
>>
>> Even though in the current thrift models has only one status entry, in
>> the database we maintain all the state transitions (i.e all the status
>> entries). But when retrieving an experiment, process, task, or job only the
>> latest status is returned based on the creation time stamp. So at the
>> registry level we can support your requirement. What is required is the
>> required thrift models to transfer those data via the APIs/CPIs.
>>
>>
>> On Tue, Jun 30, 2015 at 1:28 PM, John Weachock <jw...@gmail.com>
>> wrote:
>>
>>> Hi Supun,
>>>
>>> Sorry for sending this message so late!
>>>
>>> Last week I discussed a change to the data models with Suresh regarding
>>> task / job / experiment / etc statuses. Currently, each item has a single
>>> status ID that points to a status that's updated every change. However, if
>>> each item contained a *list* of status IDs, and each status change
>>> created a *new* status entry, we can record data about experiment run
>>> times, which could be used in future versions to assist in benchmark and
>>> runtime prediction efforts. Additionally, users could be provided the
>>> information about the progression of their experiment.
>>>
>>> Thanks,
>>>
>>> John
>>>
>>>
>>> On Sun, Jun 14, 2015 at 1:56 PM, Supun Nakandala <
>>> supun.nakandala@gmail.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I came up with the initial version of the schema for the new experiment
>>>> catalog. It is very much similar to the existing model and have few changes
>>>>
>>>> 1. In Experiments I have used one text field for email addresses with
>>>> the intention of storing comma separated email list. The idea was to avoid
>>>> another DB table join. And also in Errors tables I have used a single text
>>>> field for storing parent error ids with the same intention.
>>>>
>>>> 2. I have used separate tables for ExperimentErrors, ProcessErrors,
>>>> TaskErrors rather than having a single Errors table. The idea is to avoid
>>>> the use of composite ids(with some ids null) and to avoid the filtering
>>>> correct type of errors in the code level (for example when retrieving
>>>> experiment errors). And also this eases the data retrieval in JPA level. I
>>>> have used the same concept for Statuses  and Inputs and Outputs tables.
>>>>
>>>> 3. Since there are some performance issues in PGA related operations in
>>>> retrieving experiment related data I created a view called
>>>> experiment_summaries which underneath joins several tables and gives the
>>>> required data in one view. We can create a JPA model for this view and use
>>>> it for PGA related (including some of the Admin Dashboard) operations. I
>>>> hope this will solve the issue.
>>>>
>>>> I have attached the schema diagram here with. Please check it and let
>>>> me know if anything is wrong, needs to be changed or improved.
>>>>
>>>> If things look good, as the next step I would like to suggest that we
>>>> brainstorm different queries that we will run on this data and check
>>>> whether the data model can support those queries and the expected
>>>> performance.
>>>>
>>>> Thanks
>>>> Supun
>>>>
>>>> On Fri, Jun 12, 2015 at 6:39 PM, Suresh Marru <sm...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> With the experience of adapting thrift data models for Airavata in
>>>>> past couple of years, its time for us to revisit them. Most persistent
>>>>> criticism has been the data models have been complex. Next the data models
>>>>> and architecture evolved in parallel and the implementations did not always
>>>>> match the intended models. In an effort to address these issues, lets first
>>>>> discuss the minimal required data models.
>>>>>
>>>>> We need to confirm the models to the general principle of Experiments
>>>>> deriving into a Process or a Workflow. For single application, a process
>>>>> can be directly derived from Experiment Details. For workflows, multiple
>>>>> process are created. Executing a process leads to creation of multiple
>>>>> Tasks. Task is a general type which are enacted at run time based on a
>>>>> generic execution sequence of environment setup, data input staging,
>>>>> application execution and monitoring, data output staging and environment
>>>>> cleanup.
>>>>>
>>>>> Please review the initial draft:
>>>>>
>>>>> https://github.com/apache/airavata/tree/master/thrift-interface-descriptions/airavata-data-models
>>>>>
>>>>> Assume lazy consensus and update the models, lets literately review
>>>>> and update these thrift IDL’s. We don’t yet need to dive into code
>>>>> generation, until these are close to final.
>>>>>
>>>>> @Supun, may be you can start thinking on the data base representation
>>>>> on these models and assume the details will change but the general
>>>>> structure might remain.
>>>>>
>>>>> Cheers,
>>>>> Suresh
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Thank you
>>>> Supun Nakandala
>>>> Dept. Computer Science and Engineering
>>>> University of Moratuwa
>>>>
>>>
>>>
>>
>>
>> --
>> Thank you
>> Supun Nakandala
>> Dept. Computer Science and Engineering
>> University of Moratuwa
>>
>
>


-- 
Thank you
Supun Nakandala
Dept. Computer Science and Engineering
University of Moratuwa

Re: [DISCUSS] Data models for 0.16 and beyond

Posted by John Weachock <jw...@gmail.com>.

Excellent! Do you have any pointers to where I can look to start reading
the code and begin adding the necessary features to the registry?

Thanks!

On Tue, Jun 30, 2015 at 5:05 AM, Supun Nakandala <su...@gmail.com>
wrote:

> Hi John,
>
> Even though in the current thrift models has only one status entry, in the
> database we maintain all the state transitions (i.e all the status
> entries). But when retrieving an experiment, process, task, or job only the
> latest status is returned based on the creation time stamp. So at the
> registry level we can support your requirement. What is required is the
> required thrift models to transfer those data via the APIs/CPIs.
>
>
> On Tue, Jun 30, 2015 at 1:28 PM, John Weachock <jw...@gmail.com>
> wrote:
>
>> Hi Supun,
>>
>> Sorry for sending this message so late!
>>
>> Last week I discussed a change to the data models with Suresh regarding
>> task / job / experiment / etc statuses. Currently, each item has a single
>> status ID that points to a status that's updated every change. However, if
>> each item contained a *list* of status IDs, and each status change
>> created a *new* status entry, we can record data about experiment run
>> times, which could be used in future versions to assist in benchmark and
>> runtime prediction efforts. Additionally, users could be provided the
>> information about the progression of their experiment.
>>
>> Thanks,
>>
>> John
>>
>>
>> On Sun, Jun 14, 2015 at 1:56 PM, Supun Nakandala <
>> supun.nakandala@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> I came up with the initial version of the schema for the new experiment
>>> catalog. It is very much similar to the existing model and have few changes
>>>
>>> 1. In Experiments I have used one text field for email addresses with
>>> the intention of storing comma separated email list. The idea was to avoid
>>> another DB table join. And also in Errors tables I have used a single text
>>> field for storing parent error ids with the same intention.
>>>
>>> 2. I have used separate tables for ExperimentErrors, ProcessErrors,
>>> TaskErrors rather than having a single Errors table. The idea is to avoid
>>> the use of composite ids(with some ids null) and to avoid the filtering
>>> correct type of errors in the code level (for example when retrieving
>>> experiment errors). And also this eases the data retrieval in JPA level. I
>>> have used the same concept for Statuses  and Inputs and Outputs tables.
>>>
>>> 3. Since there are some performance issues in PGA related operations in
>>> retrieving experiment related data I created a view called
>>> experiment_summaries which underneath joins several tables and gives the
>>> required data in one view. We can create a JPA model for this view and use
>>> it for PGA related (including some of the Admin Dashboard) operations. I
>>> hope this will solve the issue.
>>>
>>> I have attached the schema diagram here with. Please check it and let me
>>> know if anything is wrong, needs to be changed or improved.
>>>
>>> If things look good, as the next step I would like to suggest that we
>>> brainstorm different queries that we will run on this data and check
>>> whether the data model can support those queries and the expected
>>> performance.
>>>
>>> Thanks
>>> Supun
>>>
>>> On Fri, Jun 12, 2015 at 6:39 PM, Suresh Marru <sm...@apache.org> wrote:
>>>
>>>> Hi All,
>>>>
>>>> With the experience of adapting thrift data models for Airavata in past
>>>> couple of years, its time for us to revisit them. Most persistent criticism
>>>> has been the data models have been complex. Next the data models and
>>>> architecture evolved in parallel and the implementations did not always
>>>> match the intended models. In an effort to address these issues, lets first
>>>> discuss the minimal required data models.
>>>>
>>>> We need to confirm the models to the general principle of Experiments
>>>> deriving into a Process or a Workflow. For single application, a process
>>>> can be directly derived from Experiment Details. For workflows, multiple
>>>> process are created. Executing a process leads to creation of multiple
>>>> Tasks. Task is a general type which are enacted at run time based on a
>>>> generic execution sequence of environment setup, data input staging,
>>>> application execution and monitoring, data output staging and environment
>>>> cleanup.
>>>>
>>>> Please review the initial draft:
>>>>
>>>> https://github.com/apache/airavata/tree/master/thrift-interface-descriptions/airavata-data-models
>>>>
>>>> Assume lazy consensus and update the models, lets literately review and
>>>> update these thrift IDL’s. We don’t yet need to dive into code generation,
>>>> until these are close to final.
>>>>
>>>> @Supun, may be you can start thinking on the data base representation
>>>> on these models and assume the details will change but the general
>>>> structure might remain.
>>>>
>>>> Cheers,
>>>> Suresh
>>>>
>>>
>>>
>>>
>>> --
>>> Thank you
>>> Supun Nakandala
>>> Dept. Computer Science and Engineering
>>> University of Moratuwa
>>>
>>
>>
>
>
> --
> Thank you
> Supun Nakandala
> Dept. Computer Science and Engineering
> University of Moratuwa
>

Re: [DISCUSS] Data models for 0.16 and beyond

Posted by Supun Nakandala <su...@gmail.com>.

Hi John,

Even though in the current thrift models has only one status entry, in the
database we maintain all the state transitions (i.e all the status
entries). But when retrieving an experiment, process, task, or job only the
latest status is returned based on the creation time stamp. So at the
registry level we can support your requirement. What is required is the
required thrift models to transfer those data via the APIs/CPIs.


On Tue, Jun 30, 2015 at 1:28 PM, John Weachock <jw...@gmail.com> wrote:

> Hi Supun,
>
> Sorry for sending this message so late!
>
> Last week I discussed a change to the data models with Suresh regarding
> task / job / experiment / etc statuses. Currently, each item has a single
> status ID that points to a status that's updated every change. However, if
> each item contained a *list* of status IDs, and each status change
> created a *new* status entry, we can record data about experiment run
> times, which could be used in future versions to assist in benchmark and
> runtime prediction efforts. Additionally, users could be provided the
> information about the progression of their experiment.
>
> Thanks,
>
> John
>
>
> On Sun, Jun 14, 2015 at 1:56 PM, Supun Nakandala <
> supun.nakandala@gmail.com> wrote:
>
>> Hi All,
>>
>> I came up with the initial version of the schema for the new experiment
>> catalog. It is very much similar to the existing model and have few changes
>>
>> 1. In Experiments I have used one text field for email addresses with the
>> intention of storing comma separated email list. The idea was to avoid
>> another DB table join. And also in Errors tables I have used a single text
>> field for storing parent error ids with the same intention.
>>
>> 2. I have used separate tables for ExperimentErrors, ProcessErrors,
>> TaskErrors rather than having a single Errors table. The idea is to avoid
>> the use of composite ids(with some ids null) and to avoid the filtering
>> correct type of errors in the code level (for example when retrieving
>> experiment errors). And also this eases the data retrieval in JPA level. I
>> have used the same concept for Statuses  and Inputs and Outputs tables.
>>
>> 3. Since there are some performance issues in PGA related operations in
>> retrieving experiment related data I created a view called
>> experiment_summaries which underneath joins several tables and gives the
>> required data in one view. We can create a JPA model for this view and use
>> it for PGA related (including some of the Admin Dashboard) operations. I
>> hope this will solve the issue.
>>
>> I have attached the schema diagram here with. Please check it and let me
>> know if anything is wrong, needs to be changed or improved.
>>
>> If things look good, as the next step I would like to suggest that we
>> brainstorm different queries that we will run on this data and check
>> whether the data model can support those queries and the expected
>> performance.
>>
>> Thanks
>> Supun
>>
>> On Fri, Jun 12, 2015 at 6:39 PM, Suresh Marru <sm...@apache.org> wrote:
>>
>>> Hi All,
>>>
>>> With the experience of adapting thrift data models for Airavata in past
>>> couple of years, its time for us to revisit them. Most persistent criticism
>>> has been the data models have been complex. Next the data models and
>>> architecture evolved in parallel and the implementations did not always
>>> match the intended models. In an effort to address these issues, lets first
>>> discuss the minimal required data models.
>>>
>>> We need to confirm the models to the general principle of Experiments
>>> deriving into a Process or a Workflow. For single application, a process
>>> can be directly derived from Experiment Details. For workflows, multiple
>>> process are created. Executing a process leads to creation of multiple
>>> Tasks. Task is a general type which are enacted at run time based on a
>>> generic execution sequence of environment setup, data input staging,
>>> application execution and monitoring, data output staging and environment
>>> cleanup.
>>>
>>> Please review the initial draft:
>>>
>>> https://github.com/apache/airavata/tree/master/thrift-interface-descriptions/airavata-data-models
>>>
>>> Assume lazy consensus and update the models, lets literately review and
>>> update these thrift IDL’s. We don’t yet need to dive into code generation,
>>> until these are close to final.
>>>
>>> @Supun, may be you can start thinking on the data base representation on
>>> these models and assume the details will change but the general structure
>>> might remain.
>>>
>>> Cheers,
>>> Suresh
>>>
>>
>>
>>
>> --
>> Thank you
>> Supun Nakandala
>> Dept. Computer Science and Engineering
>> University of Moratuwa
>>
>
>


-- 
Thank you
Supun Nakandala
Dept. Computer Science and Engineering
University of Moratuwa

Re: [DISCUSS] Data models for 0.16 and beyond

Posted by John Weachock <jw...@gmail.com>.

Hi Supun,

Sorry for sending this message so late!

Last week I discussed a change to the data models with Suresh regarding
task / job / experiment / etc statuses. Currently, each item has a single
status ID that points to a status that's updated every change. However, if
each item contained a *list* of status IDs, and each status change created
a *new* status entry, we can record data about experiment run times, which
could be used in future versions to assist in benchmark and runtime
prediction efforts. Additionally, users could be provided the information
about the progression of their experiment.

Thanks,

John

On Sun, Jun 14, 2015 at 1:56 PM, Supun Nakandala <su...@gmail.com>
wrote:

> Hi All,
>
> I came up with the initial version of the schema for the new experiment
> catalog. It is very much similar to the existing model and have few changes
>
> 1. In Experiments I have used one text field for email addresses with the
> intention of storing comma separated email list. The idea was to avoid
> another DB table join. And also in Errors tables I have used a single text
> field for storing parent error ids with the same intention.
>
> 2. I have used separate tables for ExperimentErrors, ProcessErrors,
> TaskErrors rather than having a single Errors table. The idea is to avoid
> the use of composite ids(with some ids null) and to avoid the filtering
> correct type of errors in the code level (for example when retrieving
> experiment errors). And also this eases the data retrieval in JPA level. I
> have used the same concept for Statuses  and Inputs and Outputs tables.
>
> 3. Since there are some performance issues in PGA related operations in
> retrieving experiment related data I created a view called
> experiment_summaries which underneath joins several tables and gives the
> required data in one view. We can create a JPA model for this view and use
> it for PGA related (including some of the Admin Dashboard) operations. I
> hope this will solve the issue.
>
> I have attached the schema diagram here with. Please check it and let me
> know if anything is wrong, needs to be changed or improved.
>
> If things look good, as the next step I would like to suggest that we
> brainstorm different queries that we will run on this data and check
> whether the data model can support those queries and the expected
> performance.
>
> Thanks
> Supun
>
> On Fri, Jun 12, 2015 at 6:39 PM, Suresh Marru <sm...@apache.org> wrote:
>
>> Hi All,
>>
>> With the experience of adapting thrift data models for Airavata in past
>> couple of years, its time for us to revisit them. Most persistent criticism
>> has been the data models have been complex. Next the data models and
>> architecture evolved in parallel and the implementations did not always
>> match the intended models. In an effort to address these issues, lets first
>> discuss the minimal required data models.
>>
>> We need to confirm the models to the general principle of Experiments
>> deriving into a Process or a Workflow. For single application, a process
>> can be directly derived from Experiment Details. For workflows, multiple
>> process are created. Executing a process leads to creation of multiple
>> Tasks. Task is a general type which are enacted at run time based on a
>> generic execution sequence of environment setup, data input staging,
>> application execution and monitoring, data output staging and environment
>> cleanup.
>>
>> Please review the initial draft:
>>
>> https://github.com/apache/airavata/tree/master/thrift-interface-descriptions/airavata-data-models
>>
>> Assume lazy consensus and update the models, lets literately review and
>> update these thrift IDL’s. We don’t yet need to dive into code generation,
>> until these are close to final.
>>
>> @Supun, may be you can start thinking on the data base representation on
>> these models and assume the details will change but the general structure
>> might remain.
>>
>> Cheers,
>> Suresh
>>
>
>
>
> --
> Thank you
> Supun Nakandala
> Dept. Computer Science and Engineering
> University of Moratuwa
>

Re: [DISCUSS] Data models for 0.16 and beyond

Posted by Supun Nakandala <su...@gmail.com>.

Hi All,

I came up with the initial version of the schema for the new experiment
catalog. It is very much similar to the existing model and have few changes

1. In Experiments I have used one text field for email addresses with the
intention of storing comma separated email list. The idea was to avoid
another DB table join. And also in Errors tables I have used a single text
field for storing parent error ids with the same intention.

2. I have used separate tables for ExperimentErrors, ProcessErrors,
TaskErrors rather than having a single Errors table. The idea is to avoid
the use of composite ids(with some ids null) and to avoid the filtering
correct type of errors in the code level (for example when retrieving
experiment errors). And also this eases the data retrieval in JPA level. I
have used the same concept for Statuses  and Inputs and Outputs tables.

3. Since there are some performance issues in PGA related operations in
retrieving experiment related data I created a view called
experiment_summaries which underneath joins several tables and gives the
required data in one view. We can create a JPA model for this view and use
it for PGA related (including some of the Admin Dashboard) operations. I
hope this will solve the issue.

I have attached the schema diagram here with. Please check it and let me
know if anything is wrong, needs to be changed or improved.

If things look good, as the next step I would like to suggest that we
brainstorm different queries that we will run on this data and check
whether the data model can support those queries and the expected
performance.

Thanks
Supun

On Fri, Jun 12, 2015 at 6:39 PM, Suresh Marru <sm...@apache.org> wrote:

> Hi All,
>
> With the experience of adapting thrift data models for Airavata in past
> couple of years, its time for us to revisit them. Most persistent criticism
> has been the data models have been complex. Next the data models and
> architecture evolved in parallel and the implementations did not always
> match the intended models. In an effort to address these issues, lets first
> discuss the minimal required data models.
>
> We need to confirm the models to the general principle of Experiments
> deriving into a Process or a Workflow. For single application, a process
> can be directly derived from Experiment Details. For workflows, multiple
> process are created. Executing a process leads to creation of multiple
> Tasks. Task is a general type which are enacted at run time based on a
> generic execution sequence of environment setup, data input staging,
> application execution and monitoring, data output staging and environment
> cleanup.
>
> Please review the initial draft:
>
> https://github.com/apache/airavata/tree/master/thrift-interface-descriptions/airavata-data-models
>
> Assume lazy consensus and update the models, lets literately review and
> update these thrift IDL’s. We don’t yet need to dive into code generation,
> until these are close to final.
>
> @Supun, may be you can start thinking on the data base representation on
> these models and assume the details will change but the general structure
> might remain.
>
> Cheers,
> Suresh
>

-- 
Thank you
Supun Nakandala
Dept. Computer Science and Engineering
University of Moratuwa