You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@apex.apache.org by Sandesh Hegde <sa...@datatorrent.com> on 2015/11/10 23:34:04 UTC

Apex-119 - Distributed Operator design discussion

Hello All,

Tim & I started working on Apex 119
<https://malhar.atlassian.net/browse/APEX-119> and came up with the
following design document.

Idea is to treat all the partitions of an operator as a single unit, they
all will work on the same window and if one of them fails all the
partitions are brought back to common checkpoint.

You can comment on the document, once it is finalized, we will attach the
document to Jira.

https://docs.google.com/document/d/1Rau76WxAycyN9vQqP2bqDWZAwLw0u23xSh0_5fQ1980/edit?usp=sharing

Thanks
Sandesh

Re: Apex-119 - Distributed Operator design discussion

Posted by Chandni Singh <ch...@datatorrent.com>.

Hey Tim/Sandesh,

While searching for a doc, I stumbled on the document of Chord protocol.
Don't know how much relevant is this for this project but just wanted to
share.
Here is the doc link.
https://docs.google.com/document/d/12iOWPaA82g3JahjUflEyvirQA4_hL8SrGyit_Iunobk/edit

Also it is a standard protocol so you can look it up and will find more
information.

Thanks,
Chandni

On Fri, Nov 13, 2015 at 10:55 AM, Timothy Farkas <ti...@datatorrent.com>
wrote:

> Sandesh and I have created some slides outlining some more possible design
> approaches along with their pros and cons.
>
>
> https://docs.google.com/presentation/d/1-gWwwq4Dd7g9Mai7XLlzA7R_F9nMqEWg3IKMD3OMbYE/edit?usp=sharing
>
> Please review and comment
>
> Thanks,
> Tim
>
> On Wed, Nov 11, 2015 at 11:20 PM, Amol Kekre <am...@datatorrent.com> wrote:
>
> > This feature should be false by default. That way it will need to be an
> > explicit user ask (attribute?) and then on degradation in performance
> etc.
> > is user chosen.
> >
> > Amol
> >
> >
> > On Wed, Nov 11, 2015 at 10:57 PM, Gaurav Gupta <ga...@datatorrent.com>
> > wrote:
> >
> > > Is there a way to disable/ enable this feature? Synchronizing all the
> > > partitions and bringing all the partitions to same common checkpoint
> post
> > > failure would affect performance.
> > >
> > > Thanks
> > > - Gaurav
> > >
> > > > On Nov 11, 2015, at 10:50 PM, Thomas Weise <th...@datatorrent.com>
> > > wrote:
> > > >
> > > > I would like to better understand the target use cases. This will
> also
> > > help
> > > > to analyze trade-offs.
> > > >
> > > > The proposal of synchronizing all partitions at a window boundary
> > affects
> > > > scalability, adds latency and dictates reset of all partitions on
> > > operator
> > > > failure.
> > > >
> > > > There are different levels of support for such "distributed data
> > > > structure". For example, limiting each partition to single writer and
> > > > version based reads would allow for relaxation of synchronization
> > needs.
> > > > Again, goals and pros and cons of different approaches need to be
> > > discussed.
> > > >
> > > >
> > > > On Tue, Nov 10, 2015 at 2:34 PM, Sandesh Hegde <
> > sandesh@datatorrent.com>
> > > > wrote:
> > > >
> > > >> Hello All,
> > > >>
> > > >> Tim & I started working on Apex 119
> > > >> <https://malhar.atlassian.net/browse/APEX-119> and came up with the
> > > >> following design document.
> > > >>
> > > >> Idea is to treat all the partitions of an operator as a single unit,
> > > they
> > > >> all will work on the same window and if one of them fails all the
> > > >> partitions are brought back to common checkpoint.
> > > >>
> > > >> You can comment on the document, once it is finalized, we will
> attach
> > > the
> > > >> document to Jira.
> > > >>
> > > >>
> > > >>
> > >
> >
> https://docs.google.com/document/d/1Rau76WxAycyN9vQqP2bqDWZAwLw0u23xSh0_5fQ1980/edit?usp=sharing
> > > >>
> > > >> Thanks
> > > >> Sandesh
> > > >>
> > >
> > >
> >
>

Re: Apex-119 - Distributed Operator design discussion

Posted by Timothy Farkas <ti...@datatorrent.com>.

Sandesh and I have created some slides outlining some more possible design
approaches along with their pros and cons.

https://docs.google.com/presentation/d/1-gWwwq4Dd7g9Mai7XLlzA7R_F9nMqEWg3IKMD3OMbYE/edit?usp=sharing

Please review and comment

Thanks,
Tim

On Wed, Nov 11, 2015 at 11:20 PM, Amol Kekre <am...@datatorrent.com> wrote:

> This feature should be false by default. That way it will need to be an
> explicit user ask (attribute?) and then on degradation in performance etc.
> is user chosen.
>
> Amol
>
>
> On Wed, Nov 11, 2015 at 10:57 PM, Gaurav Gupta <ga...@datatorrent.com>
> wrote:
>
> > Is there a way to disable/ enable this feature? Synchronizing all the
> > partitions and bringing all the partitions to same common checkpoint post
> > failure would affect performance.
> >
> > Thanks
> > - Gaurav
> >
> > > On Nov 11, 2015, at 10:50 PM, Thomas Weise <th...@datatorrent.com>
> > wrote:
> > >
> > > I would like to better understand the target use cases. This will also
> > help
> > > to analyze trade-offs.
> > >
> > > The proposal of synchronizing all partitions at a window boundary
> affects
> > > scalability, adds latency and dictates reset of all partitions on
> > operator
> > > failure.
> > >
> > > There are different levels of support for such "distributed data
> > > structure". For example, limiting each partition to single writer and
> > > version based reads would allow for relaxation of synchronization
> needs.
> > > Again, goals and pros and cons of different approaches need to be
> > discussed.
> > >
> > >
> > > On Tue, Nov 10, 2015 at 2:34 PM, Sandesh Hegde <
> sandesh@datatorrent.com>
> > > wrote:
> > >
> > >> Hello All,
> > >>
> > >> Tim & I started working on Apex 119
> > >> <https://malhar.atlassian.net/browse/APEX-119> and came up with the
> > >> following design document.
> > >>
> > >> Idea is to treat all the partitions of an operator as a single unit,
> > they
> > >> all will work on the same window and if one of them fails all the
> > >> partitions are brought back to common checkpoint.
> > >>
> > >> You can comment on the document, once it is finalized, we will attach
> > the
> > >> document to Jira.
> > >>
> > >>
> > >>
> >
> https://docs.google.com/document/d/1Rau76WxAycyN9vQqP2bqDWZAwLw0u23xSh0_5fQ1980/edit?usp=sharing
> > >>
> > >> Thanks
> > >> Sandesh
> > >>
> >
> >
>

Re: Apex-119 - Distributed Operator design discussion

Posted by Amol Kekre <am...@datatorrent.com>.

This feature should be false by default. That way it will need to be an
explicit user ask (attribute?) and then on degradation in performance etc.
is user chosen.

Amol


On Wed, Nov 11, 2015 at 10:57 PM, Gaurav Gupta <ga...@datatorrent.com>
wrote:

> Is there a way to disable/ enable this feature? Synchronizing all the
> partitions and bringing all the partitions to same common checkpoint post
> failure would affect performance.
>
> Thanks
> - Gaurav
>
> > On Nov 11, 2015, at 10:50 PM, Thomas Weise <th...@datatorrent.com>
> wrote:
> >
> > I would like to better understand the target use cases. This will also
> help
> > to analyze trade-offs.
> >
> > The proposal of synchronizing all partitions at a window boundary affects
> > scalability, adds latency and dictates reset of all partitions on
> operator
> > failure.
> >
> > There are different levels of support for such "distributed data
> > structure". For example, limiting each partition to single writer and
> > version based reads would allow for relaxation of synchronization needs.
> > Again, goals and pros and cons of different approaches need to be
> discussed.
> >
> >
> > On Tue, Nov 10, 2015 at 2:34 PM, Sandesh Hegde <sa...@datatorrent.com>
> > wrote:
> >
> >> Hello All,
> >>
> >> Tim & I started working on Apex 119
> >> <https://malhar.atlassian.net/browse/APEX-119> and came up with the
> >> following design document.
> >>
> >> Idea is to treat all the partitions of an operator as a single unit,
> they
> >> all will work on the same window and if one of them fails all the
> >> partitions are brought back to common checkpoint.
> >>
> >> You can comment on the document, once it is finalized, we will attach
> the
> >> document to Jira.
> >>
> >>
> >>
> https://docs.google.com/document/d/1Rau76WxAycyN9vQqP2bqDWZAwLw0u23xSh0_5fQ1980/edit?usp=sharing
> >>
> >> Thanks
> >> Sandesh
> >>
>
>

Re: Apex-119 - Distributed Operator design discussion

Posted by Gaurav Gupta <ga...@datatorrent.com>.

Is there a way to disable/ enable this feature? Synchronizing all the partitions and bringing all the partitions to same common checkpoint post failure would affect performance. 

Thanks
- Gaurav

> On Nov 11, 2015, at 10:50 PM, Thomas Weise <th...@datatorrent.com> wrote:
> 
> I would like to better understand the target use cases. This will also help
> to analyze trade-offs.
> 
> The proposal of synchronizing all partitions at a window boundary affects
> scalability, adds latency and dictates reset of all partitions on operator
> failure.
> 
> There are different levels of support for such "distributed data
> structure". For example, limiting each partition to single writer and
> version based reads would allow for relaxation of synchronization needs.
> Again, goals and pros and cons of different approaches need to be discussed.
> 
> 
> On Tue, Nov 10, 2015 at 2:34 PM, Sandesh Hegde <sa...@datatorrent.com>
> wrote:
> 
>> Hello All,
>> 
>> Tim & I started working on Apex 119
>> <https://malhar.atlassian.net/browse/APEX-119> and came up with the
>> following design document.
>> 
>> Idea is to treat all the partitions of an operator as a single unit, they
>> all will work on the same window and if one of them fails all the
>> partitions are brought back to common checkpoint.
>> 
>> You can comment on the document, once it is finalized, we will attach the
>> document to Jira.
>> 
>> 
>> https://docs.google.com/document/d/1Rau76WxAycyN9vQqP2bqDWZAwLw0u23xSh0_5fQ1980/edit?usp=sharing
>> 
>> Thanks
>> Sandesh
>>

Re: Apex-119 - Distributed Operator design discussion

Posted by Thomas Weise <th...@datatorrent.com>.

I would like to better understand the target use cases. This will also help
to analyze trade-offs.

The proposal of synchronizing all partitions at a window boundary affects
scalability, adds latency and dictates reset of all partitions on operator
failure.

There are different levels of support for such "distributed data
structure". For example, limiting each partition to single writer and
version based reads would allow for relaxation of synchronization needs.
Again, goals and pros and cons of different approaches need to be discussed.

On Tue, Nov 10, 2015 at 2:34 PM, Sandesh Hegde <sa...@datatorrent.com>
wrote:

> Hello All,
>
> Tim & I started working on Apex 119
> <https://malhar.atlassian.net/browse/APEX-119> and came up with the
> following design document.
>
> Idea is to treat all the partitions of an operator as a single unit, they
> all will work on the same window and if one of them fails all the
> partitions are brought back to common checkpoint.
>
> You can comment on the document, once it is finalized, we will attach the
> document to Jira.
>
>
> https://docs.google.com/document/d/1Rau76WxAycyN9vQqP2bqDWZAwLw0u23xSh0_5fQ1980/edit?usp=sharing
>
> Thanks
> Sandesh
>