You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@apex.apache.org by Ashwin Chandra Putta <as...@gmail.com> on 2015/12/15 08:05:17 UTC

operator recovery window

In the apex architecture there is concept of checkpointing and concept of
committed when all operator have crossed a common checkpoint.

So, in which scenarios does a given operator recover at last checkpoint
window vs last committed window vs some other checkpoint window in between?
-- 

Regards,
Ashwin.

Re: operator recovery window

Posted by Ashwin Chandra Putta <as...@gmail.com>.
Well, that is true in any case irrespective of recovery at committed window
or further checkpointed window; that we start reprocessing at a window
after recovered checkpointed window, as the checkpoint state is after the
end window of the window that is checkpointed and before the next window
begins.

Here is a summary for future reference.

At operator failure,

1. The operator recovers at its largest checkpointed window that is less
than or equal to the largest common checkpointed window of all the
downstream operators.
2. For output operators, they always recover at their largest checkpointed
window (as they have no dependency on downstream operators)
3. The operator will recover at committed window when there is no
checkpoint for the operator or for any of the downstream operators after
the committed window.

Regards,
Ashwin.

On Tue, Dec 15, 2015 at 2:35 PM, Timothy Farkas <ti...@datatorrent.com> wrote:

> Yes and no.
>
> Window 30 could be committed and we could restore to the corresponding
> checkpoint. The corresponding checkpoint for window 30 is taken after
> window 30 is done, but before window 31 has begun. So when we restore to
> the checkpoint we will not redo window 30, we will start at window 31. Once
> a window id is committed it will never be redone by any operator.
>
> On Tue, Dec 15, 2015 at 2:04 PM, Ashwin Chandra Putta <
> ashwinchandrap@gmail.com> wrote:
>
> > Siyuan,
> >
> > Yes, we are discussing at least once semantics.
> >
> > Tim,
> >
> > So it is indeed possible that we recover at committed window id in a case
> > where we just committed and there were no further checkpoints before
> > failure.
> >
> > Regards,
> > Ashwin.
> >
> > On Tue, Dec 15, 2015 at 1:54 PM, Timothy Farkas <ti...@datatorrent.com>
> > wrote:
> >
> > > Whoops my bad, that would never happen. There is a check that only
> allows
> > > purging of checkpoints for an operator if the operator has more than
> one
> > > checkpoint. :)
> > >
> > > On Tue, Dec 15, 2015 at 1:39 PM, Timothy Farkas <ti...@datatorrent.com>
> > > wrote:
> > >
> > > > Siyuan, then Ashwin may be right that there is an issue. Looking at
> the
> > > > code again I think this could happen:
> > > >
> > > > 1 - All operators reach checkpiont 30
> > > > 2 - Checkpoints are updated on heartbeat and committed window is now
> > 25,
> > > > everything before window 30 is purged
> > > > 3 - no new checkpoint is reached for any operator
> > > > 4 - Checkpoints are updated on heartbeat again and committed window
> is
> > > now
> > > > 30, now window 30 is purged.
> > > >
> > > > May be missing something again though.
> > > >
> > > > On Tue, Dec 15, 2015 at 1:32 PM, Siyuan Hua <si...@datatorrent.com>
> > > > wrote:
> > > >
> > > >> My understanding is the committed window could possibly be 30 as
> well,
> > > >> depends on whether container manager get heart beat from containers.
> > > >>
> > > >> And I guess the discussion is assuming at_least_once semantic? :)
> > > >> at_most_once should have different recovery window.
> > > >>
> > > >> On Tue, Dec 15, 2015 at 12:01 PM, Timothy Farkas <
> tim@datatorrent.com
> > >
> > > >> wrote:
> > > >>
> > > >> > Hi Ashwin,
> > > >> >
> > > >> > In your example, if A fails the recovery windows would be
> > > >> >
> > > >> > D - 15
> > > >> > C - 15
> > > >> > B - 15
> > > >> > A - 15
> > > >> >
> > > >> > If C fails the recovery windows would be
> > > >> >
> > > >> > D -15
> > > >> > C -15
> > > >> > B - 25
> > > >> > A - 30
> > > >> >
> > > >> > If every operator just reached window 30 and checkpointed, the
> > > committed
> > > >> > window would be 25, and all the checkpoints before window 30 would
> > be
> > > >> > purged, but the checkpoint for window 30 would not be purged.
> > > >> >
> > > >> > Thanks,
> > > >> > Tim
> > > >> >
> > > >> > On Tue, Dec 15, 2015 at 11:41 AM, Ashwin Chandra Putta <
> > > >> > ashwinchandrap@gmail.com> wrote:
> > > >> >
> > > >> > > Tim,
> > > >> > >
> > > >> > > Thanks, that is pretty much inline with what I was thinking. A
> > > little
> > > >> > > different thought though in terms of picking the checkpoint
> based
> > on
> > > >> > > downstream operators. For A, is it not going to be "the
> checkpoint
> > > >> with
> > > >> > the
> > > >> > > largest window id that is less than or equal to the checkpoint
> > with
> > > >> the
> > > >> > > largest common window id (instead of largest window id) among
> all
> > > the
> > > >> > > operators down stream to A"
> > > >> > >
> > > >> > > For example,
> > > >> > >
> > > >> > > If A -> B -> C -> D is the dag. And say, the checkpoint window
> > count
> > > >> is 5
> > > >> > > and the largest checkpoints are as follows.
> > > >> > >
> > > >> > > A - 30
> > > >> > > B - 25
> > > >> > > C - 20
> > > >> > > D - 15
> > > >> > >
> > > >> > > Does A recover at 25 (checkpoint with largest window id) or 15
> > > >> > (checkpoint
> > > >> > > with largest common window id)?
> > > >> > >
> > > >> > > Also, regarding recovering at committed window id. Is it not
> > > possible
> > > >> in
> > > >> > > the following scenario where all operators have checkpointed at
> 30
> > > and
> > > >> > got
> > > >> > > the committed window call back. And then an operator fails
> before
> > > any
> > > >> > > operator checkpoints further. In that case, the recovery window
> is
> > > 30
> > > >> > > right?
> > > >> > >
> > > >> > > Regards,
> > > >> > > Ashwin.
> > > >> > >
> > > >> > > On Mon, Dec 14, 2015 at 11:58 PM, Timothy Farkas <
> > > tim@datatorrent.com
> > > >> >
> > > >> > > wrote:
> > > >> > >
> > > >> > > > Hi Ashwin,
> > > >> > > >
> > > >> > > > The recovery checkpoint for operator A is computed by taking
> the
> > > >> > > checkpoint
> > > >> > > > with the largest window id that is less than or equal to the
> > > >> checkpoint
> > > >> > > > with the largest window id among all the operators down stream
> > to
> > > A.
> > > >> > The
> > > >> > > > output operators in a dag will always recover to their most
> > recent
> > > >> > > > checkpoint. The input operator of the dag may recover to the
> > > >> earliest
> > > >> > > > checkpoint. Operators between the input and ouput operators
> > could
> > > >> > recover
> > > >> > > > to a window in between.
> > > >> > > >
> > > >> > > > I don't think you can ever recover to a committed window, the
> > > >> earliest
> > > >> > I
> > > >> > > > think you can recover to is the window after the committed
> > window
> > > >> (may
> > > >> > be
> > > >> > > > wrong on this).
> > > >> > > >
> > > >> > > > On Mon, Dec 14, 2015 at 11:05 PM, Ashwin Chandra Putta <
> > > >> > > > ashwinchandrap@gmail.com> wrote:
> > > >> > > >
> > > >> > > > > In the apex architecture there is concept of checkpointing
> and
> > > >> > concept
> > > >> > > of
> > > >> > > > > committed when all operator have crossed a common
> checkpoint.
> > > >> > > > >
> > > >> > > > > So, in which scenarios does a given operator recover at last
> > > >> > checkpoint
> > > >> > > > > window vs last committed window vs some other checkpoint
> > window
> > > in
> > > >> > > > between?
> > > >> > > > > --
> > > >> > > > >
> > > >> > > > > Regards,
> > > >> > > > > Ashwin.
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > --
> > > >> > >
> > > >> > > Regards,
> > > >> > > Ashwin.
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
> >
> >
> > --
> >
> > Regards,
> > Ashwin.
> >
>



-- 

Regards,
Ashwin.

Re: operator recovery window

Posted by Timothy Farkas <ti...@datatorrent.com>.
Yes and no.

Window 30 could be committed and we could restore to the corresponding
checkpoint. The corresponding checkpoint for window 30 is taken after
window 30 is done, but before window 31 has begun. So when we restore to
the checkpoint we will not redo window 30, we will start at window 31. Once
a window id is committed it will never be redone by any operator.

On Tue, Dec 15, 2015 at 2:04 PM, Ashwin Chandra Putta <
ashwinchandrap@gmail.com> wrote:

> Siyuan,
>
> Yes, we are discussing at least once semantics.
>
> Tim,
>
> So it is indeed possible that we recover at committed window id in a case
> where we just committed and there were no further checkpoints before
> failure.
>
> Regards,
> Ashwin.
>
> On Tue, Dec 15, 2015 at 1:54 PM, Timothy Farkas <ti...@datatorrent.com>
> wrote:
>
> > Whoops my bad, that would never happen. There is a check that only allows
> > purging of checkpoints for an operator if the operator has more than one
> > checkpoint. :)
> >
> > On Tue, Dec 15, 2015 at 1:39 PM, Timothy Farkas <ti...@datatorrent.com>
> > wrote:
> >
> > > Siyuan, then Ashwin may be right that there is an issue. Looking at the
> > > code again I think this could happen:
> > >
> > > 1 - All operators reach checkpiont 30
> > > 2 - Checkpoints are updated on heartbeat and committed window is now
> 25,
> > > everything before window 30 is purged
> > > 3 - no new checkpoint is reached for any operator
> > > 4 - Checkpoints are updated on heartbeat again and committed window is
> > now
> > > 30, now window 30 is purged.
> > >
> > > May be missing something again though.
> > >
> > > On Tue, Dec 15, 2015 at 1:32 PM, Siyuan Hua <si...@datatorrent.com>
> > > wrote:
> > >
> > >> My understanding is the committed window could possibly be 30 as well,
> > >> depends on whether container manager get heart beat from containers.
> > >>
> > >> And I guess the discussion is assuming at_least_once semantic? :)
> > >> at_most_once should have different recovery window.
> > >>
> > >> On Tue, Dec 15, 2015 at 12:01 PM, Timothy Farkas <tim@datatorrent.com
> >
> > >> wrote:
> > >>
> > >> > Hi Ashwin,
> > >> >
> > >> > In your example, if A fails the recovery windows would be
> > >> >
> > >> > D - 15
> > >> > C - 15
> > >> > B - 15
> > >> > A - 15
> > >> >
> > >> > If C fails the recovery windows would be
> > >> >
> > >> > D -15
> > >> > C -15
> > >> > B - 25
> > >> > A - 30
> > >> >
> > >> > If every operator just reached window 30 and checkpointed, the
> > committed
> > >> > window would be 25, and all the checkpoints before window 30 would
> be
> > >> > purged, but the checkpoint for window 30 would not be purged.
> > >> >
> > >> > Thanks,
> > >> > Tim
> > >> >
> > >> > On Tue, Dec 15, 2015 at 11:41 AM, Ashwin Chandra Putta <
> > >> > ashwinchandrap@gmail.com> wrote:
> > >> >
> > >> > > Tim,
> > >> > >
> > >> > > Thanks, that is pretty much inline with what I was thinking. A
> > little
> > >> > > different thought though in terms of picking the checkpoint based
> on
> > >> > > downstream operators. For A, is it not going to be "the checkpoint
> > >> with
> > >> > the
> > >> > > largest window id that is less than or equal to the checkpoint
> with
> > >> the
> > >> > > largest common window id (instead of largest window id) among all
> > the
> > >> > > operators down stream to A"
> > >> > >
> > >> > > For example,
> > >> > >
> > >> > > If A -> B -> C -> D is the dag. And say, the checkpoint window
> count
> > >> is 5
> > >> > > and the largest checkpoints are as follows.
> > >> > >
> > >> > > A - 30
> > >> > > B - 25
> > >> > > C - 20
> > >> > > D - 15
> > >> > >
> > >> > > Does A recover at 25 (checkpoint with largest window id) or 15
> > >> > (checkpoint
> > >> > > with largest common window id)?
> > >> > >
> > >> > > Also, regarding recovering at committed window id. Is it not
> > possible
> > >> in
> > >> > > the following scenario where all operators have checkpointed at 30
> > and
> > >> > got
> > >> > > the committed window call back. And then an operator fails before
> > any
> > >> > > operator checkpoints further. In that case, the recovery window is
> > 30
> > >> > > right?
> > >> > >
> > >> > > Regards,
> > >> > > Ashwin.
> > >> > >
> > >> > > On Mon, Dec 14, 2015 at 11:58 PM, Timothy Farkas <
> > tim@datatorrent.com
> > >> >
> > >> > > wrote:
> > >> > >
> > >> > > > Hi Ashwin,
> > >> > > >
> > >> > > > The recovery checkpoint for operator A is computed by taking the
> > >> > > checkpoint
> > >> > > > with the largest window id that is less than or equal to the
> > >> checkpoint
> > >> > > > with the largest window id among all the operators down stream
> to
> > A.
> > >> > The
> > >> > > > output operators in a dag will always recover to their most
> recent
> > >> > > > checkpoint. The input operator of the dag may recover to the
> > >> earliest
> > >> > > > checkpoint. Operators between the input and ouput operators
> could
> > >> > recover
> > >> > > > to a window in between.
> > >> > > >
> > >> > > > I don't think you can ever recover to a committed window, the
> > >> earliest
> > >> > I
> > >> > > > think you can recover to is the window after the committed
> window
> > >> (may
> > >> > be
> > >> > > > wrong on this).
> > >> > > >
> > >> > > > On Mon, Dec 14, 2015 at 11:05 PM, Ashwin Chandra Putta <
> > >> > > > ashwinchandrap@gmail.com> wrote:
> > >> > > >
> > >> > > > > In the apex architecture there is concept of checkpointing and
> > >> > concept
> > >> > > of
> > >> > > > > committed when all operator have crossed a common checkpoint.
> > >> > > > >
> > >> > > > > So, in which scenarios does a given operator recover at last
> > >> > checkpoint
> > >> > > > > window vs last committed window vs some other checkpoint
> window
> > in
> > >> > > > between?
> > >> > > > > --
> > >> > > > >
> > >> > > > > Regards,
> > >> > > > > Ashwin.
> > >> > > > >
> > >> > > >
> > >> > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > >
> > >> > > Regards,
> > >> > > Ashwin.
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>
>
>
> --
>
> Regards,
> Ashwin.
>

Re: operator recovery window

Posted by Ashwin Chandra Putta <as...@gmail.com>.
Siyuan,

Yes, we are discussing at least once semantics.

Tim,

So it is indeed possible that we recover at committed window id in a case
where we just committed and there were no further checkpoints before
failure.

Regards,
Ashwin.

On Tue, Dec 15, 2015 at 1:54 PM, Timothy Farkas <ti...@datatorrent.com> wrote:

> Whoops my bad, that would never happen. There is a check that only allows
> purging of checkpoints for an operator if the operator has more than one
> checkpoint. :)
>
> On Tue, Dec 15, 2015 at 1:39 PM, Timothy Farkas <ti...@datatorrent.com>
> wrote:
>
> > Siyuan, then Ashwin may be right that there is an issue. Looking at the
> > code again I think this could happen:
> >
> > 1 - All operators reach checkpiont 30
> > 2 - Checkpoints are updated on heartbeat and committed window is now 25,
> > everything before window 30 is purged
> > 3 - no new checkpoint is reached for any operator
> > 4 - Checkpoints are updated on heartbeat again and committed window is
> now
> > 30, now window 30 is purged.
> >
> > May be missing something again though.
> >
> > On Tue, Dec 15, 2015 at 1:32 PM, Siyuan Hua <si...@datatorrent.com>
> > wrote:
> >
> >> My understanding is the committed window could possibly be 30 as well,
> >> depends on whether container manager get heart beat from containers.
> >>
> >> And I guess the discussion is assuming at_least_once semantic? :)
> >> at_most_once should have different recovery window.
> >>
> >> On Tue, Dec 15, 2015 at 12:01 PM, Timothy Farkas <ti...@datatorrent.com>
> >> wrote:
> >>
> >> > Hi Ashwin,
> >> >
> >> > In your example, if A fails the recovery windows would be
> >> >
> >> > D - 15
> >> > C - 15
> >> > B - 15
> >> > A - 15
> >> >
> >> > If C fails the recovery windows would be
> >> >
> >> > D -15
> >> > C -15
> >> > B - 25
> >> > A - 30
> >> >
> >> > If every operator just reached window 30 and checkpointed, the
> committed
> >> > window would be 25, and all the checkpoints before window 30 would be
> >> > purged, but the checkpoint for window 30 would not be purged.
> >> >
> >> > Thanks,
> >> > Tim
> >> >
> >> > On Tue, Dec 15, 2015 at 11:41 AM, Ashwin Chandra Putta <
> >> > ashwinchandrap@gmail.com> wrote:
> >> >
> >> > > Tim,
> >> > >
> >> > > Thanks, that is pretty much inline with what I was thinking. A
> little
> >> > > different thought though in terms of picking the checkpoint based on
> >> > > downstream operators. For A, is it not going to be "the checkpoint
> >> with
> >> > the
> >> > > largest window id that is less than or equal to the checkpoint with
> >> the
> >> > > largest common window id (instead of largest window id) among all
> the
> >> > > operators down stream to A"
> >> > >
> >> > > For example,
> >> > >
> >> > > If A -> B -> C -> D is the dag. And say, the checkpoint window count
> >> is 5
> >> > > and the largest checkpoints are as follows.
> >> > >
> >> > > A - 30
> >> > > B - 25
> >> > > C - 20
> >> > > D - 15
> >> > >
> >> > > Does A recover at 25 (checkpoint with largest window id) or 15
> >> > (checkpoint
> >> > > with largest common window id)?
> >> > >
> >> > > Also, regarding recovering at committed window id. Is it not
> possible
> >> in
> >> > > the following scenario where all operators have checkpointed at 30
> and
> >> > got
> >> > > the committed window call back. And then an operator fails before
> any
> >> > > operator checkpoints further. In that case, the recovery window is
> 30
> >> > > right?
> >> > >
> >> > > Regards,
> >> > > Ashwin.
> >> > >
> >> > > On Mon, Dec 14, 2015 at 11:58 PM, Timothy Farkas <
> tim@datatorrent.com
> >> >
> >> > > wrote:
> >> > >
> >> > > > Hi Ashwin,
> >> > > >
> >> > > > The recovery checkpoint for operator A is computed by taking the
> >> > > checkpoint
> >> > > > with the largest window id that is less than or equal to the
> >> checkpoint
> >> > > > with the largest window id among all the operators down stream to
> A.
> >> > The
> >> > > > output operators in a dag will always recover to their most recent
> >> > > > checkpoint. The input operator of the dag may recover to the
> >> earliest
> >> > > > checkpoint. Operators between the input and ouput operators could
> >> > recover
> >> > > > to a window in between.
> >> > > >
> >> > > > I don't think you can ever recover to a committed window, the
> >> earliest
> >> > I
> >> > > > think you can recover to is the window after the committed window
> >> (may
> >> > be
> >> > > > wrong on this).
> >> > > >
> >> > > > On Mon, Dec 14, 2015 at 11:05 PM, Ashwin Chandra Putta <
> >> > > > ashwinchandrap@gmail.com> wrote:
> >> > > >
> >> > > > > In the apex architecture there is concept of checkpointing and
> >> > concept
> >> > > of
> >> > > > > committed when all operator have crossed a common checkpoint.
> >> > > > >
> >> > > > > So, in which scenarios does a given operator recover at last
> >> > checkpoint
> >> > > > > window vs last committed window vs some other checkpoint window
> in
> >> > > > between?
> >> > > > > --
> >> > > > >
> >> > > > > Regards,
> >> > > > > Ashwin.
> >> > > > >
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > >
> >> > > Regards,
> >> > > Ashwin.
> >> > >
> >> >
> >>
> >
> >
>



-- 

Regards,
Ashwin.

Re: operator recovery window

Posted by Timothy Farkas <ti...@datatorrent.com>.
Whoops my bad, that would never happen. There is a check that only allows
purging of checkpoints for an operator if the operator has more than one
checkpoint. :)

On Tue, Dec 15, 2015 at 1:39 PM, Timothy Farkas <ti...@datatorrent.com> wrote:

> Siyuan, then Ashwin may be right that there is an issue. Looking at the
> code again I think this could happen:
>
> 1 - All operators reach checkpiont 30
> 2 - Checkpoints are updated on heartbeat and committed window is now 25,
> everything before window 30 is purged
> 3 - no new checkpoint is reached for any operator
> 4 - Checkpoints are updated on heartbeat again and committed window is now
> 30, now window 30 is purged.
>
> May be missing something again though.
>
> On Tue, Dec 15, 2015 at 1:32 PM, Siyuan Hua <si...@datatorrent.com>
> wrote:
>
>> My understanding is the committed window could possibly be 30 as well,
>> depends on whether container manager get heart beat from containers.
>>
>> And I guess the discussion is assuming at_least_once semantic? :)
>> at_most_once should have different recovery window.
>>
>> On Tue, Dec 15, 2015 at 12:01 PM, Timothy Farkas <ti...@datatorrent.com>
>> wrote:
>>
>> > Hi Ashwin,
>> >
>> > In your example, if A fails the recovery windows would be
>> >
>> > D - 15
>> > C - 15
>> > B - 15
>> > A - 15
>> >
>> > If C fails the recovery windows would be
>> >
>> > D -15
>> > C -15
>> > B - 25
>> > A - 30
>> >
>> > If every operator just reached window 30 and checkpointed, the committed
>> > window would be 25, and all the checkpoints before window 30 would be
>> > purged, but the checkpoint for window 30 would not be purged.
>> >
>> > Thanks,
>> > Tim
>> >
>> > On Tue, Dec 15, 2015 at 11:41 AM, Ashwin Chandra Putta <
>> > ashwinchandrap@gmail.com> wrote:
>> >
>> > > Tim,
>> > >
>> > > Thanks, that is pretty much inline with what I was thinking. A little
>> > > different thought though in terms of picking the checkpoint based on
>> > > downstream operators. For A, is it not going to be "the checkpoint
>> with
>> > the
>> > > largest window id that is less than or equal to the checkpoint with
>> the
>> > > largest common window id (instead of largest window id) among all the
>> > > operators down stream to A"
>> > >
>> > > For example,
>> > >
>> > > If A -> B -> C -> D is the dag. And say, the checkpoint window count
>> is 5
>> > > and the largest checkpoints are as follows.
>> > >
>> > > A - 30
>> > > B - 25
>> > > C - 20
>> > > D - 15
>> > >
>> > > Does A recover at 25 (checkpoint with largest window id) or 15
>> > (checkpoint
>> > > with largest common window id)?
>> > >
>> > > Also, regarding recovering at committed window id. Is it not possible
>> in
>> > > the following scenario where all operators have checkpointed at 30 and
>> > got
>> > > the committed window call back. And then an operator fails before any
>> > > operator checkpoints further. In that case, the recovery window is 30
>> > > right?
>> > >
>> > > Regards,
>> > > Ashwin.
>> > >
>> > > On Mon, Dec 14, 2015 at 11:58 PM, Timothy Farkas <tim@datatorrent.com
>> >
>> > > wrote:
>> > >
>> > > > Hi Ashwin,
>> > > >
>> > > > The recovery checkpoint for operator A is computed by taking the
>> > > checkpoint
>> > > > with the largest window id that is less than or equal to the
>> checkpoint
>> > > > with the largest window id among all the operators down stream to A.
>> > The
>> > > > output operators in a dag will always recover to their most recent
>> > > > checkpoint. The input operator of the dag may recover to the
>> earliest
>> > > > checkpoint. Operators between the input and ouput operators could
>> > recover
>> > > > to a window in between.
>> > > >
>> > > > I don't think you can ever recover to a committed window, the
>> earliest
>> > I
>> > > > think you can recover to is the window after the committed window
>> (may
>> > be
>> > > > wrong on this).
>> > > >
>> > > > On Mon, Dec 14, 2015 at 11:05 PM, Ashwin Chandra Putta <
>> > > > ashwinchandrap@gmail.com> wrote:
>> > > >
>> > > > > In the apex architecture there is concept of checkpointing and
>> > concept
>> > > of
>> > > > > committed when all operator have crossed a common checkpoint.
>> > > > >
>> > > > > So, in which scenarios does a given operator recover at last
>> > checkpoint
>> > > > > window vs last committed window vs some other checkpoint window in
>> > > > between?
>> > > > > --
>> > > > >
>> > > > > Regards,
>> > > > > Ashwin.
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > >
>> > > Regards,
>> > > Ashwin.
>> > >
>> >
>>
>
>

Re: operator recovery window

Posted by Timothy Farkas <ti...@datatorrent.com>.
Siyuan, then Ashwin may be right that there is an issue. Looking at the
code again I think this could happen:

1 - All operators reach checkpiont 30
2 - Checkpoints are updated on heartbeat and committed window is now 25,
everything before window 30 is purged
3 - no new checkpoint is reached for any operator
4 - Checkpoints are updated on heartbeat again and committed window is now
30, now window 30 is purged.

May be missing something again though.

On Tue, Dec 15, 2015 at 1:32 PM, Siyuan Hua <si...@datatorrent.com> wrote:

> My understanding is the committed window could possibly be 30 as well,
> depends on whether container manager get heart beat from containers.
>
> And I guess the discussion is assuming at_least_once semantic? :)
> at_most_once should have different recovery window.
>
> On Tue, Dec 15, 2015 at 12:01 PM, Timothy Farkas <ti...@datatorrent.com>
> wrote:
>
> > Hi Ashwin,
> >
> > In your example, if A fails the recovery windows would be
> >
> > D - 15
> > C - 15
> > B - 15
> > A - 15
> >
> > If C fails the recovery windows would be
> >
> > D -15
> > C -15
> > B - 25
> > A - 30
> >
> > If every operator just reached window 30 and checkpointed, the committed
> > window would be 25, and all the checkpoints before window 30 would be
> > purged, but the checkpoint for window 30 would not be purged.
> >
> > Thanks,
> > Tim
> >
> > On Tue, Dec 15, 2015 at 11:41 AM, Ashwin Chandra Putta <
> > ashwinchandrap@gmail.com> wrote:
> >
> > > Tim,
> > >
> > > Thanks, that is pretty much inline with what I was thinking. A little
> > > different thought though in terms of picking the checkpoint based on
> > > downstream operators. For A, is it not going to be "the checkpoint with
> > the
> > > largest window id that is less than or equal to the checkpoint with the
> > > largest common window id (instead of largest window id) among all the
> > > operators down stream to A"
> > >
> > > For example,
> > >
> > > If A -> B -> C -> D is the dag. And say, the checkpoint window count
> is 5
> > > and the largest checkpoints are as follows.
> > >
> > > A - 30
> > > B - 25
> > > C - 20
> > > D - 15
> > >
> > > Does A recover at 25 (checkpoint with largest window id) or 15
> > (checkpoint
> > > with largest common window id)?
> > >
> > > Also, regarding recovering at committed window id. Is it not possible
> in
> > > the following scenario where all operators have checkpointed at 30 and
> > got
> > > the committed window call back. And then an operator fails before any
> > > operator checkpoints further. In that case, the recovery window is 30
> > > right?
> > >
> > > Regards,
> > > Ashwin.
> > >
> > > On Mon, Dec 14, 2015 at 11:58 PM, Timothy Farkas <ti...@datatorrent.com>
> > > wrote:
> > >
> > > > Hi Ashwin,
> > > >
> > > > The recovery checkpoint for operator A is computed by taking the
> > > checkpoint
> > > > with the largest window id that is less than or equal to the
> checkpoint
> > > > with the largest window id among all the operators down stream to A.
> > The
> > > > output operators in a dag will always recover to their most recent
> > > > checkpoint. The input operator of the dag may recover to the earliest
> > > > checkpoint. Operators between the input and ouput operators could
> > recover
> > > > to a window in between.
> > > >
> > > > I don't think you can ever recover to a committed window, the
> earliest
> > I
> > > > think you can recover to is the window after the committed window
> (may
> > be
> > > > wrong on this).
> > > >
> > > > On Mon, Dec 14, 2015 at 11:05 PM, Ashwin Chandra Putta <
> > > > ashwinchandrap@gmail.com> wrote:
> > > >
> > > > > In the apex architecture there is concept of checkpointing and
> > concept
> > > of
> > > > > committed when all operator have crossed a common checkpoint.
> > > > >
> > > > > So, in which scenarios does a given operator recover at last
> > checkpoint
> > > > > window vs last committed window vs some other checkpoint window in
> > > > between?
> > > > > --
> > > > >
> > > > > Regards,
> > > > > Ashwin.
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > Regards,
> > > Ashwin.
> > >
> >
>

Re: operator recovery window

Posted by Siyuan Hua <si...@datatorrent.com>.
My understanding is the committed window could possibly be 30 as well,
depends on whether container manager get heart beat from containers.

And I guess the discussion is assuming at_least_once semantic? :)
at_most_once should have different recovery window.

On Tue, Dec 15, 2015 at 12:01 PM, Timothy Farkas <ti...@datatorrent.com>
wrote:

> Hi Ashwin,
>
> In your example, if A fails the recovery windows would be
>
> D - 15
> C - 15
> B - 15
> A - 15
>
> If C fails the recovery windows would be
>
> D -15
> C -15
> B - 25
> A - 30
>
> If every operator just reached window 30 and checkpointed, the committed
> window would be 25, and all the checkpoints before window 30 would be
> purged, but the checkpoint for window 30 would not be purged.
>
> Thanks,
> Tim
>
> On Tue, Dec 15, 2015 at 11:41 AM, Ashwin Chandra Putta <
> ashwinchandrap@gmail.com> wrote:
>
> > Tim,
> >
> > Thanks, that is pretty much inline with what I was thinking. A little
> > different thought though in terms of picking the checkpoint based on
> > downstream operators. For A, is it not going to be "the checkpoint with
> the
> > largest window id that is less than or equal to the checkpoint with the
> > largest common window id (instead of largest window id) among all the
> > operators down stream to A"
> >
> > For example,
> >
> > If A -> B -> C -> D is the dag. And say, the checkpoint window count is 5
> > and the largest checkpoints are as follows.
> >
> > A - 30
> > B - 25
> > C - 20
> > D - 15
> >
> > Does A recover at 25 (checkpoint with largest window id) or 15
> (checkpoint
> > with largest common window id)?
> >
> > Also, regarding recovering at committed window id. Is it not possible in
> > the following scenario where all operators have checkpointed at 30 and
> got
> > the committed window call back. And then an operator fails before any
> > operator checkpoints further. In that case, the recovery window is 30
> > right?
> >
> > Regards,
> > Ashwin.
> >
> > On Mon, Dec 14, 2015 at 11:58 PM, Timothy Farkas <ti...@datatorrent.com>
> > wrote:
> >
> > > Hi Ashwin,
> > >
> > > The recovery checkpoint for operator A is computed by taking the
> > checkpoint
> > > with the largest window id that is less than or equal to the checkpoint
> > > with the largest window id among all the operators down stream to A.
> The
> > > output operators in a dag will always recover to their most recent
> > > checkpoint. The input operator of the dag may recover to the earliest
> > > checkpoint. Operators between the input and ouput operators could
> recover
> > > to a window in between.
> > >
> > > I don't think you can ever recover to a committed window, the earliest
> I
> > > think you can recover to is the window after the committed window (may
> be
> > > wrong on this).
> > >
> > > On Mon, Dec 14, 2015 at 11:05 PM, Ashwin Chandra Putta <
> > > ashwinchandrap@gmail.com> wrote:
> > >
> > > > In the apex architecture there is concept of checkpointing and
> concept
> > of
> > > > committed when all operator have crossed a common checkpoint.
> > > >
> > > > So, in which scenarios does a given operator recover at last
> checkpoint
> > > > window vs last committed window vs some other checkpoint window in
> > > between?
> > > > --
> > > >
> > > > Regards,
> > > > Ashwin.
> > > >
> > >
> >
> >
> >
> > --
> >
> > Regards,
> > Ashwin.
> >
>

Re: operator recovery window

Posted by Timothy Farkas <ti...@datatorrent.com>.
Yes 25 would be purged, but the operator would never get restored to that
window.

On Tue, Dec 15, 2015 at 12:37 PM, Ashwin Chandra Putta <
ashwinchandrap@gmail.com> wrote:

> Tim,
>
> You mean 25 is purged too?
>
> Regards,
> Ashwin.
>
> On Tue, Dec 15, 2015 at 12:01 PM, Timothy Farkas <ti...@datatorrent.com>
> wrote:
>
> > Hi Ashwin,
> >
> > In your example, if A fails the recovery windows would be
> >
> > D - 15
> > C - 15
> > B - 15
> > A - 15
> >
> > If C fails the recovery windows would be
> >
> > D -15
> > C -15
> > B - 25
> > A - 30
> >
> > If every operator just reached window 30 and checkpointed, the committed
> > window would be 25, and all the checkpoints before window 30 would be
> > purged, but the checkpoint for window 30 would not be purged.
> >
> > Thanks,
> > Tim
> >
> > On Tue, Dec 15, 2015 at 11:41 AM, Ashwin Chandra Putta <
> > ashwinchandrap@gmail.com> wrote:
> >
> > > Tim,
> > >
> > > Thanks, that is pretty much inline with what I was thinking. A little
> > > different thought though in terms of picking the checkpoint based on
> > > downstream operators. For A, is it not going to be "the checkpoint with
> > the
> > > largest window id that is less than or equal to the checkpoint with the
> > > largest common window id (instead of largest window id) among all the
> > > operators down stream to A"
> > >
> > > For example,
> > >
> > > If A -> B -> C -> D is the dag. And say, the checkpoint window count
> is 5
> > > and the largest checkpoints are as follows.
> > >
> > > A - 30
> > > B - 25
> > > C - 20
> > > D - 15
> > >
> > > Does A recover at 25 (checkpoint with largest window id) or 15
> > (checkpoint
> > > with largest common window id)?
> > >
> > > Also, regarding recovering at committed window id. Is it not possible
> in
> > > the following scenario where all operators have checkpointed at 30 and
> > got
> > > the committed window call back. And then an operator fails before any
> > > operator checkpoints further. In that case, the recovery window is 30
> > > right?
> > >
> > > Regards,
> > > Ashwin.
> > >
> > > On Mon, Dec 14, 2015 at 11:58 PM, Timothy Farkas <ti...@datatorrent.com>
> > > wrote:
> > >
> > > > Hi Ashwin,
> > > >
> > > > The recovery checkpoint for operator A is computed by taking the
> > > checkpoint
> > > > with the largest window id that is less than or equal to the
> checkpoint
> > > > with the largest window id among all the operators down stream to A.
> > The
> > > > output operators in a dag will always recover to their most recent
> > > > checkpoint. The input operator of the dag may recover to the earliest
> > > > checkpoint. Operators between the input and ouput operators could
> > recover
> > > > to a window in between.
> > > >
> > > > I don't think you can ever recover to a committed window, the
> earliest
> > I
> > > > think you can recover to is the window after the committed window
> (may
> > be
> > > > wrong on this).
> > > >
> > > > On Mon, Dec 14, 2015 at 11:05 PM, Ashwin Chandra Putta <
> > > > ashwinchandrap@gmail.com> wrote:
> > > >
> > > > > In the apex architecture there is concept of checkpointing and
> > concept
> > > of
> > > > > committed when all operator have crossed a common checkpoint.
> > > > >
> > > > > So, in which scenarios does a given operator recover at last
> > checkpoint
> > > > > window vs last committed window vs some other checkpoint window in
> > > > between?
> > > > > --
> > > > >
> > > > > Regards,
> > > > > Ashwin.
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > Regards,
> > > Ashwin.
> > >
> >
>
>
>
> --
>
> Regards,
> Ashwin.
>

Re: operator recovery window

Posted by Ashwin Chandra Putta <as...@gmail.com>.
Tim,

You mean 25 is purged too?

Regards,
Ashwin.

On Tue, Dec 15, 2015 at 12:01 PM, Timothy Farkas <ti...@datatorrent.com>
wrote:

> Hi Ashwin,
>
> In your example, if A fails the recovery windows would be
>
> D - 15
> C - 15
> B - 15
> A - 15
>
> If C fails the recovery windows would be
>
> D -15
> C -15
> B - 25
> A - 30
>
> If every operator just reached window 30 and checkpointed, the committed
> window would be 25, and all the checkpoints before window 30 would be
> purged, but the checkpoint for window 30 would not be purged.
>
> Thanks,
> Tim
>
> On Tue, Dec 15, 2015 at 11:41 AM, Ashwin Chandra Putta <
> ashwinchandrap@gmail.com> wrote:
>
> > Tim,
> >
> > Thanks, that is pretty much inline with what I was thinking. A little
> > different thought though in terms of picking the checkpoint based on
> > downstream operators. For A, is it not going to be "the checkpoint with
> the
> > largest window id that is less than or equal to the checkpoint with the
> > largest common window id (instead of largest window id) among all the
> > operators down stream to A"
> >
> > For example,
> >
> > If A -> B -> C -> D is the dag. And say, the checkpoint window count is 5
> > and the largest checkpoints are as follows.
> >
> > A - 30
> > B - 25
> > C - 20
> > D - 15
> >
> > Does A recover at 25 (checkpoint with largest window id) or 15
> (checkpoint
> > with largest common window id)?
> >
> > Also, regarding recovering at committed window id. Is it not possible in
> > the following scenario where all operators have checkpointed at 30 and
> got
> > the committed window call back. And then an operator fails before any
> > operator checkpoints further. In that case, the recovery window is 30
> > right?
> >
> > Regards,
> > Ashwin.
> >
> > On Mon, Dec 14, 2015 at 11:58 PM, Timothy Farkas <ti...@datatorrent.com>
> > wrote:
> >
> > > Hi Ashwin,
> > >
> > > The recovery checkpoint for operator A is computed by taking the
> > checkpoint
> > > with the largest window id that is less than or equal to the checkpoint
> > > with the largest window id among all the operators down stream to A.
> The
> > > output operators in a dag will always recover to their most recent
> > > checkpoint. The input operator of the dag may recover to the earliest
> > > checkpoint. Operators between the input and ouput operators could
> recover
> > > to a window in between.
> > >
> > > I don't think you can ever recover to a committed window, the earliest
> I
> > > think you can recover to is the window after the committed window (may
> be
> > > wrong on this).
> > >
> > > On Mon, Dec 14, 2015 at 11:05 PM, Ashwin Chandra Putta <
> > > ashwinchandrap@gmail.com> wrote:
> > >
> > > > In the apex architecture there is concept of checkpointing and
> concept
> > of
> > > > committed when all operator have crossed a common checkpoint.
> > > >
> > > > So, in which scenarios does a given operator recover at last
> checkpoint
> > > > window vs last committed window vs some other checkpoint window in
> > > between?
> > > > --
> > > >
> > > > Regards,
> > > > Ashwin.
> > > >
> > >
> >
> >
> >
> > --
> >
> > Regards,
> > Ashwin.
> >
>



-- 

Regards,
Ashwin.

Re: operator recovery window

Posted by Timothy Farkas <ti...@datatorrent.com>.
Hi Ashwin,

In your example, if A fails the recovery windows would be

D - 15
C - 15
B - 15
A - 15

If C fails the recovery windows would be

D -15
C -15
B - 25
A - 30

If every operator just reached window 30 and checkpointed, the committed
window would be 25, and all the checkpoints before window 30 would be
purged, but the checkpoint for window 30 would not be purged.

Thanks,
Tim

On Tue, Dec 15, 2015 at 11:41 AM, Ashwin Chandra Putta <
ashwinchandrap@gmail.com> wrote:

> Tim,
>
> Thanks, that is pretty much inline with what I was thinking. A little
> different thought though in terms of picking the checkpoint based on
> downstream operators. For A, is it not going to be "the checkpoint with the
> largest window id that is less than or equal to the checkpoint with the
> largest common window id (instead of largest window id) among all the
> operators down stream to A"
>
> For example,
>
> If A -> B -> C -> D is the dag. And say, the checkpoint window count is 5
> and the largest checkpoints are as follows.
>
> A - 30
> B - 25
> C - 20
> D - 15
>
> Does A recover at 25 (checkpoint with largest window id) or 15 (checkpoint
> with largest common window id)?
>
> Also, regarding recovering at committed window id. Is it not possible in
> the following scenario where all operators have checkpointed at 30 and got
> the committed window call back. And then an operator fails before any
> operator checkpoints further. In that case, the recovery window is 30
> right?
>
> Regards,
> Ashwin.
>
> On Mon, Dec 14, 2015 at 11:58 PM, Timothy Farkas <ti...@datatorrent.com>
> wrote:
>
> > Hi Ashwin,
> >
> > The recovery checkpoint for operator A is computed by taking the
> checkpoint
> > with the largest window id that is less than or equal to the checkpoint
> > with the largest window id among all the operators down stream to A. The
> > output operators in a dag will always recover to their most recent
> > checkpoint. The input operator of the dag may recover to the earliest
> > checkpoint. Operators between the input and ouput operators could recover
> > to a window in between.
> >
> > I don't think you can ever recover to a committed window, the earliest I
> > think you can recover to is the window after the committed window (may be
> > wrong on this).
> >
> > On Mon, Dec 14, 2015 at 11:05 PM, Ashwin Chandra Putta <
> > ashwinchandrap@gmail.com> wrote:
> >
> > > In the apex architecture there is concept of checkpointing and concept
> of
> > > committed when all operator have crossed a common checkpoint.
> > >
> > > So, in which scenarios does a given operator recover at last checkpoint
> > > window vs last committed window vs some other checkpoint window in
> > between?
> > > --
> > >
> > > Regards,
> > > Ashwin.
> > >
> >
>
>
>
> --
>
> Regards,
> Ashwin.
>

Re: operator recovery window

Posted by Ashwin Chandra Putta <as...@gmail.com>.
Tim,

Thanks, that is pretty much inline with what I was thinking. A little
different thought though in terms of picking the checkpoint based on
downstream operators. For A, is it not going to be "the checkpoint with the
largest window id that is less than or equal to the checkpoint with the
largest common window id (instead of largest window id) among all the
operators down stream to A"

For example,

If A -> B -> C -> D is the dag. And say, the checkpoint window count is 5
and the largest checkpoints are as follows.

A - 30
B - 25
C - 20
D - 15

Does A recover at 25 (checkpoint with largest window id) or 15 (checkpoint
with largest common window id)?

Also, regarding recovering at committed window id. Is it not possible in
the following scenario where all operators have checkpointed at 30 and got
the committed window call back. And then an operator fails before any
operator checkpoints further. In that case, the recovery window is 30 right?

Regards,
Ashwin.

On Mon, Dec 14, 2015 at 11:58 PM, Timothy Farkas <ti...@datatorrent.com>
wrote:

> Hi Ashwin,
>
> The recovery checkpoint for operator A is computed by taking the checkpoint
> with the largest window id that is less than or equal to the checkpoint
> with the largest window id among all the operators down stream to A. The
> output operators in a dag will always recover to their most recent
> checkpoint. The input operator of the dag may recover to the earliest
> checkpoint. Operators between the input and ouput operators could recover
> to a window in between.
>
> I don't think you can ever recover to a committed window, the earliest I
> think you can recover to is the window after the committed window (may be
> wrong on this).
>
> On Mon, Dec 14, 2015 at 11:05 PM, Ashwin Chandra Putta <
> ashwinchandrap@gmail.com> wrote:
>
> > In the apex architecture there is concept of checkpointing and concept of
> > committed when all operator have crossed a common checkpoint.
> >
> > So, in which scenarios does a given operator recover at last checkpoint
> > window vs last committed window vs some other checkpoint window in
> between?
> > --
> >
> > Regards,
> > Ashwin.
> >
>



-- 

Regards,
Ashwin.

Re: operator recovery window

Posted by Timothy Farkas <ti...@datatorrent.com>.
Hi Ashwin,

The recovery checkpoint for operator A is computed by taking the checkpoint
with the largest window id that is less than or equal to the checkpoint
with the largest window id among all the operators down stream to A. The
output operators in a dag will always recover to their most recent
checkpoint. The input operator of the dag may recover to the earliest
checkpoint. Operators between the input and ouput operators could recover
to a window in between.

I don't think you can ever recover to a committed window, the earliest I
think you can recover to is the window after the committed window (may be
wrong on this).

On Mon, Dec 14, 2015 at 11:05 PM, Ashwin Chandra Putta <
ashwinchandrap@gmail.com> wrote:

> In the apex architecture there is concept of checkpointing and concept of
> committed when all operator have crossed a common checkpoint.
>
> So, in which scenarios does a given operator recover at last checkpoint
> window vs last committed window vs some other checkpoint window in between?
> --
>
> Regards,
> Ashwin.
>