You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Gyula Fóra <gy...@apache.org> on 2016/01/08 11:49:08 UTC

Weird test-source issue

Hey,

I have encountered a weird issue in a checkpointing test I am trying to
write. The logic is the same as with the previous checkpointing tests,
there is a OnceFailingReducer.

My problem is that before the reducer fails, my job cannot take any
snapshots. The Runnables executing the checkpointing logic in the sources
keep waiting on some lock.

After the failure and the restart, everything is fine and the checkpointing
can succeed properly.

Also if I remove the failure from the reducer, the job doesnt take any
snapshots (waiting on lock) and the job will finish.

Here is the code:
https://github.com/gyfora/flink/blob/d1f12c2474413c9af357b6da33f1fac30549fbc3/flink-contrib/flink-streaming-contrib/src/test/java/org/apache/flink/contrib/streaming/state/TfileStateCheckpointingTest.java#L83

I assume there is no problem with the source as the Thread.sleep(..) is
outside of the synchronized block. (and as I said after the failure it
works fine).

Any ideas?

Thanks,
Gyula

Re: Weird test-source issue

Posted by Stephan Ewen <se...@apache.org>.

It is nice to see that we converge on the issues we find.

Means that this is getting pretty stable :-)

On Tue, Jan 19, 2016 at 8:17 PM, Stephan Ewen <se...@apache.org> wrote:

> Yeah, we saw this as well this morning, in a job that triggers checkpoints
> super fast (50msecs).
>
> I think we have a good fix figured out, let's solve this for 1.0...
>
> On Tue, Jan 19, 2016 at 3:25 PM, Gyula Fóra <gy...@gmail.com> wrote:
>
>> I just got back to this issue. The problem wasn't with the locking but
>> that
>> the StreamTask wasn't in running state before the first checkpoint trigger
>> message.
>> I actually just saw your JIRA as well, funny... :)
>>
>> Regards,
>> Gyula
>>
>> Stephan Ewen <se...@apache.org> ezt írta (időpont: 2016. jan. 8., P,
>> 15:36):
>>
>> > Hmm, strange issue indeed.
>> >
>> > So, checkpoints are definitely triggered (log message by coordinator to
>> > trigger checkpoint) but are not completing?
>> > Can you check which is the first checkpoint to complete? Is it
>> Checkpoint
>> > 1, or a later one (indicating that checkpoint 1 was somehow subsumed).
>> >
>> > Can you check in the stacktrace on which lock the checkpoint runables
>> are
>> > waiting, and who is holding that lock?
>> >
>> > Two thoughts:
>> >
>> > 1) What I mistakenly did once in one of my tests is to have the sleep()
>> in
>> > a downstream task. That would simply prevent the fast generated data
>> > elements (and the inline checkpoint barriers) from passing though and
>> > completing the checkpoint.
>> >
>> > 2) Is this another issue with the non-fair lock? Does the checkpoint
>> > runnable simply not get the lock before the checkpoint. Not sure why it
>> > would suddenly work after the failure. We could try and swap the lock
>> > Object by a "ReentrantLock(true)" and see what would happen.
>> >
>> >
>> > Stephan
>> >
>> >
>> > On Fri, Jan 8, 2016 at 11:49 AM, Gyula Fóra <gy...@apache.org> wrote:
>> >
>> > > Hey,
>> > >
>> > > I have encountered a weird issue in a checkpointing test I am trying
>> to
>> > > write. The logic is the same as with the previous checkpointing tests,
>> > > there is a OnceFailingReducer.
>> > >
>> > > My problem is that before the reducer fails, my job cannot take any
>> > > snapshots. The Runnables executing the checkpointing logic in the
>> sources
>> > > keep waiting on some lock.
>> > >
>> > > After the failure and the restart, everything is fine and the
>> > checkpointing
>> > > can succeed properly.
>> > >
>> > > Also if I remove the failure from the reducer, the job doesnt take any
>> > > snapshots (waiting on lock) and the job will finish.
>> > >
>> > > Here is the code:
>> > >
>> > >
>> >
>> https://github.com/gyfora/flink/blob/d1f12c2474413c9af357b6da33f1fac30549fbc3/flink-contrib/flink-streaming-contrib/src/test/java/org/apache/flink/contrib/streaming/state/TfileStateCheckpointingTest.java#L83
>> > >
>> > > I assume there is no problem with the source as the Thread.sleep(..)
>> is
>> > > outside of the synchronized block. (and as I said after the failure it
>> > > works fine).
>> > >
>> > > Any ideas?
>> > >
>> > > Thanks,
>> > > Gyula
>> > >
>> >
>>
>
>

Re: Weird test-source issue

Posted by Stephan Ewen <se...@apache.org>.

Yeah, we saw this as well this morning, in a job that triggers checkpoints
super fast (50msecs).

I think we have a good fix figured out, let's solve this for 1.0...

On Tue, Jan 19, 2016 at 3:25 PM, Gyula Fóra <gy...@gmail.com> wrote:

> I just got back to this issue. The problem wasn't with the locking but that
> the StreamTask wasn't in running state before the first checkpoint trigger
> message.
> I actually just saw your JIRA as well, funny... :)
>
> Regards,
> Gyula
>
> Stephan Ewen <se...@apache.org> ezt írta (időpont: 2016. jan. 8., P,
> 15:36):
>
> > Hmm, strange issue indeed.
> >
> > So, checkpoints are definitely triggered (log message by coordinator to
> > trigger checkpoint) but are not completing?
> > Can you check which is the first checkpoint to complete? Is it Checkpoint
> > 1, or a later one (indicating that checkpoint 1 was somehow subsumed).
> >
> > Can you check in the stacktrace on which lock the checkpoint runables are
> > waiting, and who is holding that lock?
> >
> > Two thoughts:
> >
> > 1) What I mistakenly did once in one of my tests is to have the sleep()
> in
> > a downstream task. That would simply prevent the fast generated data
> > elements (and the inline checkpoint barriers) from passing though and
> > completing the checkpoint.
> >
> > 2) Is this another issue with the non-fair lock? Does the checkpoint
> > runnable simply not get the lock before the checkpoint. Not sure why it
> > would suddenly work after the failure. We could try and swap the lock
> > Object by a "ReentrantLock(true)" and see what would happen.
> >
> >
> > Stephan
> >
> >
> > On Fri, Jan 8, 2016 at 11:49 AM, Gyula Fóra <gy...@apache.org> wrote:
> >
> > > Hey,
> > >
> > > I have encountered a weird issue in a checkpointing test I am trying to
> > > write. The logic is the same as with the previous checkpointing tests,
> > > there is a OnceFailingReducer.
> > >
> > > My problem is that before the reducer fails, my job cannot take any
> > > snapshots. The Runnables executing the checkpointing logic in the
> sources
> > > keep waiting on some lock.
> > >
> > > After the failure and the restart, everything is fine and the
> > checkpointing
> > > can succeed properly.
> > >
> > > Also if I remove the failure from the reducer, the job doesnt take any
> > > snapshots (waiting on lock) and the job will finish.
> > >
> > > Here is the code:
> > >
> > >
> >
> https://github.com/gyfora/flink/blob/d1f12c2474413c9af357b6da33f1fac30549fbc3/flink-contrib/flink-streaming-contrib/src/test/java/org/apache/flink/contrib/streaming/state/TfileStateCheckpointingTest.java#L83
> > >
> > > I assume there is no problem with the source as the Thread.sleep(..) is
> > > outside of the synchronized block. (and as I said after the failure it
> > > works fine).
> > >
> > > Any ideas?
> > >
> > > Thanks,
> > > Gyula
> > >
> >
>

Re: Weird test-source issue

Posted by Gyula Fóra <gy...@gmail.com>.

I just got back to this issue. The problem wasn't with the locking but that
the StreamTask wasn't in running state before the first checkpoint trigger
message.
I actually just saw your JIRA as well, funny... :)

Regards,
Gyula

Stephan Ewen <se...@apache.org> ezt írta (időpont: 2016. jan. 8., P, 15:36):

> Hmm, strange issue indeed.
>
> So, checkpoints are definitely triggered (log message by coordinator to
> trigger checkpoint) but are not completing?
> Can you check which is the first checkpoint to complete? Is it Checkpoint
> 1, or a later one (indicating that checkpoint 1 was somehow subsumed).
>
> Can you check in the stacktrace on which lock the checkpoint runables are
> waiting, and who is holding that lock?
>
> Two thoughts:
>
> 1) What I mistakenly did once in one of my tests is to have the sleep() in
> a downstream task. That would simply prevent the fast generated data
> elements (and the inline checkpoint barriers) from passing though and
> completing the checkpoint.
>
> 2) Is this another issue with the non-fair lock? Does the checkpoint
> runnable simply not get the lock before the checkpoint. Not sure why it
> would suddenly work after the failure. We could try and swap the lock
> Object by a "ReentrantLock(true)" and see what would happen.
>
>
> Stephan
>
>
> On Fri, Jan 8, 2016 at 11:49 AM, Gyula Fóra <gy...@apache.org> wrote:
>
> > Hey,
> >
> > I have encountered a weird issue in a checkpointing test I am trying to
> > write. The logic is the same as with the previous checkpointing tests,
> > there is a OnceFailingReducer.
> >
> > My problem is that before the reducer fails, my job cannot take any
> > snapshots. The Runnables executing the checkpointing logic in the sources
> > keep waiting on some lock.
> >
> > After the failure and the restart, everything is fine and the
> checkpointing
> > can succeed properly.
> >
> > Also if I remove the failure from the reducer, the job doesnt take any
> > snapshots (waiting on lock) and the job will finish.
> >
> > Here is the code:
> >
> >
> https://github.com/gyfora/flink/blob/d1f12c2474413c9af357b6da33f1fac30549fbc3/flink-contrib/flink-streaming-contrib/src/test/java/org/apache/flink/contrib/streaming/state/TfileStateCheckpointingTest.java#L83
> >
> > I assume there is no problem with the source as the Thread.sleep(..) is
> > outside of the synchronized block. (and as I said after the failure it
> > works fine).
> >
> > Any ideas?
> >
> > Thanks,
> > Gyula
> >
>

Re: Weird test-source issue

Posted by Stephan Ewen <se...@apache.org>.

Hmm, strange issue indeed.

So, checkpoints are definitely triggered (log message by coordinator to
trigger checkpoint) but are not completing?
Can you check which is the first checkpoint to complete? Is it Checkpoint
1, or a later one (indicating that checkpoint 1 was somehow subsumed).

Can you check in the stacktrace on which lock the checkpoint runables are
waiting, and who is holding that lock?

Two thoughts:

1) What I mistakenly did once in one of my tests is to have the sleep() in
a downstream task. That would simply prevent the fast generated data
elements (and the inline checkpoint barriers) from passing though and
completing the checkpoint.

2) Is this another issue with the non-fair lock? Does the checkpoint
runnable simply not get the lock before the checkpoint. Not sure why it
would suddenly work after the failure. We could try and swap the lock
Object by a "ReentrantLock(true)" and see what would happen.

Stephan

On Fri, Jan 8, 2016 at 11:49 AM, Gyula Fóra <gy...@apache.org> wrote:

> Hey,
>
> I have encountered a weird issue in a checkpointing test I am trying to
> write. The logic is the same as with the previous checkpointing tests,
> there is a OnceFailingReducer.
>
> My problem is that before the reducer fails, my job cannot take any
> snapshots. The Runnables executing the checkpointing logic in the sources
> keep waiting on some lock.
>
> After the failure and the restart, everything is fine and the checkpointing
> can succeed properly.
>
> Also if I remove the failure from the reducer, the job doesnt take any
> snapshots (waiting on lock) and the job will finish.
>
> Here is the code:
>
> https://github.com/gyfora/flink/blob/d1f12c2474413c9af357b6da33f1fac30549fbc3/flink-contrib/flink-streaming-contrib/src/test/java/org/apache/flink/contrib/streaming/state/TfileStateCheckpointingTest.java#L83
>
> I assume there is no problem with the source as the Thread.sleep(..) is
> outside of the synchronized block. (and as I said after the failure it
> works fine).
>
> Any ideas?
>
> Thanks,
> Gyula
>