You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/01/21 18:15:39 UTC
[jira] [Commented] (FLINK-3261) Tasks should eagerly report back when they cannot start a checkpoint

    [ https://issues.apache.org/jira/browse/FLINK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15110934#comment-15110934 ] 

ASF GitHub Bot commented on FLINK-3261:
---------------------------------------

GitHub user aljoscha opened a pull request:

    https://github.com/apache/flink/pull/1537

    [FLINK-3261] Allow Task to decline checkpoint request if not ready

    Before, it could happen that a StreamingTask receives a Checkpoint
    Trigger message while internally not being ready. The checkpoint
    coordinator would then wait the specified timeout interval before
    continuing. Now, tasks can signal that they are not ready and the
    checkpoint coordinator will dicard a checkpoint for which is this the
    case and trigger new checkpoints if necessary.
    
    The newly triggered checkpoints will also release alignment locks in
    streaming tasks that are still waiting for barriers from failed
    checkpoints.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/aljoscha/flink checkpoint-coordinator-decline

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/1537.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1537
    
----
commit c759e2a0a2a1310467c25d84912544abaf5ab29e
Author: Aljoscha Krettek <al...@gmail.com>
Date:   2016-01-21T16:21:09Z

    [FLINK-3261] Allow Task to decline checkpoint request if not ready
    
    Before, it could happen that a StreamingTask receives a Checkpoint
    Trigger message while internally not being ready. The checkpoint
    coordinator would then wait the specified timeout interval before
    continuing. Now, tasks can signal that they are not ready and the
    checkpoint coordinator will dicard a checkpoint for which is this the
    case and trigger new checkpoints if necessary.
    
    The newly triggered checkpoints will also release alignment locks in
    streaming tasks that are still waiting for barriers from failed
    checkpoints.

----


> Tasks should eagerly report back when they cannot start a checkpoint
> --------------------------------------------------------------------
>
>                 Key: FLINK-3261
>                 URL: https://issues.apache.org/jira/browse/FLINK-3261
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Runtime
>    Affects Versions: 0.10.1
>            Reporter: Stephan Ewen
>            Assignee: Aljoscha Krettek
>            Priority: Blocker
>             Fix For: 1.0.0
>
>
> With very fast checkpoint intervals (few 100 msecs), it can happen that a Task is not ready to start a checkpoint by the time it gets the first checkpoint trigger message.
> If some other tasks are ready already and commence a checkpoint, the stream alignment will make the non-participating task wait until the checkpoint expires (default: 10 minutes).
> A simple way to fix this is that tasks report back when they could not start a checkpoint. The checkpoint coordinator can then abort that checkpoint and unblock the streams by starting new checkpoint (where all tasks will participate).
> An optimization would be to send a special "abort checkpoints barrier" that tells the barrier buffers for stream alignment to unblock a checkpoint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)