You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/09/06 14:41:00 UTC

[jira] [Commented] (FLINK-9598) [Checkpoints] The config Minimum Pause Between Checkpoints doesn't work when there's a checkpoint failure

    [ https://issues.apache.org/jira/browse/FLINK-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605855#comment-16605855 ] 

ASF GitHub Bot commented on FLINK-9598:
---------------------------------------

Myasuka commented on issue #6346: [FLINK-9598] Refine java-doc about the min pause between checkpoints
URL: https://github.com/apache/flink/pull/6346#issuecomment-419119077
 
 
   @zentol I think the decision to change javadocs or fix checkpoint coordinator's behavior depends on whether we give an explicit definition of the minimal pause between checkpoints. From our docs about [Minimum Pause Between Checkpoints](https://ci.apache.org/projects/flink/flink-docs-release-1.6/monitoring/checkpoint_monitoring.html#configuration-tab), it only said the successful scenario:
   
   > After a checkpoint has completed successfully, we wait at least for this amount of time before triggering the next one.
   
   If we could give more clear definition on the checkpoint-failure scenario, it might not be a problem to discuss. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> [Checkpoints] The config Minimum Pause Between Checkpoints doesn't work when there's a checkpoint failure
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-9598
>                 URL: https://issues.apache.org/jira/browse/FLINK-9598
>             Project: Flink
>          Issue Type: Bug
>    Affects Versions: 1.3.2
>            Reporter: Prem Santosh
>            Assignee: Yun Tang
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: Screen Shot 2018-06-20 at 7.44.10 AM.png
>
>
> We have set the config Minimum Pause Between Checkpoints to be 10 min but noticed that when a checkpoint fails (because it timesout before it completes) the application immediately starts taking the next checkpoint. This basically stalls the application's progress since its always taking checkpoints.
> [^Screen Shot 2018-06-20 at 7.44.10 AM.png] is a screenshot of this issue.
> Details:
>  * Running Flink-1.3.2 on EMR
>  * checkpoint timeout duration: 40 min
>  * minimum pause between checkpoints: 10 min
> There is also a [relevant thread|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Having-a-backoff-while-experiencing-checkpointing-failures-td20618.html] that I found on the Flink users group.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)