You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/11/08 14:54:00 UTC

[jira] [Commented] (FLINK-10753) Propagate and log snapshotting exceptions

    [ https://issues.apache.org/jira/browse/FLINK-10753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679863#comment-16679863 ] 

ASF GitHub Bot commented on FLINK-10753:
----------------------------------------

StefanRRichter opened a new pull request #7064: [FLINK-10753] Improve propagation and logging of snapshot exceptions
URL: https://github.com/apache/flink/pull/7064
 
 
   ## What is the purpose of the change
   
   This PR aims to improve the propagation and logging of exceptions that happen during state snapshots, as outlined in the JIRA issue. We log the exception already at the task manager and take care that the right cause makes it into the job manager logs.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Propagate and log snapshotting exceptions
> -----------------------------------------
>
>                 Key: FLINK-10753
>                 URL: https://issues.apache.org/jira/browse/FLINK-10753
>             Project: Flink
>          Issue Type: Bug
>          Components: State Backends, Checkpointing
>    Affects Versions: 1.6.2, 1.7.0
>            Reporter: Alexander Fedulov
>            Assignee: Stefan Richter
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.7.0
>
>         Attachments: Screen Shot 2018-11-01 at 16.27.01.png
>
>
> Upon failure, {{AbstractStreamOperator.snapshotState}} rethrows a new exception with the message "{{Could not complete snapshot {} for operator {}.}}" and the original exception as the cause. 
> While handling the error, {{CheckpointCoordinator.discardCheckpoint}} method logs only this  propagated message and not the original cause of the exception.
> In addition, {{pendingCheckpoint.abortDeclined()}}, called from the {{discardCheckpoint}}, reports the failed checkpoint with a misleading message "{{Checkpoint was declined (tasks not ready)}}". This message is what will be displayed in the UI (see attached).
>  Proposition:
>  # Log exception at the Task Manager (.snapshotState)
>  # Log cause, instead of cause.getMessage() at the JobsManager (.dicardCheckpoint)
>  # Pass root cause to abortDeclined and propagate a more appropriate message to the UI.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)