You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Stefan Richter (JIRA)" <ji...@apache.org> on 2018/05/28 11:55:00 UTC

[jira] [Commented] (FLINK-9450) Job hangs if S3 access it denied during checkpoints

    [ https://issues.apache.org/jira/browse/FLINK-9450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16492587#comment-16492587 ] 

Stefan Richter commented on FLINK-9450:
---------------------------------------

Do you have some logs for this problem, and/or a thread dump from a "hanging" TM, and/or can you figure out if the unreachable S3 leads to any exception in the Presto client? You can configure if a job fails or continues if a checkpoint fails, but it is unclear from your description if the checkpoint actually fails or just waits on S3 access under the checkpointing lock. It is possible that the job will not continue with asynchronous checkpoints because the timer service snapshots are not async (yet, will probably change in the next release) and that part of a checkpoint can therefore be blocking.

> Job hangs if S3 access it denied during checkpoints
> ---------------------------------------------------
>
>                 Key: FLINK-9450
>                 URL: https://issues.apache.org/jira/browse/FLINK-9450
>             Project: Flink
>          Issue Type: Bug
>          Components: State Backends, Checkpointing
>    Affects Versions: 1.4.2
>            Reporter: Elias Levy
>            Priority: Major
>
> We have a streaming job that consumes from and writes to Kafka.  The job is configured to checkpoint to S3.  If we deny access to S3 by using iptables on the TM host to deny all outgoing connections to ports 80 and 443, whether using DROP or REJECT, and whether using REJECT with -reject-with tcp-reset or -r reject-with imp-port-unreachable, the job soon stops publishing to Kafka.
> This happens whether or not the Kafka sources have {{setCommitOffsetsOnCheckpoints}} set to true or false.
> The system is configured to use Presto for the S3 file system.  The job has a small amount of state, so it is configured to use {{FsStateBackend}} with asynchronous snapshots.
> If the ip tables rules are removed, the job continues the function.
> I would expect the job to either fail or continue running if a checkpoint fails.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)