You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Stephan Ewen (JIRA)" <ji...@apache.org> on 2018/11/15 22:48:01 UTC

[jira] [Commented] (FLINK-10664) Flink: Checkpointing fails with S3 exception - Please reduce your request rate

    [ https://issues.apache.org/jira/browse/FLINK-10664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16688750#comment-16688750 ] 

Stephan Ewen commented on FLINK-10664:
--------------------------------------

The flink-s3-fs-presto exposes itself as a Hadoop File System but is a different implementation internally. It is simpler and does not try to mimick a file system on an object store, like Hadoop's s3a.

That mimicking is crazy expensive in terms of list requests, which usually are the first bottleneck.

> Flink: Checkpointing fails with S3 exception - Please reduce your request rate
> ------------------------------------------------------------------------------
>
>                 Key: FLINK-10664
>                 URL: https://issues.apache.org/jira/browse/FLINK-10664
>             Project: Flink
>          Issue Type: Improvement
>          Components: JobManager, TaskManager
>    Affects Versions: 1.5.4, 1.6.1
>            Reporter: Pawel Bartoszek
>            Priority: Major
>
> When the checkpoint is created for the job which has many operators it could happen that Flink uploads too many checkpoint files, at the same time, to S3 resulting in throttling from S3. 
>  
> {code:java}
> Caused by: org.apache.hadoop.fs.s3a.AWSS3IOException: saving output on flink/state-checkpoints/7bbd6495f90257e4bc037ecc08ba21a5/chk-19/4422b088-0836-4f12-bbbe-7e731da11231: com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown; Request ID: XXXX; S3 Extended Request ID: XXX), S3 Extended Request ID: XXX: Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown; Request ID: 5310EA750DF8B949; S3 Extended Request ID: XXX)
> at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:178)
> at org.apache.hadoop.fs.s3a.S3AOutputStream.close(S3AOutputStream.java:121)
> at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:74)
> at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:108)
> at org.apache.flink.runtime.fs.hdfs.HadoopDataOutputStream.close(HadoopDataOutputStream.java:52)
> at org.apache.flink.core.fs.ClosingFSDataOutputStream.close(ClosingFSDataOutputStream.java:64)
> at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:311){code}
>  
> Can the upload be retried with kind of back off?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)