You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Tony Wei <to...@gmail.com> on 2018/08/29 03:35:51 UTC

checkpoint failed due to s3 exception: request timeout

Hi,

I met checkpoint failure problem that cause by s3 exception.

org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
> Your socket connection to the server was not read from or written to within
> the timeout period. Idle connections will be closed. (Service: Amazon S3;
> Status Code: 400; Error Code: RequestTimeout; Request ID:
> B8BE8978D3EFF3F5), S3 Extended Request ID:
> ePKce/MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw=


The full stack trace and screenshot is provided in the attachment.

My setting for flink cluster and job:

   - flink version 1.4.0
   - standalone mode
   - 4 slots for each TM
   - presto s3 filesystem
   - rocksdb statebackend
   - local ssd
   - enable incremental checkpoint

No weird message beside the exception in the log file. No high ratio of GC
during the checkpoint
procedure. And still 3 of 4 parts uploaded successfully on that TM. I
didn't find something that
would related to this failure. Did anyone meet this problem before?

Besides, I also found an issue in other aws sdk[1] that mentioned this s3
exception as well. One
reply said you can passively avoid the problem by raising the max client
retires config. So I found
that config in presto[2]. Can I just add s3.max-client-retries: xxx in
flink-conf.yaml to config
it? If not, how should I do to overwrite the default value of this
configuration? Thanks in advance.

Best,
Tony Wei

[1] https://github.com/aws/aws-sdk-php/issues/885
[2]
https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/HiveS3Config.java#L218

Re: checkpoint failed due to s3 exception: request timeout

Posted by Tony Wei <to...@gmail.com>.

Hi Andrey,

Cool! I will add it in my flink-conf.yaml. However, I'm still wondering if
anyone is familiar with this
problem or has any idea to find the root cause. Thanks.

Best,
Tony Wei

2018-08-29 16:20 GMT+08:00 Andrey Zagrebin <an...@data-artisans.com>:

> Hi,
>
> the current Flink 1.6.0 version uses Presto Hive s3 connector 0.185 [1],
> which has this option:
> S3_MAX_CLIENT_RETRIES = "presto.s3.max-client-retries”;
>
> If you add “s3.max-client-retries” to flink conf, flink-s3-fs-presto [2]
> should automatically prefix it and configure PrestoS3FileSystem correctly.
>
> Cheers,
> Andrey
>
> [1] https://github.com/prestodb/presto/blob/0.185/
> presto-hive/src/main/java/com/facebook/presto/hive/PrestoS3FileSystem.java
> [2] https://ci.apache.org/projects/flink/flink-docs-
> stable/ops/deployment/aws.html#shaded-hadooppresto-s3-
> file-systems-recommended
>
>
> On 29 Aug 2018, at 08:49, vino yang <ya...@gmail.com> wrote:
>
> Hi Tony,
>
> Maybe you can consider looking at the doc information for this class, this
> class comes from flink-s3-fs-presto.[1]
>
> [1]: https://ci.apache.org/projects/flink/flink-docs-
> release-1.6/api/java/org/apache/hadoop/conf/Configuration.html
>
> Thanks, vino.
>
> Tony Wei <to...@gmail.com> 于2018年8月29日周三 下午2:18写道：
>
>> Hi Vino,
>>
>> I thought this config is for aws s3 client, but this client is inner
>> flink-s3-fs-presto.
>> So, I guessed I should find a way to pass this config to this library.
>>
>> Best,
>> Tony Wei
>>
>> 2018-08-29 14:13 GMT+08:00 vino yang <ya...@gmail.com>:
>>
>>> Hi Tony,
>>>
>>> Sorry, I just saw the timeout, I thought they were similar because they
>>> both happened on aws s3.
>>> Regarding this setting, isn't "s3.max-client-retries: xxx" set for the
>>> client?
>>>
>>> Thanks, vino.
>>>
>>> Tony Wei <to...@gmail.com> 于2018年8月29日周三 下午1:17写道：
>>>
>>>> Hi Vino,
>>>>
>>>> Thanks for your quick reply, but I think these two questions are
>>>> different. The checkpoint in that question
>>>> finally finished, but my checkpoint failed due to s3 client timeout.
>>>> You can see from my screenshot that
>>>> showed the checkpoint failed in a short time.
>>>>
>>>> According to configuration, do you mean pass the configuration as
>>>> program's input arguments? I don't
>>>> think it will work. At least I need to find a way to pass it to s3
>>>> filesystem builder in my program. However,
>>>> I will ask for help to pass it by flink-conf.yaml, because I used that
>>>> to config the global setting for s3
>>>> filesystem and I thought it might have a simple way to support this
>>>> setting like other s3.xxx config.
>>>>
>>>> Very much appreciate for your answer and help.
>>>>
>>>> Best,
>>>> Tony Wei
>>>>
>>>> 2018-08-29 11:51 GMT+08:00 vino yang <ya...@gmail.com>:
>>>>
>>>>> Hi Tony,
>>>>>
>>>>> A while ago, I have answered a similar question.[1]
>>>>>
>>>>> You can try to increase this value appropriately. You can't put this
>>>>> configuration in flink-conf.yaml, you can put it in the submit command of
>>>>> the job[2], or in the configuration file you specify.
>>>>>
>>>>> [1]: http://apache-flink-user-mailing-list-archive.2336050.
>>>>> n4.nabble.com/Why-checkpoint-took-so-long-td22364.html#a22375
>>>>> [2]: https://ci.apache.org/projects/flink/flink-docs-
>>>>> release-1.6/ops/cli.html
>>>>>
>>>>> Thanks, vino.
>>>>>
>>>>> Tony Wei <to...@gmail.com> 于2018年8月29日周三 上午11:36写道：
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I met checkpoint failure problem that cause by s3 exception.
>>>>>>
>>>>>> org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>>>>>>> Your socket connection to the server was not read from or written to within
>>>>>>> the timeout period. Idle connections will be closed. (Service: Amazon S3;
>>>>>>> Status Code: 400; Error Code: RequestTimeout; Request ID:
>>>>>>> B8BE8978D3EFF3F5), S3 Extended Request ID: ePKce/
>>>>>>> MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/
>>>>>>> 6JZAU4whpfXeV6SfG62cnts0NBw=
>>>>>>
>>>>>>
>>>>>> The full stack trace and screenshot is provided in the attachment.
>>>>>>
>>>>>> My setting for flink cluster and job:
>>>>>>
>>>>>>    - flink version 1.4.0
>>>>>>    - standalone mode
>>>>>>    - 4 slots for each TM
>>>>>>    - presto s3 filesystem
>>>>>>    - rocksdb statebackend
>>>>>>    - local ssd
>>>>>>    - enable incremental checkpoint
>>>>>>
>>>>>> No weird message beside the exception in the log file. No high ratio
>>>>>> of GC during the checkpoint
>>>>>> procedure. And still 3 of 4 parts uploaded successfully on that TM. I
>>>>>> didn't find something that
>>>>>> would related to this failure. Did anyone meet this problem before?
>>>>>>
>>>>>> Besides, I also found an issue in other aws sdk[1] that mentioned
>>>>>> this s3 exception as well. One
>>>>>> reply said you can passively avoid the problem by raising the max
>>>>>> client retires config. So I found
>>>>>> that config in presto[2]. Can I just add s3.max-client-retries: xxx in
>>>>>> flink-conf.yaml to config
>>>>>> it? If not, how should I do to overwrite the default value of this
>>>>>> configuration? Thanks in advance.
>>>>>>
>>>>>> Best,
>>>>>> Tony Wei
>>>>>>
>>>>>> [1] https://github.com/aws/aws-sdk-php/issues/885
>>>>>> [2] https://github.com/prestodb/presto/blob/master/
>>>>>> presto-hive/src/main/java/com/facebook/presto/hive/s3/
>>>>>> HiveS3Config.java#L218
>>>>>>
>>>>>
>>>>
>>
>

Re: checkpoint failed due to s3 exception: request timeout

Posted by Andrey Zagrebin <an...@data-artisans.com>.

Hi,

the current Flink 1.6.0 version uses Presto Hive s3 connector 0.185 [1], which has this option:
S3_MAX_CLIENT_RETRIES = "presto.s3.max-client-retries”;

If you add “s3.max-client-retries” to flink conf, flink-s3-fs-presto [2] should automatically prefix it and configure PrestoS3FileSystem correctly.

Cheers,
Andrey

[1] https://github.com/prestodb/presto/blob/0.185/presto-hive/src/main/java/com/facebook/presto/hive/PrestoS3FileSystem.java <https://github.com/prestodb/presto/blob/0.185/presto-hive/src/main/java/com/facebook/presto/hive/PrestoS3FileSystem.java>
[2] https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/aws.html#shaded-hadooppresto-s3-file-systems-recommended <https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/aws.html#shaded-hadooppresto-s3-file-systems-recommended>


> On 29 Aug 2018, at 08:49, vino yang <ya...@gmail.com> wrote:
> 
> Hi Tony,
> 
> Maybe you can consider looking at the doc information for this class, this class comes from flink-s3-fs-presto.[1]
> 
> [1]: https://ci.apache.org/projects/flink/flink-docs-release-1.6/api/java/org/apache/hadoop/conf/Configuration.html <https://ci.apache.org/projects/flink/flink-docs-release-1.6/api/java/org/apache/hadoop/conf/Configuration.html>
> 
> Thanks, vino.
> 
> Tony Wei <tony19920430@gmail.com <ma...@gmail.com>> 于2018年8月29日周三 下午2:18写道：
> Hi Vino,
> 
> I thought this config is for aws s3 client, but this client is inner flink-s3-fs-presto.
> So, I guessed I should find a way to pass this config to this library.
> 
> Best,
> Tony Wei
> 
> 2018-08-29 14:13 GMT+08:00 vino yang <yanghua1127@gmail.com <ma...@gmail.com>>:
> Hi Tony,
> 
> Sorry, I just saw the timeout, I thought they were similar because they both happened on aws s3. 
> Regarding this setting, isn't "s3.max-client-retries: xxx" set for the client?
> 
> Thanks, vino.
> 
> Tony Wei <tony19920430@gmail.com <ma...@gmail.com>> 于2018年8月29日周三 下午1:17写道：
> Hi Vino,
> 
> Thanks for your quick reply, but I think these two questions are different. The checkpoint in that question 
> finally finished, but my checkpoint failed due to s3 client timeout. You can see from my screenshot that 
> showed the checkpoint failed in a short time.
> 
> According to configuration, do you mean pass the configuration as program's input arguments? I don't 
> think it will work. At least I need to find a way to pass it to s3 filesystem builder in my program. However,
> I will ask for help to pass it by flink-conf.yaml, because I used that to config the global setting for s3
> filesystem and I thought it might have a simple way to support this setting like other s3.xxx config.
> 
> Very much appreciate for your answer and help.
> 
> Best,
> Tony Wei
> 
> 2018-08-29 11:51 GMT+08:00 vino yang <yanghua1127@gmail.com <ma...@gmail.com>>:
> Hi Tony,
> 
> A while ago, I have answered a similar question.[1]
> 
> You can try to increase this value appropriately. You can't put this configuration in flink-conf.yaml, you can put it in the submit command of the job[2], or in the configuration file you specify.
> 
> [1]: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Why-checkpoint-took-so-long-td22364.html#a22375 <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Why-checkpoint-took-so-long-td22364.html#a22375>
> [2]: https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/cli.html <https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/cli.html>
> 
> Thanks, vino.
> 
> Tony Wei <tony19920430@gmail.com <ma...@gmail.com>> 于2018年8月29日周三 上午11:36写道：
> Hi,
> 
> I met checkpoint failure problem that cause by s3 exception.
> 
> org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: Amazon S3; Status Code: 400; Error Code: RequestTimeout; Request ID: B8BE8978D3EFF3F5), S3 Extended Request ID: ePKce/MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw=
> 
> The full stack trace and screenshot is provided in the attachment.
> 
> My setting for flink cluster and job:
> flink version 1.4.0
> standalone mode
> 4 slots for each TM
> presto s3 filesystem
> rocksdb statebackend
> local ssd
> enable incremental checkpoint
> No weird message beside the exception in the log file. No high ratio of GC during the checkpoint
> procedure. And still 3 of 4 parts uploaded successfully on that TM. I didn't find something that 
> would related to this failure. Did anyone meet this problem before?
> 
> Besides, I also found an issue in other aws sdk[1] that mentioned this s3 exception as well. One
> reply said you can passively avoid the problem by raising the max client retires config. So I found
> that config in presto[2]. Can I just add s3.max-client-retries: xxx in flink-conf.yaml to config
> it? If not, how should I do to overwrite the default value of this configuration? Thanks in advance.
> 
> Best,
> Tony Wei
> 
> [1] https://github.com/aws/aws-sdk-php/issues/885 <https://github.com/aws/aws-sdk-php/issues/885>
> [2] https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/HiveS3Config.java#L218 <https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/HiveS3Config.java#L218>
>

Re: checkpoint failed due to s3 exception: request timeout

Posted by vino yang <ya...@gmail.com>.

Hi Tony,

Maybe you can consider looking at the doc information for this class, this
class comes from flink-s3-fs-presto.[1]

[1]:
https://ci.apache.org/projects/flink/flink-docs-release-1.6/api/java/org/apache/hadoop/conf/Configuration.html

Thanks, vino.

Tony Wei <to...@gmail.com> 于2018年8月29日周三 下午2:18写道：

> Hi Vino,
>
> I thought this config is for aws s3 client, but this client is inner
> flink-s3-fs-presto.
> So, I guessed I should find a way to pass this config to this library.
>
> Best,
> Tony Wei
>
> 2018-08-29 14:13 GMT+08:00 vino yang <ya...@gmail.com>:
>
>> Hi Tony,
>>
>> Sorry, I just saw the timeout, I thought they were similar because they
>> both happened on aws s3.
>> Regarding this setting, isn't "s3.max-client-retries: xxx" set for the
>> client?
>>
>> Thanks, vino.
>>
>> Tony Wei <to...@gmail.com> 于2018年8月29日周三 下午1:17写道：
>>
>>> Hi Vino,
>>>
>>> Thanks for your quick reply, but I think these two questions are
>>> different. The checkpoint in that question
>>> finally finished, but my checkpoint failed due to s3 client timeout. You
>>> can see from my screenshot that
>>> showed the checkpoint failed in a short time.
>>>
>>> According to configuration, do you mean pass the configuration as
>>> program's input arguments? I don't
>>> think it will work. At least I need to find a way to pass it to s3
>>> filesystem builder in my program. However,
>>> I will ask for help to pass it by flink-conf.yaml, because I used that
>>> to config the global setting for s3
>>> filesystem and I thought it might have a simple way to support this
>>> setting like other s3.xxx config.
>>>
>>> Very much appreciate for your answer and help.
>>>
>>> Best,
>>> Tony Wei
>>>
>>> 2018-08-29 11:51 GMT+08:00 vino yang <ya...@gmail.com>:
>>>
>>>> Hi Tony,
>>>>
>>>> A while ago, I have answered a similar question.[1]
>>>>
>>>> You can try to increase this value appropriately. You can't put this
>>>> configuration in flink-conf.yaml, you can put it in the submit command of
>>>> the job[2], or in the configuration file you specify.
>>>>
>>>> [1]:
>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Why-checkpoint-took-so-long-td22364.html#a22375
>>>> [2]:
>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/cli.html
>>>>
>>>> Thanks, vino.
>>>>
>>>> Tony Wei <to...@gmail.com> 于2018年8月29日周三 上午11:36写道：
>>>>
>>>>> Hi,
>>>>>
>>>>> I met checkpoint failure problem that cause by s3 exception.
>>>>>
>>>>> org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>>>>>> Your socket connection to the server was not read from or written to within
>>>>>> the timeout period. Idle connections will be closed. (Service: Amazon S3;
>>>>>> Status Code: 400; Error Code: RequestTimeout; Request ID:
>>>>>> B8BE8978D3EFF3F5), S3 Extended Request ID:
>>>>>> ePKce/MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw=
>>>>>
>>>>>
>>>>> The full stack trace and screenshot is provided in the attachment.
>>>>>
>>>>> My setting for flink cluster and job:
>>>>>
>>>>>    - flink version 1.4.0
>>>>>    - standalone mode
>>>>>    - 4 slots for each TM
>>>>>    - presto s3 filesystem
>>>>>    - rocksdb statebackend
>>>>>    - local ssd
>>>>>    - enable incremental checkpoint
>>>>>
>>>>> No weird message beside the exception in the log file. No high ratio
>>>>> of GC during the checkpoint
>>>>> procedure. And still 3 of 4 parts uploaded successfully on that TM. I
>>>>> didn't find something that
>>>>> would related to this failure. Did anyone meet this problem before?
>>>>>
>>>>> Besides, I also found an issue in other aws sdk[1] that mentioned this
>>>>> s3 exception as well. One
>>>>> reply said you can passively avoid the problem by raising the max
>>>>> client retires config. So I found
>>>>> that config in presto[2]. Can I just add s3.max-client-retries: xxx in
>>>>> flink-conf.yaml to config
>>>>> it? If not, how should I do to overwrite the default value of this
>>>>> configuration? Thanks in advance.
>>>>>
>>>>> Best,
>>>>> Tony Wei
>>>>>
>>>>> [1] https://github.com/aws/aws-sdk-php/issues/885
>>>>> [2]
>>>>> https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/HiveS3Config.java#L218
>>>>>
>>>>
>>>
>

Re: checkpoint failed due to s3 exception: request timeout

Posted by Tony Wei <to...@gmail.com>.

Hi Vino,

I thought this config is for aws s3 client, but this client is inner
flink-s3-fs-presto.
So, I guessed I should find a way to pass this config to this library.

Best,
Tony Wei

2018-08-29 14:13 GMT+08:00 vino yang <ya...@gmail.com>:

> Hi Tony,
>
> Sorry, I just saw the timeout, I thought they were similar because they
> both happened on aws s3.
> Regarding this setting, isn't "s3.max-client-retries: xxx" set for the
> client?
>
> Thanks, vino.
>
> Tony Wei <to...@gmail.com> 于2018年8月29日周三 下午1:17写道：
>
>> Hi Vino,
>>
>> Thanks for your quick reply, but I think these two questions are
>> different. The checkpoint in that question
>> finally finished, but my checkpoint failed due to s3 client timeout. You
>> can see from my screenshot that
>> showed the checkpoint failed in a short time.
>>
>> According to configuration, do you mean pass the configuration as
>> program's input arguments? I don't
>> think it will work. At least I need to find a way to pass it to s3
>> filesystem builder in my program. However,
>> I will ask for help to pass it by flink-conf.yaml, because I used that to
>> config the global setting for s3
>> filesystem and I thought it might have a simple way to support this
>> setting like other s3.xxx config.
>>
>> Very much appreciate for your answer and help.
>>
>> Best,
>> Tony Wei
>>
>> 2018-08-29 11:51 GMT+08:00 vino yang <ya...@gmail.com>:
>>
>>> Hi Tony,
>>>
>>> A while ago, I have answered a similar question.[1]
>>>
>>> You can try to increase this value appropriately. You can't put this
>>> configuration in flink-conf.yaml, you can put it in the submit command of
>>> the job[2], or in the configuration file you specify.
>>>
>>> [1]: http://apache-flink-user-mailing-list-archive.2336050.
>>> n4.nabble.com/Why-checkpoint-took-so-long-td22364.html#a22375
>>> [2]: https://ci.apache.org/projects/flink/flink-docs-
>>> release-1.6/ops/cli.html
>>>
>>> Thanks, vino.
>>>
>>> Tony Wei <to...@gmail.com> 于2018年8月29日周三 上午11:36写道：
>>>
>>>> Hi,
>>>>
>>>> I met checkpoint failure problem that cause by s3 exception.
>>>>
>>>> org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>>>>> Your socket connection to the server was not read from or written to within
>>>>> the timeout period. Idle connections will be closed. (Service: Amazon S3;
>>>>> Status Code: 400; Error Code: RequestTimeout; Request ID:
>>>>> B8BE8978D3EFF3F5), S3 Extended Request ID: ePKce/
>>>>> MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw=
>>>>
>>>>
>>>> The full stack trace and screenshot is provided in the attachment.
>>>>
>>>> My setting for flink cluster and job:
>>>>
>>>>    - flink version 1.4.0
>>>>    - standalone mode
>>>>    - 4 slots for each TM
>>>>    - presto s3 filesystem
>>>>    - rocksdb statebackend
>>>>    - local ssd
>>>>    - enable incremental checkpoint
>>>>
>>>> No weird message beside the exception in the log file. No high ratio of
>>>> GC during the checkpoint
>>>> procedure. And still 3 of 4 parts uploaded successfully on that TM. I
>>>> didn't find something that
>>>> would related to this failure. Did anyone meet this problem before?
>>>>
>>>> Besides, I also found an issue in other aws sdk[1] that mentioned this
>>>> s3 exception as well. One
>>>> reply said you can passively avoid the problem by raising the max
>>>> client retires config. So I found
>>>> that config in presto[2]. Can I just add s3.max-client-retries: xxx in
>>>> flink-conf.yaml to config
>>>> it? If not, how should I do to overwrite the default value of this
>>>> configuration? Thanks in advance.
>>>>
>>>> Best,
>>>> Tony Wei
>>>>
>>>> [1] https://github.com/aws/aws-sdk-php/issues/885
>>>> [2] https://github.com/prestodb/presto/blob/master/
>>>> presto-hive/src/main/java/com/facebook/presto/hive/s3/
>>>> HiveS3Config.java#L218
>>>>
>>>
>>

Re: checkpoint failed due to s3 exception: request timeout

Posted by vino yang <ya...@gmail.com>.

Hi Tony,

Sorry, I just saw the timeout, I thought they were similar because they
both happened on aws s3.
Regarding this setting, isn't "s3.max-client-retries: xxx" set for the
client?

Thanks, vino.

Tony Wei <to...@gmail.com> 于2018年8月29日周三 下午1:17写道：

> Hi Vino,
>
> Thanks for your quick reply, but I think these two questions are
> different. The checkpoint in that question
> finally finished, but my checkpoint failed due to s3 client timeout. You
> can see from my screenshot that
> showed the checkpoint failed in a short time.
>
> According to configuration, do you mean pass the configuration as
> program's input arguments? I don't
> think it will work. At least I need to find a way to pass it to s3
> filesystem builder in my program. However,
> I will ask for help to pass it by flink-conf.yaml, because I used that to
> config the global setting for s3
> filesystem and I thought it might have a simple way to support this
> setting like other s3.xxx config.
>
> Very much appreciate for your answer and help.
>
> Best,
> Tony Wei
>
> 2018-08-29 11:51 GMT+08:00 vino yang <ya...@gmail.com>:
>
>> Hi Tony,
>>
>> A while ago, I have answered a similar question.[1]
>>
>> You can try to increase this value appropriately. You can't put this
>> configuration in flink-conf.yaml, you can put it in the submit command of
>> the job[2], or in the configuration file you specify.
>>
>> [1]:
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Why-checkpoint-took-so-long-td22364.html#a22375
>> [2]:
>> https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/cli.html
>>
>> Thanks, vino.
>>
>> Tony Wei <to...@gmail.com> 于2018年8月29日周三 上午11:36写道：
>>
>>> Hi,
>>>
>>> I met checkpoint failure problem that cause by s3 exception.
>>>
>>> org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>>>> Your socket connection to the server was not read from or written to within
>>>> the timeout period. Idle connections will be closed. (Service: Amazon S3;
>>>> Status Code: 400; Error Code: RequestTimeout; Request ID:
>>>> B8BE8978D3EFF3F5), S3 Extended Request ID:
>>>> ePKce/MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw=
>>>
>>>
>>> The full stack trace and screenshot is provided in the attachment.
>>>
>>> My setting for flink cluster and job:
>>>
>>>    - flink version 1.4.0
>>>    - standalone mode
>>>    - 4 slots for each TM
>>>    - presto s3 filesystem
>>>    - rocksdb statebackend
>>>    - local ssd
>>>    - enable incremental checkpoint
>>>
>>> No weird message beside the exception in the log file. No high ratio of
>>> GC during the checkpoint
>>> procedure. And still 3 of 4 parts uploaded successfully on that TM. I
>>> didn't find something that
>>> would related to this failure. Did anyone meet this problem before?
>>>
>>> Besides, I also found an issue in other aws sdk[1] that mentioned this
>>> s3 exception as well. One
>>> reply said you can passively avoid the problem by raising the max client
>>> retires config. So I found
>>> that config in presto[2]. Can I just add s3.max-client-retries: xxx in
>>> flink-conf.yaml to config
>>> it? If not, how should I do to overwrite the default value of this
>>> configuration? Thanks in advance.
>>>
>>> Best,
>>> Tony Wei
>>>
>>> [1] https://github.com/aws/aws-sdk-php/issues/885
>>> [2]
>>> https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/HiveS3Config.java#L218
>>>
>>
>

Re: checkpoint failed due to s3 exception: request timeout

Posted by Tony Wei <to...@gmail.com>.

Hi Vino,

Thanks for your quick reply, but I think these two questions are different.
The checkpoint in that question
finally finished, but my checkpoint failed due to s3 client timeout. You
can see from my screenshot that
showed the checkpoint failed in a short time.

According to configuration, do you mean pass the configuration as program's
input arguments? I don't
think it will work. At least I need to find a way to pass it to s3
filesystem builder in my program. However,
I will ask for help to pass it by flink-conf.yaml, because I used that to
config the global setting for s3
filesystem and I thought it might have a simple way to support this setting
like other s3.xxx config.

Very much appreciate for your answer and help.

Best,
Tony Wei

2018-08-29 11:51 GMT+08:00 vino yang <ya...@gmail.com>:

> Hi Tony,
>
> A while ago, I have answered a similar question.[1]
>
> You can try to increase this value appropriately. You can't put this
> configuration in flink-conf.yaml, you can put it in the submit command of
> the job[2], or in the configuration file you specify.
>
> [1]: http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/Why-checkpoint-took-so-long-td22364.html#a22375
> [2]: https://ci.apache.org/projects/flink/flink-docs-
> release-1.6/ops/cli.html
>
> Thanks, vino.
>
> Tony Wei <to...@gmail.com> 于2018年8月29日周三 上午11:36写道：
>
>> Hi,
>>
>> I met checkpoint failure problem that cause by s3 exception.
>>
>> org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>>> Your socket connection to the server was not read from or written to within
>>> the timeout period. Idle connections will be closed. (Service: Amazon S3;
>>> Status Code: 400; Error Code: RequestTimeout; Request ID:
>>> B8BE8978D3EFF3F5), S3 Extended Request ID: ePKce/
>>> MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw=
>>
>>
>> The full stack trace and screenshot is provided in the attachment.
>>
>> My setting for flink cluster and job:
>>
>>    - flink version 1.4.0
>>    - standalone mode
>>    - 4 slots for each TM
>>    - presto s3 filesystem
>>    - rocksdb statebackend
>>    - local ssd
>>    - enable incremental checkpoint
>>
>> No weird message beside the exception in the log file. No high ratio of
>> GC during the checkpoint
>> procedure. And still 3 of 4 parts uploaded successfully on that TM. I
>> didn't find something that
>> would related to this failure. Did anyone meet this problem before?
>>
>> Besides, I also found an issue in other aws sdk[1] that mentioned this s3
>> exception as well. One
>> reply said you can passively avoid the problem by raising the max client
>> retires config. So I found
>> that config in presto[2]. Can I just add s3.max-client-retries: xxx in
>> flink-conf.yaml to config
>> it? If not, how should I do to overwrite the default value of this
>> configuration? Thanks in advance.
>>
>> Best,
>> Tony Wei
>>
>> [1] https://github.com/aws/aws-sdk-php/issues/885
>> [2] https://github.com/prestodb/presto/blob/master/
>> presto-hive/src/main/java/com/facebook/presto/hive/s3/
>> HiveS3Config.java#L218
>>
>

Re: checkpoint failed due to s3 exception: request timeout

Posted by vino yang <ya...@gmail.com>.

Hi Tony,

A while ago, I have answered a similar question.[1]

You can try to increase this value appropriately. You can't put this
configuration in flink-conf.yaml, you can put it in the submit command of
the job[2], or in the configuration file you specify.

[1]:
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Why-checkpoint-took-so-long-td22364.html#a22375
[2]:
https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/cli.html

Thanks, vino.

Tony Wei <to...@gmail.com> 于2018年8月29日周三 上午11:36写道：

> Hi,
>
> I met checkpoint failure problem that cause by s3 exception.
>
> org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>> Your socket connection to the server was not read from or written to within
>> the timeout period. Idle connections will be closed. (Service: Amazon S3;
>> Status Code: 400; Error Code: RequestTimeout; Request ID:
>> B8BE8978D3EFF3F5), S3 Extended Request ID:
>> ePKce/MjMFPPNYi90rGdYmDw3blfvi0xR2CcJpCISEgxM92/6JZAU4whpfXeV6SfG62cnts0NBw=
>
>
> The full stack trace and screenshot is provided in the attachment.
>
> My setting for flink cluster and job:
>
>    - flink version 1.4.0
>    - standalone mode
>    - 4 slots for each TM
>    - presto s3 filesystem
>    - rocksdb statebackend
>    - local ssd
>    - enable incremental checkpoint
>
> No weird message beside the exception in the log file. No high ratio of GC
> during the checkpoint
> procedure. And still 3 of 4 parts uploaded successfully on that TM. I
> didn't find something that
> would related to this failure. Did anyone meet this problem before?
>
> Besides, I also found an issue in other aws sdk[1] that mentioned this s3
> exception as well. One
> reply said you can passively avoid the problem by raising the max client
> retires config. So I found
> that config in presto[2]. Can I just add s3.max-client-retries: xxx in
> flink-conf.yaml to config
> it? If not, how should I do to overwrite the default value of this
> configuration? Thanks in advance.
>
> Best,
> Tony Wei
>
> [1] https://github.com/aws/aws-sdk-php/issues/885
> [2]
> https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/s3/HiveS3Config.java#L218
>