You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Flink Jira Bot (Jira)" <ji...@apache.org> on 2022/01/04 10:41:00 UTC

[jira] [Updated] (FLINK-3431) Add retrying logic for RocksDB snapshots

     [ https://issues.apache.org/jira/browse/FLINK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Flink Jira Bot updated FLINK-3431:
----------------------------------
    Labels: auto-deprioritized-critical auto-deprioritized-major auto-unassigned stale-minor  (was: auto-deprioritized-critical auto-deprioritized-major auto-unassigned)

I am the [Flink Jira Bot|https://github.com/apache/flink-jira-bot/] and I help the community manage its development. I see this issues has been marked as Minor but is unassigned and neither itself nor its Sub-Tasks have been updated for 180 days. I have gone ahead and marked it "stale-minor". If this ticket is still Minor, please either assign yourself or give an update. Afterwards, please remove the label or in 7 days the issue will be deprioritized.


> Add retrying logic for RocksDB snapshots
> ----------------------------------------
>
>                 Key: FLINK-3431
>                 URL: https://issues.apache.org/jira/browse/FLINK-3431
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / State Backends
>            Reporter: Gyula Fora
>            Priority: Minor
>              Labels: auto-deprioritized-critical, auto-deprioritized-major, auto-unassigned, stale-minor
>
> Currently the RocksDB snapshots rely on hdfs copy not failing while taking the snapshots.
> In some cases when the state size is big enough the HDFS nodes might get so overloaded that the copy operation fails on errors like this:
> AsynchronousException{java.io.IOException: All datanodes 172.26.86.90:50010 are bad. Aborting...}
> at org.apache.flink.streaming.runtime.tasks.StreamTask$1.run(StreamTask.java:545)
> Caused by: java.io.IOException: All datanodes 172.26.86.90:50010 are bad. Aborting...
> at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1023)
> at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:838)
> at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:483)
> I think it would be important that we don't immediately fail the job in these cases but retry the copy operation after some random sleep time. It might be also good to do a random sleep before the copy depending on the state size to smoothen out IO a little bit.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)