You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by "Janick Wu (Jira)" <ji...@apache.org> on 2022/02/23 03:56:00 UTC

[jira] [Created] (FLINK-26315) Stream job which have multily region would not recover when connection with zookeeper/hbase lost.

Janick Wu created FLINK-26315:
---------------------------------

             Summary: Stream job which have multily region would not recover when connection with zookeeper/hbase lost.
                 Key: FLINK-26315
                 URL: https://issues.apache.org/jira/browse/FLINK-26315
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Coordination, Runtime / Task
    Affects Versions: 1.12.0
            Reporter: Janick Wu


  Our platfrom use failure-rate (failure-rate-interval: 5min,max-failures-per-interval: 6) as default restart-strategy. And failover-strategy is region level.
  Let's asume a job with concurrency of 10, all the edges in stream graph is FORWARD, then the region count is equal to job parallelism. If more than 5 Task failed caused by connection lost between Taskmanager and external System such as zookeeper、hbase, 
failure rate will exceeded immediately. So our job will never recover from such situition.
h2. possible solution：

Imporve failure-rate strategy: record last task failure cause and timestamp,. If the task failure cause  occur multiple times in a short period of time, it will ingore the rest.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)