You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Flink Jira Bot (Jira)" <ji...@apache.org> on 2022/07/06 10:41:00 UTC

[jira] [Updated] (FLINK-26315) Stream job which have multily region would not recover when connection with zookeeper/hbase lost.

     [ https://issues.apache.org/jira/browse/FLINK-26315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Flink Jira Bot updated FLINK-26315:
-----------------------------------
    Labels: auto-deprioritized-critical failure-recovery restart stale-major  (was: auto-deprioritized-critical failure-recovery restart)

I am the [Flink Jira Bot|https://github.com/apache/flink-jira-bot/] and I help the community manage its development. I see this issues has been marked as Major but is unassigned and neither itself nor its Sub-Tasks have been updated for 60 days. I have gone ahead and added a "stale-major" to the issue". If this ticket is a Major, please either assign yourself or give an update. Afterwards, please remove the label or in 7 days the issue will be deprioritized.


> Stream job which have multily region would not recover when connection with zookeeper/hbase lost.
> -------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-26315
>                 URL: https://issues.apache.org/jira/browse/FLINK-26315
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.12.0
>            Reporter: Janick Wu
>            Priority: Major
>              Labels: auto-deprioritized-critical, failure-recovery, restart, stale-major
>         Attachments: [FLINK-26315]_improve_failure-rate_restart_strategy,_support_same_failure_cause_ignore.patch
>
>
>   Our platfrom use failure-rate (failure-rate-interval: 5min,max-failures-per-interval: 6) as default restart-strategy. And failover-strategy is region level.
>   Let's asume a job with concurrency of 10, all the edges in stream graph is FORWARD, then the region count is equal to job parallelism. If more than 5 Task failed caused by connection lost between Taskmanager and external System such as zookeeper、hbase, failure rate will exceeded immediately. So our job will never recover from such situition(very common when use zookeeper for ha).
> h2. possible solution:
> Imporve failure-rate strategy: record last task failure cause and timestamp,. If the task failure cause  occur multiple times in a short period of time, it will ingore the rest.
> I already implement it and work well. 
> useage: 
> {quote}restart-strategy: failure-rate
> restart-strategy.failure-rate.cause.insensitive: true
> restart-strategy.failure-rate.cause.insensitive-interval: 1min
> {quote}
> this configure will ignore continuously repeating exception in 1min.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)