You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Michael Ho (JIRA)" <ji...@apache.org> on 2019/03/22 22:28:00 UTC

[jira] [Commented] (IMPALA-7872) Extended health checks to mark node as down

    [ https://issues.apache.org/jira/browse/IMPALA-7872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16799395#comment-16799395 ] 

Michael Ho commented on IMPALA-7872:
------------------------------------

From the perspective of a coordinator node, an executor node can be considered available if:
- there are fragment instances running on it which are making more progress (a bit hard to define for blocking operators but at the minimum getting status report from the query periodically. See IMPALA-2990).
- if there is no fragment instances whose coordinator is the current coordinator node, it may consider issuing some simple queries to the executor to see if it's still functional. Presumably, the health check query shouldn't be too resource intensive.

> Extended health checks to mark node as down
> -------------------------------------------
>
>                 Key: IMPALA-7872
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7872
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Distributed Exec
>            Reporter: Tim Armstrong
>            Priority: Major
>              Labels: Availability, statestore
>
> This is an umbrella JIRA to improve handling of complex failure modes aside from fail-stop. The current statestore heartbeat mechanism assumes that an Impala daemon that responds to heartbeats is healthy and can be scheduled on. Memory-based admission control provides a bit more robustness here by not admitting queries on daemons where memory would be oversubscribed.
> Examples of failure modes of interest are:
> * Hangs, where a particular node can't make progress (the JVM hangs in IMPALA-7483 are a good example) on some or all queries.
> * Repeated fragment instance startup failures. E.g. where coordinators can't successfully start fragments on an impala daemon, because of communication errors or other issues.
> We can't automatically handle all failure modes, but we could improve handling of some common ones, particularly repeated fragment startup failures or hangs. The goal would be to degrade more gracefully to avoid repeated failures causing a cluster-wide outage. The goal isn't to prevent all failures, just to recover to a healthy state automatically in more scenarios.
> IMPALA-1760 (graceful shutdown) may give us some better options here, since if a node notices that it is somehow unhealthy, it could gracefully remove itself from scheduling and restart itself.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org