You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@impala.apache.org by "Thomas Tauber-Marshall (Jira)" <ji...@apache.org> on 2020/02/26 00:36:00 UTC

[jira] [Created] (IMPALA-9425) Statestore may fail to report when an impalad has failed

Thomas Tauber-Marshall created IMPALA-9425:
----------------------------------------------

             Summary: Statestore may fail to report when an impalad has failed
                 Key: IMPALA-9425
                 URL: https://issues.apache.org/jira/browse/IMPALA-9425
             Project: IMPALA
          Issue Type: Bug
          Components: Distributed Exec
    Affects Versions: Impala 3.4.0
            Reporter: Thomas Tauber-Marshall
            Assignee: Thomas Tauber-Marshall


If an impalad fails and another is restarted at the same host:port combination quickly, the statestore may fail to report to the coordinators that the impalad went down.

The reason for this is that in the cluster membership topic, impalads are keyed by their statestore subscriber id, which is "impalad@host:port". If the new impalad registers itself before a topic update has been generated for a particular coordinator, the statestore has no way of knowing that the particular key was deleted and then re-added since the last update.

The result is that queries that were running on the impalad that failed may not be cancelled by the coordinator until they pass the unresponsive backend timeout, which by default is ~12 minutes.

I propose as a solution that we add a concept of uuids for impalads, where each impalad will generate its own uuid on startup. This allows us to differentiate between different impalads running at the same host:port combination.

It can also be used to simplify some logic in the scheduler and ExecutorGroup/ExecutorBlacklist etc. where we currently have data structures containing info about impalads that are keyed off host/port combinations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)