You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Sahil Takiar (Jira)" <ji...@apache.org> on 2019/09/13 16:40:00 UTC

[jira] [Commented] (IMPALA-8634) Catalog client should be resilient to temporary Catalog outage

    [ https://issues.apache.org/jira/browse/IMPALA-8634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16929353#comment-16929353 ] 

Sahil Takiar commented on IMPALA-8634:
--------------------------------------

The existing code actually already does this. The flags {{catalog_client_connection_num_retries}} and {{catalog_client_rpc_retry_interval_ms}} control the number of times the client tries to re-connect to the catalog.

The issue is that connection established is retried, but individual RPCs are not retried (unless the RPC hits a connection reset). So a fix would to use {{DoRpcWithRetry}} instead of {{DoRpc}} (similar to what was done in IMPALA-8904).

There is some odd behavior with the retry logic though. If there is a cached client connection, the catalogd crashes, and then a query runs, the impalad will retry the connection {{2 * catalog_client_connection_num_retries}} times because the RPC is retried and the connection established is retried. One way to fix this would be to remove the connection establishment retry and let the RPC retry handle all retries. The issue is that the way the code is written, that means any attempt to establish a new connection won't be retried (if it uses a cached connection it will be retried).

Ideally, the following scenarios are handled correctly (e.g. each are retried exactly {{catalog_client_connection_num_retries}} times):
* New connection establishment
* Cached connection resets
* RPC failures

Would be nice to rename {{catalog_client_connection_num_retries}} to {{catalog_client_rpc_num_retries}} as well.

> Catalog client should be resilient to temporary Catalog outage
> --------------------------------------------------------------
>
>                 Key: IMPALA-8634
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8634
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Catalog
>    Affects Versions: Impala 3.2.0
>            Reporter: Michael Ho
>            Assignee: Sahil Takiar
>            Priority: Critical
>
> Currently, when the catalog server is down, catalog clients will fail all RPCs sent to it. In essence, DDL queries will fail and the Impala service becomes a lot less functional. Catalog clients should consider retrying failed RPCs with some exponential backoff in between while catalog server is being restarted after crashing. We probably need to add [a test |https://github.com/apache/impala/blob/master/tests/custom_cluster/test_restart_services.py] to exercise the paths of catalog restart to verify coordinators are resilient to it.
> cc'ing [~stakiar], [~joemcdonnell], [~twm378]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org