You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (JIRA)" <ji...@apache.org> on 2019/06/03 16:37:00 UTC
[jira] [Commented] (IMPALA-7665) Bringing up stopped statestore causes queries to fail

    [ https://issues.apache.org/jira/browse/IMPALA-7665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16854796#comment-16854796 ] 

ASF subversion and git services commented on IMPALA-7665:
---------------------------------------------------------

Commit 6b3e5fe426a7cd8b13c18a54fe6c2726ab8667d8 in impala's branch refs/heads/master from Lars Volker
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=6b3e5fe ]

IMPALA-8460: Simplify cluster membership management

This change adds a class to track cluster membership called
ClusterMembershipMgr. It replaces the logic that was partially
duplicated between the ImpalaServer and the Coordinator and makes sure
that the local backend descriptor is consistent (IMPALA-8469).

The ClusterMembershipMgr maintains a view of the cluster membership and
incorporates incoming updates from the statestore. It also registers the
local backend with the statestore after startup. Clients can obtain a
consistent, immutable snapshot of the current cluster membership from
the ClusterMembershipMgr. Additionally, callbacks can be registered to
receive notifications of cluster membership changes. The ImpalaServer
and Frontend use this mechanism.

This change also generalizes the fix for IMPALA-7665: updates from the
statestore to the cluster membership topic are only made visible to the
rest of the local server after a post-recovery grace period has elapsed.
As part of this the flag
'failed_backends_query_cancellation_grace_period_ms' is replaced with
'statestore_subscriber_recovery_grace_period_ms'. To tell the initial
startup from post-recovery, a new metric
'statestore-subscriber.num-connection-failures' is exposed by the
daemon, which tracks the total number of connection failures to the
statestore over the lifetime process lifetime.

This change also unifies the naming of executor-related classes, in
particular it renames "BackendConfig" to "ExecutorGroup". In
anticipation of a subsequent change (IMPALA-8484), it adds maps to store
multiple executor groups.

This change also disables the generation of default operators from the
thrift files and instead adds explicit implementations for the ones that
we rely on. This forces us to explicitly specify comparators when
manipulating containers of thrift structs and will help prevent
accidental bugs.

Testing: This change adds a backend unit test for the new cluster
membership manager. The observable behavior of Impala does not change,
and the existing scheduler unit test and end to end tests should make
sure of that.

Change-Id: Ib3cf9a8bb060d0c6e9ec8868b7b21ce01f8740a3
Reviewed-on: http://gerrit.cloudera.org:8080/13207
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Bringing up stopped statestore causes queries to fail
> -----------------------------------------------------
>
>                 Key: IMPALA-7665
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7665
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Distributed Exec
>    Affects Versions: Impala 3.1.0
>            Reporter: Tim Armstrong
>            Assignee: Bikramjeet Vig
>            Priority: Critical
>              Labels: query-lifecycle, statestore
>             Fix For: Impala 3.3.0
>
>
> I can reproduce this by running a long-running query then cycling the statestore:
> {noformat}
> tarmstrong@tarmstrong-box:~/Impala/incubator-impala$ impala-shell.sh -q "select distinct * from tpch10_parquet.lineitem"
> Starting Impala Shell without Kerberos authentication
> Connected to localhost:21000
> Server version: impalad version 3.1.0-SNAPSHOT DEBUG (build c486fb9ea4330e1008fa9b7ceaa60492e43ee120)
> Query: select distinct * from tpch10_parquet.lineitem
> Query submitted at: 2018-10-04 17:06:48 (Coordinator: http://tarmstrong-box:25000)
> {noformat}
> If I kill the statestore, the query runs fine, but if I start up the statestore again, it fails.
> {noformat}
> # In one terminal, start up the statestore
> $ /home/tarmstrong/Impala/incubator-impala/be/build/latest/statestore/statestored -log_filename=statestored -log_dir=/home/tarmstrong/Impala/incubator-impala/logs/cluster -v=1 -logbufsecs=5 -max_log_files=10
> # The running query then fails
> WARNINGS: Failed due to unreachable impalad(s): tarmstrong-box:22001, tarmstrong-box:22002
> {noformat}
> Note that I've seen different subsets impalads reported as failed, e.g. "Failed due to unreachable impalad(s): tarmstrong-box:22001"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org