You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (JIRA)" <ji...@apache.org> on 2018/09/25 04:55:00 UTC

[jira] [Commented] (IMPALA-7305) membership entry for failed impalad gets stuck in statestore due to race between failure detection and update processing

    [ https://issues.apache.org/jira/browse/IMPALA-7305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16626769#comment-16626769 ] 

ASF subversion and git services commented on IMPALA-7305:
---------------------------------------------------------

Commit e38715e25297cc3643482be04e3b1b273e339b54 in impala's branch refs/heads/master from [~tarmstrong@cloudera.com]
[ https://git-wip-us.apache.org/repos/asf?p=impala.git;h=e38715e ]

IMPALA-7306: regression test for non-removed transient updates

Adds a test for IMPALA-7305 that reproduces the bug by delaying
heartbeats and updates.

Increased some timeouts in the test because they were hit
once after looping for ~12 hours.

Testing:
Manually reintroduced the bug by commenting out the code that
fixed it and confirmed that the test failed.

Change-Id: I6c2a39d8a76cb5371f394b5a97817d8231e473cc
Reviewed-on: http://gerrit.cloudera.org:8080/11470
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> membership entry for failed impalad gets stuck in statestore due to race between failure detection and update processing
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-7305
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7305
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Distributed Exec
>    Affects Versions: Impala 2.5.0, Impala 2.6.0, Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, Impala 2.11.0
>            Reporter: Tim Armstrong
>            Assignee: Tim Armstrong
>            Priority: Critical
>             Fix For: Impala 2.12.0, Impala 3.1.0
>
>         Attachments: 0001-Repro-CDH-70703.patch
>
>
> I was able to reproduce this bug on a version of Impala pre-IMPALA-4953 with the attached patch that adds a sleep. The patch is a hack and only works on my system (it has a name hardcoded). The trick is to kill the third impala manually while the cluster is starting up.
> Then the system gets stuck in a state where all impalads thing 22002 is alive but the process was actually killed. Running queries fails because they keep getting scheduled on the dead impalad.
> {noformat}
> Known backend(s): 3
> Address	Coordinator	Executor
> tarmstrong-box:22002 	true 	true
> tarmstrong-box:22001 	true 	true
> tarmstrong-box:22000 	true 	true
> {noformat}
> The race seems quite exotic but may be possible if there are intermittent transport errors (causing heartbeats to fail) or if there are delays processing topics, e.g. contending for locks.
> IMPALA-4953 fixes the problem by deleting newly-added transient entries if the subscriber got unregistered while the statestore was processing an update.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org