You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Tim Armstrong (Code Review)" <ge...@cloudera.org> on 2018/04/03 21:20:41 UTC

[Impala-ASF-CR] IMPALA-6785: reset failed heartbeat count when re-registering

Tim Armstrong has uploaded this change for review. ( http://gerrit.cloudera.org:8080/9913


Change subject: IMPALA-6785: reset failed heartbeat count when re-registering
......................................................................

IMPALA-6785: reset failed heartbeat count when re-registering

The bug is a mix-up between whether the subscriber or registration ID is
the key for the failure detected. The symptom was that, if the first
topic update won the race with the first heartbeat, no subsequent topic
updates were delivered.

Testing:
Add a regression that reliably reproduces the issue by delaying the
heartbeat and checking that topic updates continue to be sent. The
test reliably fails before the fix and passes after.

Change-Id: I2ad409e2a8e22d081fce97b085b9469ab046bf07
---
M be/src/statestore/statestore.cc
M tests/statestore/test_statestore.py
2 files changed, 28 insertions(+), 3 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/13/9913/1
-- 
To view, visit http://gerrit.cloudera.org:8080/9913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I2ad409e2a8e22d081fce97b085b9469ab046bf07
Gerrit-Change-Number: 9913
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-6785: reset failed heartbeat count when re-registering

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Tianyi Wang has posted comments on this change. ( http://gerrit.cloudera.org:8080/9913 )

Change subject: IMPALA-6785: reset failed heartbeat count when re-registering
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9913/3/be/src/statestore/statestore.cc
File be/src/statestore/statestore.cc:

http://gerrit.cloudera.org:8080/#/c/9913/3/be/src/statestore/statestore.cc@873
PS3, Line 873:     if (state == FailureDetector::FAILED) {
If the topic update is processed first, GetPeerState() will return UNKNOWN. How could it get into this branch?



-- 
To view, visit http://gerrit.cloudera.org:8080/9913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2ad409e2a8e22d081fce97b085b9469ab046bf07
Gerrit-Change-Number: 9913
Gerrit-PatchSet: 3
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 04 Apr 2018 01:43:23 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6785: reset failed heartbeat count when re-registering

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/9913 )

Change subject: IMPALA-6785: reset failed heartbeat count when re-registering
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9913/4/be/src/statestore/statestore.h
File be/src/statestore/statestore.h:

http://gerrit.cloudera.org:8080/#/c/9913/4/be/src/statestore/statestore.h@520
PS4, Line 520:  The
             :   /// subscriber ID is used to identify peers for failure detection purposes.
why is it that using subscriber ID is the right thing, rather than go back to using the registration id everywhere for failure detection?



-- 
To view, visit http://gerrit.cloudera.org:8080/9913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2ad409e2a8e22d081fce97b085b9469ab046bf07
Gerrit-Change-Number: 9913
Gerrit-PatchSet: 4
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 04 Apr 2018 16:42:35 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6785: reset failed heartbeat count when re-registering

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has posted comments on this change. ( http://gerrit.cloudera.org:8080/9913 )

Change subject: IMPALA-6785: reset failed heartbeat count when re-registering
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9913/4/be/src/statestore/statestore.h
File be/src/statestore/statestore.h:

http://gerrit.cloudera.org:8080/#/c/9913/4/be/src/statestore/statestore.h@520
PS4, Line 520:  The
             :   /// subscriber ID is used to identify peers for failure detection purposes.
> I don't think it actually matters so long as its consistent. We need to evi
I think it does matter in certain cases. Since the registration ID is chosen by the statestore and the subscriber ID is chosen by the subscriber.

So if a subscriber has to re-register for some reason, the statestore would see the same subscriber ID and know that it's a re-registration vs. a new subscriber.



-- 
To view, visit http://gerrit.cloudera.org:8080/9913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2ad409e2a8e22d081fce97b085b9469ab046bf07
Gerrit-Change-Number: 9913
Gerrit-PatchSet: 4
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 04 Apr 2018 16:59:55 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6785: reset failed heartbeat count when re-registering

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9913 )

Change subject: IMPALA-6785: reset failed heartbeat count when re-registering
......................................................................


Patch Set 4: Code-Review+1

Carry +1 from Bharath


-- 
To view, visit http://gerrit.cloudera.org:8080/9913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2ad409e2a8e22d081fce97b085b9469ab046bf07
Gerrit-Change-Number: 9913
Gerrit-PatchSet: 4
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 04 Apr 2018 15:07:34 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6785: reset failed heartbeat count when re-registering

Posted by "Bharath Vissapragada (Code Review)" <ge...@cloudera.org>.
Bharath Vissapragada has posted comments on this change. ( http://gerrit.cloudera.org:8080/9913 )

Change subject: IMPALA-6785: reset failed heartbeat count when re-registering
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9913/3/be/src/statestore/statestore.cc
File be/src/statestore/statestore.cc:

http://gerrit.cloudera.org:8080/#/c/9913/3/be/src/statestore/statestore.cc@873
PS3, Line 873:     if (state == FailureDetector::FAILED) {
> If the topic update is processed first, GetPeerState() will return UNKNOWN.
I had the same question for Tim over chat (after reading the commit message), but it appears that the logging here is intended for resubscription of subscribers that are already marked FAILED. I got that after reading the added test (test_heartbeat_failure_reset). Tim, please correct me if I'm wrong.



-- 
To view, visit http://gerrit.cloudera.org:8080/9913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2ad409e2a8e22d081fce97b085b9469ab046bf07
Gerrit-Change-Number: 9913
Gerrit-PatchSet: 3
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 04 Apr 2018 03:48:22 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6785: reset failed heartbeat count when re-registering

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9913 )

Change subject: IMPALA-6785: reset failed heartbeat count when re-registering
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9913/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9913/2//COMMIT_MSG@16
PS2, Line 16: 
> nit: Add that this is an extension of IMPALA-3613, so it's easier to unders
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/9913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2ad409e2a8e22d081fce97b085b9469ab046bf07
Gerrit-Change-Number: 9913
Gerrit-PatchSet: 2
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Tue, 03 Apr 2018 23:47:41 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6785: reset failed heartbeat count when re-registering

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Hello Bharath Vissapragada, Tianyi Wang, Sailesh Mukil, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/9913

to look at the new patch set (#4).

Change subject: IMPALA-6785: reset failed heartbeat count when re-registering
......................................................................

IMPALA-6785: reset failed heartbeat count when re-registering

When a subscriber re-registers with the same subscriber ID, we need
to reset failure detection info so that we don't erroneously think
that the subscriber has failed.

The bug is a mix-up between whether the subscriber or registration ID is
the key for the failure detected. The symptom was that, if the first
topic update won the race with the first heartbeat, no subsequent topic
updates were delivered. It was introduced when IMPALA-3613 updated some,
but not all, places to use the subscriber ID as the key. It wasn't
detected because we had no tests directly testing this code path, and
it appears that the race almost never happens on smaller clusters.

This patch fixes the problem in two places. Fixing either place is
actually sufficient to fix the bug.

Testing:
Add a regression that reliably reproduces the issue by delaying the
heartbeat and checking that topic updates continue to be sent. The
test reliably fails before the fix and passes after.

Change-Id: I2ad409e2a8e22d081fce97b085b9469ab046bf07
---
M be/src/statestore/statestore.cc
M be/src/statestore/statestore.h
M tests/statestore/test_statestore.py
3 files changed, 37 insertions(+), 5 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/13/9913/4
-- 
To view, visit http://gerrit.cloudera.org:8080/9913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2ad409e2a8e22d081fce97b085b9469ab046bf07
Gerrit-Change-Number: 9913
Gerrit-PatchSet: 4
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-6785: reset failed heartbeat count when re-registering

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9913 )

Change subject: IMPALA-6785: reset failed heartbeat count when re-registering
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9913/4/be/src/statestore/statestore.h
File be/src/statestore/statestore.h:

http://gerrit.cloudera.org:8080/#/c/9913/4/be/src/statestore/statestore.h@520
PS4, Line 520:  The
             :   /// subscriber ID is used to identify peers for failure detection purposes.
> why is it that using subscriber ID is the right thing, rather than go back 
I don't think it actually matters so long as its consistent. We need to evict the entry from the cache when the subscriber is unregistered regardless so that we don't get stale state or unbounded growth in the entries, so for the lifetime of the failure detector entry there's a 1:1 relationship between subscriber and registration ID.

One advantage of using subscriber ID is that the test I added can detect if we mess this up. If we had a similar bug with registration ID we'd just get the failure detector growing unbounded.



-- 
To view, visit http://gerrit.cloudera.org:8080/9913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2ad409e2a8e22d081fce97b085b9469ab046bf07
Gerrit-Change-Number: 9913
Gerrit-PatchSet: 4
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 04 Apr 2018 16:49:02 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6785: reset failed heartbeat count when re-registering

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/9913 )

Change subject: IMPALA-6785: reset failed heartbeat count when re-registering
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9913/4/be/src/statestore/statestore.h
File be/src/statestore/statestore.h:

http://gerrit.cloudera.org:8080/#/c/9913/4/be/src/statestore/statestore.h@520
PS4, Line 520:  The
             :   /// subscriber ID is used to identify peers for failure detection purposes.
> Right, but in either case we want to start with a clean failure detection s
Doesn't that argue that we should use the registeration id?  Using the subscriber id means that the failure detector will consider the history from the previous instance of this subscriber since the same key is used, no?



-- 
To view, visit http://gerrit.cloudera.org:8080/9913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2ad409e2a8e22d081fce97b085b9469ab046bf07
Gerrit-Change-Number: 9913
Gerrit-PatchSet: 4
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 04 Apr 2018 17:10:33 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6785: reset failed heartbeat count when re-registering

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9913 )

Change subject: IMPALA-6785: reset failed heartbeat count when re-registering
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9913/1/be/src/statestore/statestore.cc
File be/src/statestore/statestore.cc:

http://gerrit.cloudera.org:8080/#/c/9913/1/be/src/statestore/statestore.cc@909
PS1, Line 909:   failure_detector_->EvictPeer(PrintId(subscriber->registration_id()));
Hmm, this is also suspicious



-- 
To view, visit http://gerrit.cloudera.org:8080/9913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2ad409e2a8e22d081fce97b085b9469ab046bf07
Gerrit-Change-Number: 9913
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Tue, 03 Apr 2018 21:21:49 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6785: reset failed heartbeat count when re-registering

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9913 )

Change subject: IMPALA-6785: reset failed heartbeat count when re-registering
......................................................................


Patch Set 5:

I'm pretty sure the UpdateHeartbeat call is redundant, since we remove any existing entry in Unregistersubscriber, which is guaranteed to run before reregistering a subscriber with the same id. There are probably some opportunities like this to simplify the logic.


-- 
To view, visit http://gerrit.cloudera.org:8080/9913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2ad409e2a8e22d081fce97b085b9469ab046bf07
Gerrit-Change-Number: 9913
Gerrit-PatchSet: 5
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 04 Apr 2018 19:46:49 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6785: reset failed heartbeat count when re-registering

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/9913 )

Change subject: IMPALA-6785: reset failed heartbeat count when re-registering
......................................................................

IMPALA-6785: reset failed heartbeat count when re-registering

When a subscriber re-registers with the same subscriber ID, we need
to reset failure detection info so that we don't erroneously think
that the subscriber has failed.

The bug is a mix-up between whether the subscriber or registration ID is
the key for the failure detected. The symptom was that, if the first
topic update won the race with the first heartbeat, no subsequent topic
updates were delivered. It was introduced when IMPALA-3613 updated some,
but not all, places to use the subscriber ID as the key. It wasn't
detected because we had no tests directly testing this code path, and
it appears that the race almost never happens on smaller clusters.

This patch fixes the problem in two places. Fixing either place is
actually sufficient to fix the bug.

Testing:
Add a regression that reliably reproduces the issue by delaying the
heartbeat and checking that topic updates continue to be sent. The
test reliably fails before the fix and passes after.

Change-Id: I2ad409e2a8e22d081fce97b085b9469ab046bf07
Reviewed-on: http://gerrit.cloudera.org:8080/9913
Reviewed-by: Tim Armstrong <ta...@cloudera.com>
Reviewed-by: Dan Hecht <dh...@cloudera.com>
Tested-by: Impala Public Jenkins
---
M be/src/statestore/statestore.cc
M be/src/statestore/statestore.h
M tests/statestore/test_statestore.py
3 files changed, 37 insertions(+), 5 deletions(-)

Approvals:
  Tim Armstrong: Looks good to me, but someone else must approve
  Dan Hecht: Looks good to me, approved
  Impala Public Jenkins: Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/9913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I2ad409e2a8e22d081fce97b085b9469ab046bf07
Gerrit-Change-Number: 9913
Gerrit-PatchSet: 5
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-6785: reset failed heartbeat count when re-registering

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9913 )

Change subject: IMPALA-6785: reset failed heartbeat count when re-registering
......................................................................


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/9913/3/be/src/statestore/statestore.cc
File be/src/statestore/statestore.cc:

http://gerrit.cloudera.org:8080/#/c/9913/3/be/src/statestore/statestore.cc@551
PS3, Line 551:     failure_detector_->UpdateHeartbeat(subscriber_id, true);
> I think we should update the header statestore.h that failure_detector_ is 
Yeah good idea.


http://gerrit.cloudera.org:8080/#/c/9913/3/be/src/statestore/statestore.cc@873
PS3, Line 873:     if (state == FailureDetector::FAILED) {
> I had the same question for Tim over chat (after reading the commit message
Yeah there was some context in the JIRA that wasn't super-clear in the commit message - the problematic case is when you restart an impala daemon and the failure info is left over from the previous subscription.



-- 
To view, visit http://gerrit.cloudera.org:8080/9913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2ad409e2a8e22d081fce97b085b9469ab046bf07
Gerrit-Change-Number: 9913
Gerrit-PatchSet: 3
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 04 Apr 2018 04:05:45 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6785: reset failed heartbeat count when re-registering

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9913 )

Change subject: IMPALA-6785: reset failed heartbeat count when re-registering
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9913/1/be/src/statestore/statestore.cc
File be/src/statestore/statestore.cc:

http://gerrit.cloudera.org:8080/#/c/9913/1/be/src/statestore/statestore.cc@880
PS1, Line 880:       }
We're also missing logging here that would have made it much easier to narrow down the issue.



-- 
To view, visit http://gerrit.cloudera.org:8080/9913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2ad409e2a8e22d081fce97b085b9469ab046bf07
Gerrit-Change-Number: 9913
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Tue, 03 Apr 2018 21:50:17 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6785: reset failed heartbeat count when re-registering

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/9913 )

Change subject: IMPALA-6785: reset failed heartbeat count when re-registering
......................................................................


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9913/4/be/src/statestore/statestore.h
File be/src/statestore/statestore.h:

http://gerrit.cloudera.org:8080/#/c/9913/4/be/src/statestore/statestore.h@520
PS4, Line 520:  The
             :   /// subscriber ID is used to identify peers for failure detection purposes.
> It doesn't though - we reset the failure detection state when registering a
Yeah. Not sure I totally understand why it was sufficient to only update the UpdateHeartbeat() call and not the EvictPeer() to fix the bug (maybe it was fixing an instance of the bug but not the bug in general?). Seems like the EvictPeer() part of the change was also needed (which I now see is also called on reregistration no matter what).

Anyway, doesn't really matter, I agree the new code is fine.



-- 
To view, visit http://gerrit.cloudera.org:8080/9913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2ad409e2a8e22d081fce97b085b9469ab046bf07
Gerrit-Change-Number: 9913
Gerrit-PatchSet: 5
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 04 Apr 2018 19:30:10 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6785: reset failed heartbeat count when re-registering

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has posted comments on this change. ( http://gerrit.cloudera.org:8080/9913 )

Change subject: IMPALA-6785: reset failed heartbeat count when re-registering
......................................................................


Patch Set 2: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9913/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9913/2//COMMIT_MSG@16
PS2, Line 16: 
nit: Add that this is an extension of IMPALA-3613, so it's easier to understand the context.



-- 
To view, visit http://gerrit.cloudera.org:8080/9913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2ad409e2a8e22d081fce97b085b9469ab046bf07
Gerrit-Change-Number: 9913
Gerrit-PatchSet: 2
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Tue, 03 Apr 2018 23:42:30 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6785: reset failed heartbeat count when re-registering

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9913 )

Change subject: IMPALA-6785: reset failed heartbeat count when re-registering
......................................................................


Patch Set 5:

I'm pretty sure the UpdateHeartbeat call is redundant, since we remove any existing entry in Unregistersubscriber, which is guaranteed to run before reregistering a subscriber with the same id. There are probably some opportunities like this to simplify the logic.


-- 
To view, visit http://gerrit.cloudera.org:8080/9913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2ad409e2a8e22d081fce97b085b9469ab046bf07
Gerrit-Change-Number: 9913
Gerrit-PatchSet: 5
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 04 Apr 2018 19:47:09 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6785: reset failed heartbeat count when re-registering

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Hello Bharath Vissapragada, Tianyi Wang, Sailesh Mukil, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/9913

to look at the new patch set (#3).

Change subject: IMPALA-6785: reset failed heartbeat count when re-registering
......................................................................

IMPALA-6785: reset failed heartbeat count when re-registering

The bug is a mix-up between whether the subscriber or registration ID is
the key for the failure detected. The symptom was that, if the first
topic update won the race with the first heartbeat, no subsequent topic
updates were delivered. It was introduced when IMPALA-3613 updated some,
but not all, places to use the subscriber ID as the key. It wasn't
detected because we had no tests directly testing this code path, and
it appears that the race almost never happens on smaller clusters.

This patch fixes the problem in two places. Fixing either place is
actually sufficient to fix the bug.

Testing:
Add a regression that reliably reproduces the issue by delaying the
heartbeat and checking that topic updates continue to be sent. The
test reliably fails before the fix and passes after.

Change-Id: I2ad409e2a8e22d081fce97b085b9469ab046bf07
---
M be/src/statestore/statestore.cc
M tests/statestore/test_statestore.py
2 files changed, 32 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/13/9913/3
-- 
To view, visit http://gerrit.cloudera.org:8080/9913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2ad409e2a8e22d081fce97b085b9469ab046bf07
Gerrit-Change-Number: 9913
Gerrit-PatchSet: 3
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-6785: reset failed heartbeat count when re-registering

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9913 )

Change subject: IMPALA-6785: reset failed heartbeat count when re-registering
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9913/4/be/src/statestore/statestore.h
File be/src/statestore/statestore.h:

http://gerrit.cloudera.org:8080/#/c/9913/4/be/src/statestore/statestore.h@520
PS4, Line 520:  The
             :   /// subscriber ID is used to identify peers for failure detection purposes.
> I think it does matter in certain cases. Since the registration ID is chose
Right, but in either case we want to start with a clean failure detection state for the subcriber upon reregistration.



-- 
To view, visit http://gerrit.cloudera.org:8080/9913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2ad409e2a8e22d081fce97b085b9469ab046bf07
Gerrit-Change-Number: 9913
Gerrit-PatchSet: 4
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 04 Apr 2018 17:08:16 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6785: reset failed heartbeat count when re-registering

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9913 )

Change subject: IMPALA-6785: reset failed heartbeat count when re-registering
......................................................................


Patch Set 4:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2234/


-- 
To view, visit http://gerrit.cloudera.org:8080/9913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2ad409e2a8e22d081fce97b085b9469ab046bf07
Gerrit-Change-Number: 9913
Gerrit-PatchSet: 4
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 04 Apr 2018 15:10:23 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6785: reset failed heartbeat count when re-registering

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9913 )

Change subject: IMPALA-6785: reset failed heartbeat count when re-registering
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9913/4/be/src/statestore/statestore.h
File be/src/statestore/statestore.h:

http://gerrit.cloudera.org:8080/#/c/9913/4/be/src/statestore/statestore.h@520
PS4, Line 520:  The
             :   /// subscriber ID is used to identify peers for failure detection purposes.
> Doesn't that argue that we should use the registeration id?  Using the subs
It doesn't though - we reset the failure detection state when registering a new subcriber or unregistering an existing one (making that work that was the whole point of this patch...).



-- 
To view, visit http://gerrit.cloudera.org:8080/9913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2ad409e2a8e22d081fce97b085b9469ab046bf07
Gerrit-Change-Number: 9913
Gerrit-PatchSet: 4
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 04 Apr 2018 17:53:44 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6785: reset failed heartbeat count when re-registering

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has uploaded a new patch set (#2). ( http://gerrit.cloudera.org:8080/9913 )

Change subject: IMPALA-6785: reset failed heartbeat count when re-registering
......................................................................

IMPALA-6785: reset failed heartbeat count when re-registering

The bug is a mix-up between whether the subscriber or registration ID is
the key for the failure detected. The symptom was that, if the first
topic update won the race with the first heartbeat, no subsequent topic
updates were delivered.

This patch fixes the problem in two places. Fixing either place is
actually sufficient to fix the bug.

Testing:
Add a regression that reliably reproduces the issue by delaying the
heartbeat and checking that topic updates continue to be sent. The
test reliably fails before the fix and passes after.

Change-Id: I2ad409e2a8e22d081fce97b085b9469ab046bf07
---
M be/src/statestore/statestore.cc
M tests/statestore/test_statestore.py
2 files changed, 32 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/13/9913/2
-- 
To view, visit http://gerrit.cloudera.org:8080/9913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2ad409e2a8e22d081fce97b085b9469ab046bf07
Gerrit-Change-Number: 9913
Gerrit-PatchSet: 2
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-6785: reset failed heartbeat count when re-registering

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/9913 )

Change subject: IMPALA-6785: reset failed heartbeat count when re-registering
......................................................................


Patch Set 4: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/9913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2ad409e2a8e22d081fce97b085b9469ab046bf07
Gerrit-Change-Number: 9913
Gerrit-PatchSet: 4
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 04 Apr 2018 16:50:54 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6785: reset failed heartbeat count when re-registering

Posted by "Bharath Vissapragada (Code Review)" <ge...@cloudera.org>.
Bharath Vissapragada has posted comments on this change. ( http://gerrit.cloudera.org:8080/9913 )

Change subject: IMPALA-6785: reset failed heartbeat count when re-registering
......................................................................


Patch Set 3: Code-Review+1

(1 comment)

Patch lgtm

http://gerrit.cloudera.org:8080/#/c/9913/3/be/src/statestore/statestore.cc
File be/src/statestore/statestore.cc:

http://gerrit.cloudera.org:8080/#/c/9913/3/be/src/statestore/statestore.cc@551
PS3, Line 551:     failure_detector_->UpdateHeartbeat(subscriber_id, true);
I think we should update the header statestore.h that failure_detector_ is based on subscriber_id



-- 
To view, visit http://gerrit.cloudera.org:8080/9913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2ad409e2a8e22d081fce97b085b9469ab046bf07
Gerrit-Change-Number: 9913
Gerrit-PatchSet: 3
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 04 Apr 2018 01:27:18 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6785: reset failed heartbeat count when re-registering

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9913 )

Change subject: IMPALA-6785: reset failed heartbeat count when re-registering
......................................................................


Patch Set 4: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/9913
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2ad409e2a8e22d081fce97b085b9469ab046bf07
Gerrit-Change-Number: 9913
Gerrit-PatchSet: 4
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 04 Apr 2018 18:58:41 +0000
Gerrit-HasComments: No