You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Todd Lipcon (Code Review)" <ge...@cloudera.org> on 2016/11/15 06:15:03 UTC

[kudu-CR] KUDU-1745. Avoid crashing during failed master lookup

Hello Adar Dembo, Alexey Serbin,

I'd like you to do a code review.  Please visit

    http://gerrit.cloudera.org:8080/5089

to review the following change.

Change subject: KUDU-1745. Avoid crashing during failed master lookup
......................................................................

KUDU-1745. Avoid crashing during failed master lookup

This addresses a SEGV seen when handling a simultaneous failure talking
to both the master and the tablet servers. In this case, the failed
tablet server object is NULL.

This is only a partial fix/workaround - it avoids the crash but it seems
like the retry backoff is not working as expected. Nonetheless, it is at
least an improvement.

Change-Id: I0a06c31c5ad4e26928a7fcabf24f9fd7a02a7e38
---
M src/kudu/client/meta_cache.cc
M src/kudu/integration-tests/client_failover-itest.cc
M src/kudu/rpc/retriable_rpc.h
M src/kudu/rpc/rpc.h
4 files changed, 43 insertions(+), 6 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/89/5089/1
-- 
To view, visit http://gerrit.cloudera.org:8080/5089
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I0a06c31c5ad4e26928a7fcabf24f9fd7a02a7e38
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>

[kudu-CR] KUDU-1745. Avoid crashing during failed master lookup

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: KUDU-1745. Avoid crashing during failed master lookup
......................................................................


Patch Set 3: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/5089
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0a06c31c5ad4e26928a7fcabf24f9fd7a02a7e38
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] KUDU-1745. Avoid crashing during failed master lookup

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change.

Change subject: KUDU-1745. Avoid crashing during failed master lookup
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5089/1/src/kudu/rpc/retriable_rpc.h
File src/kudu/rpc/retriable_rpc.h:

PS1, Line 151: server != nullptr
Does it make sense to log something in the other case or we expect it would just crash somewhere in the call chain?


-- 
To view, visit http://gerrit.cloudera.org:8080/5089
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0a06c31c5ad4e26928a7fcabf24f9fd7a02a7e38
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1745. Avoid crashing during failed master lookup

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: KUDU-1745. Avoid crashing during failed master lookup
......................................................................


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/5089/2/src/kudu/integration-tests/client_failover-itest.cc
File src/kudu/integration-tests/client_failover-itest.cc:

Line 222:   TestWorkload workload(cluster_.get());
> Couple questions about the workload:
will clean these items up.


Line 229:   SleepFor(MonoDelta::FromSeconds(1));
> Why is this needed?
Need to get the tablet server successfully looked up before crashing the servers. Otherwise the bug wasn't triggering. I'll add a comment.


-- 
To view, visit http://gerrit.cloudera.org:8080/5089
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0a06c31c5ad4e26928a7fcabf24f9fd7a02a7e38
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1745. Avoid crashing during failed master lookup

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5089

to look at the new patch set (#3).

Change subject: KUDU-1745. Avoid crashing during failed master lookup
......................................................................

KUDU-1745. Avoid crashing during failed master lookup

This addresses a SEGV seen when handling a simultaneous failure talking
to both the master and the tablet servers. In this case, the failed
tablet server object is NULL.

This is only a partial fix/workaround - it avoids the crash but it seems
like the retry backoff is not working as expected. Nonetheless, it is at
least an improvement.

Change-Id: I0a06c31c5ad4e26928a7fcabf24f9fd7a02a7e38
---
M src/kudu/client/meta_cache.cc
M src/kudu/integration-tests/client_failover-itest.cc
M src/kudu/integration-tests/test_workload.cc
M src/kudu/integration-tests/test_workload.h
M src/kudu/rpc/retriable_rpc.h
M src/kudu/rpc/rpc.h
6 files changed, 65 insertions(+), 6 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/89/5089/3
-- 
To view, visit http://gerrit.cloudera.org:8080/5089
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0a06c31c5ad4e26928a7fcabf24f9fd7a02a7e38
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-1745. Avoid crashing during failed master lookup

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: KUDU-1745. Avoid crashing during failed master lookup
......................................................................


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/5089/2/src/kudu/integration-tests/client_failover-itest.cc
File src/kudu/integration-tests/client_failover-itest.cc:

Line 222:   TestWorkload workload(cluster_.get());
Couple questions about the workload:
1. Any particular reason why Setup() splits half of the setter calls? If not, can you move it to just before Start()?
2. Some of the overrides make sense (i.e. num replicas). But why did you need to override the write batch size and timeout values?


Line 229:   SleepFor(MonoDelta::FromSeconds(1));
Why is this needed?


-- 
To view, visit http://gerrit.cloudera.org:8080/5089
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0a06c31c5ad4e26928a7fcabf24f9fd7a02a7e38
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1745. Avoid crashing during failed master lookup

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has submitted this change and it was merged.

Change subject: KUDU-1745. Avoid crashing during failed master lookup
......................................................................


KUDU-1745. Avoid crashing during failed master lookup

This addresses a SEGV seen when handling a simultaneous failure talking
to both the master and the tablet servers. In this case, the failed
tablet server object is NULL.

This is only a partial fix/workaround - it avoids the crash but it seems
like the retry backoff is not working as expected. Nonetheless, it is at
least an improvement.

Change-Id: I0a06c31c5ad4e26928a7fcabf24f9fd7a02a7e38
Reviewed-on: http://gerrit.cloudera.org:8080/5089
Reviewed-by: Adar Dembo <ad...@cloudera.com>
Tested-by: Kudu Jenkins
---
M src/kudu/client/meta_cache.cc
M src/kudu/integration-tests/client_failover-itest.cc
M src/kudu/integration-tests/test_workload.cc
M src/kudu/integration-tests/test_workload.h
M src/kudu/rpc/retriable_rpc.h
M src/kudu/rpc/rpc.h
6 files changed, 65 insertions(+), 6 deletions(-)

Approvals:
  Adar Dembo: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/5089
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I0a06c31c5ad4e26928a7fcabf24f9fd7a02a7e38
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-1745. Avoid crashing during failed master lookup

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: KUDU-1745. Avoid crashing during failed master lookup
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5089/1/src/kudu/rpc/retriable_rpc.h
File src/kudu/rpc/retriable_rpc.h:

PS1, Line 151: server != nullptr
> Does it make sense to log something in the other case or we expect it would
I wasn't quite sure what to log, since the control flow is pretty weird here. That's why I left the TODO. Looking at the output of the unit test, though, it's not quiet - a bunch of logging output does result including the underlying connection refused errors.


-- 
To view, visit http://gerrit.cloudera.org:8080/5089
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0a06c31c5ad4e26928a7fcabf24f9fd7a02a7e38
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes