You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Adar Dembo (Code Review)" <ge...@cloudera.org> on 2016/08/16 01:35:14 UTC

[kudu-CR] catalog manager: avoid more races between Init() and GetTabletPeer()

Hello Todd Lipcon,

I'd like you to do a code review.  Please visit

    http://gerrit.cloudera.org:8080/3997

to review the following change.

Change subject: catalog_manager: avoid more races between Init() and GetTabletPeer()
......................................................................

catalog_manager: avoid more races between Init() and GetTabletPeer()

Commit 2525ad0 took a stab at this, but apparently it wasn't enough due to
the tablet_id() call. So here's another attempt, where sys_catalog_ is only
set when it is fully formed (i.e. when it has a functional TabletPeer).

Below I've included test output when the race hits.

master_replication-itest: /home/jenkins-slave/workspace/kudu-3/src/kudu/gutil/ref_counted.h:273: T *scoped_refptr<kudu::tablet::TabletPeer>::operator->() const [T = kudu::tablet::TabletPeer]: Assertion `ptr_ != __null' failed.
*** Aborted at 1471309445 (unix time) try "date -d @1471309445" if you are using GNU date ***
PC: @     0x7f330225dcc9 gsignal
*** SIGABRT (@0x3e800006e90) received by PID 28304 (TID 0x7f32f06eb700) from PID 28304; stack trace: ***
    @           0x42e687 __tsan::CallUserSignalHandler() at /home/jenkins-slave/workspace/kudu-3/thirdparty/llvm-3.8.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:1962
    @           0x42f4d3 rtl_sigaction() at /home/jenkins-slave/workspace/kudu-3/thirdparty/llvm-3.8.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:2039
    @     0x7f33090a4340 (unknown) at ??:0
    @     0x7f330225dcc9 gsignal at ??:0
    @     0x7f33022610d8 abort at ??:0
    @     0x7f3302256b86 (unknown) at ??:0
    @     0x7f3302256c32 __assert_fail at ??:0
    @     0x7f330ca13130 scoped_refptr<>::operator->() at ??:0
    @     0x7f330ca1a952 kudu::master::SysCatalogTable::tablet_id() at ??:0
    @     0x7f330ca0b136 kudu::master::CatalogManager::GetTabletPeer() at ??:0
    @     0x7f330c69214d kudu::tserver::(anonymous namespace)::LookupTabletPeerOrRespond<>() at ??:0
    @     0x7f330c691bab kudu::tserver::ConsensusServiceImpl::RequestConsensusVote() at ??:0
    @     0x7f3307c9fca5 kudu::consensus::ConsensusServiceIf::ConsensusServiceIf()::$_1::operator()() at ??:0
    @     0x7f3307c9fabf std::_Function_handler<>::_M_invoke() at ??:0
    @     0x7f3306bd7219 std::function<>::operator()() at ??:0
    @     0x7f3306bd6c8e kudu::rpc::GeneratedServiceIf::Handle() at ??:0
    @     0x7f3306bd8b3e kudu::rpc::ServicePool::RunThread() at ??:0
    @     0x7f3306bdaa27 boost::_mfi::mf0<>::operator()() at ??:0
    @     0x7f3306bda98b boost::_bi::list1<>::operator()<>() at ??:0
    @     0x7f3306bda934 boost::_bi::bind_t<>::operator()() at ??:0
    @     0x7f3306bda75a boost::detail::function::void_function_obj_invoker0<>::invoke() at ??:0
    @     0x7f3306b758b2 boost::function0<>::operator()() at ??:0
    @     0x7f3304962630 kudu::Thread::SuperviseThread() at ??:0

Change-Id: I43fdc6499cb84d2053bed08b689fe5a08a6761d6
---
M src/kudu/master/catalog_manager.cc
1 file changed, 9 insertions(+), 7 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/97/3997/1
-- 
To view, visit http://gerrit.cloudera.org:8080/3997
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I43fdc6499cb84d2053bed08b689fe5a08a6761d6
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] catalog manager: avoid more races between Init() and GetTabletPeer()

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Dinesh Bhat has posted comments on this change.

Change subject: catalog_manager: avoid more races between Init() and GetTabletPeer()
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/3997/1//COMMIT_MSG
Commit Message:

Line 15: master_replication-itest: /home/jenkins-slave/workspace/kudu-3/src/kudu/gutil/ref_counted.h:273: T *scoped_refptr<kudu::tablet::TabletPeer>::operator->() const [T = kudu::tablet::TabletPeer]: Assertion `ptr_ != __null' failed.
That's a first time I am seeing a stacktrace in a commit message, well as they say there is a first time for everything :)


http://gerrit.cloudera.org:8080/#/c/3997/1/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

Line 706:   sys_catalog_.reset(new_catalog.release());
Purely looking from this context alone, would this race be seen if we were to make sys_catalog_ a shared_ptr instead of gscope_ptr ?


-- 
To view, visit http://gerrit.cloudera.org:8080/3997
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I43fdc6499cb84d2053bed08b689fe5a08a6761d6
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] catalog manager: avoid more races between Init() and GetTabletPeer()

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has abandoned this change.

Change subject: catalog_manager: avoid more races between Init() and GetTabletPeer()
......................................................................


Abandoned

After some more digging, it appears the race is between Shutdown() and an RPC in GetTabletPeer(). There shouldn't be any outstanding RPCs by the time we reach CatalogManager::Shutdown(), so I'm stumped.

Anyway, this patch only made things worse, so I'll throw it away.

-- 
To view, visit http://gerrit.cloudera.org:8080/3997
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: abandon
Gerrit-Change-Id: I43fdc6499cb84d2053bed08b689fe5a08a6761d6
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] catalog manager: avoid race in InitSysCatalogAsync() and GetTabletPeer()

Posted by "Kudu Jenkins (Code Review)" <ge...@cloudera.org>.
Kudu Jenkins has posted comments on this change.

Change subject: catalog_manager: avoid race in InitSysCatalogAsync() and GetTabletPeer()
......................................................................


Patch Set 2:

Build Started http://104.196.14.100/job/kudu-gerrit/2948/

-- 
To view, visit http://gerrit.cloudera.org:8080/3997
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I43fdc6499cb84d2053bed08b689fe5a08a6761d6
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] catalog manager: avoid more races between Init() and GetTabletPeer()

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has restored this change.

Change subject: catalog_manager: avoid more races between Init() and GetTabletPeer()
......................................................................


Restored

I found a way to hack through this.

-- 
To view, visit http://gerrit.cloudera.org:8080/3997
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: restore
Gerrit-Change-Id: I43fdc6499cb84d2053bed08b689fe5a08a6761d6
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] catalog manager: avoid race in InitSysCatalogAsync() and GetTabletPeer()

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: catalog_manager: avoid race in InitSysCatalogAsync() and GetTabletPeer()
......................................................................


Patch Set 2: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/3997
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I43fdc6499cb84d2053bed08b689fe5a08a6761d6
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] catalog manager: avoid race in InitSysCatalogAsync() and GetTabletPeer()

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has submitted this change and it was merged.

Change subject: catalog_manager: avoid race in InitSysCatalogAsync() and GetTabletPeer()
......................................................................


catalog_manager: avoid race in InitSysCatalogAsync() and GetTabletPeer()

Commit 2525ad0 took a stab at this, but it doesn't handle the case where
InitSysCatalogAsync() fails and leaves behind sys_catalog_ without a
functional tablet peer, as in the new integration test
MasterReplicationTest.TestMasterPeerSetsDontMatch. So here's another
attempt, where sys_catalog_ is only set when it is fully formed (i.e. when
it has a functional TabletPeer).

It turns out this isn't enough; we also need to prevent ElectedAsLeaderCb
from making progress until InitSysCatalogAsync() sets sys_catalog_. The
extra lock acquisition is hacky in that it doesn't explicitly protect
anything, but it gets the job done.

Below I've included test output when the race hits.

master_replication-itest: /home/jenkins-slave/workspace/kudu-3/src/kudu/gutil/ref_counted.h:273: T *scoped_refptr<kudu::tablet::TabletPeer>::operator->() const [T = kudu::tablet::TabletPeer]: Assertion `ptr_ != __null' failed.
*** Aborted at 1471309445 (unix time) try "date -d @1471309445" if you are using GNU date ***
PC: @     0x7f330225dcc9 gsignal
*** SIGABRT (@0x3e800006e90) received by PID 28304 (TID 0x7f32f06eb700) from PID 28304; stack trace: ***
    @           0x42e687 __tsan::CallUserSignalHandler() at /home/jenkins-slave/workspace/kudu-3/thirdparty/llvm-3.8.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:1962
    @           0x42f4d3 rtl_sigaction() at /home/jenkins-slave/workspace/kudu-3/thirdparty/llvm-3.8.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:2039
    @     0x7f33090a4340 (unknown) at ??:0
    @     0x7f330225dcc9 gsignal at ??:0
    @     0x7f33022610d8 abort at ??:0
    @     0x7f3302256b86 (unknown) at ??:0
    @     0x7f3302256c32 __assert_fail at ??:0
    @     0x7f330ca13130 scoped_refptr<>::operator->() at ??:0
    @     0x7f330ca1a952 kudu::master::SysCatalogTable::tablet_id() at ??:0
    @     0x7f330ca0b136 kudu::master::CatalogManager::GetTabletPeer() at ??:0
    @     0x7f330c69214d kudu::tserver::(anonymous namespace)::LookupTabletPeerOrRespond<>() at ??:0
    @     0x7f330c691bab kudu::tserver::ConsensusServiceImpl::RequestConsensusVote() at ??:0
    @     0x7f3307c9fca5 kudu::consensus::ConsensusServiceIf::ConsensusServiceIf()::$_1::operator()() at ??:0
    @     0x7f3307c9fabf std::_Function_handler<>::_M_invoke() at ??:0
    @     0x7f3306bd7219 std::function<>::operator()() at ??:0
    @     0x7f3306bd6c8e kudu::rpc::GeneratedServiceIf::Handle() at ??:0
    @     0x7f3306bd8b3e kudu::rpc::ServicePool::RunThread() at ??:0
    @     0x7f3306bdaa27 boost::_mfi::mf0<>::operator()() at ??:0
    @     0x7f3306bda98b boost::_bi::list1<>::operator()<>() at ??:0
    @     0x7f3306bda934 boost::_bi::bind_t<>::operator()() at ??:0
    @     0x7f3306bda75a boost::detail::function::void_function_obj_invoker0<>::invoke() at ??:0
    @     0x7f3306b758b2 boost::function0<>::operator()() at ??:0
    @     0x7f3304962630 kudu::Thread::SuperviseThread() at ??:0

Change-Id: I43fdc6499cb84d2053bed08b689fe5a08a6761d6
Reviewed-on: http://gerrit.cloudera.org:8080/3997
Tested-by: Kudu Jenkins
Reviewed-by: Todd Lipcon <to...@apache.org>
---
M src/kudu/master/catalog_manager.cc
1 file changed, 12 insertions(+), 6 deletions(-)

Approvals:
  Todd Lipcon: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/3997
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I43fdc6499cb84d2053bed08b689fe5a08a6761d6
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] catalog manager: avoid race in InitSysCatalogAsync() and GetTabletPeer()

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: catalog_manager: avoid race in InitSysCatalogAsync() and GetTabletPeer()
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3997/1/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

Line 706:   sys_catalog_.reset(new_catalog.release());
> Purely looking from this context alone, would this race be seen if we were 
When I first wrote this patch, I thought the race was with Shutdown() and then I'd be inclined to agree (in that sys_catalog_ would be destroyed only when the last owner loses its ref).

But since then, I've found the root cause of the race and I don't think changing the ownership semantics would actually change anything.


-- 
To view, visit http://gerrit.cloudera.org:8080/3997
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I43fdc6499cb84d2053bed08b689fe5a08a6761d6
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] catalog manager: avoid race in InitSysCatalogAsync() and GetTabletPeer()

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Hello Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/3997

to look at the new patch set (#2).

Change subject: catalog_manager: avoid race in InitSysCatalogAsync() and GetTabletPeer()
......................................................................

catalog_manager: avoid race in InitSysCatalogAsync() and GetTabletPeer()

Commit 2525ad0 took a stab at this, but it doesn't handle the case where
InitSysCatalogAsync() fails and leaves behind sys_catalog_ without a
functional tablet peer, as in the new integration test
MasterReplicationTest.TestMasterPeerSetsDontMatch. So here's another
attempt, where sys_catalog_ is only set when it is fully formed (i.e. when
it has a functional TabletPeer).

It turns out this isn't enough; we also need to prevent ElectedAsLeaderCb
from making progress until InitSysCatalogAsync() sets sys_catalog_. The
extra lock acquisition is hacky in that it doesn't explicitly protect
anything, but it gets the job done.

Below I've included test output when the race hits.

master_replication-itest: /home/jenkins-slave/workspace/kudu-3/src/kudu/gutil/ref_counted.h:273: T *scoped_refptr<kudu::tablet::TabletPeer>::operator->() const [T = kudu::tablet::TabletPeer]: Assertion `ptr_ != __null' failed.
*** Aborted at 1471309445 (unix time) try "date -d @1471309445" if you are using GNU date ***
PC: @     0x7f330225dcc9 gsignal
*** SIGABRT (@0x3e800006e90) received by PID 28304 (TID 0x7f32f06eb700) from PID 28304; stack trace: ***
    @           0x42e687 __tsan::CallUserSignalHandler() at /home/jenkins-slave/workspace/kudu-3/thirdparty/llvm-3.8.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:1962
    @           0x42f4d3 rtl_sigaction() at /home/jenkins-slave/workspace/kudu-3/thirdparty/llvm-3.8.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:2039
    @     0x7f33090a4340 (unknown) at ??:0
    @     0x7f330225dcc9 gsignal at ??:0
    @     0x7f33022610d8 abort at ??:0
    @     0x7f3302256b86 (unknown) at ??:0
    @     0x7f3302256c32 __assert_fail at ??:0
    @     0x7f330ca13130 scoped_refptr<>::operator->() at ??:0
    @     0x7f330ca1a952 kudu::master::SysCatalogTable::tablet_id() at ??:0
    @     0x7f330ca0b136 kudu::master::CatalogManager::GetTabletPeer() at ??:0
    @     0x7f330c69214d kudu::tserver::(anonymous namespace)::LookupTabletPeerOrRespond<>() at ??:0
    @     0x7f330c691bab kudu::tserver::ConsensusServiceImpl::RequestConsensusVote() at ??:0
    @     0x7f3307c9fca5 kudu::consensus::ConsensusServiceIf::ConsensusServiceIf()::$_1::operator()() at ??:0
    @     0x7f3307c9fabf std::_Function_handler<>::_M_invoke() at ??:0
    @     0x7f3306bd7219 std::function<>::operator()() at ??:0
    @     0x7f3306bd6c8e kudu::rpc::GeneratedServiceIf::Handle() at ??:0
    @     0x7f3306bd8b3e kudu::rpc::ServicePool::RunThread() at ??:0
    @     0x7f3306bdaa27 boost::_mfi::mf0<>::operator()() at ??:0
    @     0x7f3306bda98b boost::_bi::list1<>::operator()<>() at ??:0
    @     0x7f3306bda934 boost::_bi::bind_t<>::operator()() at ??:0
    @     0x7f3306bda75a boost::detail::function::void_function_obj_invoker0<>::invoke() at ??:0
    @     0x7f3306b758b2 boost::function0<>::operator()() at ??:0
    @     0x7f3304962630 kudu::Thread::SuperviseThread() at ??:0

Change-Id: I43fdc6499cb84d2053bed08b689fe5a08a6761d6
---
M src/kudu/master/catalog_manager.cc
1 file changed, 12 insertions(+), 6 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/97/3997/2
-- 
To view, visit http://gerrit.cloudera.org:8080/3997
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I43fdc6499cb84d2053bed08b689fe5a08a6761d6
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] catalog manager: avoid more races between Init() and GetTabletPeer()

Posted by "Kudu Jenkins (Code Review)" <ge...@cloudera.org>.
Kudu Jenkins has posted comments on this change.

Change subject: catalog_manager: avoid more races between Init() and GetTabletPeer()
......................................................................


Patch Set 1:

Build Started http://104.196.14.100/job/kudu-gerrit/2937/

-- 
To view, visit http://gerrit.cloudera.org:8080/3997
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I43fdc6499cb84d2053bed08b689fe5a08a6761d6
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No