You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org> on 2017/08/25 21:47:00 UTC

[jira] [Commented] (KUDU-1520) Possible race between alter schema lock release and tablet shutdown

    [ https://issues.apache.org/jira/browse/KUDU-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16142272#comment-16142272 ] 

Jean-Daniel Cryans commented on KUDU-1520:
------------------------------------------

[~adar] is this still an issue?

> Possible race between alter schema lock release and tablet shutdown
> -------------------------------------------------------------------
>
>                 Key: KUDU-1520
>                 URL: https://issues.apache.org/jira/browse/KUDU-1520
>             Project: Kudu
>          Issue Type: Bug
>          Components: tablet
>    Affects Versions: 0.9.1
>            Reporter: Adar Dembo
>
> I've been running a new stress that hammers a cluster with concurrent alter and delete table requests, and one of my test runs failed with the following:
> {noformat}
> F0707 18:59:34.311122   373 rw_semaphore.h:145] Check failed: base::subtle::NoBarrier_Load(&state_) == kWriteFlag (0 vs. 2147483648) 
> *** Check failure stack trace: ***
>     @     0x7f86cd37df5d  google::LogMessage::Fail() at ??:0
>     @     0x7f86cd37fe5d  google::LogMessage::SendToLog() at ??:0
>     @     0x7f86cd37da99  google::LogMessage::Flush() at ??:0
>     @     0x7f86cd3808ff  google::LogMessageFatal::~LogMessageFatal() at ??:0
>     @     0x7f86d4f77c78  kudu::rw_semaphore::unlock() at ??:0
>     @     0x7f86d3728de0  std::unique_lock<>::unlock() at ??:0
>     @     0x7f86d3727192  std::unique_lock<>::~unique_lock() at ??:0
>     @     0x7f86d3725582  kudu::tablet::AlterSchemaTransactionState::~AlterSchemaTransactionState() at ??:0
>     @     0x7f86d37255be  kudu::tablet::AlterSchemaTransactionState::~AlterSchemaTransactionState() at ??:0
>     @     0x7f86d4f68dce  std::default_delete<>::operator()() at ??:0
>     @     0x7f86d4f670b9  std::unique_ptr<>::~unique_ptr() at ??:0
>     @     0x7f86d374510e  kudu::tablet::AlterSchemaTransaction::~AlterSchemaTransaction() at ??:0
>     @     0x7f86d374514a  kudu::tablet::AlterSchemaTransaction::~AlterSchemaTransaction() at ??:0
>     @     0x7f86d373f532  kudu::DefaultDeleter<>::operator()() at ??:0
>     @     0x7f86d373df4a  kudu::internal::gscoped_ptr_impl<>::~gscoped_ptr_impl() at ??:0
>     @     0x7f86d373d552  gscoped_ptr<>::~gscoped_ptr() at ??:0
>     @     0x7f86d373d580  kudu::tablet::TransactionDriver::~TransactionDriver() at ??:0
>     @     0x7f86d3740ab4  kudu::RefCountedThreadSafe<>::DeleteInternal() at ??:0
>     @     0x7f86d3740405  kudu::DefaultRefCountedThreadSafeTraits<>::Destruct() at ??:0
>     @     0x7f86d373f928  kudu::RefCountedThreadSafe<>::Release() at ??:0
>     @     0x7f86d373e769  scoped_refptr<>::~scoped_refptr() at ??:0
>     @     0x7f86d37397cd  kudu::tablet::TabletPeer::SubmitAlterSchema() at ??:0
>     @     0x7f86d4f4e070  kudu::tserver::TabletServiceAdminImpl::AlterSchema() at ??:0
>     @     0x7f86d27a4e92  _ZZN4kudu7tserver26TabletServerAdminServiceIfC4ERK13scoped_refptrINS_12MetricEntityEEENKUlPKN6google8protobuf7MessageEPS9_PNS_3rpc10RpcContextEE1_clESB_SC_SF_ at ??:0
>     @     0x7f86d27a5d96  _ZNSt17_Function_handlerIFvPKN6google8protobuf7MessageEPS2_PN4kudu3rpc10RpcContextEEZNS6_7tserver26TabletServerAdminServiceIfC4ERK13scoped_refptrINS6_12MetricEntityEEEUlS4_S5_S9_E1_E9_M_invokeERKSt9_Any_dataS4_S5_S9_ at ??:0
>     @     0x7f86d22ce6e4  std::function<>::operator()() at ??:0
>     @     0x7f86d22ce19b  kudu::rpc::GeneratedServiceIf::Handle() at ??:0
>     @     0x7f86d22d0a97  kudu::rpc::ServicePool::RunThread() at ??:0
>     @     0x7f86d22d1d45  boost::_mfi::mf0<>::operator()() at ??:0
>     @     0x7f86d22d1b6c  boost::_bi::list1<>::operator()<>() at ??:0
>     @     0x7f86d22d1a61  boost::_bi::bind_t<>::operator()() at ??:0
>     @     0x7f86d22d1998  boost::detail::function::void_function_obj_invoker0<>::invoke() at ??:0
> {noformat}
> After looking through the code a bit, I suspect this happened because, in the event of failure, the AlterSchema transaction releases the tablet's schema lock implicitly (i.e. when AlterSchemaTransactionState is destroyed) _after_ the transaction itself is removed from the driver's TransactionTracker. Thus, the WaitForAllToFinish() performed during the tablet shutdown process thinks all the transactions are done and proceeds to free tablet state. Later, the last reference to the transaction is released (in TabletPeer::SubmitAlterSchema), the transaction is destroyed, and we try to unlock a lock whose memory has already been freed.
> If this analysis is correct, the broken invariant is: once the transaction has been released from the tracker, it may no longer access any tablet state.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)