You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org> on 2017/08/25 21:47:00 UTC
[jira] [Commented] (KUDU-1520) Possible race between alter schema
lock release and tablet shutdown
[ https://issues.apache.org/jira/browse/KUDU-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16142272#comment-16142272 ]
Jean-Daniel Cryans commented on KUDU-1520:
------------------------------------------
[~adar] is this still an issue?
> Possible race between alter schema lock release and tablet shutdown
> -------------------------------------------------------------------
>
> Key: KUDU-1520
> URL: https://issues.apache.org/jira/browse/KUDU-1520
> Project: Kudu
> Issue Type: Bug
> Components: tablet
> Affects Versions: 0.9.1
> Reporter: Adar Dembo
>
> I've been running a new stress that hammers a cluster with concurrent alter and delete table requests, and one of my test runs failed with the following:
> {noformat}
> F0707 18:59:34.311122 373 rw_semaphore.h:145] Check failed: base::subtle::NoBarrier_Load(&state_) == kWriteFlag (0 vs. 2147483648)
> *** Check failure stack trace: ***
> @ 0x7f86cd37df5d google::LogMessage::Fail() at ??:0
> @ 0x7f86cd37fe5d google::LogMessage::SendToLog() at ??:0
> @ 0x7f86cd37da99 google::LogMessage::Flush() at ??:0
> @ 0x7f86cd3808ff google::LogMessageFatal::~LogMessageFatal() at ??:0
> @ 0x7f86d4f77c78 kudu::rw_semaphore::unlock() at ??:0
> @ 0x7f86d3728de0 std::unique_lock<>::unlock() at ??:0
> @ 0x7f86d3727192 std::unique_lock<>::~unique_lock() at ??:0
> @ 0x7f86d3725582 kudu::tablet::AlterSchemaTransactionState::~AlterSchemaTransactionState() at ??:0
> @ 0x7f86d37255be kudu::tablet::AlterSchemaTransactionState::~AlterSchemaTransactionState() at ??:0
> @ 0x7f86d4f68dce std::default_delete<>::operator()() at ??:0
> @ 0x7f86d4f670b9 std::unique_ptr<>::~unique_ptr() at ??:0
> @ 0x7f86d374510e kudu::tablet::AlterSchemaTransaction::~AlterSchemaTransaction() at ??:0
> @ 0x7f86d374514a kudu::tablet::AlterSchemaTransaction::~AlterSchemaTransaction() at ??:0
> @ 0x7f86d373f532 kudu::DefaultDeleter<>::operator()() at ??:0
> @ 0x7f86d373df4a kudu::internal::gscoped_ptr_impl<>::~gscoped_ptr_impl() at ??:0
> @ 0x7f86d373d552 gscoped_ptr<>::~gscoped_ptr() at ??:0
> @ 0x7f86d373d580 kudu::tablet::TransactionDriver::~TransactionDriver() at ??:0
> @ 0x7f86d3740ab4 kudu::RefCountedThreadSafe<>::DeleteInternal() at ??:0
> @ 0x7f86d3740405 kudu::DefaultRefCountedThreadSafeTraits<>::Destruct() at ??:0
> @ 0x7f86d373f928 kudu::RefCountedThreadSafe<>::Release() at ??:0
> @ 0x7f86d373e769 scoped_refptr<>::~scoped_refptr() at ??:0
> @ 0x7f86d37397cd kudu::tablet::TabletPeer::SubmitAlterSchema() at ??:0
> @ 0x7f86d4f4e070 kudu::tserver::TabletServiceAdminImpl::AlterSchema() at ??:0
> @ 0x7f86d27a4e92 _ZZN4kudu7tserver26TabletServerAdminServiceIfC4ERK13scoped_refptrINS_12MetricEntityEEENKUlPKN6google8protobuf7MessageEPS9_PNS_3rpc10RpcContextEE1_clESB_SC_SF_ at ??:0
> @ 0x7f86d27a5d96 _ZNSt17_Function_handlerIFvPKN6google8protobuf7MessageEPS2_PN4kudu3rpc10RpcContextEEZNS6_7tserver26TabletServerAdminServiceIfC4ERK13scoped_refptrINS6_12MetricEntityEEEUlS4_S5_S9_E1_E9_M_invokeERKSt9_Any_dataS4_S5_S9_ at ??:0
> @ 0x7f86d22ce6e4 std::function<>::operator()() at ??:0
> @ 0x7f86d22ce19b kudu::rpc::GeneratedServiceIf::Handle() at ??:0
> @ 0x7f86d22d0a97 kudu::rpc::ServicePool::RunThread() at ??:0
> @ 0x7f86d22d1d45 boost::_mfi::mf0<>::operator()() at ??:0
> @ 0x7f86d22d1b6c boost::_bi::list1<>::operator()<>() at ??:0
> @ 0x7f86d22d1a61 boost::_bi::bind_t<>::operator()() at ??:0
> @ 0x7f86d22d1998 boost::detail::function::void_function_obj_invoker0<>::invoke() at ??:0
> {noformat}
> After looking through the code a bit, I suspect this happened because, in the event of failure, the AlterSchema transaction releases the tablet's schema lock implicitly (i.e. when AlterSchemaTransactionState is destroyed) _after_ the transaction itself is removed from the driver's TransactionTracker. Thus, the WaitForAllToFinish() performed during the tablet shutdown process thinks all the transactions are done and proceeds to free tablet state. Later, the last reference to the transaction is released (in TabletPeer::SubmitAlterSchema), the transaction is destroyed, and we try to unlock a lock whose memory has already been freed.
> If this analysis is correct, the broken invariant is: once the transaction has been released from the tracker, it may no longer access any tablet state.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)