You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Andrew Wong (Jira)" <ji...@apache.org> on 2020/06/17 21:38:00 UTC

[jira] [Created] (KUDU-3149) Lock contention between registering ops and computing maintenance op stats

Andrew Wong created KUDU-3149:
---------------------------------

             Summary: Lock contention between registering ops and computing maintenance op stats
                 Key: KUDU-3149
                 URL: https://issues.apache.org/jira/browse/KUDU-3149
             Project: Kudu
          Issue Type: Bug
          Components: perf, tserver
            Reporter: Andrew Wong


We saw a bunch of tablets bootstrapping extremely slowly, and many stuck supposedly bootstrapping, but not showing up in the {{/tablets}} page, i.e. we could only see INITIALIZED and RUNNING tablets, no BOOTSTRAPPING.

Upon digging into the stacks, we saw a bunch waiting in:

{code}
TID 46583(tablet-open [wo):
    @     0x7f1dd57147e0  (unknown)
    @     0x7f1dd5713332  (unknown)
    @     0x7f1dd570e5d8  (unknown)
    @     0x7f1dd570e4a7  (unknown)
    @          0x23b4058  kudu::Mutex::Acquire()
    @          0x23980ff  kudu::MaintenanceManager::RegisterOp()
    @           0xb85374  kudu::tablet::TabletReplica::RegisterMaintenanceOps()
    @           0xa0055b  kudu::tserver::TSTabletManager::OpenTablet()
    @          0x23f994c  kudu::ThreadPool::DispatchThread()
    @          0x23f3f8b  kudu::Thread::SuperviseThread()
    @     0x7f1dd570caa1  (unknown)
    @     0x7f1dd3b18bcd  (unknown)
{code}

and upon further inspection, the lock being held is taken by the MM scheduler thread here:

{code}
Thread 4 (Thread 0x7f1d7d358700 (LWP 46999)):
#0  0x00007f1dd5713334 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f1dd570e5d8 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x00007f1dd570e4a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000000b51f29 in kudu::tablet::Tablet::UpdateCompactionStats(kudu::MaintenanceOpStats*) ()
#4  0x0000000000b7f435 in kudu::tablet::CompactRowSetsOp::UpdateStats(kudu::MaintenanceOpStats*) ()
#5  0x00000000023956e4 in kudu::MaintenanceManager::FindBestOp() ()
#6  0x0000000002396af9 in kudu::MaintenanceManager::FindAndLaunchOp(std::unique_lock<kudu::Mutex>*) ()
#7  0x0000000002397858 in kudu::MaintenanceManager::RunSchedulerThread() ()
{code}

A couple things come to mind:
- We could probably take a snapshot of the ops under lock and release the lock_ when finding the best op to run.
- Additionally, we may want to consider disabling compactions entirely until the initial set of tablets finishes bootstrapping.

We used the {{set_flag}} tool to disable compactions on the node and noted significantly faster bootstrapping.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)