You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2016/12/15 09:47:58 UTC

[jira] [Commented] (KUDU-1564) Deadlock on raft notification ThreadPool

    [ https://issues.apache.org/jira/browse/KUDU-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15750931#comment-15750931 ] 

Todd Lipcon commented on KUDU-1564:
-----------------------------------

This should be fixed by KUDU-699

> Deadlock on raft notification ThreadPool
> ----------------------------------------
>
>                 Key: KUDU-1564
>                 URL: https://issues.apache.org/jira/browse/KUDU-1564
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus
>    Affects Versions: 0.10.0
>            Reporter: Todd Lipcon
>            Priority: Critical
>         Attachments: stacks.txt
>
>
> In a stress test on a cluster, one of the tablet servers got stuck in a deadlock. It appears that:
> - the Raft notification threadpool for a tablet has 24 max threads (corresponding to the number of cores)
> - One of the threads is in:
> {code}
> #1  0x00000000019019b2 in kudu::Semaphore::Acquire() ()
> #2  0x0000000000985159 in kudu::consensus::Peer::Close() ()
> #3  0x000000000099d909 in kudu::consensus::PeerManager::Close() ()
> #4  0x00000000009684bd in kudu::consensus::RaftConsensus::RefreshConsensusQueueAndPeersUnlocked() ()
> #5  0x000000000096eced in kudu::consensus::RaftConsensus::ReplicateConfigChangeUnlocked(kudu::consensus::RaftConfigPB const&, kudu::consensus::RaftConfigPB const&, kudu::Callback<void ()(kudu::Status const&)> const&) ()
> #6  0x00000000009795be in kudu::consensus::RaftConsensus::ChangeConfig(kudu::consensus::ChangeConfigRequestPB const&, kudu::Callback<void ()(kudu::Status const&)> const&, boost::optional<kudu::tserver::TabletServerErrorPB_Code>*) ()
> #7  0x0000000000978cd0 in kudu::consensus::RaftConsensus::TryRemoveFollowerTask(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, kudu::consensus::RaftConfigPB const&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
> {code}
> the rest are in:
> {code}
> #1  0x0000000001924e87 in base::SpinLock::SlowLock() ()
> #2  0x000000000097df28 in kudu::consensus::ReplicaState::LockForConfigChange(std::unique_lock<kudu::simple_spinlock>*) const ()
> #3  0x00000000009791dd in kudu::consensus::RaftConsensus::ChangeConfig(kudu::consensus::ChangeConfigRequestPB const&, kudu::Callback<void ()(kudu::Status const&)> const&, boost::optional<kudu::tserver::TabletServerErrorPB_Code>*) ()
> #4  0x0000000000978cd0 in kudu::consensus::RaftConsensus::TryRemoveFollowerTask(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, kudu::consensus::RaftConfigPB const&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
> {code}
> It appears that the thread holding the lock is waiting on a peer response (in order to close the peer), but the peer response is waiting in the ThreadPool's queue (and will never arrive since all threads are occupied waiting on something waiting for it)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)