You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kudu.apache.org by Lee King <yu...@gmail.com> on 2017/11/04 02:47:25 UTC

The service queue is full; it has 400 items.. Retrying in the next heartbeat period.

Hi,
    Our kudu cluster have ran well a long time,  but write became slowly
recently,client also come out rpc timeout. I check the warning and find
vast error look this:
W1104 10:25:16.833736 10271 consensus_peers.cc:365] T
149ffa58ac274c9ba8385ccfdc01ea14 P 59c768eb799243678ee7fa3f83801316 -> Peer
1c67a7e7ff8f4de494469766641fccd1 (cloud-sk-ds-08:7050): Couldn't send
request to peer 1c67a7e7ff8f4de494469766641fccd1 for tablet
149ffa58ac274c9ba8385ccfdc01ea14. Status: Timed out: UpdateConsensus RPC to
10.6.60.9:7050 timed out after 1.000s (SENT). Retrying in the next
heartbeat period. Already tried 5 times.
    I change the
configure rpc_service_queue_length=400,rpc_num_service_threads=40, but it
takes no effect.
    Our cluster include 5 master , 10 ts. 3800G data, 800 tablet per ts. I
check one of the ts machine's memory, 14G left(128 In all), thread 4739(max
32000), openfile 28000(max 65536), cpu disk utilization ratio about 30%(32
core), disk util  less than 30%.
    Any suggestion for this? Thanks!

Re: The service queue is full; it has 400 items.. Retrying in the next heartbeat period.

Posted by Lee King <yu...@gmail.com>.
We change the configure rpc_service_queue_length=1024, rpc_num_service_threa
ds=128,consensus_rpc_timeout_ms=30000, the kudu cluster looks work well now.

2017-11-04 5:15 GMT+00:00 Todd Lipcon <to...@cloudera.com>:

> One thing you might try is to update the consensus rpc timeout to 30
> seconds instead of 1. We changed the default in later versions.
>
> I'd also recommend updating up 1.4 or 1.5 for other related fixes to
> consensus stability. I think I recall you were on 1.3 still?
>
> Todd
>
>
> On Nov 3, 2017 7:47 PM, "Lee King" <yu...@gmail.com> wrote:
>
> Hi,
>     Our kudu cluster have ran well a long time,  but write became slowly
> recently,client also come out rpc timeout. I check the warning and find
> vast error look this:
> W1104 10:25:16.833736 10271 consensus_peers.cc:365] T
> 149ffa58ac274c9ba8385ccfdc01ea14 P 59c768eb799243678ee7fa3f83801316 ->
> Peer 1c67a7e7ff8f4de494469766641fccd1 (cloud-sk-ds-08:7050): Couldn't
> send request to peer 1c67a7e7ff8f4de494469766641fccd1 for tablet
> 149ffa58ac274c9ba8385ccfdc01ea14. Status: Timed out: UpdateConsensus RPC
> to 10.6.60.9:7050 timed out after 1.000s (SENT). Retrying in the next
> heartbeat period. Already tried 5 times.
>     I change the configure rpc_service_queue_le
> ngth=400,rpc_num_service_threads=40, but it takes no effect.
>     Our cluster include 5 master , 10 ts. 3800G data, 800 tablet per ts. I
> check one of the ts machine's memory, 14G left(128 In all), thread 4739(max
> 32000), openfile 28000(max 65536), cpu disk utilization ratio about
> 30%(32 core), disk util  less than 30%.
>     Any suggestion for this? Thanks!
>
>
>

Re: The service queue is full; it has 400 items.. Retrying in the next heartbeat period.

Posted by Lee King <yu...@gmail.com>.
We disable THP the day before yesterday, the cluster start work well soon,
I obsever our spark task online for a long time, it no delay now ,and
execute task' time is fluctuation; we also do some test using ycsb, the
test result is same as we deploy a new cluster, even better. I think TPH is
key primary casuse for the cluster unstable. We ' ll obsever more time, and
if sonething intetesting happen, I will replay here. Thanks @Todd .

    King Lee


Todd Lipcon <to...@cloudera.com>于2017年11月14日 周二上午1:37写道:

> I caught up with King Lee on Slack over the weekend, but wanted to reply
> here so that if anyone else in the community hits similar issues they might
> check on the same things.
>
> - I asked for a dump of /metrics and /rpcz from the TS web server.
>
> From this I could see that a lot of the consensus requests were stuck a
> long time waiting on various locks.
>
> - I asked for 10 'pstack's separated by 10 seconds using the following
> command:
> for x in $(seq 1 10) ; do pstack $(pgrep kudu-tserver) >
> /tmp/stacks.$x.txt ; sleep 10 ; done
>
> In the stacks I found some interesting stuff:
>
> 1) Many threads in the process of leader elections
> - King Lee has an SSD available on his tablet server boxes, so I gave him
> a script to migrate the consensus-meta/ and tablet-meta/ directories to the
> SSD and replace them with symlinks on the original drive. This is basically
> a manual workaround for https://issues.apache.org/jira/browse/KUDU-1489
>
> 2) A lot of threads blocked in DNS lookups
> - He's using /etc/hosts only and not actual DNS.
> - The stack looks like:
>
> Thread 770 (Thread 0x7fa8ecd0d700 (LWP 7365)):
> #0  0x00007fabc504e334 in __lll_lock_wait () from /lib64/libpthread.so.0
> #1  0x00007fabc50495d8 in _L_lock_854 () from /lib64/libpthread.so.0
> #2  0x00007fabc50494a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
> #3  0x00007fabbcfe2e07 in _nss_files_gethostbyname2_r () from
> /lib64/libnss_files.so.2
> #4  0x00007fabc3fed915 in gethostbyname2_r@@GLIBC_2.2.5 () from
> /lib64/libc.so.6
> #5  0x00007fabc3fbad62 in gaih_inet () from /lib64/libc.so.6
> #6  0x00007fabc3fbcfbf in getaddrinfo () from /lib64/libc.so.6
> #7  0x0000000001b3bd06 in
> kudu::HostPort::ResolveAddresses(std::vector<kudu::Sockaddr,
> std::allocator<kudu::Sockaddr> >*) const ()
>
> It seems like at least this version of glibc (from el6) has a lock around
> the '/etc/hosts'-based implementation of gethostbyname. It still struck me
> as odd that looking up a host in /etc/hosts would be slow, but I also found
> this stack:
>
> Thread 496 (Thread 0x7fa65d7bb700 (LWP 28798)):
> #0  0x00007fabc3fcf59a in mmap64 () from /lib64/libc.so.6
> #1  0x00007fabc3f5015c in _IO_file_doallocate_internal () from
> /lib64/libc.so.6
> #2  0x00007fabc3f5d61c in _IO_doallocbuf_internal () from /lib64/libc.so.6
> #3  0x00007fabc3f5bbd4 in _IO_new_file_underflow () from /lib64/libc.so.6
> #4  0x00007fabc3f5d65e in _IO_default_uflow_internal () from
> /lib64/libc.so.6
> #5  0x00007fabc3f51d3a in _IO_getline_info_internal () from
> /lib64/libc.so.6
> #6  0x00007fabc3f5ab71 in fgets_unlocked () from /lib64/libc.so.6
> #7  0x00007fabbcfe1ab1 in internal_getent () from /lib64/libnss_files.so.2
> #8  0x00007fabbcfe2e96 in _nss_files_gethostbyname2_r () from
> /lib64/libnss_files.so.2
> #9  0x00007fabc3fed915 in gethostbyname2_r@@GLIBC_2.2.5 () from
> /lib64/libc.so.6
> #10 0x00007fabc3fbad62 in gaih_inet () from /lib64/libc.so.6
> #11 0x00007fabc3fbcfbf in getaddrinfo () from /lib64/libc.so.6
>
> I dug into the source for el6's glibc and found that, indeed, it uses
> mmap() to read /etc/hosts on every host resolution.
>
> This led me to question why mmap was so slow. The most common reason I've
> seen for that is when transparent hugepages are enabled on el6. It turns
> out King's system does have THP enabled, so I asked him to disable it. See
> https://issues.apache.org/jira/browse/KUDU-1447
>
> - The last thing I noticed were a bunch of threads stuck in this stack:
>
> Thread 54 (Thread 0x7fabb95e5700 (LWP 29901)):
> #0  0x00007fabc504e334 in __lll_lock_wait () from /lib64/libpthread.so.0
> #1  0x00007fabc50495d8 in _L_lock_854 () from /lib64/libpthread.so.0
> #2  0x00007fabc50494a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
> #3  0x0000000001b3669f in kudu::Mutex::Acquire() ()
> #4  0x0000000001b17193 in kudu::KernelStackWatchdog::Unregister() ()
> #5  0x0000000001b1736a in void
> kudu::threadlocal::internal::Destroy<kudu::KernelStackWatchdog::TLS>(void*)
> ()
> #6  0x0000000001b6c316 in
> kudu::threadlocal::internal::InvokeDestructors(void*) ()
> #7  0x00007fabc5047b39 in start_thread () from /lib64/libpthread.so.0
> #8  0x00007fabc3fd2bcd in clone () from /lib64/libc.so.6
>
> I filed KUDU-2215 for this one. This particular stuck thread issue
> wouldn't cause problems in Kudu 1.5, but in 1.4 and earlier the slow
> shutdown of a thread could cause consensus to get "stuck" as well in a
> stack like this:
> Thread 2137 (Thread 0x7fa4e4c1c700 (LWP 1564)):
> #0  0x00007fabc504ba5e in pthread_cond_timedwait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
> #1  0x0000000001ad9cdd in
> kudu::ConditionVariable::TimedWait(kudu::MonoDelta const&) const ()
> #2  0x0000000001b63d0c in kudu::ThreadJoiner::Join() ()
> #3  0x0000000001b55b4c in kudu::ResettableHeartbeater::Stop() ()
> #4  0x0000000000a65b42 in kudu::consensus::Peer::Close() ()
> #5  0x0000000000a819b9 in kudu::consensus::PeerManager::Close() ()
> #6  0x0000000000a45dde in
> kudu::consensus::RaftConsensus::BecomeReplicaUnlocked() ()
> #7  0x0000000000a51063 in
> kudu::consensus::RaftConsensus::HandleTermAdvanceUnlocked(long,
> kudu::consensus::ReplicaState::FlushToDisk) ()
> #8  0x0000000000a5145f in
> kudu::consensus::RaftConsensus::NotifyTermChange(long) ()
> #9  0x0000000000a6ceac in
> kudu::consensus::PeerMessageQueue::NotifyObserversOfTermChangeTask(long) ()
>
> which I also found in his stacks fairly often.
>
> To workaround this I suggested disabling the kernel_stack_watchdog using a
> hidden configuration. I'll look at a fix for this that we can backport to
> earlier versions as well.
>
> King -- please report back in a couple days whether the suggestions we
> talked about help resolve the problems on your cluster.
>
> Thanks
> -Todd
>
>
>
>
>
> On Thu, Nov 9, 2017 at 11:27 PM, Lee King <yu...@gmail.com> wrote:
>
>> Hi, Todd
>>     Our kudu cluster 's error/warning log just like the
>> https://issues.apache.org/jira/browse/KUDU-1078, and the issues's status
>> is reopen, I have upload log for analysis the issues, If you want to more
>> detail, just tell me 😄。
>> log files:
>> https://drive.google.com/open?id=1_1l2xpT3-NmumgI_sIdxch-6BocXqTCt
>>
>> https://drive.google.com/open?id=0B4-NyGFtYNboN3NYNW1pVWQwcFVLa083VkRIUTZHRk85WHY4
>>
>>
>>
>>
>>
>>
>> 2017-11-04 13:15 GMT+08:00 Todd Lipcon <to...@cloudera.com>:
>>
>>> One thing you might try is to update the consensus rpc timeout to 30
>>> seconds instead of 1. We changed the default in later versions.
>>>
>>> I'd also recommend updating up 1.4 or 1.5 for other related fixes to
>>> consensus stability. I think I recall you were on 1.3 still?
>>>
>>> Todd
>>>
>>>
>>> On Nov 3, 2017 7:47 PM, "Lee King" <yu...@gmail.com> wrote:
>>>
>>> Hi,
>>>     Our kudu cluster have ran well a long time,  but write became slowly
>>> recently,client also come out rpc timeout. I check the warning and find
>>> vast error look this:
>>> W1104 10:25:16.833736 10271 consensus_peers.cc:365] T
>>> 149ffa58ac274c9ba8385ccfdc01ea14 P 59c768eb799243678ee7fa3f83801316 -> Peer
>>> 1c67a7e7ff8f4de494469766641fccd1 (cloud-sk-ds-08:7050): Couldn't send
>>> request to peer 1c67a7e7ff8f4de494469766641fccd1 for tablet
>>> 149ffa58ac274c9ba8385ccfdc01ea14. Status: Timed out: UpdateConsensus RPC to
>>> 10.6.60.9:7050 timed out after 1.000s (SENT). Retrying in the next
>>> heartbeat period. Already tried 5 times.
>>>     I change the
>>> configure rpc_service_queue_length=400,rpc_num_service_threads=40, but it
>>> takes no effect.
>>>     Our cluster include 5 master , 10 ts. 3800G data, 800 tablet per ts.
>>> I check one of the ts machine's memory, 14G left(128 In all), thread
>>> 4739(max 32000), openfile 28000(max 65536), cpu disk utilization ratio
>>> about 30%(32 core), disk util  less than 30%.
>>>     Any suggestion for this? Thanks!
>>>
>>>
>>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: The service queue is full; it has 400 items.. Retrying in the next heartbeat period.

Posted by Todd Lipcon <to...@cloudera.com>.
I caught up with King Lee on Slack over the weekend, but wanted to reply
here so that if anyone else in the community hits similar issues they might
check on the same things.

- I asked for a dump of /metrics and /rpcz from the TS web server.

From this I could see that a lot of the consensus requests were stuck a
long time waiting on various locks.

- I asked for 10 'pstack's separated by 10 seconds using the following
command:
for x in $(seq 1 10) ; do pstack $(pgrep kudu-tserver) > /tmp/stacks.$x.txt
; sleep 10 ; done

In the stacks I found some interesting stuff:

1) Many threads in the process of leader elections
- King Lee has an SSD available on his tablet server boxes, so I gave him a
script to migrate the consensus-meta/ and tablet-meta/ directories to the
SSD and replace them with symlinks on the original drive. This is basically
a manual workaround for https://issues.apache.org/jira/browse/KUDU-1489

2) A lot of threads blocked in DNS lookups
- He's using /etc/hosts only and not actual DNS.
- The stack looks like:

Thread 770 (Thread 0x7fa8ecd0d700 (LWP 7365)):
#0  0x00007fabc504e334 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fabc50495d8 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x00007fabc50494a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007fabbcfe2e07 in _nss_files_gethostbyname2_r () from
/lib64/libnss_files.so.2
#4  0x00007fabc3fed915 in gethostbyname2_r@@GLIBC_2.2.5 () from
/lib64/libc.so.6
#5  0x00007fabc3fbad62 in gaih_inet () from /lib64/libc.so.6
#6  0x00007fabc3fbcfbf in getaddrinfo () from /lib64/libc.so.6
#7  0x0000000001b3bd06 in
kudu::HostPort::ResolveAddresses(std::vector<kudu::Sockaddr,
std::allocator<kudu::Sockaddr> >*) const ()

It seems like at least this version of glibc (from el6) has a lock around
the '/etc/hosts'-based implementation of gethostbyname. It still struck me
as odd that looking up a host in /etc/hosts would be slow, but I also found
this stack:

Thread 496 (Thread 0x7fa65d7bb700 (LWP 28798)):
#0  0x00007fabc3fcf59a in mmap64 () from /lib64/libc.so.6
#1  0x00007fabc3f5015c in _IO_file_doallocate_internal () from
/lib64/libc.so.6
#2  0x00007fabc3f5d61c in _IO_doallocbuf_internal () from /lib64/libc.so.6
#3  0x00007fabc3f5bbd4 in _IO_new_file_underflow () from /lib64/libc.so.6
#4  0x00007fabc3f5d65e in _IO_default_uflow_internal () from
/lib64/libc.so.6
#5  0x00007fabc3f51d3a in _IO_getline_info_internal () from /lib64/libc.so.6
#6  0x00007fabc3f5ab71 in fgets_unlocked () from /lib64/libc.so.6
#7  0x00007fabbcfe1ab1 in internal_getent () from /lib64/libnss_files.so.2
#8  0x00007fabbcfe2e96 in _nss_files_gethostbyname2_r () from
/lib64/libnss_files.so.2
#9  0x00007fabc3fed915 in gethostbyname2_r@@GLIBC_2.2.5 () from
/lib64/libc.so.6
#10 0x00007fabc3fbad62 in gaih_inet () from /lib64/libc.so.6
#11 0x00007fabc3fbcfbf in getaddrinfo () from /lib64/libc.so.6

I dug into the source for el6's glibc and found that, indeed, it uses
mmap() to read /etc/hosts on every host resolution.

This led me to question why mmap was so slow. The most common reason I've
seen for that is when transparent hugepages are enabled on el6. It turns
out King's system does have THP enabled, so I asked him to disable it. See
https://issues.apache.org/jira/browse/KUDU-1447

- The last thing I noticed were a bunch of threads stuck in this stack:

Thread 54 (Thread 0x7fabb95e5700 (LWP 29901)):
#0  0x00007fabc504e334 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fabc50495d8 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x00007fabc50494a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000001b3669f in kudu::Mutex::Acquire() ()
#4  0x0000000001b17193 in kudu::KernelStackWatchdog::Unregister() ()
#5  0x0000000001b1736a in void
kudu::threadlocal::internal::Destroy<kudu::KernelStackWatchdog::TLS>(void*)
()
#6  0x0000000001b6c316 in
kudu::threadlocal::internal::InvokeDestructors(void*) ()
#7  0x00007fabc5047b39 in start_thread () from /lib64/libpthread.so.0
#8  0x00007fabc3fd2bcd in clone () from /lib64/libc.so.6

I filed KUDU-2215 for this one. This particular stuck thread issue wouldn't
cause problems in Kudu 1.5, but in 1.4 and earlier the slow shutdown of a
thread could cause consensus to get "stuck" as well in a stack like this:
Thread 2137 (Thread 0x7fa4e4c1c700 (LWP 1564)):
#0  0x00007fabc504ba5e in pthread_cond_timedwait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x0000000001ad9cdd in
kudu::ConditionVariable::TimedWait(kudu::MonoDelta const&) const ()
#2  0x0000000001b63d0c in kudu::ThreadJoiner::Join() ()
#3  0x0000000001b55b4c in kudu::ResettableHeartbeater::Stop() ()
#4  0x0000000000a65b42 in kudu::consensus::Peer::Close() ()
#5  0x0000000000a819b9 in kudu::consensus::PeerManager::Close() ()
#6  0x0000000000a45dde in
kudu::consensus::RaftConsensus::BecomeReplicaUnlocked() ()
#7  0x0000000000a51063 in
kudu::consensus::RaftConsensus::HandleTermAdvanceUnlocked(long,
kudu::consensus::ReplicaState::FlushToDisk) ()
#8  0x0000000000a5145f in
kudu::consensus::RaftConsensus::NotifyTermChange(long) ()
#9  0x0000000000a6ceac in
kudu::consensus::PeerMessageQueue::NotifyObserversOfTermChangeTask(long) ()

which I also found in his stacks fairly often.

To workaround this I suggested disabling the kernel_stack_watchdog using a
hidden configuration. I'll look at a fix for this that we can backport to
earlier versions as well.

King -- please report back in a couple days whether the suggestions we
talked about help resolve the problems on your cluster.

Thanks
-Todd





On Thu, Nov 9, 2017 at 11:27 PM, Lee King <yu...@gmail.com> wrote:

> Hi, Todd
>     Our kudu cluster 's error/warning log just like the
> https://issues.apache.org/jira/browse/KUDU-1078, and the issues's status
> is reopen, I have upload log for analysis the issues, If you want to more
> detail, just tell me 😄。
> log files:
> https://drive.google.com/open?id=1_1l2xpT3-NmumgI_sIdxch-6BocXqTCt
> https://drive.google.com/open?id=0B4-NyGFtYNboN3NYNW1pVWQwcFVLa083V
> kRIUTZHRk85WHY4
>
>
>
>
>
>
> 2017-11-04 13:15 GMT+08:00 Todd Lipcon <to...@cloudera.com>:
>
>> One thing you might try is to update the consensus rpc timeout to 30
>> seconds instead of 1. We changed the default in later versions.
>>
>> I'd also recommend updating up 1.4 or 1.5 for other related fixes to
>> consensus stability. I think I recall you were on 1.3 still?
>>
>> Todd
>>
>>
>> On Nov 3, 2017 7:47 PM, "Lee King" <yu...@gmail.com> wrote:
>>
>> Hi,
>>     Our kudu cluster have ran well a long time,  but write became slowly
>> recently,client also come out rpc timeout. I check the warning and find
>> vast error look this:
>> W1104 10:25:16.833736 10271 consensus_peers.cc:365] T
>> 149ffa58ac274c9ba8385ccfdc01ea14 P 59c768eb799243678ee7fa3f83801316 ->
>> Peer 1c67a7e7ff8f4de494469766641fccd1 (cloud-sk-ds-08:7050): Couldn't
>> send request to peer 1c67a7e7ff8f4de494469766641fccd1 for tablet
>> 149ffa58ac274c9ba8385ccfdc01ea14. Status: Timed out: UpdateConsensus RPC
>> to 10.6.60.9:7050 timed out after 1.000s (SENT). Retrying in the next
>> heartbeat period. Already tried 5 times.
>>     I change the configure rpc_service_queue_le
>> ngth=400,rpc_num_service_threads=40, but it takes no effect.
>>     Our cluster include 5 master , 10 ts. 3800G data, 800 tablet per ts.
>> I check one of the ts machine's memory, 14G left(128 In all), thread
>> 4739(max 32000), openfile 28000(max 65536), cpu disk utilization ratio
>> about 30%(32 core), disk util  less than 30%.
>>     Any suggestion for this? Thanks!
>>
>>
>>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: The service queue is full; it has 400 items.. Retrying in the next heartbeat period.

Posted by Lee King <yu...@gmail.com>.
Hi, Todd
    Our kudu cluster 's error/warning log just like the
https://issues.apache.org/jira/browse/KUDU-1078, and the issues's status is
reopen, I have upload log for analysis the issues, If you want to more
detail, just tell me 😄。
log files:
https://drive.google.com/open?id=1_1l2xpT3-NmumgI_sIdxch-6BocXqTCt
https://drive.google.com/open?id=0B4-NyGFtYNboN3NYNW1pVWQwcFVLa083VkRIUTZHRk85WHY4






2017-11-04 13:15 GMT+08:00 Todd Lipcon <to...@cloudera.com>:

> One thing you might try is to update the consensus rpc timeout to 30
> seconds instead of 1. We changed the default in later versions.
>
> I'd also recommend updating up 1.4 or 1.5 for other related fixes to
> consensus stability. I think I recall you were on 1.3 still?
>
> Todd
>
>
> On Nov 3, 2017 7:47 PM, "Lee King" <yu...@gmail.com> wrote:
>
> Hi,
>     Our kudu cluster have ran well a long time,  but write became slowly
> recently,client also come out rpc timeout. I check the warning and find
> vast error look this:
> W1104 10:25:16.833736 10271 consensus_peers.cc:365] T
> 149ffa58ac274c9ba8385ccfdc01ea14 P 59c768eb799243678ee7fa3f83801316 ->
> Peer 1c67a7e7ff8f4de494469766641fccd1 (cloud-sk-ds-08:7050): Couldn't
> send request to peer 1c67a7e7ff8f4de494469766641fccd1 for tablet
> 149ffa58ac274c9ba8385ccfdc01ea14. Status: Timed out: UpdateConsensus RPC
> to 10.6.60.9:7050 timed out after 1.000s (SENT). Retrying in the next
> heartbeat period. Already tried 5 times.
>     I change the configure rpc_service_queue_le
> ngth=400,rpc_num_service_threads=40, but it takes no effect.
>     Our cluster include 5 master , 10 ts. 3800G data, 800 tablet per ts. I
> check one of the ts machine's memory, 14G left(128 In all), thread 4739(max
> 32000), openfile 28000(max 65536), cpu disk utilization ratio about
> 30%(32 core), disk util  less than 30%.
>     Any suggestion for this? Thanks!
>
>
>

Re: The service queue is full; it has 400 items.. Retrying in the next heartbeat period.

Posted by Lee King <yu...@gmail.com>.
Hi, Todd
    I have change the consensus_rpc_timeout_ms from 1s to 30s, but It looks
like not fix this problem also. give some error info from file
kudu-tserver.ERROR .

E1106 11:10:46.077805 30426 consensus_queue.cc:428] T
f9eb9a8cd4cd4d26a9340928fb4e9327 P e817589db79348ad8722a697f1671720
[LEADER]: Error trying to read ahead of the log while preparing peer
request: Incomplete: Op with index 122 is ahead of the local log (next
sequential op: 122). Destination peer: Peer:
9229935dbf8f4f8f8f63b279fb0796ea, Is new: false, Last received: 149.122,
Next index: 123, Last known committed idx: 122, Last exchange result:
ERROR, Needs tablet copy: false


2017-11-04 13:15 GMT+08:00 Todd Lipcon <to...@cloudera.com>:

> One thing you might try is to update the consensus rpc timeout to 30
> seconds instead of 1. We changed the default in later versions.
>
> I'd also recommend updating up 1.4 or 1.5 for other related fixes to
> consensus stability. I think I recall you were on 1.3 still?
>
> Todd
>
>
> On Nov 3, 2017 7:47 PM, "Lee King" <yu...@gmail.com> wrote:
>
> Hi,
>     Our kudu cluster have ran well a long time,  but write became slowly
> recently,client also come out rpc timeout. I check the warning and find
> vast error look this:
> W1104 10:25:16.833736 10271 consensus_peers.cc:365] T
> 149ffa58ac274c9ba8385ccfdc01ea14 P 59c768eb799243678ee7fa3f83801316 ->
> Peer 1c67a7e7ff8f4de494469766641fccd1 (cloud-sk-ds-08:7050): Couldn't
> send request to peer 1c67a7e7ff8f4de494469766641fccd1 for tablet
> 149ffa58ac274c9ba8385ccfdc01ea14. Status: Timed out: UpdateConsensus RPC
> to 10.6.60.9:7050 timed out after 1.000s (SENT). Retrying in the next
> heartbeat period. Already tried 5 times.
>     I change the configure rpc_service_queue_le
> ngth=400,rpc_num_service_threads=40, but it takes no effect.
>     Our cluster include 5 master , 10 ts. 3800G data, 800 tablet per ts. I
> check one of the ts machine's memory, 14G left(128 In all), thread 4739(max
> 32000), openfile 28000(max 65536), cpu disk utilization ratio about
> 30%(32 core), disk util  less than 30%.
>     Any suggestion for this? Thanks!
>
>
>

Re: The service queue is full; it has 400 items.. Retrying in the next heartbeat period.

Posted by Lee King <yu...@gmail.com>.
We use 1.4 now, I try to update the param. thanks Todd.

2017-11-04 13:15 GMT+08:00 Todd Lipcon <to...@cloudera.com>:

> One thing you might try is to update the consensus rpc timeout to 30
> seconds instead of 1. We changed the default in later versions.
>
> I'd also recommend updating up 1.4 or 1.5 for other related fixes to
> consensus stability. I think I recall you were on 1.3 still?
>
> Todd
>
>
> On Nov 3, 2017 7:47 PM, "Lee King" <yu...@gmail.com> wrote:
>
> Hi,
>     Our kudu cluster have ran well a long time,  but write became slowly
> recently,client also come out rpc timeout. I check the warning and find
> vast error look this:
> W1104 10:25:16.833736 10271 consensus_peers.cc:365] T
> 149ffa58ac274c9ba8385ccfdc01ea14 P 59c768eb799243678ee7fa3f83801316 ->
> Peer 1c67a7e7ff8f4de494469766641fccd1 (cloud-sk-ds-08:7050): Couldn't
> send request to peer 1c67a7e7ff8f4de494469766641fccd1 for tablet
> 149ffa58ac274c9ba8385ccfdc01ea14. Status: Timed out: UpdateConsensus RPC
> to 10.6.60.9:7050 timed out after 1.000s (SENT). Retrying in the next
> heartbeat period. Already tried 5 times.
>     I change the configure rpc_service_queue_le
> ngth=400,rpc_num_service_threads=40, but it takes no effect.
>     Our cluster include 5 master , 10 ts. 3800G data, 800 tablet per ts. I
> check one of the ts machine's memory, 14G left(128 In all), thread 4739(max
> 32000), openfile 28000(max 65536), cpu disk utilization ratio about
> 30%(32 core), disk util  less than 30%.
>     Any suggestion for this? Thanks!
>
>
>

Re: The service queue is full; it has 400 items.. Retrying in the next heartbeat period.

Posted by Todd Lipcon <to...@cloudera.com>.
One thing you might try is to update the consensus rpc timeout to 30
seconds instead of 1. We changed the default in later versions.

I'd also recommend updating up 1.4 or 1.5 for other related fixes to
consensus stability. I think I recall you were on 1.3 still?

Todd


On Nov 3, 2017 7:47 PM, "Lee King" <yu...@gmail.com> wrote:

Hi,
    Our kudu cluster have ran well a long time,  but write became slowly
recently,client also come out rpc timeout. I check the warning and find
vast error look this:
W1104 10:25:16.833736 10271 consensus_peers.cc:365] T
149ffa58ac274c9ba8385ccfdc01ea14 P 59c768eb799243678ee7fa3f83801316 -> Peer
1c67a7e7ff8f4de494469766641fccd1 (cloud-sk-ds-08:7050): Couldn't send
request to peer 1c67a7e7ff8f4de494469766641fccd1 for tablet
149ffa58ac274c9ba8385ccfdc01ea14. Status: Timed out: UpdateConsensus RPC to
10.6.60.9:7050 timed out after 1.000s (SENT). Retrying in the next
heartbeat period. Already tried 5 times.
    I change the configure
rpc_service_queue_length=400,rpc_num_service_threads=40,
but it takes no effect.
    Our cluster include 5 master , 10 ts. 3800G data, 800 tablet per ts. I
check one of the ts machine's memory, 14G left(128 In all), thread 4739(max
32000), openfile 28000(max 65536), cpu disk utilization ratio about 30%(32
core), disk util  less than 30%.
    Any suggestion for this? Thanks!