You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Jaikiran Pai <ja...@gmail.com> on 2016/04/26 10:16:45 UTC

0.9.0.1 High CPU usage on broker - Why is the default heart beat interval set too low (3 seconds)?

We have been investigating an unreasonably high CPU usage of the Kafka 
process when there's no _real_ activity going on between the consumers 
and the broker. We had this issue in 0.8.x days and is exactly the same 
as what's being tracked in this JIRA 
https://issues.apache.org/jira/browse/KAFKA-493. We now use 0.9.0.1 
(both client libraries, new consumer APIs and the broker). However, we 
still see some CPU usage which looks a bit on the higher side when 
there's no real message production or consumption going on. Just 
connecting around 10-20 consumers on different topics of a single broker 
Kafka instance shows up this issue.

All our debugging so far points to the Processor thread on the broker 
side which has a high CPU usage. There are N such Processor threads, 
which always are in the RUNNABLE state doing this:

"kafka-network-thread-0-PLAINTEXT-0" #21 prio=5 os_prio=0 
tid=0x00007f1858c4a800 nid=0xc81 runnable [0x00007f18106cb000]
    java.lang.Thread.State: RUNNABLE
     at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
     at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
     at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
     at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
     - locked <0x00000006c0046128> (a sun.nio.ch.Util$2)
     - locked <0x00000006c0046118> (a java.util.Collections$UnmodifiableSet)
     - locked <0x00000006c0046068> (a sun.nio.ch.EPollSelectorImpl)
     at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
     at org.apache.kafka.common.network.Selector.select(Selector.java:425)
     at org.apache.kafka.common.network.Selector.poll(Selector.java:254)
     at kafka.network.Processor.run(SocketServer.scala:413)
     at java.lang.Thread.run(Thread.java:745)



 From what we have narrowed down so far, this thread in itself isn't a 
"culprit", since when they are no consumers connected, the CPU isn't 
high. However when a consumer connects to this and just waits for 
messages, these threads start playing a role in the high CPU usage. Our 
debugging shows that each of these X number of consumers that connect to 
the broker keep doing 2 things when they are "idle":

1) A delayed operation every Y seconds which does the auto commit of 
offsets.
2) Sending heartbeats every 3 seconds to the broker

We disabled auto commits of offsets since that's the semantic we wanted. 
So #1 isn't really an issue. However, #2 is. It looks like the default 
heartbeat interval is 3 seconds which is too low, IMO. This translates 
to a network socket operation every 3 seconds which then has to be 
processed by the broker side Processor thread. If there's just a single 
consumer, this doesn't make much of a difference. As soon as you add 
more consumers, the Processor on the broker side has to be start 
processing each of these incoming heartbeats which become too frequent. 
Even though the interval is 3 seconds, the incoming heartbeats to the 
broker can be much more frequent when more consumers are involved since 
the 3 second interval is just per consumer. So in practice there can be 
a heartbeat coming every second or few milli seconds from the X 
consumers to this broker which can contribute to this high CPU usage 
when the system is practically idle.

So coming to the real question - why is the default heart beat interval 
so low - 3 seconds? We increased it to 29 seconds (just 1 second less 
than the session timeout) per consumer (via consumer configs) and in 
addition to disabling auto commit, these changes have improved 
noticeably the CPU usage.

Ideally, what would be a better value for the heart beat interval that 
doesn't unnecessary flood these messages and cause the broker to 
continuous process them?

-Jaikiran

Re: 0.9.0.1 High CPU usage on broker - Why is the default heart beat interval set too low (3 seconds)?

Posted by Jaikiran Pai <ja...@gmail.com>.

Thanks for responding, Liquan. Just so I understand better, where does 
the coordinator reside? On the broker side? And the coordinator knows 
the heart beat interval configured on each consumer that's connected to 
the broker, I am guessing. For the usecase we have, having a higher 
value for heart beat works out fine for us.

I haven't yet fully thought of what I'm proposing, but would it be 
wise/feasible to introduce an option which disables heart beats 
altogether and instead relies on regular consumption/communication 
between the consumer and broker as a heart beat? That way, the usual 
consumption and offset commits from the consumer can be used as an 
inference of whether the consumer is alive or dead. That way, there 
won't be a need for regular heart beats being passed by the consumer. 
This helps in the case where you have many consumers connected to the 
broker and practically idle except for this heart beats being passed 
around. One implication of this would be that if a message isn't 
produced in the topics that the consumers listen to, for a while, then 
there won't be any communication (isn't it?) from the consumer and hence 
it might be considered dead and a re-balance triggered. Is that a 
bad/worse thing though?

-Jaikiran

On Tuesday 26 April 2016 02:10 PM, Liquan Pei wrote:
> Hi Jaikiran,
>
> Thanks for the email and the detailed analysis. One reason for setting the
> heartbeat interval to a lower value is for faster failure detection. On
> every received heartbeat, the coordinator starts (or resets) a timer. If no
> heartbeat is received when the timer expires, the coordinator marks the
> member dead and signals the rest of the group that they should rejoin so
> that partitions can be reassigned.
>
> I think the trade off here is the CPU usage and how fast you want to detect
> the consumer failure. Faster failure detection makes the partitions
> assigned to dead consumers to assign to other consumers.
>
> Best,
> Liquan
>
>
> On Tue, Apr 26, 2016 at 1:16 AM, Jaikiran Pai <ja...@gmail.com>
> wrote:
>
>> We have been investigating an unreasonably high CPU usage of the Kafka
>> process when there's no _real_ activity going on between the consumers and
>> the broker. We had this issue in 0.8.x days and is exactly the same as
>> what's being tracked in this JIRA
>> https://issues.apache.org/jira/browse/KAFKA-493. We now use 0.9.0.1 (both
>> client libraries, new consumer APIs and the broker). However, we still see
>> some CPU usage which looks a bit on the higher side when there's no real
>> message production or consumption going on. Just connecting around 10-20
>> consumers on different topics of a single broker Kafka instance shows up
>> this issue.
>>
>> All our debugging so far points to the Processor thread on the broker side
>> which has a high CPU usage. There are N such Processor threads, which
>> always are in the RUNNABLE state doing this:
>>
>> "kafka-network-thread-0-PLAINTEXT-0" #21 prio=5 os_prio=0
>> tid=0x00007f1858c4a800 nid=0xc81 runnable [0x00007f18106cb000]
>>     java.lang.Thread.State: RUNNABLE
>>      at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>>      at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>>      at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
>>      at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
>>      - locked <0x00000006c0046128> (a sun.nio.ch.Util$2)
>>      - locked <0x00000006c0046118> (a java.util.Collections$UnmodifiableSet)
>>      - locked <0x00000006c0046068> (a sun.nio.ch.EPollSelectorImpl)
>>      at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
>>      at org.apache.kafka.common.network.Selector.select(Selector.java:425)
>>      at org.apache.kafka.common.network.Selector.poll(Selector.java:254)
>>      at kafka.network.Processor.run(SocketServer.scala:413)
>>      at java.lang.Thread.run(Thread.java:745)
>>
>>
>>
>>  From what we have narrowed down so far, this thread in itself isn't a
>> "culprit", since when they are no consumers connected, the CPU isn't high.
>> However when a consumer connects to this and just waits for messages, these
>> threads start playing a role in the high CPU usage. Our debugging shows
>> that each of these X number of consumers that connect to the broker keep
>> doing 2 things when they are "idle":
>>
>> 1) A delayed operation every Y seconds which does the auto commit of
>> offsets.
>> 2) Sending heartbeats every 3 seconds to the broker
>>
>> We disabled auto commits of offsets since that's the semantic we wanted.
>> So #1 isn't really an issue. However, #2 is. It looks like the default
>> heartbeat interval is 3 seconds which is too low, IMO. This translates to a
>> network socket operation every 3 seconds which then has to be processed by
>> the broker side Processor thread. If there's just a single consumer, this
>> doesn't make much of a difference. As soon as you add more consumers, the
>> Processor on the broker side has to be start processing each of these
>> incoming heartbeats which become too frequent. Even though the interval is
>> 3 seconds, the incoming heartbeats to the broker can be much more frequent
>> when more consumers are involved since the 3 second interval is just per
>> consumer. So in practice there can be a heartbeat coming every second or
>> few milli seconds from the X consumers to this broker which can contribute
>> to this high CPU usage when the system is practically idle.
>>
>> So coming to the real question - why is the default heart beat interval so
>> low - 3 seconds? We increased it to 29 seconds (just 1 second less than the
>> session timeout) per consumer (via consumer configs) and in addition to
>> disabling auto commit, these changes have improved noticeably the CPU usage.
>>
>> Ideally, what would be a better value for the heart beat interval that
>> doesn't unnecessary flood these messages and cause the broker to continuous
>> process them?
>>
>> -Jaikiran
>>
>
>

Re: 0.9.0.1 High CPU usage on broker - Why is the default heart beat interval set too low (3 seconds)?

Posted by Liquan Pei <li...@gmail.com>.

Hi Jaikiran,

Thanks for the email and the detailed analysis. One reason for setting the
heartbeat interval to a lower value is for faster failure detection. On
every received heartbeat, the coordinator starts (or resets) a timer. If no
heartbeat is received when the timer expires, the coordinator marks the
member dead and signals the rest of the group that they should rejoin so
that partitions can be reassigned.

I think the trade off here is the CPU usage and how fast you want to detect
the consumer failure. Faster failure detection makes the partitions
assigned to dead consumers to assign to other consumers.

Best,
Liquan


On Tue, Apr 26, 2016 at 1:16 AM, Jaikiran Pai <ja...@gmail.com>
wrote:

> We have been investigating an unreasonably high CPU usage of the Kafka
> process when there's no _real_ activity going on between the consumers and
> the broker. We had this issue in 0.8.x days and is exactly the same as
> what's being tracked in this JIRA
> https://issues.apache.org/jira/browse/KAFKA-493. We now use 0.9.0.1 (both
> client libraries, new consumer APIs and the broker). However, we still see
> some CPU usage which looks a bit on the higher side when there's no real
> message production or consumption going on. Just connecting around 10-20
> consumers on different topics of a single broker Kafka instance shows up
> this issue.
>
> All our debugging so far points to the Processor thread on the broker side
> which has a high CPU usage. There are N such Processor threads, which
> always are in the RUNNABLE state doing this:
>
> "kafka-network-thread-0-PLAINTEXT-0" #21 prio=5 os_prio=0
> tid=0x00007f1858c4a800 nid=0xc81 runnable [0x00007f18106cb000]
>    java.lang.Thread.State: RUNNABLE
>     at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>     at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>     at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
>     at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
>     - locked <0x00000006c0046128> (a sun.nio.ch.Util$2)
>     - locked <0x00000006c0046118> (a java.util.Collections$UnmodifiableSet)
>     - locked <0x00000006c0046068> (a sun.nio.ch.EPollSelectorImpl)
>     at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
>     at org.apache.kafka.common.network.Selector.select(Selector.java:425)
>     at org.apache.kafka.common.network.Selector.poll(Selector.java:254)
>     at kafka.network.Processor.run(SocketServer.scala:413)
>     at java.lang.Thread.run(Thread.java:745)
>
>
>
> From what we have narrowed down so far, this thread in itself isn't a
> "culprit", since when they are no consumers connected, the CPU isn't high.
> However when a consumer connects to this and just waits for messages, these
> threads start playing a role in the high CPU usage. Our debugging shows
> that each of these X number of consumers that connect to the broker keep
> doing 2 things when they are "idle":
>
> 1) A delayed operation every Y seconds which does the auto commit of
> offsets.
> 2) Sending heartbeats every 3 seconds to the broker
>
> We disabled auto commits of offsets since that's the semantic we wanted.
> So #1 isn't really an issue. However, #2 is. It looks like the default
> heartbeat interval is 3 seconds which is too low, IMO. This translates to a
> network socket operation every 3 seconds which then has to be processed by
> the broker side Processor thread. If there's just a single consumer, this
> doesn't make much of a difference. As soon as you add more consumers, the
> Processor on the broker side has to be start processing each of these
> incoming heartbeats which become too frequent. Even though the interval is
> 3 seconds, the incoming heartbeats to the broker can be much more frequent
> when more consumers are involved since the 3 second interval is just per
> consumer. So in practice there can be a heartbeat coming every second or
> few milli seconds from the X consumers to this broker which can contribute
> to this high CPU usage when the system is practically idle.
>
> So coming to the real question - why is the default heart beat interval so
> low - 3 seconds? We increased it to 29 seconds (just 1 second less than the
> session timeout) per consumer (via consumer configs) and in addition to
> disabling auto commit, these changes have improved noticeably the CPU usage.
>
> Ideally, what would be a better value for the heart beat interval that
> doesn't unnecessary flood these messages and cause the broker to continuous
> process them?
>
> -Jaikiran
>



-- 
Liquan Pei
Software Engineer, Confluent Inc