You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Mathias Söderberg <ma...@gmail.com> on 2014/12/08 10:25:50 UTC

Increased CPU usage with 0.8.2-beta

Good day,

I upgraded a Kafka cluster from v0.8.1.1 to v0.8.2-beta and noticed that
the CPU usage on the broker machines went up by roughly 40%, from ~60% to
~100% and am wondering if anyone else has experienced something similar?
The load average also went up by 2x-3x.

We're running on EC2 and the cluster currently consists of four m1.xlarge,
with roughly 1100 topics / 4000 partitions. Using Java 7 (1.7.0_65 to be
exact) and Scala 2.9.2. Configurations can be found over here:
https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.

I'm assuming that this is not expected behaviour for 0.8.2-beta?

Best regards,
Mathias

Re: Increased CPU usage with 0.8.2-beta

Posted by Mathias Söderberg <ma...@gmail.com>.
Jun,

Yeah, sure, I'll take it for a spin tomorrow.

On Mon Feb 02 2015 at 11:08:42 PM Jun Rao <ju...@confluent.io> wrote:

> Mathias,
>
> Thanks for the info. I took a quick look. The biggest difference I saw is
> the org.xerial.snappy.SnappyNative.rawCompress() call. In 0.8.1.1, it uses
> about 0.05% of the CPU. In 0.8.2.0, it uses about 0.10% of the CPU. We did
> upgrade snappy from 1.0.5 in 0.8.1.1 to 1.1.1.6 in 0.8.2.0. Could you try
> to use the same version of snappy in 0.8.1.1 and 0.8.2.0 and rerun your
> tests?
>
> Jun
>
> On Mon, Feb 2, 2015 at 1:15 PM, Mathias Söderberg <
> mathias.soederberg@gmail.com> wrote:
>
> > Hi all,
> >
> > I ran the same hprof test on 0.8.1.1, and also did a re-run on
> > 0.8.2.0-rc2, attached logs from both runs. Both runs lasted for 30-40
> > minutes. The configurations used can be found over here:
> > https://gist.github.com/mthssdrbrg/5fcb9fbdb851d8cc66a2. The
> > configuration used for the first run (on 0.8.2-beta) can be found here:
> > https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
> >
> > The cluster is running on EC2, currently on 4 m1.xlarge, with a ZooKeeper
> > cluster consisting of 3 nodes. There's 1248 topics, and a total of 4143
> > partitions, using replication factor 3.
> >
> > Let me know if there's any other information I can provide.
> >
> > Best regards,
> > Mathias
> >
> > On Mon Feb 02 2015 at 6:50:54 PM Jay Kreps <ja...@gmail.com> wrote:
> >
> >> Looking at that profile:
> >>
> >> Misc. blocking socket activity, not actual CPU work:
> >>    1  5.24%  5.24%   67781 300447 java.net.PlainSocketImpl.socketAccept
> >>    2  5.24% 10.49%   67775 300515 java.net.PlainSocketImpl.socketAccept
> >>    3  5.24% 15.73%   67773 300567 java.net.PlainSocketImpl.socketAccept
> >>    4  5.24% 20.97%   67683 301396 sun.nio.ch.ServerSocketChannelImpl.
> >> accept0
> >>    5  5.23% 26.20%   67582 301395 sun.nio.ch.EPollArrayWrapper.
> epollWait
> >>    6  5.23% 31.42%   67529 301519 sun.nio.ch.EPollArrayWrapper.
> epollWait
> >>    7  5.13% 36.55%   66297 302447 sun.nio.ch.EPollArrayWrapper.
> epollWait
> >>    8  3.95% 40.51%   51071 302446 sun.nio.ch.EPollArrayWrapper.
> epollWait
> >>    9  3.65% 44.16%   47234 303479 sun.nio.ch.EPollArrayWrapper.
> epollWait
> >>   10  3.64% 47.80%   47019 302444 sun.nio.ch.EPollArrayWrapper.
> epollWait
> >>   11  3.62% 51.42%   46768 302445 sun.nio.ch.EPollArrayWrapper.
> epollWait
> >>   12  3.27% 54.69%   42237 303475 sun.nio.ch.EPollArrayWrapper.
> epollWait
> >>   13  3.16% 57.85%   40892 303476 sun.nio.ch.EPollArrayWrapper.
> epollWait
> >>   14  3.14% 60.99%   40556 303478 sun.nio.ch.EPollArrayWrapper.
> epollWait
> >>   15  3.05% 64.04%   39428 303480 sun.nio.ch.EPollArrayWrapper.
> epollWait
> >>   16  2.68% 66.72%   34673 303477 sun.nio.ch.EPollArrayWrapper.
> epollWait
> >>   17  1.23% 67.95%   15867 303520 sun.nio.ch.EPollArrayWrapper.
> epollWait
> >>   18  0.98% 68.93%   12663 303541 sun.nio.ch.EPollArrayWrapper.
> epollWait
> >>   19  0.92% 69.85%   11920 303536 sun.nio.ch.EPollArrayWrapper.
> epollWait
> >>   20  0.85% 70.70%   11015 303546 sun.nio.ch.EPollArrayWrapper.
> epollWait
> >>   21  0.82% 71.53%   10625 303534 sun.nio.ch.EPollArrayWrapper.
> epollWait
> >>   22  0.69% 72.21%    8866 303512 sun.nio.ch.EPollArrayWrapper.
> epollWait
> >>
> >>   TransferTo to write data to socket, not actual CPU work (mostly)
> >>   23  0.36% 72.57%    4592 304991 sun.nio.ch.FileDispatcherImpl.write0
> >>   24  0.36% 72.92%    4589 305021 sun.nio.ch.FileDispatcherImpl.write0
> >>   25  0.35% 73.27%    4503 304992 sun.nio.ch.FileDispatcherImpl.write0
> >>   26  0.32% 73.60%    4198 305022 sun.nio.ch.FileDispatcherImpl.write0
> >>   27  0.25% 73.85%    3250 305246 sun.nio.ch.FileDispatcherImpl.write0
> >>   28  0.25% 74.10%    3249 305497 sun.nio.ch.FileDispatcherImpl.write0
> >>
> >>   Request channel - Actual CPU
> >>   29  0.22% 74.32%    2862 305000 sun.misc.Unsafe.unpark
> >>   30  0.17% 74.49%    2163 304838 sun.misc.Unsafe.unpark
> >>   31  0.14% 74.63%    1795 305240 sun.misc.Unsafe.unpark
> >>
> >>   Purgatory - Actual CPU
> >>   32  0.12% 74.75%    1553 305137 scala.collection.immutable.
> >> HashMap.$plus
> >>   33  0.12% 74.87%    1546 305100 java.util.concurrent.
> >> ConcurrentHashMap.get
> >>   34  0.12% 74.98%    1531 305181 java.util.concurrent.
> >> ConcurrentHashMap.get
> >>   35  0.12% 75.10%    1526 305234 scala.collection.immutable.
> >> HashMap.$plus
> >>   36  0.12% 75.22%    1521 305401 scala.collection.immutable.
> >> HashMap.$plus
> >>   37  0.12% 75.34%    1519 305186 java.util.concurrent.
> >> ConcurrentHashMap.get
> >>   38  0.12% 75.46%    1517 305264 java.util.concurrent.
> >> ConcurrentHashMap.get
> >>   39  0.12% 75.57%    1514 305271 java.util.concurrent.
> >> ConcurrentHashMap.get
> >>   40  0.12% 75.69%    1511 305250 scala.collection.immutable.
> >> HashMap.$plus
> >>   41  0.12% 75.81%    1499 305155 java.util.concurrent.
> >> ConcurrentHashMap.get
> >>   42  0.12% 75.92%    1496 305113 scala.collection.immutable.
> >> HashMap.$plus
> >>   43  0.12% 76.04%    1496 305263 scala.collection.immutable.
> >> HashMap.$plus
> >>   44  0.11% 76.15%    1480 305235 scala.collection.immutable.
> >> HashMap.$plus
> >>   45  0.11% 76.26%    1444 305185 scala.collection.immutable.
> >> HashMap.$plus
> >>   46  0.11% 76.37%    1428 305102 java.util.concurrent.
> >> ConcurrentHashMap.get
> >>   47  0.11% 76.48%    1418 305320 java.util.concurrent.
> >> ConcurrentHashMap.get
> >>
> >>   Compression - Actual CPU
> >>   48  0.09% 76.58%    1187 306546 org.xerial.snappy.
> >> SnappyNative.rawCompress
> >>   49  0.09% 76.67%    1174 305584 org.xerial.snappy.
> >> SnappyNative.rawCompress
> >>   50  0.09% 76.76%    1173 305545 org.xerial.snappy.
> >> SnappyNative.rawCompress
> >>   51  0.09% 76.85%    1165 305661 org.xerial.snappy.
> >> SnappyNative.rawCompress
> >>   52  0.09% 76.94%    1152 307069 org.xerial.snappy.
> >> SnappyNative.rawCompress
> >>   53  0.09% 77.02%    1121 305997 org.xerial.snappy.
> >> SnappyNative.rawCompress
> >>   54  0.09% 77.11%    1117 307035 org.xerial.snappy.
> >> SnappyNative.rawCompress
> >>   55  0.09% 77.19%    1106 306190 org.xerial.snappy.
> >> SnappyNative.rawCompress
> >>
> >> On Mon, Feb 2, 2015 at 9:39 AM, Jay Kreps <ja...@gmail.com> wrote:
> >>
> >> > Ah, yeah, you're right. That is just wait time not CPU time. We should
> >> > check that profile it must be something else on the list.
> >> >
> >> > -Jay
> >> >
> >> > On Mon, Feb 2, 2015 at 9:33 AM, Jun Rao <ju...@confluent.io> wrote:
> >> >
> >> >> Hi, Mathias,
> >> >>
> >> >> From the hprof output, it seems that the top CPU consumers are
> >> >> socketAccept() and epollWait(). As far as I am aware, there hasn't
> been
> >> >> any
> >> >> significant changes in the socket server code btw 0.8.1 and 0.8.2.
> >> Could
> >> >> you run the same hprof test on 0.8.1 so that we can see the
> difference?
> >> >>
> >> >> Jaikiran,
> >> >>
> >> >> The fix you provided in probably unnecessary. The channel that we use
> >> in
> >> >> SimpleConsumer (BlockingChannel) is configured to be blocking. So
> even
> >> >> though the read from the socket is in a loop, each read blocks if
> >> there is
> >> >> no bytes received from the broker. So, that shouldn't cause extra CPU
> >> >> consumption.
> >> >>
> >> >> Thanks,
> >> >>
> >> >> Jun
> >> >>
> >> >> On Mon, Jan 26, 2015 at 10:05 AM, Mathias Söderberg <
> >> >> mathias.soederberg@gmail.com> wrote:
> >> >>
> >> >> > Hi Neha,
> >> >> >
> >> >> > I sent an e-mail earlier today, but noticed now that it didn't
> >> actually
> >> >> go
> >> >> > through.
> >> >> >
> >> >> > Anyhow, I've attached two files, one with output from a 10 minute
> run
> >> >> and
> >> >> > one with output from a 30 minute run. Realized that maybe I
> should've
> >> >> done
> >> >> > one or two runs with 0.8.1.1 as well, but nevertheless.
> >> >> >
> >> >> > I upgraded our staging cluster to 0.8.2.0-rc2, and I'm seeing the
> >> same
> >> >> CPU
> >> >> > usage as with the beta version (basically pegging all cores). If I
> >> >> manage
> >> >> > to find the time I'll do another run with hprof on the rc2 version
> >> later
> >> >> > today.
> >> >> >
> >> >> > Best regards,
> >> >> > Mathias
> >> >> >
> >> >> > On Tue Dec 09 2014 at 10:08:21 PM Neha Narkhede <neha@confluent.io
> >
> >> >> wrote:
> >> >> >
> >> >> >> The following should be sufficient
> >> >> >>
> >> >> >> java
> >> >> >> -agentlib:hprof=cpu=samples,depth=100,interval=20,lineno=
> >> >> >> y,thread=y,file=kafka.hprof
> >> >> >> <classname>
> >> >> >>
> >> >> >> You would need to start the Kafka server with the settings above
> for
> >> >> >> sometime until you observe the problem.
> >> >> >>
> >> >> >> On Tue, Dec 9, 2014 at 3:47 AM, Mathias Söderberg <
> >> >> >> mathias.soederberg@gmail.com> wrote:
> >> >> >>
> >> >> >> > Hi Neha,
> >> >> >> >
> >> >> >> > Yeah sure. I'm not familiar with hprof, so any particular
> options
> >> I
> >> >> >> should
> >> >> >> > include or just run with defaults?
> >> >> >> >
> >> >> >> > Best regards,
> >> >> >> > Mathias
> >> >> >> >
> >> >> >> > On Mon Dec 08 2014 at 7:41:32 PM Neha Narkhede <
> neha@confluent.io
> >> >
> >> >> >> wrote:
> >> >> >> >
> >> >> >> > > Thanks for reporting the issue. Would you mind running hprof
> and
> >> >> >> sending
> >> >> >> > > the output?
> >> >> >> > >
> >> >> >> > > On Mon, Dec 8, 2014 at 1:25 AM, Mathias Söderberg <
> >> >> >> > > mathias.soederberg@gmail.com> wrote:
> >> >> >> > >
> >> >> >> > > > Good day,
> >> >> >> > > >
> >> >> >> > > > I upgraded a Kafka cluster from v0.8.1.1 to v0.8.2-beta and
> >> >> noticed
> >> >> >> > that
> >> >> >> > > > the CPU usage on the broker machines went up by roughly 40%,
> >> from
> >> >> >> ~60%
> >> >> >> > to
> >> >> >> > > > ~100% and am wondering if anyone else has experienced
> >> something
> >> >> >> > similar?
> >> >> >> > > > The load average also went up by 2x-3x.
> >> >> >> > > >
> >> >> >> > > > We're running on EC2 and the cluster currently consists of
> >> four
> >> >> >> > > m1.xlarge,
> >> >> >> > > > with roughly 1100 topics / 4000 partitions. Using Java 7
> >> >> (1.7.0_65
> >> >> >> to
> >> >> >> > be
> >> >> >> > > > exact) and Scala 2.9.2. Configurations can be found over
> here:
> >> >> >> > > > https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
> >> >> >> > > >
> >> >> >> > > > I'm assuming that this is not expected behaviour for
> >> 0.8.2-beta?
> >> >> >> > > >
> >> >> >> > > > Best regards,
> >> >> >> > > > Mathias
> >> >> >> > > >
> >> >> >> > >
> >> >> >> > >
> >> >> >> > >
> >> >> >> > > --
> >> >> >> > > Thanks,
> >> >> >> > > Neha
> >> >> >> > >
> >> >> >> >
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> Thanks,
> >> >> >> Neha
> >> >> >>
> >> >> >
> >> >>
> >> >
> >> >
> >>
> >
>

Re: Increased CPU usage with 0.8.2-beta

Posted by Jun Rao <ju...@confluent.io>.
Mathias,

Thanks for the info. I took a quick look. The biggest difference I saw is
the org.xerial.snappy.SnappyNative.rawCompress() call. In 0.8.1.1, it uses
about 0.05% of the CPU. In 0.8.2.0, it uses about 0.10% of the CPU. We did
upgrade snappy from 1.0.5 in 0.8.1.1 to 1.1.1.6 in 0.8.2.0. Could you try
to use the same version of snappy in 0.8.1.1 and 0.8.2.0 and rerun your
tests?

Jun

On Mon, Feb 2, 2015 at 1:15 PM, Mathias Söderberg <
mathias.soederberg@gmail.com> wrote:

> Hi all,
>
> I ran the same hprof test on 0.8.1.1, and also did a re-run on
> 0.8.2.0-rc2, attached logs from both runs. Both runs lasted for 30-40
> minutes. The configurations used can be found over here:
> https://gist.github.com/mthssdrbrg/5fcb9fbdb851d8cc66a2. The
> configuration used for the first run (on 0.8.2-beta) can be found here:
> https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
>
> The cluster is running on EC2, currently on 4 m1.xlarge, with a ZooKeeper
> cluster consisting of 3 nodes. There's 1248 topics, and a total of 4143
> partitions, using replication factor 3.
>
> Let me know if there's any other information I can provide.
>
> Best regards,
> Mathias
>
> On Mon Feb 02 2015 at 6:50:54 PM Jay Kreps <ja...@gmail.com> wrote:
>
>> Looking at that profile:
>>
>> Misc. blocking socket activity, not actual CPU work:
>>    1  5.24%  5.24%   67781 300447 java.net.PlainSocketImpl.socketAccept
>>    2  5.24% 10.49%   67775 300515 java.net.PlainSocketImpl.socketAccept
>>    3  5.24% 15.73%   67773 300567 java.net.PlainSocketImpl.socketAccept
>>    4  5.24% 20.97%   67683 301396 sun.nio.ch.ServerSocketChannelImpl.
>> accept0
>>    5  5.23% 26.20%   67582 301395 sun.nio.ch.EPollArrayWrapper.epollWait
>>    6  5.23% 31.42%   67529 301519 sun.nio.ch.EPollArrayWrapper.epollWait
>>    7  5.13% 36.55%   66297 302447 sun.nio.ch.EPollArrayWrapper.epollWait
>>    8  3.95% 40.51%   51071 302446 sun.nio.ch.EPollArrayWrapper.epollWait
>>    9  3.65% 44.16%   47234 303479 sun.nio.ch.EPollArrayWrapper.epollWait
>>   10  3.64% 47.80%   47019 302444 sun.nio.ch.EPollArrayWrapper.epollWait
>>   11  3.62% 51.42%   46768 302445 sun.nio.ch.EPollArrayWrapper.epollWait
>>   12  3.27% 54.69%   42237 303475 sun.nio.ch.EPollArrayWrapper.epollWait
>>   13  3.16% 57.85%   40892 303476 sun.nio.ch.EPollArrayWrapper.epollWait
>>   14  3.14% 60.99%   40556 303478 sun.nio.ch.EPollArrayWrapper.epollWait
>>   15  3.05% 64.04%   39428 303480 sun.nio.ch.EPollArrayWrapper.epollWait
>>   16  2.68% 66.72%   34673 303477 sun.nio.ch.EPollArrayWrapper.epollWait
>>   17  1.23% 67.95%   15867 303520 sun.nio.ch.EPollArrayWrapper.epollWait
>>   18  0.98% 68.93%   12663 303541 sun.nio.ch.EPollArrayWrapper.epollWait
>>   19  0.92% 69.85%   11920 303536 sun.nio.ch.EPollArrayWrapper.epollWait
>>   20  0.85% 70.70%   11015 303546 sun.nio.ch.EPollArrayWrapper.epollWait
>>   21  0.82% 71.53%   10625 303534 sun.nio.ch.EPollArrayWrapper.epollWait
>>   22  0.69% 72.21%    8866 303512 sun.nio.ch.EPollArrayWrapper.epollWait
>>
>>   TransferTo to write data to socket, not actual CPU work (mostly)
>>   23  0.36% 72.57%    4592 304991 sun.nio.ch.FileDispatcherImpl.write0
>>   24  0.36% 72.92%    4589 305021 sun.nio.ch.FileDispatcherImpl.write0
>>   25  0.35% 73.27%    4503 304992 sun.nio.ch.FileDispatcherImpl.write0
>>   26  0.32% 73.60%    4198 305022 sun.nio.ch.FileDispatcherImpl.write0
>>   27  0.25% 73.85%    3250 305246 sun.nio.ch.FileDispatcherImpl.write0
>>   28  0.25% 74.10%    3249 305497 sun.nio.ch.FileDispatcherImpl.write0
>>
>>   Request channel - Actual CPU
>>   29  0.22% 74.32%    2862 305000 sun.misc.Unsafe.unpark
>>   30  0.17% 74.49%    2163 304838 sun.misc.Unsafe.unpark
>>   31  0.14% 74.63%    1795 305240 sun.misc.Unsafe.unpark
>>
>>   Purgatory - Actual CPU
>>   32  0.12% 74.75%    1553 305137 scala.collection.immutable.
>> HashMap.$plus
>>   33  0.12% 74.87%    1546 305100 java.util.concurrent.
>> ConcurrentHashMap.get
>>   34  0.12% 74.98%    1531 305181 java.util.concurrent.
>> ConcurrentHashMap.get
>>   35  0.12% 75.10%    1526 305234 scala.collection.immutable.
>> HashMap.$plus
>>   36  0.12% 75.22%    1521 305401 scala.collection.immutable.
>> HashMap.$plus
>>   37  0.12% 75.34%    1519 305186 java.util.concurrent.
>> ConcurrentHashMap.get
>>   38  0.12% 75.46%    1517 305264 java.util.concurrent.
>> ConcurrentHashMap.get
>>   39  0.12% 75.57%    1514 305271 java.util.concurrent.
>> ConcurrentHashMap.get
>>   40  0.12% 75.69%    1511 305250 scala.collection.immutable.
>> HashMap.$plus
>>   41  0.12% 75.81%    1499 305155 java.util.concurrent.
>> ConcurrentHashMap.get
>>   42  0.12% 75.92%    1496 305113 scala.collection.immutable.
>> HashMap.$plus
>>   43  0.12% 76.04%    1496 305263 scala.collection.immutable.
>> HashMap.$plus
>>   44  0.11% 76.15%    1480 305235 scala.collection.immutable.
>> HashMap.$plus
>>   45  0.11% 76.26%    1444 305185 scala.collection.immutable.
>> HashMap.$plus
>>   46  0.11% 76.37%    1428 305102 java.util.concurrent.
>> ConcurrentHashMap.get
>>   47  0.11% 76.48%    1418 305320 java.util.concurrent.
>> ConcurrentHashMap.get
>>
>>   Compression - Actual CPU
>>   48  0.09% 76.58%    1187 306546 org.xerial.snappy.
>> SnappyNative.rawCompress
>>   49  0.09% 76.67%    1174 305584 org.xerial.snappy.
>> SnappyNative.rawCompress
>>   50  0.09% 76.76%    1173 305545 org.xerial.snappy.
>> SnappyNative.rawCompress
>>   51  0.09% 76.85%    1165 305661 org.xerial.snappy.
>> SnappyNative.rawCompress
>>   52  0.09% 76.94%    1152 307069 org.xerial.snappy.
>> SnappyNative.rawCompress
>>   53  0.09% 77.02%    1121 305997 org.xerial.snappy.
>> SnappyNative.rawCompress
>>   54  0.09% 77.11%    1117 307035 org.xerial.snappy.
>> SnappyNative.rawCompress
>>   55  0.09% 77.19%    1106 306190 org.xerial.snappy.
>> SnappyNative.rawCompress
>>
>> On Mon, Feb 2, 2015 at 9:39 AM, Jay Kreps <ja...@gmail.com> wrote:
>>
>> > Ah, yeah, you're right. That is just wait time not CPU time. We should
>> > check that profile it must be something else on the list.
>> >
>> > -Jay
>> >
>> > On Mon, Feb 2, 2015 at 9:33 AM, Jun Rao <ju...@confluent.io> wrote:
>> >
>> >> Hi, Mathias,
>> >>
>> >> From the hprof output, it seems that the top CPU consumers are
>> >> socketAccept() and epollWait(). As far as I am aware, there hasn't been
>> >> any
>> >> significant changes in the socket server code btw 0.8.1 and 0.8.2.
>> Could
>> >> you run the same hprof test on 0.8.1 so that we can see the difference?
>> >>
>> >> Jaikiran,
>> >>
>> >> The fix you provided in probably unnecessary. The channel that we use
>> in
>> >> SimpleConsumer (BlockingChannel) is configured to be blocking. So even
>> >> though the read from the socket is in a loop, each read blocks if
>> there is
>> >> no bytes received from the broker. So, that shouldn't cause extra CPU
>> >> consumption.
>> >>
>> >> Thanks,
>> >>
>> >> Jun
>> >>
>> >> On Mon, Jan 26, 2015 at 10:05 AM, Mathias Söderberg <
>> >> mathias.soederberg@gmail.com> wrote:
>> >>
>> >> > Hi Neha,
>> >> >
>> >> > I sent an e-mail earlier today, but noticed now that it didn't
>> actually
>> >> go
>> >> > through.
>> >> >
>> >> > Anyhow, I've attached two files, one with output from a 10 minute run
>> >> and
>> >> > one with output from a 30 minute run. Realized that maybe I should've
>> >> done
>> >> > one or two runs with 0.8.1.1 as well, but nevertheless.
>> >> >
>> >> > I upgraded our staging cluster to 0.8.2.0-rc2, and I'm seeing the
>> same
>> >> CPU
>> >> > usage as with the beta version (basically pegging all cores). If I
>> >> manage
>> >> > to find the time I'll do another run with hprof on the rc2 version
>> later
>> >> > today.
>> >> >
>> >> > Best regards,
>> >> > Mathias
>> >> >
>> >> > On Tue Dec 09 2014 at 10:08:21 PM Neha Narkhede <ne...@confluent.io>
>> >> wrote:
>> >> >
>> >> >> The following should be sufficient
>> >> >>
>> >> >> java
>> >> >> -agentlib:hprof=cpu=samples,depth=100,interval=20,lineno=
>> >> >> y,thread=y,file=kafka.hprof
>> >> >> <classname>
>> >> >>
>> >> >> You would need to start the Kafka server with the settings above for
>> >> >> sometime until you observe the problem.
>> >> >>
>> >> >> On Tue, Dec 9, 2014 at 3:47 AM, Mathias Söderberg <
>> >> >> mathias.soederberg@gmail.com> wrote:
>> >> >>
>> >> >> > Hi Neha,
>> >> >> >
>> >> >> > Yeah sure. I'm not familiar with hprof, so any particular options
>> I
>> >> >> should
>> >> >> > include or just run with defaults?
>> >> >> >
>> >> >> > Best regards,
>> >> >> > Mathias
>> >> >> >
>> >> >> > On Mon Dec 08 2014 at 7:41:32 PM Neha Narkhede <neha@confluent.io
>> >
>> >> >> wrote:
>> >> >> >
>> >> >> > > Thanks for reporting the issue. Would you mind running hprof and
>> >> >> sending
>> >> >> > > the output?
>> >> >> > >
>> >> >> > > On Mon, Dec 8, 2014 at 1:25 AM, Mathias Söderberg <
>> >> >> > > mathias.soederberg@gmail.com> wrote:
>> >> >> > >
>> >> >> > > > Good day,
>> >> >> > > >
>> >> >> > > > I upgraded a Kafka cluster from v0.8.1.1 to v0.8.2-beta and
>> >> noticed
>> >> >> > that
>> >> >> > > > the CPU usage on the broker machines went up by roughly 40%,
>> from
>> >> >> ~60%
>> >> >> > to
>> >> >> > > > ~100% and am wondering if anyone else has experienced
>> something
>> >> >> > similar?
>> >> >> > > > The load average also went up by 2x-3x.
>> >> >> > > >
>> >> >> > > > We're running on EC2 and the cluster currently consists of
>> four
>> >> >> > > m1.xlarge,
>> >> >> > > > with roughly 1100 topics / 4000 partitions. Using Java 7
>> >> (1.7.0_65
>> >> >> to
>> >> >> > be
>> >> >> > > > exact) and Scala 2.9.2. Configurations can be found over here:
>> >> >> > > > https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
>> >> >> > > >
>> >> >> > > > I'm assuming that this is not expected behaviour for
>> 0.8.2-beta?
>> >> >> > > >
>> >> >> > > > Best regards,
>> >> >> > > > Mathias
>> >> >> > > >
>> >> >> > >
>> >> >> > >
>> >> >> > >
>> >> >> > > --
>> >> >> > > Thanks,
>> >> >> > > Neha
>> >> >> > >
>> >> >> >
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Thanks,
>> >> >> Neha
>> >> >>
>> >> >
>> >>
>> >
>> >
>>
>

Re: Increased CPU usage with 0.8.2-beta

Posted by Mathias Söderberg <ma...@gmail.com>.
Hi all,

I ran the same hprof test on 0.8.1.1, and also did a re-run on 0.8.2.0-rc2,
attached logs from both runs. Both runs lasted for 30-40 minutes. The
configurations used can be found over here:
https://gist.github.com/mthssdrbrg/5fcb9fbdb851d8cc66a2. The configuration
used for the first run (on 0.8.2-beta) can be found here:
https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.

The cluster is running on EC2, currently on 4 m1.xlarge, with a ZooKeeper
cluster consisting of 3 nodes. There's 1248 topics, and a total of 4143
partitions, using replication factor 3.

Let me know if there's any other information I can provide.

Best regards,
Mathias

On Mon Feb 02 2015 at 6:50:54 PM Jay Kreps <ja...@gmail.com> wrote:

> Looking at that profile:
>
> Misc. blocking socket activity, not actual CPU work:
>    1  5.24%  5.24%   67781 300447 java.net.PlainSocketImpl.socketAccept
>    2  5.24% 10.49%   67775 300515 java.net.PlainSocketImpl.socketAccept
>    3  5.24% 15.73%   67773 300567 java.net.PlainSocketImpl.socketAccept
>    4  5.24% 20.97%   67683 301396 sun.nio.ch.ServerSocketChannelImpl.
> accept0
>    5  5.23% 26.20%   67582 301395 sun.nio.ch.EPollArrayWrapper.epollWait
>    6  5.23% 31.42%   67529 301519 sun.nio.ch.EPollArrayWrapper.epollWait
>    7  5.13% 36.55%   66297 302447 sun.nio.ch.EPollArrayWrapper.epollWait
>    8  3.95% 40.51%   51071 302446 sun.nio.ch.EPollArrayWrapper.epollWait
>    9  3.65% 44.16%   47234 303479 sun.nio.ch.EPollArrayWrapper.epollWait
>   10  3.64% 47.80%   47019 302444 sun.nio.ch.EPollArrayWrapper.epollWait
>   11  3.62% 51.42%   46768 302445 sun.nio.ch.EPollArrayWrapper.epollWait
>   12  3.27% 54.69%   42237 303475 sun.nio.ch.EPollArrayWrapper.epollWait
>   13  3.16% 57.85%   40892 303476 sun.nio.ch.EPollArrayWrapper.epollWait
>   14  3.14% 60.99%   40556 303478 sun.nio.ch.EPollArrayWrapper.epollWait
>   15  3.05% 64.04%   39428 303480 sun.nio.ch.EPollArrayWrapper.epollWait
>   16  2.68% 66.72%   34673 303477 sun.nio.ch.EPollArrayWrapper.epollWait
>   17  1.23% 67.95%   15867 303520 sun.nio.ch.EPollArrayWrapper.epollWait
>   18  0.98% 68.93%   12663 303541 sun.nio.ch.EPollArrayWrapper.epollWait
>   19  0.92% 69.85%   11920 303536 sun.nio.ch.EPollArrayWrapper.epollWait
>   20  0.85% 70.70%   11015 303546 sun.nio.ch.EPollArrayWrapper.epollWait
>   21  0.82% 71.53%   10625 303534 sun.nio.ch.EPollArrayWrapper.epollWait
>   22  0.69% 72.21%    8866 303512 sun.nio.ch.EPollArrayWrapper.epollWait
>
>   TransferTo to write data to socket, not actual CPU work (mostly)
>   23  0.36% 72.57%    4592 304991 sun.nio.ch.FileDispatcherImpl.write0
>   24  0.36% 72.92%    4589 305021 sun.nio.ch.FileDispatcherImpl.write0
>   25  0.35% 73.27%    4503 304992 sun.nio.ch.FileDispatcherImpl.write0
>   26  0.32% 73.60%    4198 305022 sun.nio.ch.FileDispatcherImpl.write0
>   27  0.25% 73.85%    3250 305246 sun.nio.ch.FileDispatcherImpl.write0
>   28  0.25% 74.10%    3249 305497 sun.nio.ch.FileDispatcherImpl.write0
>
>   Request channel - Actual CPU
>   29  0.22% 74.32%    2862 305000 sun.misc.Unsafe.unpark
>   30  0.17% 74.49%    2163 304838 sun.misc.Unsafe.unpark
>   31  0.14% 74.63%    1795 305240 sun.misc.Unsafe.unpark
>
>   Purgatory - Actual CPU
>   32  0.12% 74.75%    1553 305137 scala.collection.immutable.HashMap.$plus
>   33  0.12% 74.87%    1546 305100 java.util.concurrent.
> ConcurrentHashMap.get
>   34  0.12% 74.98%    1531 305181 java.util.concurrent.
> ConcurrentHashMap.get
>   35  0.12% 75.10%    1526 305234 scala.collection.immutable.HashMap.$plus
>   36  0.12% 75.22%    1521 305401 scala.collection.immutable.HashMap.$plus
>   37  0.12% 75.34%    1519 305186 java.util.concurrent.
> ConcurrentHashMap.get
>   38  0.12% 75.46%    1517 305264 java.util.concurrent.
> ConcurrentHashMap.get
>   39  0.12% 75.57%    1514 305271 java.util.concurrent.
> ConcurrentHashMap.get
>   40  0.12% 75.69%    1511 305250 scala.collection.immutable.HashMap.$plus
>   41  0.12% 75.81%    1499 305155 java.util.concurrent.
> ConcurrentHashMap.get
>   42  0.12% 75.92%    1496 305113 scala.collection.immutable.HashMap.$plus
>   43  0.12% 76.04%    1496 305263 scala.collection.immutable.HashMap.$plus
>   44  0.11% 76.15%    1480 305235 scala.collection.immutable.HashMap.$plus
>   45  0.11% 76.26%    1444 305185 scala.collection.immutable.HashMap.$plus
>   46  0.11% 76.37%    1428 305102 java.util.concurrent.
> ConcurrentHashMap.get
>   47  0.11% 76.48%    1418 305320 java.util.concurrent.
> ConcurrentHashMap.get
>
>   Compression - Actual CPU
>   48  0.09% 76.58%    1187 306546 org.xerial.snappy.
> SnappyNative.rawCompress
>   49  0.09% 76.67%    1174 305584 org.xerial.snappy.
> SnappyNative.rawCompress
>   50  0.09% 76.76%    1173 305545 org.xerial.snappy.
> SnappyNative.rawCompress
>   51  0.09% 76.85%    1165 305661 org.xerial.snappy.
> SnappyNative.rawCompress
>   52  0.09% 76.94%    1152 307069 org.xerial.snappy.
> SnappyNative.rawCompress
>   53  0.09% 77.02%    1121 305997 org.xerial.snappy.
> SnappyNative.rawCompress
>   54  0.09% 77.11%    1117 307035 org.xerial.snappy.
> SnappyNative.rawCompress
>   55  0.09% 77.19%    1106 306190 org.xerial.snappy.
> SnappyNative.rawCompress
>
> On Mon, Feb 2, 2015 at 9:39 AM, Jay Kreps <ja...@gmail.com> wrote:
>
> > Ah, yeah, you're right. That is just wait time not CPU time. We should
> > check that profile it must be something else on the list.
> >
> > -Jay
> >
> > On Mon, Feb 2, 2015 at 9:33 AM, Jun Rao <ju...@confluent.io> wrote:
> >
> >> Hi, Mathias,
> >>
> >> From the hprof output, it seems that the top CPU consumers are
> >> socketAccept() and epollWait(). As far as I am aware, there hasn't been
> >> any
> >> significant changes in the socket server code btw 0.8.1 and 0.8.2. Could
> >> you run the same hprof test on 0.8.1 so that we can see the difference?
> >>
> >> Jaikiran,
> >>
> >> The fix you provided in probably unnecessary. The channel that we use in
> >> SimpleConsumer (BlockingChannel) is configured to be blocking. So even
> >> though the read from the socket is in a loop, each read blocks if there
> is
> >> no bytes received from the broker. So, that shouldn't cause extra CPU
> >> consumption.
> >>
> >> Thanks,
> >>
> >> Jun
> >>
> >> On Mon, Jan 26, 2015 at 10:05 AM, Mathias Söderberg <
> >> mathias.soederberg@gmail.com> wrote:
> >>
> >> > Hi Neha,
> >> >
> >> > I sent an e-mail earlier today, but noticed now that it didn't
> actually
> >> go
> >> > through.
> >> >
> >> > Anyhow, I've attached two files, one with output from a 10 minute run
> >> and
> >> > one with output from a 30 minute run. Realized that maybe I should've
> >> done
> >> > one or two runs with 0.8.1.1 as well, but nevertheless.
> >> >
> >> > I upgraded our staging cluster to 0.8.2.0-rc2, and I'm seeing the same
> >> CPU
> >> > usage as with the beta version (basically pegging all cores). If I
> >> manage
> >> > to find the time I'll do another run with hprof on the rc2 version
> later
> >> > today.
> >> >
> >> > Best regards,
> >> > Mathias
> >> >
> >> > On Tue Dec 09 2014 at 10:08:21 PM Neha Narkhede <ne...@confluent.io>
> >> wrote:
> >> >
> >> >> The following should be sufficient
> >> >>
> >> >> java
> >> >> -agentlib:hprof=cpu=samples,depth=100,interval=20,lineno=
> >> >> y,thread=y,file=kafka.hprof
> >> >> <classname>
> >> >>
> >> >> You would need to start the Kafka server with the settings above for
> >> >> sometime until you observe the problem.
> >> >>
> >> >> On Tue, Dec 9, 2014 at 3:47 AM, Mathias Söderberg <
> >> >> mathias.soederberg@gmail.com> wrote:
> >> >>
> >> >> > Hi Neha,
> >> >> >
> >> >> > Yeah sure. I'm not familiar with hprof, so any particular options I
> >> >> should
> >> >> > include or just run with defaults?
> >> >> >
> >> >> > Best regards,
> >> >> > Mathias
> >> >> >
> >> >> > On Mon Dec 08 2014 at 7:41:32 PM Neha Narkhede <ne...@confluent.io>
> >> >> wrote:
> >> >> >
> >> >> > > Thanks for reporting the issue. Would you mind running hprof and
> >> >> sending
> >> >> > > the output?
> >> >> > >
> >> >> > > On Mon, Dec 8, 2014 at 1:25 AM, Mathias Söderberg <
> >> >> > > mathias.soederberg@gmail.com> wrote:
> >> >> > >
> >> >> > > > Good day,
> >> >> > > >
> >> >> > > > I upgraded a Kafka cluster from v0.8.1.1 to v0.8.2-beta and
> >> noticed
> >> >> > that
> >> >> > > > the CPU usage on the broker machines went up by roughly 40%,
> from
> >> >> ~60%
> >> >> > to
> >> >> > > > ~100% and am wondering if anyone else has experienced something
> >> >> > similar?
> >> >> > > > The load average also went up by 2x-3x.
> >> >> > > >
> >> >> > > > We're running on EC2 and the cluster currently consists of four
> >> >> > > m1.xlarge,
> >> >> > > > with roughly 1100 topics / 4000 partitions. Using Java 7
> >> (1.7.0_65
> >> >> to
> >> >> > be
> >> >> > > > exact) and Scala 2.9.2. Configurations can be found over here:
> >> >> > > > https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
> >> >> > > >
> >> >> > > > I'm assuming that this is not expected behaviour for
> 0.8.2-beta?
> >> >> > > >
> >> >> > > > Best regards,
> >> >> > > > Mathias
> >> >> > > >
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > > --
> >> >> > > Thanks,
> >> >> > > Neha
> >> >> > >
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Thanks,
> >> >> Neha
> >> >>
> >> >
> >>
> >
> >
>

Re: Increased CPU usage with 0.8.2-beta

Posted by Jay Kreps <ja...@gmail.com>.
Looking at that profile:

Misc. blocking socket activity, not actual CPU work:
   1  5.24%  5.24%   67781 300447 java.net.PlainSocketImpl.socketAccept
   2  5.24% 10.49%   67775 300515 java.net.PlainSocketImpl.socketAccept
   3  5.24% 15.73%   67773 300567 java.net.PlainSocketImpl.socketAccept
   4  5.24% 20.97%   67683 301396 sun.nio.ch.ServerSocketChannelImpl.accept0
   5  5.23% 26.20%   67582 301395 sun.nio.ch.EPollArrayWrapper.epollWait
   6  5.23% 31.42%   67529 301519 sun.nio.ch.EPollArrayWrapper.epollWait
   7  5.13% 36.55%   66297 302447 sun.nio.ch.EPollArrayWrapper.epollWait
   8  3.95% 40.51%   51071 302446 sun.nio.ch.EPollArrayWrapper.epollWait
   9  3.65% 44.16%   47234 303479 sun.nio.ch.EPollArrayWrapper.epollWait
  10  3.64% 47.80%   47019 302444 sun.nio.ch.EPollArrayWrapper.epollWait
  11  3.62% 51.42%   46768 302445 sun.nio.ch.EPollArrayWrapper.epollWait
  12  3.27% 54.69%   42237 303475 sun.nio.ch.EPollArrayWrapper.epollWait
  13  3.16% 57.85%   40892 303476 sun.nio.ch.EPollArrayWrapper.epollWait
  14  3.14% 60.99%   40556 303478 sun.nio.ch.EPollArrayWrapper.epollWait
  15  3.05% 64.04%   39428 303480 sun.nio.ch.EPollArrayWrapper.epollWait
  16  2.68% 66.72%   34673 303477 sun.nio.ch.EPollArrayWrapper.epollWait
  17  1.23% 67.95%   15867 303520 sun.nio.ch.EPollArrayWrapper.epollWait
  18  0.98% 68.93%   12663 303541 sun.nio.ch.EPollArrayWrapper.epollWait
  19  0.92% 69.85%   11920 303536 sun.nio.ch.EPollArrayWrapper.epollWait
  20  0.85% 70.70%   11015 303546 sun.nio.ch.EPollArrayWrapper.epollWait
  21  0.82% 71.53%   10625 303534 sun.nio.ch.EPollArrayWrapper.epollWait
  22  0.69% 72.21%    8866 303512 sun.nio.ch.EPollArrayWrapper.epollWait

  TransferTo to write data to socket, not actual CPU work (mostly)
  23  0.36% 72.57%    4592 304991 sun.nio.ch.FileDispatcherImpl.write0
  24  0.36% 72.92%    4589 305021 sun.nio.ch.FileDispatcherImpl.write0
  25  0.35% 73.27%    4503 304992 sun.nio.ch.FileDispatcherImpl.write0
  26  0.32% 73.60%    4198 305022 sun.nio.ch.FileDispatcherImpl.write0
  27  0.25% 73.85%    3250 305246 sun.nio.ch.FileDispatcherImpl.write0
  28  0.25% 74.10%    3249 305497 sun.nio.ch.FileDispatcherImpl.write0

  Request channel - Actual CPU
  29  0.22% 74.32%    2862 305000 sun.misc.Unsafe.unpark
  30  0.17% 74.49%    2163 304838 sun.misc.Unsafe.unpark
  31  0.14% 74.63%    1795 305240 sun.misc.Unsafe.unpark

  Purgatory - Actual CPU
  32  0.12% 74.75%    1553 305137 scala.collection.immutable.HashMap.$plus
  33  0.12% 74.87%    1546 305100 java.util.concurrent.ConcurrentHashMap.get
  34  0.12% 74.98%    1531 305181 java.util.concurrent.ConcurrentHashMap.get
  35  0.12% 75.10%    1526 305234 scala.collection.immutable.HashMap.$plus
  36  0.12% 75.22%    1521 305401 scala.collection.immutable.HashMap.$plus
  37  0.12% 75.34%    1519 305186 java.util.concurrent.ConcurrentHashMap.get
  38  0.12% 75.46%    1517 305264 java.util.concurrent.ConcurrentHashMap.get
  39  0.12% 75.57%    1514 305271 java.util.concurrent.ConcurrentHashMap.get
  40  0.12% 75.69%    1511 305250 scala.collection.immutable.HashMap.$plus
  41  0.12% 75.81%    1499 305155 java.util.concurrent.ConcurrentHashMap.get
  42  0.12% 75.92%    1496 305113 scala.collection.immutable.HashMap.$plus
  43  0.12% 76.04%    1496 305263 scala.collection.immutable.HashMap.$plus
  44  0.11% 76.15%    1480 305235 scala.collection.immutable.HashMap.$plus
  45  0.11% 76.26%    1444 305185 scala.collection.immutable.HashMap.$plus
  46  0.11% 76.37%    1428 305102 java.util.concurrent.ConcurrentHashMap.get
  47  0.11% 76.48%    1418 305320 java.util.concurrent.ConcurrentHashMap.get

  Compression - Actual CPU
  48  0.09% 76.58%    1187 306546 org.xerial.snappy.SnappyNative.rawCompress
  49  0.09% 76.67%    1174 305584 org.xerial.snappy.SnappyNative.rawCompress
  50  0.09% 76.76%    1173 305545 org.xerial.snappy.SnappyNative.rawCompress
  51  0.09% 76.85%    1165 305661 org.xerial.snappy.SnappyNative.rawCompress
  52  0.09% 76.94%    1152 307069 org.xerial.snappy.SnappyNative.rawCompress
  53  0.09% 77.02%    1121 305997 org.xerial.snappy.SnappyNative.rawCompress
  54  0.09% 77.11%    1117 307035 org.xerial.snappy.SnappyNative.rawCompress
  55  0.09% 77.19%    1106 306190 org.xerial.snappy.SnappyNative.rawCompress

On Mon, Feb 2, 2015 at 9:39 AM, Jay Kreps <ja...@gmail.com> wrote:

> Ah, yeah, you're right. That is just wait time not CPU time. We should
> check that profile it must be something else on the list.
>
> -Jay
>
> On Mon, Feb 2, 2015 at 9:33 AM, Jun Rao <ju...@confluent.io> wrote:
>
>> Hi, Mathias,
>>
>> From the hprof output, it seems that the top CPU consumers are
>> socketAccept() and epollWait(). As far as I am aware, there hasn't been
>> any
>> significant changes in the socket server code btw 0.8.1 and 0.8.2. Could
>> you run the same hprof test on 0.8.1 so that we can see the difference?
>>
>> Jaikiran,
>>
>> The fix you provided in probably unnecessary. The channel that we use in
>> SimpleConsumer (BlockingChannel) is configured to be blocking. So even
>> though the read from the socket is in a loop, each read blocks if there is
>> no bytes received from the broker. So, that shouldn't cause extra CPU
>> consumption.
>>
>> Thanks,
>>
>> Jun
>>
>> On Mon, Jan 26, 2015 at 10:05 AM, Mathias Söderberg <
>> mathias.soederberg@gmail.com> wrote:
>>
>> > Hi Neha,
>> >
>> > I sent an e-mail earlier today, but noticed now that it didn't actually
>> go
>> > through.
>> >
>> > Anyhow, I've attached two files, one with output from a 10 minute run
>> and
>> > one with output from a 30 minute run. Realized that maybe I should've
>> done
>> > one or two runs with 0.8.1.1 as well, but nevertheless.
>> >
>> > I upgraded our staging cluster to 0.8.2.0-rc2, and I'm seeing the same
>> CPU
>> > usage as with the beta version (basically pegging all cores). If I
>> manage
>> > to find the time I'll do another run with hprof on the rc2 version later
>> > today.
>> >
>> > Best regards,
>> > Mathias
>> >
>> > On Tue Dec 09 2014 at 10:08:21 PM Neha Narkhede <ne...@confluent.io>
>> wrote:
>> >
>> >> The following should be sufficient
>> >>
>> >> java
>> >> -agentlib:hprof=cpu=samples,depth=100,interval=20,lineno=
>> >> y,thread=y,file=kafka.hprof
>> >> <classname>
>> >>
>> >> You would need to start the Kafka server with the settings above for
>> >> sometime until you observe the problem.
>> >>
>> >> On Tue, Dec 9, 2014 at 3:47 AM, Mathias Söderberg <
>> >> mathias.soederberg@gmail.com> wrote:
>> >>
>> >> > Hi Neha,
>> >> >
>> >> > Yeah sure. I'm not familiar with hprof, so any particular options I
>> >> should
>> >> > include or just run with defaults?
>> >> >
>> >> > Best regards,
>> >> > Mathias
>> >> >
>> >> > On Mon Dec 08 2014 at 7:41:32 PM Neha Narkhede <ne...@confluent.io>
>> >> wrote:
>> >> >
>> >> > > Thanks for reporting the issue. Would you mind running hprof and
>> >> sending
>> >> > > the output?
>> >> > >
>> >> > > On Mon, Dec 8, 2014 at 1:25 AM, Mathias Söderberg <
>> >> > > mathias.soederberg@gmail.com> wrote:
>> >> > >
>> >> > > > Good day,
>> >> > > >
>> >> > > > I upgraded a Kafka cluster from v0.8.1.1 to v0.8.2-beta and
>> noticed
>> >> > that
>> >> > > > the CPU usage on the broker machines went up by roughly 40%, from
>> >> ~60%
>> >> > to
>> >> > > > ~100% and am wondering if anyone else has experienced something
>> >> > similar?
>> >> > > > The load average also went up by 2x-3x.
>> >> > > >
>> >> > > > We're running on EC2 and the cluster currently consists of four
>> >> > > m1.xlarge,
>> >> > > > with roughly 1100 topics / 4000 partitions. Using Java 7
>> (1.7.0_65
>> >> to
>> >> > be
>> >> > > > exact) and Scala 2.9.2. Configurations can be found over here:
>> >> > > > https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
>> >> > > >
>> >> > > > I'm assuming that this is not expected behaviour for 0.8.2-beta?
>> >> > > >
>> >> > > > Best regards,
>> >> > > > Mathias
>> >> > > >
>> >> > >
>> >> > >
>> >> > >
>> >> > > --
>> >> > > Thanks,
>> >> > > Neha
>> >> > >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Thanks,
>> >> Neha
>> >>
>> >
>>
>
>

Re: Increased CPU usage with 0.8.2-beta

Posted by Jay Kreps <ja...@gmail.com>.
Ah, yeah, you're right. That is just wait time not CPU time. We should
check that profile it must be something else on the list.

-Jay

On Mon, Feb 2, 2015 at 9:33 AM, Jun Rao <ju...@confluent.io> wrote:

> Hi, Mathias,
>
> From the hprof output, it seems that the top CPU consumers are
> socketAccept() and epollWait(). As far as I am aware, there hasn't been any
> significant changes in the socket server code btw 0.8.1 and 0.8.2. Could
> you run the same hprof test on 0.8.1 so that we can see the difference?
>
> Jaikiran,
>
> The fix you provided in probably unnecessary. The channel that we use in
> SimpleConsumer (BlockingChannel) is configured to be blocking. So even
> though the read from the socket is in a loop, each read blocks if there is
> no bytes received from the broker. So, that shouldn't cause extra CPU
> consumption.
>
> Thanks,
>
> Jun
>
> On Mon, Jan 26, 2015 at 10:05 AM, Mathias Söderberg <
> mathias.soederberg@gmail.com> wrote:
>
> > Hi Neha,
> >
> > I sent an e-mail earlier today, but noticed now that it didn't actually
> go
> > through.
> >
> > Anyhow, I've attached two files, one with output from a 10 minute run and
> > one with output from a 30 minute run. Realized that maybe I should've
> done
> > one or two runs with 0.8.1.1 as well, but nevertheless.
> >
> > I upgraded our staging cluster to 0.8.2.0-rc2, and I'm seeing the same
> CPU
> > usage as with the beta version (basically pegging all cores). If I manage
> > to find the time I'll do another run with hprof on the rc2 version later
> > today.
> >
> > Best regards,
> > Mathias
> >
> > On Tue Dec 09 2014 at 10:08:21 PM Neha Narkhede <ne...@confluent.io>
> wrote:
> >
> >> The following should be sufficient
> >>
> >> java
> >> -agentlib:hprof=cpu=samples,depth=100,interval=20,lineno=
> >> y,thread=y,file=kafka.hprof
> >> <classname>
> >>
> >> You would need to start the Kafka server with the settings above for
> >> sometime until you observe the problem.
> >>
> >> On Tue, Dec 9, 2014 at 3:47 AM, Mathias Söderberg <
> >> mathias.soederberg@gmail.com> wrote:
> >>
> >> > Hi Neha,
> >> >
> >> > Yeah sure. I'm not familiar with hprof, so any particular options I
> >> should
> >> > include or just run with defaults?
> >> >
> >> > Best regards,
> >> > Mathias
> >> >
> >> > On Mon Dec 08 2014 at 7:41:32 PM Neha Narkhede <ne...@confluent.io>
> >> wrote:
> >> >
> >> > > Thanks for reporting the issue. Would you mind running hprof and
> >> sending
> >> > > the output?
> >> > >
> >> > > On Mon, Dec 8, 2014 at 1:25 AM, Mathias Söderberg <
> >> > > mathias.soederberg@gmail.com> wrote:
> >> > >
> >> > > > Good day,
> >> > > >
> >> > > > I upgraded a Kafka cluster from v0.8.1.1 to v0.8.2-beta and
> noticed
> >> > that
> >> > > > the CPU usage on the broker machines went up by roughly 40%, from
> >> ~60%
> >> > to
> >> > > > ~100% and am wondering if anyone else has experienced something
> >> > similar?
> >> > > > The load average also went up by 2x-3x.
> >> > > >
> >> > > > We're running on EC2 and the cluster currently consists of four
> >> > > m1.xlarge,
> >> > > > with roughly 1100 topics / 4000 partitions. Using Java 7 (1.7.0_65
> >> to
> >> > be
> >> > > > exact) and Scala 2.9.2. Configurations can be found over here:
> >> > > > https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
> >> > > >
> >> > > > I'm assuming that this is not expected behaviour for 0.8.2-beta?
> >> > > >
> >> > > > Best regards,
> >> > > > Mathias
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Thanks,
> >> > > Neha
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> Thanks,
> >> Neha
> >>
> >
>

Re: Increased CPU usage with 0.8.2-beta

Posted by Todd Palino <tp...@gmail.com>.
I'm checking into this on our side. The version we're working on jumping to
right now is not the 0.8.2 release version, but it is significantly ahead
of 0.8.1.1. We've got it deployed on one cluster and I'm making sure it's
balanced right now before I take a look at all the metrics. I'll fill in
more detail once I have it.

-Todd

On Thu, Feb 12, 2015 at 9:51 PM, Jay Kreps <ja...@gmail.com> wrote:

> This is a serious issue, we'll take a look.
>
> -Jay
>
> On Thu, Feb 12, 2015 at 3:19 PM, Solon Gordon <so...@knewton.com> wrote:
>
> > I saw a very similar jump in CPU usage when I tried upgrading from
> 0.8.1.1
> > to 0.8.2.0 today in a test environment. The Kafka cluster there is two
> > m1.larges handling 2,000 partitions across 32 topics. CPU usage rose from
> > 40% into the 150%–190% range, and load average from under 1 to over 4.
> > Downgrading to 0.8.1.1 brought the CPU and load back to the previous
> > values.
> >
> > If there's more info that would be helpful, please let me know.
> >
> > On Thu, Feb 12, 2015 at 4:17 PM, Mathias Söderberg <
> > mathias.soederberg@gmail.com> wrote:
> >
> > > Jun,
> > >
> > > Pardon the radio silence. I booted up a new broker, created a topic
> with
> > > three (3) partitions and replication factor one (1) and used the
> > > *kafka-producer-perf-test.sh
> > > *script to generate load (using messages of roughly the same size as
> > ours).
> > > There was a slight increase in CPU usage (~5-10%) on 0.8.2.0-rc2
> compared
> > > to 0.8.1.1, but that was about it.
> > >
> > > I upgraded our staging cluster to 0.8.2.0 earlier this week or so, and
> > had
> > > to add an additional broker due to increased load after the upgrade
> (note
> > > that the incoming load on the cluster has been pretty much consistent).
> > > Since the upgrade we've been seeing an 2-3x increase in latency as
> well.
> > > I'm considering downgrading to 0.8.1.1 again to see if it resolves our
> > > issues.
> > >
> > > Best regards,
> > > Mathias
> > >
> > > On Tue Feb 03 2015 at 6:44:36 PM Jun Rao <ju...@confluent.io> wrote:
> > >
> > > > Mathias,
> > > >
> > > > The new hprof doesn't reveal anything new to me. We did fix the logic
> > in
> > > > using Purgatory in 0.8.2, which could potentially drive up the CPU
> > usage
> > > a
> > > > bit. To verify that, could you do your test on a single broker (with
> > > > replication factor 1) btw 0.8.1 and 0.8.2 and see if there is any
> > > > significant difference in cpu usage?
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Tue, Feb 3, 2015 at 5:09 AM, Mathias Söderberg <
> > > > mathias.soederberg@gmail.com> wrote:
> > > >
> > > > > Jun,
> > > > >
> > > > > I re-ran the hprof test, for about 30 minutes again, for
> 0.8.2.0-rc2
> > > with
> > > > > the same version of snappy that 0.8.1.1 used. Attached the logs.
> > > > > Unfortunately there wasn't any improvement as the node running
> > > > 0.8.2.0-rc2
> > > > > still had a higher load and CPU usage.
> > > > >
> > > > > Best regards,
> > > > > Mathias
> > > > >
> > > > > On Tue Feb 03 2015 at 4:40:31 AM Jaikiran Pai <
> > > jai.forums2013@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> On Monday 02 February 2015 11:03 PM, Jun Rao wrote:
> > > > >> > Jaikiran,
> > > > >> >
> > > > >> > The fix you provided in probably unnecessary. The channel that
> we
> > > use
> > > > in
> > > > >> > SimpleConsumer (BlockingChannel) is configured to be blocking.
> So
> > > even
> > > > >> > though the read from the socket is in a loop, each read blocks
> if
> > > > there
> > > > >> is
> > > > >> > no bytes received from the broker. So, that shouldn't cause
> extra
> > > CPU
> > > > >> > consumption.
> > > > >> Hi Jun,
> > > > >>
> > > > >> Of course, you are right! I forgot that while reading the thread
> > dump
> > > in
> > > > >> hprof output, one has to be aware that the thread state isn't
> shown
> > > and
> > > > >> the thread need not necessarily be doing any CPU activity.
> > > > >>
> > > > >> -Jaikiran
> > > > >>
> > > > >>
> > > > >> >
> > > > >> > Thanks,
> > > > >> >
> > > > >> > Jun
> > > > >> >
> > > > >> > On Mon, Jan 26, 2015 at 10:05 AM, Mathias Söderberg <
> > > > >> > mathias.soederberg@gmail.com> wrote:
> > > > >> >
> > > > >> >> Hi Neha,
> > > > >> >>
> > > > >> >> I sent an e-mail earlier today, but noticed now that it didn't
> > > > >> actually go
> > > > >> >> through.
> > > > >> >>
> > > > >> >> Anyhow, I've attached two files, one with output from a 10
> minute
> > > run
> > > > >> and
> > > > >> >> one with output from a 30 minute run. Realized that maybe I
> > > should've
> > > > >> done
> > > > >> >> one or two runs with 0.8.1.1 as well, but nevertheless.
> > > > >> >>
> > > > >> >> I upgraded our staging cluster to 0.8.2.0-rc2, and I'm seeing
> the
> > > > same
> > > > >> CPU
> > > > >> >> usage as with the beta version (basically pegging all cores).
> If
> > I
> > > > >> manage
> > > > >> >> to find the time I'll do another run with hprof on the rc2
> > version
> > > > >> later
> > > > >> >> today.
> > > > >> >>
> > > > >> >> Best regards,
> > > > >> >> Mathias
> > > > >> >>
> > > > >> >> On Tue Dec 09 2014 at 10:08:21 PM Neha Narkhede <
> > neha@confluent.io
> > > >
> > > > >> wrote:
> > > > >> >>
> > > > >> >>> The following should be sufficient
> > > > >> >>>
> > > > >> >>> java
> > > > >> >>> -agentlib:hprof=cpu=samples,depth=100,interval=20,lineno=
> > > > >> >>> y,thread=y,file=kafka.hprof
> > > > >> >>> <classname>
> > > > >> >>>
> > > > >> >>> You would need to start the Kafka server with the settings
> above
> > > for
> > > > >> >>> sometime until you observe the problem.
> > > > >> >>>
> > > > >> >>> On Tue, Dec 9, 2014 at 3:47 AM, Mathias Söderberg <
> > > > >> >>> mathias.soederberg@gmail.com> wrote:
> > > > >> >>>
> > > > >> >>>> Hi Neha,
> > > > >> >>>>
> > > > >> >>>> Yeah sure. I'm not familiar with hprof, so any particular
> > > options I
> > > > >> >>> should
> > > > >> >>>> include or just run with defaults?
> > > > >> >>>>
> > > > >> >>>> Best regards,
> > > > >> >>>> Mathias
> > > > >> >>>>
> > > > >> >>>> On Mon Dec 08 2014 at 7:41:32 PM Neha Narkhede <
> > > neha@confluent.io>
> > > > >> >>> wrote:
> > > > >> >>>>> Thanks for reporting the issue. Would you mind running hprof
> > and
> > > > >> >>> sending
> > > > >> >>>>> the output?
> > > > >> >>>>>
> > > > >> >>>>> On Mon, Dec 8, 2014 at 1:25 AM, Mathias Söderberg <
> > > > >> >>>>> mathias.soederberg@gmail.com> wrote:
> > > > >> >>>>>
> > > > >> >>>>>> Good day,
> > > > >> >>>>>>
> > > > >> >>>>>> I upgraded a Kafka cluster from v0.8.1.1 to v0.8.2-beta and
> > > > noticed
> > > > >> >>>> that
> > > > >> >>>>>> the CPU usage on the broker machines went up by roughly
> 40%,
> > > from
> > > > >> >>> ~60%
> > > > >> >>>> to
> > > > >> >>>>>> ~100% and am wondering if anyone else has experienced
> > something
> > > > >> >>>> similar?
> > > > >> >>>>>> The load average also went up by 2x-3x.
> > > > >> >>>>>>
> > > > >> >>>>>> We're running on EC2 and the cluster currently consists of
> > four
> > > > >> >>>>> m1.xlarge,
> > > > >> >>>>>> with roughly 1100 topics / 4000 partitions. Using Java 7
> > > > (1.7.0_65
> > > > >> >>> to
> > > > >> >>>> be
> > > > >> >>>>>> exact) and Scala 2.9.2. Configurations can be found over
> > here:
> > > > >> >>>>>> https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
> > > > >> >>>>>>
> > > > >> >>>>>> I'm assuming that this is not expected behaviour for
> > > 0.8.2-beta?
> > > > >> >>>>>>
> > > > >> >>>>>> Best regards,
> > > > >> >>>>>> Mathias
> > > > >> >>>>>>
> > > > >> >>>>>
> > > > >> >>>>>
> > > > >> >>>>> --
> > > > >> >>>>> Thanks,
> > > > >> >>>>> Neha
> > > > >> >>>>>
> > > > >> >>>
> > > > >> >>>
> > > > >> >>> --
> > > > >> >>> Thanks,
> > > > >> >>> Neha
> > > > >> >>>
> > > > >>
> > > > >>
> > > >
> > >
> >
>

Re: Increased CPU usage with 0.8.2-beta

Posted by Jay Kreps <ja...@gmail.com>.
This is a serious issue, we'll take a look.

-Jay

On Thu, Feb 12, 2015 at 3:19 PM, Solon Gordon <so...@knewton.com> wrote:

> I saw a very similar jump in CPU usage when I tried upgrading from 0.8.1.1
> to 0.8.2.0 today in a test environment. The Kafka cluster there is two
> m1.larges handling 2,000 partitions across 32 topics. CPU usage rose from
> 40% into the 150%–190% range, and load average from under 1 to over 4.
> Downgrading to 0.8.1.1 brought the CPU and load back to the previous
> values.
>
> If there's more info that would be helpful, please let me know.
>
> On Thu, Feb 12, 2015 at 4:17 PM, Mathias Söderberg <
> mathias.soederberg@gmail.com> wrote:
>
> > Jun,
> >
> > Pardon the radio silence. I booted up a new broker, created a topic with
> > three (3) partitions and replication factor one (1) and used the
> > *kafka-producer-perf-test.sh
> > *script to generate load (using messages of roughly the same size as
> ours).
> > There was a slight increase in CPU usage (~5-10%) on 0.8.2.0-rc2 compared
> > to 0.8.1.1, but that was about it.
> >
> > I upgraded our staging cluster to 0.8.2.0 earlier this week or so, and
> had
> > to add an additional broker due to increased load after the upgrade (note
> > that the incoming load on the cluster has been pretty much consistent).
> > Since the upgrade we've been seeing an 2-3x increase in latency as well.
> > I'm considering downgrading to 0.8.1.1 again to see if it resolves our
> > issues.
> >
> > Best regards,
> > Mathias
> >
> > On Tue Feb 03 2015 at 6:44:36 PM Jun Rao <ju...@confluent.io> wrote:
> >
> > > Mathias,
> > >
> > > The new hprof doesn't reveal anything new to me. We did fix the logic
> in
> > > using Purgatory in 0.8.2, which could potentially drive up the CPU
> usage
> > a
> > > bit. To verify that, could you do your test on a single broker (with
> > > replication factor 1) btw 0.8.1 and 0.8.2 and see if there is any
> > > significant difference in cpu usage?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Tue, Feb 3, 2015 at 5:09 AM, Mathias Söderberg <
> > > mathias.soederberg@gmail.com> wrote:
> > >
> > > > Jun,
> > > >
> > > > I re-ran the hprof test, for about 30 minutes again, for 0.8.2.0-rc2
> > with
> > > > the same version of snappy that 0.8.1.1 used. Attached the logs.
> > > > Unfortunately there wasn't any improvement as the node running
> > > 0.8.2.0-rc2
> > > > still had a higher load and CPU usage.
> > > >
> > > > Best regards,
> > > > Mathias
> > > >
> > > > On Tue Feb 03 2015 at 4:40:31 AM Jaikiran Pai <
> > jai.forums2013@gmail.com>
> > > > wrote:
> > > >
> > > >> On Monday 02 February 2015 11:03 PM, Jun Rao wrote:
> > > >> > Jaikiran,
> > > >> >
> > > >> > The fix you provided in probably unnecessary. The channel that we
> > use
> > > in
> > > >> > SimpleConsumer (BlockingChannel) is configured to be blocking. So
> > even
> > > >> > though the read from the socket is in a loop, each read blocks if
> > > there
> > > >> is
> > > >> > no bytes received from the broker. So, that shouldn't cause extra
> > CPU
> > > >> > consumption.
> > > >> Hi Jun,
> > > >>
> > > >> Of course, you are right! I forgot that while reading the thread
> dump
> > in
> > > >> hprof output, one has to be aware that the thread state isn't shown
> > and
> > > >> the thread need not necessarily be doing any CPU activity.
> > > >>
> > > >> -Jaikiran
> > > >>
> > > >>
> > > >> >
> > > >> > Thanks,
> > > >> >
> > > >> > Jun
> > > >> >
> > > >> > On Mon, Jan 26, 2015 at 10:05 AM, Mathias Söderberg <
> > > >> > mathias.soederberg@gmail.com> wrote:
> > > >> >
> > > >> >> Hi Neha,
> > > >> >>
> > > >> >> I sent an e-mail earlier today, but noticed now that it didn't
> > > >> actually go
> > > >> >> through.
> > > >> >>
> > > >> >> Anyhow, I've attached two files, one with output from a 10 minute
> > run
> > > >> and
> > > >> >> one with output from a 30 minute run. Realized that maybe I
> > should've
> > > >> done
> > > >> >> one or two runs with 0.8.1.1 as well, but nevertheless.
> > > >> >>
> > > >> >> I upgraded our staging cluster to 0.8.2.0-rc2, and I'm seeing the
> > > same
> > > >> CPU
> > > >> >> usage as with the beta version (basically pegging all cores). If
> I
> > > >> manage
> > > >> >> to find the time I'll do another run with hprof on the rc2
> version
> > > >> later
> > > >> >> today.
> > > >> >>
> > > >> >> Best regards,
> > > >> >> Mathias
> > > >> >>
> > > >> >> On Tue Dec 09 2014 at 10:08:21 PM Neha Narkhede <
> neha@confluent.io
> > >
> > > >> wrote:
> > > >> >>
> > > >> >>> The following should be sufficient
> > > >> >>>
> > > >> >>> java
> > > >> >>> -agentlib:hprof=cpu=samples,depth=100,interval=20,lineno=
> > > >> >>> y,thread=y,file=kafka.hprof
> > > >> >>> <classname>
> > > >> >>>
> > > >> >>> You would need to start the Kafka server with the settings above
> > for
> > > >> >>> sometime until you observe the problem.
> > > >> >>>
> > > >> >>> On Tue, Dec 9, 2014 at 3:47 AM, Mathias Söderberg <
> > > >> >>> mathias.soederberg@gmail.com> wrote:
> > > >> >>>
> > > >> >>>> Hi Neha,
> > > >> >>>>
> > > >> >>>> Yeah sure. I'm not familiar with hprof, so any particular
> > options I
> > > >> >>> should
> > > >> >>>> include or just run with defaults?
> > > >> >>>>
> > > >> >>>> Best regards,
> > > >> >>>> Mathias
> > > >> >>>>
> > > >> >>>> On Mon Dec 08 2014 at 7:41:32 PM Neha Narkhede <
> > neha@confluent.io>
> > > >> >>> wrote:
> > > >> >>>>> Thanks for reporting the issue. Would you mind running hprof
> and
> > > >> >>> sending
> > > >> >>>>> the output?
> > > >> >>>>>
> > > >> >>>>> On Mon, Dec 8, 2014 at 1:25 AM, Mathias Söderberg <
> > > >> >>>>> mathias.soederberg@gmail.com> wrote:
> > > >> >>>>>
> > > >> >>>>>> Good day,
> > > >> >>>>>>
> > > >> >>>>>> I upgraded a Kafka cluster from v0.8.1.1 to v0.8.2-beta and
> > > noticed
> > > >> >>>> that
> > > >> >>>>>> the CPU usage on the broker machines went up by roughly 40%,
> > from
> > > >> >>> ~60%
> > > >> >>>> to
> > > >> >>>>>> ~100% and am wondering if anyone else has experienced
> something
> > > >> >>>> similar?
> > > >> >>>>>> The load average also went up by 2x-3x.
> > > >> >>>>>>
> > > >> >>>>>> We're running on EC2 and the cluster currently consists of
> four
> > > >> >>>>> m1.xlarge,
> > > >> >>>>>> with roughly 1100 topics / 4000 partitions. Using Java 7
> > > (1.7.0_65
> > > >> >>> to
> > > >> >>>> be
> > > >> >>>>>> exact) and Scala 2.9.2. Configurations can be found over
> here:
> > > >> >>>>>> https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
> > > >> >>>>>>
> > > >> >>>>>> I'm assuming that this is not expected behaviour for
> > 0.8.2-beta?
> > > >> >>>>>>
> > > >> >>>>>> Best regards,
> > > >> >>>>>> Mathias
> > > >> >>>>>>
> > > >> >>>>>
> > > >> >>>>>
> > > >> >>>>> --
> > > >> >>>>> Thanks,
> > > >> >>>>> Neha
> > > >> >>>>>
> > > >> >>>
> > > >> >>>
> > > >> >>> --
> > > >> >>> Thanks,
> > > >> >>> Neha
> > > >> >>>
> > > >>
> > > >>
> > >
> >
>

Re: Increased CPU usage with 0.8.2-beta

Posted by Mathias Söderberg <ma...@gmail.com>.
We've been running with the new patch since yesterday, and everything seems
to be working just fine (comparable to the previous patch).

Haven't look into the memory consumption that much though, doesn't look
like it varies too much between the patches.

On Mon Feb 16 2015 at 7:31:34 PM Solon Gordon <so...@knewton.com> wrote:

> I tested the new patch out and am seeing comparable CPU usage to the
> previous patch. As far as I can see, heap usage is also comparable between
> the two patches, though I will say that both look significantly better than
> 0.8.1.1 (~250MB vs. ~1GB).
>
> I'll report back if any new issues come up as I start adding more
> producer/consumer load.
>
> On Sun, Feb 15, 2015 at 6:38 PM, Jun Rao <ju...@confluent.io> wrote:
>
> > Solon, Mathias,
> >
> > Thanks for testing this out. I just uploaded a slightly modified patch in
> > https://issues.apache.org/jira/browse/KAFKA-1952. The new patch may not
> > improve the latency and CPU usage further, but will potentially improve
> > memory consumption. It would be great if you guys can test the new patch
> > out.
> >
> > Thanks,
> >
> > Jun
> >
> > On Sat, Feb 14, 2015 at 9:08 AM, Mathias Söderberg <
> > mathias.soederberg@gmail.com> wrote:
> >
> > > Jun,
> > >
> > > I updated our brokers earlier today with the mentioned patch. A week
> ago
> > > our brokers used ~380% CPU (out of 400%) quite consistently, and now
> > > they're varying between 250-325% (probably running a bit high right now
> > as
> > > we have some consumers catching up quite some lag), so there's
> definitely
> > > an improvement. The producer latency is still a bit higher than with
> > > 0.8.1.1, but I've been playing a bit with broker configuration as well
> as
> > > producer configuration lately so that probably plays in a bit.
> > >
> > > I'll keep an eye on our metrics, and am going to mess around a bit with
> > > configuration. Right now our traffic load is quite low, so it'd be
> > > interesting to see how this works over the next few days. With that
> said,
> > > we're at the same levels of CPU usage as with 0.8.1.1 (though with an
> > > additional broker), so everything looks pretty great.
> > >
> > > We're using acks = "all" (-1) by the way.
> > >
> > > Best regards,
> > > Mathias
> > >
> > > On Sat Feb 14 2015 at 4:40:31 AM Solon Gordon <so...@knewton.com>
> wrote:
> > >
> > > > Thanks for the fast response. I did a quick test and initial results
> > look
> > > > promising. When I swapped in the patched version, CPU usage dropped
> > from
> > > > ~150% to ~65%. Still a bit higher than what I see with 0.8.1.1 but
> much
> > > > more reasonable.
> > > >
> > > > I'll do more testing on Monday but wanted to get you some quick
> > feedback.
> > > > Hopefully Mathias will have good results as well.
> > > >
> > > > On Fri, Feb 13, 2015 at 9:14 PM, Jun Rao <ju...@confluent.io> wrote:
> > > >
> > > > > Mathias, Solon,
> > > > >
> > > > > We did identify a CPU issue and patched it in
> > > > > https://issues.apache.org/jira/browse/KAFKA-1952. Could you apply
> > the
> > > > > patch
> > > > > in the 0.8.2 branch and see if that addresses the issue?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Fri, Feb 13, 2015 at 3:26 PM, Jay Kreps <ja...@gmail.com>
> > > wrote:
> > > > >
> > > > > > We can reproduce this issue, have a theory as to the cause, and
> are
> > > > > working
> > > > > > on a fix. Here is the ticket to track it:
> > > > > > https://issues.apache.org/jira/browse/KAFKA-1952
> > > > > >
> > > > > > I would recommend people hold off on 0.8.2 upgrades until we
> have a
> > > > > handle
> > > > > > on this.
> > > > > >
> > > > > > -Jay
> > > > > >
> > > > > > On Fri, Feb 13, 2015 at 1:47 PM, Solon Gordon <solon@knewton.com
> >
> > > > wrote:
> > > > > >
> > > > > > > The partitions nearly all have replication factor 2 (a few
> stray
> > > ones
> > > > > > have
> > > > > > > 1), and our producers use request.required.acks=-1. However, I
> > > should
> > > > > > note
> > > > > > > there were hardly any messages being produced when I did the
> > > upgrade
> > > > > and
> > > > > > > observed the high CPU load.
> > > > > > >
> > > > > > > I should have time to do some profiling on Monday and will get
> > back
> > > > to
> > > > > > you
> > > > > > > with the results.
> > > > > > >
> > > > > > > On Fri, Feb 13, 2015 at 1:00 PM, Jun Rao <ju...@confluent.io>
> > wrote:
> > > > > > >
> > > > > > > > Solon,
> > > > > > > >
> > > > > > > > What's the replication factor you used for those partitions?
> > > What's
> > > > > the
> > > > > > > > producer ack that you used? Also, could you do a bit of
> > profiling
> > > > on
> > > > > > the
> > > > > > > > broker to see which methods used the most CPU?
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Jun
> > > > > > > >
> > > > > > > > On Thu, Feb 12, 2015 at 3:19 PM, Solon Gordon <
> > solon@knewton.com
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > I saw a very similar jump in CPU usage when I tried
> upgrading
> > > > from
> > > > > > > > 0.8.1.1
> > > > > > > > > to 0.8.2.0 today in a test environment. The Kafka cluster
> > there
> > > > is
> > > > > > two
> > > > > > > > > m1.larges handling 2,000 partitions across 32 topics. CPU
> > usage
> > > > > rose
> > > > > > > from
> > > > > > > > > 40% into the 150%–190% range, and load average from under 1
> > to
> > > > over
> > > > > > 4.
> > > > > > > > > Downgrading to 0.8.1.1 brought the CPU and load back to the
> > > > > previous
> > > > > > > > > values.
> > > > > > > > >
> > > > > > > > > If there's more info that would be helpful, please let me
> > know.
> > > > > > > > >
> > > > > > > > > On Thu, Feb 12, 2015 at 4:17 PM, Mathias Söderberg <
> > > > > > > > > mathias.soederberg@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > > Jun,
> > > > > > > > > >
> > > > > > > > > > Pardon the radio silence. I booted up a new broker,
> > created a
> > > > > topic
> > > > > > > > with
> > > > > > > > > > three (3) partitions and replication factor one (1) and
> > used
> > > > the
> > > > > > > > > > *kafka-producer-perf-test.sh
> > > > > > > > > > *script to generate load (using messages of roughly the
> > same
> > > > size
> > > > > > as
> > > > > > > > > ours).
> > > > > > > > > > There was a slight increase in CPU usage (~5-10%) on
> > > > 0.8.2.0-rc2
> > > > > > > > compared
> > > > > > > > > > to 0.8.1.1, but that was about it.
> > > > > > > > > >
> > > > > > > > > > I upgraded our staging cluster to 0.8.2.0 earlier this
> week
> > > or
> > > > > so,
> > > > > > > and
> > > > > > > > > had
> > > > > > > > > > to add an additional broker due to increased load after
> the
> > > > > upgrade
> > > > > > > > (note
> > > > > > > > > > that the incoming load on the cluster has been pretty
> much
> > > > > > > consistent).
> > > > > > > > > > Since the upgrade we've been seeing an 2-3x increase in
> > > latency
> > > > > as
> > > > > > > > well.
> > > > > > > > > > I'm considering downgrading to 0.8.1.1 again to see if it
> > > > > resolves
> > > > > > > our
> > > > > > > > > > issues.
> > > > > > > > > >
> > > > > > > > > > Best regards,
> > > > > > > > > > Mathias
> > > > > > > > > >
> > > > > > > > > > On Tue Feb 03 2015 at 6:44:36 PM Jun Rao <
> jun@confluent.io
> > >
> > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Mathias,
> > > > > > > > > > >
> > > > > > > > > > > The new hprof doesn't reveal anything new to me. We did
> > fix
> > > > the
> > > > > > > logic
> > > > > > > > > in
> > > > > > > > > > > using Purgatory in 0.8.2, which could potentially drive
> > up
> > > > the
> > > > > > CPU
> > > > > > > > > usage
> > > > > > > > > > a
> > > > > > > > > > > bit. To verify that, could you do your test on a single
> > > > broker
> > > > > > > (with
> > > > > > > > > > > replication factor 1) btw 0.8.1 and 0.8.2 and see if
> > there
> > > is
> > > > > any
> > > > > > > > > > > significant difference in cpu usage?
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > >
> > > > > > > > > > > Jun
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Feb 3, 2015 at 5:09 AM, Mathias Söderberg <
> > > > > > > > > > > mathias.soederberg@gmail.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Jun,
> > > > > > > > > > > >
> > > > > > > > > > > > I re-ran the hprof test, for about 30 minutes again,
> > for
> > > > > > > > 0.8.2.0-rc2
> > > > > > > > > > with
> > > > > > > > > > > > the same version of snappy that 0.8.1.1 used.
> Attached
> > > the
> > > > > > logs.
> > > > > > > > > > > > Unfortunately there wasn't any improvement as the
> node
> > > > > running
> > > > > > > > > > > 0.8.2.0-rc2
> > > > > > > > > > > > still had a higher load and CPU usage.
> > > > > > > > > > > >
> > > > > > > > > > > > Best regards,
> > > > > > > > > > > > Mathias
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue Feb 03 2015 at 4:40:31 AM Jaikiran Pai <
> > > > > > > > > > jai.forums2013@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >> On Monday 02 February 2015 11:03 PM, Jun Rao wrote:
> > > > > > > > > > > >> > Jaikiran,
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > The fix you provided in probably unnecessary. The
> > > > channel
> > > > > > that
> > > > > > > > we
> > > > > > > > > > use
> > > > > > > > > > > in
> > > > > > > > > > > >> > SimpleConsumer (BlockingChannel) is configured to
> be
> > > > > > blocking.
> > > > > > > > So
> > > > > > > > > > even
> > > > > > > > > > > >> > though the read from the socket is in a loop, each
> > > read
> > > > > > blocks
> > > > > > > > if
> > > > > > > > > > > there
> > > > > > > > > > > >> is
> > > > > > > > > > > >> > no bytes received from the broker. So, that
> > shouldn't
> > > > > cause
> > > > > > > > extra
> > > > > > > > > > CPU
> > > > > > > > > > > >> > consumption.
> > > > > > > > > > > >> Hi Jun,
> > > > > > > > > > > >>
> > > > > > > > > > > >> Of course, you are right! I forgot that while
> reading
> > > the
> > > > > > thread
> > > > > > > > > dump
> > > > > > > > > > in
> > > > > > > > > > > >> hprof output, one has to be aware that the thread
> > state
> > > > > isn't
> > > > > > > > shown
> > > > > > > > > > and
> > > > > > > > > > > >> the thread need not necessarily be doing any CPU
> > > activity.
> > > > > > > > > > > >>
> > > > > > > > > > > >> -Jaikiran
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Thanks,
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Jun
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > On Mon, Jan 26, 2015 at 10:05 AM, Mathias
> Söderberg
> > <
> > > > > > > > > > > >> > mathias.soederberg@gmail.com> wrote:
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >> Hi Neha,
> > > > > > > > > > > >> >>
> > > > > > > > > > > >> >> I sent an e-mail earlier today, but noticed now
> > that
> > > it
> > > > > > > didn't
> > > > > > > > > > > >> actually go
> > > > > > > > > > > >> >> through.
> > > > > > > > > > > >> >>
> > > > > > > > > > > >> >> Anyhow, I've attached two files, one with output
> > > from a
> > > > > 10
> > > > > > > > minute
> > > > > > > > > > run
> > > > > > > > > > > >> and
> > > > > > > > > > > >> >> one with output from a 30 minute run. Realized
> that
> > > > > maybe I
> > > > > > > > > > should've
> > > > > > > > > > > >> done
> > > > > > > > > > > >> >> one or two runs with 0.8.1.1 as well, but
> > > nevertheless.
> > > > > > > > > > > >> >>
> > > > > > > > > > > >> >> I upgraded our staging cluster to 0.8.2.0-rc2,
> and
> > > I'm
> > > > > > seeing
> > > > > > > > the
> > > > > > > > > > > same
> > > > > > > > > > > >> CPU
> > > > > > > > > > > >> >> usage as with the beta version (basically pegging
> > all
> > > > > > cores).
> > > > > > > > If
> > > > > > > > > I
> > > > > > > > > > > >> manage
> > > > > > > > > > > >> >> to find the time I'll do another run with hprof
> on
> > > the
> > > > > rc2
> > > > > > > > > version
> > > > > > > > > > > >> later
> > > > > > > > > > > >> >> today.
> > > > > > > > > > > >> >>
> > > > > > > > > > > >> >> Best regards,
> > > > > > > > > > > >> >> Mathias
> > > > > > > > > > > >> >>
> > > > > > > > > > > >> >> On Tue Dec 09 2014 at 10:08:21 PM Neha Narkhede <
> > > > > > > > > neha@confluent.io
> > > > > > > > > > >
> > > > > > > > > > > >> wrote:
> > > > > > > > > > > >> >>
> > > > > > > > > > > >> >>> The following should be sufficient
> > > > > > > > > > > >> >>>
> > > > > > > > > > > >> >>> java
> > > > > > > > > > > >> >>>
> > > > > -agentlib:hprof=cpu=samples,depth=100,interval=20,lineno=
> > > > > > > > > > > >> >>> y,thread=y,file=kafka.hprof
> > > > > > > > > > > >> >>> <classname>
> > > > > > > > > > > >> >>>
> > > > > > > > > > > >> >>> You would need to start the Kafka server with
> the
> > > > > settings
> > > > > > > > above
> > > > > > > > > > for
> > > > > > > > > > > >> >>> sometime until you observe the problem.
> > > > > > > > > > > >> >>>
> > > > > > > > > > > >> >>> On Tue, Dec 9, 2014 at 3:47 AM, Mathias
> Söderberg
> > <
> > > > > > > > > > > >> >>> mathias.soederberg@gmail.com> wrote:
> > > > > > > > > > > >> >>>
> > > > > > > > > > > >> >>>> Hi Neha,
> > > > > > > > > > > >> >>>>
> > > > > > > > > > > >> >>>> Yeah sure. I'm not familiar with hprof, so any
> > > > > particular
> > > > > > > > > > options I
> > > > > > > > > > > >> >>> should
> > > > > > > > > > > >> >>>> include or just run with defaults?
> > > > > > > > > > > >> >>>>
> > > > > > > > > > > >> >>>> Best regards,
> > > > > > > > > > > >> >>>> Mathias
> > > > > > > > > > > >> >>>>
> > > > > > > > > > > >> >>>> On Mon Dec 08 2014 at 7:41:32 PM Neha Narkhede
> <
> > > > > > > > > > neha@confluent.io>
> > > > > > > > > > > >> >>> wrote:
> > > > > > > > > > > >> >>>>> Thanks for reporting the issue. Would you mind
> > > > running
> > > > > > > hprof
> > > > > > > > > and
> > > > > > > > > > > >> >>> sending
> > > > > > > > > > > >> >>>>> the output?
> > > > > > > > > > > >> >>>>>
> > > > > > > > > > > >> >>>>> On Mon, Dec 8, 2014 at 1:25 AM, Mathias
> > Söderberg
> > > <
> > > > > > > > > > > >> >>>>> mathias.soederberg@gmail.com> wrote:
> > > > > > > > > > > >> >>>>>
> > > > > > > > > > > >> >>>>>> Good day,
> > > > > > > > > > > >> >>>>>>
> > > > > > > > > > > >> >>>>>> I upgraded a Kafka cluster from v0.8.1.1 to
> > > > > v0.8.2-beta
> > > > > > > and
> > > > > > > > > > > noticed
> > > > > > > > > > > >> >>>> that
> > > > > > > > > > > >> >>>>>> the CPU usage on the broker machines went up
> by
> > > > > roughly
> > > > > > > > 40%,
> > > > > > > > > > from
> > > > > > > > > > > >> >>> ~60%
> > > > > > > > > > > >> >>>> to
> > > > > > > > > > > >> >>>>>> ~100% and am wondering if anyone else has
> > > > experienced
> > > > > > > > > something
> > > > > > > > > > > >> >>>> similar?
> > > > > > > > > > > >> >>>>>> The load average also went up by 2x-3x.
> > > > > > > > > > > >> >>>>>>
> > > > > > > > > > > >> >>>>>> We're running on EC2 and the cluster
> currently
> > > > > consists
> > > > > > > of
> > > > > > > > > four
> > > > > > > > > > > >> >>>>> m1.xlarge,
> > > > > > > > > > > >> >>>>>> with roughly 1100 topics / 4000 partitions.
> > Using
> > > > > Java
> > > > > > 7
> > > > > > > > > > > (1.7.0_65
> > > > > > > > > > > >> >>> to
> > > > > > > > > > > >> >>>> be
> > > > > > > > > > > >> >>>>>> exact) and Scala 2.9.2. Configurations can be
> > > found
> > > > > > over
> > > > > > > > > here:
> > > > > > > > > > > >> >>>>>>
> > > > > > https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
> > > > > > > > > > > >> >>>>>>
> > > > > > > > > > > >> >>>>>> I'm assuming that this is not expected
> > behaviour
> > > > for
> > > > > > > > > > 0.8.2-beta?
> > > > > > > > > > > >> >>>>>>
> > > > > > > > > > > >> >>>>>> Best regards,
> > > > > > > > > > > >> >>>>>> Mathias
> > > > > > > > > > > >> >>>>>>
> > > > > > > > > > > >> >>>>>
> > > > > > > > > > > >> >>>>>
> > > > > > > > > > > >> >>>>> --
> > > > > > > > > > > >> >>>>> Thanks,
> > > > > > > > > > > >> >>>>> Neha
> > > > > > > > > > > >> >>>>>
> > > > > > > > > > > >> >>>
> > > > > > > > > > > >> >>>
> > > > > > > > > > > >> >>> --
> > > > > > > > > > > >> >>> Thanks,
> > > > > > > > > > > >> >>> Neha
> > > > > > > > > > > >> >>>
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Increased CPU usage with 0.8.2-beta

Posted by Solon Gordon <so...@knewton.com>.
I tested the new patch out and am seeing comparable CPU usage to the
previous patch. As far as I can see, heap usage is also comparable between
the two patches, though I will say that both look significantly better than
0.8.1.1 (~250MB vs. ~1GB).

I'll report back if any new issues come up as I start adding more
producer/consumer load.

On Sun, Feb 15, 2015 at 6:38 PM, Jun Rao <ju...@confluent.io> wrote:

> Solon, Mathias,
>
> Thanks for testing this out. I just uploaded a slightly modified patch in
> https://issues.apache.org/jira/browse/KAFKA-1952. The new patch may not
> improve the latency and CPU usage further, but will potentially improve
> memory consumption. It would be great if you guys can test the new patch
> out.
>
> Thanks,
>
> Jun
>
> On Sat, Feb 14, 2015 at 9:08 AM, Mathias Söderberg <
> mathias.soederberg@gmail.com> wrote:
>
> > Jun,
> >
> > I updated our brokers earlier today with the mentioned patch. A week ago
> > our brokers used ~380% CPU (out of 400%) quite consistently, and now
> > they're varying between 250-325% (probably running a bit high right now
> as
> > we have some consumers catching up quite some lag), so there's definitely
> > an improvement. The producer latency is still a bit higher than with
> > 0.8.1.1, but I've been playing a bit with broker configuration as well as
> > producer configuration lately so that probably plays in a bit.
> >
> > I'll keep an eye on our metrics, and am going to mess around a bit with
> > configuration. Right now our traffic load is quite low, so it'd be
> > interesting to see how this works over the next few days. With that said,
> > we're at the same levels of CPU usage as with 0.8.1.1 (though with an
> > additional broker), so everything looks pretty great.
> >
> > We're using acks = "all" (-1) by the way.
> >
> > Best regards,
> > Mathias
> >
> > On Sat Feb 14 2015 at 4:40:31 AM Solon Gordon <so...@knewton.com> wrote:
> >
> > > Thanks for the fast response. I did a quick test and initial results
> look
> > > promising. When I swapped in the patched version, CPU usage dropped
> from
> > > ~150% to ~65%. Still a bit higher than what I see with 0.8.1.1 but much
> > > more reasonable.
> > >
> > > I'll do more testing on Monday but wanted to get you some quick
> feedback.
> > > Hopefully Mathias will have good results as well.
> > >
> > > On Fri, Feb 13, 2015 at 9:14 PM, Jun Rao <ju...@confluent.io> wrote:
> > >
> > > > Mathias, Solon,
> > > >
> > > > We did identify a CPU issue and patched it in
> > > > https://issues.apache.org/jira/browse/KAFKA-1952. Could you apply
> the
> > > > patch
> > > > in the 0.8.2 branch and see if that addresses the issue?
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Fri, Feb 13, 2015 at 3:26 PM, Jay Kreps <ja...@gmail.com>
> > wrote:
> > > >
> > > > > We can reproduce this issue, have a theory as to the cause, and are
> > > > working
> > > > > on a fix. Here is the ticket to track it:
> > > > > https://issues.apache.org/jira/browse/KAFKA-1952
> > > > >
> > > > > I would recommend people hold off on 0.8.2 upgrades until we have a
> > > > handle
> > > > > on this.
> > > > >
> > > > > -Jay
> > > > >
> > > > > On Fri, Feb 13, 2015 at 1:47 PM, Solon Gordon <so...@knewton.com>
> > > wrote:
> > > > >
> > > > > > The partitions nearly all have replication factor 2 (a few stray
> > ones
> > > > > have
> > > > > > 1), and our producers use request.required.acks=-1. However, I
> > should
> > > > > note
> > > > > > there were hardly any messages being produced when I did the
> > upgrade
> > > > and
> > > > > > observed the high CPU load.
> > > > > >
> > > > > > I should have time to do some profiling on Monday and will get
> back
> > > to
> > > > > you
> > > > > > with the results.
> > > > > >
> > > > > > On Fri, Feb 13, 2015 at 1:00 PM, Jun Rao <ju...@confluent.io>
> wrote:
> > > > > >
> > > > > > > Solon,
> > > > > > >
> > > > > > > What's the replication factor you used for those partitions?
> > What's
> > > > the
> > > > > > > producer ack that you used? Also, could you do a bit of
> profiling
> > > on
> > > > > the
> > > > > > > broker to see which methods used the most CPU?
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > > On Thu, Feb 12, 2015 at 3:19 PM, Solon Gordon <
> solon@knewton.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > I saw a very similar jump in CPU usage when I tried upgrading
> > > from
> > > > > > > 0.8.1.1
> > > > > > > > to 0.8.2.0 today in a test environment. The Kafka cluster
> there
> > > is
> > > > > two
> > > > > > > > m1.larges handling 2,000 partitions across 32 topics. CPU
> usage
> > > > rose
> > > > > > from
> > > > > > > > 40% into the 150%–190% range, and load average from under 1
> to
> > > over
> > > > > 4.
> > > > > > > > Downgrading to 0.8.1.1 brought the CPU and load back to the
> > > > previous
> > > > > > > > values.
> > > > > > > >
> > > > > > > > If there's more info that would be helpful, please let me
> know.
> > > > > > > >
> > > > > > > > On Thu, Feb 12, 2015 at 4:17 PM, Mathias Söderberg <
> > > > > > > > mathias.soederberg@gmail.com> wrote:
> > > > > > > >
> > > > > > > > > Jun,
> > > > > > > > >
> > > > > > > > > Pardon the radio silence. I booted up a new broker,
> created a
> > > > topic
> > > > > > > with
> > > > > > > > > three (3) partitions and replication factor one (1) and
> used
> > > the
> > > > > > > > > *kafka-producer-perf-test.sh
> > > > > > > > > *script to generate load (using messages of roughly the
> same
> > > size
> > > > > as
> > > > > > > > ours).
> > > > > > > > > There was a slight increase in CPU usage (~5-10%) on
> > > 0.8.2.0-rc2
> > > > > > > compared
> > > > > > > > > to 0.8.1.1, but that was about it.
> > > > > > > > >
> > > > > > > > > I upgraded our staging cluster to 0.8.2.0 earlier this week
> > or
> > > > so,
> > > > > > and
> > > > > > > > had
> > > > > > > > > to add an additional broker due to increased load after the
> > > > upgrade
> > > > > > > (note
> > > > > > > > > that the incoming load on the cluster has been pretty much
> > > > > > consistent).
> > > > > > > > > Since the upgrade we've been seeing an 2-3x increase in
> > latency
> > > > as
> > > > > > > well.
> > > > > > > > > I'm considering downgrading to 0.8.1.1 again to see if it
> > > > resolves
> > > > > > our
> > > > > > > > > issues.
> > > > > > > > >
> > > > > > > > > Best regards,
> > > > > > > > > Mathias
> > > > > > > > >
> > > > > > > > > On Tue Feb 03 2015 at 6:44:36 PM Jun Rao <jun@confluent.io
> >
> > > > wrote:
> > > > > > > > >
> > > > > > > > > > Mathias,
> > > > > > > > > >
> > > > > > > > > > The new hprof doesn't reveal anything new to me. We did
> fix
> > > the
> > > > > > logic
> > > > > > > > in
> > > > > > > > > > using Purgatory in 0.8.2, which could potentially drive
> up
> > > the
> > > > > CPU
> > > > > > > > usage
> > > > > > > > > a
> > > > > > > > > > bit. To verify that, could you do your test on a single
> > > broker
> > > > > > (with
> > > > > > > > > > replication factor 1) btw 0.8.1 and 0.8.2 and see if
> there
> > is
> > > > any
> > > > > > > > > > significant difference in cpu usage?
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > >
> > > > > > > > > > Jun
> > > > > > > > > >
> > > > > > > > > > On Tue, Feb 3, 2015 at 5:09 AM, Mathias Söderberg <
> > > > > > > > > > mathias.soederberg@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > > Jun,
> > > > > > > > > > >
> > > > > > > > > > > I re-ran the hprof test, for about 30 minutes again,
> for
> > > > > > > 0.8.2.0-rc2
> > > > > > > > > with
> > > > > > > > > > > the same version of snappy that 0.8.1.1 used. Attached
> > the
> > > > > logs.
> > > > > > > > > > > Unfortunately there wasn't any improvement as the node
> > > > running
> > > > > > > > > > 0.8.2.0-rc2
> > > > > > > > > > > still had a higher load and CPU usage.
> > > > > > > > > > >
> > > > > > > > > > > Best regards,
> > > > > > > > > > > Mathias
> > > > > > > > > > >
> > > > > > > > > > > On Tue Feb 03 2015 at 4:40:31 AM Jaikiran Pai <
> > > > > > > > > jai.forums2013@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >> On Monday 02 February 2015 11:03 PM, Jun Rao wrote:
> > > > > > > > > > >> > Jaikiran,
> > > > > > > > > > >> >
> > > > > > > > > > >> > The fix you provided in probably unnecessary. The
> > > channel
> > > > > that
> > > > > > > we
> > > > > > > > > use
> > > > > > > > > > in
> > > > > > > > > > >> > SimpleConsumer (BlockingChannel) is configured to be
> > > > > blocking.
> > > > > > > So
> > > > > > > > > even
> > > > > > > > > > >> > though the read from the socket is in a loop, each
> > read
> > > > > blocks
> > > > > > > if
> > > > > > > > > > there
> > > > > > > > > > >> is
> > > > > > > > > > >> > no bytes received from the broker. So, that
> shouldn't
> > > > cause
> > > > > > > extra
> > > > > > > > > CPU
> > > > > > > > > > >> > consumption.
> > > > > > > > > > >> Hi Jun,
> > > > > > > > > > >>
> > > > > > > > > > >> Of course, you are right! I forgot that while reading
> > the
> > > > > thread
> > > > > > > > dump
> > > > > > > > > in
> > > > > > > > > > >> hprof output, one has to be aware that the thread
> state
> > > > isn't
> > > > > > > shown
> > > > > > > > > and
> > > > > > > > > > >> the thread need not necessarily be doing any CPU
> > activity.
> > > > > > > > > > >>
> > > > > > > > > > >> -Jaikiran
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >> >
> > > > > > > > > > >> > Thanks,
> > > > > > > > > > >> >
> > > > > > > > > > >> > Jun
> > > > > > > > > > >> >
> > > > > > > > > > >> > On Mon, Jan 26, 2015 at 10:05 AM, Mathias Söderberg
> <
> > > > > > > > > > >> > mathias.soederberg@gmail.com> wrote:
> > > > > > > > > > >> >
> > > > > > > > > > >> >> Hi Neha,
> > > > > > > > > > >> >>
> > > > > > > > > > >> >> I sent an e-mail earlier today, but noticed now
> that
> > it
> > > > > > didn't
> > > > > > > > > > >> actually go
> > > > > > > > > > >> >> through.
> > > > > > > > > > >> >>
> > > > > > > > > > >> >> Anyhow, I've attached two files, one with output
> > from a
> > > > 10
> > > > > > > minute
> > > > > > > > > run
> > > > > > > > > > >> and
> > > > > > > > > > >> >> one with output from a 30 minute run. Realized that
> > > > maybe I
> > > > > > > > > should've
> > > > > > > > > > >> done
> > > > > > > > > > >> >> one or two runs with 0.8.1.1 as well, but
> > nevertheless.
> > > > > > > > > > >> >>
> > > > > > > > > > >> >> I upgraded our staging cluster to 0.8.2.0-rc2, and
> > I'm
> > > > > seeing
> > > > > > > the
> > > > > > > > > > same
> > > > > > > > > > >> CPU
> > > > > > > > > > >> >> usage as with the beta version (basically pegging
> all
> > > > > cores).
> > > > > > > If
> > > > > > > > I
> > > > > > > > > > >> manage
> > > > > > > > > > >> >> to find the time I'll do another run with hprof on
> > the
> > > > rc2
> > > > > > > > version
> > > > > > > > > > >> later
> > > > > > > > > > >> >> today.
> > > > > > > > > > >> >>
> > > > > > > > > > >> >> Best regards,
> > > > > > > > > > >> >> Mathias
> > > > > > > > > > >> >>
> > > > > > > > > > >> >> On Tue Dec 09 2014 at 10:08:21 PM Neha Narkhede <
> > > > > > > > neha@confluent.io
> > > > > > > > > >
> > > > > > > > > > >> wrote:
> > > > > > > > > > >> >>
> > > > > > > > > > >> >>> The following should be sufficient
> > > > > > > > > > >> >>>
> > > > > > > > > > >> >>> java
> > > > > > > > > > >> >>>
> > > > -agentlib:hprof=cpu=samples,depth=100,interval=20,lineno=
> > > > > > > > > > >> >>> y,thread=y,file=kafka.hprof
> > > > > > > > > > >> >>> <classname>
> > > > > > > > > > >> >>>
> > > > > > > > > > >> >>> You would need to start the Kafka server with the
> > > > settings
> > > > > > > above
> > > > > > > > > for
> > > > > > > > > > >> >>> sometime until you observe the problem.
> > > > > > > > > > >> >>>
> > > > > > > > > > >> >>> On Tue, Dec 9, 2014 at 3:47 AM, Mathias Söderberg
> <
> > > > > > > > > > >> >>> mathias.soederberg@gmail.com> wrote:
> > > > > > > > > > >> >>>
> > > > > > > > > > >> >>>> Hi Neha,
> > > > > > > > > > >> >>>>
> > > > > > > > > > >> >>>> Yeah sure. I'm not familiar with hprof, so any
> > > > particular
> > > > > > > > > options I
> > > > > > > > > > >> >>> should
> > > > > > > > > > >> >>>> include or just run with defaults?
> > > > > > > > > > >> >>>>
> > > > > > > > > > >> >>>> Best regards,
> > > > > > > > > > >> >>>> Mathias
> > > > > > > > > > >> >>>>
> > > > > > > > > > >> >>>> On Mon Dec 08 2014 at 7:41:32 PM Neha Narkhede <
> > > > > > > > > neha@confluent.io>
> > > > > > > > > > >> >>> wrote:
> > > > > > > > > > >> >>>>> Thanks for reporting the issue. Would you mind
> > > running
> > > > > > hprof
> > > > > > > > and
> > > > > > > > > > >> >>> sending
> > > > > > > > > > >> >>>>> the output?
> > > > > > > > > > >> >>>>>
> > > > > > > > > > >> >>>>> On Mon, Dec 8, 2014 at 1:25 AM, Mathias
> Söderberg
> > <
> > > > > > > > > > >> >>>>> mathias.soederberg@gmail.com> wrote:
> > > > > > > > > > >> >>>>>
> > > > > > > > > > >> >>>>>> Good day,
> > > > > > > > > > >> >>>>>>
> > > > > > > > > > >> >>>>>> I upgraded a Kafka cluster from v0.8.1.1 to
> > > > v0.8.2-beta
> > > > > > and
> > > > > > > > > > noticed
> > > > > > > > > > >> >>>> that
> > > > > > > > > > >> >>>>>> the CPU usage on the broker machines went up by
> > > > roughly
> > > > > > > 40%,
> > > > > > > > > from
> > > > > > > > > > >> >>> ~60%
> > > > > > > > > > >> >>>> to
> > > > > > > > > > >> >>>>>> ~100% and am wondering if anyone else has
> > > experienced
> > > > > > > > something
> > > > > > > > > > >> >>>> similar?
> > > > > > > > > > >> >>>>>> The load average also went up by 2x-3x.
> > > > > > > > > > >> >>>>>>
> > > > > > > > > > >> >>>>>> We're running on EC2 and the cluster currently
> > > > consists
> > > > > > of
> > > > > > > > four
> > > > > > > > > > >> >>>>> m1.xlarge,
> > > > > > > > > > >> >>>>>> with roughly 1100 topics / 4000 partitions.
> Using
> > > > Java
> > > > > 7
> > > > > > > > > > (1.7.0_65
> > > > > > > > > > >> >>> to
> > > > > > > > > > >> >>>> be
> > > > > > > > > > >> >>>>>> exact) and Scala 2.9.2. Configurations can be
> > found
> > > > > over
> > > > > > > > here:
> > > > > > > > > > >> >>>>>>
> > > > > https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
> > > > > > > > > > >> >>>>>>
> > > > > > > > > > >> >>>>>> I'm assuming that this is not expected
> behaviour
> > > for
> > > > > > > > > 0.8.2-beta?
> > > > > > > > > > >> >>>>>>
> > > > > > > > > > >> >>>>>> Best regards,
> > > > > > > > > > >> >>>>>> Mathias
> > > > > > > > > > >> >>>>>>
> > > > > > > > > > >> >>>>>
> > > > > > > > > > >> >>>>>
> > > > > > > > > > >> >>>>> --
> > > > > > > > > > >> >>>>> Thanks,
> > > > > > > > > > >> >>>>> Neha
> > > > > > > > > > >> >>>>>
> > > > > > > > > > >> >>>
> > > > > > > > > > >> >>>
> > > > > > > > > > >> >>> --
> > > > > > > > > > >> >>> Thanks,
> > > > > > > > > > >> >>> Neha
> > > > > > > > > > >> >>>
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Increased CPU usage with 0.8.2-beta

Posted by Jun Rao <ju...@confluent.io>.
Solon, Mathias,

Thanks for testing this out. I just uploaded a slightly modified patch in
https://issues.apache.org/jira/browse/KAFKA-1952. The new patch may not
improve the latency and CPU usage further, but will potentially improve
memory consumption. It would be great if you guys can test the new patch
out.

Thanks,

Jun

On Sat, Feb 14, 2015 at 9:08 AM, Mathias Söderberg <
mathias.soederberg@gmail.com> wrote:

> Jun,
>
> I updated our brokers earlier today with the mentioned patch. A week ago
> our brokers used ~380% CPU (out of 400%) quite consistently, and now
> they're varying between 250-325% (probably running a bit high right now as
> we have some consumers catching up quite some lag), so there's definitely
> an improvement. The producer latency is still a bit higher than with
> 0.8.1.1, but I've been playing a bit with broker configuration as well as
> producer configuration lately so that probably plays in a bit.
>
> I'll keep an eye on our metrics, and am going to mess around a bit with
> configuration. Right now our traffic load is quite low, so it'd be
> interesting to see how this works over the next few days. With that said,
> we're at the same levels of CPU usage as with 0.8.1.1 (though with an
> additional broker), so everything looks pretty great.
>
> We're using acks = "all" (-1) by the way.
>
> Best regards,
> Mathias
>
> On Sat Feb 14 2015 at 4:40:31 AM Solon Gordon <so...@knewton.com> wrote:
>
> > Thanks for the fast response. I did a quick test and initial results look
> > promising. When I swapped in the patched version, CPU usage dropped from
> > ~150% to ~65%. Still a bit higher than what I see with 0.8.1.1 but much
> > more reasonable.
> >
> > I'll do more testing on Monday but wanted to get you some quick feedback.
> > Hopefully Mathias will have good results as well.
> >
> > On Fri, Feb 13, 2015 at 9:14 PM, Jun Rao <ju...@confluent.io> wrote:
> >
> > > Mathias, Solon,
> > >
> > > We did identify a CPU issue and patched it in
> > > https://issues.apache.org/jira/browse/KAFKA-1952. Could you apply the
> > > patch
> > > in the 0.8.2 branch and see if that addresses the issue?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Fri, Feb 13, 2015 at 3:26 PM, Jay Kreps <ja...@gmail.com>
> wrote:
> > >
> > > > We can reproduce this issue, have a theory as to the cause, and are
> > > working
> > > > on a fix. Here is the ticket to track it:
> > > > https://issues.apache.org/jira/browse/KAFKA-1952
> > > >
> > > > I would recommend people hold off on 0.8.2 upgrades until we have a
> > > handle
> > > > on this.
> > > >
> > > > -Jay
> > > >
> > > > On Fri, Feb 13, 2015 at 1:47 PM, Solon Gordon <so...@knewton.com>
> > wrote:
> > > >
> > > > > The partitions nearly all have replication factor 2 (a few stray
> ones
> > > > have
> > > > > 1), and our producers use request.required.acks=-1. However, I
> should
> > > > note
> > > > > there were hardly any messages being produced when I did the
> upgrade
> > > and
> > > > > observed the high CPU load.
> > > > >
> > > > > I should have time to do some profiling on Monday and will get back
> > to
> > > > you
> > > > > with the results.
> > > > >
> > > > > On Fri, Feb 13, 2015 at 1:00 PM, Jun Rao <ju...@confluent.io> wrote:
> > > > >
> > > > > > Solon,
> > > > > >
> > > > > > What's the replication factor you used for those partitions?
> What's
> > > the
> > > > > > producer ack that you used? Also, could you do a bit of profiling
> > on
> > > > the
> > > > > > broker to see which methods used the most CPU?
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Thu, Feb 12, 2015 at 3:19 PM, Solon Gordon <solon@knewton.com
> >
> > > > wrote:
> > > > > >
> > > > > > > I saw a very similar jump in CPU usage when I tried upgrading
> > from
> > > > > > 0.8.1.1
> > > > > > > to 0.8.2.0 today in a test environment. The Kafka cluster there
> > is
> > > > two
> > > > > > > m1.larges handling 2,000 partitions across 32 topics. CPU usage
> > > rose
> > > > > from
> > > > > > > 40% into the 150%–190% range, and load average from under 1 to
> > over
> > > > 4.
> > > > > > > Downgrading to 0.8.1.1 brought the CPU and load back to the
> > > previous
> > > > > > > values.
> > > > > > >
> > > > > > > If there's more info that would be helpful, please let me know.
> > > > > > >
> > > > > > > On Thu, Feb 12, 2015 at 4:17 PM, Mathias Söderberg <
> > > > > > > mathias.soederberg@gmail.com> wrote:
> > > > > > >
> > > > > > > > Jun,
> > > > > > > >
> > > > > > > > Pardon the radio silence. I booted up a new broker, created a
> > > topic
> > > > > > with
> > > > > > > > three (3) partitions and replication factor one (1) and used
> > the
> > > > > > > > *kafka-producer-perf-test.sh
> > > > > > > > *script to generate load (using messages of roughly the same
> > size
> > > > as
> > > > > > > ours).
> > > > > > > > There was a slight increase in CPU usage (~5-10%) on
> > 0.8.2.0-rc2
> > > > > > compared
> > > > > > > > to 0.8.1.1, but that was about it.
> > > > > > > >
> > > > > > > > I upgraded our staging cluster to 0.8.2.0 earlier this week
> or
> > > so,
> > > > > and
> > > > > > > had
> > > > > > > > to add an additional broker due to increased load after the
> > > upgrade
> > > > > > (note
> > > > > > > > that the incoming load on the cluster has been pretty much
> > > > > consistent).
> > > > > > > > Since the upgrade we've been seeing an 2-3x increase in
> latency
> > > as
> > > > > > well.
> > > > > > > > I'm considering downgrading to 0.8.1.1 again to see if it
> > > resolves
> > > > > our
> > > > > > > > issues.
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Mathias
> > > > > > > >
> > > > > > > > On Tue Feb 03 2015 at 6:44:36 PM Jun Rao <ju...@confluent.io>
> > > wrote:
> > > > > > > >
> > > > > > > > > Mathias,
> > > > > > > > >
> > > > > > > > > The new hprof doesn't reveal anything new to me. We did fix
> > the
> > > > > logic
> > > > > > > in
> > > > > > > > > using Purgatory in 0.8.2, which could potentially drive up
> > the
> > > > CPU
> > > > > > > usage
> > > > > > > > a
> > > > > > > > > bit. To verify that, could you do your test on a single
> > broker
> > > > > (with
> > > > > > > > > replication factor 1) btw 0.8.1 and 0.8.2 and see if there
> is
> > > any
> > > > > > > > > significant difference in cpu usage?
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > >
> > > > > > > > > Jun
> > > > > > > > >
> > > > > > > > > On Tue, Feb 3, 2015 at 5:09 AM, Mathias Söderberg <
> > > > > > > > > mathias.soederberg@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > > Jun,
> > > > > > > > > >
> > > > > > > > > > I re-ran the hprof test, for about 30 minutes again, for
> > > > > > 0.8.2.0-rc2
> > > > > > > > with
> > > > > > > > > > the same version of snappy that 0.8.1.1 used. Attached
> the
> > > > logs.
> > > > > > > > > > Unfortunately there wasn't any improvement as the node
> > > running
> > > > > > > > > 0.8.2.0-rc2
> > > > > > > > > > still had a higher load and CPU usage.
> > > > > > > > > >
> > > > > > > > > > Best regards,
> > > > > > > > > > Mathias
> > > > > > > > > >
> > > > > > > > > > On Tue Feb 03 2015 at 4:40:31 AM Jaikiran Pai <
> > > > > > > > jai.forums2013@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >> On Monday 02 February 2015 11:03 PM, Jun Rao wrote:
> > > > > > > > > >> > Jaikiran,
> > > > > > > > > >> >
> > > > > > > > > >> > The fix you provided in probably unnecessary. The
> > channel
> > > > that
> > > > > > we
> > > > > > > > use
> > > > > > > > > in
> > > > > > > > > >> > SimpleConsumer (BlockingChannel) is configured to be
> > > > blocking.
> > > > > > So
> > > > > > > > even
> > > > > > > > > >> > though the read from the socket is in a loop, each
> read
> > > > blocks
> > > > > > if
> > > > > > > > > there
> > > > > > > > > >> is
> > > > > > > > > >> > no bytes received from the broker. So, that shouldn't
> > > cause
> > > > > > extra
> > > > > > > > CPU
> > > > > > > > > >> > consumption.
> > > > > > > > > >> Hi Jun,
> > > > > > > > > >>
> > > > > > > > > >> Of course, you are right! I forgot that while reading
> the
> > > > thread
> > > > > > > dump
> > > > > > > > in
> > > > > > > > > >> hprof output, one has to be aware that the thread state
> > > isn't
> > > > > > shown
> > > > > > > > and
> > > > > > > > > >> the thread need not necessarily be doing any CPU
> activity.
> > > > > > > > > >>
> > > > > > > > > >> -Jaikiran
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >> >
> > > > > > > > > >> > Thanks,
> > > > > > > > > >> >
> > > > > > > > > >> > Jun
> > > > > > > > > >> >
> > > > > > > > > >> > On Mon, Jan 26, 2015 at 10:05 AM, Mathias Söderberg <
> > > > > > > > > >> > mathias.soederberg@gmail.com> wrote:
> > > > > > > > > >> >
> > > > > > > > > >> >> Hi Neha,
> > > > > > > > > >> >>
> > > > > > > > > >> >> I sent an e-mail earlier today, but noticed now that
> it
> > > > > didn't
> > > > > > > > > >> actually go
> > > > > > > > > >> >> through.
> > > > > > > > > >> >>
> > > > > > > > > >> >> Anyhow, I've attached two files, one with output
> from a
> > > 10
> > > > > > minute
> > > > > > > > run
> > > > > > > > > >> and
> > > > > > > > > >> >> one with output from a 30 minute run. Realized that
> > > maybe I
> > > > > > > > should've
> > > > > > > > > >> done
> > > > > > > > > >> >> one or two runs with 0.8.1.1 as well, but
> nevertheless.
> > > > > > > > > >> >>
> > > > > > > > > >> >> I upgraded our staging cluster to 0.8.2.0-rc2, and
> I'm
> > > > seeing
> > > > > > the
> > > > > > > > > same
> > > > > > > > > >> CPU
> > > > > > > > > >> >> usage as with the beta version (basically pegging all
> > > > cores).
> > > > > > If
> > > > > > > I
> > > > > > > > > >> manage
> > > > > > > > > >> >> to find the time I'll do another run with hprof on
> the
> > > rc2
> > > > > > > version
> > > > > > > > > >> later
> > > > > > > > > >> >> today.
> > > > > > > > > >> >>
> > > > > > > > > >> >> Best regards,
> > > > > > > > > >> >> Mathias
> > > > > > > > > >> >>
> > > > > > > > > >> >> On Tue Dec 09 2014 at 10:08:21 PM Neha Narkhede <
> > > > > > > neha@confluent.io
> > > > > > > > >
> > > > > > > > > >> wrote:
> > > > > > > > > >> >>
> > > > > > > > > >> >>> The following should be sufficient
> > > > > > > > > >> >>>
> > > > > > > > > >> >>> java
> > > > > > > > > >> >>>
> > > -agentlib:hprof=cpu=samples,depth=100,interval=20,lineno=
> > > > > > > > > >> >>> y,thread=y,file=kafka.hprof
> > > > > > > > > >> >>> <classname>
> > > > > > > > > >> >>>
> > > > > > > > > >> >>> You would need to start the Kafka server with the
> > > settings
> > > > > > above
> > > > > > > > for
> > > > > > > > > >> >>> sometime until you observe the problem.
> > > > > > > > > >> >>>
> > > > > > > > > >> >>> On Tue, Dec 9, 2014 at 3:47 AM, Mathias Söderberg <
> > > > > > > > > >> >>> mathias.soederberg@gmail.com> wrote:
> > > > > > > > > >> >>>
> > > > > > > > > >> >>>> Hi Neha,
> > > > > > > > > >> >>>>
> > > > > > > > > >> >>>> Yeah sure. I'm not familiar with hprof, so any
> > > particular
> > > > > > > > options I
> > > > > > > > > >> >>> should
> > > > > > > > > >> >>>> include or just run with defaults?
> > > > > > > > > >> >>>>
> > > > > > > > > >> >>>> Best regards,
> > > > > > > > > >> >>>> Mathias
> > > > > > > > > >> >>>>
> > > > > > > > > >> >>>> On Mon Dec 08 2014 at 7:41:32 PM Neha Narkhede <
> > > > > > > > neha@confluent.io>
> > > > > > > > > >> >>> wrote:
> > > > > > > > > >> >>>>> Thanks for reporting the issue. Would you mind
> > running
> > > > > hprof
> > > > > > > and
> > > > > > > > > >> >>> sending
> > > > > > > > > >> >>>>> the output?
> > > > > > > > > >> >>>>>
> > > > > > > > > >> >>>>> On Mon, Dec 8, 2014 at 1:25 AM, Mathias Söderberg
> <
> > > > > > > > > >> >>>>> mathias.soederberg@gmail.com> wrote:
> > > > > > > > > >> >>>>>
> > > > > > > > > >> >>>>>> Good day,
> > > > > > > > > >> >>>>>>
> > > > > > > > > >> >>>>>> I upgraded a Kafka cluster from v0.8.1.1 to
> > > v0.8.2-beta
> > > > > and
> > > > > > > > > noticed
> > > > > > > > > >> >>>> that
> > > > > > > > > >> >>>>>> the CPU usage on the broker machines went up by
> > > roughly
> > > > > > 40%,
> > > > > > > > from
> > > > > > > > > >> >>> ~60%
> > > > > > > > > >> >>>> to
> > > > > > > > > >> >>>>>> ~100% and am wondering if anyone else has
> > experienced
> > > > > > > something
> > > > > > > > > >> >>>> similar?
> > > > > > > > > >> >>>>>> The load average also went up by 2x-3x.
> > > > > > > > > >> >>>>>>
> > > > > > > > > >> >>>>>> We're running on EC2 and the cluster currently
> > > consists
> > > > > of
> > > > > > > four
> > > > > > > > > >> >>>>> m1.xlarge,
> > > > > > > > > >> >>>>>> with roughly 1100 topics / 4000 partitions. Using
> > > Java
> > > > 7
> > > > > > > > > (1.7.0_65
> > > > > > > > > >> >>> to
> > > > > > > > > >> >>>> be
> > > > > > > > > >> >>>>>> exact) and Scala 2.9.2. Configurations can be
> found
> > > > over
> > > > > > > here:
> > > > > > > > > >> >>>>>>
> > > > https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
> > > > > > > > > >> >>>>>>
> > > > > > > > > >> >>>>>> I'm assuming that this is not expected behaviour
> > for
> > > > > > > > 0.8.2-beta?
> > > > > > > > > >> >>>>>>
> > > > > > > > > >> >>>>>> Best regards,
> > > > > > > > > >> >>>>>> Mathias
> > > > > > > > > >> >>>>>>
> > > > > > > > > >> >>>>>
> > > > > > > > > >> >>>>>
> > > > > > > > > >> >>>>> --
> > > > > > > > > >> >>>>> Thanks,
> > > > > > > > > >> >>>>> Neha
> > > > > > > > > >> >>>>>
> > > > > > > > > >> >>>
> > > > > > > > > >> >>>
> > > > > > > > > >> >>> --
> > > > > > > > > >> >>> Thanks,
> > > > > > > > > >> >>> Neha
> > > > > > > > > >> >>>
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Increased CPU usage with 0.8.2-beta

Posted by Mathias Söderberg <ma...@gmail.com>.
Jun,

I updated our brokers earlier today with the mentioned patch. A week ago
our brokers used ~380% CPU (out of 400%) quite consistently, and now
they're varying between 250-325% (probably running a bit high right now as
we have some consumers catching up quite some lag), so there's definitely
an improvement. The producer latency is still a bit higher than with
0.8.1.1, but I've been playing a bit with broker configuration as well as
producer configuration lately so that probably plays in a bit.

I'll keep an eye on our metrics, and am going to mess around a bit with
configuration. Right now our traffic load is quite low, so it'd be
interesting to see how this works over the next few days. With that said,
we're at the same levels of CPU usage as with 0.8.1.1 (though with an
additional broker), so everything looks pretty great.

We're using acks = "all" (-1) by the way.

Best regards,
Mathias

On Sat Feb 14 2015 at 4:40:31 AM Solon Gordon <so...@knewton.com> wrote:

> Thanks for the fast response. I did a quick test and initial results look
> promising. When I swapped in the patched version, CPU usage dropped from
> ~150% to ~65%. Still a bit higher than what I see with 0.8.1.1 but much
> more reasonable.
>
> I'll do more testing on Monday but wanted to get you some quick feedback.
> Hopefully Mathias will have good results as well.
>
> On Fri, Feb 13, 2015 at 9:14 PM, Jun Rao <ju...@confluent.io> wrote:
>
> > Mathias, Solon,
> >
> > We did identify a CPU issue and patched it in
> > https://issues.apache.org/jira/browse/KAFKA-1952. Could you apply the
> > patch
> > in the 0.8.2 branch and see if that addresses the issue?
> >
> > Thanks,
> >
> > Jun
> >
> > On Fri, Feb 13, 2015 at 3:26 PM, Jay Kreps <ja...@gmail.com> wrote:
> >
> > > We can reproduce this issue, have a theory as to the cause, and are
> > working
> > > on a fix. Here is the ticket to track it:
> > > https://issues.apache.org/jira/browse/KAFKA-1952
> > >
> > > I would recommend people hold off on 0.8.2 upgrades until we have a
> > handle
> > > on this.
> > >
> > > -Jay
> > >
> > > On Fri, Feb 13, 2015 at 1:47 PM, Solon Gordon <so...@knewton.com>
> wrote:
> > >
> > > > The partitions nearly all have replication factor 2 (a few stray ones
> > > have
> > > > 1), and our producers use request.required.acks=-1. However, I should
> > > note
> > > > there were hardly any messages being produced when I did the upgrade
> > and
> > > > observed the high CPU load.
> > > >
> > > > I should have time to do some profiling on Monday and will get back
> to
> > > you
> > > > with the results.
> > > >
> > > > On Fri, Feb 13, 2015 at 1:00 PM, Jun Rao <ju...@confluent.io> wrote:
> > > >
> > > > > Solon,
> > > > >
> > > > > What's the replication factor you used for those partitions? What's
> > the
> > > > > producer ack that you used? Also, could you do a bit of profiling
> on
> > > the
> > > > > broker to see which methods used the most CPU?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Thu, Feb 12, 2015 at 3:19 PM, Solon Gordon <so...@knewton.com>
> > > wrote:
> > > > >
> > > > > > I saw a very similar jump in CPU usage when I tried upgrading
> from
> > > > > 0.8.1.1
> > > > > > to 0.8.2.0 today in a test environment. The Kafka cluster there
> is
> > > two
> > > > > > m1.larges handling 2,000 partitions across 32 topics. CPU usage
> > rose
> > > > from
> > > > > > 40% into the 150%–190% range, and load average from under 1 to
> over
> > > 4.
> > > > > > Downgrading to 0.8.1.1 brought the CPU and load back to the
> > previous
> > > > > > values.
> > > > > >
> > > > > > If there's more info that would be helpful, please let me know.
> > > > > >
> > > > > > On Thu, Feb 12, 2015 at 4:17 PM, Mathias Söderberg <
> > > > > > mathias.soederberg@gmail.com> wrote:
> > > > > >
> > > > > > > Jun,
> > > > > > >
> > > > > > > Pardon the radio silence. I booted up a new broker, created a
> > topic
> > > > > with
> > > > > > > three (3) partitions and replication factor one (1) and used
> the
> > > > > > > *kafka-producer-perf-test.sh
> > > > > > > *script to generate load (using messages of roughly the same
> size
> > > as
> > > > > > ours).
> > > > > > > There was a slight increase in CPU usage (~5-10%) on
> 0.8.2.0-rc2
> > > > > compared
> > > > > > > to 0.8.1.1, but that was about it.
> > > > > > >
> > > > > > > I upgraded our staging cluster to 0.8.2.0 earlier this week or
> > so,
> > > > and
> > > > > > had
> > > > > > > to add an additional broker due to increased load after the
> > upgrade
> > > > > (note
> > > > > > > that the incoming load on the cluster has been pretty much
> > > > consistent).
> > > > > > > Since the upgrade we've been seeing an 2-3x increase in latency
> > as
> > > > > well.
> > > > > > > I'm considering downgrading to 0.8.1.1 again to see if it
> > resolves
> > > > our
> > > > > > > issues.
> > > > > > >
> > > > > > > Best regards,
> > > > > > > Mathias
> > > > > > >
> > > > > > > On Tue Feb 03 2015 at 6:44:36 PM Jun Rao <ju...@confluent.io>
> > wrote:
> > > > > > >
> > > > > > > > Mathias,
> > > > > > > >
> > > > > > > > The new hprof doesn't reveal anything new to me. We did fix
> the
> > > > logic
> > > > > > in
> > > > > > > > using Purgatory in 0.8.2, which could potentially drive up
> the
> > > CPU
> > > > > > usage
> > > > > > > a
> > > > > > > > bit. To verify that, could you do your test on a single
> broker
> > > > (with
> > > > > > > > replication factor 1) btw 0.8.1 and 0.8.2 and see if there is
> > any
> > > > > > > > significant difference in cpu usage?
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Jun
> > > > > > > >
> > > > > > > > On Tue, Feb 3, 2015 at 5:09 AM, Mathias Söderberg <
> > > > > > > > mathias.soederberg@gmail.com> wrote:
> > > > > > > >
> > > > > > > > > Jun,
> > > > > > > > >
> > > > > > > > > I re-ran the hprof test, for about 30 minutes again, for
> > > > > 0.8.2.0-rc2
> > > > > > > with
> > > > > > > > > the same version of snappy that 0.8.1.1 used. Attached the
> > > logs.
> > > > > > > > > Unfortunately there wasn't any improvement as the node
> > running
> > > > > > > > 0.8.2.0-rc2
> > > > > > > > > still had a higher load and CPU usage.
> > > > > > > > >
> > > > > > > > > Best regards,
> > > > > > > > > Mathias
> > > > > > > > >
> > > > > > > > > On Tue Feb 03 2015 at 4:40:31 AM Jaikiran Pai <
> > > > > > > jai.forums2013@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > >> On Monday 02 February 2015 11:03 PM, Jun Rao wrote:
> > > > > > > > >> > Jaikiran,
> > > > > > > > >> >
> > > > > > > > >> > The fix you provided in probably unnecessary. The
> channel
> > > that
> > > > > we
> > > > > > > use
> > > > > > > > in
> > > > > > > > >> > SimpleConsumer (BlockingChannel) is configured to be
> > > blocking.
> > > > > So
> > > > > > > even
> > > > > > > > >> > though the read from the socket is in a loop, each read
> > > blocks
> > > > > if
> > > > > > > > there
> > > > > > > > >> is
> > > > > > > > >> > no bytes received from the broker. So, that shouldn't
> > cause
> > > > > extra
> > > > > > > CPU
> > > > > > > > >> > consumption.
> > > > > > > > >> Hi Jun,
> > > > > > > > >>
> > > > > > > > >> Of course, you are right! I forgot that while reading the
> > > thread
> > > > > > dump
> > > > > > > in
> > > > > > > > >> hprof output, one has to be aware that the thread state
> > isn't
> > > > > shown
> > > > > > > and
> > > > > > > > >> the thread need not necessarily be doing any CPU activity.
> > > > > > > > >>
> > > > > > > > >> -Jaikiran
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> >
> > > > > > > > >> > Thanks,
> > > > > > > > >> >
> > > > > > > > >> > Jun
> > > > > > > > >> >
> > > > > > > > >> > On Mon, Jan 26, 2015 at 10:05 AM, Mathias Söderberg <
> > > > > > > > >> > mathias.soederberg@gmail.com> wrote:
> > > > > > > > >> >
> > > > > > > > >> >> Hi Neha,
> > > > > > > > >> >>
> > > > > > > > >> >> I sent an e-mail earlier today, but noticed now that it
> > > > didn't
> > > > > > > > >> actually go
> > > > > > > > >> >> through.
> > > > > > > > >> >>
> > > > > > > > >> >> Anyhow, I've attached two files, one with output from a
> > 10
> > > > > minute
> > > > > > > run
> > > > > > > > >> and
> > > > > > > > >> >> one with output from a 30 minute run. Realized that
> > maybe I
> > > > > > > should've
> > > > > > > > >> done
> > > > > > > > >> >> one or two runs with 0.8.1.1 as well, but nevertheless.
> > > > > > > > >> >>
> > > > > > > > >> >> I upgraded our staging cluster to 0.8.2.0-rc2, and I'm
> > > seeing
> > > > > the
> > > > > > > > same
> > > > > > > > >> CPU
> > > > > > > > >> >> usage as with the beta version (basically pegging all
> > > cores).
> > > > > If
> > > > > > I
> > > > > > > > >> manage
> > > > > > > > >> >> to find the time I'll do another run with hprof on the
> > rc2
> > > > > > version
> > > > > > > > >> later
> > > > > > > > >> >> today.
> > > > > > > > >> >>
> > > > > > > > >> >> Best regards,
> > > > > > > > >> >> Mathias
> > > > > > > > >> >>
> > > > > > > > >> >> On Tue Dec 09 2014 at 10:08:21 PM Neha Narkhede <
> > > > > > neha@confluent.io
> > > > > > > >
> > > > > > > > >> wrote:
> > > > > > > > >> >>
> > > > > > > > >> >>> The following should be sufficient
> > > > > > > > >> >>>
> > > > > > > > >> >>> java
> > > > > > > > >> >>>
> > -agentlib:hprof=cpu=samples,depth=100,interval=20,lineno=
> > > > > > > > >> >>> y,thread=y,file=kafka.hprof
> > > > > > > > >> >>> <classname>
> > > > > > > > >> >>>
> > > > > > > > >> >>> You would need to start the Kafka server with the
> > settings
> > > > > above
> > > > > > > for
> > > > > > > > >> >>> sometime until you observe the problem.
> > > > > > > > >> >>>
> > > > > > > > >> >>> On Tue, Dec 9, 2014 at 3:47 AM, Mathias Söderberg <
> > > > > > > > >> >>> mathias.soederberg@gmail.com> wrote:
> > > > > > > > >> >>>
> > > > > > > > >> >>>> Hi Neha,
> > > > > > > > >> >>>>
> > > > > > > > >> >>>> Yeah sure. I'm not familiar with hprof, so any
> > particular
> > > > > > > options I
> > > > > > > > >> >>> should
> > > > > > > > >> >>>> include or just run with defaults?
> > > > > > > > >> >>>>
> > > > > > > > >> >>>> Best regards,
> > > > > > > > >> >>>> Mathias
> > > > > > > > >> >>>>
> > > > > > > > >> >>>> On Mon Dec 08 2014 at 7:41:32 PM Neha Narkhede <
> > > > > > > neha@confluent.io>
> > > > > > > > >> >>> wrote:
> > > > > > > > >> >>>>> Thanks for reporting the issue. Would you mind
> running
> > > > hprof
> > > > > > and
> > > > > > > > >> >>> sending
> > > > > > > > >> >>>>> the output?
> > > > > > > > >> >>>>>
> > > > > > > > >> >>>>> On Mon, Dec 8, 2014 at 1:25 AM, Mathias Söderberg <
> > > > > > > > >> >>>>> mathias.soederberg@gmail.com> wrote:
> > > > > > > > >> >>>>>
> > > > > > > > >> >>>>>> Good day,
> > > > > > > > >> >>>>>>
> > > > > > > > >> >>>>>> I upgraded a Kafka cluster from v0.8.1.1 to
> > v0.8.2-beta
> > > > and
> > > > > > > > noticed
> > > > > > > > >> >>>> that
> > > > > > > > >> >>>>>> the CPU usage on the broker machines went up by
> > roughly
> > > > > 40%,
> > > > > > > from
> > > > > > > > >> >>> ~60%
> > > > > > > > >> >>>> to
> > > > > > > > >> >>>>>> ~100% and am wondering if anyone else has
> experienced
> > > > > > something
> > > > > > > > >> >>>> similar?
> > > > > > > > >> >>>>>> The load average also went up by 2x-3x.
> > > > > > > > >> >>>>>>
> > > > > > > > >> >>>>>> We're running on EC2 and the cluster currently
> > consists
> > > > of
> > > > > > four
> > > > > > > > >> >>>>> m1.xlarge,
> > > > > > > > >> >>>>>> with roughly 1100 topics / 4000 partitions. Using
> > Java
> > > 7
> > > > > > > > (1.7.0_65
> > > > > > > > >> >>> to
> > > > > > > > >> >>>> be
> > > > > > > > >> >>>>>> exact) and Scala 2.9.2. Configurations can be found
> > > over
> > > > > > here:
> > > > > > > > >> >>>>>>
> > > https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
> > > > > > > > >> >>>>>>
> > > > > > > > >> >>>>>> I'm assuming that this is not expected behaviour
> for
> > > > > > > 0.8.2-beta?
> > > > > > > > >> >>>>>>
> > > > > > > > >> >>>>>> Best regards,
> > > > > > > > >> >>>>>> Mathias
> > > > > > > > >> >>>>>>
> > > > > > > > >> >>>>>
> > > > > > > > >> >>>>>
> > > > > > > > >> >>>>> --
> > > > > > > > >> >>>>> Thanks,
> > > > > > > > >> >>>>> Neha
> > > > > > > > >> >>>>>
> > > > > > > > >> >>>
> > > > > > > > >> >>>
> > > > > > > > >> >>> --
> > > > > > > > >> >>> Thanks,
> > > > > > > > >> >>> Neha
> > > > > > > > >> >>>
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Increased CPU usage with 0.8.2-beta

Posted by Solon Gordon <so...@knewton.com>.
Thanks for the fast response. I did a quick test and initial results look
promising. When I swapped in the patched version, CPU usage dropped from
~150% to ~65%. Still a bit higher than what I see with 0.8.1.1 but much
more reasonable.

I'll do more testing on Monday but wanted to get you some quick feedback.
Hopefully Mathias will have good results as well.

On Fri, Feb 13, 2015 at 9:14 PM, Jun Rao <ju...@confluent.io> wrote:

> Mathias, Solon,
>
> We did identify a CPU issue and patched it in
> https://issues.apache.org/jira/browse/KAFKA-1952. Could you apply the
> patch
> in the 0.8.2 branch and see if that addresses the issue?
>
> Thanks,
>
> Jun
>
> On Fri, Feb 13, 2015 at 3:26 PM, Jay Kreps <ja...@gmail.com> wrote:
>
> > We can reproduce this issue, have a theory as to the cause, and are
> working
> > on a fix. Here is the ticket to track it:
> > https://issues.apache.org/jira/browse/KAFKA-1952
> >
> > I would recommend people hold off on 0.8.2 upgrades until we have a
> handle
> > on this.
> >
> > -Jay
> >
> > On Fri, Feb 13, 2015 at 1:47 PM, Solon Gordon <so...@knewton.com> wrote:
> >
> > > The partitions nearly all have replication factor 2 (a few stray ones
> > have
> > > 1), and our producers use request.required.acks=-1. However, I should
> > note
> > > there were hardly any messages being produced when I did the upgrade
> and
> > > observed the high CPU load.
> > >
> > > I should have time to do some profiling on Monday and will get back to
> > you
> > > with the results.
> > >
> > > On Fri, Feb 13, 2015 at 1:00 PM, Jun Rao <ju...@confluent.io> wrote:
> > >
> > > > Solon,
> > > >
> > > > What's the replication factor you used for those partitions? What's
> the
> > > > producer ack that you used? Also, could you do a bit of profiling on
> > the
> > > > broker to see which methods used the most CPU?
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Thu, Feb 12, 2015 at 3:19 PM, Solon Gordon <so...@knewton.com>
> > wrote:
> > > >
> > > > > I saw a very similar jump in CPU usage when I tried upgrading from
> > > > 0.8.1.1
> > > > > to 0.8.2.0 today in a test environment. The Kafka cluster there is
> > two
> > > > > m1.larges handling 2,000 partitions across 32 topics. CPU usage
> rose
> > > from
> > > > > 40% into the 150%–190% range, and load average from under 1 to over
> > 4.
> > > > > Downgrading to 0.8.1.1 brought the CPU and load back to the
> previous
> > > > > values.
> > > > >
> > > > > If there's more info that would be helpful, please let me know.
> > > > >
> > > > > On Thu, Feb 12, 2015 at 4:17 PM, Mathias Söderberg <
> > > > > mathias.soederberg@gmail.com> wrote:
> > > > >
> > > > > > Jun,
> > > > > >
> > > > > > Pardon the radio silence. I booted up a new broker, created a
> topic
> > > > with
> > > > > > three (3) partitions and replication factor one (1) and used the
> > > > > > *kafka-producer-perf-test.sh
> > > > > > *script to generate load (using messages of roughly the same size
> > as
> > > > > ours).
> > > > > > There was a slight increase in CPU usage (~5-10%) on 0.8.2.0-rc2
> > > > compared
> > > > > > to 0.8.1.1, but that was about it.
> > > > > >
> > > > > > I upgraded our staging cluster to 0.8.2.0 earlier this week or
> so,
> > > and
> > > > > had
> > > > > > to add an additional broker due to increased load after the
> upgrade
> > > > (note
> > > > > > that the incoming load on the cluster has been pretty much
> > > consistent).
> > > > > > Since the upgrade we've been seeing an 2-3x increase in latency
> as
> > > > well.
> > > > > > I'm considering downgrading to 0.8.1.1 again to see if it
> resolves
> > > our
> > > > > > issues.
> > > > > >
> > > > > > Best regards,
> > > > > > Mathias
> > > > > >
> > > > > > On Tue Feb 03 2015 at 6:44:36 PM Jun Rao <ju...@confluent.io>
> wrote:
> > > > > >
> > > > > > > Mathias,
> > > > > > >
> > > > > > > The new hprof doesn't reveal anything new to me. We did fix the
> > > logic
> > > > > in
> > > > > > > using Purgatory in 0.8.2, which could potentially drive up the
> > CPU
> > > > > usage
> > > > > > a
> > > > > > > bit. To verify that, could you do your test on a single broker
> > > (with
> > > > > > > replication factor 1) btw 0.8.1 and 0.8.2 and see if there is
> any
> > > > > > > significant difference in cpu usage?
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > > On Tue, Feb 3, 2015 at 5:09 AM, Mathias Söderberg <
> > > > > > > mathias.soederberg@gmail.com> wrote:
> > > > > > >
> > > > > > > > Jun,
> > > > > > > >
> > > > > > > > I re-ran the hprof test, for about 30 minutes again, for
> > > > 0.8.2.0-rc2
> > > > > > with
> > > > > > > > the same version of snappy that 0.8.1.1 used. Attached the
> > logs.
> > > > > > > > Unfortunately there wasn't any improvement as the node
> running
> > > > > > > 0.8.2.0-rc2
> > > > > > > > still had a higher load and CPU usage.
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Mathias
> > > > > > > >
> > > > > > > > On Tue Feb 03 2015 at 4:40:31 AM Jaikiran Pai <
> > > > > > jai.forums2013@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >> On Monday 02 February 2015 11:03 PM, Jun Rao wrote:
> > > > > > > >> > Jaikiran,
> > > > > > > >> >
> > > > > > > >> > The fix you provided in probably unnecessary. The channel
> > that
> > > > we
> > > > > > use
> > > > > > > in
> > > > > > > >> > SimpleConsumer (BlockingChannel) is configured to be
> > blocking.
> > > > So
> > > > > > even
> > > > > > > >> > though the read from the socket is in a loop, each read
> > blocks
> > > > if
> > > > > > > there
> > > > > > > >> is
> > > > > > > >> > no bytes received from the broker. So, that shouldn't
> cause
> > > > extra
> > > > > > CPU
> > > > > > > >> > consumption.
> > > > > > > >> Hi Jun,
> > > > > > > >>
> > > > > > > >> Of course, you are right! I forgot that while reading the
> > thread
> > > > > dump
> > > > > > in
> > > > > > > >> hprof output, one has to be aware that the thread state
> isn't
> > > > shown
> > > > > > and
> > > > > > > >> the thread need not necessarily be doing any CPU activity.
> > > > > > > >>
> > > > > > > >> -Jaikiran
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> >
> > > > > > > >> > Thanks,
> > > > > > > >> >
> > > > > > > >> > Jun
> > > > > > > >> >
> > > > > > > >> > On Mon, Jan 26, 2015 at 10:05 AM, Mathias Söderberg <
> > > > > > > >> > mathias.soederberg@gmail.com> wrote:
> > > > > > > >> >
> > > > > > > >> >> Hi Neha,
> > > > > > > >> >>
> > > > > > > >> >> I sent an e-mail earlier today, but noticed now that it
> > > didn't
> > > > > > > >> actually go
> > > > > > > >> >> through.
> > > > > > > >> >>
> > > > > > > >> >> Anyhow, I've attached two files, one with output from a
> 10
> > > > minute
> > > > > > run
> > > > > > > >> and
> > > > > > > >> >> one with output from a 30 minute run. Realized that
> maybe I
> > > > > > should've
> > > > > > > >> done
> > > > > > > >> >> one or two runs with 0.8.1.1 as well, but nevertheless.
> > > > > > > >> >>
> > > > > > > >> >> I upgraded our staging cluster to 0.8.2.0-rc2, and I'm
> > seeing
> > > > the
> > > > > > > same
> > > > > > > >> CPU
> > > > > > > >> >> usage as with the beta version (basically pegging all
> > cores).
> > > > If
> > > > > I
> > > > > > > >> manage
> > > > > > > >> >> to find the time I'll do another run with hprof on the
> rc2
> > > > > version
> > > > > > > >> later
> > > > > > > >> >> today.
> > > > > > > >> >>
> > > > > > > >> >> Best regards,
> > > > > > > >> >> Mathias
> > > > > > > >> >>
> > > > > > > >> >> On Tue Dec 09 2014 at 10:08:21 PM Neha Narkhede <
> > > > > neha@confluent.io
> > > > > > >
> > > > > > > >> wrote:
> > > > > > > >> >>
> > > > > > > >> >>> The following should be sufficient
> > > > > > > >> >>>
> > > > > > > >> >>> java
> > > > > > > >> >>>
> -agentlib:hprof=cpu=samples,depth=100,interval=20,lineno=
> > > > > > > >> >>> y,thread=y,file=kafka.hprof
> > > > > > > >> >>> <classname>
> > > > > > > >> >>>
> > > > > > > >> >>> You would need to start the Kafka server with the
> settings
> > > > above
> > > > > > for
> > > > > > > >> >>> sometime until you observe the problem.
> > > > > > > >> >>>
> > > > > > > >> >>> On Tue, Dec 9, 2014 at 3:47 AM, Mathias Söderberg <
> > > > > > > >> >>> mathias.soederberg@gmail.com> wrote:
> > > > > > > >> >>>
> > > > > > > >> >>>> Hi Neha,
> > > > > > > >> >>>>
> > > > > > > >> >>>> Yeah sure. I'm not familiar with hprof, so any
> particular
> > > > > > options I
> > > > > > > >> >>> should
> > > > > > > >> >>>> include or just run with defaults?
> > > > > > > >> >>>>
> > > > > > > >> >>>> Best regards,
> > > > > > > >> >>>> Mathias
> > > > > > > >> >>>>
> > > > > > > >> >>>> On Mon Dec 08 2014 at 7:41:32 PM Neha Narkhede <
> > > > > > neha@confluent.io>
> > > > > > > >> >>> wrote:
> > > > > > > >> >>>>> Thanks for reporting the issue. Would you mind running
> > > hprof
> > > > > and
> > > > > > > >> >>> sending
> > > > > > > >> >>>>> the output?
> > > > > > > >> >>>>>
> > > > > > > >> >>>>> On Mon, Dec 8, 2014 at 1:25 AM, Mathias Söderberg <
> > > > > > > >> >>>>> mathias.soederberg@gmail.com> wrote:
> > > > > > > >> >>>>>
> > > > > > > >> >>>>>> Good day,
> > > > > > > >> >>>>>>
> > > > > > > >> >>>>>> I upgraded a Kafka cluster from v0.8.1.1 to
> v0.8.2-beta
> > > and
> > > > > > > noticed
> > > > > > > >> >>>> that
> > > > > > > >> >>>>>> the CPU usage on the broker machines went up by
> roughly
> > > > 40%,
> > > > > > from
> > > > > > > >> >>> ~60%
> > > > > > > >> >>>> to
> > > > > > > >> >>>>>> ~100% and am wondering if anyone else has experienced
> > > > > something
> > > > > > > >> >>>> similar?
> > > > > > > >> >>>>>> The load average also went up by 2x-3x.
> > > > > > > >> >>>>>>
> > > > > > > >> >>>>>> We're running on EC2 and the cluster currently
> consists
> > > of
> > > > > four
> > > > > > > >> >>>>> m1.xlarge,
> > > > > > > >> >>>>>> with roughly 1100 topics / 4000 partitions. Using
> Java
> > 7
> > > > > > > (1.7.0_65
> > > > > > > >> >>> to
> > > > > > > >> >>>> be
> > > > > > > >> >>>>>> exact) and Scala 2.9.2. Configurations can be found
> > over
> > > > > here:
> > > > > > > >> >>>>>>
> > https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
> > > > > > > >> >>>>>>
> > > > > > > >> >>>>>> I'm assuming that this is not expected behaviour for
> > > > > > 0.8.2-beta?
> > > > > > > >> >>>>>>
> > > > > > > >> >>>>>> Best regards,
> > > > > > > >> >>>>>> Mathias
> > > > > > > >> >>>>>>
> > > > > > > >> >>>>>
> > > > > > > >> >>>>>
> > > > > > > >> >>>>> --
> > > > > > > >> >>>>> Thanks,
> > > > > > > >> >>>>> Neha
> > > > > > > >> >>>>>
> > > > > > > >> >>>
> > > > > > > >> >>>
> > > > > > > >> >>> --
> > > > > > > >> >>> Thanks,
> > > > > > > >> >>> Neha
> > > > > > > >> >>>
> > > > > > > >>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Increased CPU usage with 0.8.2-beta

Posted by Jun Rao <ju...@confluent.io>.
Mathias, Solon,

We did identify a CPU issue and patched it in
https://issues.apache.org/jira/browse/KAFKA-1952. Could you apply the patch
in the 0.8.2 branch and see if that addresses the issue?

Thanks,

Jun

On Fri, Feb 13, 2015 at 3:26 PM, Jay Kreps <ja...@gmail.com> wrote:

> We can reproduce this issue, have a theory as to the cause, and are working
> on a fix. Here is the ticket to track it:
> https://issues.apache.org/jira/browse/KAFKA-1952
>
> I would recommend people hold off on 0.8.2 upgrades until we have a handle
> on this.
>
> -Jay
>
> On Fri, Feb 13, 2015 at 1:47 PM, Solon Gordon <so...@knewton.com> wrote:
>
> > The partitions nearly all have replication factor 2 (a few stray ones
> have
> > 1), and our producers use request.required.acks=-1. However, I should
> note
> > there were hardly any messages being produced when I did the upgrade and
> > observed the high CPU load.
> >
> > I should have time to do some profiling on Monday and will get back to
> you
> > with the results.
> >
> > On Fri, Feb 13, 2015 at 1:00 PM, Jun Rao <ju...@confluent.io> wrote:
> >
> > > Solon,
> > >
> > > What's the replication factor you used for those partitions? What's the
> > > producer ack that you used? Also, could you do a bit of profiling on
> the
> > > broker to see which methods used the most CPU?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Thu, Feb 12, 2015 at 3:19 PM, Solon Gordon <so...@knewton.com>
> wrote:
> > >
> > > > I saw a very similar jump in CPU usage when I tried upgrading from
> > > 0.8.1.1
> > > > to 0.8.2.0 today in a test environment. The Kafka cluster there is
> two
> > > > m1.larges handling 2,000 partitions across 32 topics. CPU usage rose
> > from
> > > > 40% into the 150%–190% range, and load average from under 1 to over
> 4.
> > > > Downgrading to 0.8.1.1 brought the CPU and load back to the previous
> > > > values.
> > > >
> > > > If there's more info that would be helpful, please let me know.
> > > >
> > > > On Thu, Feb 12, 2015 at 4:17 PM, Mathias Söderberg <
> > > > mathias.soederberg@gmail.com> wrote:
> > > >
> > > > > Jun,
> > > > >
> > > > > Pardon the radio silence. I booted up a new broker, created a topic
> > > with
> > > > > three (3) partitions and replication factor one (1) and used the
> > > > > *kafka-producer-perf-test.sh
> > > > > *script to generate load (using messages of roughly the same size
> as
> > > > ours).
> > > > > There was a slight increase in CPU usage (~5-10%) on 0.8.2.0-rc2
> > > compared
> > > > > to 0.8.1.1, but that was about it.
> > > > >
> > > > > I upgraded our staging cluster to 0.8.2.0 earlier this week or so,
> > and
> > > > had
> > > > > to add an additional broker due to increased load after the upgrade
> > > (note
> > > > > that the incoming load on the cluster has been pretty much
> > consistent).
> > > > > Since the upgrade we've been seeing an 2-3x increase in latency as
> > > well.
> > > > > I'm considering downgrading to 0.8.1.1 again to see if it resolves
> > our
> > > > > issues.
> > > > >
> > > > > Best regards,
> > > > > Mathias
> > > > >
> > > > > On Tue Feb 03 2015 at 6:44:36 PM Jun Rao <ju...@confluent.io> wrote:
> > > > >
> > > > > > Mathias,
> > > > > >
> > > > > > The new hprof doesn't reveal anything new to me. We did fix the
> > logic
> > > > in
> > > > > > using Purgatory in 0.8.2, which could potentially drive up the
> CPU
> > > > usage
> > > > > a
> > > > > > bit. To verify that, could you do your test on a single broker
> > (with
> > > > > > replication factor 1) btw 0.8.1 and 0.8.2 and see if there is any
> > > > > > significant difference in cpu usage?
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Tue, Feb 3, 2015 at 5:09 AM, Mathias Söderberg <
> > > > > > mathias.soederberg@gmail.com> wrote:
> > > > > >
> > > > > > > Jun,
> > > > > > >
> > > > > > > I re-ran the hprof test, for about 30 minutes again, for
> > > 0.8.2.0-rc2
> > > > > with
> > > > > > > the same version of snappy that 0.8.1.1 used. Attached the
> logs.
> > > > > > > Unfortunately there wasn't any improvement as the node running
> > > > > > 0.8.2.0-rc2
> > > > > > > still had a higher load and CPU usage.
> > > > > > >
> > > > > > > Best regards,
> > > > > > > Mathias
> > > > > > >
> > > > > > > On Tue Feb 03 2015 at 4:40:31 AM Jaikiran Pai <
> > > > > jai.forums2013@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> On Monday 02 February 2015 11:03 PM, Jun Rao wrote:
> > > > > > >> > Jaikiran,
> > > > > > >> >
> > > > > > >> > The fix you provided in probably unnecessary. The channel
> that
> > > we
> > > > > use
> > > > > > in
> > > > > > >> > SimpleConsumer (BlockingChannel) is configured to be
> blocking.
> > > So
> > > > > even
> > > > > > >> > though the read from the socket is in a loop, each read
> blocks
> > > if
> > > > > > there
> > > > > > >> is
> > > > > > >> > no bytes received from the broker. So, that shouldn't cause
> > > extra
> > > > > CPU
> > > > > > >> > consumption.
> > > > > > >> Hi Jun,
> > > > > > >>
> > > > > > >> Of course, you are right! I forgot that while reading the
> thread
> > > > dump
> > > > > in
> > > > > > >> hprof output, one has to be aware that the thread state isn't
> > > shown
> > > > > and
> > > > > > >> the thread need not necessarily be doing any CPU activity.
> > > > > > >>
> > > > > > >> -Jaikiran
> > > > > > >>
> > > > > > >>
> > > > > > >> >
> > > > > > >> > Thanks,
> > > > > > >> >
> > > > > > >> > Jun
> > > > > > >> >
> > > > > > >> > On Mon, Jan 26, 2015 at 10:05 AM, Mathias Söderberg <
> > > > > > >> > mathias.soederberg@gmail.com> wrote:
> > > > > > >> >
> > > > > > >> >> Hi Neha,
> > > > > > >> >>
> > > > > > >> >> I sent an e-mail earlier today, but noticed now that it
> > didn't
> > > > > > >> actually go
> > > > > > >> >> through.
> > > > > > >> >>
> > > > > > >> >> Anyhow, I've attached two files, one with output from a 10
> > > minute
> > > > > run
> > > > > > >> and
> > > > > > >> >> one with output from a 30 minute run. Realized that maybe I
> > > > > should've
> > > > > > >> done
> > > > > > >> >> one or two runs with 0.8.1.1 as well, but nevertheless.
> > > > > > >> >>
> > > > > > >> >> I upgraded our staging cluster to 0.8.2.0-rc2, and I'm
> seeing
> > > the
> > > > > > same
> > > > > > >> CPU
> > > > > > >> >> usage as with the beta version (basically pegging all
> cores).
> > > If
> > > > I
> > > > > > >> manage
> > > > > > >> >> to find the time I'll do another run with hprof on the rc2
> > > > version
> > > > > > >> later
> > > > > > >> >> today.
> > > > > > >> >>
> > > > > > >> >> Best regards,
> > > > > > >> >> Mathias
> > > > > > >> >>
> > > > > > >> >> On Tue Dec 09 2014 at 10:08:21 PM Neha Narkhede <
> > > > neha@confluent.io
> > > > > >
> > > > > > >> wrote:
> > > > > > >> >>
> > > > > > >> >>> The following should be sufficient
> > > > > > >> >>>
> > > > > > >> >>> java
> > > > > > >> >>> -agentlib:hprof=cpu=samples,depth=100,interval=20,lineno=
> > > > > > >> >>> y,thread=y,file=kafka.hprof
> > > > > > >> >>> <classname>
> > > > > > >> >>>
> > > > > > >> >>> You would need to start the Kafka server with the settings
> > > above
> > > > > for
> > > > > > >> >>> sometime until you observe the problem.
> > > > > > >> >>>
> > > > > > >> >>> On Tue, Dec 9, 2014 at 3:47 AM, Mathias Söderberg <
> > > > > > >> >>> mathias.soederberg@gmail.com> wrote:
> > > > > > >> >>>
> > > > > > >> >>>> Hi Neha,
> > > > > > >> >>>>
> > > > > > >> >>>> Yeah sure. I'm not familiar with hprof, so any particular
> > > > > options I
> > > > > > >> >>> should
> > > > > > >> >>>> include or just run with defaults?
> > > > > > >> >>>>
> > > > > > >> >>>> Best regards,
> > > > > > >> >>>> Mathias
> > > > > > >> >>>>
> > > > > > >> >>>> On Mon Dec 08 2014 at 7:41:32 PM Neha Narkhede <
> > > > > neha@confluent.io>
> > > > > > >> >>> wrote:
> > > > > > >> >>>>> Thanks for reporting the issue. Would you mind running
> > hprof
> > > > and
> > > > > > >> >>> sending
> > > > > > >> >>>>> the output?
> > > > > > >> >>>>>
> > > > > > >> >>>>> On Mon, Dec 8, 2014 at 1:25 AM, Mathias Söderberg <
> > > > > > >> >>>>> mathias.soederberg@gmail.com> wrote:
> > > > > > >> >>>>>
> > > > > > >> >>>>>> Good day,
> > > > > > >> >>>>>>
> > > > > > >> >>>>>> I upgraded a Kafka cluster from v0.8.1.1 to v0.8.2-beta
> > and
> > > > > > noticed
> > > > > > >> >>>> that
> > > > > > >> >>>>>> the CPU usage on the broker machines went up by roughly
> > > 40%,
> > > > > from
> > > > > > >> >>> ~60%
> > > > > > >> >>>> to
> > > > > > >> >>>>>> ~100% and am wondering if anyone else has experienced
> > > > something
> > > > > > >> >>>> similar?
> > > > > > >> >>>>>> The load average also went up by 2x-3x.
> > > > > > >> >>>>>>
> > > > > > >> >>>>>> We're running on EC2 and the cluster currently consists
> > of
> > > > four
> > > > > > >> >>>>> m1.xlarge,
> > > > > > >> >>>>>> with roughly 1100 topics / 4000 partitions. Using Java
> 7
> > > > > > (1.7.0_65
> > > > > > >> >>> to
> > > > > > >> >>>> be
> > > > > > >> >>>>>> exact) and Scala 2.9.2. Configurations can be found
> over
> > > > here:
> > > > > > >> >>>>>>
> https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
> > > > > > >> >>>>>>
> > > > > > >> >>>>>> I'm assuming that this is not expected behaviour for
> > > > > 0.8.2-beta?
> > > > > > >> >>>>>>
> > > > > > >> >>>>>> Best regards,
> > > > > > >> >>>>>> Mathias
> > > > > > >> >>>>>>
> > > > > > >> >>>>>
> > > > > > >> >>>>>
> > > > > > >> >>>>> --
> > > > > > >> >>>>> Thanks,
> > > > > > >> >>>>> Neha
> > > > > > >> >>>>>
> > > > > > >> >>>
> > > > > > >> >>>
> > > > > > >> >>> --
> > > > > > >> >>> Thanks,
> > > > > > >> >>> Neha
> > > > > > >> >>>
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Increased CPU usage with 0.8.2-beta

Posted by Jay Kreps <ja...@gmail.com>.
We can reproduce this issue, have a theory as to the cause, and are working
on a fix. Here is the ticket to track it:
https://issues.apache.org/jira/browse/KAFKA-1952

I would recommend people hold off on 0.8.2 upgrades until we have a handle
on this.

-Jay

On Fri, Feb 13, 2015 at 1:47 PM, Solon Gordon <so...@knewton.com> wrote:

> The partitions nearly all have replication factor 2 (a few stray ones have
> 1), and our producers use request.required.acks=-1. However, I should note
> there were hardly any messages being produced when I did the upgrade and
> observed the high CPU load.
>
> I should have time to do some profiling on Monday and will get back to you
> with the results.
>
> On Fri, Feb 13, 2015 at 1:00 PM, Jun Rao <ju...@confluent.io> wrote:
>
> > Solon,
> >
> > What's the replication factor you used for those partitions? What's the
> > producer ack that you used? Also, could you do a bit of profiling on the
> > broker to see which methods used the most CPU?
> >
> > Thanks,
> >
> > Jun
> >
> > On Thu, Feb 12, 2015 at 3:19 PM, Solon Gordon <so...@knewton.com> wrote:
> >
> > > I saw a very similar jump in CPU usage when I tried upgrading from
> > 0.8.1.1
> > > to 0.8.2.0 today in a test environment. The Kafka cluster there is two
> > > m1.larges handling 2,000 partitions across 32 topics. CPU usage rose
> from
> > > 40% into the 150%–190% range, and load average from under 1 to over 4.
> > > Downgrading to 0.8.1.1 brought the CPU and load back to the previous
> > > values.
> > >
> > > If there's more info that would be helpful, please let me know.
> > >
> > > On Thu, Feb 12, 2015 at 4:17 PM, Mathias Söderberg <
> > > mathias.soederberg@gmail.com> wrote:
> > >
> > > > Jun,
> > > >
> > > > Pardon the radio silence. I booted up a new broker, created a topic
> > with
> > > > three (3) partitions and replication factor one (1) and used the
> > > > *kafka-producer-perf-test.sh
> > > > *script to generate load (using messages of roughly the same size as
> > > ours).
> > > > There was a slight increase in CPU usage (~5-10%) on 0.8.2.0-rc2
> > compared
> > > > to 0.8.1.1, but that was about it.
> > > >
> > > > I upgraded our staging cluster to 0.8.2.0 earlier this week or so,
> and
> > > had
> > > > to add an additional broker due to increased load after the upgrade
> > (note
> > > > that the incoming load on the cluster has been pretty much
> consistent).
> > > > Since the upgrade we've been seeing an 2-3x increase in latency as
> > well.
> > > > I'm considering downgrading to 0.8.1.1 again to see if it resolves
> our
> > > > issues.
> > > >
> > > > Best regards,
> > > > Mathias
> > > >
> > > > On Tue Feb 03 2015 at 6:44:36 PM Jun Rao <ju...@confluent.io> wrote:
> > > >
> > > > > Mathias,
> > > > >
> > > > > The new hprof doesn't reveal anything new to me. We did fix the
> logic
> > > in
> > > > > using Purgatory in 0.8.2, which could potentially drive up the CPU
> > > usage
> > > > a
> > > > > bit. To verify that, could you do your test on a single broker
> (with
> > > > > replication factor 1) btw 0.8.1 and 0.8.2 and see if there is any
> > > > > significant difference in cpu usage?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Tue, Feb 3, 2015 at 5:09 AM, Mathias Söderberg <
> > > > > mathias.soederberg@gmail.com> wrote:
> > > > >
> > > > > > Jun,
> > > > > >
> > > > > > I re-ran the hprof test, for about 30 minutes again, for
> > 0.8.2.0-rc2
> > > > with
> > > > > > the same version of snappy that 0.8.1.1 used. Attached the logs.
> > > > > > Unfortunately there wasn't any improvement as the node running
> > > > > 0.8.2.0-rc2
> > > > > > still had a higher load and CPU usage.
> > > > > >
> > > > > > Best regards,
> > > > > > Mathias
> > > > > >
> > > > > > On Tue Feb 03 2015 at 4:40:31 AM Jaikiran Pai <
> > > > jai.forums2013@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > >> On Monday 02 February 2015 11:03 PM, Jun Rao wrote:
> > > > > >> > Jaikiran,
> > > > > >> >
> > > > > >> > The fix you provided in probably unnecessary. The channel that
> > we
> > > > use
> > > > > in
> > > > > >> > SimpleConsumer (BlockingChannel) is configured to be blocking.
> > So
> > > > even
> > > > > >> > though the read from the socket is in a loop, each read blocks
> > if
> > > > > there
> > > > > >> is
> > > > > >> > no bytes received from the broker. So, that shouldn't cause
> > extra
> > > > CPU
> > > > > >> > consumption.
> > > > > >> Hi Jun,
> > > > > >>
> > > > > >> Of course, you are right! I forgot that while reading the thread
> > > dump
> > > > in
> > > > > >> hprof output, one has to be aware that the thread state isn't
> > shown
> > > > and
> > > > > >> the thread need not necessarily be doing any CPU activity.
> > > > > >>
> > > > > >> -Jaikiran
> > > > > >>
> > > > > >>
> > > > > >> >
> > > > > >> > Thanks,
> > > > > >> >
> > > > > >> > Jun
> > > > > >> >
> > > > > >> > On Mon, Jan 26, 2015 at 10:05 AM, Mathias Söderberg <
> > > > > >> > mathias.soederberg@gmail.com> wrote:
> > > > > >> >
> > > > > >> >> Hi Neha,
> > > > > >> >>
> > > > > >> >> I sent an e-mail earlier today, but noticed now that it
> didn't
> > > > > >> actually go
> > > > > >> >> through.
> > > > > >> >>
> > > > > >> >> Anyhow, I've attached two files, one with output from a 10
> > minute
> > > > run
> > > > > >> and
> > > > > >> >> one with output from a 30 minute run. Realized that maybe I
> > > > should've
> > > > > >> done
> > > > > >> >> one or two runs with 0.8.1.1 as well, but nevertheless.
> > > > > >> >>
> > > > > >> >> I upgraded our staging cluster to 0.8.2.0-rc2, and I'm seeing
> > the
> > > > > same
> > > > > >> CPU
> > > > > >> >> usage as with the beta version (basically pegging all cores).
> > If
> > > I
> > > > > >> manage
> > > > > >> >> to find the time I'll do another run with hprof on the rc2
> > > version
> > > > > >> later
> > > > > >> >> today.
> > > > > >> >>
> > > > > >> >> Best regards,
> > > > > >> >> Mathias
> > > > > >> >>
> > > > > >> >> On Tue Dec 09 2014 at 10:08:21 PM Neha Narkhede <
> > > neha@confluent.io
> > > > >
> > > > > >> wrote:
> > > > > >> >>
> > > > > >> >>> The following should be sufficient
> > > > > >> >>>
> > > > > >> >>> java
> > > > > >> >>> -agentlib:hprof=cpu=samples,depth=100,interval=20,lineno=
> > > > > >> >>> y,thread=y,file=kafka.hprof
> > > > > >> >>> <classname>
> > > > > >> >>>
> > > > > >> >>> You would need to start the Kafka server with the settings
> > above
> > > > for
> > > > > >> >>> sometime until you observe the problem.
> > > > > >> >>>
> > > > > >> >>> On Tue, Dec 9, 2014 at 3:47 AM, Mathias Söderberg <
> > > > > >> >>> mathias.soederberg@gmail.com> wrote:
> > > > > >> >>>
> > > > > >> >>>> Hi Neha,
> > > > > >> >>>>
> > > > > >> >>>> Yeah sure. I'm not familiar with hprof, so any particular
> > > > options I
> > > > > >> >>> should
> > > > > >> >>>> include or just run with defaults?
> > > > > >> >>>>
> > > > > >> >>>> Best regards,
> > > > > >> >>>> Mathias
> > > > > >> >>>>
> > > > > >> >>>> On Mon Dec 08 2014 at 7:41:32 PM Neha Narkhede <
> > > > neha@confluent.io>
> > > > > >> >>> wrote:
> > > > > >> >>>>> Thanks for reporting the issue. Would you mind running
> hprof
> > > and
> > > > > >> >>> sending
> > > > > >> >>>>> the output?
> > > > > >> >>>>>
> > > > > >> >>>>> On Mon, Dec 8, 2014 at 1:25 AM, Mathias Söderberg <
> > > > > >> >>>>> mathias.soederberg@gmail.com> wrote:
> > > > > >> >>>>>
> > > > > >> >>>>>> Good day,
> > > > > >> >>>>>>
> > > > > >> >>>>>> I upgraded a Kafka cluster from v0.8.1.1 to v0.8.2-beta
> and
> > > > > noticed
> > > > > >> >>>> that
> > > > > >> >>>>>> the CPU usage on the broker machines went up by roughly
> > 40%,
> > > > from
> > > > > >> >>> ~60%
> > > > > >> >>>> to
> > > > > >> >>>>>> ~100% and am wondering if anyone else has experienced
> > > something
> > > > > >> >>>> similar?
> > > > > >> >>>>>> The load average also went up by 2x-3x.
> > > > > >> >>>>>>
> > > > > >> >>>>>> We're running on EC2 and the cluster currently consists
> of
> > > four
> > > > > >> >>>>> m1.xlarge,
> > > > > >> >>>>>> with roughly 1100 topics / 4000 partitions. Using Java 7
> > > > > (1.7.0_65
> > > > > >> >>> to
> > > > > >> >>>> be
> > > > > >> >>>>>> exact) and Scala 2.9.2. Configurations can be found over
> > > here:
> > > > > >> >>>>>> https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
> > > > > >> >>>>>>
> > > > > >> >>>>>> I'm assuming that this is not expected behaviour for
> > > > 0.8.2-beta?
> > > > > >> >>>>>>
> > > > > >> >>>>>> Best regards,
> > > > > >> >>>>>> Mathias
> > > > > >> >>>>>>
> > > > > >> >>>>>
> > > > > >> >>>>>
> > > > > >> >>>>> --
> > > > > >> >>>>> Thanks,
> > > > > >> >>>>> Neha
> > > > > >> >>>>>
> > > > > >> >>>
> > > > > >> >>>
> > > > > >> >>> --
> > > > > >> >>> Thanks,
> > > > > >> >>> Neha
> > > > > >> >>>
> > > > > >>
> > > > > >>
> > > > >
> > > >
> > >
> >
>

Re: Increased CPU usage with 0.8.2-beta

Posted by Solon Gordon <so...@knewton.com>.
The partitions nearly all have replication factor 2 (a few stray ones have
1), and our producers use request.required.acks=-1. However, I should note
there were hardly any messages being produced when I did the upgrade and
observed the high CPU load.

I should have time to do some profiling on Monday and will get back to you
with the results.

On Fri, Feb 13, 2015 at 1:00 PM, Jun Rao <ju...@confluent.io> wrote:

> Solon,
>
> What's the replication factor you used for those partitions? What's the
> producer ack that you used? Also, could you do a bit of profiling on the
> broker to see which methods used the most CPU?
>
> Thanks,
>
> Jun
>
> On Thu, Feb 12, 2015 at 3:19 PM, Solon Gordon <so...@knewton.com> wrote:
>
> > I saw a very similar jump in CPU usage when I tried upgrading from
> 0.8.1.1
> > to 0.8.2.0 today in a test environment. The Kafka cluster there is two
> > m1.larges handling 2,000 partitions across 32 topics. CPU usage rose from
> > 40% into the 150%–190% range, and load average from under 1 to over 4.
> > Downgrading to 0.8.1.1 brought the CPU and load back to the previous
> > values.
> >
> > If there's more info that would be helpful, please let me know.
> >
> > On Thu, Feb 12, 2015 at 4:17 PM, Mathias Söderberg <
> > mathias.soederberg@gmail.com> wrote:
> >
> > > Jun,
> > >
> > > Pardon the radio silence. I booted up a new broker, created a topic
> with
> > > three (3) partitions and replication factor one (1) and used the
> > > *kafka-producer-perf-test.sh
> > > *script to generate load (using messages of roughly the same size as
> > ours).
> > > There was a slight increase in CPU usage (~5-10%) on 0.8.2.0-rc2
> compared
> > > to 0.8.1.1, but that was about it.
> > >
> > > I upgraded our staging cluster to 0.8.2.0 earlier this week or so, and
> > had
> > > to add an additional broker due to increased load after the upgrade
> (note
> > > that the incoming load on the cluster has been pretty much consistent).
> > > Since the upgrade we've been seeing an 2-3x increase in latency as
> well.
> > > I'm considering downgrading to 0.8.1.1 again to see if it resolves our
> > > issues.
> > >
> > > Best regards,
> > > Mathias
> > >
> > > On Tue Feb 03 2015 at 6:44:36 PM Jun Rao <ju...@confluent.io> wrote:
> > >
> > > > Mathias,
> > > >
> > > > The new hprof doesn't reveal anything new to me. We did fix the logic
> > in
> > > > using Purgatory in 0.8.2, which could potentially drive up the CPU
> > usage
> > > a
> > > > bit. To verify that, could you do your test on a single broker (with
> > > > replication factor 1) btw 0.8.1 and 0.8.2 and see if there is any
> > > > significant difference in cpu usage?
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Tue, Feb 3, 2015 at 5:09 AM, Mathias Söderberg <
> > > > mathias.soederberg@gmail.com> wrote:
> > > >
> > > > > Jun,
> > > > >
> > > > > I re-ran the hprof test, for about 30 minutes again, for
> 0.8.2.0-rc2
> > > with
> > > > > the same version of snappy that 0.8.1.1 used. Attached the logs.
> > > > > Unfortunately there wasn't any improvement as the node running
> > > > 0.8.2.0-rc2
> > > > > still had a higher load and CPU usage.
> > > > >
> > > > > Best regards,
> > > > > Mathias
> > > > >
> > > > > On Tue Feb 03 2015 at 4:40:31 AM Jaikiran Pai <
> > > jai.forums2013@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> On Monday 02 February 2015 11:03 PM, Jun Rao wrote:
> > > > >> > Jaikiran,
> > > > >> >
> > > > >> > The fix you provided in probably unnecessary. The channel that
> we
> > > use
> > > > in
> > > > >> > SimpleConsumer (BlockingChannel) is configured to be blocking.
> So
> > > even
> > > > >> > though the read from the socket is in a loop, each read blocks
> if
> > > > there
> > > > >> is
> > > > >> > no bytes received from the broker. So, that shouldn't cause
> extra
> > > CPU
> > > > >> > consumption.
> > > > >> Hi Jun,
> > > > >>
> > > > >> Of course, you are right! I forgot that while reading the thread
> > dump
> > > in
> > > > >> hprof output, one has to be aware that the thread state isn't
> shown
> > > and
> > > > >> the thread need not necessarily be doing any CPU activity.
> > > > >>
> > > > >> -Jaikiran
> > > > >>
> > > > >>
> > > > >> >
> > > > >> > Thanks,
> > > > >> >
> > > > >> > Jun
> > > > >> >
> > > > >> > On Mon, Jan 26, 2015 at 10:05 AM, Mathias Söderberg <
> > > > >> > mathias.soederberg@gmail.com> wrote:
> > > > >> >
> > > > >> >> Hi Neha,
> > > > >> >>
> > > > >> >> I sent an e-mail earlier today, but noticed now that it didn't
> > > > >> actually go
> > > > >> >> through.
> > > > >> >>
> > > > >> >> Anyhow, I've attached two files, one with output from a 10
> minute
> > > run
> > > > >> and
> > > > >> >> one with output from a 30 minute run. Realized that maybe I
> > > should've
> > > > >> done
> > > > >> >> one or two runs with 0.8.1.1 as well, but nevertheless.
> > > > >> >>
> > > > >> >> I upgraded our staging cluster to 0.8.2.0-rc2, and I'm seeing
> the
> > > > same
> > > > >> CPU
> > > > >> >> usage as with the beta version (basically pegging all cores).
> If
> > I
> > > > >> manage
> > > > >> >> to find the time I'll do another run with hprof on the rc2
> > version
> > > > >> later
> > > > >> >> today.
> > > > >> >>
> > > > >> >> Best regards,
> > > > >> >> Mathias
> > > > >> >>
> > > > >> >> On Tue Dec 09 2014 at 10:08:21 PM Neha Narkhede <
> > neha@confluent.io
> > > >
> > > > >> wrote:
> > > > >> >>
> > > > >> >>> The following should be sufficient
> > > > >> >>>
> > > > >> >>> java
> > > > >> >>> -agentlib:hprof=cpu=samples,depth=100,interval=20,lineno=
> > > > >> >>> y,thread=y,file=kafka.hprof
> > > > >> >>> <classname>
> > > > >> >>>
> > > > >> >>> You would need to start the Kafka server with the settings
> above
> > > for
> > > > >> >>> sometime until you observe the problem.
> > > > >> >>>
> > > > >> >>> On Tue, Dec 9, 2014 at 3:47 AM, Mathias Söderberg <
> > > > >> >>> mathias.soederberg@gmail.com> wrote:
> > > > >> >>>
> > > > >> >>>> Hi Neha,
> > > > >> >>>>
> > > > >> >>>> Yeah sure. I'm not familiar with hprof, so any particular
> > > options I
> > > > >> >>> should
> > > > >> >>>> include or just run with defaults?
> > > > >> >>>>
> > > > >> >>>> Best regards,
> > > > >> >>>> Mathias
> > > > >> >>>>
> > > > >> >>>> On Mon Dec 08 2014 at 7:41:32 PM Neha Narkhede <
> > > neha@confluent.io>
> > > > >> >>> wrote:
> > > > >> >>>>> Thanks for reporting the issue. Would you mind running hprof
> > and
> > > > >> >>> sending
> > > > >> >>>>> the output?
> > > > >> >>>>>
> > > > >> >>>>> On Mon, Dec 8, 2014 at 1:25 AM, Mathias Söderberg <
> > > > >> >>>>> mathias.soederberg@gmail.com> wrote:
> > > > >> >>>>>
> > > > >> >>>>>> Good day,
> > > > >> >>>>>>
> > > > >> >>>>>> I upgraded a Kafka cluster from v0.8.1.1 to v0.8.2-beta and
> > > > noticed
> > > > >> >>>> that
> > > > >> >>>>>> the CPU usage on the broker machines went up by roughly
> 40%,
> > > from
> > > > >> >>> ~60%
> > > > >> >>>> to
> > > > >> >>>>>> ~100% and am wondering if anyone else has experienced
> > something
> > > > >> >>>> similar?
> > > > >> >>>>>> The load average also went up by 2x-3x.
> > > > >> >>>>>>
> > > > >> >>>>>> We're running on EC2 and the cluster currently consists of
> > four
> > > > >> >>>>> m1.xlarge,
> > > > >> >>>>>> with roughly 1100 topics / 4000 partitions. Using Java 7
> > > > (1.7.0_65
> > > > >> >>> to
> > > > >> >>>> be
> > > > >> >>>>>> exact) and Scala 2.9.2. Configurations can be found over
> > here:
> > > > >> >>>>>> https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
> > > > >> >>>>>>
> > > > >> >>>>>> I'm assuming that this is not expected behaviour for
> > > 0.8.2-beta?
> > > > >> >>>>>>
> > > > >> >>>>>> Best regards,
> > > > >> >>>>>> Mathias
> > > > >> >>>>>>
> > > > >> >>>>>
> > > > >> >>>>>
> > > > >> >>>>> --
> > > > >> >>>>> Thanks,
> > > > >> >>>>> Neha
> > > > >> >>>>>
> > > > >> >>>
> > > > >> >>>
> > > > >> >>> --
> > > > >> >>> Thanks,
> > > > >> >>> Neha
> > > > >> >>>
> > > > >>
> > > > >>
> > > >
> > >
> >
>

Re: Increased CPU usage with 0.8.2-beta

Posted by Jun Rao <ju...@confluent.io>.
Solon,

What's the replication factor you used for those partitions? What's the
producer ack that you used? Also, could you do a bit of profiling on the
broker to see which methods used the most CPU?

Thanks,

Jun

On Thu, Feb 12, 2015 at 3:19 PM, Solon Gordon <so...@knewton.com> wrote:

> I saw a very similar jump in CPU usage when I tried upgrading from 0.8.1.1
> to 0.8.2.0 today in a test environment. The Kafka cluster there is two
> m1.larges handling 2,000 partitions across 32 topics. CPU usage rose from
> 40% into the 150%–190% range, and load average from under 1 to over 4.
> Downgrading to 0.8.1.1 brought the CPU and load back to the previous
> values.
>
> If there's more info that would be helpful, please let me know.
>
> On Thu, Feb 12, 2015 at 4:17 PM, Mathias Söderberg <
> mathias.soederberg@gmail.com> wrote:
>
> > Jun,
> >
> > Pardon the radio silence. I booted up a new broker, created a topic with
> > three (3) partitions and replication factor one (1) and used the
> > *kafka-producer-perf-test.sh
> > *script to generate load (using messages of roughly the same size as
> ours).
> > There was a slight increase in CPU usage (~5-10%) on 0.8.2.0-rc2 compared
> > to 0.8.1.1, but that was about it.
> >
> > I upgraded our staging cluster to 0.8.2.0 earlier this week or so, and
> had
> > to add an additional broker due to increased load after the upgrade (note
> > that the incoming load on the cluster has been pretty much consistent).
> > Since the upgrade we've been seeing an 2-3x increase in latency as well.
> > I'm considering downgrading to 0.8.1.1 again to see if it resolves our
> > issues.
> >
> > Best regards,
> > Mathias
> >
> > On Tue Feb 03 2015 at 6:44:36 PM Jun Rao <ju...@confluent.io> wrote:
> >
> > > Mathias,
> > >
> > > The new hprof doesn't reveal anything new to me. We did fix the logic
> in
> > > using Purgatory in 0.8.2, which could potentially drive up the CPU
> usage
> > a
> > > bit. To verify that, could you do your test on a single broker (with
> > > replication factor 1) btw 0.8.1 and 0.8.2 and see if there is any
> > > significant difference in cpu usage?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Tue, Feb 3, 2015 at 5:09 AM, Mathias Söderberg <
> > > mathias.soederberg@gmail.com> wrote:
> > >
> > > > Jun,
> > > >
> > > > I re-ran the hprof test, for about 30 minutes again, for 0.8.2.0-rc2
> > with
> > > > the same version of snappy that 0.8.1.1 used. Attached the logs.
> > > > Unfortunately there wasn't any improvement as the node running
> > > 0.8.2.0-rc2
> > > > still had a higher load and CPU usage.
> > > >
> > > > Best regards,
> > > > Mathias
> > > >
> > > > On Tue Feb 03 2015 at 4:40:31 AM Jaikiran Pai <
> > jai.forums2013@gmail.com>
> > > > wrote:
> > > >
> > > >> On Monday 02 February 2015 11:03 PM, Jun Rao wrote:
> > > >> > Jaikiran,
> > > >> >
> > > >> > The fix you provided in probably unnecessary. The channel that we
> > use
> > > in
> > > >> > SimpleConsumer (BlockingChannel) is configured to be blocking. So
> > even
> > > >> > though the read from the socket is in a loop, each read blocks if
> > > there
> > > >> is
> > > >> > no bytes received from the broker. So, that shouldn't cause extra
> > CPU
> > > >> > consumption.
> > > >> Hi Jun,
> > > >>
> > > >> Of course, you are right! I forgot that while reading the thread
> dump
> > in
> > > >> hprof output, one has to be aware that the thread state isn't shown
> > and
> > > >> the thread need not necessarily be doing any CPU activity.
> > > >>
> > > >> -Jaikiran
> > > >>
> > > >>
> > > >> >
> > > >> > Thanks,
> > > >> >
> > > >> > Jun
> > > >> >
> > > >> > On Mon, Jan 26, 2015 at 10:05 AM, Mathias Söderberg <
> > > >> > mathias.soederberg@gmail.com> wrote:
> > > >> >
> > > >> >> Hi Neha,
> > > >> >>
> > > >> >> I sent an e-mail earlier today, but noticed now that it didn't
> > > >> actually go
> > > >> >> through.
> > > >> >>
> > > >> >> Anyhow, I've attached two files, one with output from a 10 minute
> > run
> > > >> and
> > > >> >> one with output from a 30 minute run. Realized that maybe I
> > should've
> > > >> done
> > > >> >> one or two runs with 0.8.1.1 as well, but nevertheless.
> > > >> >>
> > > >> >> I upgraded our staging cluster to 0.8.2.0-rc2, and I'm seeing the
> > > same
> > > >> CPU
> > > >> >> usage as with the beta version (basically pegging all cores). If
> I
> > > >> manage
> > > >> >> to find the time I'll do another run with hprof on the rc2
> version
> > > >> later
> > > >> >> today.
> > > >> >>
> > > >> >> Best regards,
> > > >> >> Mathias
> > > >> >>
> > > >> >> On Tue Dec 09 2014 at 10:08:21 PM Neha Narkhede <
> neha@confluent.io
> > >
> > > >> wrote:
> > > >> >>
> > > >> >>> The following should be sufficient
> > > >> >>>
> > > >> >>> java
> > > >> >>> -agentlib:hprof=cpu=samples,depth=100,interval=20,lineno=
> > > >> >>> y,thread=y,file=kafka.hprof
> > > >> >>> <classname>
> > > >> >>>
> > > >> >>> You would need to start the Kafka server with the settings above
> > for
> > > >> >>> sometime until you observe the problem.
> > > >> >>>
> > > >> >>> On Tue, Dec 9, 2014 at 3:47 AM, Mathias Söderberg <
> > > >> >>> mathias.soederberg@gmail.com> wrote:
> > > >> >>>
> > > >> >>>> Hi Neha,
> > > >> >>>>
> > > >> >>>> Yeah sure. I'm not familiar with hprof, so any particular
> > options I
> > > >> >>> should
> > > >> >>>> include or just run with defaults?
> > > >> >>>>
> > > >> >>>> Best regards,
> > > >> >>>> Mathias
> > > >> >>>>
> > > >> >>>> On Mon Dec 08 2014 at 7:41:32 PM Neha Narkhede <
> > neha@confluent.io>
> > > >> >>> wrote:
> > > >> >>>>> Thanks for reporting the issue. Would you mind running hprof
> and
> > > >> >>> sending
> > > >> >>>>> the output?
> > > >> >>>>>
> > > >> >>>>> On Mon, Dec 8, 2014 at 1:25 AM, Mathias Söderberg <
> > > >> >>>>> mathias.soederberg@gmail.com> wrote:
> > > >> >>>>>
> > > >> >>>>>> Good day,
> > > >> >>>>>>
> > > >> >>>>>> I upgraded a Kafka cluster from v0.8.1.1 to v0.8.2-beta and
> > > noticed
> > > >> >>>> that
> > > >> >>>>>> the CPU usage on the broker machines went up by roughly 40%,
> > from
> > > >> >>> ~60%
> > > >> >>>> to
> > > >> >>>>>> ~100% and am wondering if anyone else has experienced
> something
> > > >> >>>> similar?
> > > >> >>>>>> The load average also went up by 2x-3x.
> > > >> >>>>>>
> > > >> >>>>>> We're running on EC2 and the cluster currently consists of
> four
> > > >> >>>>> m1.xlarge,
> > > >> >>>>>> with roughly 1100 topics / 4000 partitions. Using Java 7
> > > (1.7.0_65
> > > >> >>> to
> > > >> >>>> be
> > > >> >>>>>> exact) and Scala 2.9.2. Configurations can be found over
> here:
> > > >> >>>>>> https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
> > > >> >>>>>>
> > > >> >>>>>> I'm assuming that this is not expected behaviour for
> > 0.8.2-beta?
> > > >> >>>>>>
> > > >> >>>>>> Best regards,
> > > >> >>>>>> Mathias
> > > >> >>>>>>
> > > >> >>>>>
> > > >> >>>>>
> > > >> >>>>> --
> > > >> >>>>> Thanks,
> > > >> >>>>> Neha
> > > >> >>>>>
> > > >> >>>
> > > >> >>>
> > > >> >>> --
> > > >> >>> Thanks,
> > > >> >>> Neha
> > > >> >>>
> > > >>
> > > >>
> > >
> >
>

Re: Increased CPU usage with 0.8.2-beta

Posted by Solon Gordon <so...@knewton.com>.
I saw a very similar jump in CPU usage when I tried upgrading from 0.8.1.1
to 0.8.2.0 today in a test environment. The Kafka cluster there is two
m1.larges handling 2,000 partitions across 32 topics. CPU usage rose from
40% into the 150%–190% range, and load average from under 1 to over 4.
Downgrading to 0.8.1.1 brought the CPU and load back to the previous values.

If there's more info that would be helpful, please let me know.

On Thu, Feb 12, 2015 at 4:17 PM, Mathias Söderberg <
mathias.soederberg@gmail.com> wrote:

> Jun,
>
> Pardon the radio silence. I booted up a new broker, created a topic with
> three (3) partitions and replication factor one (1) and used the
> *kafka-producer-perf-test.sh
> *script to generate load (using messages of roughly the same size as ours).
> There was a slight increase in CPU usage (~5-10%) on 0.8.2.0-rc2 compared
> to 0.8.1.1, but that was about it.
>
> I upgraded our staging cluster to 0.8.2.0 earlier this week or so, and had
> to add an additional broker due to increased load after the upgrade (note
> that the incoming load on the cluster has been pretty much consistent).
> Since the upgrade we've been seeing an 2-3x increase in latency as well.
> I'm considering downgrading to 0.8.1.1 again to see if it resolves our
> issues.
>
> Best regards,
> Mathias
>
> On Tue Feb 03 2015 at 6:44:36 PM Jun Rao <ju...@confluent.io> wrote:
>
> > Mathias,
> >
> > The new hprof doesn't reveal anything new to me. We did fix the logic in
> > using Purgatory in 0.8.2, which could potentially drive up the CPU usage
> a
> > bit. To verify that, could you do your test on a single broker (with
> > replication factor 1) btw 0.8.1 and 0.8.2 and see if there is any
> > significant difference in cpu usage?
> >
> > Thanks,
> >
> > Jun
> >
> > On Tue, Feb 3, 2015 at 5:09 AM, Mathias Söderberg <
> > mathias.soederberg@gmail.com> wrote:
> >
> > > Jun,
> > >
> > > I re-ran the hprof test, for about 30 minutes again, for 0.8.2.0-rc2
> with
> > > the same version of snappy that 0.8.1.1 used. Attached the logs.
> > > Unfortunately there wasn't any improvement as the node running
> > 0.8.2.0-rc2
> > > still had a higher load and CPU usage.
> > >
> > > Best regards,
> > > Mathias
> > >
> > > On Tue Feb 03 2015 at 4:40:31 AM Jaikiran Pai <
> jai.forums2013@gmail.com>
> > > wrote:
> > >
> > >> On Monday 02 February 2015 11:03 PM, Jun Rao wrote:
> > >> > Jaikiran,
> > >> >
> > >> > The fix you provided in probably unnecessary. The channel that we
> use
> > in
> > >> > SimpleConsumer (BlockingChannel) is configured to be blocking. So
> even
> > >> > though the read from the socket is in a loop, each read blocks if
> > there
> > >> is
> > >> > no bytes received from the broker. So, that shouldn't cause extra
> CPU
> > >> > consumption.
> > >> Hi Jun,
> > >>
> > >> Of course, you are right! I forgot that while reading the thread dump
> in
> > >> hprof output, one has to be aware that the thread state isn't shown
> and
> > >> the thread need not necessarily be doing any CPU activity.
> > >>
> > >> -Jaikiran
> > >>
> > >>
> > >> >
> > >> > Thanks,
> > >> >
> > >> > Jun
> > >> >
> > >> > On Mon, Jan 26, 2015 at 10:05 AM, Mathias Söderberg <
> > >> > mathias.soederberg@gmail.com> wrote:
> > >> >
> > >> >> Hi Neha,
> > >> >>
> > >> >> I sent an e-mail earlier today, but noticed now that it didn't
> > >> actually go
> > >> >> through.
> > >> >>
> > >> >> Anyhow, I've attached two files, one with output from a 10 minute
> run
> > >> and
> > >> >> one with output from a 30 minute run. Realized that maybe I
> should've
> > >> done
> > >> >> one or two runs with 0.8.1.1 as well, but nevertheless.
> > >> >>
> > >> >> I upgraded our staging cluster to 0.8.2.0-rc2, and I'm seeing the
> > same
> > >> CPU
> > >> >> usage as with the beta version (basically pegging all cores). If I
> > >> manage
> > >> >> to find the time I'll do another run with hprof on the rc2 version
> > >> later
> > >> >> today.
> > >> >>
> > >> >> Best regards,
> > >> >> Mathias
> > >> >>
> > >> >> On Tue Dec 09 2014 at 10:08:21 PM Neha Narkhede <neha@confluent.io
> >
> > >> wrote:
> > >> >>
> > >> >>> The following should be sufficient
> > >> >>>
> > >> >>> java
> > >> >>> -agentlib:hprof=cpu=samples,depth=100,interval=20,lineno=
> > >> >>> y,thread=y,file=kafka.hprof
> > >> >>> <classname>
> > >> >>>
> > >> >>> You would need to start the Kafka server with the settings above
> for
> > >> >>> sometime until you observe the problem.
> > >> >>>
> > >> >>> On Tue, Dec 9, 2014 at 3:47 AM, Mathias Söderberg <
> > >> >>> mathias.soederberg@gmail.com> wrote:
> > >> >>>
> > >> >>>> Hi Neha,
> > >> >>>>
> > >> >>>> Yeah sure. I'm not familiar with hprof, so any particular
> options I
> > >> >>> should
> > >> >>>> include or just run with defaults?
> > >> >>>>
> > >> >>>> Best regards,
> > >> >>>> Mathias
> > >> >>>>
> > >> >>>> On Mon Dec 08 2014 at 7:41:32 PM Neha Narkhede <
> neha@confluent.io>
> > >> >>> wrote:
> > >> >>>>> Thanks for reporting the issue. Would you mind running hprof and
> > >> >>> sending
> > >> >>>>> the output?
> > >> >>>>>
> > >> >>>>> On Mon, Dec 8, 2014 at 1:25 AM, Mathias Söderberg <
> > >> >>>>> mathias.soederberg@gmail.com> wrote:
> > >> >>>>>
> > >> >>>>>> Good day,
> > >> >>>>>>
> > >> >>>>>> I upgraded a Kafka cluster from v0.8.1.1 to v0.8.2-beta and
> > noticed
> > >> >>>> that
> > >> >>>>>> the CPU usage on the broker machines went up by roughly 40%,
> from
> > >> >>> ~60%
> > >> >>>> to
> > >> >>>>>> ~100% and am wondering if anyone else has experienced something
> > >> >>>> similar?
> > >> >>>>>> The load average also went up by 2x-3x.
> > >> >>>>>>
> > >> >>>>>> We're running on EC2 and the cluster currently consists of four
> > >> >>>>> m1.xlarge,
> > >> >>>>>> with roughly 1100 topics / 4000 partitions. Using Java 7
> > (1.7.0_65
> > >> >>> to
> > >> >>>> be
> > >> >>>>>> exact) and Scala 2.9.2. Configurations can be found over here:
> > >> >>>>>> https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
> > >> >>>>>>
> > >> >>>>>> I'm assuming that this is not expected behaviour for
> 0.8.2-beta?
> > >> >>>>>>
> > >> >>>>>> Best regards,
> > >> >>>>>> Mathias
> > >> >>>>>>
> > >> >>>>>
> > >> >>>>>
> > >> >>>>> --
> > >> >>>>> Thanks,
> > >> >>>>> Neha
> > >> >>>>>
> > >> >>>
> > >> >>>
> > >> >>> --
> > >> >>> Thanks,
> > >> >>> Neha
> > >> >>>
> > >>
> > >>
> >
>

Re: Increased CPU usage with 0.8.2-beta

Posted by Jun Rao <ju...@confluent.io>.
Mathias,

What's the ack mode you used in the producer? Could you share the command
you used to run kafka-producer-perf-test.sh?

Thanks,

Jun

On Thu, Feb 12, 2015 at 1:17 PM, Mathias Söderberg <
mathias.soederberg@gmail.com> wrote:

> Jun,
>
> Pardon the radio silence. I booted up a new broker, created a topic with
> three (3) partitions and replication factor one (1) and used the
> *kafka-producer-perf-test.sh
> *script to generate load (using messages of roughly the same size as ours).
> There was a slight increase in CPU usage (~5-10%) on 0.8.2.0-rc2 compared
> to 0.8.1.1, but that was about it.
>
> I upgraded our staging cluster to 0.8.2.0 earlier this week or so, and had
> to add an additional broker due to increased load after the upgrade (note
> that the incoming load on the cluster has been pretty much consistent).
> Since the upgrade we've been seeing an 2-3x increase in latency as well.
> I'm considering downgrading to 0.8.1.1 again to see if it resolves our
> issues.
>
> Best regards,
> Mathias
>
> On Tue Feb 03 2015 at 6:44:36 PM Jun Rao <ju...@confluent.io> wrote:
>
> > Mathias,
> >
> > The new hprof doesn't reveal anything new to me. We did fix the logic in
> > using Purgatory in 0.8.2, which could potentially drive up the CPU usage
> a
> > bit. To verify that, could you do your test on a single broker (with
> > replication factor 1) btw 0.8.1 and 0.8.2 and see if there is any
> > significant difference in cpu usage?
> >
> > Thanks,
> >
> > Jun
> >
> > On Tue, Feb 3, 2015 at 5:09 AM, Mathias Söderberg <
> > mathias.soederberg@gmail.com> wrote:
> >
> > > Jun,
> > >
> > > I re-ran the hprof test, for about 30 minutes again, for 0.8.2.0-rc2
> with
> > > the same version of snappy that 0.8.1.1 used. Attached the logs.
> > > Unfortunately there wasn't any improvement as the node running
> > 0.8.2.0-rc2
> > > still had a higher load and CPU usage.
> > >
> > > Best regards,
> > > Mathias
> > >
> > > On Tue Feb 03 2015 at 4:40:31 AM Jaikiran Pai <
> jai.forums2013@gmail.com>
> > > wrote:
> > >
> > >> On Monday 02 February 2015 11:03 PM, Jun Rao wrote:
> > >> > Jaikiran,
> > >> >
> > >> > The fix you provided in probably unnecessary. The channel that we
> use
> > in
> > >> > SimpleConsumer (BlockingChannel) is configured to be blocking. So
> even
> > >> > though the read from the socket is in a loop, each read blocks if
> > there
> > >> is
> > >> > no bytes received from the broker. So, that shouldn't cause extra
> CPU
> > >> > consumption.
> > >> Hi Jun,
> > >>
> > >> Of course, you are right! I forgot that while reading the thread dump
> in
> > >> hprof output, one has to be aware that the thread state isn't shown
> and
> > >> the thread need not necessarily be doing any CPU activity.
> > >>
> > >> -Jaikiran
> > >>
> > >>
> > >> >
> > >> > Thanks,
> > >> >
> > >> > Jun
> > >> >
> > >> > On Mon, Jan 26, 2015 at 10:05 AM, Mathias Söderberg <
> > >> > mathias.soederberg@gmail.com> wrote:
> > >> >
> > >> >> Hi Neha,
> > >> >>
> > >> >> I sent an e-mail earlier today, but noticed now that it didn't
> > >> actually go
> > >> >> through.
> > >> >>
> > >> >> Anyhow, I've attached two files, one with output from a 10 minute
> run
> > >> and
> > >> >> one with output from a 30 minute run. Realized that maybe I
> should've
> > >> done
> > >> >> one or two runs with 0.8.1.1 as well, but nevertheless.
> > >> >>
> > >> >> I upgraded our staging cluster to 0.8.2.0-rc2, and I'm seeing the
> > same
> > >> CPU
> > >> >> usage as with the beta version (basically pegging all cores). If I
> > >> manage
> > >> >> to find the time I'll do another run with hprof on the rc2 version
> > >> later
> > >> >> today.
> > >> >>
> > >> >> Best regards,
> > >> >> Mathias
> > >> >>
> > >> >> On Tue Dec 09 2014 at 10:08:21 PM Neha Narkhede <neha@confluent.io
> >
> > >> wrote:
> > >> >>
> > >> >>> The following should be sufficient
> > >> >>>
> > >> >>> java
> > >> >>> -agentlib:hprof=cpu=samples,depth=100,interval=20,lineno=
> > >> >>> y,thread=y,file=kafka.hprof
> > >> >>> <classname>
> > >> >>>
> > >> >>> You would need to start the Kafka server with the settings above
> for
> > >> >>> sometime until you observe the problem.
> > >> >>>
> > >> >>> On Tue, Dec 9, 2014 at 3:47 AM, Mathias Söderberg <
> > >> >>> mathias.soederberg@gmail.com> wrote:
> > >> >>>
> > >> >>>> Hi Neha,
> > >> >>>>
> > >> >>>> Yeah sure. I'm not familiar with hprof, so any particular
> options I
> > >> >>> should
> > >> >>>> include or just run with defaults?
> > >> >>>>
> > >> >>>> Best regards,
> > >> >>>> Mathias
> > >> >>>>
> > >> >>>> On Mon Dec 08 2014 at 7:41:32 PM Neha Narkhede <
> neha@confluent.io>
> > >> >>> wrote:
> > >> >>>>> Thanks for reporting the issue. Would you mind running hprof and
> > >> >>> sending
> > >> >>>>> the output?
> > >> >>>>>
> > >> >>>>> On Mon, Dec 8, 2014 at 1:25 AM, Mathias Söderberg <
> > >> >>>>> mathias.soederberg@gmail.com> wrote:
> > >> >>>>>
> > >> >>>>>> Good day,
> > >> >>>>>>
> > >> >>>>>> I upgraded a Kafka cluster from v0.8.1.1 to v0.8.2-beta and
> > noticed
> > >> >>>> that
> > >> >>>>>> the CPU usage on the broker machines went up by roughly 40%,
> from
> > >> >>> ~60%
> > >> >>>> to
> > >> >>>>>> ~100% and am wondering if anyone else has experienced something
> > >> >>>> similar?
> > >> >>>>>> The load average also went up by 2x-3x.
> > >> >>>>>>
> > >> >>>>>> We're running on EC2 and the cluster currently consists of four
> > >> >>>>> m1.xlarge,
> > >> >>>>>> with roughly 1100 topics / 4000 partitions. Using Java 7
> > (1.7.0_65
> > >> >>> to
> > >> >>>> be
> > >> >>>>>> exact) and Scala 2.9.2. Configurations can be found over here:
> > >> >>>>>> https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
> > >> >>>>>>
> > >> >>>>>> I'm assuming that this is not expected behaviour for
> 0.8.2-beta?
> > >> >>>>>>
> > >> >>>>>> Best regards,
> > >> >>>>>> Mathias
> > >> >>>>>>
> > >> >>>>>
> > >> >>>>>
> > >> >>>>> --
> > >> >>>>> Thanks,
> > >> >>>>> Neha
> > >> >>>>>
> > >> >>>
> > >> >>>
> > >> >>> --
> > >> >>> Thanks,
> > >> >>> Neha
> > >> >>>
> > >>
> > >>
> >
>

Re: Increased CPU usage with 0.8.2-beta

Posted by Mathias Söderberg <ma...@gmail.com>.
Jun,

Pardon the radio silence. I booted up a new broker, created a topic with
three (3) partitions and replication factor one (1) and used the
*kafka-producer-perf-test.sh
*script to generate load (using messages of roughly the same size as ours).
There was a slight increase in CPU usage (~5-10%) on 0.8.2.0-rc2 compared
to 0.8.1.1, but that was about it.

I upgraded our staging cluster to 0.8.2.0 earlier this week or so, and had
to add an additional broker due to increased load after the upgrade (note
that the incoming load on the cluster has been pretty much consistent).
Since the upgrade we've been seeing an 2-3x increase in latency as well.
I'm considering downgrading to 0.8.1.1 again to see if it resolves our
issues.

Best regards,
Mathias

On Tue Feb 03 2015 at 6:44:36 PM Jun Rao <ju...@confluent.io> wrote:

> Mathias,
>
> The new hprof doesn't reveal anything new to me. We did fix the logic in
> using Purgatory in 0.8.2, which could potentially drive up the CPU usage a
> bit. To verify that, could you do your test on a single broker (with
> replication factor 1) btw 0.8.1 and 0.8.2 and see if there is any
> significant difference in cpu usage?
>
> Thanks,
>
> Jun
>
> On Tue, Feb 3, 2015 at 5:09 AM, Mathias Söderberg <
> mathias.soederberg@gmail.com> wrote:
>
> > Jun,
> >
> > I re-ran the hprof test, for about 30 minutes again, for 0.8.2.0-rc2 with
> > the same version of snappy that 0.8.1.1 used. Attached the logs.
> > Unfortunately there wasn't any improvement as the node running
> 0.8.2.0-rc2
> > still had a higher load and CPU usage.
> >
> > Best regards,
> > Mathias
> >
> > On Tue Feb 03 2015 at 4:40:31 AM Jaikiran Pai <ja...@gmail.com>
> > wrote:
> >
> >> On Monday 02 February 2015 11:03 PM, Jun Rao wrote:
> >> > Jaikiran,
> >> >
> >> > The fix you provided in probably unnecessary. The channel that we use
> in
> >> > SimpleConsumer (BlockingChannel) is configured to be blocking. So even
> >> > though the read from the socket is in a loop, each read blocks if
> there
> >> is
> >> > no bytes received from the broker. So, that shouldn't cause extra CPU
> >> > consumption.
> >> Hi Jun,
> >>
> >> Of course, you are right! I forgot that while reading the thread dump in
> >> hprof output, one has to be aware that the thread state isn't shown and
> >> the thread need not necessarily be doing any CPU activity.
> >>
> >> -Jaikiran
> >>
> >>
> >> >
> >> > Thanks,
> >> >
> >> > Jun
> >> >
> >> > On Mon, Jan 26, 2015 at 10:05 AM, Mathias Söderberg <
> >> > mathias.soederberg@gmail.com> wrote:
> >> >
> >> >> Hi Neha,
> >> >>
> >> >> I sent an e-mail earlier today, but noticed now that it didn't
> >> actually go
> >> >> through.
> >> >>
> >> >> Anyhow, I've attached two files, one with output from a 10 minute run
> >> and
> >> >> one with output from a 30 minute run. Realized that maybe I should've
> >> done
> >> >> one or two runs with 0.8.1.1 as well, but nevertheless.
> >> >>
> >> >> I upgraded our staging cluster to 0.8.2.0-rc2, and I'm seeing the
> same
> >> CPU
> >> >> usage as with the beta version (basically pegging all cores). If I
> >> manage
> >> >> to find the time I'll do another run with hprof on the rc2 version
> >> later
> >> >> today.
> >> >>
> >> >> Best regards,
> >> >> Mathias
> >> >>
> >> >> On Tue Dec 09 2014 at 10:08:21 PM Neha Narkhede <ne...@confluent.io>
> >> wrote:
> >> >>
> >> >>> The following should be sufficient
> >> >>>
> >> >>> java
> >> >>> -agentlib:hprof=cpu=samples,depth=100,interval=20,lineno=
> >> >>> y,thread=y,file=kafka.hprof
> >> >>> <classname>
> >> >>>
> >> >>> You would need to start the Kafka server with the settings above for
> >> >>> sometime until you observe the problem.
> >> >>>
> >> >>> On Tue, Dec 9, 2014 at 3:47 AM, Mathias Söderberg <
> >> >>> mathias.soederberg@gmail.com> wrote:
> >> >>>
> >> >>>> Hi Neha,
> >> >>>>
> >> >>>> Yeah sure. I'm not familiar with hprof, so any particular options I
> >> >>> should
> >> >>>> include or just run with defaults?
> >> >>>>
> >> >>>> Best regards,
> >> >>>> Mathias
> >> >>>>
> >> >>>> On Mon Dec 08 2014 at 7:41:32 PM Neha Narkhede <ne...@confluent.io>
> >> >>> wrote:
> >> >>>>> Thanks for reporting the issue. Would you mind running hprof and
> >> >>> sending
> >> >>>>> the output?
> >> >>>>>
> >> >>>>> On Mon, Dec 8, 2014 at 1:25 AM, Mathias Söderberg <
> >> >>>>> mathias.soederberg@gmail.com> wrote:
> >> >>>>>
> >> >>>>>> Good day,
> >> >>>>>>
> >> >>>>>> I upgraded a Kafka cluster from v0.8.1.1 to v0.8.2-beta and
> noticed
> >> >>>> that
> >> >>>>>> the CPU usage on the broker machines went up by roughly 40%, from
> >> >>> ~60%
> >> >>>> to
> >> >>>>>> ~100% and am wondering if anyone else has experienced something
> >> >>>> similar?
> >> >>>>>> The load average also went up by 2x-3x.
> >> >>>>>>
> >> >>>>>> We're running on EC2 and the cluster currently consists of four
> >> >>>>> m1.xlarge,
> >> >>>>>> with roughly 1100 topics / 4000 partitions. Using Java 7
> (1.7.0_65
> >> >>> to
> >> >>>> be
> >> >>>>>> exact) and Scala 2.9.2. Configurations can be found over here:
> >> >>>>>> https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
> >> >>>>>>
> >> >>>>>> I'm assuming that this is not expected behaviour for 0.8.2-beta?
> >> >>>>>>
> >> >>>>>> Best regards,
> >> >>>>>> Mathias
> >> >>>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>> --
> >> >>>>> Thanks,
> >> >>>>> Neha
> >> >>>>>
> >> >>>
> >> >>>
> >> >>> --
> >> >>> Thanks,
> >> >>> Neha
> >> >>>
> >>
> >>
>

Re: Increased CPU usage with 0.8.2-beta

Posted by Jun Rao <ju...@confluent.io>.
Mathias,

The new hprof doesn't reveal anything new to me. We did fix the logic in
using Purgatory in 0.8.2, which could potentially drive up the CPU usage a
bit. To verify that, could you do your test on a single broker (with
replication factor 1) btw 0.8.1 and 0.8.2 and see if there is any
significant difference in cpu usage?

Thanks,

Jun

On Tue, Feb 3, 2015 at 5:09 AM, Mathias Söderberg <
mathias.soederberg@gmail.com> wrote:

> Jun,
>
> I re-ran the hprof test, for about 30 minutes again, for 0.8.2.0-rc2 with
> the same version of snappy that 0.8.1.1 used. Attached the logs.
> Unfortunately there wasn't any improvement as the node running 0.8.2.0-rc2
> still had a higher load and CPU usage.
>
> Best regards,
> Mathias
>
> On Tue Feb 03 2015 at 4:40:31 AM Jaikiran Pai <ja...@gmail.com>
> wrote:
>
>> On Monday 02 February 2015 11:03 PM, Jun Rao wrote:
>> > Jaikiran,
>> >
>> > The fix you provided in probably unnecessary. The channel that we use in
>> > SimpleConsumer (BlockingChannel) is configured to be blocking. So even
>> > though the read from the socket is in a loop, each read blocks if there
>> is
>> > no bytes received from the broker. So, that shouldn't cause extra CPU
>> > consumption.
>> Hi Jun,
>>
>> Of course, you are right! I forgot that while reading the thread dump in
>> hprof output, one has to be aware that the thread state isn't shown and
>> the thread need not necessarily be doing any CPU activity.
>>
>> -Jaikiran
>>
>>
>> >
>> > Thanks,
>> >
>> > Jun
>> >
>> > On Mon, Jan 26, 2015 at 10:05 AM, Mathias Söderberg <
>> > mathias.soederberg@gmail.com> wrote:
>> >
>> >> Hi Neha,
>> >>
>> >> I sent an e-mail earlier today, but noticed now that it didn't
>> actually go
>> >> through.
>> >>
>> >> Anyhow, I've attached two files, one with output from a 10 minute run
>> and
>> >> one with output from a 30 minute run. Realized that maybe I should've
>> done
>> >> one or two runs with 0.8.1.1 as well, but nevertheless.
>> >>
>> >> I upgraded our staging cluster to 0.8.2.0-rc2, and I'm seeing the same
>> CPU
>> >> usage as with the beta version (basically pegging all cores). If I
>> manage
>> >> to find the time I'll do another run with hprof on the rc2 version
>> later
>> >> today.
>> >>
>> >> Best regards,
>> >> Mathias
>> >>
>> >> On Tue Dec 09 2014 at 10:08:21 PM Neha Narkhede <ne...@confluent.io>
>> wrote:
>> >>
>> >>> The following should be sufficient
>> >>>
>> >>> java
>> >>> -agentlib:hprof=cpu=samples,depth=100,interval=20,lineno=
>> >>> y,thread=y,file=kafka.hprof
>> >>> <classname>
>> >>>
>> >>> You would need to start the Kafka server with the settings above for
>> >>> sometime until you observe the problem.
>> >>>
>> >>> On Tue, Dec 9, 2014 at 3:47 AM, Mathias Söderberg <
>> >>> mathias.soederberg@gmail.com> wrote:
>> >>>
>> >>>> Hi Neha,
>> >>>>
>> >>>> Yeah sure. I'm not familiar with hprof, so any particular options I
>> >>> should
>> >>>> include or just run with defaults?
>> >>>>
>> >>>> Best regards,
>> >>>> Mathias
>> >>>>
>> >>>> On Mon Dec 08 2014 at 7:41:32 PM Neha Narkhede <ne...@confluent.io>
>> >>> wrote:
>> >>>>> Thanks for reporting the issue. Would you mind running hprof and
>> >>> sending
>> >>>>> the output?
>> >>>>>
>> >>>>> On Mon, Dec 8, 2014 at 1:25 AM, Mathias Söderberg <
>> >>>>> mathias.soederberg@gmail.com> wrote:
>> >>>>>
>> >>>>>> Good day,
>> >>>>>>
>> >>>>>> I upgraded a Kafka cluster from v0.8.1.1 to v0.8.2-beta and noticed
>> >>>> that
>> >>>>>> the CPU usage on the broker machines went up by roughly 40%, from
>> >>> ~60%
>> >>>> to
>> >>>>>> ~100% and am wondering if anyone else has experienced something
>> >>>> similar?
>> >>>>>> The load average also went up by 2x-3x.
>> >>>>>>
>> >>>>>> We're running on EC2 and the cluster currently consists of four
>> >>>>> m1.xlarge,
>> >>>>>> with roughly 1100 topics / 4000 partitions. Using Java 7 (1.7.0_65
>> >>> to
>> >>>> be
>> >>>>>> exact) and Scala 2.9.2. Configurations can be found over here:
>> >>>>>> https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
>> >>>>>>
>> >>>>>> I'm assuming that this is not expected behaviour for 0.8.2-beta?
>> >>>>>>
>> >>>>>> Best regards,
>> >>>>>> Mathias
>> >>>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>> Thanks,
>> >>>>> Neha
>> >>>>>
>> >>>
>> >>>
>> >>> --
>> >>> Thanks,
>> >>> Neha
>> >>>
>>
>>

Re: Increased CPU usage with 0.8.2-beta

Posted by Mathias Söderberg <ma...@gmail.com>.
Jun,

I re-ran the hprof test, for about 30 minutes again, for 0.8.2.0-rc2 with
the same version of snappy that 0.8.1.1 used. Attached the logs.
Unfortunately there wasn't any improvement as the node running 0.8.2.0-rc2
still had a higher load and CPU usage.

Best regards,
Mathias

On Tue Feb 03 2015 at 4:40:31 AM Jaikiran Pai <ja...@gmail.com>
wrote:

> On Monday 02 February 2015 11:03 PM, Jun Rao wrote:
> > Jaikiran,
> >
> > The fix you provided in probably unnecessary. The channel that we use in
> > SimpleConsumer (BlockingChannel) is configured to be blocking. So even
> > though the read from the socket is in a loop, each read blocks if there
> is
> > no bytes received from the broker. So, that shouldn't cause extra CPU
> > consumption.
> Hi Jun,
>
> Of course, you are right! I forgot that while reading the thread dump in
> hprof output, one has to be aware that the thread state isn't shown and
> the thread need not necessarily be doing any CPU activity.
>
> -Jaikiran
>
>
> >
> > Thanks,
> >
> > Jun
> >
> > On Mon, Jan 26, 2015 at 10:05 AM, Mathias Söderberg <
> > mathias.soederberg@gmail.com> wrote:
> >
> >> Hi Neha,
> >>
> >> I sent an e-mail earlier today, but noticed now that it didn't actually
> go
> >> through.
> >>
> >> Anyhow, I've attached two files, one with output from a 10 minute run
> and
> >> one with output from a 30 minute run. Realized that maybe I should've
> done
> >> one or two runs with 0.8.1.1 as well, but nevertheless.
> >>
> >> I upgraded our staging cluster to 0.8.2.0-rc2, and I'm seeing the same
> CPU
> >> usage as with the beta version (basically pegging all cores). If I
> manage
> >> to find the time I'll do another run with hprof on the rc2 version later
> >> today.
> >>
> >> Best regards,
> >> Mathias
> >>
> >> On Tue Dec 09 2014 at 10:08:21 PM Neha Narkhede <ne...@confluent.io>
> wrote:
> >>
> >>> The following should be sufficient
> >>>
> >>> java
> >>> -agentlib:hprof=cpu=samples,depth=100,interval=20,lineno=
> >>> y,thread=y,file=kafka.hprof
> >>> <classname>
> >>>
> >>> You would need to start the Kafka server with the settings above for
> >>> sometime until you observe the problem.
> >>>
> >>> On Tue, Dec 9, 2014 at 3:47 AM, Mathias Söderberg <
> >>> mathias.soederberg@gmail.com> wrote:
> >>>
> >>>> Hi Neha,
> >>>>
> >>>> Yeah sure. I'm not familiar with hprof, so any particular options I
> >>> should
> >>>> include or just run with defaults?
> >>>>
> >>>> Best regards,
> >>>> Mathias
> >>>>
> >>>> On Mon Dec 08 2014 at 7:41:32 PM Neha Narkhede <ne...@confluent.io>
> >>> wrote:
> >>>>> Thanks for reporting the issue. Would you mind running hprof and
> >>> sending
> >>>>> the output?
> >>>>>
> >>>>> On Mon, Dec 8, 2014 at 1:25 AM, Mathias Söderberg <
> >>>>> mathias.soederberg@gmail.com> wrote:
> >>>>>
> >>>>>> Good day,
> >>>>>>
> >>>>>> I upgraded a Kafka cluster from v0.8.1.1 to v0.8.2-beta and noticed
> >>>> that
> >>>>>> the CPU usage on the broker machines went up by roughly 40%, from
> >>> ~60%
> >>>> to
> >>>>>> ~100% and am wondering if anyone else has experienced something
> >>>> similar?
> >>>>>> The load average also went up by 2x-3x.
> >>>>>>
> >>>>>> We're running on EC2 and the cluster currently consists of four
> >>>>> m1.xlarge,
> >>>>>> with roughly 1100 topics / 4000 partitions. Using Java 7 (1.7.0_65
> >>> to
> >>>> be
> >>>>>> exact) and Scala 2.9.2. Configurations can be found over here:
> >>>>>> https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
> >>>>>>
> >>>>>> I'm assuming that this is not expected behaviour for 0.8.2-beta?
> >>>>>>
> >>>>>> Best regards,
> >>>>>> Mathias
> >>>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Thanks,
> >>>>> Neha
> >>>>>
> >>>
> >>>
> >>> --
> >>> Thanks,
> >>> Neha
> >>>
>
>

Re: Increased CPU usage with 0.8.2-beta

Posted by Jaikiran Pai <ja...@gmail.com>.
On Monday 02 February 2015 11:03 PM, Jun Rao wrote:
> Jaikiran,
>
> The fix you provided in probably unnecessary. The channel that we use in
> SimpleConsumer (BlockingChannel) is configured to be blocking. So even
> though the read from the socket is in a loop, each read blocks if there is
> no bytes received from the broker. So, that shouldn't cause extra CPU
> consumption.
Hi Jun,

Of course, you are right! I forgot that while reading the thread dump in 
hprof output, one has to be aware that the thread state isn't shown and 
the thread need not necessarily be doing any CPU activity.

-Jaikiran


>
> Thanks,
>
> Jun
>
> On Mon, Jan 26, 2015 at 10:05 AM, Mathias Söderberg <
> mathias.soederberg@gmail.com> wrote:
>
>> Hi Neha,
>>
>> I sent an e-mail earlier today, but noticed now that it didn't actually go
>> through.
>>
>> Anyhow, I've attached two files, one with output from a 10 minute run and
>> one with output from a 30 minute run. Realized that maybe I should've done
>> one or two runs with 0.8.1.1 as well, but nevertheless.
>>
>> I upgraded our staging cluster to 0.8.2.0-rc2, and I'm seeing the same CPU
>> usage as with the beta version (basically pegging all cores). If I manage
>> to find the time I'll do another run with hprof on the rc2 version later
>> today.
>>
>> Best regards,
>> Mathias
>>
>> On Tue Dec 09 2014 at 10:08:21 PM Neha Narkhede <ne...@confluent.io> wrote:
>>
>>> The following should be sufficient
>>>
>>> java
>>> -agentlib:hprof=cpu=samples,depth=100,interval=20,lineno=
>>> y,thread=y,file=kafka.hprof
>>> <classname>
>>>
>>> You would need to start the Kafka server with the settings above for
>>> sometime until you observe the problem.
>>>
>>> On Tue, Dec 9, 2014 at 3:47 AM, Mathias Söderberg <
>>> mathias.soederberg@gmail.com> wrote:
>>>
>>>> Hi Neha,
>>>>
>>>> Yeah sure. I'm not familiar with hprof, so any particular options I
>>> should
>>>> include or just run with defaults?
>>>>
>>>> Best regards,
>>>> Mathias
>>>>
>>>> On Mon Dec 08 2014 at 7:41:32 PM Neha Narkhede <ne...@confluent.io>
>>> wrote:
>>>>> Thanks for reporting the issue. Would you mind running hprof and
>>> sending
>>>>> the output?
>>>>>
>>>>> On Mon, Dec 8, 2014 at 1:25 AM, Mathias Söderberg <
>>>>> mathias.soederberg@gmail.com> wrote:
>>>>>
>>>>>> Good day,
>>>>>>
>>>>>> I upgraded a Kafka cluster from v0.8.1.1 to v0.8.2-beta and noticed
>>>> that
>>>>>> the CPU usage on the broker machines went up by roughly 40%, from
>>> ~60%
>>>> to
>>>>>> ~100% and am wondering if anyone else has experienced something
>>>> similar?
>>>>>> The load average also went up by 2x-3x.
>>>>>>
>>>>>> We're running on EC2 and the cluster currently consists of four
>>>>> m1.xlarge,
>>>>>> with roughly 1100 topics / 4000 partitions. Using Java 7 (1.7.0_65
>>> to
>>>> be
>>>>>> exact) and Scala 2.9.2. Configurations can be found over here:
>>>>>> https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
>>>>>>
>>>>>> I'm assuming that this is not expected behaviour for 0.8.2-beta?
>>>>>>
>>>>>> Best regards,
>>>>>> Mathias
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks,
>>>>> Neha
>>>>>
>>>
>>>
>>> --
>>> Thanks,
>>> Neha
>>>


Re: Increased CPU usage with 0.8.2-beta

Posted by Jun Rao <ju...@confluent.io>.
Hi, Mathias,

>From the hprof output, it seems that the top CPU consumers are
socketAccept() and epollWait(). As far as I am aware, there hasn't been any
significant changes in the socket server code btw 0.8.1 and 0.8.2. Could
you run the same hprof test on 0.8.1 so that we can see the difference?

Jaikiran,

The fix you provided in probably unnecessary. The channel that we use in
SimpleConsumer (BlockingChannel) is configured to be blocking. So even
though the read from the socket is in a loop, each read blocks if there is
no bytes received from the broker. So, that shouldn't cause extra CPU
consumption.

Thanks,

Jun

On Mon, Jan 26, 2015 at 10:05 AM, Mathias Söderberg <
mathias.soederberg@gmail.com> wrote:

> Hi Neha,
>
> I sent an e-mail earlier today, but noticed now that it didn't actually go
> through.
>
> Anyhow, I've attached two files, one with output from a 10 minute run and
> one with output from a 30 minute run. Realized that maybe I should've done
> one or two runs with 0.8.1.1 as well, but nevertheless.
>
> I upgraded our staging cluster to 0.8.2.0-rc2, and I'm seeing the same CPU
> usage as with the beta version (basically pegging all cores). If I manage
> to find the time I'll do another run with hprof on the rc2 version later
> today.
>
> Best regards,
> Mathias
>
> On Tue Dec 09 2014 at 10:08:21 PM Neha Narkhede <ne...@confluent.io> wrote:
>
>> The following should be sufficient
>>
>> java
>> -agentlib:hprof=cpu=samples,depth=100,interval=20,lineno=
>> y,thread=y,file=kafka.hprof
>> <classname>
>>
>> You would need to start the Kafka server with the settings above for
>> sometime until you observe the problem.
>>
>> On Tue, Dec 9, 2014 at 3:47 AM, Mathias Söderberg <
>> mathias.soederberg@gmail.com> wrote:
>>
>> > Hi Neha,
>> >
>> > Yeah sure. I'm not familiar with hprof, so any particular options I
>> should
>> > include or just run with defaults?
>> >
>> > Best regards,
>> > Mathias
>> >
>> > On Mon Dec 08 2014 at 7:41:32 PM Neha Narkhede <ne...@confluent.io>
>> wrote:
>> >
>> > > Thanks for reporting the issue. Would you mind running hprof and
>> sending
>> > > the output?
>> > >
>> > > On Mon, Dec 8, 2014 at 1:25 AM, Mathias Söderberg <
>> > > mathias.soederberg@gmail.com> wrote:
>> > >
>> > > > Good day,
>> > > >
>> > > > I upgraded a Kafka cluster from v0.8.1.1 to v0.8.2-beta and noticed
>> > that
>> > > > the CPU usage on the broker machines went up by roughly 40%, from
>> ~60%
>> > to
>> > > > ~100% and am wondering if anyone else has experienced something
>> > similar?
>> > > > The load average also went up by 2x-3x.
>> > > >
>> > > > We're running on EC2 and the cluster currently consists of four
>> > > m1.xlarge,
>> > > > with roughly 1100 topics / 4000 partitions. Using Java 7 (1.7.0_65
>> to
>> > be
>> > > > exact) and Scala 2.9.2. Configurations can be found over here:
>> > > > https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
>> > > >
>> > > > I'm assuming that this is not expected behaviour for 0.8.2-beta?
>> > > >
>> > > > Best regards,
>> > > > Mathias
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Thanks,
>> > > Neha
>> > >
>> >
>>
>>
>>
>> --
>> Thanks,
>> Neha
>>
>

how to fetch old message from kafka

Posted by Snehalata Nagaje <sn...@harbingergroup.com>.
Hi ,


We are using kafka for storing messages in chat application.

Currently we divided each topic in multiple partitions. each partition stores data for given customer who uses the application.

Right now on very first request, application fetches log from kafka from earliest valid offset to maxiumum 100000 bytes. hence it reads all messages for given topic

for given partition. Now we want to apply pagination as linkedin, facebook does. Only latest 10-15 messages should be displayed. And then on scroll down

fetch next set of previous messages, we are using Simple consumer to fetch messages.

Can you please guide on this?


Thanks,
Snehalata











----- Original Message -----
From: "Jaikiran Pai" <ja...@gmail.com>
To: users@kafka.apache.org
Sent: Monday, February 2, 2015 12:47:19 PM
Subject: Re: Increased CPU usage with 0.8.2-beta

Hi Mathias,

Looking at that thread dump, I think the potential culprit is this one:

TRACE 303545: (thread=200049)
sun.nio.ch.EPollArrayWrapper.epollWait(EPollArrayWrapper.java:Unknown line)
     sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
     sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
     sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:221)
sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385)
     kafka.utils.Utils$.read(Utils.scala:380)
kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:67)
kafka.network.Receive$class.readCompletely(Transmission.scala:56)
kafka.network.BoundedByteBufferReceive.readCompletely(BoundedByteBufferReceive.scala:29)
kafka.network.BlockingChannel.receive(BlockingChannel.scala:108)
kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:72)
kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:69)
kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SimpleConsumer.scala:113)
kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:113)
kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:113)
     kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:112)
kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:112)
kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:112)
     kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
     kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:111)
kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:97)
kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:89)
kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)


I see many such threads all triggered through the SimpleConsumer and 
ending up polling. Looking at the code, in theory, I can see why there 
might be a busy CPU loop generated by that code path. If my guess is 
right, it could be because of an issue in the implementation of how data 
is read off a channel in a blocking manner and I think this patch might 
help overcome that problem:

diff --git a/core/src/main/scala/kafka/network/Transmission.scala 
b/core/src/main/scala/kafka/network/Transmission.scala
index 2827103..0bab9ed 100644
--- a/core/src/main/scala/kafka/network/Transmission.scala
+++ b/core/src/main/scala/kafka/network/Transmission.scala
@@ -54,8 +54,15 @@ trait Receive extends Transmission {
      var totalRead = 0
      while(!complete) {
        val read = readFrom(channel)
-      trace(read + " bytes read.")
-      totalRead += read
+      if (read > 0) {
+        trace(read + " bytes read.")
+        totalRead += read
+      } else if (read == 0) {
+        // it's possible that nothing was read (see javadoc of 
ReadableByteChannel#read), from the backing channel,
+        // so we wait for a while before polling again, so that we 
don't end up with a busy CPU loop
+        // TODO: For now, this 30 milli seconds is a random value.
+        Thread.sleep(30)
+      }
      }
      totalRead
    }

Is this something that you would be able to apply against the latest 
0.8.2 branch of Kafka, build the Kafka binary, try it out and see if it 
improves the situation?

-Jaikiran

On Monday 26 January 2015 11:35 PM, Mathias Söderberg wrote:
> Hi Neha,
>
> I sent an e-mail earlier today, but noticed now that it didn't 
> actually go through.
>
> Anyhow, I've attached two files, one with output from a 10 minute run 
> and one with output from a 30 minute run. Realized that maybe I 
> should've done one or two runs with 0.8.1.1 as well, but nevertheless.
>
> I upgraded our staging cluster to 0.8.2.0-rc2, and I'm seeing the same 
> CPU usage as with the beta version (basically pegging all cores). If I 
> manage to find the time I'll do another run with hprof on the rc2 
> version later today.
>
> Best regards,
> Mathias
>
> On Tue Dec 09 2014 at 10:08:21 PM Neha Narkhede <neha@confluent.io 
> <ma...@confluent.io>> wrote:
>
>     The following should be sufficient
>
>     java
>     -agentlib:hprof=cpu=samples,depth=100,interval=20,lineno=y,thread=y,file=kafka.hprof
>     <classname>
>
>     You would need to start the Kafka server with the settings above for
>     sometime until you observe the problem.
>
>     On Tue, Dec 9, 2014 at 3:47 AM, Mathias Söderberg <
>     mathias.soederberg@gmail.com
>     <ma...@gmail.com>> wrote:
>
>     > Hi Neha,
>     >
>     > Yeah sure. I'm not familiar with hprof, so any particular
>     options I should
>     > include or just run with defaults?
>     >
>     > Best regards,
>     > Mathias
>     >
>     > On Mon Dec 08 2014 at 7:41:32 PM Neha Narkhede
>     <neha@confluent.io <ma...@confluent.io>> wrote:
>     >
>     > > Thanks for reporting the issue. Would you mind running hprof
>     and sending
>     > > the output?
>     > >
>     > > On Mon, Dec 8, 2014 at 1:25 AM, Mathias Söderberg <
>     > > mathias.soederberg@gmail.com
>     <ma...@gmail.com>> wrote:
>     > >
>     > > > Good day,
>     > > >
>     > > > I upgraded a Kafka cluster from v0.8.1.1 to v0.8.2-beta and
>     noticed
>     > that
>     > > > the CPU usage on the broker machines went up by roughly 40%,
>     from ~60%
>     > to
>     > > > ~100% and am wondering if anyone else has experienced something
>     > similar?
>     > > > The load average also went up by 2x-3x.
>     > > >
>     > > > We're running on EC2 and the cluster currently consists of four
>     > > m1.xlarge,
>     > > > with roughly 1100 topics / 4000 partitions. Using Java 7
>     (1.7.0_65 to
>     > be
>     > > > exact) and Scala 2.9.2. Configurations can be found over here:
>     > > > https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
>     > > >
>     > > > I'm assuming that this is not expected behaviour for 0.8.2-beta?
>     > > >
>     > > > Best regards,
>     > > > Mathias
>     > > >
>     > >
>     > >
>     > >
>     > > --
>     > > Thanks,
>     > > Neha
>     > >
>     >
>
>
>
>     --
>     Thanks,
>     Neha
>


Re: Increased CPU usage with 0.8.2-beta

Posted by Jay Kreps <ja...@confluent.io>.
Actually that fetch call blocks on the server side. That is, if there is no
data, the server will wait until data arrives or the timeout occurs to send
a response. This is done to help simplify the client development. If that
isn't happening it is likely a bug or a configuration change in the timeout.

I think we should try to ascertain how widespread this issue is, it could
be pretty serious if it is always happening.

Mattias, could you share your server configuration?

-Jay

On Sun, Feb 1, 2015 at 11:17 PM, Jaikiran Pai <ja...@gmail.com>
wrote:

> Hi Mathias,
>
> Looking at that thread dump, I think the potential culprit is this one:
>
> TRACE 303545: (thread=200049)
> sun.nio.ch.EPollArrayWrapper.epollWait(EPollArrayWrapper.java:Unknown
> line)
>     sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
> sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
>     sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
>     sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
> sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:221)
> sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
> java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385)
>     kafka.utils.Utils$.read(Utils.scala:380)
> kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.
> scala:67)
> kafka.network.Receive$class.readCompletely(Transmission.scala:56)
> kafka.network.BoundedByteBufferReceive.readCompletely(
> BoundedByteBufferReceive.scala:29)
> kafka.network.BlockingChannel.receive(BlockingChannel.scala:108)
> kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:72)
> kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$
> $sendRequest(SimpleConsumer.scala:69)
> kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$
> apply$mcV$sp$1.apply$mcV$sp(SimpleConsumer.scala:113)
> kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$
> apply$mcV$sp$1.apply(SimpleConsumer.scala:113)
> kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$
> apply$mcV$sp$1.apply(SimpleConsumer.scala:113)
>     kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
> kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(
> SimpleConsumer.scala:112)
> kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(
> SimpleConsumer.scala:112)
> kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(
> SimpleConsumer.scala:112)
>     kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
>     kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:111)
> kafka.server.AbstractFetcherThread.processFetchRequest(
> AbstractFetcherThread.scala:97)
> kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:89)
> kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)
>
>
> I see many such threads all triggered through the SimpleConsumer and
> ending up polling. Looking at the code, in theory, I can see why there
> might be a busy CPU loop generated by that code path. If my guess is right,
> it could be because of an issue in the implementation of how data is read
> off a channel in a blocking manner and I think this patch might help
> overcome that problem:
>
> diff --git a/core/src/main/scala/kafka/network/Transmission.scala
> b/core/src/main/scala/kafka/network/Transmission.scala
> index 2827103..0bab9ed 100644
> --- a/core/src/main/scala/kafka/network/Transmission.scala
> +++ b/core/src/main/scala/kafka/network/Transmission.scala
> @@ -54,8 +54,15 @@ trait Receive extends Transmission {
>      var totalRead = 0
>      while(!complete) {
>        val read = readFrom(channel)
> -      trace(read + " bytes read.")
> -      totalRead += read
> +      if (read > 0) {
> +        trace(read + " bytes read.")
> +        totalRead += read
> +      } else if (read == 0) {
> +        // it's possible that nothing was read (see javadoc of
> ReadableByteChannel#read), from the backing channel,
> +        // so we wait for a while before polling again, so that we don't
> end up with a busy CPU loop
> +        // TODO: For now, this 30 milli seconds is a random value.
> +        Thread.sleep(30)
> +      }
>      }
>      totalRead
>    }
>
> Is this something that you would be able to apply against the latest 0.8.2
> branch of Kafka, build the Kafka binary, try it out and see if it improves
> the situation?
>
> -Jaikiran
>
> On Monday 26 January 2015 11:35 PM, Mathias Söderberg wrote:
>
>> Hi Neha,
>>
>> I sent an e-mail earlier today, but noticed now that it didn't actually
>> go through.
>>
>> Anyhow, I've attached two files, one with output from a 10 minute run and
>> one with output from a 30 minute run. Realized that maybe I should've done
>> one or two runs with 0.8.1.1 as well, but nevertheless.
>>
>> I upgraded our staging cluster to 0.8.2.0-rc2, and I'm seeing the same
>> CPU usage as with the beta version (basically pegging all cores). If I
>> manage to find the time I'll do another run with hprof on the rc2 version
>> later today.
>>
>> Best regards,
>> Mathias
>>
>> On Tue Dec 09 2014 at 10:08:21 PM Neha Narkhede <neha@confluent.io
>> <ma...@confluent.io>> wrote:
>>
>>     The following should be sufficient
>>
>>     java
>>     -agentlib:hprof=cpu=samples,depth=100,interval=20,lineno=
>> y,thread=y,file=kafka.hprof
>>     <classname>
>>
>>     You would need to start the Kafka server with the settings above for
>>     sometime until you observe the problem.
>>
>>     On Tue, Dec 9, 2014 at 3:47 AM, Mathias Söderberg <
>>     mathias.soederberg@gmail.com
>>     <ma...@gmail.com>> wrote:
>>
>>     > Hi Neha,
>>     >
>>     > Yeah sure. I'm not familiar with hprof, so any particular
>>     options I should
>>     > include or just run with defaults?
>>     >
>>     > Best regards,
>>     > Mathias
>>     >
>>     > On Mon Dec 08 2014 at 7:41:32 PM Neha Narkhede
>>     <neha@confluent.io <ma...@confluent.io>> wrote:
>>     >
>>     > > Thanks for reporting the issue. Would you mind running hprof
>>     and sending
>>     > > the output?
>>     > >
>>     > > On Mon, Dec 8, 2014 at 1:25 AM, Mathias Söderberg <
>>     > > mathias.soederberg@gmail.com
>>     <ma...@gmail.com>> wrote:
>>     > >
>>     > > > Good day,
>>     > > >
>>     > > > I upgraded a Kafka cluster from v0.8.1.1 to v0.8.2-beta and
>>     noticed
>>     > that
>>     > > > the CPU usage on the broker machines went up by roughly 40%,
>>     from ~60%
>>     > to
>>     > > > ~100% and am wondering if anyone else has experienced something
>>     > similar?
>>     > > > The load average also went up by 2x-3x.
>>     > > >
>>     > > > We're running on EC2 and the cluster currently consists of four
>>     > > m1.xlarge,
>>     > > > with roughly 1100 topics / 4000 partitions. Using Java 7
>>     (1.7.0_65 to
>>     > be
>>     > > > exact) and Scala 2.9.2. Configurations can be found over here:
>>     > > > https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
>>     > > >
>>     > > > I'm assuming that this is not expected behaviour for 0.8.2-beta?
>>     > > >
>>     > > > Best regards,
>>     > > > Mathias
>>     > > >
>>     > >
>>     > >
>>     > >
>>     > > --
>>     > > Thanks,
>>     > > Neha
>>     > >
>>     >
>>
>>
>>
>>     --
>>     Thanks,
>>     Neha
>>
>>
>

Re: Increased CPU usage with 0.8.2-beta

Posted by Jaikiran Pai <ja...@gmail.com>.
Hi Mathias,

Looking at that thread dump, I think the potential culprit is this one:

TRACE 303545: (thread=200049)
sun.nio.ch.EPollArrayWrapper.epollWait(EPollArrayWrapper.java:Unknown line)
     sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
     sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
     sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:221)
sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385)
     kafka.utils.Utils$.read(Utils.scala:380)
kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:67)
kafka.network.Receive$class.readCompletely(Transmission.scala:56)
kafka.network.BoundedByteBufferReceive.readCompletely(BoundedByteBufferReceive.scala:29)
kafka.network.BlockingChannel.receive(BlockingChannel.scala:108)
kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:72)
kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:69)
kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SimpleConsumer.scala:113)
kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:113)
kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:113)
     kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:112)
kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:112)
kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:112)
     kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
     kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:111)
kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:97)
kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:89)
kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)


I see many such threads all triggered through the SimpleConsumer and 
ending up polling. Looking at the code, in theory, I can see why there 
might be a busy CPU loop generated by that code path. If my guess is 
right, it could be because of an issue in the implementation of how data 
is read off a channel in a blocking manner and I think this patch might 
help overcome that problem:

diff --git a/core/src/main/scala/kafka/network/Transmission.scala 
b/core/src/main/scala/kafka/network/Transmission.scala
index 2827103..0bab9ed 100644
--- a/core/src/main/scala/kafka/network/Transmission.scala
+++ b/core/src/main/scala/kafka/network/Transmission.scala
@@ -54,8 +54,15 @@ trait Receive extends Transmission {
      var totalRead = 0
      while(!complete) {
        val read = readFrom(channel)
-      trace(read + " bytes read.")
-      totalRead += read
+      if (read > 0) {
+        trace(read + " bytes read.")
+        totalRead += read
+      } else if (read == 0) {
+        // it's possible that nothing was read (see javadoc of 
ReadableByteChannel#read), from the backing channel,
+        // so we wait for a while before polling again, so that we 
don't end up with a busy CPU loop
+        // TODO: For now, this 30 milli seconds is a random value.
+        Thread.sleep(30)
+      }
      }
      totalRead
    }

Is this something that you would be able to apply against the latest 
0.8.2 branch of Kafka, build the Kafka binary, try it out and see if it 
improves the situation?

-Jaikiran

On Monday 26 January 2015 11:35 PM, Mathias Söderberg wrote:
> Hi Neha,
>
> I sent an e-mail earlier today, but noticed now that it didn't 
> actually go through.
>
> Anyhow, I've attached two files, one with output from a 10 minute run 
> and one with output from a 30 minute run. Realized that maybe I 
> should've done one or two runs with 0.8.1.1 as well, but nevertheless.
>
> I upgraded our staging cluster to 0.8.2.0-rc2, and I'm seeing the same 
> CPU usage as with the beta version (basically pegging all cores). If I 
> manage to find the time I'll do another run with hprof on the rc2 
> version later today.
>
> Best regards,
> Mathias
>
> On Tue Dec 09 2014 at 10:08:21 PM Neha Narkhede <neha@confluent.io 
> <ma...@confluent.io>> wrote:
>
>     The following should be sufficient
>
>     java
>     -agentlib:hprof=cpu=samples,depth=100,interval=20,lineno=y,thread=y,file=kafka.hprof
>     <classname>
>
>     You would need to start the Kafka server with the settings above for
>     sometime until you observe the problem.
>
>     On Tue, Dec 9, 2014 at 3:47 AM, Mathias Söderberg <
>     mathias.soederberg@gmail.com
>     <ma...@gmail.com>> wrote:
>
>     > Hi Neha,
>     >
>     > Yeah sure. I'm not familiar with hprof, so any particular
>     options I should
>     > include or just run with defaults?
>     >
>     > Best regards,
>     > Mathias
>     >
>     > On Mon Dec 08 2014 at 7:41:32 PM Neha Narkhede
>     <neha@confluent.io <ma...@confluent.io>> wrote:
>     >
>     > > Thanks for reporting the issue. Would you mind running hprof
>     and sending
>     > > the output?
>     > >
>     > > On Mon, Dec 8, 2014 at 1:25 AM, Mathias Söderberg <
>     > > mathias.soederberg@gmail.com
>     <ma...@gmail.com>> wrote:
>     > >
>     > > > Good day,
>     > > >
>     > > > I upgraded a Kafka cluster from v0.8.1.1 to v0.8.2-beta and
>     noticed
>     > that
>     > > > the CPU usage on the broker machines went up by roughly 40%,
>     from ~60%
>     > to
>     > > > ~100% and am wondering if anyone else has experienced something
>     > similar?
>     > > > The load average also went up by 2x-3x.
>     > > >
>     > > > We're running on EC2 and the cluster currently consists of four
>     > > m1.xlarge,
>     > > > with roughly 1100 topics / 4000 partitions. Using Java 7
>     (1.7.0_65 to
>     > be
>     > > > exact) and Scala 2.9.2. Configurations can be found over here:
>     > > > https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
>     > > >
>     > > > I'm assuming that this is not expected behaviour for 0.8.2-beta?
>     > > >
>     > > > Best regards,
>     > > > Mathias
>     > > >
>     > >
>     > >
>     > >
>     > > --
>     > > Thanks,
>     > > Neha
>     > >
>     >
>
>
>
>     --
>     Thanks,
>     Neha
>


Re: Increased CPU usage with 0.8.2-beta

Posted by Mathias Söderberg <ma...@gmail.com>.
Hi Neha,

I sent an e-mail earlier today, but noticed now that it didn't actually go
through.

Anyhow, I've attached two files, one with output from a 10 minute run and
one with output from a 30 minute run. Realized that maybe I should've done
one or two runs with 0.8.1.1 as well, but nevertheless.

I upgraded our staging cluster to 0.8.2.0-rc2, and I'm seeing the same CPU
usage as with the beta version (basically pegging all cores). If I manage
to find the time I'll do another run with hprof on the rc2 version later
today.

Best regards,
Mathias

On Tue Dec 09 2014 at 10:08:21 PM Neha Narkhede <ne...@confluent.io> wrote:

> The following should be sufficient
>
> java
> -agentlib:hprof=cpu=samples,depth=100,interval=20,lineno=
> y,thread=y,file=kafka.hprof
> <classname>
>
> You would need to start the Kafka server with the settings above for
> sometime until you observe the problem.
>
> On Tue, Dec 9, 2014 at 3:47 AM, Mathias Söderberg <
> mathias.soederberg@gmail.com> wrote:
>
> > Hi Neha,
> >
> > Yeah sure. I'm not familiar with hprof, so any particular options I
> should
> > include or just run with defaults?
> >
> > Best regards,
> > Mathias
> >
> > On Mon Dec 08 2014 at 7:41:32 PM Neha Narkhede <ne...@confluent.io>
> wrote:
> >
> > > Thanks for reporting the issue. Would you mind running hprof and
> sending
> > > the output?
> > >
> > > On Mon, Dec 8, 2014 at 1:25 AM, Mathias Söderberg <
> > > mathias.soederberg@gmail.com> wrote:
> > >
> > > > Good day,
> > > >
> > > > I upgraded a Kafka cluster from v0.8.1.1 to v0.8.2-beta and noticed
> > that
> > > > the CPU usage on the broker machines went up by roughly 40%, from
> ~60%
> > to
> > > > ~100% and am wondering if anyone else has experienced something
> > similar?
> > > > The load average also went up by 2x-3x.
> > > >
> > > > We're running on EC2 and the cluster currently consists of four
> > > m1.xlarge,
> > > > with roughly 1100 topics / 4000 partitions. Using Java 7 (1.7.0_65 to
> > be
> > > > exact) and Scala 2.9.2. Configurations can be found over here:
> > > > https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
> > > >
> > > > I'm assuming that this is not expected behaviour for 0.8.2-beta?
> > > >
> > > > Best regards,
> > > > Mathias
> > > >
> > >
> > >
> > >
> > > --
> > > Thanks,
> > > Neha
> > >
> >
>
>
>
> --
> Thanks,
> Neha
>

Re: Increased CPU usage with 0.8.2-beta

Posted by Neha Narkhede <ne...@confluent.io>.
The following should be sufficient

java
-agentlib:hprof=cpu=samples,depth=100,interval=20,lineno=y,thread=y,file=kafka.hprof
<classname>

You would need to start the Kafka server with the settings above for
sometime until you observe the problem.

On Tue, Dec 9, 2014 at 3:47 AM, Mathias Söderberg <
mathias.soederberg@gmail.com> wrote:

> Hi Neha,
>
> Yeah sure. I'm not familiar with hprof, so any particular options I should
> include or just run with defaults?
>
> Best regards,
> Mathias
>
> On Mon Dec 08 2014 at 7:41:32 PM Neha Narkhede <ne...@confluent.io> wrote:
>
> > Thanks for reporting the issue. Would you mind running hprof and sending
> > the output?
> >
> > On Mon, Dec 8, 2014 at 1:25 AM, Mathias Söderberg <
> > mathias.soederberg@gmail.com> wrote:
> >
> > > Good day,
> > >
> > > I upgraded a Kafka cluster from v0.8.1.1 to v0.8.2-beta and noticed
> that
> > > the CPU usage on the broker machines went up by roughly 40%, from ~60%
> to
> > > ~100% and am wondering if anyone else has experienced something
> similar?
> > > The load average also went up by 2x-3x.
> > >
> > > We're running on EC2 and the cluster currently consists of four
> > m1.xlarge,
> > > with roughly 1100 topics / 4000 partitions. Using Java 7 (1.7.0_65 to
> be
> > > exact) and Scala 2.9.2. Configurations can be found over here:
> > > https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
> > >
> > > I'm assuming that this is not expected behaviour for 0.8.2-beta?
> > >
> > > Best regards,
> > > Mathias
> > >
> >
> >
> >
> > --
> > Thanks,
> > Neha
> >
>



-- 
Thanks,
Neha

Re: Increased CPU usage with 0.8.2-beta

Posted by Mathias Söderberg <ma...@gmail.com>.
Hi Neha,

Yeah sure. I'm not familiar with hprof, so any particular options I should
include or just run with defaults?

Best regards,
Mathias

On Mon Dec 08 2014 at 7:41:32 PM Neha Narkhede <ne...@confluent.io> wrote:

> Thanks for reporting the issue. Would you mind running hprof and sending
> the output?
>
> On Mon, Dec 8, 2014 at 1:25 AM, Mathias Söderberg <
> mathias.soederberg@gmail.com> wrote:
>
> > Good day,
> >
> > I upgraded a Kafka cluster from v0.8.1.1 to v0.8.2-beta and noticed that
> > the CPU usage on the broker machines went up by roughly 40%, from ~60% to
> > ~100% and am wondering if anyone else has experienced something similar?
> > The load average also went up by 2x-3x.
> >
> > We're running on EC2 and the cluster currently consists of four
> m1.xlarge,
> > with roughly 1100 topics / 4000 partitions. Using Java 7 (1.7.0_65 to be
> > exact) and Scala 2.9.2. Configurations can be found over here:
> > https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
> >
> > I'm assuming that this is not expected behaviour for 0.8.2-beta?
> >
> > Best regards,
> > Mathias
> >
>
>
>
> --
> Thanks,
> Neha
>

Re: Increased CPU usage with 0.8.2-beta

Posted by Neha Narkhede <ne...@confluent.io>.
Thanks for reporting the issue. Would you mind running hprof and sending
the output?

On Mon, Dec 8, 2014 at 1:25 AM, Mathias Söderberg <
mathias.soederberg@gmail.com> wrote:

> Good day,
>
> I upgraded a Kafka cluster from v0.8.1.1 to v0.8.2-beta and noticed that
> the CPU usage on the broker machines went up by roughly 40%, from ~60% to
> ~100% and am wondering if anyone else has experienced something similar?
> The load average also went up by 2x-3x.
>
> We're running on EC2 and the cluster currently consists of four m1.xlarge,
> with roughly 1100 topics / 4000 partitions. Using Java 7 (1.7.0_65 to be
> exact) and Scala 2.9.2. Configurations can be found over here:
> https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
>
> I'm assuming that this is not expected behaviour for 0.8.2-beta?
>
> Best regards,
> Mathias
>



-- 
Thanks,
Neha