You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Jonathan Gordon <jo...@newrelic.com.INVALID> on 2018/11/06 17:49:19 UTC

Kafka Streams Session store performance degradation from 0.10.2.1 to 0.11.0.3

I have a Kafka Streams app that I'm trying to upgrade from 0.10.2.1 to
0.11.0.3 but when I do I notice that CPU goes way up and consumption goes
down. A thread profile indicates that the most expensive task is during our
aggregation, fetching from the cache.

Thread profile with caching:
https://imgur.com/l5VEsC2

If I disable the cache both performance and consumption are good but we are
producing every single aggregation modification, which is not what we want.

Thread profile without caching:
https://imgur.com/a/JK3nkou

I read this thread, which seems relevant e

https://lists.apache.org/thread.html/2b44e74eaec7172b107bcff96861cf8b4837f55a44714f69d033cc2e@%3Cusers.kafka.apache.org%3E

Notably: "Note, that caching was _not_ introduced to reduce the writes to
RocksDB, but to reduce the write the the changelog topic and to reduce the
number of records send downstream."

So how can we reduce the number of records sent downstream while
maintaining the same performance characteristics that we have with caching
turned off? Or put another way, how can I upgrade my app without taking a
hit in performance or behavior?

Thanks!

Re: Kafka Streams Session store performance degradation from 0.10.2.1 to 0.11.0.3

Posted by jo...@newrelic.com, jo...@newrelic.com.
On 2018/11/17 00:26:56, Guozhang Wang <wa...@gmail.com> wrote: 
> Could you create a JIRA with all the current available information uploaded
> on the ticket for me to further investigate the issue? This way we will not
> lose track of it (email list is not the best venue for potential bug
> investigation :).

Here you go. I've added some logs which show the issue pretty clearly:

https://issues.apache.org/jira/browse/KAFKA-7652

> At the mean time, I will try to compare the source code of 0.10.2 and 2.0
> and see if I can eyeball any obvious issues.

Great. Please let me know if there's any way I can assist.

Re: Kafka Streams Session store performance degradation from 0.10.2.1 to 0.11.0.3

Posted by Guozhang Wang <wa...@gmail.com>.
Hello Jonathan,

I've left a comment on https://issues.apache.org/jira/browse/KAFKA-7652
with a fix trying to resolve the discovered bug in trunk. If it verifies to
be the right fix I will push it to older branches as well.

Just FYI.


Guozhang

On Fri, Nov 16, 2018 at 4:26 PM Guozhang Wang <wa...@gmail.com> wrote:

> Hi Jonathan,
>
> Could you create a JIRA with all the current available information
> uploaded on the ticket for me to further investigate the issue? This way we
> will not lose track of it (email list is not the best venue for potential
> bug investigation :).
>
> At the mean time, I will try to compare the source code of 0.10.2 and 2.0
> and see if I can eyeball any obvious issues.
>
> Guozhang
>
> On Thu, Nov 8, 2018 at 1:39 PM Matthias J. Sax <ma...@confluent.io>
> wrote:
>
>> Thanks for verifying.
>>
>> >> From our perspective, it appears something happened after 0.10.2.1
>> that made the LRU Cache much slower for our use case.
>>
>> That is what I try to figure out. I went over the 0.10.2.2 to 0.11.0.3
>> Jiras but found nothing I could point out. There are couple of
>> SessionStore related tickets, but none of them should have an effect
>> like this.
>>
>> To narrow it down, it would be helpful to test with other versions, too.
>> Maybe 0.10.2.2 and 0.11.0.0 to see when the issue was introduced.
>>
>> Can you also profile v0.10.2.1 so we can compare?
>>
>> > What would you recommend for our next steps?
>>
>> Not sure. If you could help us to track down the issue, that would be
>> most helpful so get a fix (and you could run from a SNAPSHOT version to
>> get the fix -- not sure if this would be an option for you).
>>
>>
>> -Matthias
>>
>>
>>
>> On 11/7/18 3:47 PM, jonathangordon@newrelic.com wrote:
>> > Hi Matthias,
>> >
>> > I upgraded to 2.0.0 and we're experiencing the same problem. I've
>> posted a new screengrab of a thread profile:
>> >
>> > https://imgur.com/a/2wncPHw
>> >
>> > From our perspective, it appears something happened after 0.10.2.1 that
>> made the LRU Cache much slower for our use case. What would you recommend
>> for our next steps?
>> >
>> > Jonathan
>> >
>> > On 2018/11/06 19:22:16, "Matthias J. Sax" <ma...@confluent.io>
>> wrote:
>> >> Not sure atm why you see a performance degradation. Would need to dig
>> >> into the details.
>> >>
>> >> However, did you consider to upgrade to 2.0 instead or 0.11?
>> >>
>> >> Also note that we added a new operator `suppress()` in upcoming 2.1
>> >> release, that allows you to do rate control without caching:
>> >>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables
>> >>
>> >> Hope this helps.
>> >>
>> >>
>> >> -Matthias
>> >>
>> >> On 11/6/18 9:49 AM, Jonathan Gordon wrote:
>> >>> I have a Kafka Streams app that I'm trying to upgrade from 0.10.2.1 to
>> >>> 0.11.0.3 but when I do I notice that CPU goes way up and consumption
>> goes
>> >>> down. A thread profile indicates that the most expensive task is
>> during our
>> >>> aggregation, fetching from the cache.
>> >>>
>> >>> Thread profile with caching:
>> >>> https://imgur.com/l5VEsC2
>> >>>
>> >>> If I disable the cache both performance and consumption are good but
>> we are
>> >>> producing every single aggregation modification, which is not what we
>> want.
>> >>>
>> >>> Thread profile without caching:
>> >>> https://imgur.com/a/JK3nkou
>> >>>
>> >>> I read this thread, which seems relevant e
>> >>>
>> >>>
>> https://lists.apache.org/thread.html/2b44e74eaec7172b107bcff96861cf8b4837f55a44714f69d033cc2e@%3Cusers.kafka.apache.org%3E
>> >>>
>> >>> Notably: "Note, that caching was _not_ introduced to reduce the
>> writes to
>> >>> RocksDB, but to reduce the write the the changelog topic and to
>> reduce the
>> >>> number of records send downstream."
>> >>>
>> >>> So how can we reduce the number of records sent downstream while
>> >>> maintaining the same performance characteristics that we have with
>> caching
>> >>> turned off? Or put another way, how can I upgrade my app without
>> taking a
>> >>> hit in performance or behavior?
>> >>>
>> >>> Thanks!
>> >>>
>> >>
>> >>
>>
>>
>
> --
> -- Guozhang
>


-- 
-- Guozhang

Re: Kafka Streams Session store performance degradation from 0.10.2.1 to 0.11.0.3

Posted by Guozhang Wang <wa...@gmail.com>.
Hi Jonathan,

Could you create a JIRA with all the current available information uploaded
on the ticket for me to further investigate the issue? This way we will not
lose track of it (email list is not the best venue for potential bug
investigation :).

At the mean time, I will try to compare the source code of 0.10.2 and 2.0
and see if I can eyeball any obvious issues.

Guozhang

On Thu, Nov 8, 2018 at 1:39 PM Matthias J. Sax <ma...@confluent.io>
wrote:

> Thanks for verifying.
>
> >> From our perspective, it appears something happened after 0.10.2.1 that
> made the LRU Cache much slower for our use case.
>
> That is what I try to figure out. I went over the 0.10.2.2 to 0.11.0.3
> Jiras but found nothing I could point out. There are couple of
> SessionStore related tickets, but none of them should have an effect
> like this.
>
> To narrow it down, it would be helpful to test with other versions, too.
> Maybe 0.10.2.2 and 0.11.0.0 to see when the issue was introduced.
>
> Can you also profile v0.10.2.1 so we can compare?
>
> > What would you recommend for our next steps?
>
> Not sure. If you could help us to track down the issue, that would be
> most helpful so get a fix (and you could run from a SNAPSHOT version to
> get the fix -- not sure if this would be an option for you).
>
>
> -Matthias
>
>
>
> On 11/7/18 3:47 PM, jonathangordon@newrelic.com wrote:
> > Hi Matthias,
> >
> > I upgraded to 2.0.0 and we're experiencing the same problem. I've posted
> a new screengrab of a thread profile:
> >
> > https://imgur.com/a/2wncPHw
> >
> > From our perspective, it appears something happened after 0.10.2.1 that
> made the LRU Cache much slower for our use case. What would you recommend
> for our next steps?
> >
> > Jonathan
> >
> > On 2018/11/06 19:22:16, "Matthias J. Sax" <ma...@confluent.io>
> wrote:
> >> Not sure atm why you see a performance degradation. Would need to dig
> >> into the details.
> >>
> >> However, did you consider to upgrade to 2.0 instead or 0.11?
> >>
> >> Also note that we added a new operator `suppress()` in upcoming 2.1
> >> release, that allows you to do rate control without caching:
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables
> >>
> >> Hope this helps.
> >>
> >>
> >> -Matthias
> >>
> >> On 11/6/18 9:49 AM, Jonathan Gordon wrote:
> >>> I have a Kafka Streams app that I'm trying to upgrade from 0.10.2.1 to
> >>> 0.11.0.3 but when I do I notice that CPU goes way up and consumption
> goes
> >>> down. A thread profile indicates that the most expensive task is
> during our
> >>> aggregation, fetching from the cache.
> >>>
> >>> Thread profile with caching:
> >>> https://imgur.com/l5VEsC2
> >>>
> >>> If I disable the cache both performance and consumption are good but
> we are
> >>> producing every single aggregation modification, which is not what we
> want.
> >>>
> >>> Thread profile without caching:
> >>> https://imgur.com/a/JK3nkou
> >>>
> >>> I read this thread, which seems relevant e
> >>>
> >>>
> https://lists.apache.org/thread.html/2b44e74eaec7172b107bcff96861cf8b4837f55a44714f69d033cc2e@%3Cusers.kafka.apache.org%3E
> >>>
> >>> Notably: "Note, that caching was _not_ introduced to reduce the writes
> to
> >>> RocksDB, but to reduce the write the the changelog topic and to reduce
> the
> >>> number of records send downstream."
> >>>
> >>> So how can we reduce the number of records sent downstream while
> >>> maintaining the same performance characteristics that we have with
> caching
> >>> turned off? Or put another way, how can I upgrade my app without
> taking a
> >>> hit in performance or behavior?
> >>>
> >>> Thanks!
> >>>
> >>
> >>
>
>

-- 
-- Guozhang

Re: Kafka Streams Session store performance degradation from 0.10.2.1 to 0.11.0.3

Posted by jo...@newrelic.com, jo...@newrelic.com.
On 2018/11/08 00:13:39, "Matthias J. Sax" <ma...@confluent.io> wrote: 
> That is what I try to figure out. I went over the 0.10.2.2 to 0.11.0.3
> Jiras but found nothing I could point out. There are couple of
> SessionStore related tickets, but none of them should have an effect
> like this.
> 
> To narrow it down, it would be helpful to test with other versions, too.
> Maybe 0.10.2.2 and 0.11.0.0 to see when the issue was introduced.

Done. So far here's what my tests have shown:
0.10.2.1 (the current version we're running) and 0.10.2.2, the local cache works properly and we see thread profiles similar to what I posted earlier, where the majority of time is spent in RockDB and there's no lag. 

Testing with 0.11.0.0, 0.11.0.3, 1.1.1, 2.0.0 and 2.0.1 all show us spending the majority of time in the local cache and we lag considerably:

https://imgur.com/l5VEsC2

> Can you also profile v0.10.2.1 so we can compare?

Here's a recent profile for 0.10.2.1:

https://imgur.com/a/Sto636s

> > What would you recommend for our next steps? 
> 
> Not sure. If you could help us to track down the issue, that would be
> most helpful so get a fix (and you could run from a SNAPSHOT version to
> get the fix -- not sure if this would be an option for you).

Another developer took a look a the code and he had some thoughts:

"It appears we're scanning an order of magnitude more keys for every call to `findSessions`. You can see this manifest in the flush logs where version 0.11.0.3 and later will have a billion hits on the cache in 10 minutes, even though the number of events consumed is only 1M. It seems like when they made some fixes to make sure all possible windows for a session merge are found that resulted in having to scan every entry in the cache."

Is there a way for us to refine the cache search so we're not searching the entire key space? 




Re: Kafka Streams Session store performance degradation from 0.10.2.1 to 0.11.0.3

Posted by "Matthias J. Sax" <ma...@confluent.io>.
Thanks for verifying.

>> From our perspective, it appears something happened after 0.10.2.1 that made the LRU Cache much slower for our use case. 

That is what I try to figure out. I went over the 0.10.2.2 to 0.11.0.3
Jiras but found nothing I could point out. There are couple of
SessionStore related tickets, but none of them should have an effect
like this.

To narrow it down, it would be helpful to test with other versions, too.
Maybe 0.10.2.2 and 0.11.0.0 to see when the issue was introduced.

Can you also profile v0.10.2.1 so we can compare?

> What would you recommend for our next steps? 

Not sure. If you could help us to track down the issue, that would be
most helpful so get a fix (and you could run from a SNAPSHOT version to
get the fix -- not sure if this would be an option for you).


-Matthias



On 11/7/18 3:47 PM, jonathangordon@newrelic.com wrote:
> Hi Matthias,
> 
> I upgraded to 2.0.0 and we're experiencing the same problem. I've posted a new screengrab of a thread profile:
> 
> https://imgur.com/a/2wncPHw
> 
> From our perspective, it appears something happened after 0.10.2.1 that made the LRU Cache much slower for our use case. What would you recommend for our next steps? 
> 
> Jonathan
> 
> On 2018/11/06 19:22:16, "Matthias J. Sax" <ma...@confluent.io> wrote: 
>> Not sure atm why you see a performance degradation. Would need to dig
>> into the details.
>>
>> However, did you consider to upgrade to 2.0 instead or 0.11?
>>
>> Also note that we added a new operator `suppress()` in upcoming 2.1
>> release, that allows you to do rate control without caching:
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables
>>
>> Hope this helps.
>>
>>
>> -Matthias
>>
>> On 11/6/18 9:49 AM, Jonathan Gordon wrote:
>>> I have a Kafka Streams app that I'm trying to upgrade from 0.10.2.1 to
>>> 0.11.0.3 but when I do I notice that CPU goes way up and consumption goes
>>> down. A thread profile indicates that the most expensive task is during our
>>> aggregation, fetching from the cache.
>>>
>>> Thread profile with caching:
>>> https://imgur.com/l5VEsC2
>>>
>>> If I disable the cache both performance and consumption are good but we are
>>> producing every single aggregation modification, which is not what we want.
>>>
>>> Thread profile without caching:
>>> https://imgur.com/a/JK3nkou
>>>
>>> I read this thread, which seems relevant e
>>>
>>> https://lists.apache.org/thread.html/2b44e74eaec7172b107bcff96861cf8b4837f55a44714f69d033cc2e@%3Cusers.kafka.apache.org%3E
>>>
>>> Notably: "Note, that caching was _not_ introduced to reduce the writes to
>>> RocksDB, but to reduce the write the the changelog topic and to reduce the
>>> number of records send downstream."
>>>
>>> So how can we reduce the number of records sent downstream while
>>> maintaining the same performance characteristics that we have with caching
>>> turned off? Or put another way, how can I upgrade my app without taking a
>>> hit in performance or behavior?
>>>
>>> Thanks!
>>>
>>
>>


Re: Kafka Streams Session store performance degradation from 0.10.2.1 to 0.11.0.3

Posted by jo...@newrelic.com, jo...@newrelic.com.
Hi Matthias,

I upgraded to 2.0.0 and we're experiencing the same problem. I've posted a new screengrab of a thread profile:

https://imgur.com/a/2wncPHw

From our perspective, it appears something happened after 0.10.2.1 that made the LRU Cache much slower for our use case. What would you recommend for our next steps? 

Jonathan

On 2018/11/06 19:22:16, "Matthias J. Sax" <ma...@confluent.io> wrote: 
> Not sure atm why you see a performance degradation. Would need to dig
> into the details.
> 
> However, did you consider to upgrade to 2.0 instead or 0.11?
> 
> Also note that we added a new operator `suppress()` in upcoming 2.1
> release, that allows you to do rate control without caching:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables
> 
> Hope this helps.
> 
> 
> -Matthias
> 
> On 11/6/18 9:49 AM, Jonathan Gordon wrote:
> > I have a Kafka Streams app that I'm trying to upgrade from 0.10.2.1 to
> > 0.11.0.3 but when I do I notice that CPU goes way up and consumption goes
> > down. A thread profile indicates that the most expensive task is during our
> > aggregation, fetching from the cache.
> > 
> > Thread profile with caching:
> > https://imgur.com/l5VEsC2
> > 
> > If I disable the cache both performance and consumption are good but we are
> > producing every single aggregation modification, which is not what we want.
> > 
> > Thread profile without caching:
> > https://imgur.com/a/JK3nkou
> > 
> > I read this thread, which seems relevant e
> > 
> > https://lists.apache.org/thread.html/2b44e74eaec7172b107bcff96861cf8b4837f55a44714f69d033cc2e@%3Cusers.kafka.apache.org%3E
> > 
> > Notably: "Note, that caching was _not_ introduced to reduce the writes to
> > RocksDB, but to reduce the write the the changelog topic and to reduce the
> > number of records send downstream."
> > 
> > So how can we reduce the number of records sent downstream while
> > maintaining the same performance characteristics that we have with caching
> > turned off? Or put another way, how can I upgrade my app without taking a
> > hit in performance or behavior?
> > 
> > Thanks!
> > 
> 
> 

Re: Kafka Streams Session store performance degradation from 0.10.2.1 to 0.11.0.3

Posted by "Matthias J. Sax" <ma...@confluent.io>.
Not sure atm why you see a performance degradation. Would need to dig
into the details.

However, did you consider to upgrade to 2.0 instead or 0.11?

Also note that we added a new operator `suppress()` in upcoming 2.1
release, that allows you to do rate control without caching:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables

Hope this helps.


-Matthias

On 11/6/18 9:49 AM, Jonathan Gordon wrote:
> I have a Kafka Streams app that I'm trying to upgrade from 0.10.2.1 to
> 0.11.0.3 but when I do I notice that CPU goes way up and consumption goes
> down. A thread profile indicates that the most expensive task is during our
> aggregation, fetching from the cache.
> 
> Thread profile with caching:
> https://imgur.com/l5VEsC2
> 
> If I disable the cache both performance and consumption are good but we are
> producing every single aggregation modification, which is not what we want.
> 
> Thread profile without caching:
> https://imgur.com/a/JK3nkou
> 
> I read this thread, which seems relevant e
> 
> https://lists.apache.org/thread.html/2b44e74eaec7172b107bcff96861cf8b4837f55a44714f69d033cc2e@%3Cusers.kafka.apache.org%3E
> 
> Notably: "Note, that caching was _not_ introduced to reduce the writes to
> RocksDB, but to reduce the write the the changelog topic and to reduce the
> number of records send downstream."
> 
> So how can we reduce the number of records sent downstream while
> maintaining the same performance characteristics that we have with caching
> turned off? Or put another way, how can I upgrade my app without taking a
> hit in performance or behavior?
> 
> Thanks!
>