You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@samza.apache.org by Thunder Stumpges <ts...@ntent.com> on 2018/06/08 20:35:40 UTC

Urgent : Help with latency / backlog / topic lag

We have a new samza job which we just put into production. This job processes many topics (~30) but the total rate is not that high (~1200/sec in aggregate). I am unable to get above ~700/sec and have a growing backlog.

We are running samza 0.12 (I have an update to 0.14 that is not tested or pushed yet).  When we load tested with a single topic, we could easily do several thousand per second. The latency of a single message is about 0.5ms as recorded by our timer metric on our 'process' call.

What we believe may be happening is that most of the topics have no backlog, but one topic has all the backlog (this is because one of the topics accounts for ~60% of the total message rate).  Could there be something inducing extra latency on processing the one topic with a backlog just having a bunch of other topics with NO backlog?

Some things I have tried:


  1.  Increasing thread pool (10->20->30), no change
  2.  Going from 1 container to 2, no help (the two containers run at half the speed and total is the same)
  3.  Increasing task.max.concurrency from 1 -> 2 -> 3  (this had some minor help going from 1 to 2, but not enough)
  4.  Increasing fetch.threshold.bytes (currently at 100,000 and we have pretty small messages)

Some observed metrics:


  *   "Pending Messages" are > 0  (15+ on some partitions)
  *   "Messages in flight" is almost always 0
  *   Polls rate is ~50/sec
  *   Message chooser "Choos Obj" is ~680-700/sec like our processing rate
  *   Message chooser "choose null" is ~50/sec

I'm somewhat at a loss because based on the actual processing latency we should easily be able to do 2000+ with just a small handful of threads.

Thanks in advance, this is in production I really need a solution.
Thunder

Re: Urgent : Help with latency / backlog / topic lag

Posted by Prateek Maheshwari <pr...@gmail.com>.

Good to hear its working now. Please feel free to reach out if you run
into any issues.

When you upgrade the job to 0.14.1, please remember to change
single.thread.mode to false (or remove this configuration): The bug in
SAMZA-1599 only affects the AsyncRunLoop implementation. Setting
"single.thread.mode = true" reverts to the older and synchronous
RunLoop implementation. Since the bug is fixed for AsyncRunLoop in
0.14.1, you can continue using it again.

Thanks,
Prateek

On Fri, Jun 8, 2018 at 3:23 PM, Thunder Stumpges <ts...@ntent.com> wrote:
> I set job.container.single.thread.mode = true
>
> And actually I think we did catch up with that setting. I have since completed also the merge of 0.14.1 and we are able to keep up with the input now.
>
> Thanks again for the pointers and the fast response!
>
> -----Original Message-----
> From: Prateek Maheshwari [mailto:prateekmi2@gmail.com]
> Sent: Friday, June 8, 2018 15:00
> To: dev@samza.apache.org
> Subject: Re: Urgent : Help with latency / backlog / topic lag
>
> Just to clarify, when you say you tried single threaded mode, do you mean that you set job.container.thread.pool.size = 1, or that you set job.container.single.thread.mode = true?
>
> On Fri, Jun 8, 2018 at 2:53 PM, Thunder Stumpges <ts...@ntent.com> wrote:
>> Thanks for the quick reply. That sounds very much like what I'm
>> seeing. I'm merging in 0.14.1 to our branch now. I did try single
>> threaded mode and unfortunately that didn't seem to make a significant
>> difference. Perhaps I do need some multithreading? I'm seeing a task
>> latency 0.2ms per message but still only achieve ~700/sec
>>
>>
>> -----Original Message-----
>> From: Prateek Maheshwari [mailto:prateekmi2@gmail.com]
>> Sent: Friday, June 8, 2018 13:54
>> To: dev@samza.apache.org
>> Subject: Re: Urgent : Help with latency / backlog / topic lag
>>
>> Hi Thunder,
>>
>>> What we believe may be happening is that most of the topics have no
>> backlog, but one topic has all the backlog (this is because one of the topics accounts for ~60% of the total message rate).  Could there be something inducing extra latency on processing the one topic with a backlog just having a bunch of other topics with NO backlog?
>> This seems very similar to this issue:
>> https://issues.apache.org/jira/browse/SAMZA-1599
>> This was fixed in https://github.com/apache/samza/pull/436, and the fix should be available in the 0.14.1 version.
>> Would it be possible to try upgrading to 0.14.1? It should be backwards compatible with 0.14.0.
>>
>> For something you can try without upgrading: try setting
>> "job.container.single.thread.mode" to true. From the configuration
>> reference
>> <https://samza.apache.org/learn/documentation/latest/jobs/configuration-table.html>:
>> "If set to true, samza will fallback to legacy single-threaded event loop.
>> Default is false, which enables the multithreading execution."
>>
>> Let us know if this doesn't help.
>>
>> Thanks,
>> Prateek
>>
>> On Fri, Jun 8, 2018 at 1:35 PM, Thunder Stumpges <ts...@ntent.com>
>> wrote:
>>
>>> We have a new samza job which we just put into production. This job
>>> processes many topics (~30) but the total rate is not that high
>>> (~1200/sec in aggregate). I am unable to get above ~700/sec and have a growing backlog.
>>>
>>> We are running samza 0.12 (I have an update to 0.14 that is not
>>> tested or pushed yet).  When we load tested with a single topic, we
>>> could easily do several thousand per second. The latency of a single
>>> message is about 0.5ms as recorded by our timer metric on our 'process' call.
>>>
>>> What we believe may be happening is that most of the topics have no
>>> backlog, but one topic has all the backlog (this is because one of
>>> the topics accounts for ~60% of the total message rate).  Could there
>>> be something inducing extra latency on processing the one topic with
>>> a backlog just having a bunch of other topics with NO backlog?
>>>
>>> Some things I have tried:
>>>
>>>
>>>   1.  Increasing thread pool (10->20->30), no change
>>>   2.  Going from 1 container to 2, no help (the two containers run at
>>> half the speed and total is the same)
>>>   3.  Increasing task.max.concurrency from 1 -> 2 -> 3  (this had
>>> some minor help going from 1 to 2, but not enough)
>>>   4.  Increasing fetch.threshold.bytes (currently at 100,000 and we
>>> have pretty small messages)
>>>
>>> Some observed metrics:
>>>
>>>
>>>   *   "Pending Messages" are > 0  (15+ on some partitions)
>>>   *   "Messages in flight" is almost always 0
>>>   *   Polls rate is ~50/sec
>>>   *   Message chooser "Choos Obj" is ~680-700/sec like our processing rate
>>>   *   Message chooser "choose null" is ~50/sec
>>>
>>> I'm somewhat at a loss because based on the actual processing latency
>>> we should easily be able to do 2000+ with just a small handful of threads.
>>>
>>> Thanks in advance, this is in production I really need a solution.
>>> Thunder
>>>
>>>

RE: Urgent : Help with latency / backlog / topic lag

Posted by Thunder Stumpges <ts...@ntent.com>.

I set job.container.single.thread.mode = true

And actually I think we did catch up with that setting. I have since completed also the merge of 0.14.1 and we are able to keep up with the input now.

Thanks again for the pointers and the fast response!

-----Original Message-----
From: Prateek Maheshwari [mailto:prateekmi2@gmail.com] 
Sent: Friday, June 8, 2018 15:00
To: dev@samza.apache.org
Subject: Re: Urgent : Help with latency / backlog / topic lag

Just to clarify, when you say you tried single threaded mode, do you mean that you set job.container.thread.pool.size = 1, or that you set job.container.single.thread.mode = true?

On Fri, Jun 8, 2018 at 2:53 PM, Thunder Stumpges <ts...@ntent.com> wrote:
> Thanks for the quick reply. That sounds very much like what I'm 
> seeing. I'm merging in 0.14.1 to our branch now. I did try single 
> threaded mode and unfortunately that didn't seem to make a significant 
> difference. Perhaps I do need some multithreading? I'm seeing a task 
> latency 0.2ms per message but still only achieve ~700/sec
>
>
> -----Original Message-----
> From: Prateek Maheshwari [mailto:prateekmi2@gmail.com]
> Sent: Friday, June 8, 2018 13:54
> To: dev@samza.apache.org
> Subject: Re: Urgent : Help with latency / backlog / topic lag
>
> Hi Thunder,
>
>> What we believe may be happening is that most of the topics have no
> backlog, but one topic has all the backlog (this is because one of the topics accounts for ~60% of the total message rate).  Could there be something inducing extra latency on processing the one topic with a backlog just having a bunch of other topics with NO backlog?
> This seems very similar to this issue:
> https://issues.apache.org/jira/browse/SAMZA-1599
> This was fixed in https://github.com/apache/samza/pull/436, and the fix should be available in the 0.14.1 version.
> Would it be possible to try upgrading to 0.14.1? It should be backwards compatible with 0.14.0.
>
> For something you can try without upgrading: try setting 
> "job.container.single.thread.mode" to true. From the configuration 
> reference
> <https://samza.apache.org/learn/documentation/latest/jobs/configuration-table.html>:
> "If set to true, samza will fallback to legacy single-threaded event loop.
> Default is false, which enables the multithreading execution."
>
> Let us know if this doesn't help.
>
> Thanks,
> Prateek
>
> On Fri, Jun 8, 2018 at 1:35 PM, Thunder Stumpges <ts...@ntent.com>
> wrote:
>
>> We have a new samza job which we just put into production. This job 
>> processes many topics (~30) but the total rate is not that high 
>> (~1200/sec in aggregate). I am unable to get above ~700/sec and have a growing backlog.
>>
>> We are running samza 0.12 (I have an update to 0.14 that is not 
>> tested or pushed yet).  When we load tested with a single topic, we 
>> could easily do several thousand per second. The latency of a single 
>> message is about 0.5ms as recorded by our timer metric on our 'process' call.
>>
>> What we believe may be happening is that most of the topics have no 
>> backlog, but one topic has all the backlog (this is because one of 
>> the topics accounts for ~60% of the total message rate).  Could there 
>> be something inducing extra latency on processing the one topic with 
>> a backlog just having a bunch of other topics with NO backlog?
>>
>> Some things I have tried:
>>
>>
>>   1.  Increasing thread pool (10->20->30), no change
>>   2.  Going from 1 container to 2, no help (the two containers run at 
>> half the speed and total is the same)
>>   3.  Increasing task.max.concurrency from 1 -> 2 -> 3  (this had 
>> some minor help going from 1 to 2, but not enough)
>>   4.  Increasing fetch.threshold.bytes (currently at 100,000 and we 
>> have pretty small messages)
>>
>> Some observed metrics:
>>
>>
>>   *   "Pending Messages" are > 0  (15+ on some partitions)
>>   *   "Messages in flight" is almost always 0
>>   *   Polls rate is ~50/sec
>>   *   Message chooser "Choos Obj" is ~680-700/sec like our processing rate
>>   *   Message chooser "choose null" is ~50/sec
>>
>> I'm somewhat at a loss because based on the actual processing latency 
>> we should easily be able to do 2000+ with just a small handful of threads.
>>
>> Thanks in advance, this is in production I really need a solution.
>> Thunder
>>
>>

Re: Urgent : Help with latency / backlog / topic lag

Posted by Prateek Maheshwari <pr...@gmail.com>.

Just to clarify, when you say you tried single threaded mode, do you
mean that you set job.container.thread.pool.size = 1, or that you set
job.container.single.thread.mode = true?

On Fri, Jun 8, 2018 at 2:53 PM, Thunder Stumpges <ts...@ntent.com> wrote:
> Thanks for the quick reply. That sounds very much like what I'm seeing. I'm merging in 0.14.1 to our branch now. I did try single threaded mode and unfortunately that didn't seem to make a significant difference. Perhaps I do need some multithreading? I'm seeing a task latency 0.2ms per message but still only achieve ~700/sec
>
>
> -----Original Message-----
> From: Prateek Maheshwari [mailto:prateekmi2@gmail.com]
> Sent: Friday, June 8, 2018 13:54
> To: dev@samza.apache.org
> Subject: Re: Urgent : Help with latency / backlog / topic lag
>
> Hi Thunder,
>
>> What we believe may be happening is that most of the topics have no
> backlog, but one topic has all the backlog (this is because one of the topics accounts for ~60% of the total message rate).  Could there be something inducing extra latency on processing the one topic with a backlog just having a bunch of other topics with NO backlog?
> This seems very similar to this issue:
> https://issues.apache.org/jira/browse/SAMZA-1599
> This was fixed in https://github.com/apache/samza/pull/436, and the fix should be available in the 0.14.1 version.
> Would it be possible to try upgrading to 0.14.1? It should be backwards compatible with 0.14.0.
>
> For something you can try without upgrading: try setting "job.container.single.thread.mode" to true. From the configuration reference
> <https://samza.apache.org/learn/documentation/latest/jobs/configuration-table.html>:
> "If set to true, samza will fallback to legacy single-threaded event loop.
> Default is false, which enables the multithreading execution."
>
> Let us know if this doesn't help.
>
> Thanks,
> Prateek
>
> On Fri, Jun 8, 2018 at 1:35 PM, Thunder Stumpges <ts...@ntent.com>
> wrote:
>
>> We have a new samza job which we just put into production. This job
>> processes many topics (~30) but the total rate is not that high
>> (~1200/sec in aggregate). I am unable to get above ~700/sec and have a growing backlog.
>>
>> We are running samza 0.12 (I have an update to 0.14 that is not tested
>> or pushed yet).  When we load tested with a single topic, we could
>> easily do several thousand per second. The latency of a single message
>> is about 0.5ms as recorded by our timer metric on our 'process' call.
>>
>> What we believe may be happening is that most of the topics have no
>> backlog, but one topic has all the backlog (this is because one of the
>> topics accounts for ~60% of the total message rate).  Could there be
>> something inducing extra latency on processing the one topic with a
>> backlog just having a bunch of other topics with NO backlog?
>>
>> Some things I have tried:
>>
>>
>>   1.  Increasing thread pool (10->20->30), no change
>>   2.  Going from 1 container to 2, no help (the two containers run at
>> half the speed and total is the same)
>>   3.  Increasing task.max.concurrency from 1 -> 2 -> 3  (this had some
>> minor help going from 1 to 2, but not enough)
>>   4.  Increasing fetch.threshold.bytes (currently at 100,000 and we
>> have pretty small messages)
>>
>> Some observed metrics:
>>
>>
>>   *   "Pending Messages" are > 0  (15+ on some partitions)
>>   *   "Messages in flight" is almost always 0
>>   *   Polls rate is ~50/sec
>>   *   Message chooser "Choos Obj" is ~680-700/sec like our processing rate
>>   *   Message chooser "choose null" is ~50/sec
>>
>> I'm somewhat at a loss because based on the actual processing latency
>> we should easily be able to do 2000+ with just a small handful of threads.
>>
>> Thanks in advance, this is in production I really need a solution.
>> Thunder
>>
>>

RE: Urgent : Help with latency / backlog / topic lag

Posted by Thunder Stumpges <ts...@ntent.com>.

Thanks for the quick reply. That sounds very much like what I'm seeing. I'm merging in 0.14.1 to our branch now. I did try single threaded mode and unfortunately that didn't seem to make a significant difference. Perhaps I do need some multithreading? I'm seeing a task latency 0.2ms per message but still only achieve ~700/sec


-----Original Message-----
From: Prateek Maheshwari [mailto:prateekmi2@gmail.com] 
Sent: Friday, June 8, 2018 13:54
To: dev@samza.apache.org
Subject: Re: Urgent : Help with latency / backlog / topic lag

Hi Thunder,

> What we believe may be happening is that most of the topics have no
backlog, but one topic has all the backlog (this is because one of the topics accounts for ~60% of the total message rate).  Could there be something inducing extra latency on processing the one topic with a backlog just having a bunch of other topics with NO backlog?
This seems very similar to this issue:
https://issues.apache.org/jira/browse/SAMZA-1599
This was fixed in https://github.com/apache/samza/pull/436, and the fix should be available in the 0.14.1 version.
Would it be possible to try upgrading to 0.14.1? It should be backwards compatible with 0.14.0.

For something you can try without upgrading: try setting "job.container.single.thread.mode" to true. From the configuration reference
<https://samza.apache.org/learn/documentation/latest/jobs/configuration-table.html>:
"If set to true, samza will fallback to legacy single-threaded event loop.
Default is false, which enables the multithreading execution."

Let us know if this doesn't help.

Thanks,
Prateek

On Fri, Jun 8, 2018 at 1:35 PM, Thunder Stumpges <ts...@ntent.com>
wrote:

> We have a new samza job which we just put into production. This job 
> processes many topics (~30) but the total rate is not that high 
> (~1200/sec in aggregate). I am unable to get above ~700/sec and have a growing backlog.
>
> We are running samza 0.12 (I have an update to 0.14 that is not tested 
> or pushed yet).  When we load tested with a single topic, we could 
> easily do several thousand per second. The latency of a single message 
> is about 0.5ms as recorded by our timer metric on our 'process' call.
>
> What we believe may be happening is that most of the topics have no 
> backlog, but one topic has all the backlog (this is because one of the 
> topics accounts for ~60% of the total message rate).  Could there be 
> something inducing extra latency on processing the one topic with a 
> backlog just having a bunch of other topics with NO backlog?
>
> Some things I have tried:
>
>
>   1.  Increasing thread pool (10->20->30), no change
>   2.  Going from 1 container to 2, no help (the two containers run at 
> half the speed and total is the same)
>   3.  Increasing task.max.concurrency from 1 -> 2 -> 3  (this had some 
> minor help going from 1 to 2, but not enough)
>   4.  Increasing fetch.threshold.bytes (currently at 100,000 and we 
> have pretty small messages)
>
> Some observed metrics:
>
>
>   *   "Pending Messages" are > 0  (15+ on some partitions)
>   *   "Messages in flight" is almost always 0
>   *   Polls rate is ~50/sec
>   *   Message chooser "Choos Obj" is ~680-700/sec like our processing rate
>   *   Message chooser "choose null" is ~50/sec
>
> I'm somewhat at a loss because based on the actual processing latency 
> we should easily be able to do 2000+ with just a small handful of threads.
>
> Thanks in advance, this is in production I really need a solution.
> Thunder
>
>

Re: Urgent : Help with latency / backlog / topic lag

Posted by Prateek Maheshwari <pr...@gmail.com>.

Hi Thunder,

> What we believe may be happening is that most of the topics have no
backlog, but one topic has all the backlog (this is because one of the
topics accounts for ~60% of the total message rate).  Could there be
something inducing extra latency on processing the one topic with a backlog
just having a bunch of other topics with NO backlog?
This seems very similar to this issue:
https://issues.apache.org/jira/browse/SAMZA-1599
This was fixed in https://github.com/apache/samza/pull/436, and the fix
should be available in the 0.14.1 version.
Would it be possible to try upgrading to 0.14.1? It should be backwards
compatible with 0.14.0.

For something you can try without upgrading: try setting
"job.container.single.thread.mode" to true. From the configuration reference
<https://samza.apache.org/learn/documentation/latest/jobs/configuration-table.html>:
"If set to true, samza will fallback to legacy single-threaded event loop.
Default is false, which enables the multithreading execution."

Let us know if this doesn't help.

Thanks,
Prateek

On Fri, Jun 8, 2018 at 1:35 PM, Thunder Stumpges <ts...@ntent.com>
wrote:

> We have a new samza job which we just put into production. This job
> processes many topics (~30) but the total rate is not that high (~1200/sec
> in aggregate). I am unable to get above ~700/sec and have a growing backlog.
>
> We are running samza 0.12 (I have an update to 0.14 that is not tested or
> pushed yet).  When we load tested with a single topic, we could easily do
> several thousand per second. The latency of a single message is about 0.5ms
> as recorded by our timer metric on our 'process' call.
>
> What we believe may be happening is that most of the topics have no
> backlog, but one topic has all the backlog (this is because one of the
> topics accounts for ~60% of the total message rate).  Could there be
> something inducing extra latency on processing the one topic with a backlog
> just having a bunch of other topics with NO backlog?
>
> Some things I have tried:
>
>
>   1.  Increasing thread pool (10->20->30), no change
>   2.  Going from 1 container to 2, no help (the two containers run at half
> the speed and total is the same)
>   3.  Increasing task.max.concurrency from 1 -> 2 -> 3  (this had some
> minor help going from 1 to 2, but not enough)
>   4.  Increasing fetch.threshold.bytes (currently at 100,000 and we have
> pretty small messages)
>
> Some observed metrics:
>
>
>   *   "Pending Messages" are > 0  (15+ on some partitions)
>   *   "Messages in flight" is almost always 0
>   *   Polls rate is ~50/sec
>   *   Message chooser "Choos Obj" is ~680-700/sec like our processing rate
>   *   Message chooser "choose null" is ~50/sec
>
> I'm somewhat at a loss because based on the actual processing latency we
> should easily be able to do 2000+ with just a small handful of threads.
>
> Thanks in advance, this is in production I really need a solution.
> Thunder
>
>