You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Wes Chow <we...@chartbeat.com> on 2014/11/11 19:50:00 UTC

spikes in producer requests/sec

We're seeing periodic spikes in req/sec rates across our nodes. Our 
cluster is 10 nodes, and the topic has a replication factor of 3. We 
push around 200k messages / sec into Kafka.


The machines are running the most recent version of Kafka and we're 
connecting via librdkafka. pingstream02-10 are using the CMS garbage 
collector, but I switched pingstream01 to use G1GC under the theory that 
maybe these were GC pauses. The graph shows that likely didn't improve 
the situation.

My next thought is that maybe this is the effect of log rolling. 
Checking in the logs, I see a lot of this:

[2014-11-11 13:46:45,836] 72952071 [ReplicaFetcherThread-0-7] INFO  
kafka.log.Log  - Rolled new log segment for 'pings-342' in 3 ms.
[2014-11-11 13:46:47,116] 72953351 [kafka-request-handler-0] INFO  
kafka.log.Log  - Rolled new log segment for 'pings-186' in 2 ms.
[2014-11-11 13:46:48,155] 72954390 [ReplicaFetcherThread-0-8] INFO  
kafka.log.Log  - Rolled new log segment for 'pings-253' in 3 ms.
[2014-11-11 13:46:48,408] 72954643 [ReplicaFetcherThread-0-4] INFO  
kafka.log.Log  - Rolled new log segment for 'pings-209' in 3 ms.
[2014-11-11 13:46:48,436] 72954671 [ReplicaFetcherThread-0-4] INFO  
kafka.log.Log  - Rolled new log segment for 'pings-299' in 2 ms.
[2014-11-11 13:46:48,687] 72954922 [kafka-request-handler-0] INFO  
kafka.log.Log  - Rolled new log segment for 'pings-506' in 2 ms.

The "pings" topic in question has 512 partitions, so it does this 512 
times every so often. We have an effective retention period of a bit 
less than 30 min, so rolling happens pretty frequently. Still, if I 
assume worst case that rolling locks up the process for 2ms and there 
are 512 rolls every few minutes, I'd expect halting to happen for about 
a second at a time. The graphs seem to indicate much longer dips, but 
it's hard for me to know if I'm looking at real data or some sort of 
artifact.

Fwiw, the producers are not reporting any errors, so it does not seem 
like we're losing data.

I'm new to Kafka. Should I be worried? If so, how should I be debugging 
this?

Thanks,
Wes

Re: spikes in producer requests/sec

Posted by Wes Chow <we...@chartbeat.com>.

Thanks for suggestions everyone. Will get some more metrics into place 
and report back.

Wes

> Jun Rao <ma...@gmail.com>
> November 12, 2014 at 12:24 PM
> You can also look into the producer time metrics and see if the time 
> goes up before the spike of the request rate. If the time does go up, 
> you can look at the breakdown of the time. Any latency due to local 
> I/O will be included in the local time.
>
> Thanks,
>
> Jun
>
>
> Wes Chow <ma...@chartbeat.com>
> November 11, 2014 at 1:50 PM
>
> We're seeing periodic spikes in req/sec rates across our nodes. Our 
> cluster is 10 nodes, and the topic has a replication factor of 3. We 
> push around 200k messages / sec into Kafka.
>
>
> The machines are running the most recent version of Kafka and we're 
> connecting via librdkafka. pingstream02-10 are using the CMS garbage 
> collector, but I switched pingstream01 to use G1GC under the theory 
> that maybe these were GC pauses. The graph shows that likely didn't 
> improve the situation.
>
> My next thought is that maybe this is the effect of log rolling. 
> Checking in the logs, I see a lot of this:
>
> [2014-11-11 13:46:45,836] 72952071 [ReplicaFetcherThread-0-7] INFO  
> kafka.log.Log  - Rolled new log segment for 'pings-342' in 3 ms.
> [2014-11-11 13:46:47,116] 72953351 [kafka-request-handler-0] INFO  
> kafka.log.Log  - Rolled new log segment for 'pings-186' in 2 ms.
> [2014-11-11 13:46:48,155] 72954390 [ReplicaFetcherThread-0-8] INFO  
> kafka.log.Log  - Rolled new log segment for 'pings-253' in 3 ms.
> [2014-11-11 13:46:48,408] 72954643 [ReplicaFetcherThread-0-4] INFO  
> kafka.log.Log  - Rolled new log segment for 'pings-209' in 3 ms.
> [2014-11-11 13:46:48,436] 72954671 [ReplicaFetcherThread-0-4] INFO  
> kafka.log.Log  - Rolled new log segment for 'pings-299' in 2 ms.
> [2014-11-11 13:46:48,687] 72954922 [kafka-request-handler-0] INFO  
> kafka.log.Log  - Rolled new log segment for 'pings-506' in 2 ms.
>
> The "pings" topic in question has 512 partitions, so it does this 512 
> times every so often. We have an effective retention period of a bit 
> less than 30 min, so rolling happens pretty frequently. Still, if I 
> assume worst case that rolling locks up the process for 2ms and there 
> are 512 rolls every few minutes, I'd expect halting to happen for 
> about a second at a time. The graphs seem to indicate much longer 
> dips, but it's hard for me to know if I'm looking at real data or some 
> sort of artifact.
>
> Fwiw, the producers are not reporting any errors, so it does not seem 
> like we're losing data.
>
> I'm new to Kafka. Should I be worried? If so, how should I be 
> debugging this?
>
> Thanks,
> Wes
>

Re: spikes in producer requests/sec

Posted by Jun Rao <ju...@gmail.com>.

You can also look into the producer time metrics and see if the time goes
up before the spike of the request rate. If the time does go up, you can
look at the breakdown of the time. Any latency due to local I/O will be
included in the local time.

Thanks,

Jun

On Tue, Nov 11, 2014 at 10:50 AM, Wes Chow <we...@chartbeat.com> wrote:

>
> We're seeing periodic spikes in req/sec rates across our nodes. Our
> cluster is 10 nodes, and the topic has a replication factor of 3. We push
> around 200k messages / sec into Kafka.
>
>
> The machines are running the most recent version of Kafka and we're
> connecting via librdkafka. pingstream02-10 are using the CMS garbage
> collector, but I switched pingstream01 to use G1GC under the theory that
> maybe these were GC pauses. The graph shows that likely didn't improve the
> situation.
>
> My next thought is that maybe this is the effect of log rolling. Checking
> in the logs, I see a lot of this:
>
> [2014-11-11 13:46:45,836] 72952071 [ReplicaFetcherThread-0-7] INFO
> kafka.log.Log  - Rolled new log segment for 'pings-342' in 3 ms.
> [2014-11-11 13:46:47,116] 72953351 [kafka-request-handler-0] INFO
> kafka.log.Log  - Rolled new log segment for 'pings-186' in 2 ms.
> [2014-11-11 13:46:48,155] 72954390 [ReplicaFetcherThread-0-8] INFO
> kafka.log.Log  - Rolled new log segment for 'pings-253' in 3 ms.
> [2014-11-11 13:46:48,408] 72954643 [ReplicaFetcherThread-0-4] INFO
> kafka.log.Log  - Rolled new log segment for 'pings-209' in 3 ms.
> [2014-11-11 13:46:48,436] 72954671 [ReplicaFetcherThread-0-4] INFO
> kafka.log.Log  - Rolled new log segment for 'pings-299' in 2 ms.
> [2014-11-11 13:46:48,687] 72954922 [kafka-request-handler-0] INFO
> kafka.log.Log  - Rolled new log segment for 'pings-506' in 2 ms.
>
> The "pings" topic in question has 512 partitions, so it does this 512
> times every so often. We have an effective retention period of a bit less
> than 30 min, so rolling happens pretty frequently. Still, if I assume worst
> case that rolling locks up the process for 2ms and there are 512 rolls
> every few minutes, I'd expect halting to happen for about a second at a
> time. The graphs seem to indicate much longer dips, but it's hard for me to
> know if I'm looking at real data or some sort of artifact.
>
> Fwiw, the producers are not reporting any errors, so it does not seem like
> we're losing data.
>
> I'm new to Kafka. Should I be worried? If so, how should I be debugging
> this?
>
> Thanks,
> Wes
>
>

Re: spikes in producer requests/sec

Posted by Jay Kreps <ja...@gmail.com>.

There are some fixes in 0.8.2-beta for periodic latency spikes if you are
using acks=-1 in the producer.

-Jay

On Tue, Nov 11, 2014 at 10:50 AM, Wes Chow <we...@chartbeat.com> wrote:

>
> We're seeing periodic spikes in req/sec rates across our nodes. Our
> cluster is 10 nodes, and the topic has a replication factor of 3. We push
> around 200k messages / sec into Kafka.
>
>
> The machines are running the most recent version of Kafka and we're
> connecting via librdkafka. pingstream02-10 are using the CMS garbage
> collector, but I switched pingstream01 to use G1GC under the theory that
> maybe these were GC pauses. The graph shows that likely didn't improve the
> situation.
>
> My next thought is that maybe this is the effect of log rolling. Checking
> in the logs, I see a lot of this:
>
> [2014-11-11 13:46:45,836] 72952071 [ReplicaFetcherThread-0-7] INFO
> kafka.log.Log  - Rolled new log segment for 'pings-342' in 3 ms.
> [2014-11-11 13:46:47,116] 72953351 [kafka-request-handler-0] INFO
> kafka.log.Log  - Rolled new log segment for 'pings-186' in 2 ms.
> [2014-11-11 13:46:48,155] 72954390 [ReplicaFetcherThread-0-8] INFO
> kafka.log.Log  - Rolled new log segment for 'pings-253' in 3 ms.
> [2014-11-11 13:46:48,408] 72954643 [ReplicaFetcherThread-0-4] INFO
> kafka.log.Log  - Rolled new log segment for 'pings-209' in 3 ms.
> [2014-11-11 13:46:48,436] 72954671 [ReplicaFetcherThread-0-4] INFO
> kafka.log.Log  - Rolled new log segment for 'pings-299' in 2 ms.
> [2014-11-11 13:46:48,687] 72954922 [kafka-request-handler-0] INFO
> kafka.log.Log  - Rolled new log segment for 'pings-506' in 2 ms.
>
> The "pings" topic in question has 512 partitions, so it does this 512
> times every so often. We have an effective retention period of a bit less
> than 30 min, so rolling happens pretty frequently. Still, if I assume worst
> case that rolling locks up the process for 2ms and there are 512 rolls
> every few minutes, I'd expect halting to happen for about a second at a
> time. The graphs seem to indicate much longer dips, but it's hard for me to
> know if I'm looking at real data or some sort of artifact.
>
> Fwiw, the producers are not reporting any errors, so it does not seem like
> we're losing data.
>
> I'm new to Kafka. Should I be worried? If so, how should I be debugging
> this?
>
> Thanks,
> Wes
>
>

Re: spikes in producer requests/sec

Posted by Magnus Edenhill <ma...@edenhill.se>.

Hi Wes,

are you monitoring librdkafka statistics as well?
If so, are there any correlating spikes in the per-broker and per-partition
statistics?
Such as:
 - brokers.<NAME>.rtt.avg  <--- broker round-trip-time (latency)
 - brokers.<NAME>.waitresp_cnt  <-- requests in flight
 - topics.<TOPIC>.partitions.<N>.msgq_cnt   <-- internal message queue
 - topics.<TOPIC>.partitions.<N>.xmitq_cnt   <-- transmit queue

Regards,
Magnus



2014-11-11 19:50 GMT+01:00 Wes Chow <we...@chartbeat.com>:

>
> We're seeing periodic spikes in req/sec rates across our nodes. Our
> cluster is 10 nodes, and the topic has a replication factor of 3. We push
> around 200k messages / sec into Kafka.
>
>
> The machines are running the most recent version of Kafka and we're
> connecting via librdkafka. pingstream02-10 are using the CMS garbage
> collector, but I switched pingstream01 to use G1GC under the theory that
> maybe these were GC pauses. The graph shows that likely didn't improve the
> situation.
>
> My next thought is that maybe this is the effect of log rolling. Checking
> in the logs, I see a lot of this:
>
> [2014-11-11 13:46:45,836] 72952071 [ReplicaFetcherThread-0-7] INFO
> kafka.log.Log  - Rolled new log segment for 'pings-342' in 3 ms.
> [2014-11-11 13:46:47,116] 72953351 [kafka-request-handler-0] INFO
> kafka.log.Log  - Rolled new log segment for 'pings-186' in 2 ms.
> [2014-11-11 13:46:48,155] 72954390 [ReplicaFetcherThread-0-8] INFO
> kafka.log.Log  - Rolled new log segment for 'pings-253' in 3 ms.
> [2014-11-11 13:46:48,408] 72954643 [ReplicaFetcherThread-0-4] INFO
> kafka.log.Log  - Rolled new log segment for 'pings-209' in 3 ms.
> [2014-11-11 13:46:48,436] 72954671 [ReplicaFetcherThread-0-4] INFO
> kafka.log.Log  - Rolled new log segment for 'pings-299' in 2 ms.
> [2014-11-11 13:46:48,687] 72954922 [kafka-request-handler-0] INFO
> kafka.log.Log  - Rolled new log segment for 'pings-506' in 2 ms.
>
> The "pings" topic in question has 512 partitions, so it does this 512
> times every so often. We have an effective retention period of a bit less
> than 30 min, so rolling happens pretty frequently. Still, if I assume worst
> case that rolling locks up the process for 2ms and there are 512 rolls
> every few minutes, I'd expect halting to happen for about a second at a
> time. The graphs seem to indicate much longer dips, but it's hard for me to
> know if I'm looking at real data or some sort of artifact.
>
> Fwiw, the producers are not reporting any errors, so it does not seem like
> we're losing data.
>
> I'm new to Kafka. Should I be worried? If so, how should I be debugging
> this?
>
> Thanks,
> Wes
>
>