You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Alex Popiel <ap...@marchex.com> on 2016/04/26 01:40:31 UTC

Partition fetching stalls with 0.9.0 new consumer

Hello, folks.

I'm encountering a bizarre situation where it appears that fetching for specific partitions stalls when using the 0.9.0 new consumer.  I know that no partitions are paused for extended periods; I issue a resume for all assigned partitions immediately before doing a poll.  Despite this, I'm ending up with approximately 7 (it varies from 3-9) partitions where no records are delivered to the consumer, despite records continuing to be published to those partitions.  As a result, I routinely end up with partition lag in the thousands for this small subset of partitions, while all other partitions have a lag under twenty.

For scale, I have 3 brokers, 100 partitions, and 16 consumer instances.  Records range from 20k to 160k, typically around  30-40k.  Processing time is mostly linear with record size, on the order of 1 CPU-second per 6k of record data.  Because of the high processing time, processing is done multi-threaded across 34 cores, and if processing from a single poll hasn't completed in the heartbeat interval, I pause all assigned partitions, issue a poll(0) to force the heartbeat, and then resume all assigned partitions.

When partitions get wedged, bouncing one of the consumer instances (not necessarily the instance who would receive the partitions) will often unwedge the partitions that were wedged... but then other partitions get wedged, instead.

I have more than sufficient CPU to process all the records, and much of the consumer instance time is spent waiting on a poll(60000) result which doesn't return anything from the partitions that are wedged.  Also, my brokers seem to be running cold, with less than 30% CPU utilization and less than 2MB/sec disk i/o.

Has anyone seen anything like this?  Is it normal for the consumer fetcher to be biased in which partitions it fetches from?  Are there any suggestions on how to diagnose further?

- Alex

RE: Partition fetching stalls with 0.9.0 new consumer

Posted by Alex Popiel <ap...@marchex.com>.
Hello, Robert.

I upgraded to 0.9.0.1, and (after baking for a day and a half) confirm that the issue is now resolved.  KAFKA-2978 is likely the culprit.

Thanks,
- Alex

-----Original Message-----
From: Underwood, Robert [mailto:Robert.Underwood@inin.com] 
Sent: Tuesday, April 26, 2016 2:51 PM
To: users@kafka.apache.org
Subject: Re: Partition fetching stalls with 0.9.0 new consumer

You may be hitting https://issues.apache.org/jira/browse/KAFKA-2978, if you're using 0.9.0.0

________________________________________
From: Alex Popiel <ap...@marchex.com>
Sent: Monday, April 25, 2016 7:40 PM
To: 'users@kafka.apache.org'
Subject: Partition fetching stalls with 0.9.0 new consumer

Hello, folks.

I'm encountering a bizarre situation where it appears that fetching for specific partitions stalls when using the 0.9.0 new consumer.  I know that no partitions are paused for extended periods; I issue a resume for all assigned partitions immediately before doing a poll.  Despite this, I'm ending up with approximately 7 (it varies from 3-9) partitions where no records are delivered to the consumer, despite records continuing to be published to those partitions.  As a result, I routinely end up with partition lag in the thousands for this small subset of partitions, while all other partitions have a lag under twenty.

For scale, I have 3 brokers, 100 partitions, and 16 consumer instances.  Records range from 20k to 160k, typically around  30-40k.  Processing time is mostly linear with record size, on the order of 1 CPU-second per 6k of record data.  Because of the high processing time, processing is done multi-threaded across 34 cores, and if processing from a single poll hasn't completed in the heartbeat interval, I pause all assigned partitions, issue a poll(0) to force the heartbeat, and then resume all assigned partitions.

When partitions get wedged, bouncing one of the consumer instances (not necessarily the instance who would receive the partitions) will often unwedge the partitions that were wedged... but then other partitions get wedged, instead.

I have more than sufficient CPU to process all the records, and much of the consumer instance time is spent waiting on a poll(60000) result which doesn't return anything from the partitions that are wedged.  Also, my brokers seem to be running cold, with less than 30% CPU utilization and less than 2MB/sec disk i/o.

Has anyone seen anything like this?  Is it normal for the consumer fetcher to be biased in which partitions it fetches from?  Are there any suggestions on how to diagnose further?

- Alex

Re: Partition fetching stalls with 0.9.0 new consumer

Posted by "Underwood, Robert" <Ro...@inin.com>.
You may be hitting https://issues.apache.org/jira/browse/KAFKA-2978, if you're using 0.9.0.0

________________________________________
From: Alex Popiel <ap...@marchex.com>
Sent: Monday, April 25, 2016 7:40 PM
To: 'users@kafka.apache.org'
Subject: Partition fetching stalls with 0.9.0 new consumer

Hello, folks.

I'm encountering a bizarre situation where it appears that fetching for specific partitions stalls when using the 0.9.0 new consumer.  I know that no partitions are paused for extended periods; I issue a resume for all assigned partitions immediately before doing a poll.  Despite this, I'm ending up with approximately 7 (it varies from 3-9) partitions where no records are delivered to the consumer, despite records continuing to be published to those partitions.  As a result, I routinely end up with partition lag in the thousands for this small subset of partitions, while all other partitions have a lag under twenty.

For scale, I have 3 brokers, 100 partitions, and 16 consumer instances.  Records range from 20k to 160k, typically around  30-40k.  Processing time is mostly linear with record size, on the order of 1 CPU-second per 6k of record data.  Because of the high processing time, processing is done multi-threaded across 34 cores, and if processing from a single poll hasn't completed in the heartbeat interval, I pause all assigned partitions, issue a poll(0) to force the heartbeat, and then resume all assigned partitions.

When partitions get wedged, bouncing one of the consumer instances (not necessarily the instance who would receive the partitions) will often unwedge the partitions that were wedged... but then other partitions get wedged, instead.

I have more than sufficient CPU to process all the records, and much of the consumer instance time is spent waiting on a poll(60000) result which doesn't return anything from the partitions that are wedged.  Also, my brokers seem to be running cold, with less than 30% CPU utilization and less than 2MB/sec disk i/o.

Has anyone seen anything like this?  Is it normal for the consumer fetcher to be biased in which partitions it fetches from?  Are there any suggestions on how to diagnose further?

- Alex