You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Martin Skøtt <ma...@falconsocial.com> on 2015/11/25 15:54:02 UTC

Consumer group disappears and consumers loops

Hi,

I'm experiencing some very strange issues with 0.9. I get these log
messages from the new consumer:

[main] ERROR
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Error
ILLEGAL_GENERATION occurred while committing offsets for group
aaa-bbb-reader
[main] WARN org.apache.kafka.clients.consumer.internals.ConsumerCoordinator
- Auto offset commit failed: Commit cannot be completed due to group
rebalance
[main] ERROR
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Error
ILLEGAL_GENERATION occurred while committing offsets for group
aaa-bbb-reader
[main] WARN org.apache.kafka.clients.consumer.internals.ConsumerCoordinator
- Auto offset commit failed:
[main] INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator
- Attempt to join group aaa-bbb-reader failed due to unknown member id,
resetting and retrying.

And this in the broker log:
[2015-11-25 15:41:01,542] INFO [GroupCoordinator 0]: Preparing to
restabilize group aaa-bbb-reader with old generation 1
(kafka.coordinator.GroupCoordinator)
[2015-11-25 15:41:01,544] INFO [GroupCoordinator 0]:
Group aaa-bbb-reader generation 1 is dead and removed
(kafka.coordinator.GroupCoordinator)
[2015-11-25 15:41:13,474] INFO [GroupCoordinator 0]: Preparing to
restabilize group aaa-bbb-reader with old generation 0
(kafka.coordinator.GroupCoordinator)
[2015-11-25 15:41:13,475] INFO [GroupCoordinator 0]: Stabilized
group aaa-bbb-reader generation 1 (kafka.coordinator.GroupCoordinator)
[2015-11-25 15:41:13,477] INFO [GroupCoordinator 0]: Assignment received
from leader for group aaa-bbb-reader for generation 1
(kafka.coordinator.GroupCoordinator)
[2015-11-25 15:41:43,478] INFO [GroupCoordinator 0]: Preparing to
restabilize group aaa-bbb-reader with old generation 1
(kafka.coordinator.GroupCoordinator)
[2015-11-25 15:41:43,478] INFO [GroupCoordinator 0]:
Group aaa-bbb-reader generation 1 is dead and removed
(kafka.coordinator.GroupCoordinator)

When this happens the kafka-consumer-groups describe command keeps saying
that the group no longer exists or is rebalancing. What is probably even
worse is that my consumers appears to be looping constantly through
everything written to the topics!?

Does anyone have any input on what might be happening?

I'm running 0.9 locally on my laptop using one Zookeeper and one broker,
both using the configuration provided in the distribution. I have 13 topics
with two partitions each and a replication factor of 1. I run one producer
and once consumer also on the same machine.

-- 
Martin Skøtt

Re: Consumer group disappears and consumers loops

Posted by Phillip Walker <pw...@bandwidth.com>.

I've done some work in isolating the problem a bit more, running the code
normally instead of through my IDE, etc. KIP-41 should resolve part of the
problem, but I found potential related issues. The UNKNOWN_MEMBER_ID error
does occur, as expected, when a given thread fails to call poll() again
within the session time-out period, and it's generally happening when the
previous poll had returned several times the number of records it normally
does.

My code launches a number of identical threads for processing, with each
thread having its own consumer instance. The topic being consumed has 120
partitions with millions of messages in each partition, and I am processing
from the beginning, so there's never a time before any error that a thread
is caught up.

What seems to be happening is that the higher the value for
max.partition.fetch.bytes (but way less than the actual default for that
value), the more likely that I will have a poll() that returns a fairly
consistent number of records for a while until returning 0 records for
maybe two or three passes (or more for larger max.partition.fetch.bytes
values) and then returning a large batch many times the usual number which
my code can't process in time. These 0-record pulls seem to be arbitrary
and are not related to rebalancing or anything in the server logs or to
what's happening in other threads. I'm not sure why I get 0 records
sometimes, given that the code is on a dedicated VM with huge bandwidth,
but the longer the polling time, the less likely that I'll get 0 records
(and thus fewer oversized batches and fewer UNKNOWN_MEMBER_ID errors).

Once a thread gets this error, it shuts down and is automatically restarted
by my code, and processing resumes. However, all the other threads which
were running at the time of the error start returning 0 records
continuously with the poll() method until I kill them and run them again,
despite having their own consumer instances, which makes me wonder if
there's a server-side issue or if I am confused about something with the
client. Server log messages when the client side error first occurs do tend
to indicate a rebalance, but nothing unusual:

[2016-01-11 22:57:03,131] INFO [GroupCoordinator 0]: Group myGroup
generation 2 is dead and removed (kafka.coordinator.GroupCoordinator)
[2016-01-11 22:57:11,446] INFO [GroupCoordinator 0]: Preparing to
restabilize group myGroup with old generation 0
(kafka.coordinator.GroupCoordinator)
[2016-01-11 22:57:11,447] INFO [GroupCoordinator 0]: Stabilized group
myGroup generation 1 (kafka.coordinator.GroupCoordinator)
[2016-01-11 22:57:11,447] INFO [GroupCoordinator 0]: Preparing to
restabilize group myGroup with old generation 1
(kafka.coordinator.GroupCoordinator)
[2016-01-11 22:57:11,459] INFO [GroupCoordinator 0]: Stabilized group
myGroup generation 2 (kafka.coordinator.GroupCoordinator)
[2016-01-11 22:57:11,466] INFO [GroupCoordinator 0]: Assignment received
from leader for group myGroup for generation 2
(kafka.coordinator.GroupCoordinator)
[2016-01-11 22:57:52,715] INFO [GroupCoordinator 0]: Preparing to
restabilize group myGroup with old generation 2
(kafka.coordinator.GroupCoordinator)
[2016-01-11 22:57:58,309] INFO [GroupCoordinator 0]: Stabilized group
myGroup generation 3 (kafka.coordinator.GroupCoordinator)
[2016-01-11 22:57:58,622] INFO [GroupCoordinator 0]: Assignment received
from leader for group myGroup for generation 3
(kafka.coordinator.GroupCoordinator)


A basic outline of the thread code:
...
@Autowired
@Qualifier("kafkaConsumer.properties")
private Properties consumerConfig;

@Value("${consumerGroup}")
private String groupId;

@Value("${topic}")
private String kafkaTopic;

@Value("${pollingTime}")
private int pollingTime;

private AtomicBoolean shutdownIndicator;
public void setShutdownIndicator(final AtomicBoolean shutdownIndicator) {
this.shutdownIndicator = shutdownIndicator; }

...
@Override
public void run()
{
final KafkaConsumer<String, String> consumer;

try
{
consumerConfig.put("group.id", groupId);
consumer = new KafkaConsumer<>(consumerConfig);
consumer.subscribe(Collections.singletonList(kafkaTopic));
}
catch (final Exception e)
{
logger.error(...);
return;
}

try
{
logger.info("Launching consumer thread {} ({}) for Kafka topic '{}', group
'{}'", threadNumber, Thread.currentThread().getName(), kafkaTopic, groupID);

while (!shutdownIndicator.get())
{
logger.info("Polling consumer");
final ConsumerRecords<String, String> records = consumer.poll(pollingTime);

if (records.isEmpty())
{
logger.info("Consumer polled: No records returned");
continue;
}

logger.info("Consumer polled: Records returned: " + records.count());
for (final ConsumerRecord<String, String> record : records)
{
// process the record
}
}
}
catch (final Exception e)
{
logger.error("Exception in thread " + threadNumber, e);
consumer.close();
}
}

Consumer settings:

bootstrap.servers=...[3 servers]...
key.deserializer=org.apache.kafka.common.serialization.StringDeserializer
value.deserializer=org.apache.kafka.common.serialization.StringDeserializer
auto.commit.interval.ms=5000
auto.offset.reset=earliest
session.timeout.ms=30000
heartbeat.interval.ms=3000
retry.backoff.ms=1000
send.buffer.bytes=1048576
receive.buffer.bytes=1048576
max.partition.fetch.bytes=102400 # varies or removed entirely depending on
my tests; this particular value is vert stable but returns few records per
pass
enable.auto.commit=true






[image: email-signature-logo.jpg]

Phillip Walker

Software Developer

Bandwidth <http://www.bandwidth.com/>

o 919.238.1452

m 919.578.1964

e email@bandwidth.com

LinkedIn <https://www.linkedin.com/in/phillipwalker>  |  Twitter
<https://twitter.com/bandwidth>

On Fri, Jan 8, 2016 at 5:26 PM, Jason Gustafson <ja...@confluent.io> wrote:

> We have an active KIP which aims to give better options for avoiding this
> problem:
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-41%3A+KafkaConsumer+Max+Records
> .
> If you have any feedback, the discussion is on the dev list.
>
> The pause/resume trick should have also worked. The consumer should
> continue heartbeating as long as you call poll(). Can you share some more
> detail about how you implemented this and the failure you saw?
>
> Thanks,
> Jason
>
> On Thu, Dec 31, 2015 at 9:41 AM, Phillip Walker <pw...@bandwidth.com>
> wrote:
>
> > I've run into exactly the same problem, and a similar workaround, along
> > with a config change, seems to be working for now, but I can probably
> add a
> > little detail.
> >
> > Scenario: The code reads from a very high-volume topic on a remote server
> > (with 120 partitions and no SSL or security). I am currently debugging,
> so
> > I am using only a single thread. Several calls to consumer.poll(100) were
> > always required to start retrieving any records at all, and only 40
> > partitions were returned on each call that returns records. At first, I
> was
> > getting about 24,000 records per call and was not finished with
> processing
> > them (running through an IDE) within the session timeout period, getting
> > the same messages as Martin. That part, I would expect, since I wasn't
> > calling consumer.poll() often enough to prevent the session from timing
> > out.
> >
> > I initially tried a workaround where I paused all partitions I owned,
> > called consumer.poll(0) well within the heartbeat time, and then resumed
> > the partitions as a sort of keep-alive mechanism. I stilled timed out
> every
> > 30 seconds, so apparently I wasn't heartbeating that way.
> >
> > I dramatically decreased max.partitions.fetch.bytes so that I would get
> far
> > fewer records that I could easily process within the heartbeat time. With
> > just that change, after a call to consumer.poll(100) that returned
> records,
> > I would still have a number of subsequent consumer.poll(100) calls
> return 0
> > records (despite being nowhere near the end of any partition). In fact,
> > those calls that returned 0 records did not appear to be counting as a
> > heartbeat, because I was still getting the "Group testGroup generation 1
> is
> > dead and removed" message even as the poll method was being called
> > repeatedly and simultaneously and returning nothing.
> >
> > By lowering max.partitions.fetch.bytes and boosting the consumer.poll()
> > wait time to a number greater than the heartbeat time, I started
> receiving
> > records on every call and am having no timeout issues. I still get
> records
> > from only 40 partitions per call, though, despite consumer.assignment()
> > returning the full 120.
> >
> >
> >
> > [image: email-signature-logo.jpg]
> >
> > Phillip Walker
> >
> > Software Developer
> >
> > Bandwidth <http://www.bandwidth.com/>
> >
> > o 919.238.1452
> >
> > m 919.578.1964
> >
> > e email@bandwidth.com
> >
> > LinkedIn <https://www.linkedin.com/in/phillipwalker>  |  Twitter
> > <https://twitter.com/bandwidth>
> >
> > On Tue, Dec 22, 2015 at 11:31 AM, Rune Sørensen <ru...@falconsocial.com>
> > wrote:
> >
> > > Hi,
> > >
> > > Sorry for the long delay in replying.
> > >
> > > As for your questions:
> > > No we are not using SSL.
> > > The problem went away for Martin when he was running against his kafka
> > > instance locally on his development machine, but we are still seeing
> the
> > > issue when we run it in our testing environment, where the broker is
> on a
> > > remote machine.
> > > The network in our testing environment seems stable, from the
> > measurements
> > > I have made so far.
> > >
> > > *Rune Tor*
> > > Sørensen
> > > +45 3172 2097 <+4531722097>
> > > LinkedIn <https://www.linkedin.com/in/runets> Twitter
> > > <https://twitter.com/Areian>
> > > *Copenhagen*
> > > Falcon Social
> > > H.C. Andersens Blvd. 27
> > > 1553 Copenhagen
> > > *Budapest*
> > > Falcon Social
> > > Colabs Startup Center Zrt
> > > 1016 Budapest, Krisztina krt. 99
> > > [image: Falcon Social]
> > > <
> > >
> >
> https://www.falconsocial.com/?utm_source=Employee%20emails&utm_medium=email&utm_content=Rune%20Tor&utm_campaign=Mail%20signature
> > > >
> > > Social Media Management for Enterprise
> > >
> > > On Tue, Dec 1, 2015 at 11:56 PM, Jason Gustafson <ja...@confluent.io>
> > > wrote:
> > >
> > > > I've been unable to reproduce this issue running locally. Even with a
> > > poll
> > > > timeout of 1 millisecond, it seems to work as expected. It would be
> > > helpful
> > > > to know a little more about your setup. Are you using SSL? Are the
> > > brokers
> > > > remote? Is the network stable?
> > > >
> > > > Thanks,
> > > > Jason
> > > >
> > > > On Tue, Dec 1, 2015 at 10:06 AM, Jason Gustafson <jason@confluent.io
> >
> > > > wrote:
> > > >
> > > > > Hi Martin,
> > > > >
> > > > > I'm also not sure why the poll timeout would affect this. Perhaps
> the
> > > > > handler is still doing work (e.g. sending requests) when the record
> > set
> > > > is
> > > > > empty?
> > > > >
> > > > > As a general rule, I would recommend longer poll timeouts. I've
> > > actually
> > > > > tended to use Long.MAX_VALUE myself. I'll have a look just to make
> > sure
> > > > > everything still works with smaller values though.
> > > > >
> > > > > -Jason
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Dec 1, 2015 at 2:35 AM, Martin Skøtt <
> > > > > martin.skoett@falconsocial.com> wrote:
> > > > >
> > > > >> Hi Jason,
> > > > >>
> > > > >> That actually sounds like a very plausible explanation. My current
> > > > >> consumer
> > > > >> is using the default settings, but I have previously used these
> > (taken
> > > > >> from
> > > > >> the sample in the Javadoc for the new KafkaConsumer):
> > > > >>  "auto.commit.interval.ms", "1000"
> > > > >>  "session.timeout.ms", "30000"
> > > > >>
> > > > >> My consumer loop is quite simple as it just calls a domain
> specific
> > > > >> service:
> > > > >>
> > > > >> while (true) {
> > > > >>     ConsumerRecords<String, Object> records =
> consumer.poll(10000);
> > > > >>     for (ConsumerRecord<String, Object> record : records) {
> > > > >>         serve.handle(record.topic(), record.value());
> > > > >>     }
> > > > >> }
> > > > >>
> > > > >> The domain service does a number of things (including lookups in a
> > > RDBMS
> > > > >> and saving to ElasticSearch). In my local test setup a poll will
> > often
> > > > >> result between 5.000 and 10.000 records and I can easily see the
> > > > >> processing
> > > > >> of those taking more than 30 seconds.
> > > > >>
> > > > >> I'll probably take a look at adding some threading to my consumer
> > and
> > > > add
> > > > >> more partitions to my topics.
> > > > >>
> > > > >> That is all fine, but it doesn't really explain why increasing
> poll
> > > > >> timeout
> > > > >> made the problem go away :-/
> > > > >>
> > > > >> Martin
> > > > >>
> > > > >> On 30 November 2015 at 19:30, Jason Gustafson <jason@confluent.io
> >
> > > > wrote:
> > > > >>
> > > > >> > Hey Martin,
> > > > >> >
> > > > >> > At a glance, it looks like your consumer's session timeout is
> > > > expiring.
> > > > >> > This shouldn't happen unless there is a delay between successive
> > > calls
> > > > >> to
> > > > >> > poll which is longer than the session timeout. It might help if
> > you
> > > > >> include
> > > > >> > a snippet of your poll loop and your configuration (i.e. any
> > > > overridden
> > > > >> > settings).
> > > > >> >
> > > > >> > -Jason
> > > > >> >
> > > > >> > On Mon, Nov 30, 2015 at 8:12 AM, Martin Skøtt <
> > > > >> > martin.skoett@falconsocial.com> wrote:
> > > > >> >
> > > > >> > > Well, I made the problem go away, but I'm not sure why it
> works
> > > :-/
> > > > >> > >
> > > > >> > > Previously I used a time out value of 100 for Consumer.poll().
> > > > >> Increasing
> > > > >> > > it to 10.000 makes the problem go away completely?! I tried
> > other
> > > > >> values
> > > > >> > as
> > > > >> > > well:
> > > > >> > >    - 0 problem remained
> > > > >> > >    - 3000, same as heartbeat.interval, problem remained, but
> > less
> > > > >> > frequent
> > > > >> > >
> > > > >> > > Not really sure what is going on, but happy that the problem
> > went
> > > > away
> > > > >> > :-)
> > > > >> > >
> > > > >> > > Martin
> > > > >> > >
> > > > >> > > On 30 November 2015 at 15:33, Martin Skøtt <
> > > > >> > martin.skoett@falconsocial.com
> > > > >> > > >
> > > > >> > > wrote:
> > > > >> > >
> > > > >> > > > Hi Guozhang,
> > > > >> > > >
> > > > >> > > > I have done some testing with various values of
> > > > >> heartbeat.interval.ms
> > > > >> > > and
> > > > >> > > > they don't seem to have any influence on the error messages.
> > > > Running
> > > > >> > > > kafka-consumer-groups also continues to return that the
> > consumer
> > > > >> groups
> > > > >> > > > does not exists or is rebalancing. Do you have any
> suggestions
> > > to
> > > > >> how I
> > > > >> > > > could debug this further?
> > > > >> > > >
> > > > >> > > > Regards,
> > > > >> > > > Martin
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > On 25 November 2015 at 18:37, Guozhang Wang <
> > wangguoz@gmail.com
> > > >
> > > > >> > wrote:
> > > > >> > > >
> > > > >> > > >> Hello Martin,
> > > > >> > > >>
> > > > >> > > >> It seems your consumer's heartbeat.interval.ms config
> value
> > is
> > > > too
> > > > >> > > small
> > > > >> > > >> (default is 3 seconds) for your environment, consider
> > > increasing
> > > > it
> > > > >> > and
> > > > >> > > >> see
> > > > >> > > >> if this issue goes away.
> > > > >> > > >>
> > > > >> > > >> At the same time, we have some better error handling fixes
> in
> > > > trunk
> > > > >> > > which
> > > > >> > > >> will be included in the next point release.
> > > > >> > > >>
> > > > >> > > >> https://issues.apache.org/jira/browse/KAFKA-2860
> > > > >> > > >>
> > > > >> > > >> Guozhang
> > > > >> > > >>
> > > > >> > > >>
> > > > >> > > >>
> > > > >> > > >> On Wed, Nov 25, 2015 at 6:54 AM, Martin Skøtt <
> > > > >> > > >> martin.skoett@falconsocial.com> wrote:
> > > > >> > > >>
> > > > >> > > >> > Hi,
> > > > >> > > >> >
> > > > >> > > >> > I'm experiencing some very strange issues with 0.9. I get
> > > these
> > > > >> log
> > > > >> > > >> > messages from the new consumer:
> > > > >> > > >> >
> > > > >> > > >> > [main] ERROR
> > > > >> > > >> >
> > > > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator -
> > > > >> > > Error
> > > > >> > > >> > ILLEGAL_GENERATION occurred while committing offsets for
> > > group
> > > > >> > > >> > aaa-bbb-reader
> > > > >> > > >> > [main] WARN
> > > > >> > > >>
> > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator
> > > > >> > > >> > - Auto offset commit failed: Commit cannot be completed
> due
> > > to
> > > > >> group
> > > > >> > > >> > rebalance
> > > > >> > > >> > [main] ERROR
> > > > >> > > >> >
> > > > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator -
> > > > >> > > Error
> > > > >> > > >> > ILLEGAL_GENERATION occurred while committing offsets for
> > > group
> > > > >> > > >> > aaa-bbb-reader
> > > > >> > > >> > [main] WARN
> > > > >> > > >>
> > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator
> > > > >> > > >> > - Auto offset commit failed:
> > > > >> > > >> > [main] INFO
> > > > >> > > >>
> > org.apache.kafka.clients.consumer.internals.AbstractCoordinator
> > > > >> > > >> > - Attempt to join group aaa-bbb-reader failed due to
> > unknown
> > > > >> member
> > > > >> > > id,
> > > > >> > > >> > resetting and retrying.
> > > > >> > > >> >
> > > > >> > > >> > And this in the broker log:
> > > > >> > > >> > [2015-11-25 15:41:01,542] INFO [GroupCoordinator 0]:
> > > Preparing
> > > > to
> > > > >> > > >> > restabilize group aaa-bbb-reader with old generation 1
> > > > >> > > >> > (kafka.coordinator.GroupCoordinator)
> > > > >> > > >> > [2015-11-25 15:41:01,544] INFO [GroupCoordinator 0]:
> > > > >> > > >> > Group aaa-bbb-reader generation 1 is dead and removed
> > > > >> > > >> > (kafka.coordinator.GroupCoordinator)
> > > > >> > > >> > [2015-11-25 15:41:13,474] INFO [GroupCoordinator 0]:
> > > Preparing
> > > > to
> > > > >> > > >> > restabilize group aaa-bbb-reader with old generation 0
> > > > >> > > >> > (kafka.coordinator.GroupCoordinator)
> > > > >> > > >> > [2015-11-25 15:41:13,475] INFO [GroupCoordinator 0]:
> > > Stabilized
> > > > >> > > >> > group aaa-bbb-reader generation 1
> > > > >> > (kafka.coordinator.GroupCoordinator)
> > > > >> > > >> > [2015-11-25 15:41:13,477] INFO [GroupCoordinator 0]:
> > > Assignment
> > > > >> > > received
> > > > >> > > >> > from leader for group aaa-bbb-reader for generation 1
> > > > >> > > >> > (kafka.coordinator.GroupCoordinator)
> > > > >> > > >> > [2015-11-25 15:41:43,478] INFO [GroupCoordinator 0]:
> > > Preparing
> > > > to
> > > > >> > > >> > restabilize group aaa-bbb-reader with old generation 1
> > > > >> > > >> > (kafka.coordinator.GroupCoordinator)
> > > > >> > > >> > [2015-11-25 15:41:43,478] INFO [GroupCoordinator 0]:
> > > > >> > > >> > Group aaa-bbb-reader generation 1 is dead and removed
> > > > >> > > >> > (kafka.coordinator.GroupCoordinator)
> > > > >> > > >> >
> > > > >> > > >> > When this happens the kafka-consumer-groups describe
> > command
> > > > >> keeps
> > > > >> > > >> saying
> > > > >> > > >> > that the group no longer exists or is rebalancing. What
> is
> > > > >> probably
> > > > >> > > even
> > > > >> > > >> > worse is that my consumers appears to be looping
> constantly
> > > > >> through
> > > > >> > > >> > everything written to the topics!?
> > > > >> > > >> >
> > > > >> > > >> > Does anyone have any input on what might be happening?
> > > > >> > > >> >
> > > > >> > > >> > I'm running 0.9 locally on my laptop using one Zookeeper
> > and
> > > > one
> > > > >> > > broker,
> > > > >> > > >> > both using the configuration provided in the
> distribution.
> > I
> > > > >> have 13
> > > > >> > > >> topics
> > > > >> > > >> > with two partitions each and a replication factor of 1. I
> > run
> > > > one
> > > > >> > > >> producer
> > > > >> > > >> > and once consumer also on the same machine.
> > > > >> > > >> >
> > > > >> > > >> > --
> > > > >> > > >> > Martin Skøtt
> > > > >> > > >> >
> > > > >> > > >>
> > > > >> > > >>
> > > > >> > > >>
> > > > >> > > >> --
> > > > >> > > >> -- Guozhang
> > > > >> > > >>
> > > > >> > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Consumer group disappears and consumers loops

Posted by Jason Gustafson <ja...@confluent.io>.

We have an active KIP which aims to give better options for avoiding this
problem:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-41%3A+KafkaConsumer+Max+Records.
If you have any feedback, the discussion is on the dev list.

The pause/resume trick should have also worked. The consumer should
continue heartbeating as long as you call poll(). Can you share some more
detail about how you implemented this and the failure you saw?

Thanks,
Jason

On Thu, Dec 31, 2015 at 9:41 AM, Phillip Walker <pw...@bandwidth.com>
wrote:

> I've run into exactly the same problem, and a similar workaround, along
> with a config change, seems to be working for now, but I can probably add a
> little detail.
>
> Scenario: The code reads from a very high-volume topic on a remote server
> (with 120 partitions and no SSL or security). I am currently debugging, so
> I am using only a single thread. Several calls to consumer.poll(100) were
> always required to start retrieving any records at all, and only 40
> partitions were returned on each call that returns records. At first, I was
> getting about 24,000 records per call and was not finished with processing
> them (running through an IDE) within the session timeout period, getting
> the same messages as Martin. That part, I would expect, since I wasn't
> calling consumer.poll() often enough to prevent the session from timing
> out.
>
> I initially tried a workaround where I paused all partitions I owned,
> called consumer.poll(0) well within the heartbeat time, and then resumed
> the partitions as a sort of keep-alive mechanism. I stilled timed out every
> 30 seconds, so apparently I wasn't heartbeating that way.
>
> I dramatically decreased max.partitions.fetch.bytes so that I would get far
> fewer records that I could easily process within the heartbeat time. With
> just that change, after a call to consumer.poll(100) that returned records,
> I would still have a number of subsequent consumer.poll(100) calls return 0
> records (despite being nowhere near the end of any partition). In fact,
> those calls that returned 0 records did not appear to be counting as a
> heartbeat, because I was still getting the "Group testGroup generation 1 is
> dead and removed" message even as the poll method was being called
> repeatedly and simultaneously and returning nothing.
>
> By lowering max.partitions.fetch.bytes and boosting the consumer.poll()
> wait time to a number greater than the heartbeat time, I started receiving
> records on every call and am having no timeout issues. I still get records
> from only 40 partitions per call, though, despite consumer.assignment()
> returning the full 120.
>
>
>
> [image: email-signature-logo.jpg]
>
> Phillip Walker
>
> Software Developer
>
> Bandwidth <http://www.bandwidth.com/>
>
> o 919.238.1452
>
> m 919.578.1964
>
> e email@bandwidth.com
>
> LinkedIn <https://www.linkedin.com/in/phillipwalker>  |  Twitter
> <https://twitter.com/bandwidth>
>
> On Tue, Dec 22, 2015 at 11:31 AM, Rune Sørensen <ru...@falconsocial.com>
> wrote:
>
> > Hi,
> >
> > Sorry for the long delay in replying.
> >
> > As for your questions:
> > No we are not using SSL.
> > The problem went away for Martin when he was running against his kafka
> > instance locally on his development machine, but we are still seeing the
> > issue when we run it in our testing environment, where the broker is on a
> > remote machine.
> > The network in our testing environment seems stable, from the
> measurements
> > I have made so far.
> >
> > *Rune Tor*
> > Sørensen
> > +45 3172 2097 <+4531722097>
> > LinkedIn <https://www.linkedin.com/in/runets> Twitter
> > <https://twitter.com/Areian>
> > *Copenhagen*
> > Falcon Social
> > H.C. Andersens Blvd. 27
> > 1553 Copenhagen
> > *Budapest*
> > Falcon Social
> > Colabs Startup Center Zrt
> > 1016 Budapest, Krisztina krt. 99
> > [image: Falcon Social]
> > <
> >
> https://www.falconsocial.com/?utm_source=Employee%20emails&utm_medium=email&utm_content=Rune%20Tor&utm_campaign=Mail%20signature
> > >
> > Social Media Management for Enterprise
> >
> > On Tue, Dec 1, 2015 at 11:56 PM, Jason Gustafson <ja...@confluent.io>
> > wrote:
> >
> > > I've been unable to reproduce this issue running locally. Even with a
> > poll
> > > timeout of 1 millisecond, it seems to work as expected. It would be
> > helpful
> > > to know a little more about your setup. Are you using SSL? Are the
> > brokers
> > > remote? Is the network stable?
> > >
> > > Thanks,
> > > Jason
> > >
> > > On Tue, Dec 1, 2015 at 10:06 AM, Jason Gustafson <ja...@confluent.io>
> > > wrote:
> > >
> > > > Hi Martin,
> > > >
> > > > I'm also not sure why the poll timeout would affect this. Perhaps the
> > > > handler is still doing work (e.g. sending requests) when the record
> set
> > > is
> > > > empty?
> > > >
> > > > As a general rule, I would recommend longer poll timeouts. I've
> > actually
> > > > tended to use Long.MAX_VALUE myself. I'll have a look just to make
> sure
> > > > everything still works with smaller values though.
> > > >
> > > > -Jason
> > > >
> > > >
> > > >
> > > > On Tue, Dec 1, 2015 at 2:35 AM, Martin Skøtt <
> > > > martin.skoett@falconsocial.com> wrote:
> > > >
> > > >> Hi Jason,
> > > >>
> > > >> That actually sounds like a very plausible explanation. My current
> > > >> consumer
> > > >> is using the default settings, but I have previously used these
> (taken
> > > >> from
> > > >> the sample in the Javadoc for the new KafkaConsumer):
> > > >>  "auto.commit.interval.ms", "1000"
> > > >>  "session.timeout.ms", "30000"
> > > >>
> > > >> My consumer loop is quite simple as it just calls a domain specific
> > > >> service:
> > > >>
> > > >> while (true) {
> > > >>     ConsumerRecords<String, Object> records = consumer.poll(10000);
> > > >>     for (ConsumerRecord<String, Object> record : records) {
> > > >>         serve.handle(record.topic(), record.value());
> > > >>     }
> > > >> }
> > > >>
> > > >> The domain service does a number of things (including lookups in a
> > RDBMS
> > > >> and saving to ElasticSearch). In my local test setup a poll will
> often
> > > >> result between 5.000 and 10.000 records and I can easily see the
> > > >> processing
> > > >> of those taking more than 30 seconds.
> > > >>
> > > >> I'll probably take a look at adding some threading to my consumer
> and
> > > add
> > > >> more partitions to my topics.
> > > >>
> > > >> That is all fine, but it doesn't really explain why increasing poll
> > > >> timeout
> > > >> made the problem go away :-/
> > > >>
> > > >> Martin
> > > >>
> > > >> On 30 November 2015 at 19:30, Jason Gustafson <ja...@confluent.io>
> > > wrote:
> > > >>
> > > >> > Hey Martin,
> > > >> >
> > > >> > At a glance, it looks like your consumer's session timeout is
> > > expiring.
> > > >> > This shouldn't happen unless there is a delay between successive
> > calls
> > > >> to
> > > >> > poll which is longer than the session timeout. It might help if
> you
> > > >> include
> > > >> > a snippet of your poll loop and your configuration (i.e. any
> > > overridden
> > > >> > settings).
> > > >> >
> > > >> > -Jason
> > > >> >
> > > >> > On Mon, Nov 30, 2015 at 8:12 AM, Martin Skøtt <
> > > >> > martin.skoett@falconsocial.com> wrote:
> > > >> >
> > > >> > > Well, I made the problem go away, but I'm not sure why it works
> > :-/
> > > >> > >
> > > >> > > Previously I used a time out value of 100 for Consumer.poll().
> > > >> Increasing
> > > >> > > it to 10.000 makes the problem go away completely?! I tried
> other
> > > >> values
> > > >> > as
> > > >> > > well:
> > > >> > >    - 0 problem remained
> > > >> > >    - 3000, same as heartbeat.interval, problem remained, but
> less
> > > >> > frequent
> > > >> > >
> > > >> > > Not really sure what is going on, but happy that the problem
> went
> > > away
> > > >> > :-)
> > > >> > >
> > > >> > > Martin
> > > >> > >
> > > >> > > On 30 November 2015 at 15:33, Martin Skøtt <
> > > >> > martin.skoett@falconsocial.com
> > > >> > > >
> > > >> > > wrote:
> > > >> > >
> > > >> > > > Hi Guozhang,
> > > >> > > >
> > > >> > > > I have done some testing with various values of
> > > >> heartbeat.interval.ms
> > > >> > > and
> > > >> > > > they don't seem to have any influence on the error messages.
> > > Running
> > > >> > > > kafka-consumer-groups also continues to return that the
> consumer
> > > >> groups
> > > >> > > > does not exists or is rebalancing. Do you have any suggestions
> > to
> > > >> how I
> > > >> > > > could debug this further?
> > > >> > > >
> > > >> > > > Regards,
> > > >> > > > Martin
> > > >> > > >
> > > >> > > >
> > > >> > > > On 25 November 2015 at 18:37, Guozhang Wang <
> wangguoz@gmail.com
> > >
> > > >> > wrote:
> > > >> > > >
> > > >> > > >> Hello Martin,
> > > >> > > >>
> > > >> > > >> It seems your consumer's heartbeat.interval.ms config value
> is
> > > too
> > > >> > > small
> > > >> > > >> (default is 3 seconds) for your environment, consider
> > increasing
> > > it
> > > >> > and
> > > >> > > >> see
> > > >> > > >> if this issue goes away.
> > > >> > > >>
> > > >> > > >> At the same time, we have some better error handling fixes in
> > > trunk
> > > >> > > which
> > > >> > > >> will be included in the next point release.
> > > >> > > >>
> > > >> > > >> https://issues.apache.org/jira/browse/KAFKA-2860
> > > >> > > >>
> > > >> > > >> Guozhang
> > > >> > > >>
> > > >> > > >>
> > > >> > > >>
> > > >> > > >> On Wed, Nov 25, 2015 at 6:54 AM, Martin Skøtt <
> > > >> > > >> martin.skoett@falconsocial.com> wrote:
> > > >> > > >>
> > > >> > > >> > Hi,
> > > >> > > >> >
> > > >> > > >> > I'm experiencing some very strange issues with 0.9. I get
> > these
> > > >> log
> > > >> > > >> > messages from the new consumer:
> > > >> > > >> >
> > > >> > > >> > [main] ERROR
> > > >> > > >> >
> > > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator -
> > > >> > > Error
> > > >> > > >> > ILLEGAL_GENERATION occurred while committing offsets for
> > group
> > > >> > > >> > aaa-bbb-reader
> > > >> > > >> > [main] WARN
> > > >> > > >>
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator
> > > >> > > >> > - Auto offset commit failed: Commit cannot be completed due
> > to
> > > >> group
> > > >> > > >> > rebalance
> > > >> > > >> > [main] ERROR
> > > >> > > >> >
> > > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator -
> > > >> > > Error
> > > >> > > >> > ILLEGAL_GENERATION occurred while committing offsets for
> > group
> > > >> > > >> > aaa-bbb-reader
> > > >> > > >> > [main] WARN
> > > >> > > >>
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator
> > > >> > > >> > - Auto offset commit failed:
> > > >> > > >> > [main] INFO
> > > >> > > >>
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator
> > > >> > > >> > - Attempt to join group aaa-bbb-reader failed due to
> unknown
> > > >> member
> > > >> > > id,
> > > >> > > >> > resetting and retrying.
> > > >> > > >> >
> > > >> > > >> > And this in the broker log:
> > > >> > > >> > [2015-11-25 15:41:01,542] INFO [GroupCoordinator 0]:
> > Preparing
> > > to
> > > >> > > >> > restabilize group aaa-bbb-reader with old generation 1
> > > >> > > >> > (kafka.coordinator.GroupCoordinator)
> > > >> > > >> > [2015-11-25 15:41:01,544] INFO [GroupCoordinator 0]:
> > > >> > > >> > Group aaa-bbb-reader generation 1 is dead and removed
> > > >> > > >> > (kafka.coordinator.GroupCoordinator)
> > > >> > > >> > [2015-11-25 15:41:13,474] INFO [GroupCoordinator 0]:
> > Preparing
> > > to
> > > >> > > >> > restabilize group aaa-bbb-reader with old generation 0
> > > >> > > >> > (kafka.coordinator.GroupCoordinator)
> > > >> > > >> > [2015-11-25 15:41:13,475] INFO [GroupCoordinator 0]:
> > Stabilized
> > > >> > > >> > group aaa-bbb-reader generation 1
> > > >> > (kafka.coordinator.GroupCoordinator)
> > > >> > > >> > [2015-11-25 15:41:13,477] INFO [GroupCoordinator 0]:
> > Assignment
> > > >> > > received
> > > >> > > >> > from leader for group aaa-bbb-reader for generation 1
> > > >> > > >> > (kafka.coordinator.GroupCoordinator)
> > > >> > > >> > [2015-11-25 15:41:43,478] INFO [GroupCoordinator 0]:
> > Preparing
> > > to
> > > >> > > >> > restabilize group aaa-bbb-reader with old generation 1
> > > >> > > >> > (kafka.coordinator.GroupCoordinator)
> > > >> > > >> > [2015-11-25 15:41:43,478] INFO [GroupCoordinator 0]:
> > > >> > > >> > Group aaa-bbb-reader generation 1 is dead and removed
> > > >> > > >> > (kafka.coordinator.GroupCoordinator)
> > > >> > > >> >
> > > >> > > >> > When this happens the kafka-consumer-groups describe
> command
> > > >> keeps
> > > >> > > >> saying
> > > >> > > >> > that the group no longer exists or is rebalancing. What is
> > > >> probably
> > > >> > > even
> > > >> > > >> > worse is that my consumers appears to be looping constantly
> > > >> through
> > > >> > > >> > everything written to the topics!?
> > > >> > > >> >
> > > >> > > >> > Does anyone have any input on what might be happening?
> > > >> > > >> >
> > > >> > > >> > I'm running 0.9 locally on my laptop using one Zookeeper
> and
> > > one
> > > >> > > broker,
> > > >> > > >> > both using the configuration provided in the distribution.
> I
> > > >> have 13
> > > >> > > >> topics
> > > >> > > >> > with two partitions each and a replication factor of 1. I
> run
> > > one
> > > >> > > >> producer
> > > >> > > >> > and once consumer also on the same machine.
> > > >> > > >> >
> > > >> > > >> > --
> > > >> > > >> > Martin Skøtt
> > > >> > > >> >
> > > >> > > >>
> > > >> > > >>
> > > >> > > >>
> > > >> > > >> --
> > > >> > > >> -- Guozhang
> > > >> > > >>
> > > >> > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Re: Consumer group disappears and consumers loops

Posted by Phillip Walker <pw...@bandwidth.com>.

I've run into exactly the same problem, and a similar workaround, along
with a config change, seems to be working for now, but I can probably add a
little detail.

Scenario: The code reads from a very high-volume topic on a remote server
(with 120 partitions and no SSL or security). I am currently debugging, so
I am using only a single thread. Several calls to consumer.poll(100) were
always required to start retrieving any records at all, and only 40
partitions were returned on each call that returns records. At first, I was
getting about 24,000 records per call and was not finished with processing
them (running through an IDE) within the session timeout period, getting
the same messages as Martin. That part, I would expect, since I wasn't
calling consumer.poll() often enough to prevent the session from timing out.

I initially tried a workaround where I paused all partitions I owned,
called consumer.poll(0) well within the heartbeat time, and then resumed
the partitions as a sort of keep-alive mechanism. I stilled timed out every
30 seconds, so apparently I wasn't heartbeating that way.

I dramatically decreased max.partitions.fetch.bytes so that I would get far
fewer records that I could easily process within the heartbeat time. With
just that change, after a call to consumer.poll(100) that returned records,
I would still have a number of subsequent consumer.poll(100) calls return 0
records (despite being nowhere near the end of any partition). In fact,
those calls that returned 0 records did not appear to be counting as a
heartbeat, because I was still getting the "Group testGroup generation 1 is
dead and removed" message even as the poll method was being called
repeatedly and simultaneously and returning nothing.

By lowering max.partitions.fetch.bytes and boosting the consumer.poll()
wait time to a number greater than the heartbeat time, I started receiving
records on every call and am having no timeout issues. I still get records
from only 40 partitions per call, though, despite consumer.assignment()
returning the full 120.



[image: email-signature-logo.jpg]

Phillip Walker

Software Developer

Bandwidth <http://www.bandwidth.com/>

o 919.238.1452

m 919.578.1964

e email@bandwidth.com

LinkedIn <https://www.linkedin.com/in/phillipwalker>  |  Twitter
<https://twitter.com/bandwidth>

On Tue, Dec 22, 2015 at 11:31 AM, Rune Sørensen <ru...@falconsocial.com>
wrote:

> Hi,
>
> Sorry for the long delay in replying.
>
> As for your questions:
> No we are not using SSL.
> The problem went away for Martin when he was running against his kafka
> instance locally on his development machine, but we are still seeing the
> issue when we run it in our testing environment, where the broker is on a
> remote machine.
> The network in our testing environment seems stable, from the measurements
> I have made so far.
>
> *Rune Tor*
> Sørensen
> +45 3172 2097 <+4531722097>
> LinkedIn <https://www.linkedin.com/in/runets> Twitter
> <https://twitter.com/Areian>
> *Copenhagen*
> Falcon Social
> H.C. Andersens Blvd. 27
> 1553 Copenhagen
> *Budapest*
> Falcon Social
> Colabs Startup Center Zrt
> 1016 Budapest, Krisztina krt. 99
> [image: Falcon Social]
> <
> https://www.falconsocial.com/?utm_source=Employee%20emails&utm_medium=email&utm_content=Rune%20Tor&utm_campaign=Mail%20signature
> >
> Social Media Management for Enterprise
>
> On Tue, Dec 1, 2015 at 11:56 PM, Jason Gustafson <ja...@confluent.io>
> wrote:
>
> > I've been unable to reproduce this issue running locally. Even with a
> poll
> > timeout of 1 millisecond, it seems to work as expected. It would be
> helpful
> > to know a little more about your setup. Are you using SSL? Are the
> brokers
> > remote? Is the network stable?
> >
> > Thanks,
> > Jason
> >
> > On Tue, Dec 1, 2015 at 10:06 AM, Jason Gustafson <ja...@confluent.io>
> > wrote:
> >
> > > Hi Martin,
> > >
> > > I'm also not sure why the poll timeout would affect this. Perhaps the
> > > handler is still doing work (e.g. sending requests) when the record set
> > is
> > > empty?
> > >
> > > As a general rule, I would recommend longer poll timeouts. I've
> actually
> > > tended to use Long.MAX_VALUE myself. I'll have a look just to make sure
> > > everything still works with smaller values though.
> > >
> > > -Jason
> > >
> > >
> > >
> > > On Tue, Dec 1, 2015 at 2:35 AM, Martin Skøtt <
> > > martin.skoett@falconsocial.com> wrote:
> > >
> > >> Hi Jason,
> > >>
> > >> That actually sounds like a very plausible explanation. My current
> > >> consumer
> > >> is using the default settings, but I have previously used these (taken
> > >> from
> > >> the sample in the Javadoc for the new KafkaConsumer):
> > >>  "auto.commit.interval.ms", "1000"
> > >>  "session.timeout.ms", "30000"
> > >>
> > >> My consumer loop is quite simple as it just calls a domain specific
> > >> service:
> > >>
> > >> while (true) {
> > >>     ConsumerRecords<String, Object> records = consumer.poll(10000);
> > >>     for (ConsumerRecord<String, Object> record : records) {
> > >>         serve.handle(record.topic(), record.value());
> > >>     }
> > >> }
> > >>
> > >> The domain service does a number of things (including lookups in a
> RDBMS
> > >> and saving to ElasticSearch). In my local test setup a poll will often
> > >> result between 5.000 and 10.000 records and I can easily see the
> > >> processing
> > >> of those taking more than 30 seconds.
> > >>
> > >> I'll probably take a look at adding some threading to my consumer and
> > add
> > >> more partitions to my topics.
> > >>
> > >> That is all fine, but it doesn't really explain why increasing poll
> > >> timeout
> > >> made the problem go away :-/
> > >>
> > >> Martin
> > >>
> > >> On 30 November 2015 at 19:30, Jason Gustafson <ja...@confluent.io>
> > wrote:
> > >>
> > >> > Hey Martin,
> > >> >
> > >> > At a glance, it looks like your consumer's session timeout is
> > expiring.
> > >> > This shouldn't happen unless there is a delay between successive
> calls
> > >> to
> > >> > poll which is longer than the session timeout. It might help if you
> > >> include
> > >> > a snippet of your poll loop and your configuration (i.e. any
> > overridden
> > >> > settings).
> > >> >
> > >> > -Jason
> > >> >
> > >> > On Mon, Nov 30, 2015 at 8:12 AM, Martin Skøtt <
> > >> > martin.skoett@falconsocial.com> wrote:
> > >> >
> > >> > > Well, I made the problem go away, but I'm not sure why it works
> :-/
> > >> > >
> > >> > > Previously I used a time out value of 100 for Consumer.poll().
> > >> Increasing
> > >> > > it to 10.000 makes the problem go away completely?! I tried other
> > >> values
> > >> > as
> > >> > > well:
> > >> > >    - 0 problem remained
> > >> > >    - 3000, same as heartbeat.interval, problem remained, but less
> > >> > frequent
> > >> > >
> > >> > > Not really sure what is going on, but happy that the problem went
> > away
> > >> > :-)
> > >> > >
> > >> > > Martin
> > >> > >
> > >> > > On 30 November 2015 at 15:33, Martin Skøtt <
> > >> > martin.skoett@falconsocial.com
> > >> > > >
> > >> > > wrote:
> > >> > >
> > >> > > > Hi Guozhang,
> > >> > > >
> > >> > > > I have done some testing with various values of
> > >> heartbeat.interval.ms
> > >> > > and
> > >> > > > they don't seem to have any influence on the error messages.
> > Running
> > >> > > > kafka-consumer-groups also continues to return that the consumer
> > >> groups
> > >> > > > does not exists or is rebalancing. Do you have any suggestions
> to
> > >> how I
> > >> > > > could debug this further?
> > >> > > >
> > >> > > > Regards,
> > >> > > > Martin
> > >> > > >
> > >> > > >
> > >> > > > On 25 November 2015 at 18:37, Guozhang Wang <wangguoz@gmail.com
> >
> > >> > wrote:
> > >> > > >
> > >> > > >> Hello Martin,
> > >> > > >>
> > >> > > >> It seems your consumer's heartbeat.interval.ms config value is
> > too
> > >> > > small
> > >> > > >> (default is 3 seconds) for your environment, consider
> increasing
> > it
> > >> > and
> > >> > > >> see
> > >> > > >> if this issue goes away.
> > >> > > >>
> > >> > > >> At the same time, we have some better error handling fixes in
> > trunk
> > >> > > which
> > >> > > >> will be included in the next point release.
> > >> > > >>
> > >> > > >> https://issues.apache.org/jira/browse/KAFKA-2860
> > >> > > >>
> > >> > > >> Guozhang
> > >> > > >>
> > >> > > >>
> > >> > > >>
> > >> > > >> On Wed, Nov 25, 2015 at 6:54 AM, Martin Skøtt <
> > >> > > >> martin.skoett@falconsocial.com> wrote:
> > >> > > >>
> > >> > > >> > Hi,
> > >> > > >> >
> > >> > > >> > I'm experiencing some very strange issues with 0.9. I get
> these
> > >> log
> > >> > > >> > messages from the new consumer:
> > >> > > >> >
> > >> > > >> > [main] ERROR
> > >> > > >> >
> > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator -
> > >> > > Error
> > >> > > >> > ILLEGAL_GENERATION occurred while committing offsets for
> group
> > >> > > >> > aaa-bbb-reader
> > >> > > >> > [main] WARN
> > >> > > >> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator
> > >> > > >> > - Auto offset commit failed: Commit cannot be completed due
> to
> > >> group
> > >> > > >> > rebalance
> > >> > > >> > [main] ERROR
> > >> > > >> >
> > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator -
> > >> > > Error
> > >> > > >> > ILLEGAL_GENERATION occurred while committing offsets for
> group
> > >> > > >> > aaa-bbb-reader
> > >> > > >> > [main] WARN
> > >> > > >> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator
> > >> > > >> > - Auto offset commit failed:
> > >> > > >> > [main] INFO
> > >> > > >> org.apache.kafka.clients.consumer.internals.AbstractCoordinator
> > >> > > >> > - Attempt to join group aaa-bbb-reader failed due to unknown
> > >> member
> > >> > > id,
> > >> > > >> > resetting and retrying.
> > >> > > >> >
> > >> > > >> > And this in the broker log:
> > >> > > >> > [2015-11-25 15:41:01,542] INFO [GroupCoordinator 0]:
> Preparing
> > to
> > >> > > >> > restabilize group aaa-bbb-reader with old generation 1
> > >> > > >> > (kafka.coordinator.GroupCoordinator)
> > >> > > >> > [2015-11-25 15:41:01,544] INFO [GroupCoordinator 0]:
> > >> > > >> > Group aaa-bbb-reader generation 1 is dead and removed
> > >> > > >> > (kafka.coordinator.GroupCoordinator)
> > >> > > >> > [2015-11-25 15:41:13,474] INFO [GroupCoordinator 0]:
> Preparing
> > to
> > >> > > >> > restabilize group aaa-bbb-reader with old generation 0
> > >> > > >> > (kafka.coordinator.GroupCoordinator)
> > >> > > >> > [2015-11-25 15:41:13,475] INFO [GroupCoordinator 0]:
> Stabilized
> > >> > > >> > group aaa-bbb-reader generation 1
> > >> > (kafka.coordinator.GroupCoordinator)
> > >> > > >> > [2015-11-25 15:41:13,477] INFO [GroupCoordinator 0]:
> Assignment
> > >> > > received
> > >> > > >> > from leader for group aaa-bbb-reader for generation 1
> > >> > > >> > (kafka.coordinator.GroupCoordinator)
> > >> > > >> > [2015-11-25 15:41:43,478] INFO [GroupCoordinator 0]:
> Preparing
> > to
> > >> > > >> > restabilize group aaa-bbb-reader with old generation 1
> > >> > > >> > (kafka.coordinator.GroupCoordinator)
> > >> > > >> > [2015-11-25 15:41:43,478] INFO [GroupCoordinator 0]:
> > >> > > >> > Group aaa-bbb-reader generation 1 is dead and removed
> > >> > > >> > (kafka.coordinator.GroupCoordinator)
> > >> > > >> >
> > >> > > >> > When this happens the kafka-consumer-groups describe command
> > >> keeps
> > >> > > >> saying
> > >> > > >> > that the group no longer exists or is rebalancing. What is
> > >> probably
> > >> > > even
> > >> > > >> > worse is that my consumers appears to be looping constantly
> > >> through
> > >> > > >> > everything written to the topics!?
> > >> > > >> >
> > >> > > >> > Does anyone have any input on what might be happening?
> > >> > > >> >
> > >> > > >> > I'm running 0.9 locally on my laptop using one Zookeeper and
> > one
> > >> > > broker,
> > >> > > >> > both using the configuration provided in the distribution. I
> > >> have 13
> > >> > > >> topics
> > >> > > >> > with two partitions each and a replication factor of 1. I run
> > one
> > >> > > >> producer
> > >> > > >> > and once consumer also on the same machine.
> > >> > > >> >
> > >> > > >> > --
> > >> > > >> > Martin Skøtt
> > >> > > >> >
> > >> > > >>
> > >> > > >>
> > >> > > >>
> > >> > > >> --
> > >> > > >> -- Guozhang
> > >> > > >>
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: Consumer group disappears and consumers loops

Posted by Rune Sørensen <ru...@falconsocial.com>.

Hi,

Sorry for the long delay in replying.

As for your questions:
No we are not using SSL.
The problem went away for Martin when he was running against his kafka
instance locally on his development machine, but we are still seeing the
issue when we run it in our testing environment, where the broker is on a
remote machine.
The network in our testing environment seems stable, from the measurements
I have made so far.

*Rune Tor*
Sørensen
+45 3172 2097 <+4531722097>
LinkedIn <https://www.linkedin.com/in/runets> Twitter
<https://twitter.com/Areian>
*Copenhagen*
Falcon Social
H.C. Andersens Blvd. 27
1553 Copenhagen
*Budapest*
Falcon Social
Colabs Startup Center Zrt
1016 Budapest, Krisztina krt. 99
[image: Falcon Social]
<https://www.falconsocial.com/?utm_source=Employee%20emails&utm_medium=email&utm_content=Rune%20Tor&utm_campaign=Mail%20signature>
Social Media Management for Enterprise

On Tue, Dec 1, 2015 at 11:56 PM, Jason Gustafson <ja...@confluent.io> wrote:

> I've been unable to reproduce this issue running locally. Even with a poll
> timeout of 1 millisecond, it seems to work as expected. It would be helpful
> to know a little more about your setup. Are you using SSL? Are the brokers
> remote? Is the network stable?
>
> Thanks,
> Jason
>
> On Tue, Dec 1, 2015 at 10:06 AM, Jason Gustafson <ja...@confluent.io>
> wrote:
>
> > Hi Martin,
> >
> > I'm also not sure why the poll timeout would affect this. Perhaps the
> > handler is still doing work (e.g. sending requests) when the record set
> is
> > empty?
> >
> > As a general rule, I would recommend longer poll timeouts. I've actually
> > tended to use Long.MAX_VALUE myself. I'll have a look just to make sure
> > everything still works with smaller values though.
> >
> > -Jason
> >
> >
> >
> > On Tue, Dec 1, 2015 at 2:35 AM, Martin Skøtt <
> > martin.skoett@falconsocial.com> wrote:
> >
> >> Hi Jason,
> >>
> >> That actually sounds like a very plausible explanation. My current
> >> consumer
> >> is using the default settings, but I have previously used these (taken
> >> from
> >> the sample in the Javadoc for the new KafkaConsumer):
> >>  "auto.commit.interval.ms", "1000"
> >>  "session.timeout.ms", "30000"
> >>
> >> My consumer loop is quite simple as it just calls a domain specific
> >> service:
> >>
> >> while (true) {
> >>     ConsumerRecords<String, Object> records = consumer.poll(10000);
> >>     for (ConsumerRecord<String, Object> record : records) {
> >>         serve.handle(record.topic(), record.value());
> >>     }
> >> }
> >>
> >> The domain service does a number of things (including lookups in a RDBMS
> >> and saving to ElasticSearch). In my local test setup a poll will often
> >> result between 5.000 and 10.000 records and I can easily see the
> >> processing
> >> of those taking more than 30 seconds.
> >>
> >> I'll probably take a look at adding some threading to my consumer and
> add
> >> more partitions to my topics.
> >>
> >> That is all fine, but it doesn't really explain why increasing poll
> >> timeout
> >> made the problem go away :-/
> >>
> >> Martin
> >>
> >> On 30 November 2015 at 19:30, Jason Gustafson <ja...@confluent.io>
> wrote:
> >>
> >> > Hey Martin,
> >> >
> >> > At a glance, it looks like your consumer's session timeout is
> expiring.
> >> > This shouldn't happen unless there is a delay between successive calls
> >> to
> >> > poll which is longer than the session timeout. It might help if you
> >> include
> >> > a snippet of your poll loop and your configuration (i.e. any
> overridden
> >> > settings).
> >> >
> >> > -Jason
> >> >
> >> > On Mon, Nov 30, 2015 at 8:12 AM, Martin Skøtt <
> >> > martin.skoett@falconsocial.com> wrote:
> >> >
> >> > > Well, I made the problem go away, but I'm not sure why it works :-/
> >> > >
> >> > > Previously I used a time out value of 100 for Consumer.poll().
> >> Increasing
> >> > > it to 10.000 makes the problem go away completely?! I tried other
> >> values
> >> > as
> >> > > well:
> >> > >    - 0 problem remained
> >> > >    - 3000, same as heartbeat.interval, problem remained, but less
> >> > frequent
> >> > >
> >> > > Not really sure what is going on, but happy that the problem went
> away
> >> > :-)
> >> > >
> >> > > Martin
> >> > >
> >> > > On 30 November 2015 at 15:33, Martin Skøtt <
> >> > martin.skoett@falconsocial.com
> >> > > >
> >> > > wrote:
> >> > >
> >> > > > Hi Guozhang,
> >> > > >
> >> > > > I have done some testing with various values of
> >> heartbeat.interval.ms
> >> > > and
> >> > > > they don't seem to have any influence on the error messages.
> Running
> >> > > > kafka-consumer-groups also continues to return that the consumer
> >> groups
> >> > > > does not exists or is rebalancing. Do you have any suggestions to
> >> how I
> >> > > > could debug this further?
> >> > > >
> >> > > > Regards,
> >> > > > Martin
> >> > > >
> >> > > >
> >> > > > On 25 November 2015 at 18:37, Guozhang Wang <wa...@gmail.com>
> >> > wrote:
> >> > > >
> >> > > >> Hello Martin,
> >> > > >>
> >> > > >> It seems your consumer's heartbeat.interval.ms config value is
> too
> >> > > small
> >> > > >> (default is 3 seconds) for your environment, consider increasing
> it
> >> > and
> >> > > >> see
> >> > > >> if this issue goes away.
> >> > > >>
> >> > > >> At the same time, we have some better error handling fixes in
> trunk
> >> > > which
> >> > > >> will be included in the next point release.
> >> > > >>
> >> > > >> https://issues.apache.org/jira/browse/KAFKA-2860
> >> > > >>
> >> > > >> Guozhang
> >> > > >>
> >> > > >>
> >> > > >>
> >> > > >> On Wed, Nov 25, 2015 at 6:54 AM, Martin Skøtt <
> >> > > >> martin.skoett@falconsocial.com> wrote:
> >> > > >>
> >> > > >> > Hi,
> >> > > >> >
> >> > > >> > I'm experiencing some very strange issues with 0.9. I get these
> >> log
> >> > > >> > messages from the new consumer:
> >> > > >> >
> >> > > >> > [main] ERROR
> >> > > >> >
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator -
> >> > > Error
> >> > > >> > ILLEGAL_GENERATION occurred while committing offsets for group
> >> > > >> > aaa-bbb-reader
> >> > > >> > [main] WARN
> >> > > >> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator
> >> > > >> > - Auto offset commit failed: Commit cannot be completed due to
> >> group
> >> > > >> > rebalance
> >> > > >> > [main] ERROR
> >> > > >> >
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator -
> >> > > Error
> >> > > >> > ILLEGAL_GENERATION occurred while committing offsets for group
> >> > > >> > aaa-bbb-reader
> >> > > >> > [main] WARN
> >> > > >> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator
> >> > > >> > - Auto offset commit failed:
> >> > > >> > [main] INFO
> >> > > >> org.apache.kafka.clients.consumer.internals.AbstractCoordinator
> >> > > >> > - Attempt to join group aaa-bbb-reader failed due to unknown
> >> member
> >> > > id,
> >> > > >> > resetting and retrying.
> >> > > >> >
> >> > > >> > And this in the broker log:
> >> > > >> > [2015-11-25 15:41:01,542] INFO [GroupCoordinator 0]: Preparing
> to
> >> > > >> > restabilize group aaa-bbb-reader with old generation 1
> >> > > >> > (kafka.coordinator.GroupCoordinator)
> >> > > >> > [2015-11-25 15:41:01,544] INFO [GroupCoordinator 0]:
> >> > > >> > Group aaa-bbb-reader generation 1 is dead and removed
> >> > > >> > (kafka.coordinator.GroupCoordinator)
> >> > > >> > [2015-11-25 15:41:13,474] INFO [GroupCoordinator 0]: Preparing
> to
> >> > > >> > restabilize group aaa-bbb-reader with old generation 0
> >> > > >> > (kafka.coordinator.GroupCoordinator)
> >> > > >> > [2015-11-25 15:41:13,475] INFO [GroupCoordinator 0]: Stabilized
> >> > > >> > group aaa-bbb-reader generation 1
> >> > (kafka.coordinator.GroupCoordinator)
> >> > > >> > [2015-11-25 15:41:13,477] INFO [GroupCoordinator 0]: Assignment
> >> > > received
> >> > > >> > from leader for group aaa-bbb-reader for generation 1
> >> > > >> > (kafka.coordinator.GroupCoordinator)
> >> > > >> > [2015-11-25 15:41:43,478] INFO [GroupCoordinator 0]: Preparing
> to
> >> > > >> > restabilize group aaa-bbb-reader with old generation 1
> >> > > >> > (kafka.coordinator.GroupCoordinator)
> >> > > >> > [2015-11-25 15:41:43,478] INFO [GroupCoordinator 0]:
> >> > > >> > Group aaa-bbb-reader generation 1 is dead and removed
> >> > > >> > (kafka.coordinator.GroupCoordinator)
> >> > > >> >
> >> > > >> > When this happens the kafka-consumer-groups describe command
> >> keeps
> >> > > >> saying
> >> > > >> > that the group no longer exists or is rebalancing. What is
> >> probably
> >> > > even
> >> > > >> > worse is that my consumers appears to be looping constantly
> >> through
> >> > > >> > everything written to the topics!?
> >> > > >> >
> >> > > >> > Does anyone have any input on what might be happening?
> >> > > >> >
> >> > > >> > I'm running 0.9 locally on my laptop using one Zookeeper and
> one
> >> > > broker,
> >> > > >> > both using the configuration provided in the distribution. I
> >> have 13
> >> > > >> topics
> >> > > >> > with two partitions each and a replication factor of 1. I run
> one
> >> > > >> producer
> >> > > >> > and once consumer also on the same machine.
> >> > > >> >
> >> > > >> > --
> >> > > >> > Martin Skøtt
> >> > > >> >
> >> > > >>
> >> > > >>
> >> > > >>
> >> > > >> --
> >> > > >> -- Guozhang
> >> > > >>
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Consumer group disappears and consumers loops

Posted by Jason Gustafson <ja...@confluent.io>.

I've been unable to reproduce this issue running locally. Even with a poll
timeout of 1 millisecond, it seems to work as expected. It would be helpful
to know a little more about your setup. Are you using SSL? Are the brokers
remote? Is the network stable?

Thanks,
Jason

On Tue, Dec 1, 2015 at 10:06 AM, Jason Gustafson <ja...@confluent.io> wrote:

> Hi Martin,
>
> I'm also not sure why the poll timeout would affect this. Perhaps the
> handler is still doing work (e.g. sending requests) when the record set is
> empty?
>
> As a general rule, I would recommend longer poll timeouts. I've actually
> tended to use Long.MAX_VALUE myself. I'll have a look just to make sure
> everything still works with smaller values though.
>
> -Jason
>
>
>
> On Tue, Dec 1, 2015 at 2:35 AM, Martin Skøtt <
> martin.skoett@falconsocial.com> wrote:
>
>> Hi Jason,
>>
>> That actually sounds like a very plausible explanation. My current
>> consumer
>> is using the default settings, but I have previously used these (taken
>> from
>> the sample in the Javadoc for the new KafkaConsumer):
>>  "auto.commit.interval.ms", "1000"
>>  "session.timeout.ms", "30000"
>>
>> My consumer loop is quite simple as it just calls a domain specific
>> service:
>>
>> while (true) {
>>     ConsumerRecords<String, Object> records = consumer.poll(10000);
>>     for (ConsumerRecord<String, Object> record : records) {
>>         serve.handle(record.topic(), record.value());
>>     }
>> }
>>
>> The domain service does a number of things (including lookups in a RDBMS
>> and saving to ElasticSearch). In my local test setup a poll will often
>> result between 5.000 and 10.000 records and I can easily see the
>> processing
>> of those taking more than 30 seconds.
>>
>> I'll probably take a look at adding some threading to my consumer and add
>> more partitions to my topics.
>>
>> That is all fine, but it doesn't really explain why increasing poll
>> timeout
>> made the problem go away :-/
>>
>> Martin
>>
>> On 30 November 2015 at 19:30, Jason Gustafson <ja...@confluent.io> wrote:
>>
>> > Hey Martin,
>> >
>> > At a glance, it looks like your consumer's session timeout is expiring.
>> > This shouldn't happen unless there is a delay between successive calls
>> to
>> > poll which is longer than the session timeout. It might help if you
>> include
>> > a snippet of your poll loop and your configuration (i.e. any overridden
>> > settings).
>> >
>> > -Jason
>> >
>> > On Mon, Nov 30, 2015 at 8:12 AM, Martin Skøtt <
>> > martin.skoett@falconsocial.com> wrote:
>> >
>> > > Well, I made the problem go away, but I'm not sure why it works :-/
>> > >
>> > > Previously I used a time out value of 100 for Consumer.poll().
>> Increasing
>> > > it to 10.000 makes the problem go away completely?! I tried other
>> values
>> > as
>> > > well:
>> > >    - 0 problem remained
>> > >    - 3000, same as heartbeat.interval, problem remained, but less
>> > frequent
>> > >
>> > > Not really sure what is going on, but happy that the problem went away
>> > :-)
>> > >
>> > > Martin
>> > >
>> > > On 30 November 2015 at 15:33, Martin Skøtt <
>> > martin.skoett@falconsocial.com
>> > > >
>> > > wrote:
>> > >
>> > > > Hi Guozhang,
>> > > >
>> > > > I have done some testing with various values of
>> heartbeat.interval.ms
>> > > and
>> > > > they don't seem to have any influence on the error messages. Running
>> > > > kafka-consumer-groups also continues to return that the consumer
>> groups
>> > > > does not exists or is rebalancing. Do you have any suggestions to
>> how I
>> > > > could debug this further?
>> > > >
>> > > > Regards,
>> > > > Martin
>> > > >
>> > > >
>> > > > On 25 November 2015 at 18:37, Guozhang Wang <wa...@gmail.com>
>> > wrote:
>> > > >
>> > > >> Hello Martin,
>> > > >>
>> > > >> It seems your consumer's heartbeat.interval.ms config value is too
>> > > small
>> > > >> (default is 3 seconds) for your environment, consider increasing it
>> > and
>> > > >> see
>> > > >> if this issue goes away.
>> > > >>
>> > > >> At the same time, we have some better error handling fixes in trunk
>> > > which
>> > > >> will be included in the next point release.
>> > > >>
>> > > >> https://issues.apache.org/jira/browse/KAFKA-2860
>> > > >>
>> > > >> Guozhang
>> > > >>
>> > > >>
>> > > >>
>> > > >> On Wed, Nov 25, 2015 at 6:54 AM, Martin Skøtt <
>> > > >> martin.skoett@falconsocial.com> wrote:
>> > > >>
>> > > >> > Hi,
>> > > >> >
>> > > >> > I'm experiencing some very strange issues with 0.9. I get these
>> log
>> > > >> > messages from the new consumer:
>> > > >> >
>> > > >> > [main] ERROR
>> > > >> > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator -
>> > > Error
>> > > >> > ILLEGAL_GENERATION occurred while committing offsets for group
>> > > >> > aaa-bbb-reader
>> > > >> > [main] WARN
>> > > >> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator
>> > > >> > - Auto offset commit failed: Commit cannot be completed due to
>> group
>> > > >> > rebalance
>> > > >> > [main] ERROR
>> > > >> > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator -
>> > > Error
>> > > >> > ILLEGAL_GENERATION occurred while committing offsets for group
>> > > >> > aaa-bbb-reader
>> > > >> > [main] WARN
>> > > >> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator
>> > > >> > - Auto offset commit failed:
>> > > >> > [main] INFO
>> > > >> org.apache.kafka.clients.consumer.internals.AbstractCoordinator
>> > > >> > - Attempt to join group aaa-bbb-reader failed due to unknown
>> member
>> > > id,
>> > > >> > resetting and retrying.
>> > > >> >
>> > > >> > And this in the broker log:
>> > > >> > [2015-11-25 15:41:01,542] INFO [GroupCoordinator 0]: Preparing to
>> > > >> > restabilize group aaa-bbb-reader with old generation 1
>> > > >> > (kafka.coordinator.GroupCoordinator)
>> > > >> > [2015-11-25 15:41:01,544] INFO [GroupCoordinator 0]:
>> > > >> > Group aaa-bbb-reader generation 1 is dead and removed
>> > > >> > (kafka.coordinator.GroupCoordinator)
>> > > >> > [2015-11-25 15:41:13,474] INFO [GroupCoordinator 0]: Preparing to
>> > > >> > restabilize group aaa-bbb-reader with old generation 0
>> > > >> > (kafka.coordinator.GroupCoordinator)
>> > > >> > [2015-11-25 15:41:13,475] INFO [GroupCoordinator 0]: Stabilized
>> > > >> > group aaa-bbb-reader generation 1
>> > (kafka.coordinator.GroupCoordinator)
>> > > >> > [2015-11-25 15:41:13,477] INFO [GroupCoordinator 0]: Assignment
>> > > received
>> > > >> > from leader for group aaa-bbb-reader for generation 1
>> > > >> > (kafka.coordinator.GroupCoordinator)
>> > > >> > [2015-11-25 15:41:43,478] INFO [GroupCoordinator 0]: Preparing to
>> > > >> > restabilize group aaa-bbb-reader with old generation 1
>> > > >> > (kafka.coordinator.GroupCoordinator)
>> > > >> > [2015-11-25 15:41:43,478] INFO [GroupCoordinator 0]:
>> > > >> > Group aaa-bbb-reader generation 1 is dead and removed
>> > > >> > (kafka.coordinator.GroupCoordinator)
>> > > >> >
>> > > >> > When this happens the kafka-consumer-groups describe command
>> keeps
>> > > >> saying
>> > > >> > that the group no longer exists or is rebalancing. What is
>> probably
>> > > even
>> > > >> > worse is that my consumers appears to be looping constantly
>> through
>> > > >> > everything written to the topics!?
>> > > >> >
>> > > >> > Does anyone have any input on what might be happening?
>> > > >> >
>> > > >> > I'm running 0.9 locally on my laptop using one Zookeeper and one
>> > > broker,
>> > > >> > both using the configuration provided in the distribution. I
>> have 13
>> > > >> topics
>> > > >> > with two partitions each and a replication factor of 1. I run one
>> > > >> producer
>> > > >> > and once consumer also on the same machine.
>> > > >> >
>> > > >> > --
>> > > >> > Martin Skøtt
>> > > >> >
>> > > >>
>> > > >>
>> > > >>
>> > > >> --
>> > > >> -- Guozhang
>> > > >>
>> > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Consumer group disappears and consumers loops

Posted by Jason Gustafson <ja...@confluent.io>.

Hi Martin,

I'm also not sure why the poll timeout would affect this. Perhaps the
handler is still doing work (e.g. sending requests) when the record set is
empty?

As a general rule, I would recommend longer poll timeouts. I've actually
tended to use Long.MAX_VALUE myself. I'll have a look just to make sure
everything still works with smaller values though.

-Jason



On Tue, Dec 1, 2015 at 2:35 AM, Martin Skøtt <martin.skoett@falconsocial.com
> wrote:

> Hi Jason,
>
> That actually sounds like a very plausible explanation. My current consumer
> is using the default settings, but I have previously used these (taken from
> the sample in the Javadoc for the new KafkaConsumer):
>  "auto.commit.interval.ms", "1000"
>  "session.timeout.ms", "30000"
>
> My consumer loop is quite simple as it just calls a domain specific
> service:
>
> while (true) {
>     ConsumerRecords<String, Object> records = consumer.poll(10000);
>     for (ConsumerRecord<String, Object> record : records) {
>         serve.handle(record.topic(), record.value());
>     }
> }
>
> The domain service does a number of things (including lookups in a RDBMS
> and saving to ElasticSearch). In my local test setup a poll will often
> result between 5.000 and 10.000 records and I can easily see the processing
> of those taking more than 30 seconds.
>
> I'll probably take a look at adding some threading to my consumer and add
> more partitions to my topics.
>
> That is all fine, but it doesn't really explain why increasing poll timeout
> made the problem go away :-/
>
> Martin
>
> On 30 November 2015 at 19:30, Jason Gustafson <ja...@confluent.io> wrote:
>
> > Hey Martin,
> >
> > At a glance, it looks like your consumer's session timeout is expiring.
> > This shouldn't happen unless there is a delay between successive calls to
> > poll which is longer than the session timeout. It might help if you
> include
> > a snippet of your poll loop and your configuration (i.e. any overridden
> > settings).
> >
> > -Jason
> >
> > On Mon, Nov 30, 2015 at 8:12 AM, Martin Skøtt <
> > martin.skoett@falconsocial.com> wrote:
> >
> > > Well, I made the problem go away, but I'm not sure why it works :-/
> > >
> > > Previously I used a time out value of 100 for Consumer.poll().
> Increasing
> > > it to 10.000 makes the problem go away completely?! I tried other
> values
> > as
> > > well:
> > >    - 0 problem remained
> > >    - 3000, same as heartbeat.interval, problem remained, but less
> > frequent
> > >
> > > Not really sure what is going on, but happy that the problem went away
> > :-)
> > >
> > > Martin
> > >
> > > On 30 November 2015 at 15:33, Martin Skøtt <
> > martin.skoett@falconsocial.com
> > > >
> > > wrote:
> > >
> > > > Hi Guozhang,
> > > >
> > > > I have done some testing with various values of
> heartbeat.interval.ms
> > > and
> > > > they don't seem to have any influence on the error messages. Running
> > > > kafka-consumer-groups also continues to return that the consumer
> groups
> > > > does not exists or is rebalancing. Do you have any suggestions to
> how I
> > > > could debug this further?
> > > >
> > > > Regards,
> > > > Martin
> > > >
> > > >
> > > > On 25 November 2015 at 18:37, Guozhang Wang <wa...@gmail.com>
> > wrote:
> > > >
> > > >> Hello Martin,
> > > >>
> > > >> It seems your consumer's heartbeat.interval.ms config value is too
> > > small
> > > >> (default is 3 seconds) for your environment, consider increasing it
> > and
> > > >> see
> > > >> if this issue goes away.
> > > >>
> > > >> At the same time, we have some better error handling fixes in trunk
> > > which
> > > >> will be included in the next point release.
> > > >>
> > > >> https://issues.apache.org/jira/browse/KAFKA-2860
> > > >>
> > > >> Guozhang
> > > >>
> > > >>
> > > >>
> > > >> On Wed, Nov 25, 2015 at 6:54 AM, Martin Skøtt <
> > > >> martin.skoett@falconsocial.com> wrote:
> > > >>
> > > >> > Hi,
> > > >> >
> > > >> > I'm experiencing some very strange issues with 0.9. I get these
> log
> > > >> > messages from the new consumer:
> > > >> >
> > > >> > [main] ERROR
> > > >> > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator -
> > > Error
> > > >> > ILLEGAL_GENERATION occurred while committing offsets for group
> > > >> > aaa-bbb-reader
> > > >> > [main] WARN
> > > >> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator
> > > >> > - Auto offset commit failed: Commit cannot be completed due to
> group
> > > >> > rebalance
> > > >> > [main] ERROR
> > > >> > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator -
> > > Error
> > > >> > ILLEGAL_GENERATION occurred while committing offsets for group
> > > >> > aaa-bbb-reader
> > > >> > [main] WARN
> > > >> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator
> > > >> > - Auto offset commit failed:
> > > >> > [main] INFO
> > > >> org.apache.kafka.clients.consumer.internals.AbstractCoordinator
> > > >> > - Attempt to join group aaa-bbb-reader failed due to unknown
> member
> > > id,
> > > >> > resetting and retrying.
> > > >> >
> > > >> > And this in the broker log:
> > > >> > [2015-11-25 15:41:01,542] INFO [GroupCoordinator 0]: Preparing to
> > > >> > restabilize group aaa-bbb-reader with old generation 1
> > > >> > (kafka.coordinator.GroupCoordinator)
> > > >> > [2015-11-25 15:41:01,544] INFO [GroupCoordinator 0]:
> > > >> > Group aaa-bbb-reader generation 1 is dead and removed
> > > >> > (kafka.coordinator.GroupCoordinator)
> > > >> > [2015-11-25 15:41:13,474] INFO [GroupCoordinator 0]: Preparing to
> > > >> > restabilize group aaa-bbb-reader with old generation 0
> > > >> > (kafka.coordinator.GroupCoordinator)
> > > >> > [2015-11-25 15:41:13,475] INFO [GroupCoordinator 0]: Stabilized
> > > >> > group aaa-bbb-reader generation 1
> > (kafka.coordinator.GroupCoordinator)
> > > >> > [2015-11-25 15:41:13,477] INFO [GroupCoordinator 0]: Assignment
> > > received
> > > >> > from leader for group aaa-bbb-reader for generation 1
> > > >> > (kafka.coordinator.GroupCoordinator)
> > > >> > [2015-11-25 15:41:43,478] INFO [GroupCoordinator 0]: Preparing to
> > > >> > restabilize group aaa-bbb-reader with old generation 1
> > > >> > (kafka.coordinator.GroupCoordinator)
> > > >> > [2015-11-25 15:41:43,478] INFO [GroupCoordinator 0]:
> > > >> > Group aaa-bbb-reader generation 1 is dead and removed
> > > >> > (kafka.coordinator.GroupCoordinator)
> > > >> >
> > > >> > When this happens the kafka-consumer-groups describe command keeps
> > > >> saying
> > > >> > that the group no longer exists or is rebalancing. What is
> probably
> > > even
> > > >> > worse is that my consumers appears to be looping constantly
> through
> > > >> > everything written to the topics!?
> > > >> >
> > > >> > Does anyone have any input on what might be happening?
> > > >> >
> > > >> > I'm running 0.9 locally on my laptop using one Zookeeper and one
> > > broker,
> > > >> > both using the configuration provided in the distribution. I have
> 13
> > > >> topics
> > > >> > with two partitions each and a replication factor of 1. I run one
> > > >> producer
> > > >> > and once consumer also on the same machine.
> > > >> >
> > > >> > --
> > > >> > Martin Skøtt
> > > >> >
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> -- Guozhang
> > > >>
> > > >
> > > >
> > >
> >
>

Re: Consumer group disappears and consumers loops

Posted by Martin Skøtt <ma...@falconsocial.com>.

Hi Jason,

That actually sounds like a very plausible explanation. My current consumer
is using the default settings, but I have previously used these (taken from
the sample in the Javadoc for the new KafkaConsumer):
 "auto.commit.interval.ms", "1000"
 "session.timeout.ms", "30000"

My consumer loop is quite simple as it just calls a domain specific service:

while (true) {
    ConsumerRecords<String, Object> records = consumer.poll(10000);
    for (ConsumerRecord<String, Object> record : records) {
        serve.handle(record.topic(), record.value());
    }
}

The domain service does a number of things (including lookups in a RDBMS
and saving to ElasticSearch). In my local test setup a poll will often
result between 5.000 and 10.000 records and I can easily see the processing
of those taking more than 30 seconds.

I'll probably take a look at adding some threading to my consumer and add
more partitions to my topics.

That is all fine, but it doesn't really explain why increasing poll timeout
made the problem go away :-/

Martin

On 30 November 2015 at 19:30, Jason Gustafson <ja...@confluent.io> wrote:

> Hey Martin,
>
> At a glance, it looks like your consumer's session timeout is expiring.
> This shouldn't happen unless there is a delay between successive calls to
> poll which is longer than the session timeout. It might help if you include
> a snippet of your poll loop and your configuration (i.e. any overridden
> settings).
>
> -Jason
>
> On Mon, Nov 30, 2015 at 8:12 AM, Martin Skøtt <
> martin.skoett@falconsocial.com> wrote:
>
> > Well, I made the problem go away, but I'm not sure why it works :-/
> >
> > Previously I used a time out value of 100 for Consumer.poll(). Increasing
> > it to 10.000 makes the problem go away completely?! I tried other values
> as
> > well:
> >    - 0 problem remained
> >    - 3000, same as heartbeat.interval, problem remained, but less
> frequent
> >
> > Not really sure what is going on, but happy that the problem went away
> :-)
> >
> > Martin
> >
> > On 30 November 2015 at 15:33, Martin Skøtt <
> martin.skoett@falconsocial.com
> > >
> > wrote:
> >
> > > Hi Guozhang,
> > >
> > > I have done some testing with various values of heartbeat.interval.ms
> > and
> > > they don't seem to have any influence on the error messages. Running
> > > kafka-consumer-groups also continues to return that the consumer groups
> > > does not exists or is rebalancing. Do you have any suggestions to how I
> > > could debug this further?
> > >
> > > Regards,
> > > Martin
> > >
> > >
> > > On 25 November 2015 at 18:37, Guozhang Wang <wa...@gmail.com>
> wrote:
> > >
> > >> Hello Martin,
> > >>
> > >> It seems your consumer's heartbeat.interval.ms config value is too
> > small
> > >> (default is 3 seconds) for your environment, consider increasing it
> and
> > >> see
> > >> if this issue goes away.
> > >>
> > >> At the same time, we have some better error handling fixes in trunk
> > which
> > >> will be included in the next point release.
> > >>
> > >> https://issues.apache.org/jira/browse/KAFKA-2860
> > >>
> > >> Guozhang
> > >>
> > >>
> > >>
> > >> On Wed, Nov 25, 2015 at 6:54 AM, Martin Skøtt <
> > >> martin.skoett@falconsocial.com> wrote:
> > >>
> > >> > Hi,
> > >> >
> > >> > I'm experiencing some very strange issues with 0.9. I get these log
> > >> > messages from the new consumer:
> > >> >
> > >> > [main] ERROR
> > >> > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator -
> > Error
> > >> > ILLEGAL_GENERATION occurred while committing offsets for group
> > >> > aaa-bbb-reader
> > >> > [main] WARN
> > >> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator
> > >> > - Auto offset commit failed: Commit cannot be completed due to group
> > >> > rebalance
> > >> > [main] ERROR
> > >> > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator -
> > Error
> > >> > ILLEGAL_GENERATION occurred while committing offsets for group
> > >> > aaa-bbb-reader
> > >> > [main] WARN
> > >> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator
> > >> > - Auto offset commit failed:
> > >> > [main] INFO
> > >> org.apache.kafka.clients.consumer.internals.AbstractCoordinator
> > >> > - Attempt to join group aaa-bbb-reader failed due to unknown member
> > id,
> > >> > resetting and retrying.
> > >> >
> > >> > And this in the broker log:
> > >> > [2015-11-25 15:41:01,542] INFO [GroupCoordinator 0]: Preparing to
> > >> > restabilize group aaa-bbb-reader with old generation 1
> > >> > (kafka.coordinator.GroupCoordinator)
> > >> > [2015-11-25 15:41:01,544] INFO [GroupCoordinator 0]:
> > >> > Group aaa-bbb-reader generation 1 is dead and removed
> > >> > (kafka.coordinator.GroupCoordinator)
> > >> > [2015-11-25 15:41:13,474] INFO [GroupCoordinator 0]: Preparing to
> > >> > restabilize group aaa-bbb-reader with old generation 0
> > >> > (kafka.coordinator.GroupCoordinator)
> > >> > [2015-11-25 15:41:13,475] INFO [GroupCoordinator 0]: Stabilized
> > >> > group aaa-bbb-reader generation 1
> (kafka.coordinator.GroupCoordinator)
> > >> > [2015-11-25 15:41:13,477] INFO [GroupCoordinator 0]: Assignment
> > received
> > >> > from leader for group aaa-bbb-reader for generation 1
> > >> > (kafka.coordinator.GroupCoordinator)
> > >> > [2015-11-25 15:41:43,478] INFO [GroupCoordinator 0]: Preparing to
> > >> > restabilize group aaa-bbb-reader with old generation 1
> > >> > (kafka.coordinator.GroupCoordinator)
> > >> > [2015-11-25 15:41:43,478] INFO [GroupCoordinator 0]:
> > >> > Group aaa-bbb-reader generation 1 is dead and removed
> > >> > (kafka.coordinator.GroupCoordinator)
> > >> >
> > >> > When this happens the kafka-consumer-groups describe command keeps
> > >> saying
> > >> > that the group no longer exists or is rebalancing. What is probably
> > even
> > >> > worse is that my consumers appears to be looping constantly through
> > >> > everything written to the topics!?
> > >> >
> > >> > Does anyone have any input on what might be happening?
> > >> >
> > >> > I'm running 0.9 locally on my laptop using one Zookeeper and one
> > broker,
> > >> > both using the configuration provided in the distribution. I have 13
> > >> topics
> > >> > with two partitions each and a replication factor of 1. I run one
> > >> producer
> > >> > and once consumer also on the same machine.
> > >> >
> > >> > --
> > >> > Martin Skøtt
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> -- Guozhang
> > >>
> > >
> > >
> >
>

Re: Consumer group disappears and consumers loops

Posted by Jason Gustafson <ja...@confluent.io>.

Hey Martin,

At a glance, it looks like your consumer's session timeout is expiring.
This shouldn't happen unless there is a delay between successive calls to
poll which is longer than the session timeout. It might help if you include
a snippet of your poll loop and your configuration (i.e. any overridden
settings).

-Jason

On Mon, Nov 30, 2015 at 8:12 AM, Martin Skøtt <
martin.skoett@falconsocial.com> wrote:

> Well, I made the problem go away, but I'm not sure why it works :-/
>
> Previously I used a time out value of 100 for Consumer.poll(). Increasing
> it to 10.000 makes the problem go away completely?! I tried other values as
> well:
>    - 0 problem remained
>    - 3000, same as heartbeat.interval, problem remained, but less frequent
>
> Not really sure what is going on, but happy that the problem went away :-)
>
> Martin
>
> On 30 November 2015 at 15:33, Martin Skøtt <martin.skoett@falconsocial.com
> >
> wrote:
>
> > Hi Guozhang,
> >
> > I have done some testing with various values of heartbeat.interval.ms
> and
> > they don't seem to have any influence on the error messages. Running
> > kafka-consumer-groups also continues to return that the consumer groups
> > does not exists or is rebalancing. Do you have any suggestions to how I
> > could debug this further?
> >
> > Regards,
> > Martin
> >
> >
> > On 25 November 2015 at 18:37, Guozhang Wang <wa...@gmail.com> wrote:
> >
> >> Hello Martin,
> >>
> >> It seems your consumer's heartbeat.interval.ms config value is too
> small
> >> (default is 3 seconds) for your environment, consider increasing it and
> >> see
> >> if this issue goes away.
> >>
> >> At the same time, we have some better error handling fixes in trunk
> which
> >> will be included in the next point release.
> >>
> >> https://issues.apache.org/jira/browse/KAFKA-2860
> >>
> >> Guozhang
> >>
> >>
> >>
> >> On Wed, Nov 25, 2015 at 6:54 AM, Martin Skøtt <
> >> martin.skoett@falconsocial.com> wrote:
> >>
> >> > Hi,
> >> >
> >> > I'm experiencing some very strange issues with 0.9. I get these log
> >> > messages from the new consumer:
> >> >
> >> > [main] ERROR
> >> > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator -
> Error
> >> > ILLEGAL_GENERATION occurred while committing offsets for group
> >> > aaa-bbb-reader
> >> > [main] WARN
> >> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator
> >> > - Auto offset commit failed: Commit cannot be completed due to group
> >> > rebalance
> >> > [main] ERROR
> >> > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator -
> Error
> >> > ILLEGAL_GENERATION occurred while committing offsets for group
> >> > aaa-bbb-reader
> >> > [main] WARN
> >> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator
> >> > - Auto offset commit failed:
> >> > [main] INFO
> >> org.apache.kafka.clients.consumer.internals.AbstractCoordinator
> >> > - Attempt to join group aaa-bbb-reader failed due to unknown member
> id,
> >> > resetting and retrying.
> >> >
> >> > And this in the broker log:
> >> > [2015-11-25 15:41:01,542] INFO [GroupCoordinator 0]: Preparing to
> >> > restabilize group aaa-bbb-reader with old generation 1
> >> > (kafka.coordinator.GroupCoordinator)
> >> > [2015-11-25 15:41:01,544] INFO [GroupCoordinator 0]:
> >> > Group aaa-bbb-reader generation 1 is dead and removed
> >> > (kafka.coordinator.GroupCoordinator)
> >> > [2015-11-25 15:41:13,474] INFO [GroupCoordinator 0]: Preparing to
> >> > restabilize group aaa-bbb-reader with old generation 0
> >> > (kafka.coordinator.GroupCoordinator)
> >> > [2015-11-25 15:41:13,475] INFO [GroupCoordinator 0]: Stabilized
> >> > group aaa-bbb-reader generation 1 (kafka.coordinator.GroupCoordinator)
> >> > [2015-11-25 15:41:13,477] INFO [GroupCoordinator 0]: Assignment
> received
> >> > from leader for group aaa-bbb-reader for generation 1
> >> > (kafka.coordinator.GroupCoordinator)
> >> > [2015-11-25 15:41:43,478] INFO [GroupCoordinator 0]: Preparing to
> >> > restabilize group aaa-bbb-reader with old generation 1
> >> > (kafka.coordinator.GroupCoordinator)
> >> > [2015-11-25 15:41:43,478] INFO [GroupCoordinator 0]:
> >> > Group aaa-bbb-reader generation 1 is dead and removed
> >> > (kafka.coordinator.GroupCoordinator)
> >> >
> >> > When this happens the kafka-consumer-groups describe command keeps
> >> saying
> >> > that the group no longer exists or is rebalancing. What is probably
> even
> >> > worse is that my consumers appears to be looping constantly through
> >> > everything written to the topics!?
> >> >
> >> > Does anyone have any input on what might be happening?
> >> >
> >> > I'm running 0.9 locally on my laptop using one Zookeeper and one
> broker,
> >> > both using the configuration provided in the distribution. I have 13
> >> topics
> >> > with two partitions each and a replication factor of 1. I run one
> >> producer
> >> > and once consumer also on the same machine.
> >> >
> >> > --
> >> > Martin Skøtt
> >> >
> >>
> >>
> >>
> >> --
> >> -- Guozhang
> >>
> >
> >
>

Re: Consumer group disappears and consumers loops

Posted by Martin Skøtt <ma...@falconsocial.com>.

Well, I made the problem go away, but I'm not sure why it works :-/

Previously I used a time out value of 100 for Consumer.poll(). Increasing
it to 10.000 makes the problem go away completely?! I tried other values as
well:
   - 0 problem remained
   - 3000, same as heartbeat.interval, problem remained, but less frequent

Not really sure what is going on, but happy that the problem went away :-)

Martin

On 30 November 2015 at 15:33, Martin Skøtt <ma...@falconsocial.com>
wrote:

> Hi Guozhang,
>
> I have done some testing with various values of heartbeat.interval.ms and
> they don't seem to have any influence on the error messages. Running
> kafka-consumer-groups also continues to return that the consumer groups
> does not exists or is rebalancing. Do you have any suggestions to how I
> could debug this further?
>
> Regards,
> Martin
>
>
> On 25 November 2015 at 18:37, Guozhang Wang <wa...@gmail.com> wrote:
>
>> Hello Martin,
>>
>> It seems your consumer's heartbeat.interval.ms config value is too small
>> (default is 3 seconds) for your environment, consider increasing it and
>> see
>> if this issue goes away.
>>
>> At the same time, we have some better error handling fixes in trunk which
>> will be included in the next point release.
>>
>> https://issues.apache.org/jira/browse/KAFKA-2860
>>
>> Guozhang
>>
>>
>>
>> On Wed, Nov 25, 2015 at 6:54 AM, Martin Skøtt <
>> martin.skoett@falconsocial.com> wrote:
>>
>> > Hi,
>> >
>> > I'm experiencing some very strange issues with 0.9. I get these log
>> > messages from the new consumer:
>> >
>> > [main] ERROR
>> > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Error
>> > ILLEGAL_GENERATION occurred while committing offsets for group
>> > aaa-bbb-reader
>> > [main] WARN
>> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator
>> > - Auto offset commit failed: Commit cannot be completed due to group
>> > rebalance
>> > [main] ERROR
>> > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Error
>> > ILLEGAL_GENERATION occurred while committing offsets for group
>> > aaa-bbb-reader
>> > [main] WARN
>> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator
>> > - Auto offset commit failed:
>> > [main] INFO
>> org.apache.kafka.clients.consumer.internals.AbstractCoordinator
>> > - Attempt to join group aaa-bbb-reader failed due to unknown member id,
>> > resetting and retrying.
>> >
>> > And this in the broker log:
>> > [2015-11-25 15:41:01,542] INFO [GroupCoordinator 0]: Preparing to
>> > restabilize group aaa-bbb-reader with old generation 1
>> > (kafka.coordinator.GroupCoordinator)
>> > [2015-11-25 15:41:01,544] INFO [GroupCoordinator 0]:
>> > Group aaa-bbb-reader generation 1 is dead and removed
>> > (kafka.coordinator.GroupCoordinator)
>> > [2015-11-25 15:41:13,474] INFO [GroupCoordinator 0]: Preparing to
>> > restabilize group aaa-bbb-reader with old generation 0
>> > (kafka.coordinator.GroupCoordinator)
>> > [2015-11-25 15:41:13,475] INFO [GroupCoordinator 0]: Stabilized
>> > group aaa-bbb-reader generation 1 (kafka.coordinator.GroupCoordinator)
>> > [2015-11-25 15:41:13,477] INFO [GroupCoordinator 0]: Assignment received
>> > from leader for group aaa-bbb-reader for generation 1
>> > (kafka.coordinator.GroupCoordinator)
>> > [2015-11-25 15:41:43,478] INFO [GroupCoordinator 0]: Preparing to
>> > restabilize group aaa-bbb-reader with old generation 1
>> > (kafka.coordinator.GroupCoordinator)
>> > [2015-11-25 15:41:43,478] INFO [GroupCoordinator 0]:
>> > Group aaa-bbb-reader generation 1 is dead and removed
>> > (kafka.coordinator.GroupCoordinator)
>> >
>> > When this happens the kafka-consumer-groups describe command keeps
>> saying
>> > that the group no longer exists or is rebalancing. What is probably even
>> > worse is that my consumers appears to be looping constantly through
>> > everything written to the topics!?
>> >
>> > Does anyone have any input on what might be happening?
>> >
>> > I'm running 0.9 locally on my laptop using one Zookeeper and one broker,
>> > both using the configuration provided in the distribution. I have 13
>> topics
>> > with two partitions each and a replication factor of 1. I run one
>> producer
>> > and once consumer also on the same machine.
>> >
>> > --
>> > Martin Skøtt
>> >
>>
>>
>>
>> --
>> -- Guozhang
>>
>
>

Re: Consumer group disappears and consumers loops

Posted by Martin Skøtt <ma...@falconsocial.com>.

Hi Guozhang,

I have done some testing with various values of heartbeat.interval.ms and
they don't seem to have any influence on the error messages. Running
kafka-consumer-groups also continues to return that the consumer groups
does not exists or is rebalancing. Do you have any suggestions to how I
could debug this further?

Regards,
Martin

On 25 November 2015 at 18:37, Guozhang Wang <wa...@gmail.com> wrote:

> Hello Martin,
>
> It seems your consumer's heartbeat.interval.ms config value is too small
> (default is 3 seconds) for your environment, consider increasing it and see
> if this issue goes away.
>
> At the same time, we have some better error handling fixes in trunk which
> will be included in the next point release.
>
> https://issues.apache.org/jira/browse/KAFKA-2860
>
> Guozhang
>
>
>
> On Wed, Nov 25, 2015 at 6:54 AM, Martin Skøtt <
> martin.skoett@falconsocial.com> wrote:
>
> > Hi,
> >
> > I'm experiencing some very strange issues with 0.9. I get these log
> > messages from the new consumer:
> >
> > [main] ERROR
> > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Error
> > ILLEGAL_GENERATION occurred while committing offsets for group
> > aaa-bbb-reader
> > [main] WARN
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator
> > - Auto offset commit failed: Commit cannot be completed due to group
> > rebalance
> > [main] ERROR
> > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Error
> > ILLEGAL_GENERATION occurred while committing offsets for group
> > aaa-bbb-reader
> > [main] WARN
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator
> > - Auto offset commit failed:
> > [main] INFO
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator
> > - Attempt to join group aaa-bbb-reader failed due to unknown member id,
> > resetting and retrying.
> >
> > And this in the broker log:
> > [2015-11-25 15:41:01,542] INFO [GroupCoordinator 0]: Preparing to
> > restabilize group aaa-bbb-reader with old generation 1
> > (kafka.coordinator.GroupCoordinator)
> > [2015-11-25 15:41:01,544] INFO [GroupCoordinator 0]:
> > Group aaa-bbb-reader generation 1 is dead and removed
> > (kafka.coordinator.GroupCoordinator)
> > [2015-11-25 15:41:13,474] INFO [GroupCoordinator 0]: Preparing to
> > restabilize group aaa-bbb-reader with old generation 0
> > (kafka.coordinator.GroupCoordinator)
> > [2015-11-25 15:41:13,475] INFO [GroupCoordinator 0]: Stabilized
> > group aaa-bbb-reader generation 1 (kafka.coordinator.GroupCoordinator)
> > [2015-11-25 15:41:13,477] INFO [GroupCoordinator 0]: Assignment received
> > from leader for group aaa-bbb-reader for generation 1
> > (kafka.coordinator.GroupCoordinator)
> > [2015-11-25 15:41:43,478] INFO [GroupCoordinator 0]: Preparing to
> > restabilize group aaa-bbb-reader with old generation 1
> > (kafka.coordinator.GroupCoordinator)
> > [2015-11-25 15:41:43,478] INFO [GroupCoordinator 0]:
> > Group aaa-bbb-reader generation 1 is dead and removed
> > (kafka.coordinator.GroupCoordinator)
> >
> > When this happens the kafka-consumer-groups describe command keeps saying
> > that the group no longer exists or is rebalancing. What is probably even
> > worse is that my consumers appears to be looping constantly through
> > everything written to the topics!?
> >
> > Does anyone have any input on what might be happening?
> >
> > I'm running 0.9 locally on my laptop using one Zookeeper and one broker,
> > both using the configuration provided in the distribution. I have 13
> topics
> > with two partitions each and a replication factor of 1. I run one
> producer
> > and once consumer also on the same machine.
> >
> > --
> > Martin Skøtt
> >
>
>
>
> --
> -- Guozhang
>

Re: Consumer group disappears and consumers loops

Posted by Guozhang Wang <wa...@gmail.com>.

Hello Martin,

It seems your consumer's heartbeat.interval.ms config value is too small
(default is 3 seconds) for your environment, consider increasing it and see
if this issue goes away.

At the same time, we have some better error handling fixes in trunk which
will be included in the next point release.

https://issues.apache.org/jira/browse/KAFKA-2860

Guozhang



On Wed, Nov 25, 2015 at 6:54 AM, Martin Skøtt <
martin.skoett@falconsocial.com> wrote:

> Hi,
>
> I'm experiencing some very strange issues with 0.9. I get these log
> messages from the new consumer:
>
> [main] ERROR
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Error
> ILLEGAL_GENERATION occurred while committing offsets for group
> aaa-bbb-reader
> [main] WARN org.apache.kafka.clients.consumer.internals.ConsumerCoordinator
> - Auto offset commit failed: Commit cannot be completed due to group
> rebalance
> [main] ERROR
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Error
> ILLEGAL_GENERATION occurred while committing offsets for group
> aaa-bbb-reader
> [main] WARN org.apache.kafka.clients.consumer.internals.ConsumerCoordinator
> - Auto offset commit failed:
> [main] INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator
> - Attempt to join group aaa-bbb-reader failed due to unknown member id,
> resetting and retrying.
>
> And this in the broker log:
> [2015-11-25 15:41:01,542] INFO [GroupCoordinator 0]: Preparing to
> restabilize group aaa-bbb-reader with old generation 1
> (kafka.coordinator.GroupCoordinator)
> [2015-11-25 15:41:01,544] INFO [GroupCoordinator 0]:
> Group aaa-bbb-reader generation 1 is dead and removed
> (kafka.coordinator.GroupCoordinator)
> [2015-11-25 15:41:13,474] INFO [GroupCoordinator 0]: Preparing to
> restabilize group aaa-bbb-reader with old generation 0
> (kafka.coordinator.GroupCoordinator)
> [2015-11-25 15:41:13,475] INFO [GroupCoordinator 0]: Stabilized
> group aaa-bbb-reader generation 1 (kafka.coordinator.GroupCoordinator)
> [2015-11-25 15:41:13,477] INFO [GroupCoordinator 0]: Assignment received
> from leader for group aaa-bbb-reader for generation 1
> (kafka.coordinator.GroupCoordinator)
> [2015-11-25 15:41:43,478] INFO [GroupCoordinator 0]: Preparing to
> restabilize group aaa-bbb-reader with old generation 1
> (kafka.coordinator.GroupCoordinator)
> [2015-11-25 15:41:43,478] INFO [GroupCoordinator 0]:
> Group aaa-bbb-reader generation 1 is dead and removed
> (kafka.coordinator.GroupCoordinator)
>
> When this happens the kafka-consumer-groups describe command keeps saying
> that the group no longer exists or is rebalancing. What is probably even
> worse is that my consumers appears to be looping constantly through
> everything written to the topics!?
>
> Does anyone have any input on what might be happening?
>
> I'm running 0.9 locally on my laptop using one Zookeeper and one broker,
> both using the configuration provided in the distribution. I have 13 topics
> with two partitions each and a replication factor of 1. I run one producer
> and once consumer also on the same machine.
>
> --
> Martin Skøtt
>



-- 
-- Guozhang