You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Rajiv Kurian <ra...@signalfx.com> on 2016/08/30 06:11:27 UTC

Kafka 0.9 consumer gets stuck in epoll

We had a Kafka 0.9 consumer stuck in the epoll native call under the
following circumstances.


1. It was started bootstrapped with a cluster with 3 brokers A, B and C
with ids 1,2,3.
2. Change the assignment of the brokers to some topic partitions. Seek to
the beginning of each topic partition.
3. NO poll calls were made at all.
4. Each of the brokers A,B and C were replaced one by one by three new
brokers D, E and F with the same ids 1,2,3. The process of replacement was:
     1. Shut down broker A (has id 1).
     2. Bring up broker B (has id 1 i.e same as A).
     3. Give it a minute odd and do the same with B and C>


5. So by this time none of the bootstrapped brokers were alive. They were
all replaced. I can imagine that this would cause a problem with the new
0.9 consumer since it doesn't have a watch on the brokers directory in ZK
any more.

6. Call poll finally on the consumer.

Expected result - Some kind of exception or just empty results since the
none of the brokers in the bootstrap list are present any more.

Observed result - The poll call is just blocked in Kafka. Even though a
timeout of 500ms was provided it never returned. I am not sure why this
would happen but the same thing happened on 45 hosts so I am guessing this
is pretty reproducible. This led to the thread just getting stuck. We had
to ultimately kill -9 our processes to recover from this. Ideally a Kafka
poll call with a given timeout should never block indefinitely. Here is the
stack trace I was able to get:

 java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.EPollArrayWrapper.$$YJP$$epollWait(Native Method)
    at sun.nio.ch.EPollArrayWrapper.epollWait(EPollArrayWrapper.java)
    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
    - locked <0x0000000504e58468> (a sun.nio.ch.Util$2)
    - locked <0x0000000504e58450> (a java.util.Collections$UnmodifiableSet)
    - locked <0x0000000504e029d8> (a sun.nio.ch.EPollSelectorImpl)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
    at
sf.org.apache.kafka9.common.network.Selector.select(Selector.java:425)
    at sf.org.apache.kafka9.common.network.Selector.poll(Selector.java:254)
    at
sf.org.apache.kafka9.clients.NetworkClient.poll(NetworkClient.java:256)
    at
sf.org.apache.kafka9.clients.consumer.internals.ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:320)
    at
sf.org.apache.kafka9.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:213)
    at
sf.org.apache.kafka9.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:193)
    at
sf.org.apache.kafka9.clients.consumer.internals.ConsumerNetworkClient.awaitMetadataUpdate(ConsumerNetworkClient.java:134)
    at
sf.org.apache.kafka9.clients.consumer.internals.AbstractCoordinator.ensureCoordinatorKnown(AbstractCoordinator.java:184)
    at
sf.org.apache.kafka9.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:886)
    at
sf.org.apache.kafka9.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:853)

sf.org.apache.kafka9 is just our shaded jar but this is the stock Kafka 0.9
consumer code.

Is this a known issue? Even though this happened under extraordinary
circumstances (i.e the entire bootstrap list was replaced) blocking is
ended up stalling the entire thread this code was running on.

Thanks,
Rajiv