You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by tao xiao <xi...@gmail.com> on 2015/07/08 14:07:49 UTC

Got conflicted ephemeral node exception for several hours

Hi team,

I have 10 high level consumers connecting to Kafka and one of them kept
complaining "conflicted ephemeral node" for about 8 hours. The log was
filled with below exception

[2015-07-07 14:03:51,615] INFO conflict in
/consumers/group/ids/test-1435856975563-9a9fdc6c data:
{"version":1,"subscription":{"test.*":1},"pattern":"white_list","timestamp":"1436275631510"}
stored data:
{"version":1,"subscription":{"test.*":1},"pattern":"white_list","timestamp":"1436275558570"}
(kafka.utils.ZkUtils$)
[2015-07-07 14:03:51,616] INFO I wrote this conflicted ephemeral node
[{"version":1,"subscription":{"test.*":1},"pattern":"white_list","timestamp":"1436275631510"}]
at /consumers/group/ids/test-1435856975563-9a9fdc6c a while back in a
different session, hence I will backoff for this node to be deleted by
Zookeeper and retry (kafka.utils.ZkUtils$)

In the meantime zookeeper reported below exception for the same time span

2015-07-07 22:45:09,687 [myid:3] - INFO  [ProcessThread(sid:3
cport:-1)::PrepRequestProcessor@645] - Got user-level KeeperException when
processing sessionid:0x44e657ff19c0019 type:create cxid:0x7a26
zxid:0x3015f6e77 txntype:-1 reqpath:n/a Error
Path:/consumers/group/ids/test-1435856975563-9a9fdc6c Error:KeeperErrorCode
= NodeExists for /consumers/group/ids/test-1435856975563-9a9fdc6c

At the end zookeeper timed out the session and consumers triggered
rebalance.

I know that conflicted ephemeral node warning is to handle a zookeeper bug
that session expiration and ephemeral node deletion are not done atomically
but as indicated from zookeeper log the zookeeper never got a chance to
delete the ephemeral node which made me think that the session was not
expired at that time. And for some reason zookeeper fired session expire
event which subsequently invoked ZKSessionExpireListener.  I was just
wondering if anyone have ever encountered similar issue before and what I
can do at zookeeper side to prevent this?

Another problem is that createEphemeralPathExpectConflictHandleZKBug call
is wrapped in a while(true) loop which runs forever until the ephemeral
node is created. Would it be better that we can employ an exponential retry
policy with a max number of retries so that it has a chance to re-throw the
exception back to caller and let caller handle it in situation like above?

Re: Got conflicted ephemeral node exception for several hours

Posted by Mayuresh Gharat <gh...@gmail.com>.
Bouncing the consumers should solve this issue in most cases.

Thanks,

Mayuresh

On Sun, Jul 12, 2015 at 8:21 PM, Jiangjie Qin <jq...@linkedin.com.invalid>
wrote:

> Hi Tao,
>
> We see this error from time to time but did not think of this as a big
> issue. Any reason it bothers you much?
> I¹m not sure if throwing exception to user on this exception is a good
> handling or not. What are user supposed to do in that case other than
> retry?
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On 7/12/15, 7:16 PM, "tao xiao" <xi...@gmail.com> wrote:
>
> >We saw the error again in our cluster.  Anyone has the same issue before?
> >
> >On Fri, 10 Jul 2015 at 13:26 tao xiao <xi...@gmail.com> wrote:
> >
> >> Bump the thread. Any help would be appreciated.
> >>
> >> On Wed, 8 Jul 2015 at 20:09 tao xiao <xi...@gmail.com> wrote:
> >>
> >>> Additional info
> >>> Kafka version: 0.8.2.1
> >>> zookeeper: 3.4.6
> >>>
> >>> On Wed, 8 Jul 2015 at 20:07 tao xiao <xi...@gmail.com> wrote:
> >>>
> >>>> Hi team,
> >>>>
> >>>> I have 10 high level consumers connecting to Kafka and one of them
> >>>>kept
> >>>> complaining "conflicted ephemeral node" for about 8 hours. The log was
> >>>> filled with below exception
> >>>>
> >>>> [2015-07-07 14:03:51,615] INFO conflict in
> >>>> /consumers/group/ids/test-1435856975563-9a9fdc6c data:
> >>>>
> >>>>{"version":1,"subscription":{"test.*":1},"pattern":"white_list","timest
> >>>>amp":"1436275631510"}
> >>>> stored data:
> >>>>
> >>>>{"version":1,"subscription":{"test.*":1},"pattern":"white_list","timest
> >>>>amp":"1436275558570"}
> >>>> (kafka.utils.ZkUtils$)
> >>>> [2015-07-07 14:03:51,616] INFO I wrote this conflicted ephemeral node
> >>>>
> >>>>[{"version":1,"subscription":{"test.*":1},"pattern":"white_list","times
> >>>>tamp":"1436275631510"}]
> >>>> at /consumers/group/ids/test-1435856975563-9a9fdc6c a while back in a
> >>>> different session, hence I will backoff for this node to be deleted by
> >>>> Zookeeper and retry (kafka.utils.ZkUtils$)
> >>>>
> >>>> In the meantime zookeeper reported below exception for the same time
> >>>>span
> >>>>
> >>>> 2015-07-07 22:45:09,687 [myid:3] - INFO  [ProcessThread(sid:3
> >>>> cport:-1)::PrepRequestProcessor@645] - Got user-level KeeperException
> >>>> when processing sessionid:0x44e657ff19c0019 type:create cxid:0x7a26
> >>>> zxid:0x3015f6e77 txntype:-1 reqpath:n/a Error
> >>>> Path:/consumers/group/ids/test-1435856975563-9a9fdc6c
> >>>>Error:KeeperErrorCode
> >>>> = NodeExists for /consumers/group/ids/test-1435856975563-9a9fdc6c
> >>>>
> >>>> At the end zookeeper timed out the session and consumers triggered
> >>>> rebalance.
> >>>>
> >>>> I know that conflicted ephemeral node warning is to handle a zookeeper
> >>>> bug that session expiration and ephemeral node deletion are not done
> >>>> atomically but as indicated from zookeeper log the zookeeper never
> >>>>got a
> >>>> chance to delete the ephemeral node which made me think that the
> >>>>session
> >>>> was not expired at that time. And for some reason zookeeper fired
> >>>>session
> >>>> expire event which subsequently invoked ZKSessionExpireListener.  I
> >>>>was
> >>>> just wondering if anyone have ever encountered similar issue before
> >>>>and
> >>>> what I can do at zookeeper side to prevent this?
> >>>>
> >>>> Another problem is that createEphemeralPathExpectConflictHandleZKBug
> >>>> call is wrapped in a while(true) loop which runs forever until the
> >>>> ephemeral node is created. Would it be better that we can employ an
> >>>> exponential retry policy with a max number of retries so that it has a
> >>>> chance to re-throw the exception back to caller and let caller handle
> >>>>it in
> >>>> situation like above?
> >>>>
> >>>>
>
>


-- 
-Regards,
Mayuresh R. Gharat
(862) 250-7125

Re: Got conflicted ephemeral node exception for several hours

Posted by Jaikiran Pai <ja...@gmail.com>.
I am on Kafka 0.8.2.1 (Java 8) and have happened to run into this same 
issue where the KafkaServer (broker) goes into a indefinite while loop 
writing out this message:

[2015-08-04 15:45:12,350] INFO conflict in /brokers/ids/0 data: 
{"jmx_port":-1,"timestamp":"1438661432074","host":"foo-bar","version":1,"port":9092} 
stored data: 
{"jmx_port":-1,"timestamp":"1438661429589","host":"foo-bar","version":1,"port":9092} 
(kafka.utils.ZkUtils$)
[2015-08-04 15:45:12,352] INFO I wrote this conflicted ephemeral node 
[{"jmx_port":-1,"timestamp":"1438661432074","host":"foo-bar","version":1,"port":9092}] 
at /brokers/ids/0 a while back in a different session, hence I will 
backoff for this node to be deleted by Zookeeper and retry 
(kafka.utils.ZkUtils$)

These above 2 lines have been repeating continuously every few seconds 
for the past 20 odd hours on this broker and this broker has been 
rendered useless and is contributing to high CPU usage.

As a result the consumers have gone into a state where they no longer 
consume the messages. Furthermore, this continuous looping has put Kafka 
process on top of the CPU usage. I understand that bouncing the consumer 
is an option and probably will "fix" this, but in our real production 
environments, we won't be able to bounce the consumers. I currently have 
access to logs (some of which has been pasted here). Is there any chance 
these logs help in narrowing down the issue and fixing the root cause. 
Can we also please add a retry max limit kind of thing in this Zookeeper 
node creation logic instead of going into a indefinite while loop?

I have maintained the original timestamps in the logs so as to help 
narrow down the issue. The 1438661432074 (milli second) in the log 
translates to Aug 03 2015 21:10:32 (PDT) and 1438661429589 translates to 
Aug 03 2015 21:10:30 (PDT). I have included that part of the log snippet 
from the server.log of the broker (10.95.100.31).


[2015-08-03 21:10:29,805] ERROR Closing socket for /10.95.100.31 because 
of error (kafka.network.Processor)
java.io.IOException: Broken pipe
     at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
     at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
     at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
     at sun.nio.ch.IOUtil.write(IOUtil.java:65)
     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
     at kafka.api.TopicDataSend.writeTo(FetchResponse.scala:123)
     at kafka.network.MultiSend.writeTo(Transmission.scala:101)
     at kafka.api.FetchResponseSend.writeTo(FetchResponse.scala:231)
     at kafka.network.Processor.write(SocketServer.scala:472)
     at kafka.network.Processor.run(SocketServer.scala:342)
     at java.lang.Thread.run(Thread.java:745)
[2015-08-03 21:10:29,938] ERROR Closing socket for /10.95.100.31 because 
of error (kafka.network.Processor)
java.io.IOException: Connection reset by peer
     at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
     at sun.nio.ch.IOUtil.read(IOUtil.java:197)
     at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
     at kafka.utils.Utils$.read(Utils.scala:380)
     at 
kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
     at kafka.network.Processor.read(SocketServer.scala:444)
     at kafka.network.Processor.run(SocketServer.scala:340)
     at java.lang.Thread.run(Thread.java:745)
[2015-08-03 21:10:30,045] ERROR Closing socket for /10.95.100.31 because 
of error (kafka.network.Processor)
java.io.IOException: Broken pipe
     at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
     at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
     at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
     at sun.nio.ch.IOUtil.write(IOUtil.java:65)
     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
     at kafka.api.TopicDataSend.writeTo(FetchResponse.scala:123)
     at kafka.network.MultiSend.writeTo(Transmission.scala:101)
     at kafka.api.FetchResponseSend.writeTo(FetchResponse.scala:231)
     at kafka.network.Processor.write(SocketServer.scala:472)
     at kafka.network.Processor.run(SocketServer.scala:342)
     at java.lang.Thread.run(Thread.java:745)

<<< a lot more similar exceptions >>>


[2015-08-03 21:10:31,304] INFO Closing socket connection to 
/10.95.100.31. (kafka.network.Processor)
[2015-08-03 21:10:31,397] INFO Closing socket connection to 
/10.95.100.31. (kafka.network.Processor)
[2015-08-03 21:10:31,399] INFO Closing socket connection to 
/10.95.100.31. (kafka.network.Processor)
[2015-08-03 21:10:31,445] INFO Closing socket connection to 
/10.95.100.31. (kafka.network.Processor)
....
<<<< bunch of similar logs as above >>>>

[2015-08-03 21:10:31,784] INFO [ReplicaFetcherManager on broker 0] 
Removed fetcher for partitions [<partition list>] 
(kafka.server.ReplicaFetcherManager)
[2015-08-03 21:10:31,860] INFO Closing socket connection to 
/10.95.100.31. (kafka.network.Processor)
[2015-08-03 21:10:31,861] INFO 0 successfully elected as leader 
(kafka.server.ZookeeperLeaderElector)
[2015-08-03 21:10:32,072] INFO Closing socket connection to 
/10.95.100.31. (kafka.network.Processor)
[2015-08-03 21:10:32,074] INFO re-registering broker info in ZK for 
broker 0 (kafka.server.KafkaHealthcheck)
[2015-08-03 21:10:32,103] INFO conflict in /brokers/ids/0 data: 
{"jmx_port":-1,"timestamp":"1438661432074","host":"foo-bar","version":1,"port":9092} 
stored data: 
{"jmx_port":-1,"timestamp":"1438661429589","host":"foo-bar","version":1,"port":9092} 
(kafka.utils.ZkUtils$)
[2015-08-03 21:10:32,104] INFO I wrote this conflicted ephemeral node 
[{"jmx_port":-1,"timestamp":"1438661432074","host":"foo-bar","version":1,"port":9092}] 
at /brokers/ids/0 a while back in a different session, hence I will 
backoff for this node to be deleted by Zookeeper and retry 
(kafka.utils.ZkUtils$)
[2015-08-03 21:10:38,107] INFO conflict in /brokers/ids/0 data: 
{"jmx_port":-1,"timestamp":"1438661432074","host":"foo-bar","version":1,"port":9092} 
stored data: 
{"jmx_port":-1,"timestamp":"1438661429589","host":"foo-bar","version":1,"port":9092} 
(kafka.utils.ZkUtils$)
[2015-08-03 21:10:38,109] INFO I wrote this conflicted ephemeral node 
[{"jmx_port":-1,"timestamp":"1438661432074","host":"foo-bar","version":1,"port":9092}] 
at /brokers/ids/0 a while back in a different session, hence I will 
backoff for this node to be deleted by Zookeeper and retry 
(kafka.utils.ZkUtils$)
[2015-08-03 21:10:44,118] INFO conflict in /brokers/ids/0 data: 
{"jmx_port":-1,"timestamp":"1438661432074","host":"foo-bar","version":1,"port":9092} 
stored data: 
{"jmx_port":-1,"timestamp":"1438661429589","host":"foo-bar","version":1,"port":9092} 
(kafka.utils.ZkUtils$)
[2015-08-03 21:10:44,120] INFO I wrote this conflicted ephemeral node 
[{"jmx_port":-1,"timestamp":"1438661432074","host":"foo-bar","version":1,"port":9092}] 
at /brokers/ids/0 a while back in a different session, hence I will 
backoff for this node to be deleted by Zookeeper and retry 
(kafka.utils.ZkUtils$)
[2015-08-03 21:10:50,123] INFO conflict in /brokers/ids/0 data: 
{"jmx_port":-1,"timestamp":"1438661432074","host":"foo-bar","version":1,"port":9092} 
stored data: 
{"jmx_port":-1,"timestamp":"1438661429589","host":"foo-bar","version":1,"port":9092} 
(kafka.utils.ZkUtils$)

<<< these above couple of lines continue for around 20 odd hours now >>>

I could probably get more logs if needed, assuming the VM isn't shutdown 
by then.

-Jaikiran


On Monday 13 July 2015 10:34 AM, tao xiao wrote:
> Finding out root cause is definitely something we should do and I hope the
> new Java consumer API will not have such issues anymore. Here are my
> observation of the issue: the zk retry was caused by session timeout. In
> theory the ephemeral node should be removed after the session is terminated
> by zk but in fact the node still presented after the session was gone. I
> saw lots of slow fsync warning in zk log like below but I am not sure if
> the warning has anything to do with the error I encountered.
>
> 2015-07-12 22:01:21,195 [myid:2] - WARN  [SyncThread:2:FileTxnLog@334] -
> fsync-ing the write ahead log in SyncThread:2 took 1023ms which will
> adversely effect operation latency. See the ZooKeeper troubleshooting guide
>
>
> I know in Linkedin ZK is installed on SSD does it help with this?
>
> On Mon, 13 Jul 2015 at 12:50 Mayuresh Gharat <gh...@gmail.com>
> wrote:
>
>> That would solve this. But it looks like a work around. We need to check
>> why this happens exactly and get to the root cause. What do you think?
>> Getting to the root cause of this might be really useful.
>>
>> Thanks,
>>
>> Mayuresh
>>
>> On Sun, Jul 12, 2015 at 8:45 PM, tao xiao <xi...@gmail.com> wrote:
>>
>>> Restart the consumers does fix the issue. But since the zk retry is
>> wrapped
>>> in an infinite loop it doesn't give a chance to consumer to respond it
>>> until some one notices and restarts. Why I suggest to have a maximum
>> retry
>>> policy is if max retry is reached it can invoke a customer handler which
>> I
>>> can then inject a restart call so that it can remedy itself without
>>> people's attention.
>>>
>>> On Mon, 13 Jul 2015 at 11:36 Jiangjie Qin <jq...@linkedin.com.invalid>
>>> wrote:
>>>
>>>> Hi Tao,
>>>>
>>>> We see this error from time to time but did not think of this as a big
>>>> issue. Any reason it bothers you much?
>>>> I¹m not sure if throwing exception to user on this exception is a good
>>>> handling or not. What are user supposed to do in that case other than
>>>> retry?
>>>>
>>>> Thanks,
>>>>
>>>> Jiangjie (Becket) Qin
>>>>
>>>> On 7/12/15, 7:16 PM, "tao xiao" <xi...@gmail.com> wrote:
>>>>
>>>>> We saw the error again in our cluster.  Anyone has the same issue
>>> before?
>>>>> On Fri, 10 Jul 2015 at 13:26 tao xiao <xi...@gmail.com> wrote:
>>>>>
>>>>>> Bump the thread. Any help would be appreciated.
>>>>>>
>>>>>> On Wed, 8 Jul 2015 at 20:09 tao xiao <xi...@gmail.com> wrote:
>>>>>>
>>>>>>> Additional info
>>>>>>> Kafka version: 0.8.2.1
>>>>>>> zookeeper: 3.4.6
>>>>>>>
>>>>>>> On Wed, 8 Jul 2015 at 20:07 tao xiao <xi...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi team,
>>>>>>>>
>>>>>>>> I have 10 high level consumers connecting to Kafka and one of them
>>>>>>>> kept
>>>>>>>> complaining "conflicted ephemeral node" for about 8 hours. The log
>>> was
>>>>>>>> filled with below exception
>>>>>>>>
>>>>>>>> [2015-07-07 14:03:51,615] INFO conflict in
>>>>>>>> /consumers/group/ids/test-1435856975563-9a9fdc6c data:
>>>>>>>>
>>>>>> {"version":1,"subscription":{"test.*":1},"pattern":"white_list","timest
>>>>>>>> amp":"1436275631510"}
>>>>>>>> stored data:
>>>>>>>>
>>>>>> {"version":1,"subscription":{"test.*":1},"pattern":"white_list","timest
>>>>>>>> amp":"1436275558570"}
>>>>>>>> (kafka.utils.ZkUtils$)
>>>>>>>> [2015-07-07 14:03:51,616] INFO I wrote this conflicted ephemeral
>>> node
>>>>>> [{"version":1,"subscription":{"test.*":1},"pattern":"white_list","times
>>>>>>>> tamp":"1436275631510"}]
>>>>>>>> at /consumers/group/ids/test-1435856975563-9a9fdc6c a while back
>> in
>>> a
>>>>>>>> different session, hence I will backoff for this node to be
>> deleted
>>> by
>>>>>>>> Zookeeper and retry (kafka.utils.ZkUtils$)
>>>>>>>>
>>>>>>>> In the meantime zookeeper reported below exception for the same
>> time
>>>>>>>> span
>>>>>>>>
>>>>>>>> 2015-07-07 22:45:09,687 [myid:3] - INFO  [ProcessThread(sid:3
>>>>>>>> cport:-1)::PrepRequestProcessor@645] - Got user-level
>>> KeeperException
>>>>>>>> when processing sessionid:0x44e657ff19c0019 type:create
>> cxid:0x7a26
>>>>>>>> zxid:0x3015f6e77 txntype:-1 reqpath:n/a Error
>>>>>>>> Path:/consumers/group/ids/test-1435856975563-9a9fdc6c
>>>>>>>> Error:KeeperErrorCode
>>>>>>>> = NodeExists for /consumers/group/ids/test-1435856975563-9a9fdc6c
>>>>>>>>
>>>>>>>> At the end zookeeper timed out the session and consumers triggered
>>>>>>>> rebalance.
>>>>>>>>
>>>>>>>> I know that conflicted ephemeral node warning is to handle a
>>> zookeeper
>>>>>>>> bug that session expiration and ephemeral node deletion are not
>> done
>>>>>>>> atomically but as indicated from zookeeper log the zookeeper never
>>>>>>>> got a
>>>>>>>> chance to delete the ephemeral node which made me think that the
>>>>>>>> session
>>>>>>>> was not expired at that time. And for some reason zookeeper fired
>>>>>>>> session
>>>>>>>> expire event which subsequently invoked ZKSessionExpireListener.
>> I
>>>>>>>> was
>>>>>>>> just wondering if anyone have ever encountered similar issue
>> before
>>>>>>>> and
>>>>>>>> what I can do at zookeeper side to prevent this?
>>>>>>>>
>>>>>>>> Another problem is that
>> createEphemeralPathExpectConflictHandleZKBug
>>>>>>>> call is wrapped in a while(true) loop which runs forever until the
>>>>>>>> ephemeral node is created. Would it be better that we can employ
>> an
>>>>>>>> exponential retry policy with a max number of retries so that it
>>> has a
>>>>>>>> chance to re-throw the exception back to caller and let caller
>>> handle
>>>>>>>> it in
>>>>>>>> situation like above?
>>>>>>>>
>>>>>>>>
>>>>
>>
>>
>> --
>> -Regards,
>> Mayuresh R. Gharat
>> (862) 250-7125
>>


Re: Got conflicted ephemeral node exception for several hours

Posted by tao xiao <xi...@gmail.com>.
Finding out root cause is definitely something we should do and I hope the
new Java consumer API will not have such issues anymore. Here are my
observation of the issue: the zk retry was caused by session timeout. In
theory the ephemeral node should be removed after the session is terminated
by zk but in fact the node still presented after the session was gone. I
saw lots of slow fsync warning in zk log like below but I am not sure if
the warning has anything to do with the error I encountered.

2015-07-12 22:01:21,195 [myid:2] - WARN  [SyncThread:2:FileTxnLog@334] -
fsync-ing the write ahead log in SyncThread:2 took 1023ms which will
adversely effect operation latency. See the ZooKeeper troubleshooting guide


I know in Linkedin ZK is installed on SSD does it help with this?

On Mon, 13 Jul 2015 at 12:50 Mayuresh Gharat <gh...@gmail.com>
wrote:

> That would solve this. But it looks like a work around. We need to check
> why this happens exactly and get to the root cause. What do you think?
> Getting to the root cause of this might be really useful.
>
> Thanks,
>
> Mayuresh
>
> On Sun, Jul 12, 2015 at 8:45 PM, tao xiao <xi...@gmail.com> wrote:
>
> > Restart the consumers does fix the issue. But since the zk retry is
> wrapped
> > in an infinite loop it doesn't give a chance to consumer to respond it
> > until some one notices and restarts. Why I suggest to have a maximum
> retry
> > policy is if max retry is reached it can invoke a customer handler which
> I
> > can then inject a restart call so that it can remedy itself without
> > people's attention.
> >
> > On Mon, 13 Jul 2015 at 11:36 Jiangjie Qin <jq...@linkedin.com.invalid>
> > wrote:
> >
> > > Hi Tao,
> > >
> > > We see this error from time to time but did not think of this as a big
> > > issue. Any reason it bothers you much?
> > > I¹m not sure if throwing exception to user on this exception is a good
> > > handling or not. What are user supposed to do in that case other than
> > > retry?
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > > On 7/12/15, 7:16 PM, "tao xiao" <xi...@gmail.com> wrote:
> > >
> > > >We saw the error again in our cluster.  Anyone has the same issue
> > before?
> > > >
> > > >On Fri, 10 Jul 2015 at 13:26 tao xiao <xi...@gmail.com> wrote:
> > > >
> > > >> Bump the thread. Any help would be appreciated.
> > > >>
> > > >> On Wed, 8 Jul 2015 at 20:09 tao xiao <xi...@gmail.com> wrote:
> > > >>
> > > >>> Additional info
> > > >>> Kafka version: 0.8.2.1
> > > >>> zookeeper: 3.4.6
> > > >>>
> > > >>> On Wed, 8 Jul 2015 at 20:07 tao xiao <xi...@gmail.com> wrote:
> > > >>>
> > > >>>> Hi team,
> > > >>>>
> > > >>>> I have 10 high level consumers connecting to Kafka and one of them
> > > >>>>kept
> > > >>>> complaining "conflicted ephemeral node" for about 8 hours. The log
> > was
> > > >>>> filled with below exception
> > > >>>>
> > > >>>> [2015-07-07 14:03:51,615] INFO conflict in
> > > >>>> /consumers/group/ids/test-1435856975563-9a9fdc6c data:
> > > >>>>
> > >
> >
> >>>>{"version":1,"subscription":{"test.*":1},"pattern":"white_list","timest
> > > >>>>amp":"1436275631510"}
> > > >>>> stored data:
> > > >>>>
> > >
> >
> >>>>{"version":1,"subscription":{"test.*":1},"pattern":"white_list","timest
> > > >>>>amp":"1436275558570"}
> > > >>>> (kafka.utils.ZkUtils$)
> > > >>>> [2015-07-07 14:03:51,616] INFO I wrote this conflicted ephemeral
> > node
> > > >>>>
> > >
> >
> >>>>[{"version":1,"subscription":{"test.*":1},"pattern":"white_list","times
> > > >>>>tamp":"1436275631510"}]
> > > >>>> at /consumers/group/ids/test-1435856975563-9a9fdc6c a while back
> in
> > a
> > > >>>> different session, hence I will backoff for this node to be
> deleted
> > by
> > > >>>> Zookeeper and retry (kafka.utils.ZkUtils$)
> > > >>>>
> > > >>>> In the meantime zookeeper reported below exception for the same
> time
> > > >>>>span
> > > >>>>
> > > >>>> 2015-07-07 22:45:09,687 [myid:3] - INFO  [ProcessThread(sid:3
> > > >>>> cport:-1)::PrepRequestProcessor@645] - Got user-level
> > KeeperException
> > > >>>> when processing sessionid:0x44e657ff19c0019 type:create
> cxid:0x7a26
> > > >>>> zxid:0x3015f6e77 txntype:-1 reqpath:n/a Error
> > > >>>> Path:/consumers/group/ids/test-1435856975563-9a9fdc6c
> > > >>>>Error:KeeperErrorCode
> > > >>>> = NodeExists for /consumers/group/ids/test-1435856975563-9a9fdc6c
> > > >>>>
> > > >>>> At the end zookeeper timed out the session and consumers triggered
> > > >>>> rebalance.
> > > >>>>
> > > >>>> I know that conflicted ephemeral node warning is to handle a
> > zookeeper
> > > >>>> bug that session expiration and ephemeral node deletion are not
> done
> > > >>>> atomically but as indicated from zookeeper log the zookeeper never
> > > >>>>got a
> > > >>>> chance to delete the ephemeral node which made me think that the
> > > >>>>session
> > > >>>> was not expired at that time. And for some reason zookeeper fired
> > > >>>>session
> > > >>>> expire event which subsequently invoked ZKSessionExpireListener.
> I
> > > >>>>was
> > > >>>> just wondering if anyone have ever encountered similar issue
> before
> > > >>>>and
> > > >>>> what I can do at zookeeper side to prevent this?
> > > >>>>
> > > >>>> Another problem is that
> createEphemeralPathExpectConflictHandleZKBug
> > > >>>> call is wrapped in a while(true) loop which runs forever until the
> > > >>>> ephemeral node is created. Would it be better that we can employ
> an
> > > >>>> exponential retry policy with a max number of retries so that it
> > has a
> > > >>>> chance to re-throw the exception back to caller and let caller
> > handle
> > > >>>>it in
> > > >>>> situation like above?
> > > >>>>
> > > >>>>
> > >
> > >
> >
>
>
>
> --
> -Regards,
> Mayuresh R. Gharat
> (862) 250-7125
>

Re: Got conflicted ephemeral node exception for several hours

Posted by Mayuresh Gharat <gh...@gmail.com>.
That would solve this. But it looks like a work around. We need to check
why this happens exactly and get to the root cause. What do you think?
Getting to the root cause of this might be really useful.

Thanks,

Mayuresh

On Sun, Jul 12, 2015 at 8:45 PM, tao xiao <xi...@gmail.com> wrote:

> Restart the consumers does fix the issue. But since the zk retry is wrapped
> in an infinite loop it doesn't give a chance to consumer to respond it
> until some one notices and restarts. Why I suggest to have a maximum retry
> policy is if max retry is reached it can invoke a customer handler which I
> can then inject a restart call so that it can remedy itself without
> people's attention.
>
> On Mon, 13 Jul 2015 at 11:36 Jiangjie Qin <jq...@linkedin.com.invalid>
> wrote:
>
> > Hi Tao,
> >
> > We see this error from time to time but did not think of this as a big
> > issue. Any reason it bothers you much?
> > I¹m not sure if throwing exception to user on this exception is a good
> > handling or not. What are user supposed to do in that case other than
> > retry?
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On 7/12/15, 7:16 PM, "tao xiao" <xi...@gmail.com> wrote:
> >
> > >We saw the error again in our cluster.  Anyone has the same issue
> before?
> > >
> > >On Fri, 10 Jul 2015 at 13:26 tao xiao <xi...@gmail.com> wrote:
> > >
> > >> Bump the thread. Any help would be appreciated.
> > >>
> > >> On Wed, 8 Jul 2015 at 20:09 tao xiao <xi...@gmail.com> wrote:
> > >>
> > >>> Additional info
> > >>> Kafka version: 0.8.2.1
> > >>> zookeeper: 3.4.6
> > >>>
> > >>> On Wed, 8 Jul 2015 at 20:07 tao xiao <xi...@gmail.com> wrote:
> > >>>
> > >>>> Hi team,
> > >>>>
> > >>>> I have 10 high level consumers connecting to Kafka and one of them
> > >>>>kept
> > >>>> complaining "conflicted ephemeral node" for about 8 hours. The log
> was
> > >>>> filled with below exception
> > >>>>
> > >>>> [2015-07-07 14:03:51,615] INFO conflict in
> > >>>> /consumers/group/ids/test-1435856975563-9a9fdc6c data:
> > >>>>
> >
> >>>>{"version":1,"subscription":{"test.*":1},"pattern":"white_list","timest
> > >>>>amp":"1436275631510"}
> > >>>> stored data:
> > >>>>
> >
> >>>>{"version":1,"subscription":{"test.*":1},"pattern":"white_list","timest
> > >>>>amp":"1436275558570"}
> > >>>> (kafka.utils.ZkUtils$)
> > >>>> [2015-07-07 14:03:51,616] INFO I wrote this conflicted ephemeral
> node
> > >>>>
> >
> >>>>[{"version":1,"subscription":{"test.*":1},"pattern":"white_list","times
> > >>>>tamp":"1436275631510"}]
> > >>>> at /consumers/group/ids/test-1435856975563-9a9fdc6c a while back in
> a
> > >>>> different session, hence I will backoff for this node to be deleted
> by
> > >>>> Zookeeper and retry (kafka.utils.ZkUtils$)
> > >>>>
> > >>>> In the meantime zookeeper reported below exception for the same time
> > >>>>span
> > >>>>
> > >>>> 2015-07-07 22:45:09,687 [myid:3] - INFO  [ProcessThread(sid:3
> > >>>> cport:-1)::PrepRequestProcessor@645] - Got user-level
> KeeperException
> > >>>> when processing sessionid:0x44e657ff19c0019 type:create cxid:0x7a26
> > >>>> zxid:0x3015f6e77 txntype:-1 reqpath:n/a Error
> > >>>> Path:/consumers/group/ids/test-1435856975563-9a9fdc6c
> > >>>>Error:KeeperErrorCode
> > >>>> = NodeExists for /consumers/group/ids/test-1435856975563-9a9fdc6c
> > >>>>
> > >>>> At the end zookeeper timed out the session and consumers triggered
> > >>>> rebalance.
> > >>>>
> > >>>> I know that conflicted ephemeral node warning is to handle a
> zookeeper
> > >>>> bug that session expiration and ephemeral node deletion are not done
> > >>>> atomically but as indicated from zookeeper log the zookeeper never
> > >>>>got a
> > >>>> chance to delete the ephemeral node which made me think that the
> > >>>>session
> > >>>> was not expired at that time. And for some reason zookeeper fired
> > >>>>session
> > >>>> expire event which subsequently invoked ZKSessionExpireListener.  I
> > >>>>was
> > >>>> just wondering if anyone have ever encountered similar issue before
> > >>>>and
> > >>>> what I can do at zookeeper side to prevent this?
> > >>>>
> > >>>> Another problem is that createEphemeralPathExpectConflictHandleZKBug
> > >>>> call is wrapped in a while(true) loop which runs forever until the
> > >>>> ephemeral node is created. Would it be better that we can employ an
> > >>>> exponential retry policy with a max number of retries so that it
> has a
> > >>>> chance to re-throw the exception back to caller and let caller
> handle
> > >>>>it in
> > >>>> situation like above?
> > >>>>
> > >>>>
> >
> >
>



-- 
-Regards,
Mayuresh R. Gharat
(862) 250-7125

Re: Got conflicted ephemeral node exception for several hours

Posted by tao xiao <xi...@gmail.com>.
Restart the consumers does fix the issue. But since the zk retry is wrapped
in an infinite loop it doesn't give a chance to consumer to respond it
until some one notices and restarts. Why I suggest to have a maximum retry
policy is if max retry is reached it can invoke a customer handler which I
can then inject a restart call so that it can remedy itself without
people's attention.

On Mon, 13 Jul 2015 at 11:36 Jiangjie Qin <jq...@linkedin.com.invalid> wrote:

> Hi Tao,
>
> We see this error from time to time but did not think of this as a big
> issue. Any reason it bothers you much?
> I¹m not sure if throwing exception to user on this exception is a good
> handling or not. What are user supposed to do in that case other than
> retry?
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On 7/12/15, 7:16 PM, "tao xiao" <xi...@gmail.com> wrote:
>
> >We saw the error again in our cluster.  Anyone has the same issue before?
> >
> >On Fri, 10 Jul 2015 at 13:26 tao xiao <xi...@gmail.com> wrote:
> >
> >> Bump the thread. Any help would be appreciated.
> >>
> >> On Wed, 8 Jul 2015 at 20:09 tao xiao <xi...@gmail.com> wrote:
> >>
> >>> Additional info
> >>> Kafka version: 0.8.2.1
> >>> zookeeper: 3.4.6
> >>>
> >>> On Wed, 8 Jul 2015 at 20:07 tao xiao <xi...@gmail.com> wrote:
> >>>
> >>>> Hi team,
> >>>>
> >>>> I have 10 high level consumers connecting to Kafka and one of them
> >>>>kept
> >>>> complaining "conflicted ephemeral node" for about 8 hours. The log was
> >>>> filled with below exception
> >>>>
> >>>> [2015-07-07 14:03:51,615] INFO conflict in
> >>>> /consumers/group/ids/test-1435856975563-9a9fdc6c data:
> >>>>
> >>>>{"version":1,"subscription":{"test.*":1},"pattern":"white_list","timest
> >>>>amp":"1436275631510"}
> >>>> stored data:
> >>>>
> >>>>{"version":1,"subscription":{"test.*":1},"pattern":"white_list","timest
> >>>>amp":"1436275558570"}
> >>>> (kafka.utils.ZkUtils$)
> >>>> [2015-07-07 14:03:51,616] INFO I wrote this conflicted ephemeral node
> >>>>
> >>>>[{"version":1,"subscription":{"test.*":1},"pattern":"white_list","times
> >>>>tamp":"1436275631510"}]
> >>>> at /consumers/group/ids/test-1435856975563-9a9fdc6c a while back in a
> >>>> different session, hence I will backoff for this node to be deleted by
> >>>> Zookeeper and retry (kafka.utils.ZkUtils$)
> >>>>
> >>>> In the meantime zookeeper reported below exception for the same time
> >>>>span
> >>>>
> >>>> 2015-07-07 22:45:09,687 [myid:3] - INFO  [ProcessThread(sid:3
> >>>> cport:-1)::PrepRequestProcessor@645] - Got user-level KeeperException
> >>>> when processing sessionid:0x44e657ff19c0019 type:create cxid:0x7a26
> >>>> zxid:0x3015f6e77 txntype:-1 reqpath:n/a Error
> >>>> Path:/consumers/group/ids/test-1435856975563-9a9fdc6c
> >>>>Error:KeeperErrorCode
> >>>> = NodeExists for /consumers/group/ids/test-1435856975563-9a9fdc6c
> >>>>
> >>>> At the end zookeeper timed out the session and consumers triggered
> >>>> rebalance.
> >>>>
> >>>> I know that conflicted ephemeral node warning is to handle a zookeeper
> >>>> bug that session expiration and ephemeral node deletion are not done
> >>>> atomically but as indicated from zookeeper log the zookeeper never
> >>>>got a
> >>>> chance to delete the ephemeral node which made me think that the
> >>>>session
> >>>> was not expired at that time. And for some reason zookeeper fired
> >>>>session
> >>>> expire event which subsequently invoked ZKSessionExpireListener.  I
> >>>>was
> >>>> just wondering if anyone have ever encountered similar issue before
> >>>>and
> >>>> what I can do at zookeeper side to prevent this?
> >>>>
> >>>> Another problem is that createEphemeralPathExpectConflictHandleZKBug
> >>>> call is wrapped in a while(true) loop which runs forever until the
> >>>> ephemeral node is created. Would it be better that we can employ an
> >>>> exponential retry policy with a max number of retries so that it has a
> >>>> chance to re-throw the exception back to caller and let caller handle
> >>>>it in
> >>>> situation like above?
> >>>>
> >>>>
>
>

Re: Got conflicted ephemeral node exception for several hours

Posted by Jiangjie Qin <jq...@linkedin.com.INVALID>.
Hi Tao,

We see this error from time to time but did not think of this as a big
issue. Any reason it bothers you much?
I¹m not sure if throwing exception to user on this exception is a good
handling or not. What are user supposed to do in that case other than
retry?

Thanks,

Jiangjie (Becket) Qin

On 7/12/15, 7:16 PM, "tao xiao" <xi...@gmail.com> wrote:

>We saw the error again in our cluster.  Anyone has the same issue before?
>
>On Fri, 10 Jul 2015 at 13:26 tao xiao <xi...@gmail.com> wrote:
>
>> Bump the thread. Any help would be appreciated.
>>
>> On Wed, 8 Jul 2015 at 20:09 tao xiao <xi...@gmail.com> wrote:
>>
>>> Additional info
>>> Kafka version: 0.8.2.1
>>> zookeeper: 3.4.6
>>>
>>> On Wed, 8 Jul 2015 at 20:07 tao xiao <xi...@gmail.com> wrote:
>>>
>>>> Hi team,
>>>>
>>>> I have 10 high level consumers connecting to Kafka and one of them
>>>>kept
>>>> complaining "conflicted ephemeral node" for about 8 hours. The log was
>>>> filled with below exception
>>>>
>>>> [2015-07-07 14:03:51,615] INFO conflict in
>>>> /consumers/group/ids/test-1435856975563-9a9fdc6c data:
>>>> 
>>>>{"version":1,"subscription":{"test.*":1},"pattern":"white_list","timest
>>>>amp":"1436275631510"}
>>>> stored data:
>>>> 
>>>>{"version":1,"subscription":{"test.*":1},"pattern":"white_list","timest
>>>>amp":"1436275558570"}
>>>> (kafka.utils.ZkUtils$)
>>>> [2015-07-07 14:03:51,616] INFO I wrote this conflicted ephemeral node
>>>> 
>>>>[{"version":1,"subscription":{"test.*":1},"pattern":"white_list","times
>>>>tamp":"1436275631510"}]
>>>> at /consumers/group/ids/test-1435856975563-9a9fdc6c a while back in a
>>>> different session, hence I will backoff for this node to be deleted by
>>>> Zookeeper and retry (kafka.utils.ZkUtils$)
>>>>
>>>> In the meantime zookeeper reported below exception for the same time
>>>>span
>>>>
>>>> 2015-07-07 22:45:09,687 [myid:3] - INFO  [ProcessThread(sid:3
>>>> cport:-1)::PrepRequestProcessor@645] - Got user-level KeeperException
>>>> when processing sessionid:0x44e657ff19c0019 type:create cxid:0x7a26
>>>> zxid:0x3015f6e77 txntype:-1 reqpath:n/a Error
>>>> Path:/consumers/group/ids/test-1435856975563-9a9fdc6c
>>>>Error:KeeperErrorCode
>>>> = NodeExists for /consumers/group/ids/test-1435856975563-9a9fdc6c
>>>>
>>>> At the end zookeeper timed out the session and consumers triggered
>>>> rebalance.
>>>>
>>>> I know that conflicted ephemeral node warning is to handle a zookeeper
>>>> bug that session expiration and ephemeral node deletion are not done
>>>> atomically but as indicated from zookeeper log the zookeeper never
>>>>got a
>>>> chance to delete the ephemeral node which made me think that the
>>>>session
>>>> was not expired at that time. And for some reason zookeeper fired
>>>>session
>>>> expire event which subsequently invoked ZKSessionExpireListener.  I
>>>>was
>>>> just wondering if anyone have ever encountered similar issue before
>>>>and
>>>> what I can do at zookeeper side to prevent this?
>>>>
>>>> Another problem is that createEphemeralPathExpectConflictHandleZKBug
>>>> call is wrapped in a while(true) loop which runs forever until the
>>>> ephemeral node is created. Would it be better that we can employ an
>>>> exponential retry policy with a max number of retries so that it has a
>>>> chance to re-throw the exception back to caller and let caller handle
>>>>it in
>>>> situation like above?
>>>>
>>>>


Re: Got conflicted ephemeral node exception for several hours

Posted by tao xiao <xi...@gmail.com>.
We saw the error again in our cluster.  Anyone has the same issue before?

On Fri, 10 Jul 2015 at 13:26 tao xiao <xi...@gmail.com> wrote:

> Bump the thread. Any help would be appreciated.
>
> On Wed, 8 Jul 2015 at 20:09 tao xiao <xi...@gmail.com> wrote:
>
>> Additional info
>> Kafka version: 0.8.2.1
>> zookeeper: 3.4.6
>>
>> On Wed, 8 Jul 2015 at 20:07 tao xiao <xi...@gmail.com> wrote:
>>
>>> Hi team,
>>>
>>> I have 10 high level consumers connecting to Kafka and one of them kept
>>> complaining "conflicted ephemeral node" for about 8 hours. The log was
>>> filled with below exception
>>>
>>> [2015-07-07 14:03:51,615] INFO conflict in
>>> /consumers/group/ids/test-1435856975563-9a9fdc6c data:
>>> {"version":1,"subscription":{"test.*":1},"pattern":"white_list","timestamp":"1436275631510"}
>>> stored data:
>>> {"version":1,"subscription":{"test.*":1},"pattern":"white_list","timestamp":"1436275558570"}
>>> (kafka.utils.ZkUtils$)
>>> [2015-07-07 14:03:51,616] INFO I wrote this conflicted ephemeral node
>>> [{"version":1,"subscription":{"test.*":1},"pattern":"white_list","timestamp":"1436275631510"}]
>>> at /consumers/group/ids/test-1435856975563-9a9fdc6c a while back in a
>>> different session, hence I will backoff for this node to be deleted by
>>> Zookeeper and retry (kafka.utils.ZkUtils$)
>>>
>>> In the meantime zookeeper reported below exception for the same time span
>>>
>>> 2015-07-07 22:45:09,687 [myid:3] - INFO  [ProcessThread(sid:3
>>> cport:-1)::PrepRequestProcessor@645] - Got user-level KeeperException
>>> when processing sessionid:0x44e657ff19c0019 type:create cxid:0x7a26
>>> zxid:0x3015f6e77 txntype:-1 reqpath:n/a Error
>>> Path:/consumers/group/ids/test-1435856975563-9a9fdc6c Error:KeeperErrorCode
>>> = NodeExists for /consumers/group/ids/test-1435856975563-9a9fdc6c
>>>
>>> At the end zookeeper timed out the session and consumers triggered
>>> rebalance.
>>>
>>> I know that conflicted ephemeral node warning is to handle a zookeeper
>>> bug that session expiration and ephemeral node deletion are not done
>>> atomically but as indicated from zookeeper log the zookeeper never got a
>>> chance to delete the ephemeral node which made me think that the session
>>> was not expired at that time. And for some reason zookeeper fired session
>>> expire event which subsequently invoked ZKSessionExpireListener.  I was
>>> just wondering if anyone have ever encountered similar issue before and
>>> what I can do at zookeeper side to prevent this?
>>>
>>> Another problem is that createEphemeralPathExpectConflictHandleZKBug
>>> call is wrapped in a while(true) loop which runs forever until the
>>> ephemeral node is created. Would it be better that we can employ an
>>> exponential retry policy with a max number of retries so that it has a
>>> chance to re-throw the exception back to caller and let caller handle it in
>>> situation like above?
>>>
>>>

Re: Got conflicted ephemeral node exception for several hours

Posted by tao xiao <xi...@gmail.com>.
Bump the thread. Any help would be appreciated.

On Wed, 8 Jul 2015 at 20:09 tao xiao <xi...@gmail.com> wrote:

> Additional info
> Kafka version: 0.8.2.1
> zookeeper: 3.4.6
>
> On Wed, 8 Jul 2015 at 20:07 tao xiao <xi...@gmail.com> wrote:
>
>> Hi team,
>>
>> I have 10 high level consumers connecting to Kafka and one of them kept
>> complaining "conflicted ephemeral node" for about 8 hours. The log was
>> filled with below exception
>>
>> [2015-07-07 14:03:51,615] INFO conflict in
>> /consumers/group/ids/test-1435856975563-9a9fdc6c data:
>> {"version":1,"subscription":{"test.*":1},"pattern":"white_list","timestamp":"1436275631510"}
>> stored data:
>> {"version":1,"subscription":{"test.*":1},"pattern":"white_list","timestamp":"1436275558570"}
>> (kafka.utils.ZkUtils$)
>> [2015-07-07 14:03:51,616] INFO I wrote this conflicted ephemeral node
>> [{"version":1,"subscription":{"test.*":1},"pattern":"white_list","timestamp":"1436275631510"}]
>> at /consumers/group/ids/test-1435856975563-9a9fdc6c a while back in a
>> different session, hence I will backoff for this node to be deleted by
>> Zookeeper and retry (kafka.utils.ZkUtils$)
>>
>> In the meantime zookeeper reported below exception for the same time span
>>
>> 2015-07-07 22:45:09,687 [myid:3] - INFO  [ProcessThread(sid:3
>> cport:-1)::PrepRequestProcessor@645] - Got user-level KeeperException
>> when processing sessionid:0x44e657ff19c0019 type:create cxid:0x7a26
>> zxid:0x3015f6e77 txntype:-1 reqpath:n/a Error
>> Path:/consumers/group/ids/test-1435856975563-9a9fdc6c Error:KeeperErrorCode
>> = NodeExists for /consumers/group/ids/test-1435856975563-9a9fdc6c
>>
>> At the end zookeeper timed out the session and consumers triggered
>> rebalance.
>>
>> I know that conflicted ephemeral node warning is to handle a zookeeper
>> bug that session expiration and ephemeral node deletion are not done
>> atomically but as indicated from zookeeper log the zookeeper never got a
>> chance to delete the ephemeral node which made me think that the session
>> was not expired at that time. And for some reason zookeeper fired session
>> expire event which subsequently invoked ZKSessionExpireListener.  I was
>> just wondering if anyone have ever encountered similar issue before and
>> what I can do at zookeeper side to prevent this?
>>
>> Another problem is that createEphemeralPathExpectConflictHandleZKBug call
>> is wrapped in a while(true) loop which runs forever until the ephemeral
>> node is created. Would it be better that we can employ an exponential retry
>> policy with a max number of retries so that it has a chance to re-throw the
>> exception back to caller and let caller handle it in situation like above?
>>
>>

Re: Got conflicted ephemeral node exception for several hours

Posted by tao xiao <xi...@gmail.com>.
Additional info
Kafka version: 0.8.2.1
zookeeper: 3.4.6

On Wed, 8 Jul 2015 at 20:07 tao xiao <xi...@gmail.com> wrote:

> Hi team,
>
> I have 10 high level consumers connecting to Kafka and one of them kept
> complaining "conflicted ephemeral node" for about 8 hours. The log was
> filled with below exception
>
> [2015-07-07 14:03:51,615] INFO conflict in
> /consumers/group/ids/test-1435856975563-9a9fdc6c data:
> {"version":1,"subscription":{"test.*":1},"pattern":"white_list","timestamp":"1436275631510"}
> stored data:
> {"version":1,"subscription":{"test.*":1},"pattern":"white_list","timestamp":"1436275558570"}
> (kafka.utils.ZkUtils$)
> [2015-07-07 14:03:51,616] INFO I wrote this conflicted ephemeral node
> [{"version":1,"subscription":{"test.*":1},"pattern":"white_list","timestamp":"1436275631510"}]
> at /consumers/group/ids/test-1435856975563-9a9fdc6c a while back in a
> different session, hence I will backoff for this node to be deleted by
> Zookeeper and retry (kafka.utils.ZkUtils$)
>
> In the meantime zookeeper reported below exception for the same time span
>
> 2015-07-07 22:45:09,687 [myid:3] - INFO  [ProcessThread(sid:3
> cport:-1)::PrepRequestProcessor@645] - Got user-level KeeperException
> when processing sessionid:0x44e657ff19c0019 type:create cxid:0x7a26
> zxid:0x3015f6e77 txntype:-1 reqpath:n/a Error
> Path:/consumers/group/ids/test-1435856975563-9a9fdc6c Error:KeeperErrorCode
> = NodeExists for /consumers/group/ids/test-1435856975563-9a9fdc6c
>
> At the end zookeeper timed out the session and consumers triggered
> rebalance.
>
> I know that conflicted ephemeral node warning is to handle a zookeeper bug
> that session expiration and ephemeral node deletion are not done atomically
> but as indicated from zookeeper log the zookeeper never got a chance to
> delete the ephemeral node which made me think that the session was not
> expired at that time. And for some reason zookeeper fired session expire
> event which subsequently invoked ZKSessionExpireListener.  I was just
> wondering if anyone have ever encountered similar issue before and what I
> can do at zookeeper side to prevent this?
>
> Another problem is that createEphemeralPathExpectConflictHandleZKBug call
> is wrapped in a while(true) loop which runs forever until the ephemeral
> node is created. Would it be better that we can employ an exponential retry
> policy with a max number of retries so that it has a chance to re-throw the
> exception back to caller and let caller handle it in situation like above?
>
>