You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@helix.apache.org by DImuthu Upeksha <di...@gmail.com> on 2019/06/01 05:57:24 UTC

Re: Zookeeper connection errors in Helix Controller

Hi Lee,

Understood and thanks for the heads up. We are currently in middle of
production deployment with 0.8.2 and most of the users are already notified
with the schedule.  Basically we are a happy with the stability and
functional correctness of 0.8.2 except for above mentioned case where we
pushed the cluster above its limits in stress testing. So we will go with
this version for this deployment and once you have released the new
version, we will perform the functional tests and stress tests on it within
our staging environment and if it looks good, we will patch it to the
production environment.

Thanks
Dimuthu

On Fri, May 31, 2019 at 5:07 PM Hunter Lee <na...@gmail.com> wrote:

> Hey Dimuthu -
>
> We are actually in the process of preparing a new release, and this will
> come with the previously mentioned bug fixes in Task Framework. It also
> contains various ZK-related fixes - I don't know what your deployment
> schedule is but it might be worth the wait of another week or so.
>
> Hunter
>
> On Fri, May 31, 2019 at 10:27 AM DImuthu Upeksha <
> dimuthu.upeksha2@gmail.com>
> wrote:
>
> > Now I'm seeing following error in controller log. Restarting the
> controller
> > fixed the issue. We are time to time seeing this in controller with zk
> > connection issues. Is this also something to do with zk client version?
> >
> > 2019-05-31 13:21:46,669 [Thread-0-SendThread(localhost:2181)] WARN
> >  o.apache.zookeeper.ClientCnxn  - Session 0x16b0ebbee1d000e for server
> > localhost/127.0.0.1:2181, unexpected error, closing socket connection
> and
> > attempting reconnect
> > java.io.IOException: Broken pipe
> > at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> > at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
> > at
> >
> org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:102)
> > at
> >
> >
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:291)
> > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1041)
> >
> > Thanks
> > Dimuthu
> >
> > On Fri, May 31, 2019 at 1:14 PM DImuthu Upeksha <
> > dimuthu.upeksha2@gmail.com>
> > wrote:
> >
> > > Hi Lei,
> > >
> > > We use 0.8.2. We initially had 0.8.4 but it contains an issue with task
> > > retry logic so we downgraded to 0.8.2. We are planning to go into
> > > production with 0.8.2 by next week so can you please advice a better
> way
> > to
> > > solve this without upgrading to 0.8.4.
> > >
> > > Thanks
> > > Dimuthu
> > >
> > > On Fri, May 31, 2019 at 1:04 PM Lei Xia <xi...@gmail.com> wrote:
> > >
> > >> Which Helix version do you use?  This may caused by this Zookeeper
> bug (
> > >> https://issues.apache.org/jira/browse/ZOOKEEPER-706).  We have
> upgraded
> > >> ZkClient in later Helix versions.
> > >>
> > >>
> > >> Lei
> > >>
> > >> On Fri, May 31, 2019 at 7:52 AM DImuthu Upeksha <
> > >> dimuthu.upeksha2@gmail.com> wrote:
> > >>
> > >>> Hi Folks,
> > >>>
> > >>> I'm getting following error in controller log and seems like
> controller
> > >>> is
> > >>> not moving froward after that point
> > >>>
> > >>> 2019-05-31 10:47:37,084 [main] INFO  o.a.a.h.i.c.HelixController  -
> > >>> Starting helix controller
> > >>> 2019-05-31 10:47:37,089 [main] INFO  o.a.a.c.u.ApplicationSettings  -
> > >>> Settings loaded from
> > >>>
> > >>>
> >
> file:/home/airavata/staging-deployment/airavata-helix/apache-airavata-controller-0.18-SNAPSHOT/conf/airavata-server.properties
> > >>> 2019-05-31 10:47:37,091 [Thread-0] INFO
> o.a.a.h.i.c.HelixController  -
> > >>> Connection to helix cluster : AiravataDemoCluster with name :
> > >>> helixcontroller2
> > >>> 2019-05-31 10:47:37,092 [Thread-0] INFO
> o.a.a.h.i.c.HelixController  -
> > >>> Zookeeper connection string localhost:2181
> > >>> 2019-05-31 10:47:42,907 [GenericHelixController-event_process] ERROR
> > >>> o.a.h.c.GenericHelixController  - Exception while executing
> > >>> DEFAULTpipeline:
> > >>> org.apache.helix.controller.pipeline.Pipeline@408d6d26for
> > >>> cluster .AiravataDemoCluster. Will not continue to next pipeline
> > >>> org.apache.helix.api.exceptions.HelixMetaDataAccessException: Failed
> to
> > >>> get
> > >>> full list of /AiravataDemoCluster/CONFIGS/PARTICIPANT
> > >>> at
> > >>>
> > >>>
> >
> org.apache.helix.manager.zk.ZkBaseDataAccessor.getChildren(ZkBaseDataAccessor.java:446)
> > >>> at
> > >>>
> > >>>
> >
> org.apache.helix.manager.zk.ZKHelixDataAccessor.getChildValues(ZKHelixDataAccessor.java:406)
> > >>> at
> > >>>
> > >>>
> >
> org.apache.helix.manager.zk.ZKHelixDataAccessor.getChildValuesMap(ZKHelixDataAccessor.java:467)
> > >>> at
> > >>>
> > >>>
> >
> org.apache.helix.controller.stages.ClusterDataCache.refresh(ClusterDataCache.java:176)
> > >>> at
> > >>>
> > >>>
> >
> org.apache.helix.controller.stages.ReadClusterDataStage.process(ReadClusterDataStage.java:62)
> > >>> at
> > org.apache.helix.controller.pipeline.Pipeline.handle(Pipeline.java:63)
> > >>> at
> > >>>
> > >>>
> >
> org.apache.helix.controller.GenericHelixController.handleEvent(GenericHelixController.java:432)
> > >>> at
> > >>>
> > >>>
> >
> org.apache.helix.controller.GenericHelixController$ClusterEventProcessor.run(GenericHelixController.java:928)
> > >>> Caused by:
> > org.apache.helix.api.exceptions.HelixMetaDataAccessException:
> > >>> Fail to read nodes for
> > >>> [/AiravataDemoCluster/CONFIGS/PARTICIPANT/helixparticipant]
> > >>> at
> > >>>
> > >>>
> >
> org.apache.helix.manager.zk.ZkBaseDataAccessor.get(ZkBaseDataAccessor.java:414)
> > >>> at
> > >>>
> > >>>
> >
> org.apache.helix.manager.zk.ZkBaseDataAccessor.getChildren(ZkBaseDataAccessor.java:479)
> > >>> at
> > >>>
> > >>>
> >
> org.apache.helix.manager.zk.ZkBaseDataAccessor.getChildren(ZkBaseDataAccessor.java:442)
> > >>> ... 7 common frames omitted
> > >>>
> > >>> In the zookeeper log I can see following warning getting printed
> > >>> continuously. What could be the reason for that? I'm using helix
> 0.8.2
> > >>> and
> > >>> zookeeper 3.4.8
> > >>>
> > >>> 2019-05-31 10:49:37,621 [myid:] - INFO  [NIOServerCxn.Factory:
> > >>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1008] - Closed socket connection
> > for
> > >>> client /0:0:0:0:0:0:0:1:59056 which had sessionid 0x16b0e59877f0000
> > >>> 2019-05-31 10:49:37,773 [myid:] - INFO  [NIOServerCxn.Factory:
> > >>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket
> > >>> connection
> > >>> from /127.0.0.1:57984
> > >>> 2019-05-31 10:49:37,774 [myid:] - INFO  [NIOServerCxn.Factory:
> > >>> 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@893] - Client attempting to
> renew
> > >>> session 0x16b0e59877f0000 at /127.0.0.1:57984
> > >>> 2019-05-31 10:49:37,774 [myid:] - INFO  [NIOServerCxn.Factory:
> > >>> 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@645] - Established session
> > >>> 0x16b0e59877f0000 with negotiated timeout 30000 for client /
> > >>> 127.0.0.1:57984
> > >>> 2019-05-31 10:49:37,790 [myid:] - WARN  [NIOServerCxn.Factory:
> > >>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream
> > exception
> > >>> EndOfStreamException: Unable to read additional data from client
> > >>> sessionid
> > >>> 0x16b0e59877f0000, likely client has closed socket
> > >>> at
> > org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:230)
> > >>> at
> > >>>
> > >>>
> >
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
> > >>> at java.lang.Thread.run(Thread.java:748)
> > >>>
> > >>> Thanks
> > >>> Dimuthu
> > >>>
> > >>
> > >>
> > >> --
> > >> Lei Xia
> > >>
> > >
> >
>

Re: Zookeeper connection errors in Helix Controller

Posted by Lei Xia <xi...@gmail.com>.
Before our new release is out,  if you see that is a problem in your prod
deployment,  one thing you may try is to add a newer zookeeper version as
an explicit dependency in your project, then during the build time,  maven
(or other build tool) will pick new version instead the one specified in
helix.  This could  be a workaround for you now.


Lei

On Fri, May 31, 2019 at 10:57 PM DImuthu Upeksha <di...@gmail.com>
wrote:

> Hi Lee,
>
> Understood and thanks for the heads up. We are currently in middle of
> production deployment with 0.8.2 and most of the users are already notified
> with the schedule.  Basically we are a happy with the stability and
> functional correctness of 0.8.2 except for above mentioned case where we
> pushed the cluster above its limits in stress testing. So we will go with
> this version for this deployment and once you have released the new
> version, we will perform the functional tests and stress tests on it within
> our staging environment and if it looks good, we will patch it to the
> production environment.
>
> Thanks
> Dimuthu
>
> On Fri, May 31, 2019 at 5:07 PM Hunter Lee <na...@gmail.com> wrote:
>
> > Hey Dimuthu -
> >
> > We are actually in the process of preparing a new release, and this will
> > come with the previously mentioned bug fixes in Task Framework. It also
> > contains various ZK-related fixes - I don't know what your deployment
> > schedule is but it might be worth the wait of another week or so.
> >
> > Hunter
> >
> > On Fri, May 31, 2019 at 10:27 AM DImuthu Upeksha <
> > dimuthu.upeksha2@gmail.com>
> > wrote:
> >
> > > Now I'm seeing following error in controller log. Restarting the
> > controller
> > > fixed the issue. We are time to time seeing this in controller with zk
> > > connection issues. Is this also something to do with zk client version?
> > >
> > > 2019-05-31 13:21:46,669 [Thread-0-SendThread(localhost:2181)] WARN
> > >  o.apache.zookeeper.ClientCnxn  - Session 0x16b0ebbee1d000e for server
> > > localhost/127.0.0.1:2181, unexpected error, closing socket connection
> > and
> > > attempting reconnect
> > > java.io.IOException: Broken pipe
> > > at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> > > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> > > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> > > at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> > > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
> > > at
> > >
> >
> org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:102)
> > > at
> > >
> > >
> >
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:291)
> > > at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1041)
> > >
> > > Thanks
> > > Dimuthu
> > >
> > > On Fri, May 31, 2019 at 1:14 PM DImuthu Upeksha <
> > > dimuthu.upeksha2@gmail.com>
> > > wrote:
> > >
> > > > Hi Lei,
> > > >
> > > > We use 0.8.2. We initially had 0.8.4 but it contains an issue with
> task
> > > > retry logic so we downgraded to 0.8.2. We are planning to go into
> > > > production with 0.8.2 by next week so can you please advice a better
> > way
> > > to
> > > > solve this without upgrading to 0.8.4.
> > > >
> > > > Thanks
> > > > Dimuthu
> > > >
> > > > On Fri, May 31, 2019 at 1:04 PM Lei Xia <xi...@gmail.com> wrote:
> > > >
> > > >> Which Helix version do you use?  This may caused by this Zookeeper
> > bug (
> > > >> https://issues.apache.org/jira/browse/ZOOKEEPER-706).  We have
> > upgraded
> > > >> ZkClient in later Helix versions.
> > > >>
> > > >>
> > > >> Lei
> > > >>
> > > >> On Fri, May 31, 2019 at 7:52 AM DImuthu Upeksha <
> > > >> dimuthu.upeksha2@gmail.com> wrote:
> > > >>
> > > >>> Hi Folks,
> > > >>>
> > > >>> I'm getting following error in controller log and seems like
> > controller
> > > >>> is
> > > >>> not moving froward after that point
> > > >>>
> > > >>> 2019-05-31 10:47:37,084 [main] INFO  o.a.a.h.i.c.HelixController  -
> > > >>> Starting helix controller
> > > >>> 2019-05-31 10:47:37,089 [main] INFO
> o.a.a.c.u.ApplicationSettings  -
> > > >>> Settings loaded from
> > > >>>
> > > >>>
> > >
> >
> file:/home/airavata/staging-deployment/airavata-helix/apache-airavata-controller-0.18-SNAPSHOT/conf/airavata-server.properties
> > > >>> 2019-05-31 10:47:37,091 [Thread-0] INFO
> > o.a.a.h.i.c.HelixController  -
> > > >>> Connection to helix cluster : AiravataDemoCluster with name :
> > > >>> helixcontroller2
> > > >>> 2019-05-31 10:47:37,092 [Thread-0] INFO
> > o.a.a.h.i.c.HelixController  -
> > > >>> Zookeeper connection string localhost:2181
> > > >>> 2019-05-31 10:47:42,907 [GenericHelixController-event_process]
> ERROR
> > > >>> o.a.h.c.GenericHelixController  - Exception while executing
> > > >>> DEFAULTpipeline:
> > > >>> org.apache.helix.controller.pipeline.Pipeline@408d6d26for
> > > >>> cluster .AiravataDemoCluster. Will not continue to next pipeline
> > > >>> org.apache.helix.api.exceptions.HelixMetaDataAccessException:
> Failed
> > to
> > > >>> get
> > > >>> full list of /AiravataDemoCluster/CONFIGS/PARTICIPANT
> > > >>> at
> > > >>>
> > > >>>
> > >
> >
> org.apache.helix.manager.zk.ZkBaseDataAccessor.getChildren(ZkBaseDataAccessor.java:446)
> > > >>> at
> > > >>>
> > > >>>
> > >
> >
> org.apache.helix.manager.zk.ZKHelixDataAccessor.getChildValues(ZKHelixDataAccessor.java:406)
> > > >>> at
> > > >>>
> > > >>>
> > >
> >
> org.apache.helix.manager.zk.ZKHelixDataAccessor.getChildValuesMap(ZKHelixDataAccessor.java:467)
> > > >>> at
> > > >>>
> > > >>>
> > >
> >
> org.apache.helix.controller.stages.ClusterDataCache.refresh(ClusterDataCache.java:176)
> > > >>> at
> > > >>>
> > > >>>
> > >
> >
> org.apache.helix.controller.stages.ReadClusterDataStage.process(ReadClusterDataStage.java:62)
> > > >>> at
> > > org.apache.helix.controller.pipeline.Pipeline.handle(Pipeline.java:63)
> > > >>> at
> > > >>>
> > > >>>
> > >
> >
> org.apache.helix.controller.GenericHelixController.handleEvent(GenericHelixController.java:432)
> > > >>> at
> > > >>>
> > > >>>
> > >
> >
> org.apache.helix.controller.GenericHelixController$ClusterEventProcessor.run(GenericHelixController.java:928)
> > > >>> Caused by:
> > > org.apache.helix.api.exceptions.HelixMetaDataAccessException:
> > > >>> Fail to read nodes for
> > > >>> [/AiravataDemoCluster/CONFIGS/PARTICIPANT/helixparticipant]
> > > >>> at
> > > >>>
> > > >>>
> > >
> >
> org.apache.helix.manager.zk.ZkBaseDataAccessor.get(ZkBaseDataAccessor.java:414)
> > > >>> at
> > > >>>
> > > >>>
> > >
> >
> org.apache.helix.manager.zk.ZkBaseDataAccessor.getChildren(ZkBaseDataAccessor.java:479)
> > > >>> at
> > > >>>
> > > >>>
> > >
> >
> org.apache.helix.manager.zk.ZkBaseDataAccessor.getChildren(ZkBaseDataAccessor.java:442)
> > > >>> ... 7 common frames omitted
> > > >>>
> > > >>> In the zookeeper log I can see following warning getting printed
> > > >>> continuously. What could be the reason for that? I'm using helix
> > 0.8.2
> > > >>> and
> > > >>> zookeeper 3.4.8
> > > >>>
> > > >>> 2019-05-31 10:49:37,621 [myid:] - INFO  [NIOServerCxn.Factory:
> > > >>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1008] - Closed socket
> connection
> > > for
> > > >>> client /0:0:0:0:0:0:0:1:59056 which had sessionid 0x16b0e59877f0000
> > > >>> 2019-05-31 10:49:37,773 [myid:] - INFO  [NIOServerCxn.Factory:
> > > >>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket
> > > >>> connection
> > > >>> from /127.0.0.1:57984
> > > >>> 2019-05-31 10:49:37,774 [myid:] - INFO  [NIOServerCxn.Factory:
> > > >>> 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@893] - Client attempting to
> > renew
> > > >>> session 0x16b0e59877f0000 at /127.0.0.1:57984
> > > >>> 2019-05-31 10:49:37,774 [myid:] - INFO  [NIOServerCxn.Factory:
> > > >>> 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@645] - Established session
> > > >>> 0x16b0e59877f0000 with negotiated timeout 30000 for client /
> > > >>> 127.0.0.1:57984
> > > >>> 2019-05-31 10:49:37,790 [myid:] - WARN  [NIOServerCxn.Factory:
> > > >>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream
> > > exception
> > > >>> EndOfStreamException: Unable to read additional data from client
> > > >>> sessionid
> > > >>> 0x16b0e59877f0000, likely client has closed socket
> > > >>> at
> > > org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:230)
> > > >>> at
> > > >>>
> > > >>>
> > >
> >
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
> > > >>> at java.lang.Thread.run(Thread.java:748)
> > > >>>
> > > >>> Thanks
> > > >>> Dimuthu
> > > >>>
> > > >>
> > > >>
> > > >> --
> > > >> Lei Xia
> > > >>
> > > >
> > >
> >
>
-- 
Lei Xia