You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@zookeeper.apache.org by "Ahmed H." <ah...@gmail.com> on 2014/01/23 19:53:30 UTC

Problems with running ZK on a shared disk

Hello,

I am running ZK on a shared disk (I know, I shouldn't be, but I am
constrained right now) alongside Kafka 0.8 beta. What we are experiencing
is a problem where we get really long fsync times (according to the logs),
followed by a loss of connection of our Kafka clients. Kafka attempts to
reconnect a few times and eventually it dies because it hits the maximum
retry attempts.

The fsync error is seen below:

2014-01-23 13:18:38,746 [myid:] - WARN [SyncThread:0:FileTxnLog@321] -
fsync-ing the write ahead log in SyncThread:0 took 12762ms which will
adversely effect operation latency. See the ZooKeeper troubleshooting guide
2014-01-23 13:23:41,332 [myid:] - WARN [SyncThread:0:FileTxnLog@321] -
fsync-ing the write ahead log in SyncThread:0 took 7552ms which will
adversely effect operation latency. See the ZooKeeper troubleshooting guide
2014-01-23 13:28:49,656 [myid:] - WARN [SyncThread:0:FileTxnLog@321] -
fsync-ing the write ahead log in SyncThread:0 took 6350ms which will
adversely effect operation latency. See the ZooKeeper troubleshooting guide
2014-01-23 13:33:45,063 [myid:] - WARN [SyncThread:0:FileTxnLog@321] -
fsync-ing the write ahead log in SyncThread:0 took 1039ms which will
adversely effect operation latency. See the ZooKeeper troubleshooting guide
2014-01-23 13:34:00,024 [myid:] - WARN [SyncThread:0:FileTxnLog@321] -
fsync-ing the write ahead log in SyncThread:0 took 9490ms which will
adversely effect operation latency. See the ZooKeeper troubleshooting guide
2014-01-23 13:44:09,003 [myid:] - WARN [SyncThread:0:FileTxnLog@321] -
fsync-ing the write ahead log in SyncThread:0 took 8747ms which will
adversely effect operation latency. See the ZooKeeper troubleshooting guide

This is also followed by some of these for good measure:

2014-01-23 13:49:19,427 [myid:] - ERROR [SyncThread:0:NIOServerCnxn@180] -
Unexpected Exception:
java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
at
org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:153)
at
org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1076)
at
org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:170)
at
org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:167)
at
org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:101)

The way I see it is that I currently have two problems: 1) The setup of ZK
is an issue due to the shared disk, and 2) Kafka clients do not
automatically recover when it hits the maximum number of retries. I am
looking for a way to at least mitigate the zookeeper issue. Perhaps if I
modify the timeouts in such a way that the Kafka clients don't fail like
they do...

What are the best ways to mitigate the issue for now, as I am limited to a
single disk? Increasing tickTime? My current ZK config is the default that
comes with version 3.4.5, so the tickTime is 2000. My Kafka clients have
defined the zktimeout variable to be 30000.

I realize that this is a Zookeeper mailing list, but right now I cannot
pinpoint the exact cause of my problems, but it appears to me that ZK is
the one.

Thanks

Re: Problems with running ZK on a shared disk

Posted by Neha Narkhede <ne...@gmail.com>.

The timeout to increase would be the zookeeper "session timeout". For
Kafka, the appropriate config is "zookeeper.session.timeout.ms".

Thanks,
Neha


On Thu, Jan 23, 2014 at 2:05 PM, Ahmed H. <ah...@gmail.com> wrote:

> Thanks for the response Nikhil.
>
> What about timeouts? I have been reading about increasing timeouts to
> alleviate some of those symptoms but I am unsure of which timeouts they are
> referring to. Can you provide some insight?
>
> I currently have one Zookeeper instance so forceSync shouldn't have any
> major downsides in this case. I will certainly give it a try when I get the
> chance.
>
> Thanks
>
>
> On Thu, Jan 23, 2014 at 3:17 PM, Nikhil <mn...@gmail.com> wrote:
>
> > Try forcesync=no
> >
> > forceSync
> >
> > (Java system property: *zookeeper.forceSync*)
> >
> > Requires updates to be synced to media of the transaction log before
> > finishing processing the update. If this option is set to no, ZooKeeper
> > will not require updates to be synced to the media.
> >
> >
> > This is a risk unless your zookeeper nodes are in the same rack.
> >
> >
> > Check also this
> >
> >
> http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/zookeeper_psuedo_scalability_and_absolute
> >
> >
> > On Thu, Jan 23, 2014 at 10:53 AM, Ahmed H. <ah...@gmail.com>
> wrote:
> >
> > > Hello,
> > >
> > > I am running ZK on a shared disk (I know, I shouldn't be, but I am
> > > constrained right now) alongside Kafka 0.8 beta. What we are
> experiencing
> > > is a problem where we get really long fsync times (according to the
> > logs),
> > > followed by a loss of connection of our Kafka clients. Kafka attempts
> to
> > > reconnect a few times and eventually it dies because it hits the
> maximum
> > > retry attempts.
> > >
> > > The fsync error is seen below:
> > >
> > > 2014-01-23 13:18:38,746 [myid:] - WARN  [SyncThread:0:FileTxnLog@321]
> -
> > > fsync-ing the write ahead log in SyncThread:0 took 12762ms which will
> > > adversely effect operation latency. See the ZooKeeper troubleshooting
> > guide
> > > 2014-01-23 13:23:41,332 [myid:] - WARN  [SyncThread:0:FileTxnLog@321]
> -
> > > fsync-ing the write ahead log in SyncThread:0 took 7552ms which will
> > > adversely effect operation latency. See the ZooKeeper troubleshooting
> > guide
> > > 2014-01-23 13:28:49,656 [myid:] - WARN  [SyncThread:0:FileTxnLog@321]
> -
> > > fsync-ing the write ahead log in SyncThread:0 took 6350ms which will
> > > adversely effect operation latency. See the ZooKeeper troubleshooting
> > guide
> > > 2014-01-23 13:33:45,063 [myid:] - WARN  [SyncThread:0:FileTxnLog@321]
> -
> > > fsync-ing the write ahead log in SyncThread:0 took 1039ms which will
> > > adversely effect operation latency. See the ZooKeeper troubleshooting
> > guide
> > > 2014-01-23 13:34:00,024 [myid:] - WARN  [SyncThread:0:FileTxnLog@321]
> -
> > > fsync-ing the write ahead log in SyncThread:0 took 9490ms which will
> > > adversely effect operation latency. See the ZooKeeper troubleshooting
> > guide
> > > 2014-01-23 13:44:09,003 [myid:] - WARN  [SyncThread:0:FileTxnLog@321]
> -
> > > fsync-ing the write ahead log in SyncThread:0 took 8747ms which will
> > > adversely effect operation latency. See the ZooKeeper troubleshooting
> > guide
> > >
> > >
> > > This is also followed by some of these for good measure:
> > >
> > > 2014-01-23 13:49:19,427 [myid:] - ERROR [SyncThread:0:NIOServerCnxn@180
> ]
> > -
> > > Unexpected Exception:
> > > java.nio.channels.CancelledKeyException
> > > at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
> > > at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
> > >  at
> > >
> > >
> >
> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:153)
> > > at
> > >
> > >
> >
> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1076)
> > >  at
> > >
> > >
> >
> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:170)
> > > at
> > >
> > >
> >
> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:167)
> > >  at
> > >
> > >
> >
> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:101)
> > >
> > >
> > > The way I see it is that I currently have two problems: 1) The setup of
> > ZK
> > > is an issue due to the shared disk, and 2) Kafka clients do not
> > > automatically recover when it hits the maximum number of retries. I am
> > > looking for a way to at least mitigate the zookeeper issue. Perhaps if
> I
> > > modify the timeouts in such a way that the Kafka clients don't fail
> like
> > > they do...
> > >
> > > What are the best ways to mitigate the issue for now, as I am limited
> to
> > a
> > > single disk? Increasing tickTime? My current ZK config is the default
> > that
> > > comes with version 3.4.5, so the tickTime is 2000. My Kafka clients
> have
> > > defined the zktimeout variable to be 30000.
> > >
> > > I realize that this is a Zookeeper mailing list, but right now I cannot
> > > pinpoint the exact cause of my problems, but it appears to me that ZK
> is
> > > the one.
> > >
> > > Thanks
> > >
> >
>

Re: Problems with running ZK on a shared disk

Posted by "Ahmed H." <ah...@gmail.com>.

Thanks for the response Nikhil.

What about timeouts? I have been reading about increasing timeouts to
alleviate some of those symptoms but I am unsure of which timeouts they are
referring to. Can you provide some insight?

I currently have one Zookeeper instance so forceSync shouldn't have any
major downsides in this case. I will certainly give it a try when I get the
chance.

Thanks


On Thu, Jan 23, 2014 at 3:17 PM, Nikhil <mn...@gmail.com> wrote:

> Try forcesync=no
>
> forceSync
>
> (Java system property: *zookeeper.forceSync*)
>
> Requires updates to be synced to media of the transaction log before
> finishing processing the update. If this option is set to no, ZooKeeper
> will not require updates to be synced to the media.
>
>
> This is a risk unless your zookeeper nodes are in the same rack.
>
>
> Check also this
>
> http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/zookeeper_psuedo_scalability_and_absolute
>
>
> On Thu, Jan 23, 2014 at 10:53 AM, Ahmed H. <ah...@gmail.com> wrote:
>
> > Hello,
> >
> > I am running ZK on a shared disk (I know, I shouldn't be, but I am
> > constrained right now) alongside Kafka 0.8 beta. What we are experiencing
> > is a problem where we get really long fsync times (according to the
> logs),
> > followed by a loss of connection of our Kafka clients. Kafka attempts to
> > reconnect a few times and eventually it dies because it hits the maximum
> > retry attempts.
> >
> > The fsync error is seen below:
> >
> > 2014-01-23 13:18:38,746 [myid:] - WARN  [SyncThread:0:FileTxnLog@321] -
> > fsync-ing the write ahead log in SyncThread:0 took 12762ms which will
> > adversely effect operation latency. See the ZooKeeper troubleshooting
> guide
> > 2014-01-23 13:23:41,332 [myid:] - WARN  [SyncThread:0:FileTxnLog@321] -
> > fsync-ing the write ahead log in SyncThread:0 took 7552ms which will
> > adversely effect operation latency. See the ZooKeeper troubleshooting
> guide
> > 2014-01-23 13:28:49,656 [myid:] - WARN  [SyncThread:0:FileTxnLog@321] -
> > fsync-ing the write ahead log in SyncThread:0 took 6350ms which will
> > adversely effect operation latency. See the ZooKeeper troubleshooting
> guide
> > 2014-01-23 13:33:45,063 [myid:] - WARN  [SyncThread:0:FileTxnLog@321] -
> > fsync-ing the write ahead log in SyncThread:0 took 1039ms which will
> > adversely effect operation latency. See the ZooKeeper troubleshooting
> guide
> > 2014-01-23 13:34:00,024 [myid:] - WARN  [SyncThread:0:FileTxnLog@321] -
> > fsync-ing the write ahead log in SyncThread:0 took 9490ms which will
> > adversely effect operation latency. See the ZooKeeper troubleshooting
> guide
> > 2014-01-23 13:44:09,003 [myid:] - WARN  [SyncThread:0:FileTxnLog@321] -
> > fsync-ing the write ahead log in SyncThread:0 took 8747ms which will
> > adversely effect operation latency. See the ZooKeeper troubleshooting
> guide
> >
> >
> > This is also followed by some of these for good measure:
> >
> > 2014-01-23 13:49:19,427 [myid:] - ERROR [SyncThread:0:NIOServerCnxn@180]
> -
> > Unexpected Exception:
> > java.nio.channels.CancelledKeyException
> > at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
> > at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
> >  at
> >
> >
> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:153)
> > at
> >
> >
> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1076)
> >  at
> >
> >
> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:170)
> > at
> >
> >
> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:167)
> >  at
> >
> >
> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:101)
> >
> >
> > The way I see it is that I currently have two problems: 1) The setup of
> ZK
> > is an issue due to the shared disk, and 2) Kafka clients do not
> > automatically recover when it hits the maximum number of retries. I am
> > looking for a way to at least mitigate the zookeeper issue. Perhaps if I
> > modify the timeouts in such a way that the Kafka clients don't fail like
> > they do...
> >
> > What are the best ways to mitigate the issue for now, as I am limited to
> a
> > single disk? Increasing tickTime? My current ZK config is the default
> that
> > comes with version 3.4.5, so the tickTime is 2000. My Kafka clients have
> > defined the zktimeout variable to be 30000.
> >
> > I realize that this is a Zookeeper mailing list, but right now I cannot
> > pinpoint the exact cause of my problems, but it appears to me that ZK is
> > the one.
> >
> > Thanks
> >
>

Re: Problems with running ZK on a shared disk

Posted by Nikhil <mn...@gmail.com>.

Try forcesync=no

forceSync

(Java system property: *zookeeper.forceSync*)

Requires updates to be synced to media of the transaction log before
finishing processing the update. If this option is set to no, ZooKeeper
will not require updates to be synced to the media.


This is a risk unless your zookeeper nodes are in the same rack.


Check also this
http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/zookeeper_psuedo_scalability_and_absolute


On Thu, Jan 23, 2014 at 10:53 AM, Ahmed H. <ah...@gmail.com> wrote:

> Hello,
>
> I am running ZK on a shared disk (I know, I shouldn't be, but I am
> constrained right now) alongside Kafka 0.8 beta. What we are experiencing
> is a problem where we get really long fsync times (according to the logs),
> followed by a loss of connection of our Kafka clients. Kafka attempts to
> reconnect a few times and eventually it dies because it hits the maximum
> retry attempts.
>
> The fsync error is seen below:
>
> 2014-01-23 13:18:38,746 [myid:] - WARN  [SyncThread:0:FileTxnLog@321] -
> fsync-ing the write ahead log in SyncThread:0 took 12762ms which will
> adversely effect operation latency. See the ZooKeeper troubleshooting guide
> 2014-01-23 13:23:41,332 [myid:] - WARN  [SyncThread:0:FileTxnLog@321] -
> fsync-ing the write ahead log in SyncThread:0 took 7552ms which will
> adversely effect operation latency. See the ZooKeeper troubleshooting guide
> 2014-01-23 13:28:49,656 [myid:] - WARN  [SyncThread:0:FileTxnLog@321] -
> fsync-ing the write ahead log in SyncThread:0 took 6350ms which will
> adversely effect operation latency. See the ZooKeeper troubleshooting guide
> 2014-01-23 13:33:45,063 [myid:] - WARN  [SyncThread:0:FileTxnLog@321] -
> fsync-ing the write ahead log in SyncThread:0 took 1039ms which will
> adversely effect operation latency. See the ZooKeeper troubleshooting guide
> 2014-01-23 13:34:00,024 [myid:] - WARN  [SyncThread:0:FileTxnLog@321] -
> fsync-ing the write ahead log in SyncThread:0 took 9490ms which will
> adversely effect operation latency. See the ZooKeeper troubleshooting guide
> 2014-01-23 13:44:09,003 [myid:] - WARN  [SyncThread:0:FileTxnLog@321] -
> fsync-ing the write ahead log in SyncThread:0 took 8747ms which will
> adversely effect operation latency. See the ZooKeeper troubleshooting guide
>
>
> This is also followed by some of these for good measure:
>
> 2014-01-23 13:49:19,427 [myid:] - ERROR [SyncThread:0:NIOServerCnxn@180] -
> Unexpected Exception:
> java.nio.channels.CancelledKeyException
> at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
> at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
>  at
>
> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:153)
> at
>
> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1076)
>  at
>
> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:170)
> at
>
> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:167)
>  at
>
> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:101)
>
>
> The way I see it is that I currently have two problems: 1) The setup of ZK
> is an issue due to the shared disk, and 2) Kafka clients do not
> automatically recover when it hits the maximum number of retries. I am
> looking for a way to at least mitigate the zookeeper issue. Perhaps if I
> modify the timeouts in such a way that the Kafka clients don't fail like
> they do...
>
> What are the best ways to mitigate the issue for now, as I am limited to a
> single disk? Increasing tickTime? My current ZK config is the default that
> comes with version 3.4.5, so the tickTime is 2000. My Kafka clients have
> defined the zktimeout variable to be 30000.
>
> I realize that this is a Zookeeper mailing list, but right now I cannot
> pinpoint the exact cause of my problems, but it appears to me that ZK is
> the one.
>
> Thanks
>