You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nifi.apache.org by Mark Bean <ma...@gmail.com> on 2017/06/02 12:38:00 UTC

Re: unstable cluster

I tried to build master (1.3.0-SNAPSHOT) but updated the zookeeper
dependency to version 3.4.10. I am not able to build successfully. A
compilation error results:

[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-compiler-plugin:3.2:compile
(default-compile) on project nifi-framework-core: Compilation failure
[ERROR]
/nifi/nifi-nar/bundles/nifi-framework-bundle/nifi-framework/nifi-framework-core/src/main/java/org/apache/nifi/controller/state/server/ZooKeeperStateServer.java:
[106,25] error: no suitable constructor found for QuorumPeer(no arguments)



On Tue, May 30, 2017 at 11:33 PM, Joe Witt <jo...@gmail.com> wrote:

> Just scanning through the items currently on master that would show up
> in the 1.3.0 release we see numerous cluster related bug fixes.
>
> More consistent port alignment across cluster
>   https://issues.apache.org/jira/browse/NIFI-3981
>
> Ensure controller service lifecycle handled better with different
> timing/dependencies
>   https://issues.apache.org/jira/browse/NIFI-3972
>
> Insufficient heartbeat handling causing improper clustering behavior
>   https://issues.apache.org/jira/browse/NIFI-3933
>
> Improve timing of component startup relative to other lifecycle items
> when clustered
>   https://issues.apache.org/jira/browse/NIFI-3923
>
> Inconsistent scheduled state in some cluster settings
>   https://issues.apache.org/jira/browse/NIFI-3900
>
> Improved fingerprinted/non-fingerprinted settings enforcement and
> handling in clusters
>   https://issues.apache.org/jira/browse/NIFI-1963
>
> These are nifi specific cluster behavior things.  For nifi and
> zookeeper interaction specifically most of the focus this far has been
> about NiFi itself as the above JIRAs show and also of course the cases
> where a given system that is so resource contended will simply not
> have a nice embedded ZK/nifi experience.
>
> MarkB, your testing above suggests you were using a nifi 1.x which
> means a zookeeper 3.4.6 client against a Zookeeper 3.4.10 server
> cluster and behavior was much better.  Could you possibly run the same
> cluster evaluation against the latest master but with an embedded
> zookeeper 3.4.10 version in nifi (which means both server and client
> are on latest zk 3.4.10 release)?  This would be helpful data.
> Assuming that goes well the only other concern that jumps to mind is
> if us using a zookeeper 3.4.10 client presents problems for us talking
> to older server versions (still 3.4 though so probably ok, i'd hope).
> In general we should be safe thanks to classloader isolation but we've
> seen some pretty magical JVM/system classloader level changes happen
> for Kerberized environments.
>
> Thanks
> Joe
>
>
>
> On Tue, May 30, 2017 at 3:21 PM, Juan Sequeiros <he...@gmail.com>
> wrote:
> > Hello all,
> >
> > I'll like to chime in on this interesting discussion thread.
> >
> > I'll like to add that my system(s) too have seen unstable ZK interaction
> > with both embedded and eventually external ZK ( granted external has been
> > better ) interaction.
> > We have resolved them with NIFI restarts. And it's to the point that we
> are
> > hesitant to roll up to NIFI 1.X mainly because of this ( we have DEV NIFI
> > 1.X )
> >
> > I also would like to add that we are greatly anticipating ZK release
> 3.5.X
> > for its TLS implementation, and as such have not voiced our experience
> with
> > NIFI / ZOOKEEPER assuming that once ZOOKEEPER 3.5.X is out of ALPHA that
> it
> > would be added in to NIFI NAR framework fairly fast and fix the oddities.
> >
> > I would say though that we have been hoping for a newer client on NIFI ZK
> > side since the current one suggests its based off 3.4.6 ZOOKEEPER which
> was
> > released on *MAR 2014*.
> >
> > # jar tc nifi-framework-nar-1.1.1.nar | grep zoo
> > META-INF/bundled-dependencies/zookeper-3.4.6.jar
> >
> > And now I wonder how long it would take for NIFI to code release a client
> > based off 3.5.X once it goes official given hesitation on forward
> > capability.
> >
> >
> > On Tue, May 30, 2017 at 2:52 PM Jeff <jt...@gmail.com> wrote:
> >
> >> Joe,
> >>
> >> My own direct and indirect experiences with NiFi 1.x clustering have
> been
> >> good for both embedded and external zookeeper but we have certainly seen
> >> some emails on mailing-list about it. Those have been for high load case
> >> where the embedded approach would be susceptible to timing issues and
> >> resolved by using an external system. Mark Bean's report is interesting
> >> though since it happens under no real load at all.
> >>
> >> I suspect ZOOKEEPER-2044 will help that though there are several
> comments
> >> [1] (and others on that JIRA) that describe the issue as minor/false
> >> reporting/cosmetic/an improvement. Updating to ZooKeeper 3.4.10 suggests
> >> that this rare issue can be resolved in NiFi, but we'll have to do our
> due
> >> diligence to make sure that no new issues are raised with the upgrade
> for
> >> NiFi or its ability to interface with external systems. We'll have to do
> >> testing with other dependencies that use ZooKeeper 3.4.6 to ensure that
> >> forward capability.
> >>
> >> [1]
> >>
> >> https://issues.apache.org/jira/browse/ZOOKEEPER-2044?
> focusedCommentId=15024616&page=com.atlassian.jira.
> plugin.system.issuetabpanels:comment-tabpanel#comment-15024616
> >>
> >> Thanks,
> >> Jeff
> >>
> >> On Tue, May 30, 2017 at 1:15 PM Joe Skora <js...@gmail.com> wrote:
> >>
> >> > Jeff,
> >> >
> >> > If I understand the issue correctly, this means NiFi 1.x has always
> been
> >> > broken for clustering with an embedded ZooKeeper.  That has never
> >> > communicated until now, we clearly build for and explain how to use an
> >> > embedded ZooKeeper in documentation.
> >> >
> >> > Any external non-NiFi elements that are considered in design and
> >> dependency
> >> > decisions need to be clearly understood by the entire community.  What
> >> > things non-NiFi are you thinking of that drive ZooKeeper dependencies?
> >> >
> >> > Joe
> >> >
> >> > On Tue, May 30, 2017 at 9:11 AM, Jeff <jt...@gmail.com> wrote:
> >> >
> >> > > Mark, we can certainly take smaller steps rather than waiting for
> >> > > 3.5.2/3.6.0 to come out.  I was just bringing that JIRA up as
> another
> >> > > scenario that entices us to upgrade.
> >> > >
> >> > > Joe, I'm referring to NiFi, the toolkit, and things non-NiFi that
> >> > provide a
> >> > > ZK server to which NiFi or the ZK Migration Toolkit are clients.
> I'm
> >> not
> >> > > saying we can't or shouldn't upgrade, but we do need to test to make
> >> sure
> >> > > that no issues are introduced by NiFi shipping with ZK 3.4.10.
> Being
> >> > that
> >> > > it's a bugfix version change, it's probably fine.
> >> > >
> >> > > - Jeff
> >> > >
> >> > > On Tue, May 30, 2017 at 10:46 AM Joe Skora <js...@gmail.com>
> wrote:
> >> > >
> >> > > > Jeff,
> >> > > >
> >> > > > Does that mean NiFi 1.x will be unstable when using embedded
> >> ZooKeeper
> >> > > > until the ZK version is upgrade?
> >> > > >
> >> > > > By "components outside of NiFi" do you mean the NiFi toolkit and
> >> other
> >> > > > parts of the NiFi release?
> >> > > >
> >> > > > Joe
> >> > > >
> >> > > > On Tue, May 30, 2017 at 5:42 AM, Jeff <jt...@gmail.com> wrote:
> >> > > >
> >> > > > > Mark,
> >> > > > >
> >> > > > > I did report a JIRA [1] for upgrading to 3.5.2 or 3.6.0 (just
> due
> >> to
> >> > > > log4j
> >> > > > > issues) once it's out and stable, There are issues with the way
> >> that
> >> > ZK
> >> > > > > refers to log4j classes in the code that cause issues for NiFi
> and
> >> > our
> >> > > > > Toolkit..  However there has been some back and forth [2] (in
> >> 3.4.0,
> >> > > > which
> >> > > > > doesn't fix the issue, but moves towards fixing it), [3], and
> [4]
> >> on
> >> > > the
> >> > > > > changes being implemented in versions 3.5.2 and 3.6.0.  Also, it
> >> > looks
> >> > > > like
> >> > > > > ZK 3.6.0 is headed toward using log4j 2 [5].
> >> > > > >
> >> > > > > There are many components outside of NiFi that are still using
> ZK
> >> > > 3.4.6,
> >> > > > so
> >> > > > > it may be a while before we can move to 3.4.10. I don't
> currently
> >> > know
> >> > > > > anything about the forward compatibility of 3.4.6.  Are there
> >> > > > > improvements/fixes in 3.4.10 which you need?
> >> > > > >
> >> > > > > [1] https://issues.apache.org/jira/browse/NIFI-3067
> >> > > > > [2] https://issues.apache.org/jira/browse/ZOOKEEPER-850
> >> > > > > [3] https://issues.apache.org/jira/browse/ZOOKEEPER-1371
> >> > > > > [4] https://issues.apache.org/jira/browse/ZOOKEEPER-2393
> >> > > > > [5] https://issues.apache.org/jira/browse/ZOOKEEPER-2342
> >> > > > >
> >> > > > > - Jeff
> >> > > > >
> >> > > > > On Tue, May 30, 2017 at 8:15 AM Mark Bean <
> mark.o.bean@gmail.com>
> >> > > wrote:
> >> > > > >
> >> > > > > > Updated to external ZooKeeper last Friday. Over the weekend,
> >> there
> >> > > are
> >> > > > no
> >> > > > > > reports of SUSPENDED or RECONNECTED.
> >> > > > > >
> >> > > > > > Are there plans to upgrade the embedded ZooKeeper to the
> latest
> >> > > > version,
> >> > > > > > 3.4.10?
> >> > > > > >
> >> > > > > > Thanks,
> >> > > > > > Mark
> >> > > > > >
> >> > > > > > On Thu, May 25, 2017 at 11:56 AM, Joe Witt <
> joe.witt@gmail.com>
> >> > > wrote:
> >> > > > > >
> >> > > > > > > looked at a secured cluster and the send times are
> routinely at
> >> > > 100ms
> >> > > > > > > similar to yours.  I think what i was flagging as
> potentially
> >> > > > > > > interesting is not interesting at all.
> >> > > > > > >
> >> > > > > > > On Thu, May 25, 2017 at 11:34 AM, Joe Witt <
> joe.witt@gmail.com
> >> >
> >> > > > wrote:
> >> > > > > > > > Ok.  Well as a point of comparison i'm looking at
> heartbeat
> >> > logs
> >> > > > from
> >> > > > > > > > another cluster and the times are consistently 1-3 millis
> for
> >> > the
> >> > > > > > > > send.  Yours above show 100+ms typical with one north of
> >> 900ms.
> >> > > > Not
> >> > > > > > > > sure how relevant that is but something i noticed.
> >> > > > > > > >
> >> > > > > > > > On Thu, May 25, 2017 at 11:29 AM, Mark Bean <
> >> > > mark.o.bean@gmail.com
> >> > > > >
> >> > > > > > > wrote:
> >> > > > > > > >> ping shows acceptably fast response time between servers,
> >> > > > > > approximately
> >> > > > > > > >> 0.100-0.150 ms
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >> On Thu, May 25, 2017 at 11:13 AM, Joe Witt <
> >> > joe.witt@gmail.com>
> >> > > > > > wrote:
> >> > > > > > > >>
> >> > > > > > > >>> have you evaluated latency across the machines in your
> >> > cluster?
> >> > > > I
> >> > > > > > ask
> >> > > > > > > >>> because 122ms is pretty long and 917ms is very long.
> Are
> >> > these
> >> > > > > nodes
> >> > > > > > > >>> across a WAN link?
> >> > > > > > > >>>
> >> > > > > > > >>> On Thu, May 25, 2017 at 11:08 AM, Mark Bean <
> >> > > > mark.o.bean@gmail.com
> >> > > > > >
> >> > > > > > > wrote:
> >> > > > > > > >>> > Update: now all 5 nodes, regardless of ZK server, are
> >> > > > indicating
> >> > > > > > > >>> SUSPENDED
> >> > > > > > > >>> > -> RECONNECTED.
> >> > > > > > > >>> >
> >> > > > > > > >>> > On Thu, May 25, 2017 at 10:23 AM, Mark Bean <
> >> > > > > mark.o.bean@gmail.com
> >> > > > > > >
> >> > > > > > > >>> wrote:
> >> > > > > > > >>> >
> >> > > > > > > >>> >> I reduced the number of embedded ZooKeeper servers on
> >> the
> >> > > > 5-Node
> >> > > > > > > NiFi
> >> > > > > > > >>> >> Cluster from 5 to 3. This has improved the
> situation. I
> >> do
> >> > > not
> >> > > > > see
> >> > > > > > > any
> >> > > > > > > >>> of
> >> > > > > > > >>> >> the three Nodes which are also ZK servers
> >> > > > > > > disconnecting/reconnecting to
> >> > > > > > > >>> the
> >> > > > > > > >>> >> cluster as before. However, the two Nodes which are
> not
> >> > > > running
> >> > > > > ZK
> >> > > > > > > >>> continue
> >> > > > > > > >>> >> to disconnect and reconnect. The following is taken
> from
> >> > one
> >> > > > of
> >> > > > > > the
> >> > > > > > > >>> non-ZK
> >> > > > > > > >>> >> Nodes. It's curious that some messages are issued
> twice
> >> > from
> >> > > > the
> >> > > > > > > same
> >> > > > > > > >>> >> thread, but reference a different object
> >> > > > > > > >>> >>
> >> > > > > > > >>> >> nifi-app.log
> >> > > > > > > >>> >> 2017-05-25 13:40:01,628 INFO [main-EventTrhead]
> >> > > o.a.c.f.state.
> >> > > > > > > >>> ConnectionStateManager
> >> > > > > > > >>> >> State change: SUSPENDED
> >> > > > > > > >>> >> 2017-05-25 13:39:45,627 INFO [Clustering Tasks
> Thread-1]
> >> > > > > > o.a.n.c.c.
> >> > > > > > > >>> ClusterProtocolHeaertbeater
> >> > > > > > > >>> >> Heartbeat create at 2017-05-25 13:39:45,504 and sent
> to
> >> > > > > FQDN:PORT
> >> > > > > > at
> >> > > > > > > >>> >> 2017-05-25 13:39:45,627; send took 122 millis
> >> > > > > > > >>> >> 2017-05-25 13:39:50,862 INFO [Clustering Tasks
> Thread-1]
> >> > > > > > o.a.n.c.c.
> >> > > > > > > >>> ClusterProtocolHeaertbeater
> >> > > > > > > >>> >> Heartbeat create at 2017-05-25 13:39:50,732 and sent
> to
> >> > > > > FQDN:PORT
> >> > > > > > at
> >> > > > > > > >>> >> 2017-05-25 13:39:50,862; send took 122 millis
> >> > > > > > > >>> >> 2017-05-25 13:39:56,089 INFO [Clustering Tasks
> Thread-1]
> >> > > > > > o.a.n.c.c.
> >> > > > > > > >>> ClusterProtocolHeaertbeater
> >> > > > > > > >>> >> Heartbeat create at 2017-05-25 13:39:55,966 and sent
> to
> >> > > > > FQDN:PORT
> >> > > > > > at
> >> > > > > > > >>> >> 2017-05-25 13:39:56,089; send took 129 millis
> >> > > > > > > >>> >> 2017-05-25 13:40:01,629 INFO
> >> > > > [Curator-ConnectionStateManager-0]
> >> > > > > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> >> > > > > > > org.apache.nifi.controller.
> >> > > > > > > >>> >> leader.election.CuratorLeaderElectionManager$
> >> > > > > > > ElectionListener@68f8b6a2
> >> > > > > > > >>> >> Connection State changed to SUSPENDED
> >> > > > > > > >>> >> 2017-05-25 13:40:01,629 INFO
> >> > > > [Curator-ConnectionStateManager-0]
> >> > > > > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> >> > > > > > > org.apache.nifi.controller.
> >> > > > > > > >>> >> leader.election.CuratorLeaderElectionManager$
> >> > > > > > > ElectionListener@663f55cd
> >> > > > > > > >>> >> Connection State changed to SUSPENDED
> >> > > > > > > >>> >> 2017-05-25 13:40:02,412 INFO [main-EventThread]
> >> > > o.a.c.f.state.
> >> > > > > > > >>> ConnectinoStateManager
> >> > > > > > > >>> >> State change: RECONNECTED
> >> > > > > > > >>> >> 2017-05-25 13:40:02,413 INFO
> >> > > > [Curator-ConnectionStateManager-0]
> >> > > > > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> >> > > > > > > org.apache.nifi.controller.
> >> > > > > > > >>> >> leader.election.CuratorLeaderElectionManager$
> >> > > > > > > ElectionListener@68f8b6a2
> >> > > > > > > >>> >> Connection State changed to RECONNECTED
> >> > > > > > > >>> >> 2017-05-25 13:40:02,413 INFO
> >> > > > [Curator-ConnectionStateManager-0]
> >> > > > > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> >> > > > > > > org.apache.nifi.controller.
> >> > > > > > > >>> >> leader.election.CuratorLeaderElectionManager$
> >> > > > > > > ElectionListener@663f55cd
> >> > > > > > > >>> >> Connection State changed to RECONNECTED
> >> > > > > > > >>> >> 2017-05-25 13:40:02,550 INFO [Clustering Tasks
> Thread-1]
> >> > > > > > o.a.n.c.c.
> >> > > > > > > >>> ClusterProtocolHeaertbeater
> >> > > > > > > >>> >> Heartbeat create at 2017-05-25 13:40:01,632 and sent
> to
> >> > > > > FQDN:PORT
> >> > > > > > at
> >> > > > > > > >>> >> 2017-05-25 13:40:02,550; send took 917 millis
> >> > > > > > > >>> >> 2017-05-25 13:40:07,787 INFO [Clustering Tasks
> Thread-1]
> >> > > > > > o.a.n.c.c.
> >> > > > > > > >>> ClusterProtocolHeaertbeater
> >> > > > > > > >>> >> Heartbeat create at 2017-05-25 13:40:07,657 and sent
> to
> >> > > > > FQDN:PORT
> >> > > > > > at
> >> > > > > > > >>> >> 2017-05-25 13:40:07,787; send took 129 millis
> >> > > > > > > >>> >>
> >> > > > > > > >>> >> I will work on setting up an external ZK next, but
> would
> >> > > still
> >> > > > > > like
> >> > > > > > > some
> >> > > > > > > >>> >> insight to what is being observed with the embedded
> ZK.
> >> > > > > > > >>> >>
> >> > > > > > > >>> >> Thanks,
> >> > > > > > > >>> >> Mark
> >> > > > > > > >>> >>
> >> > > > > > > >>> >>
> >> > > > > > > >>> >>
> >> > > > > > > >>> >>
> >> > > > > > > >>> >> On Wed, May 24, 2017 at 3:57 PM, Mark Bean <
> >> > > > > mark.o.bean@gmail.com
> >> > > > > > >
> >> > > > > > > >>> wrote:
> >> > > > > > > >>> >>
> >> > > > > > > >>> >>> Yes, we are using the embedded ZK. We will try
> >> > > instantiating
> >> > > > > and
> >> > > > > > > >>> external
> >> > > > > > > >>> >>> ZK and see if that resolves the problem.
> >> > > > > > > >>> >>>
> >> > > > > > > >>> >>> The load on the system is extremely small. Currently
> >> (as
> >> > > > Nodes
> >> > > > > > are
> >> > > > > > > >>> >>> disconnecting/reconnecting) all input ports to the
> flow
> >> > are
> >> > > > > > turned
> >> > > > > > > >>> off. The
> >> > > > > > > >>> >>> only data in the flow is from a single GenerateFlow
> >> > > > generating
> >> > > > > 5B
> >> > > > > > > >>> every 30
> >> > > > > > > >>> >>> secs.
> >> > > > > > > >>> >>>
> >> > > > > > > >>> >>> Also, it is a 5-node cluster with embedded ZK on
> each
> >> > node.
> >> > > > > > First,
> >> > > > > > > I
> >> > > > > > > >>> will
> >> > > > > > > >>> >>> try reducing ZK to only 3 nodes. Then, I will try a
> >> > 3-node
> >> > > > > > > external ZK.
> >> > > > > > > >>> >>>
> >> > > > > > > >>> >>> Thanks,
> >> > > > > > > >>> >>> Mark
> >> > > > > > > >>> >>>
> >> > > > > > > >>> >>> On Wed, May 24, 2017 at 11:49 AM, Joe Witt <
> >> > > > joe.witt@gmail.com
> >> > > > > >
> >> > > > > > > wrote:
> >> > > > > > > >>> >>>
> >> > > > > > > >>> >>>> Are you using the embedded Zookeeper?  If yes we
> >> > recommend
> >> > > > > using
> >> > > > > > > an
> >> > > > > > > >>> >>>> external zookeeper.
> >> > > > > > > >>> >>>>
> >> > > > > > > >>> >>>> What type of load are the systems under when this
> >> occurs
> >> > > > (cpu,
> >> > > > > > > >>> >>>> network, memory, disk io)? Under high load the
> default
> >> > > > > timeouts
> >> > > > > > > for
> >> > > > > > > >>> >>>> clustering are too aggressive.  You can relax these
> >> for
> >> > > > higher
> >> > > > > > > load
> >> > > > > > > >>> >>>> clusters and should see good behavior.  Even if the
> >> > system
> >> > > > > > > overall is
> >> > > > > > > >>> >>>> not under all that high of load if you're seeing
> >> garbage
> >> > > > > > > collection
> >> > > > > > > >>> >>>> pauses that are lengthy and/or frequent it can
> cause
> >> the
> >> > > > same
> >> > > > > > high
> >> > > > > > > >>> >>>> load effect as far as the JVM is concerned.
> >> > > > > > > >>> >>>>
> >> > > > > > > >>> >>>> Thanks
> >> > > > > > > >>> >>>> Joe
> >> > > > > > > >>> >>>>
> >> > > > > > > >>> >>>> On Wed, May 24, 2017 at 9:11 AM, Mark Bean <
> >> > > > > > mark.o.bean@gmail.com
> >> > > > > > > >
> >> > > > > > > >>> >>>> wrote:
> >> > > > > > > >>> >>>> > We have a cluster which is showing signs of
> >> > instability.
> >> > > > The
> >> > > > > > > Primary
> >> > > > > > > >>> >>>> Node
> >> > > > > > > >>> >>>> > and Coordinator are reassigned to different nodes
> >> > every
> >> > > > > > several
> >> > > > > > > >>> >>>> minutes. I
> >> > > > > > > >>> >>>> > believe this is due to lack of heartbeat or other
> >> > > > > > coordination.
> >> > > > > > > The
> >> > > > > > > >>> >>>> > following error occurs periodically in the
> >> > nifi-app.log
> >> > > > > > > >>> >>>> >
> >> > > > > > > >>> >>>> > ERROR [CommitProcessor:1]
> o.apache.zookeeper.server.
> >> > > > > > > NIOServerCnxn
> >> > > > > > > >>> >>>> > Unexpected Exception:
> >> > > > > > > >>> >>>> > java.nio.channels.CancelledKeyException: null
> >> > > > > > > >>> >>>> >         at sun.nio.ch.SelectionKeyImpl.en
> >> > > > > > > >>> >>>> sureValid(SectionKeyImpl.java:73)
> >> > > > > > > >>> >>>> >         at sun.nio.ch.SelectionKeyImpl.in
> >> > > > > > > >>> >>>> terestOps(SelctionKeyImpl.java:77)
> >> > > > > > > >>> >>>> >         at
> >> > > > > > > >>> >>>> >
> >> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(
> >> > > > > NIOServ
> >> > > > > > > >>> >>>> erCnxn.java:151)
> >> > > > > > > >>> >>>> >         at
> >> > > > > > > >>> >>>> >
> >> > org.apache.zookeeper.server.NIOServerCnXn.sendResopnse(
> >> > > > > NIOSe
> >> > > > > > > >>> >>>> rverCnxn.java:1081)
> >> > > > > > > >>> >>>> >         at
> >> > > > > > > >>> >>>> > org.apache.zookeeper.server.
> FinalRequestProcessor.
> >> > > > > processReq
> >> > > > > > > >>> >>>> uest(FinalRequestProcessor.java:404)
> >> > > > > > > >>> >>>> >         at
> >> > > > > > > >>> >>>> >
> >> > org.apache.zookeeper.server.quorum.CommitProcessor.run(
> >> > > > > Commi
> >> > > > > > > >>> >>>> tProcessor.java:74)
> >> > > > > > > >>> >>>> >
> >> > > > > > > >>> >>>> > Apache NiFi 1.2.0
> >> > > > > > > >>> >>>> >
> >> > > > > > > >>> >>>> > Thoughts?
> >> > > > > > > >>> >>>>
> >> > > > > > > >>> >>>
> >> > > > > > > >>> >>>
> >> > > > > > > >>> >>
> >> > > > > > > >>>
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
>

Re: unstable cluster

Posted by Joe Witt <jo...@gmail.com>.
ok thanks Mark.  Yeah that is a good example of what is tricky about
even incremental upgrades with a system like that.  Not all projects
use the same incremental version change logic in terms of APIs,
backward compatibility, etc..

Thanks
joe

On Fri, Jun 2, 2017 at 8:38 AM, Mark Bean <ma...@gmail.com> wrote:
> I tried to build master (1.3.0-SNAPSHOT) but updated the zookeeper
> dependency to version 3.4.10. I am not able to build successfully. A
> compilation error results:
>
> [ERROR] Failed to execute goal
> org.apache.maven.plugins:maven-compiler-plugin:3.2:compile
> (default-compile) on project nifi-framework-core: Compilation failure
> [ERROR]
> /nifi/nifi-nar/bundles/nifi-framework-bundle/nifi-framework/nifi-framework-core/src/main/java/org/apache/nifi/controller/state/server/ZooKeeperStateServer.java:
> [106,25] error: no suitable constructor found for QuorumPeer(no arguments)
>
>
>
> On Tue, May 30, 2017 at 11:33 PM, Joe Witt <jo...@gmail.com> wrote:
>
>> Just scanning through the items currently on master that would show up
>> in the 1.3.0 release we see numerous cluster related bug fixes.
>>
>> More consistent port alignment across cluster
>>   https://issues.apache.org/jira/browse/NIFI-3981
>>
>> Ensure controller service lifecycle handled better with different
>> timing/dependencies
>>   https://issues.apache.org/jira/browse/NIFI-3972
>>
>> Insufficient heartbeat handling causing improper clustering behavior
>>   https://issues.apache.org/jira/browse/NIFI-3933
>>
>> Improve timing of component startup relative to other lifecycle items
>> when clustered
>>   https://issues.apache.org/jira/browse/NIFI-3923
>>
>> Inconsistent scheduled state in some cluster settings
>>   https://issues.apache.org/jira/browse/NIFI-3900
>>
>> Improved fingerprinted/non-fingerprinted settings enforcement and
>> handling in clusters
>>   https://issues.apache.org/jira/browse/NIFI-1963
>>
>> These are nifi specific cluster behavior things.  For nifi and
>> zookeeper interaction specifically most of the focus this far has been
>> about NiFi itself as the above JIRAs show and also of course the cases
>> where a given system that is so resource contended will simply not
>> have a nice embedded ZK/nifi experience.
>>
>> MarkB, your testing above suggests you were using a nifi 1.x which
>> means a zookeeper 3.4.6 client against a Zookeeper 3.4.10 server
>> cluster and behavior was much better.  Could you possibly run the same
>> cluster evaluation against the latest master but with an embedded
>> zookeeper 3.4.10 version in nifi (which means both server and client
>> are on latest zk 3.4.10 release)?  This would be helpful data.
>> Assuming that goes well the only other concern that jumps to mind is
>> if us using a zookeeper 3.4.10 client presents problems for us talking
>> to older server versions (still 3.4 though so probably ok, i'd hope).
>> In general we should be safe thanks to classloader isolation but we've
>> seen some pretty magical JVM/system classloader level changes happen
>> for Kerberized environments.
>>
>> Thanks
>> Joe
>>
>>
>>
>> On Tue, May 30, 2017 at 3:21 PM, Juan Sequeiros <he...@gmail.com>
>> wrote:
>> > Hello all,
>> >
>> > I'll like to chime in on this interesting discussion thread.
>> >
>> > I'll like to add that my system(s) too have seen unstable ZK interaction
>> > with both embedded and eventually external ZK ( granted external has been
>> > better ) interaction.
>> > We have resolved them with NIFI restarts. And it's to the point that we
>> are
>> > hesitant to roll up to NIFI 1.X mainly because of this ( we have DEV NIFI
>> > 1.X )
>> >
>> > I also would like to add that we are greatly anticipating ZK release
>> 3.5.X
>> > for its TLS implementation, and as such have not voiced our experience
>> with
>> > NIFI / ZOOKEEPER assuming that once ZOOKEEPER 3.5.X is out of ALPHA that
>> it
>> > would be added in to NIFI NAR framework fairly fast and fix the oddities.
>> >
>> > I would say though that we have been hoping for a newer client on NIFI ZK
>> > side since the current one suggests its based off 3.4.6 ZOOKEEPER which
>> was
>> > released on *MAR 2014*.
>> >
>> > # jar tc nifi-framework-nar-1.1.1.nar | grep zoo
>> > META-INF/bundled-dependencies/zookeper-3.4.6.jar
>> >
>> > And now I wonder how long it would take for NIFI to code release a client
>> > based off 3.5.X once it goes official given hesitation on forward
>> > capability.
>> >
>> >
>> > On Tue, May 30, 2017 at 2:52 PM Jeff <jt...@gmail.com> wrote:
>> >
>> >> Joe,
>> >>
>> >> My own direct and indirect experiences with NiFi 1.x clustering have
>> been
>> >> good for both embedded and external zookeeper but we have certainly seen
>> >> some emails on mailing-list about it. Those have been for high load case
>> >> where the embedded approach would be susceptible to timing issues and
>> >> resolved by using an external system. Mark Bean's report is interesting
>> >> though since it happens under no real load at all.
>> >>
>> >> I suspect ZOOKEEPER-2044 will help that though there are several
>> comments
>> >> [1] (and others on that JIRA) that describe the issue as minor/false
>> >> reporting/cosmetic/an improvement. Updating to ZooKeeper 3.4.10 suggests
>> >> that this rare issue can be resolved in NiFi, but we'll have to do our
>> due
>> >> diligence to make sure that no new issues are raised with the upgrade
>> for
>> >> NiFi or its ability to interface with external systems. We'll have to do
>> >> testing with other dependencies that use ZooKeeper 3.4.6 to ensure that
>> >> forward capability.
>> >>
>> >> [1]
>> >>
>> >> https://issues.apache.org/jira/browse/ZOOKEEPER-2044?
>> focusedCommentId=15024616&page=com.atlassian.jira.
>> plugin.system.issuetabpanels:comment-tabpanel#comment-15024616
>> >>
>> >> Thanks,
>> >> Jeff
>> >>
>> >> On Tue, May 30, 2017 at 1:15 PM Joe Skora <js...@gmail.com> wrote:
>> >>
>> >> > Jeff,
>> >> >
>> >> > If I understand the issue correctly, this means NiFi 1.x has always
>> been
>> >> > broken for clustering with an embedded ZooKeeper.  That has never
>> >> > communicated until now, we clearly build for and explain how to use an
>> >> > embedded ZooKeeper in documentation.
>> >> >
>> >> > Any external non-NiFi elements that are considered in design and
>> >> dependency
>> >> > decisions need to be clearly understood by the entire community.  What
>> >> > things non-NiFi are you thinking of that drive ZooKeeper dependencies?
>> >> >
>> >> > Joe
>> >> >
>> >> > On Tue, May 30, 2017 at 9:11 AM, Jeff <jt...@gmail.com> wrote:
>> >> >
>> >> > > Mark, we can certainly take smaller steps rather than waiting for
>> >> > > 3.5.2/3.6.0 to come out.  I was just bringing that JIRA up as
>> another
>> >> > > scenario that entices us to upgrade.
>> >> > >
>> >> > > Joe, I'm referring to NiFi, the toolkit, and things non-NiFi that
>> >> > provide a
>> >> > > ZK server to which NiFi or the ZK Migration Toolkit are clients.
>> I'm
>> >> not
>> >> > > saying we can't or shouldn't upgrade, but we do need to test to make
>> >> sure
>> >> > > that no issues are introduced by NiFi shipping with ZK 3.4.10.
>> Being
>> >> > that
>> >> > > it's a bugfix version change, it's probably fine.
>> >> > >
>> >> > > - Jeff
>> >> > >
>> >> > > On Tue, May 30, 2017 at 10:46 AM Joe Skora <js...@gmail.com>
>> wrote:
>> >> > >
>> >> > > > Jeff,
>> >> > > >
>> >> > > > Does that mean NiFi 1.x will be unstable when using embedded
>> >> ZooKeeper
>> >> > > > until the ZK version is upgrade?
>> >> > > >
>> >> > > > By "components outside of NiFi" do you mean the NiFi toolkit and
>> >> other
>> >> > > > parts of the NiFi release?
>> >> > > >
>> >> > > > Joe
>> >> > > >
>> >> > > > On Tue, May 30, 2017 at 5:42 AM, Jeff <jt...@gmail.com> wrote:
>> >> > > >
>> >> > > > > Mark,
>> >> > > > >
>> >> > > > > I did report a JIRA [1] for upgrading to 3.5.2 or 3.6.0 (just
>> due
>> >> to
>> >> > > > log4j
>> >> > > > > issues) once it's out and stable, There are issues with the way
>> >> that
>> >> > ZK
>> >> > > > > refers to log4j classes in the code that cause issues for NiFi
>> and
>> >> > our
>> >> > > > > Toolkit..  However there has been some back and forth [2] (in
>> >> 3.4.0,
>> >> > > > which
>> >> > > > > doesn't fix the issue, but moves towards fixing it), [3], and
>> [4]
>> >> on
>> >> > > the
>> >> > > > > changes being implemented in versions 3.5.2 and 3.6.0.  Also, it
>> >> > looks
>> >> > > > like
>> >> > > > > ZK 3.6.0 is headed toward using log4j 2 [5].
>> >> > > > >
>> >> > > > > There are many components outside of NiFi that are still using
>> ZK
>> >> > > 3.4.6,
>> >> > > > so
>> >> > > > > it may be a while before we can move to 3.4.10. I don't
>> currently
>> >> > know
>> >> > > > > anything about the forward compatibility of 3.4.6.  Are there
>> >> > > > > improvements/fixes in 3.4.10 which you need?
>> >> > > > >
>> >> > > > > [1] https://issues.apache.org/jira/browse/NIFI-3067
>> >> > > > > [2] https://issues.apache.org/jira/browse/ZOOKEEPER-850
>> >> > > > > [3] https://issues.apache.org/jira/browse/ZOOKEEPER-1371
>> >> > > > > [4] https://issues.apache.org/jira/browse/ZOOKEEPER-2393
>> >> > > > > [5] https://issues.apache.org/jira/browse/ZOOKEEPER-2342
>> >> > > > >
>> >> > > > > - Jeff
>> >> > > > >
>> >> > > > > On Tue, May 30, 2017 at 8:15 AM Mark Bean <
>> mark.o.bean@gmail.com>
>> >> > > wrote:
>> >> > > > >
>> >> > > > > > Updated to external ZooKeeper last Friday. Over the weekend,
>> >> there
>> >> > > are
>> >> > > > no
>> >> > > > > > reports of SUSPENDED or RECONNECTED.
>> >> > > > > >
>> >> > > > > > Are there plans to upgrade the embedded ZooKeeper to the
>> latest
>> >> > > > version,
>> >> > > > > > 3.4.10?
>> >> > > > > >
>> >> > > > > > Thanks,
>> >> > > > > > Mark
>> >> > > > > >
>> >> > > > > > On Thu, May 25, 2017 at 11:56 AM, Joe Witt <
>> joe.witt@gmail.com>
>> >> > > wrote:
>> >> > > > > >
>> >> > > > > > > looked at a secured cluster and the send times are
>> routinely at
>> >> > > 100ms
>> >> > > > > > > similar to yours.  I think what i was flagging as
>> potentially
>> >> > > > > > > interesting is not interesting at all.
>> >> > > > > > >
>> >> > > > > > > On Thu, May 25, 2017 at 11:34 AM, Joe Witt <
>> joe.witt@gmail.com
>> >> >
>> >> > > > wrote:
>> >> > > > > > > > Ok.  Well as a point of comparison i'm looking at
>> heartbeat
>> >> > logs
>> >> > > > from
>> >> > > > > > > > another cluster and the times are consistently 1-3 millis
>> for
>> >> > the
>> >> > > > > > > > send.  Yours above show 100+ms typical with one north of
>> >> 900ms.
>> >> > > > Not
>> >> > > > > > > > sure how relevant that is but something i noticed.
>> >> > > > > > > >
>> >> > > > > > > > On Thu, May 25, 2017 at 11:29 AM, Mark Bean <
>> >> > > mark.o.bean@gmail.com
>> >> > > > >
>> >> > > > > > > wrote:
>> >> > > > > > > >> ping shows acceptably fast response time between servers,
>> >> > > > > > approximately
>> >> > > > > > > >> 0.100-0.150 ms
>> >> > > > > > > >>
>> >> > > > > > > >>
>> >> > > > > > > >> On Thu, May 25, 2017 at 11:13 AM, Joe Witt <
>> >> > joe.witt@gmail.com>
>> >> > > > > > wrote:
>> >> > > > > > > >>
>> >> > > > > > > >>> have you evaluated latency across the machines in your
>> >> > cluster?
>> >> > > > I
>> >> > > > > > ask
>> >> > > > > > > >>> because 122ms is pretty long and 917ms is very long.
>> Are
>> >> > these
>> >> > > > > nodes
>> >> > > > > > > >>> across a WAN link?
>> >> > > > > > > >>>
>> >> > > > > > > >>> On Thu, May 25, 2017 at 11:08 AM, Mark Bean <
>> >> > > > mark.o.bean@gmail.com
>> >> > > > > >
>> >> > > > > > > wrote:
>> >> > > > > > > >>> > Update: now all 5 nodes, regardless of ZK server, are
>> >> > > > indicating
>> >> > > > > > > >>> SUSPENDED
>> >> > > > > > > >>> > -> RECONNECTED.
>> >> > > > > > > >>> >
>> >> > > > > > > >>> > On Thu, May 25, 2017 at 10:23 AM, Mark Bean <
>> >> > > > > mark.o.bean@gmail.com
>> >> > > > > > >
>> >> > > > > > > >>> wrote:
>> >> > > > > > > >>> >
>> >> > > > > > > >>> >> I reduced the number of embedded ZooKeeper servers on
>> >> the
>> >> > > > 5-Node
>> >> > > > > > > NiFi
>> >> > > > > > > >>> >> Cluster from 5 to 3. This has improved the
>> situation. I
>> >> do
>> >> > > not
>> >> > > > > see
>> >> > > > > > > any
>> >> > > > > > > >>> of
>> >> > > > > > > >>> >> the three Nodes which are also ZK servers
>> >> > > > > > > disconnecting/reconnecting to
>> >> > > > > > > >>> the
>> >> > > > > > > >>> >> cluster as before. However, the two Nodes which are
>> not
>> >> > > > running
>> >> > > > > ZK
>> >> > > > > > > >>> continue
>> >> > > > > > > >>> >> to disconnect and reconnect. The following is taken
>> from
>> >> > one
>> >> > > > of
>> >> > > > > > the
>> >> > > > > > > >>> non-ZK
>> >> > > > > > > >>> >> Nodes. It's curious that some messages are issued
>> twice
>> >> > from
>> >> > > > the
>> >> > > > > > > same
>> >> > > > > > > >>> >> thread, but reference a different object
>> >> > > > > > > >>> >>
>> >> > > > > > > >>> >> nifi-app.log
>> >> > > > > > > >>> >> 2017-05-25 13:40:01,628 INFO [main-EventTrhead]
>> >> > > o.a.c.f.state.
>> >> > > > > > > >>> ConnectionStateManager
>> >> > > > > > > >>> >> State change: SUSPENDED
>> >> > > > > > > >>> >> 2017-05-25 13:39:45,627 INFO [Clustering Tasks
>> Thread-1]
>> >> > > > > > o.a.n.c.c.
>> >> > > > > > > >>> ClusterProtocolHeaertbeater
>> >> > > > > > > >>> >> Heartbeat create at 2017-05-25 13:39:45,504 and sent
>> to
>> >> > > > > FQDN:PORT
>> >> > > > > > at
>> >> > > > > > > >>> >> 2017-05-25 13:39:45,627; send took 122 millis
>> >> > > > > > > >>> >> 2017-05-25 13:39:50,862 INFO [Clustering Tasks
>> Thread-1]
>> >> > > > > > o.a.n.c.c.
>> >> > > > > > > >>> ClusterProtocolHeaertbeater
>> >> > > > > > > >>> >> Heartbeat create at 2017-05-25 13:39:50,732 and sent
>> to
>> >> > > > > FQDN:PORT
>> >> > > > > > at
>> >> > > > > > > >>> >> 2017-05-25 13:39:50,862; send took 122 millis
>> >> > > > > > > >>> >> 2017-05-25 13:39:56,089 INFO [Clustering Tasks
>> Thread-1]
>> >> > > > > > o.a.n.c.c.
>> >> > > > > > > >>> ClusterProtocolHeaertbeater
>> >> > > > > > > >>> >> Heartbeat create at 2017-05-25 13:39:55,966 and sent
>> to
>> >> > > > > FQDN:PORT
>> >> > > > > > at
>> >> > > > > > > >>> >> 2017-05-25 13:39:56,089; send took 129 millis
>> >> > > > > > > >>> >> 2017-05-25 13:40:01,629 INFO
>> >> > > > [Curator-ConnectionStateManager-0]
>> >> > > > > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
>> >> > > > > > > org.apache.nifi.controller.
>> >> > > > > > > >>> >> leader.election.CuratorLeaderElectionManager$
>> >> > > > > > > ElectionListener@68f8b6a2
>> >> > > > > > > >>> >> Connection State changed to SUSPENDED
>> >> > > > > > > >>> >> 2017-05-25 13:40:01,629 INFO
>> >> > > > [Curator-ConnectionStateManager-0]
>> >> > > > > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
>> >> > > > > > > org.apache.nifi.controller.
>> >> > > > > > > >>> >> leader.election.CuratorLeaderElectionManager$
>> >> > > > > > > ElectionListener@663f55cd
>> >> > > > > > > >>> >> Connection State changed to SUSPENDED
>> >> > > > > > > >>> >> 2017-05-25 13:40:02,412 INFO [main-EventThread]
>> >> > > o.a.c.f.state.
>> >> > > > > > > >>> ConnectinoStateManager
>> >> > > > > > > >>> >> State change: RECONNECTED
>> >> > > > > > > >>> >> 2017-05-25 13:40:02,413 INFO
>> >> > > > [Curator-ConnectionStateManager-0]
>> >> > > > > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
>> >> > > > > > > org.apache.nifi.controller.
>> >> > > > > > > >>> >> leader.election.CuratorLeaderElectionManager$
>> >> > > > > > > ElectionListener@68f8b6a2
>> >> > > > > > > >>> >> Connection State changed to RECONNECTED
>> >> > > > > > > >>> >> 2017-05-25 13:40:02,413 INFO
>> >> > > > [Curator-ConnectionStateManager-0]
>> >> > > > > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
>> >> > > > > > > org.apache.nifi.controller.
>> >> > > > > > > >>> >> leader.election.CuratorLeaderElectionManager$
>> >> > > > > > > ElectionListener@663f55cd
>> >> > > > > > > >>> >> Connection State changed to RECONNECTED
>> >> > > > > > > >>> >> 2017-05-25 13:40:02,550 INFO [Clustering Tasks
>> Thread-1]
>> >> > > > > > o.a.n.c.c.
>> >> > > > > > > >>> ClusterProtocolHeaertbeater
>> >> > > > > > > >>> >> Heartbeat create at 2017-05-25 13:40:01,632 and sent
>> to
>> >> > > > > FQDN:PORT
>> >> > > > > > at
>> >> > > > > > > >>> >> 2017-05-25 13:40:02,550; send took 917 millis
>> >> > > > > > > >>> >> 2017-05-25 13:40:07,787 INFO [Clustering Tasks
>> Thread-1]
>> >> > > > > > o.a.n.c.c.
>> >> > > > > > > >>> ClusterProtocolHeaertbeater
>> >> > > > > > > >>> >> Heartbeat create at 2017-05-25 13:40:07,657 and sent
>> to
>> >> > > > > FQDN:PORT
>> >> > > > > > at
>> >> > > > > > > >>> >> 2017-05-25 13:40:07,787; send took 129 millis
>> >> > > > > > > >>> >>
>> >> > > > > > > >>> >> I will work on setting up an external ZK next, but
>> would
>> >> > > still
>> >> > > > > > like
>> >> > > > > > > some
>> >> > > > > > > >>> >> insight to what is being observed with the embedded
>> ZK.
>> >> > > > > > > >>> >>
>> >> > > > > > > >>> >> Thanks,
>> >> > > > > > > >>> >> Mark
>> >> > > > > > > >>> >>
>> >> > > > > > > >>> >>
>> >> > > > > > > >>> >>
>> >> > > > > > > >>> >>
>> >> > > > > > > >>> >> On Wed, May 24, 2017 at 3:57 PM, Mark Bean <
>> >> > > > > mark.o.bean@gmail.com
>> >> > > > > > >
>> >> > > > > > > >>> wrote:
>> >> > > > > > > >>> >>
>> >> > > > > > > >>> >>> Yes, we are using the embedded ZK. We will try
>> >> > > instantiating
>> >> > > > > and
>> >> > > > > > > >>> external
>> >> > > > > > > >>> >>> ZK and see if that resolves the problem.
>> >> > > > > > > >>> >>>
>> >> > > > > > > >>> >>> The load on the system is extremely small. Currently
>> >> (as
>> >> > > > Nodes
>> >> > > > > > are
>> >> > > > > > > >>> >>> disconnecting/reconnecting) all input ports to the
>> flow
>> >> > are
>> >> > > > > > turned
>> >> > > > > > > >>> off. The
>> >> > > > > > > >>> >>> only data in the flow is from a single GenerateFlow
>> >> > > > generating
>> >> > > > > 5B
>> >> > > > > > > >>> every 30
>> >> > > > > > > >>> >>> secs.
>> >> > > > > > > >>> >>>
>> >> > > > > > > >>> >>> Also, it is a 5-node cluster with embedded ZK on
>> each
>> >> > node.
>> >> > > > > > First,
>> >> > > > > > > I
>> >> > > > > > > >>> will
>> >> > > > > > > >>> >>> try reducing ZK to only 3 nodes. Then, I will try a
>> >> > 3-node
>> >> > > > > > > external ZK.
>> >> > > > > > > >>> >>>
>> >> > > > > > > >>> >>> Thanks,
>> >> > > > > > > >>> >>> Mark
>> >> > > > > > > >>> >>>
>> >> > > > > > > >>> >>> On Wed, May 24, 2017 at 11:49 AM, Joe Witt <
>> >> > > > joe.witt@gmail.com
>> >> > > > > >
>> >> > > > > > > wrote:
>> >> > > > > > > >>> >>>
>> >> > > > > > > >>> >>>> Are you using the embedded Zookeeper?  If yes we
>> >> > recommend
>> >> > > > > using
>> >> > > > > > > an
>> >> > > > > > > >>> >>>> external zookeeper.
>> >> > > > > > > >>> >>>>
>> >> > > > > > > >>> >>>> What type of load are the systems under when this
>> >> occurs
>> >> > > > (cpu,
>> >> > > > > > > >>> >>>> network, memory, disk io)? Under high load the
>> default
>> >> > > > > timeouts
>> >> > > > > > > for
>> >> > > > > > > >>> >>>> clustering are too aggressive.  You can relax these
>> >> for
>> >> > > > higher
>> >> > > > > > > load
>> >> > > > > > > >>> >>>> clusters and should see good behavior.  Even if the
>> >> > system
>> >> > > > > > > overall is
>> >> > > > > > > >>> >>>> not under all that high of load if you're seeing
>> >> garbage
>> >> > > > > > > collection
>> >> > > > > > > >>> >>>> pauses that are lengthy and/or frequent it can
>> cause
>> >> the
>> >> > > > same
>> >> > > > > > high
>> >> > > > > > > >>> >>>> load effect as far as the JVM is concerned.
>> >> > > > > > > >>> >>>>
>> >> > > > > > > >>> >>>> Thanks
>> >> > > > > > > >>> >>>> Joe
>> >> > > > > > > >>> >>>>
>> >> > > > > > > >>> >>>> On Wed, May 24, 2017 at 9:11 AM, Mark Bean <
>> >> > > > > > mark.o.bean@gmail.com
>> >> > > > > > > >
>> >> > > > > > > >>> >>>> wrote:
>> >> > > > > > > >>> >>>> > We have a cluster which is showing signs of
>> >> > instability.
>> >> > > > The
>> >> > > > > > > Primary
>> >> > > > > > > >>> >>>> Node
>> >> > > > > > > >>> >>>> > and Coordinator are reassigned to different nodes
>> >> > every
>> >> > > > > > several
>> >> > > > > > > >>> >>>> minutes. I
>> >> > > > > > > >>> >>>> > believe this is due to lack of heartbeat or other
>> >> > > > > > coordination.
>> >> > > > > > > The
>> >> > > > > > > >>> >>>> > following error occurs periodically in the
>> >> > nifi-app.log
>> >> > > > > > > >>> >>>> >
>> >> > > > > > > >>> >>>> > ERROR [CommitProcessor:1]
>> o.apache.zookeeper.server.
>> >> > > > > > > NIOServerCnxn
>> >> > > > > > > >>> >>>> > Unexpected Exception:
>> >> > > > > > > >>> >>>> > java.nio.channels.CancelledKeyException: null
>> >> > > > > > > >>> >>>> >         at sun.nio.ch.SelectionKeyImpl.en
>> >> > > > > > > >>> >>>> sureValid(SectionKeyImpl.java:73)
>> >> > > > > > > >>> >>>> >         at sun.nio.ch.SelectionKeyImpl.in
>> >> > > > > > > >>> >>>> terestOps(SelctionKeyImpl.java:77)
>> >> > > > > > > >>> >>>> >         at
>> >> > > > > > > >>> >>>> >
>> >> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(
>> >> > > > > NIOServ
>> >> > > > > > > >>> >>>> erCnxn.java:151)
>> >> > > > > > > >>> >>>> >         at
>> >> > > > > > > >>> >>>> >
>> >> > org.apache.zookeeper.server.NIOServerCnXn.sendResopnse(
>> >> > > > > NIOSe
>> >> > > > > > > >>> >>>> rverCnxn.java:1081)
>> >> > > > > > > >>> >>>> >         at
>> >> > > > > > > >>> >>>> > org.apache.zookeeper.server.
>> FinalRequestProcessor.
>> >> > > > > processReq
>> >> > > > > > > >>> >>>> uest(FinalRequestProcessor.java:404)
>> >> > > > > > > >>> >>>> >         at
>> >> > > > > > > >>> >>>> >
>> >> > org.apache.zookeeper.server.quorum.CommitProcessor.run(
>> >> > > > > Commi
>> >> > > > > > > >>> >>>> tProcessor.java:74)
>> >> > > > > > > >>> >>>> >
>> >> > > > > > > >>> >>>> > Apache NiFi 1.2.0
>> >> > > > > > > >>> >>>> >
>> >> > > > > > > >>> >>>> > Thoughts?
>> >> > > > > > > >>> >>>>
>> >> > > > > > > >>> >>>
>> >> > > > > > > >>> >>>
>> >> > > > > > > >>> >>
>> >> > > > > > > >>>
>> >> > > > > > >
>> >> > > > > >
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>>