You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nifi.apache.org by Mark Bean <ma...@gmail.com> on 2017/05/24 13:11:02 UTC

unstable cluster

We have a cluster which is showing signs of instability. The Primary Node
and Coordinator are reassigned to different nodes every several minutes. I
believe this is due to lack of heartbeat or other coordination. The
following error occurs periodically in the nifi-app.log

ERROR [CommitProcessor:1] o.apache.zookeeper.server.NIOServerCnxn
Unexpected Exception:
java.nio.channels.CancelledKeyException: null
        at sun.nio.ch.SelectionKeyImpl.ensureValid(SectionKeyImpl.java:73)
        at sun.nio.ch.SelectionKeyImpl.interestOps(SelctionKeyImpl.java:77)
        at
org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:151)
        at
org.apache.zookeeper.server.NIOServerCnXn.sendResopnse(NIOServerCnxn.java:1081)
        at
org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:404)
        at
org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:74)

Apache NiFi 1.2.0

Thoughts?

Re: unstable cluster

Posted by Joe Witt <jo...@gmail.com>.

ok thanks Mark.  Yeah that is a good example of what is tricky about
even incremental upgrades with a system like that.  Not all projects
use the same incremental version change logic in terms of APIs,
backward compatibility, etc..

Thanks
joe

On Fri, Jun 2, 2017 at 8:38 AM, Mark Bean <ma...@gmail.com> wrote:
> I tried to build master (1.3.0-SNAPSHOT) but updated the zookeeper
> dependency to version 3.4.10. I am not able to build successfully. A
> compilation error results:
>
> [ERROR] Failed to execute goal
> org.apache.maven.plugins:maven-compiler-plugin:3.2:compile
> (default-compile) on project nifi-framework-core: Compilation failure
> [ERROR]
> /nifi/nifi-nar/bundles/nifi-framework-bundle/nifi-framework/nifi-framework-core/src/main/java/org/apache/nifi/controller/state/server/ZooKeeperStateServer.java:
> [106,25] error: no suitable constructor found for QuorumPeer(no arguments)
>
>
>
> On Tue, May 30, 2017 at 11:33 PM, Joe Witt <jo...@gmail.com> wrote:
>
>> Just scanning through the items currently on master that would show up
>> in the 1.3.0 release we see numerous cluster related bug fixes.
>>
>> More consistent port alignment across cluster
>>   https://issues.apache.org/jira/browse/NIFI-3981
>>
>> Ensure controller service lifecycle handled better with different
>> timing/dependencies
>>   https://issues.apache.org/jira/browse/NIFI-3972
>>
>> Insufficient heartbeat handling causing improper clustering behavior
>>   https://issues.apache.org/jira/browse/NIFI-3933
>>
>> Improve timing of component startup relative to other lifecycle items
>> when clustered
>>   https://issues.apache.org/jira/browse/NIFI-3923
>>
>> Inconsistent scheduled state in some cluster settings
>>   https://issues.apache.org/jira/browse/NIFI-3900
>>
>> Improved fingerprinted/non-fingerprinted settings enforcement and
>> handling in clusters
>>   https://issues.apache.org/jira/browse/NIFI-1963
>>
>> These are nifi specific cluster behavior things.  For nifi and
>> zookeeper interaction specifically most of the focus this far has been
>> about NiFi itself as the above JIRAs show and also of course the cases
>> where a given system that is so resource contended will simply not
>> have a nice embedded ZK/nifi experience.
>>
>> MarkB, your testing above suggests you were using a nifi 1.x which
>> means a zookeeper 3.4.6 client against a Zookeeper 3.4.10 server
>> cluster and behavior was much better.  Could you possibly run the same
>> cluster evaluation against the latest master but with an embedded
>> zookeeper 3.4.10 version in nifi (which means both server and client
>> are on latest zk 3.4.10 release)?  This would be helpful data.
>> Assuming that goes well the only other concern that jumps to mind is
>> if us using a zookeeper 3.4.10 client presents problems for us talking
>> to older server versions (still 3.4 though so probably ok, i'd hope).
>> In general we should be safe thanks to classloader isolation but we've
>> seen some pretty magical JVM/system classloader level changes happen
>> for Kerberized environments.
>>
>> Thanks
>> Joe
>>
>>
>>
>> On Tue, May 30, 2017 at 3:21 PM, Juan Sequeiros <he...@gmail.com>
>> wrote:
>> > Hello all,
>> >
>> > I'll like to chime in on this interesting discussion thread.
>> >
>> > I'll like to add that my system(s) too have seen unstable ZK interaction
>> > with both embedded and eventually external ZK ( granted external has been
>> > better ) interaction.
>> > We have resolved them with NIFI restarts. And it's to the point that we
>> are
>> > hesitant to roll up to NIFI 1.X mainly because of this ( we have DEV NIFI
>> > 1.X )
>> >
>> > I also would like to add that we are greatly anticipating ZK release
>> 3.5.X
>> > for its TLS implementation, and as such have not voiced our experience
>> with
>> > NIFI / ZOOKEEPER assuming that once ZOOKEEPER 3.5.X is out of ALPHA that
>> it
>> > would be added in to NIFI NAR framework fairly fast and fix the oddities.
>> >
>> > I would say though that we have been hoping for a newer client on NIFI ZK
>> > side since the current one suggests its based off 3.4.6 ZOOKEEPER which
>> was
>> > released on *MAR 2014*.
>> >
>> > # jar tc nifi-framework-nar-1.1.1.nar | grep zoo
>> > META-INF/bundled-dependencies/zookeper-3.4.6.jar
>> >
>> > And now I wonder how long it would take for NIFI to code release a client
>> > based off 3.5.X once it goes official given hesitation on forward
>> > capability.
>> >
>> >
>> > On Tue, May 30, 2017 at 2:52 PM Jeff <jt...@gmail.com> wrote:
>> >
>> >> Joe,
>> >>
>> >> My own direct and indirect experiences with NiFi 1.x clustering have
>> been
>> >> good for both embedded and external zookeeper but we have certainly seen
>> >> some emails on mailing-list about it. Those have been for high load case
>> >> where the embedded approach would be susceptible to timing issues and
>> >> resolved by using an external system. Mark Bean's report is interesting
>> >> though since it happens under no real load at all.
>> >>
>> >> I suspect ZOOKEEPER-2044 will help that though there are several
>> comments
>> >> [1] (and others on that JIRA) that describe the issue as minor/false
>> >> reporting/cosmetic/an improvement. Updating to ZooKeeper 3.4.10 suggests
>> >> that this rare issue can be resolved in NiFi, but we'll have to do our
>> due
>> >> diligence to make sure that no new issues are raised with the upgrade
>> for
>> >> NiFi or its ability to interface with external systems. We'll have to do
>> >> testing with other dependencies that use ZooKeeper 3.4.6 to ensure that
>> >> forward capability.
>> >>
>> >> [1]
>> >>
>> >> https://issues.apache.org/jira/browse/ZOOKEEPER-2044?
>> focusedCommentId=15024616&page=com.atlassian.jira.
>> plugin.system.issuetabpanels:comment-tabpanel#comment-15024616
>> >>
>> >> Thanks,
>> >> Jeff
>> >>
>> >> On Tue, May 30, 2017 at 1:15 PM Joe Skora <js...@gmail.com> wrote:
>> >>
>> >> > Jeff,
>> >> >
>> >> > If I understand the issue correctly, this means NiFi 1.x has always
>> been
>> >> > broken for clustering with an embedded ZooKeeper.  That has never
>> >> > communicated until now, we clearly build for and explain how to use an
>> >> > embedded ZooKeeper in documentation.
>> >> >
>> >> > Any external non-NiFi elements that are considered in design and
>> >> dependency
>> >> > decisions need to be clearly understood by the entire community.  What
>> >> > things non-NiFi are you thinking of that drive ZooKeeper dependencies?
>> >> >
>> >> > Joe
>> >> >
>> >> > On Tue, May 30, 2017 at 9:11 AM, Jeff <jt...@gmail.com> wrote:
>> >> >
>> >> > > Mark, we can certainly take smaller steps rather than waiting for
>> >> > > 3.5.2/3.6.0 to come out.  I was just bringing that JIRA up as
>> another
>> >> > > scenario that entices us to upgrade.
>> >> > >
>> >> > > Joe, I'm referring to NiFi, the toolkit, and things non-NiFi that
>> >> > provide a
>> >> > > ZK server to which NiFi or the ZK Migration Toolkit are clients.
>> I'm
>> >> not
>> >> > > saying we can't or shouldn't upgrade, but we do need to test to make
>> >> sure
>> >> > > that no issues are introduced by NiFi shipping with ZK 3.4.10.
>> Being
>> >> > that
>> >> > > it's a bugfix version change, it's probably fine.
>> >> > >
>> >> > > - Jeff
>> >> > >
>> >> > > On Tue, May 30, 2017 at 10:46 AM Joe Skora <js...@gmail.com>
>> wrote:
>> >> > >
>> >> > > > Jeff,
>> >> > > >
>> >> > > > Does that mean NiFi 1.x will be unstable when using embedded
>> >> ZooKeeper
>> >> > > > until the ZK version is upgrade?
>> >> > > >
>> >> > > > By "components outside of NiFi" do you mean the NiFi toolkit and
>> >> other
>> >> > > > parts of the NiFi release?
>> >> > > >
>> >> > > > Joe
>> >> > > >
>> >> > > > On Tue, May 30, 2017 at 5:42 AM, Jeff <jt...@gmail.com> wrote:
>> >> > > >
>> >> > > > > Mark,
>> >> > > > >
>> >> > > > > I did report a JIRA [1] for upgrading to 3.5.2 or 3.6.0 (just
>> due
>> >> to
>> >> > > > log4j
>> >> > > > > issues) once it's out and stable, There are issues with the way
>> >> that
>> >> > ZK
>> >> > > > > refers to log4j classes in the code that cause issues for NiFi
>> and
>> >> > our
>> >> > > > > Toolkit..  However there has been some back and forth [2] (in
>> >> 3.4.0,
>> >> > > > which
>> >> > > > > doesn't fix the issue, but moves towards fixing it), [3], and
>> [4]
>> >> on
>> >> > > the
>> >> > > > > changes being implemented in versions 3.5.2 and 3.6.0.  Also, it
>> >> > looks
>> >> > > > like
>> >> > > > > ZK 3.6.0 is headed toward using log4j 2 [5].
>> >> > > > >
>> >> > > > > There are many components outside of NiFi that are still using
>> ZK
>> >> > > 3.4.6,
>> >> > > > so
>> >> > > > > it may be a while before we can move to 3.4.10. I don't
>> currently
>> >> > know
>> >> > > > > anything about the forward compatibility of 3.4.6.  Are there
>> >> > > > > improvements/fixes in 3.4.10 which you need?
>> >> > > > >
>> >> > > > > [1] https://issues.apache.org/jira/browse/NIFI-3067
>> >> > > > > [2] https://issues.apache.org/jira/browse/ZOOKEEPER-850
>> >> > > > > [3] https://issues.apache.org/jira/browse/ZOOKEEPER-1371
>> >> > > > > [4] https://issues.apache.org/jira/browse/ZOOKEEPER-2393
>> >> > > > > [5] https://issues.apache.org/jira/browse/ZOOKEEPER-2342
>> >> > > > >
>> >> > > > > - Jeff
>> >> > > > >
>> >> > > > > On Tue, May 30, 2017 at 8:15 AM Mark Bean <
>> mark.o.bean@gmail.com>
>> >> > > wrote:
>> >> > > > >
>> >> > > > > > Updated to external ZooKeeper last Friday. Over the weekend,
>> >> there
>> >> > > are
>> >> > > > no
>> >> > > > > > reports of SUSPENDED or RECONNECTED.
>> >> > > > > >
>> >> > > > > > Are there plans to upgrade the embedded ZooKeeper to the
>> latest
>> >> > > > version,
>> >> > > > > > 3.4.10?
>> >> > > > > >
>> >> > > > > > Thanks,
>> >> > > > > > Mark
>> >> > > > > >
>> >> > > > > > On Thu, May 25, 2017 at 11:56 AM, Joe Witt <
>> joe.witt@gmail.com>
>> >> > > wrote:
>> >> > > > > >
>> >> > > > > > > looked at a secured cluster and the send times are
>> routinely at
>> >> > > 100ms
>> >> > > > > > > similar to yours.  I think what i was flagging as
>> potentially
>> >> > > > > > > interesting is not interesting at all.
>> >> > > > > > >
>> >> > > > > > > On Thu, May 25, 2017 at 11:34 AM, Joe Witt <
>> joe.witt@gmail.com
>> >> >
>> >> > > > wrote:
>> >> > > > > > > > Ok.  Well as a point of comparison i'm looking at
>> heartbeat
>> >> > logs
>> >> > > > from
>> >> > > > > > > > another cluster and the times are consistently 1-3 millis
>> for
>> >> > the
>> >> > > > > > > > send.  Yours above show 100+ms typical with one north of
>> >> 900ms.
>> >> > > > Not
>> >> > > > > > > > sure how relevant that is but something i noticed.
>> >> > > > > > > >
>> >> > > > > > > > On Thu, May 25, 2017 at 11:29 AM, Mark Bean <
>> >> > > mark.o.bean@gmail.com
>> >> > > > >
>> >> > > > > > > wrote:
>> >> > > > > > > >> ping shows acceptably fast response time between servers,
>> >> > > > > > approximately
>> >> > > > > > > >> 0.100-0.150 ms
>> >> > > > > > > >>
>> >> > > > > > > >>
>> >> > > > > > > >> On Thu, May 25, 2017 at 11:13 AM, Joe Witt <
>> >> > joe.witt@gmail.com>
>> >> > > > > > wrote:
>> >> > > > > > > >>
>> >> > > > > > > >>> have you evaluated latency across the machines in your
>> >> > cluster?
>> >> > > > I
>> >> > > > > > ask
>> >> > > > > > > >>> because 122ms is pretty long and 917ms is very long.
>> Are
>> >> > these
>> >> > > > > nodes
>> >> > > > > > > >>> across a WAN link?
>> >> > > > > > > >>>
>> >> > > > > > > >>> On Thu, May 25, 2017 at 11:08 AM, Mark Bean <
>> >> > > > mark.o.bean@gmail.com
>> >> > > > > >
>> >> > > > > > > wrote:
>> >> > > > > > > >>> > Update: now all 5 nodes, regardless of ZK server, are
>> >> > > > indicating
>> >> > > > > > > >>> SUSPENDED
>> >> > > > > > > >>> > -> RECONNECTED.
>> >> > > > > > > >>> >
>> >> > > > > > > >>> > On Thu, May 25, 2017 at 10:23 AM, Mark Bean <
>> >> > > > > mark.o.bean@gmail.com
>> >> > > > > > >
>> >> > > > > > > >>> wrote:
>> >> > > > > > > >>> >
>> >> > > > > > > >>> >> I reduced the number of embedded ZooKeeper servers on
>> >> the
>> >> > > > 5-Node
>> >> > > > > > > NiFi
>> >> > > > > > > >>> >> Cluster from 5 to 3. This has improved the
>> situation. I
>> >> do
>> >> > > not
>> >> > > > > see
>> >> > > > > > > any
>> >> > > > > > > >>> of
>> >> > > > > > > >>> >> the three Nodes which are also ZK servers
>> >> > > > > > > disconnecting/reconnecting to
>> >> > > > > > > >>> the
>> >> > > > > > > >>> >> cluster as before. However, the two Nodes which are
>> not
>> >> > > > running
>> >> > > > > ZK
>> >> > > > > > > >>> continue
>> >> > > > > > > >>> >> to disconnect and reconnect. The following is taken
>> from
>> >> > one
>> >> > > > of
>> >> > > > > > the
>> >> > > > > > > >>> non-ZK
>> >> > > > > > > >>> >> Nodes. It's curious that some messages are issued
>> twice
>> >> > from
>> >> > > > the
>> >> > > > > > > same
>> >> > > > > > > >>> >> thread, but reference a different object
>> >> > > > > > > >>> >>
>> >> > > > > > > >>> >> nifi-app.log
>> >> > > > > > > >>> >> 2017-05-25 13:40:01,628 INFO [main-EventTrhead]
>> >> > > o.a.c.f.state.
>> >> > > > > > > >>> ConnectionStateManager
>> >> > > > > > > >>> >> State change: SUSPENDED
>> >> > > > > > > >>> >> 2017-05-25 13:39:45,627 INFO [Clustering Tasks
>> Thread-1]
>> >> > > > > > o.a.n.c.c.
>> >> > > > > > > >>> ClusterProtocolHeaertbeater
>> >> > > > > > > >>> >> Heartbeat create at 2017-05-25 13:39:45,504 and sent
>> to
>> >> > > > > FQDN:PORT
>> >> > > > > > at
>> >> > > > > > > >>> >> 2017-05-25 13:39:45,627; send took 122 millis
>> >> > > > > > > >>> >> 2017-05-25 13:39:50,862 INFO [Clustering Tasks
>> Thread-1]
>> >> > > > > > o.a.n.c.c.
>> >> > > > > > > >>> ClusterProtocolHeaertbeater
>> >> > > > > > > >>> >> Heartbeat create at 2017-05-25 13:39:50,732 and sent
>> to
>> >> > > > > FQDN:PORT
>> >> > > > > > at
>> >> > > > > > > >>> >> 2017-05-25 13:39:50,862; send took 122 millis
>> >> > > > > > > >>> >> 2017-05-25 13:39:56,089 INFO [Clustering Tasks
>> Thread-1]
>> >> > > > > > o.a.n.c.c.
>> >> > > > > > > >>> ClusterProtocolHeaertbeater
>> >> > > > > > > >>> >> Heartbeat create at 2017-05-25 13:39:55,966 and sent
>> to
>> >> > > > > FQDN:PORT
>> >> > > > > > at
>> >> > > > > > > >>> >> 2017-05-25 13:39:56,089; send took 129 millis
>> >> > > > > > > >>> >> 2017-05-25 13:40:01,629 INFO
>> >> > > > [Curator-ConnectionStateManager-0]
>> >> > > > > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
>> >> > > > > > > org.apache.nifi.controller.
>> >> > > > > > > >>> >> leader.election.CuratorLeaderElectionManager$
>> >> > > > > > > ElectionListener@68f8b6a2
>> >> > > > > > > >>> >> Connection State changed to SUSPENDED
>> >> > > > > > > >>> >> 2017-05-25 13:40:01,629 INFO
>> >> > > > [Curator-ConnectionStateManager-0]
>> >> > > > > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
>> >> > > > > > > org.apache.nifi.controller.
>> >> > > > > > > >>> >> leader.election.CuratorLeaderElectionManager$
>> >> > > > > > > ElectionListener@663f55cd
>> >> > > > > > > >>> >> Connection State changed to SUSPENDED
>> >> > > > > > > >>> >> 2017-05-25 13:40:02,412 INFO [main-EventThread]
>> >> > > o.a.c.f.state.
>> >> > > > > > > >>> ConnectinoStateManager
>> >> > > > > > > >>> >> State change: RECONNECTED
>> >> > > > > > > >>> >> 2017-05-25 13:40:02,413 INFO
>> >> > > > [Curator-ConnectionStateManager-0]
>> >> > > > > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
>> >> > > > > > > org.apache.nifi.controller.
>> >> > > > > > > >>> >> leader.election.CuratorLeaderElectionManager$
>> >> > > > > > > ElectionListener@68f8b6a2
>> >> > > > > > > >>> >> Connection State changed to RECONNECTED
>> >> > > > > > > >>> >> 2017-05-25 13:40:02,413 INFO
>> >> > > > [Curator-ConnectionStateManager-0]
>> >> > > > > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
>> >> > > > > > > org.apache.nifi.controller.
>> >> > > > > > > >>> >> leader.election.CuratorLeaderElectionManager$
>> >> > > > > > > ElectionListener@663f55cd
>> >> > > > > > > >>> >> Connection State changed to RECONNECTED
>> >> > > > > > > >>> >> 2017-05-25 13:40:02,550 INFO [Clustering Tasks
>> Thread-1]
>> >> > > > > > o.a.n.c.c.
>> >> > > > > > > >>> ClusterProtocolHeaertbeater
>> >> > > > > > > >>> >> Heartbeat create at 2017-05-25 13:40:01,632 and sent
>> to
>> >> > > > > FQDN:PORT
>> >> > > > > > at
>> >> > > > > > > >>> >> 2017-05-25 13:40:02,550; send took 917 millis
>> >> > > > > > > >>> >> 2017-05-25 13:40:07,787 INFO [Clustering Tasks
>> Thread-1]
>> >> > > > > > o.a.n.c.c.
>> >> > > > > > > >>> ClusterProtocolHeaertbeater
>> >> > > > > > > >>> >> Heartbeat create at 2017-05-25 13:40:07,657 and sent
>> to
>> >> > > > > FQDN:PORT
>> >> > > > > > at
>> >> > > > > > > >>> >> 2017-05-25 13:40:07,787; send took 129 millis
>> >> > > > > > > >>> >>
>> >> > > > > > > >>> >> I will work on setting up an external ZK next, but
>> would
>> >> > > still
>> >> > > > > > like
>> >> > > > > > > some
>> >> > > > > > > >>> >> insight to what is being observed with the embedded
>> ZK.
>> >> > > > > > > >>> >>
>> >> > > > > > > >>> >> Thanks,
>> >> > > > > > > >>> >> Mark
>> >> > > > > > > >>> >>
>> >> > > > > > > >>> >>
>> >> > > > > > > >>> >>
>> >> > > > > > > >>> >>
>> >> > > > > > > >>> >> On Wed, May 24, 2017 at 3:57 PM, Mark Bean <
>> >> > > > > mark.o.bean@gmail.com
>> >> > > > > > >
>> >> > > > > > > >>> wrote:
>> >> > > > > > > >>> >>
>> >> > > > > > > >>> >>> Yes, we are using the embedded ZK. We will try
>> >> > > instantiating
>> >> > > > > and
>> >> > > > > > > >>> external
>> >> > > > > > > >>> >>> ZK and see if that resolves the problem.
>> >> > > > > > > >>> >>>
>> >> > > > > > > >>> >>> The load on the system is extremely small. Currently
>> >> (as
>> >> > > > Nodes
>> >> > > > > > are
>> >> > > > > > > >>> >>> disconnecting/reconnecting) all input ports to the
>> flow
>> >> > are
>> >> > > > > > turned
>> >> > > > > > > >>> off. The
>> >> > > > > > > >>> >>> only data in the flow is from a single GenerateFlow
>> >> > > > generating
>> >> > > > > 5B
>> >> > > > > > > >>> every 30
>> >> > > > > > > >>> >>> secs.
>> >> > > > > > > >>> >>>
>> >> > > > > > > >>> >>> Also, it is a 5-node cluster with embedded ZK on
>> each
>> >> > node.
>> >> > > > > > First,
>> >> > > > > > > I
>> >> > > > > > > >>> will
>> >> > > > > > > >>> >>> try reducing ZK to only 3 nodes. Then, I will try a
>> >> > 3-node
>> >> > > > > > > external ZK.
>> >> > > > > > > >>> >>>
>> >> > > > > > > >>> >>> Thanks,
>> >> > > > > > > >>> >>> Mark
>> >> > > > > > > >>> >>>
>> >> > > > > > > >>> >>> On Wed, May 24, 2017 at 11:49 AM, Joe Witt <
>> >> > > > joe.witt@gmail.com
>> >> > > > > >
>> >> > > > > > > wrote:
>> >> > > > > > > >>> >>>
>> >> > > > > > > >>> >>>> Are you using the embedded Zookeeper?  If yes we
>> >> > recommend
>> >> > > > > using
>> >> > > > > > > an
>> >> > > > > > > >>> >>>> external zookeeper.
>> >> > > > > > > >>> >>>>
>> >> > > > > > > >>> >>>> What type of load are the systems under when this
>> >> occurs
>> >> > > > (cpu,
>> >> > > > > > > >>> >>>> network, memory, disk io)? Under high load the
>> default
>> >> > > > > timeouts
>> >> > > > > > > for
>> >> > > > > > > >>> >>>> clustering are too aggressive.  You can relax these
>> >> for
>> >> > > > higher
>> >> > > > > > > load
>> >> > > > > > > >>> >>>> clusters and should see good behavior.  Even if the
>> >> > system
>> >> > > > > > > overall is
>> >> > > > > > > >>> >>>> not under all that high of load if you're seeing
>> >> garbage
>> >> > > > > > > collection
>> >> > > > > > > >>> >>>> pauses that are lengthy and/or frequent it can
>> cause
>> >> the
>> >> > > > same
>> >> > > > > > high
>> >> > > > > > > >>> >>>> load effect as far as the JVM is concerned.
>> >> > > > > > > >>> >>>>
>> >> > > > > > > >>> >>>> Thanks
>> >> > > > > > > >>> >>>> Joe
>> >> > > > > > > >>> >>>>
>> >> > > > > > > >>> >>>> On Wed, May 24, 2017 at 9:11 AM, Mark Bean <
>> >> > > > > > mark.o.bean@gmail.com
>> >> > > > > > > >
>> >> > > > > > > >>> >>>> wrote:
>> >> > > > > > > >>> >>>> > We have a cluster which is showing signs of
>> >> > instability.
>> >> > > > The
>> >> > > > > > > Primary
>> >> > > > > > > >>> >>>> Node
>> >> > > > > > > >>> >>>> > and Coordinator are reassigned to different nodes
>> >> > every
>> >> > > > > > several
>> >> > > > > > > >>> >>>> minutes. I
>> >> > > > > > > >>> >>>> > believe this is due to lack of heartbeat or other
>> >> > > > > > coordination.
>> >> > > > > > > The
>> >> > > > > > > >>> >>>> > following error occurs periodically in the
>> >> > nifi-app.log
>> >> > > > > > > >>> >>>> >
>> >> > > > > > > >>> >>>> > ERROR [CommitProcessor:1]
>> o.apache.zookeeper.server.
>> >> > > > > > > NIOServerCnxn
>> >> > > > > > > >>> >>>> > Unexpected Exception:
>> >> > > > > > > >>> >>>> > java.nio.channels.CancelledKeyException: null
>> >> > > > > > > >>> >>>> >         at sun.nio.ch.SelectionKeyImpl.en
>> >> > > > > > > >>> >>>> sureValid(SectionKeyImpl.java:73)
>> >> > > > > > > >>> >>>> >         at sun.nio.ch.SelectionKeyImpl.in
>> >> > > > > > > >>> >>>> terestOps(SelctionKeyImpl.java:77)
>> >> > > > > > > >>> >>>> >         at
>> >> > > > > > > >>> >>>> >
>> >> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(
>> >> > > > > NIOServ
>> >> > > > > > > >>> >>>> erCnxn.java:151)
>> >> > > > > > > >>> >>>> >         at
>> >> > > > > > > >>> >>>> >
>> >> > org.apache.zookeeper.server.NIOServerCnXn.sendResopnse(
>> >> > > > > NIOSe
>> >> > > > > > > >>> >>>> rverCnxn.java:1081)
>> >> > > > > > > >>> >>>> >         at
>> >> > > > > > > >>> >>>> > org.apache.zookeeper.server.
>> FinalRequestProcessor.
>> >> > > > > processReq
>> >> > > > > > > >>> >>>> uest(FinalRequestProcessor.java:404)
>> >> > > > > > > >>> >>>> >         at
>> >> > > > > > > >>> >>>> >
>> >> > org.apache.zookeeper.server.quorum.CommitProcessor.run(
>> >> > > > > Commi
>> >> > > > > > > >>> >>>> tProcessor.java:74)
>> >> > > > > > > >>> >>>> >
>> >> > > > > > > >>> >>>> > Apache NiFi 1.2.0
>> >> > > > > > > >>> >>>> >
>> >> > > > > > > >>> >>>> > Thoughts?
>> >> > > > > > > >>> >>>>
>> >> > > > > > > >>> >>>
>> >> > > > > > > >>> >>>
>> >> > > > > > > >>> >>
>> >> > > > > > > >>>
>> >> > > > > > >
>> >> > > > > >
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>>

Re: unstable cluster

Posted by Mark Bean <ma...@gmail.com>.

I tried to build master (1.3.0-SNAPSHOT) but updated the zookeeper
dependency to version 3.4.10. I am not able to build successfully. A
compilation error results:

[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-compiler-plugin:3.2:compile
(default-compile) on project nifi-framework-core: Compilation failure
[ERROR]
/nifi/nifi-nar/bundles/nifi-framework-bundle/nifi-framework/nifi-framework-core/src/main/java/org/apache/nifi/controller/state/server/ZooKeeperStateServer.java:
[106,25] error: no suitable constructor found for QuorumPeer(no arguments)



On Tue, May 30, 2017 at 11:33 PM, Joe Witt <jo...@gmail.com> wrote:

> Just scanning through the items currently on master that would show up
> in the 1.3.0 release we see numerous cluster related bug fixes.
>
> More consistent port alignment across cluster
>   https://issues.apache.org/jira/browse/NIFI-3981
>
> Ensure controller service lifecycle handled better with different
> timing/dependencies
>   https://issues.apache.org/jira/browse/NIFI-3972
>
> Insufficient heartbeat handling causing improper clustering behavior
>   https://issues.apache.org/jira/browse/NIFI-3933
>
> Improve timing of component startup relative to other lifecycle items
> when clustered
>   https://issues.apache.org/jira/browse/NIFI-3923
>
> Inconsistent scheduled state in some cluster settings
>   https://issues.apache.org/jira/browse/NIFI-3900
>
> Improved fingerprinted/non-fingerprinted settings enforcement and
> handling in clusters
>   https://issues.apache.org/jira/browse/NIFI-1963
>
> These are nifi specific cluster behavior things.  For nifi and
> zookeeper interaction specifically most of the focus this far has been
> about NiFi itself as the above JIRAs show and also of course the cases
> where a given system that is so resource contended will simply not
> have a nice embedded ZK/nifi experience.
>
> MarkB, your testing above suggests you were using a nifi 1.x which
> means a zookeeper 3.4.6 client against a Zookeeper 3.4.10 server
> cluster and behavior was much better.  Could you possibly run the same
> cluster evaluation against the latest master but with an embedded
> zookeeper 3.4.10 version in nifi (which means both server and client
> are on latest zk 3.4.10 release)?  This would be helpful data.
> Assuming that goes well the only other concern that jumps to mind is
> if us using a zookeeper 3.4.10 client presents problems for us talking
> to older server versions (still 3.4 though so probably ok, i'd hope).
> In general we should be safe thanks to classloader isolation but we've
> seen some pretty magical JVM/system classloader level changes happen
> for Kerberized environments.
>
> Thanks
> Joe
>
>
>
> On Tue, May 30, 2017 at 3:21 PM, Juan Sequeiros <he...@gmail.com>
> wrote:
> > Hello all,
> >
> > I'll like to chime in on this interesting discussion thread.
> >
> > I'll like to add that my system(s) too have seen unstable ZK interaction
> > with both embedded and eventually external ZK ( granted external has been
> > better ) interaction.
> > We have resolved them with NIFI restarts. And it's to the point that we
> are
> > hesitant to roll up to NIFI 1.X mainly because of this ( we have DEV NIFI
> > 1.X )
> >
> > I also would like to add that we are greatly anticipating ZK release
> 3.5.X
> > for its TLS implementation, and as such have not voiced our experience
> with
> > NIFI / ZOOKEEPER assuming that once ZOOKEEPER 3.5.X is out of ALPHA that
> it
> > would be added in to NIFI NAR framework fairly fast and fix the oddities.
> >
> > I would say though that we have been hoping for a newer client on NIFI ZK
> > side since the current one suggests its based off 3.4.6 ZOOKEEPER which
> was
> > released on *MAR 2014*.
> >
> > # jar tc nifi-framework-nar-1.1.1.nar | grep zoo
> > META-INF/bundled-dependencies/zookeper-3.4.6.jar
> >
> > And now I wonder how long it would take for NIFI to code release a client
> > based off 3.5.X once it goes official given hesitation on forward
> > capability.
> >
> >
> > On Tue, May 30, 2017 at 2:52 PM Jeff <jt...@gmail.com> wrote:
> >
> >> Joe,
> >>
> >> My own direct and indirect experiences with NiFi 1.x clustering have
> been
> >> good for both embedded and external zookeeper but we have certainly seen
> >> some emails on mailing-list about it. Those have been for high load case
> >> where the embedded approach would be susceptible to timing issues and
> >> resolved by using an external system. Mark Bean's report is interesting
> >> though since it happens under no real load at all.
> >>
> >> I suspect ZOOKEEPER-2044 will help that though there are several
> comments
> >> [1] (and others on that JIRA) that describe the issue as minor/false
> >> reporting/cosmetic/an improvement. Updating to ZooKeeper 3.4.10 suggests
> >> that this rare issue can be resolved in NiFi, but we'll have to do our
> due
> >> diligence to make sure that no new issues are raised with the upgrade
> for
> >> NiFi or its ability to interface with external systems. We'll have to do
> >> testing with other dependencies that use ZooKeeper 3.4.6 to ensure that
> >> forward capability.
> >>
> >> [1]
> >>
> >> https://issues.apache.org/jira/browse/ZOOKEEPER-2044?
> focusedCommentId=15024616&page=com.atlassian.jira.
> plugin.system.issuetabpanels:comment-tabpanel#comment-15024616
> >>
> >> Thanks,
> >> Jeff
> >>
> >> On Tue, May 30, 2017 at 1:15 PM Joe Skora <js...@gmail.com> wrote:
> >>
> >> > Jeff,
> >> >
> >> > If I understand the issue correctly, this means NiFi 1.x has always
> been
> >> > broken for clustering with an embedded ZooKeeper.  That has never
> >> > communicated until now, we clearly build for and explain how to use an
> >> > embedded ZooKeeper in documentation.
> >> >
> >> > Any external non-NiFi elements that are considered in design and
> >> dependency
> >> > decisions need to be clearly understood by the entire community.  What
> >> > things non-NiFi are you thinking of that drive ZooKeeper dependencies?
> >> >
> >> > Joe
> >> >
> >> > On Tue, May 30, 2017 at 9:11 AM, Jeff <jt...@gmail.com> wrote:
> >> >
> >> > > Mark, we can certainly take smaller steps rather than waiting for
> >> > > 3.5.2/3.6.0 to come out.  I was just bringing that JIRA up as
> another
> >> > > scenario that entices us to upgrade.
> >> > >
> >> > > Joe, I'm referring to NiFi, the toolkit, and things non-NiFi that
> >> > provide a
> >> > > ZK server to which NiFi or the ZK Migration Toolkit are clients.
> I'm
> >> not
> >> > > saying we can't or shouldn't upgrade, but we do need to test to make
> >> sure
> >> > > that no issues are introduced by NiFi shipping with ZK 3.4.10.
> Being
> >> > that
> >> > > it's a bugfix version change, it's probably fine.
> >> > >
> >> > > - Jeff
> >> > >
> >> > > On Tue, May 30, 2017 at 10:46 AM Joe Skora <js...@gmail.com>
> wrote:
> >> > >
> >> > > > Jeff,
> >> > > >
> >> > > > Does that mean NiFi 1.x will be unstable when using embedded
> >> ZooKeeper
> >> > > > until the ZK version is upgrade?
> >> > > >
> >> > > > By "components outside of NiFi" do you mean the NiFi toolkit and
> >> other
> >> > > > parts of the NiFi release?
> >> > > >
> >> > > > Joe
> >> > > >
> >> > > > On Tue, May 30, 2017 at 5:42 AM, Jeff <jt...@gmail.com> wrote:
> >> > > >
> >> > > > > Mark,
> >> > > > >
> >> > > > > I did report a JIRA [1] for upgrading to 3.5.2 or 3.6.0 (just
> due
> >> to
> >> > > > log4j
> >> > > > > issues) once it's out and stable, There are issues with the way
> >> that
> >> > ZK
> >> > > > > refers to log4j classes in the code that cause issues for NiFi
> and
> >> > our
> >> > > > > Toolkit..  However there has been some back and forth [2] (in
> >> 3.4.0,
> >> > > > which
> >> > > > > doesn't fix the issue, but moves towards fixing it), [3], and
> [4]
> >> on
> >> > > the
> >> > > > > changes being implemented in versions 3.5.2 and 3.6.0.  Also, it
> >> > looks
> >> > > > like
> >> > > > > ZK 3.6.0 is headed toward using log4j 2 [5].
> >> > > > >
> >> > > > > There are many components outside of NiFi that are still using
> ZK
> >> > > 3.4.6,
> >> > > > so
> >> > > > > it may be a while before we can move to 3.4.10. I don't
> currently
> >> > know
> >> > > > > anything about the forward compatibility of 3.4.6.  Are there
> >> > > > > improvements/fixes in 3.4.10 which you need?
> >> > > > >
> >> > > > > [1] https://issues.apache.org/jira/browse/NIFI-3067
> >> > > > > [2] https://issues.apache.org/jira/browse/ZOOKEEPER-850
> >> > > > > [3] https://issues.apache.org/jira/browse/ZOOKEEPER-1371
> >> > > > > [4] https://issues.apache.org/jira/browse/ZOOKEEPER-2393
> >> > > > > [5] https://issues.apache.org/jira/browse/ZOOKEEPER-2342
> >> > > > >
> >> > > > > - Jeff
> >> > > > >
> >> > > > > On Tue, May 30, 2017 at 8:15 AM Mark Bean <
> mark.o.bean@gmail.com>
> >> > > wrote:
> >> > > > >
> >> > > > > > Updated to external ZooKeeper last Friday. Over the weekend,
> >> there
> >> > > are
> >> > > > no
> >> > > > > > reports of SUSPENDED or RECONNECTED.
> >> > > > > >
> >> > > > > > Are there plans to upgrade the embedded ZooKeeper to the
> latest
> >> > > > version,
> >> > > > > > 3.4.10?
> >> > > > > >
> >> > > > > > Thanks,
> >> > > > > > Mark
> >> > > > > >
> >> > > > > > On Thu, May 25, 2017 at 11:56 AM, Joe Witt <
> joe.witt@gmail.com>
> >> > > wrote:
> >> > > > > >
> >> > > > > > > looked at a secured cluster and the send times are
> routinely at
> >> > > 100ms
> >> > > > > > > similar to yours.  I think what i was flagging as
> potentially
> >> > > > > > > interesting is not interesting at all.
> >> > > > > > >
> >> > > > > > > On Thu, May 25, 2017 at 11:34 AM, Joe Witt <
> joe.witt@gmail.com
> >> >
> >> > > > wrote:
> >> > > > > > > > Ok.  Well as a point of comparison i'm looking at
> heartbeat
> >> > logs
> >> > > > from
> >> > > > > > > > another cluster and the times are consistently 1-3 millis
> for
> >> > the
> >> > > > > > > > send.  Yours above show 100+ms typical with one north of
> >> 900ms.
> >> > > > Not
> >> > > > > > > > sure how relevant that is but something i noticed.
> >> > > > > > > >
> >> > > > > > > > On Thu, May 25, 2017 at 11:29 AM, Mark Bean <
> >> > > mark.o.bean@gmail.com
> >> > > > >
> >> > > > > > > wrote:
> >> > > > > > > >> ping shows acceptably fast response time between servers,
> >> > > > > > approximately
> >> > > > > > > >> 0.100-0.150 ms
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >> On Thu, May 25, 2017 at 11:13 AM, Joe Witt <
> >> > joe.witt@gmail.com>
> >> > > > > > wrote:
> >> > > > > > > >>
> >> > > > > > > >>> have you evaluated latency across the machines in your
> >> > cluster?
> >> > > > I
> >> > > > > > ask
> >> > > > > > > >>> because 122ms is pretty long and 917ms is very long.
> Are
> >> > these
> >> > > > > nodes
> >> > > > > > > >>> across a WAN link?
> >> > > > > > > >>>
> >> > > > > > > >>> On Thu, May 25, 2017 at 11:08 AM, Mark Bean <
> >> > > > mark.o.bean@gmail.com
> >> > > > > >
> >> > > > > > > wrote:
> >> > > > > > > >>> > Update: now all 5 nodes, regardless of ZK server, are
> >> > > > indicating
> >> > > > > > > >>> SUSPENDED
> >> > > > > > > >>> > -> RECONNECTED.
> >> > > > > > > >>> >
> >> > > > > > > >>> > On Thu, May 25, 2017 at 10:23 AM, Mark Bean <
> >> > > > > mark.o.bean@gmail.com
> >> > > > > > >
> >> > > > > > > >>> wrote:
> >> > > > > > > >>> >
> >> > > > > > > >>> >> I reduced the number of embedded ZooKeeper servers on
> >> the
> >> > > > 5-Node
> >> > > > > > > NiFi
> >> > > > > > > >>> >> Cluster from 5 to 3. This has improved the
> situation. I
> >> do
> >> > > not
> >> > > > > see
> >> > > > > > > any
> >> > > > > > > >>> of
> >> > > > > > > >>> >> the three Nodes which are also ZK servers
> >> > > > > > > disconnecting/reconnecting to
> >> > > > > > > >>> the
> >> > > > > > > >>> >> cluster as before. However, the two Nodes which are
> not
> >> > > > running
> >> > > > > ZK
> >> > > > > > > >>> continue
> >> > > > > > > >>> >> to disconnect and reconnect. The following is taken
> from
> >> > one
> >> > > > of
> >> > > > > > the
> >> > > > > > > >>> non-ZK
> >> > > > > > > >>> >> Nodes. It's curious that some messages are issued
> twice
> >> > from
> >> > > > the
> >> > > > > > > same
> >> > > > > > > >>> >> thread, but reference a different object
> >> > > > > > > >>> >>
> >> > > > > > > >>> >> nifi-app.log
> >> > > > > > > >>> >> 2017-05-25 13:40:01,628 INFO [main-EventTrhead]
> >> > > o.a.c.f.state.
> >> > > > > > > >>> ConnectionStateManager
> >> > > > > > > >>> >> State change: SUSPENDED
> >> > > > > > > >>> >> 2017-05-25 13:39:45,627 INFO [Clustering Tasks
> Thread-1]
> >> > > > > > o.a.n.c.c.
> >> > > > > > > >>> ClusterProtocolHeaertbeater
> >> > > > > > > >>> >> Heartbeat create at 2017-05-25 13:39:45,504 and sent
> to
> >> > > > > FQDN:PORT
> >> > > > > > at
> >> > > > > > > >>> >> 2017-05-25 13:39:45,627; send took 122 millis
> >> > > > > > > >>> >> 2017-05-25 13:39:50,862 INFO [Clustering Tasks
> Thread-1]
> >> > > > > > o.a.n.c.c.
> >> > > > > > > >>> ClusterProtocolHeaertbeater
> >> > > > > > > >>> >> Heartbeat create at 2017-05-25 13:39:50,732 and sent
> to
> >> > > > > FQDN:PORT
> >> > > > > > at
> >> > > > > > > >>> >> 2017-05-25 13:39:50,862; send took 122 millis
> >> > > > > > > >>> >> 2017-05-25 13:39:56,089 INFO [Clustering Tasks
> Thread-1]
> >> > > > > > o.a.n.c.c.
> >> > > > > > > >>> ClusterProtocolHeaertbeater
> >> > > > > > > >>> >> Heartbeat create at 2017-05-25 13:39:55,966 and sent
> to
> >> > > > > FQDN:PORT
> >> > > > > > at
> >> > > > > > > >>> >> 2017-05-25 13:39:56,089; send took 129 millis
> >> > > > > > > >>> >> 2017-05-25 13:40:01,629 INFO
> >> > > > [Curator-ConnectionStateManager-0]
> >> > > > > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> >> > > > > > > org.apache.nifi.controller.
> >> > > > > > > >>> >> leader.election.CuratorLeaderElectionManager$
> >> > > > > > > ElectionListener@68f8b6a2
> >> > > > > > > >>> >> Connection State changed to SUSPENDED
> >> > > > > > > >>> >> 2017-05-25 13:40:01,629 INFO
> >> > > > [Curator-ConnectionStateManager-0]
> >> > > > > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> >> > > > > > > org.apache.nifi.controller.
> >> > > > > > > >>> >> leader.election.CuratorLeaderElectionManager$
> >> > > > > > > ElectionListener@663f55cd
> >> > > > > > > >>> >> Connection State changed to SUSPENDED
> >> > > > > > > >>> >> 2017-05-25 13:40:02,412 INFO [main-EventThread]
> >> > > o.a.c.f.state.
> >> > > > > > > >>> ConnectinoStateManager
> >> > > > > > > >>> >> State change: RECONNECTED
> >> > > > > > > >>> >> 2017-05-25 13:40:02,413 INFO
> >> > > > [Curator-ConnectionStateManager-0]
> >> > > > > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> >> > > > > > > org.apache.nifi.controller.
> >> > > > > > > >>> >> leader.election.CuratorLeaderElectionManager$
> >> > > > > > > ElectionListener@68f8b6a2
> >> > > > > > > >>> >> Connection State changed to RECONNECTED
> >> > > > > > > >>> >> 2017-05-25 13:40:02,413 INFO
> >> > > > [Curator-ConnectionStateManager-0]
> >> > > > > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> >> > > > > > > org.apache.nifi.controller.
> >> > > > > > > >>> >> leader.election.CuratorLeaderElectionManager$
> >> > > > > > > ElectionListener@663f55cd
> >> > > > > > > >>> >> Connection State changed to RECONNECTED
> >> > > > > > > >>> >> 2017-05-25 13:40:02,550 INFO [Clustering Tasks
> Thread-1]
> >> > > > > > o.a.n.c.c.
> >> > > > > > > >>> ClusterProtocolHeaertbeater
> >> > > > > > > >>> >> Heartbeat create at 2017-05-25 13:40:01,632 and sent
> to
> >> > > > > FQDN:PORT
> >> > > > > > at
> >> > > > > > > >>> >> 2017-05-25 13:40:02,550; send took 917 millis
> >> > > > > > > >>> >> 2017-05-25 13:40:07,787 INFO [Clustering Tasks
> Thread-1]
> >> > > > > > o.a.n.c.c.
> >> > > > > > > >>> ClusterProtocolHeaertbeater
> >> > > > > > > >>> >> Heartbeat create at 2017-05-25 13:40:07,657 and sent
> to
> >> > > > > FQDN:PORT
> >> > > > > > at
> >> > > > > > > >>> >> 2017-05-25 13:40:07,787; send took 129 millis
> >> > > > > > > >>> >>
> >> > > > > > > >>> >> I will work on setting up an external ZK next, but
> would
> >> > > still
> >> > > > > > like
> >> > > > > > > some
> >> > > > > > > >>> >> insight to what is being observed with the embedded
> ZK.
> >> > > > > > > >>> >>
> >> > > > > > > >>> >> Thanks,
> >> > > > > > > >>> >> Mark
> >> > > > > > > >>> >>
> >> > > > > > > >>> >>
> >> > > > > > > >>> >>
> >> > > > > > > >>> >>
> >> > > > > > > >>> >> On Wed, May 24, 2017 at 3:57 PM, Mark Bean <
> >> > > > > mark.o.bean@gmail.com
> >> > > > > > >
> >> > > > > > > >>> wrote:
> >> > > > > > > >>> >>
> >> > > > > > > >>> >>> Yes, we are using the embedded ZK. We will try
> >> > > instantiating
> >> > > > > and
> >> > > > > > > >>> external
> >> > > > > > > >>> >>> ZK and see if that resolves the problem.
> >> > > > > > > >>> >>>
> >> > > > > > > >>> >>> The load on the system is extremely small. Currently
> >> (as
> >> > > > Nodes
> >> > > > > > are
> >> > > > > > > >>> >>> disconnecting/reconnecting) all input ports to the
> flow
> >> > are
> >> > > > > > turned
> >> > > > > > > >>> off. The
> >> > > > > > > >>> >>> only data in the flow is from a single GenerateFlow
> >> > > > generating
> >> > > > > 5B
> >> > > > > > > >>> every 30
> >> > > > > > > >>> >>> secs.
> >> > > > > > > >>> >>>
> >> > > > > > > >>> >>> Also, it is a 5-node cluster with embedded ZK on
> each
> >> > node.
> >> > > > > > First,
> >> > > > > > > I
> >> > > > > > > >>> will
> >> > > > > > > >>> >>> try reducing ZK to only 3 nodes. Then, I will try a
> >> > 3-node
> >> > > > > > > external ZK.
> >> > > > > > > >>> >>>
> >> > > > > > > >>> >>> Thanks,
> >> > > > > > > >>> >>> Mark
> >> > > > > > > >>> >>>
> >> > > > > > > >>> >>> On Wed, May 24, 2017 at 11:49 AM, Joe Witt <
> >> > > > joe.witt@gmail.com
> >> > > > > >
> >> > > > > > > wrote:
> >> > > > > > > >>> >>>
> >> > > > > > > >>> >>>> Are you using the embedded Zookeeper?  If yes we
> >> > recommend
> >> > > > > using
> >> > > > > > > an
> >> > > > > > > >>> >>>> external zookeeper.
> >> > > > > > > >>> >>>>
> >> > > > > > > >>> >>>> What type of load are the systems under when this
> >> occurs
> >> > > > (cpu,
> >> > > > > > > >>> >>>> network, memory, disk io)? Under high load the
> default
> >> > > > > timeouts
> >> > > > > > > for
> >> > > > > > > >>> >>>> clustering are too aggressive.  You can relax these
> >> for
> >> > > > higher
> >> > > > > > > load
> >> > > > > > > >>> >>>> clusters and should see good behavior.  Even if the
> >> > system
> >> > > > > > > overall is
> >> > > > > > > >>> >>>> not under all that high of load if you're seeing
> >> garbage
> >> > > > > > > collection
> >> > > > > > > >>> >>>> pauses that are lengthy and/or frequent it can
> cause
> >> the
> >> > > > same
> >> > > > > > high
> >> > > > > > > >>> >>>> load effect as far as the JVM is concerned.
> >> > > > > > > >>> >>>>
> >> > > > > > > >>> >>>> Thanks
> >> > > > > > > >>> >>>> Joe
> >> > > > > > > >>> >>>>
> >> > > > > > > >>> >>>> On Wed, May 24, 2017 at 9:11 AM, Mark Bean <
> >> > > > > > mark.o.bean@gmail.com
> >> > > > > > > >
> >> > > > > > > >>> >>>> wrote:
> >> > > > > > > >>> >>>> > We have a cluster which is showing signs of
> >> > instability.
> >> > > > The
> >> > > > > > > Primary
> >> > > > > > > >>> >>>> Node
> >> > > > > > > >>> >>>> > and Coordinator are reassigned to different nodes
> >> > every
> >> > > > > > several
> >> > > > > > > >>> >>>> minutes. I
> >> > > > > > > >>> >>>> > believe this is due to lack of heartbeat or other
> >> > > > > > coordination.
> >> > > > > > > The
> >> > > > > > > >>> >>>> > following error occurs periodically in the
> >> > nifi-app.log
> >> > > > > > > >>> >>>> >
> >> > > > > > > >>> >>>> > ERROR [CommitProcessor:1]
> o.apache.zookeeper.server.
> >> > > > > > > NIOServerCnxn
> >> > > > > > > >>> >>>> > Unexpected Exception:
> >> > > > > > > >>> >>>> > java.nio.channels.CancelledKeyException: null
> >> > > > > > > >>> >>>> >         at sun.nio.ch.SelectionKeyImpl.en
> >> > > > > > > >>> >>>> sureValid(SectionKeyImpl.java:73)
> >> > > > > > > >>> >>>> >         at sun.nio.ch.SelectionKeyImpl.in
> >> > > > > > > >>> >>>> terestOps(SelctionKeyImpl.java:77)
> >> > > > > > > >>> >>>> >         at
> >> > > > > > > >>> >>>> >
> >> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(
> >> > > > > NIOServ
> >> > > > > > > >>> >>>> erCnxn.java:151)
> >> > > > > > > >>> >>>> >         at
> >> > > > > > > >>> >>>> >
> >> > org.apache.zookeeper.server.NIOServerCnXn.sendResopnse(
> >> > > > > NIOSe
> >> > > > > > > >>> >>>> rverCnxn.java:1081)
> >> > > > > > > >>> >>>> >         at
> >> > > > > > > >>> >>>> > org.apache.zookeeper.server.
> FinalRequestProcessor.
> >> > > > > processReq
> >> > > > > > > >>> >>>> uest(FinalRequestProcessor.java:404)
> >> > > > > > > >>> >>>> >         at
> >> > > > > > > >>> >>>> >
> >> > org.apache.zookeeper.server.quorum.CommitProcessor.run(
> >> > > > > Commi
> >> > > > > > > >>> >>>> tProcessor.java:74)
> >> > > > > > > >>> >>>> >
> >> > > > > > > >>> >>>> > Apache NiFi 1.2.0
> >> > > > > > > >>> >>>> >
> >> > > > > > > >>> >>>> > Thoughts?
> >> > > > > > > >>> >>>>
> >> > > > > > > >>> >>>
> >> > > > > > > >>> >>>
> >> > > > > > > >>> >>
> >> > > > > > > >>>
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
>

Re: unstable cluster

Posted by Joe Witt <jo...@gmail.com>.

Just scanning through the items currently on master that would show up
in the 1.3.0 release we see numerous cluster related bug fixes.

More consistent port alignment across cluster
  https://issues.apache.org/jira/browse/NIFI-3981

Ensure controller service lifecycle handled better with different
timing/dependencies
  https://issues.apache.org/jira/browse/NIFI-3972

Insufficient heartbeat handling causing improper clustering behavior
  https://issues.apache.org/jira/browse/NIFI-3933

Improve timing of component startup relative to other lifecycle items
when clustered
  https://issues.apache.org/jira/browse/NIFI-3923

Inconsistent scheduled state in some cluster settings
  https://issues.apache.org/jira/browse/NIFI-3900

Improved fingerprinted/non-fingerprinted settings enforcement and
handling in clusters
  https://issues.apache.org/jira/browse/NIFI-1963

These are nifi specific cluster behavior things.  For nifi and
zookeeper interaction specifically most of the focus this far has been
about NiFi itself as the above JIRAs show and also of course the cases
where a given system that is so resource contended will simply not
have a nice embedded ZK/nifi experience.

MarkB, your testing above suggests you were using a nifi 1.x which
means a zookeeper 3.4.6 client against a Zookeeper 3.4.10 server
cluster and behavior was much better.  Could you possibly run the same
cluster evaluation against the latest master but with an embedded
zookeeper 3.4.10 version in nifi (which means both server and client
are on latest zk 3.4.10 release)?  This would be helpful data.
Assuming that goes well the only other concern that jumps to mind is
if us using a zookeeper 3.4.10 client presents problems for us talking
to older server versions (still 3.4 though so probably ok, i'd hope).
In general we should be safe thanks to classloader isolation but we've
seen some pretty magical JVM/system classloader level changes happen
for Kerberized environments.

Thanks
Joe



On Tue, May 30, 2017 at 3:21 PM, Juan Sequeiros <he...@gmail.com> wrote:
> Hello all,
>
> I'll like to chime in on this interesting discussion thread.
>
> I'll like to add that my system(s) too have seen unstable ZK interaction
> with both embedded and eventually external ZK ( granted external has been
> better ) interaction.
> We have resolved them with NIFI restarts. And it's to the point that we are
> hesitant to roll up to NIFI 1.X mainly because of this ( we have DEV NIFI
> 1.X )
>
> I also would like to add that we are greatly anticipating ZK release 3.5.X
> for its TLS implementation, and as such have not voiced our experience with
> NIFI / ZOOKEEPER assuming that once ZOOKEEPER 3.5.X is out of ALPHA that it
> would be added in to NIFI NAR framework fairly fast and fix the oddities.
>
> I would say though that we have been hoping for a newer client on NIFI ZK
> side since the current one suggests its based off 3.4.6 ZOOKEEPER which was
> released on *MAR 2014*.
>
> # jar tc nifi-framework-nar-1.1.1.nar | grep zoo
> META-INF/bundled-dependencies/zookeper-3.4.6.jar
>
> And now I wonder how long it would take for NIFI to code release a client
> based off 3.5.X once it goes official given hesitation on forward
> capability.
>
>
> On Tue, May 30, 2017 at 2:52 PM Jeff <jt...@gmail.com> wrote:
>
>> Joe,
>>
>> My own direct and indirect experiences with NiFi 1.x clustering have been
>> good for both embedded and external zookeeper but we have certainly seen
>> some emails on mailing-list about it. Those have been for high load case
>> where the embedded approach would be susceptible to timing issues and
>> resolved by using an external system. Mark Bean's report is interesting
>> though since it happens under no real load at all.
>>
>> I suspect ZOOKEEPER-2044 will help that though there are several comments
>> [1] (and others on that JIRA) that describe the issue as minor/false
>> reporting/cosmetic/an improvement. Updating to ZooKeeper 3.4.10 suggests
>> that this rare issue can be resolved in NiFi, but we'll have to do our due
>> diligence to make sure that no new issues are raised with the upgrade for
>> NiFi or its ability to interface with external systems. We'll have to do
>> testing with other dependencies that use ZooKeeper 3.4.6 to ensure that
>> forward capability.
>>
>> [1]
>>
>> https://issues.apache.org/jira/browse/ZOOKEEPER-2044?focusedCommentId=15024616&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15024616
>>
>> Thanks,
>> Jeff
>>
>> On Tue, May 30, 2017 at 1:15 PM Joe Skora <js...@gmail.com> wrote:
>>
>> > Jeff,
>> >
>> > If I understand the issue correctly, this means NiFi 1.x has always been
>> > broken for clustering with an embedded ZooKeeper.  That has never
>> > communicated until now, we clearly build for and explain how to use an
>> > embedded ZooKeeper in documentation.
>> >
>> > Any external non-NiFi elements that are considered in design and
>> dependency
>> > decisions need to be clearly understood by the entire community.  What
>> > things non-NiFi are you thinking of that drive ZooKeeper dependencies?
>> >
>> > Joe
>> >
>> > On Tue, May 30, 2017 at 9:11 AM, Jeff <jt...@gmail.com> wrote:
>> >
>> > > Mark, we can certainly take smaller steps rather than waiting for
>> > > 3.5.2/3.6.0 to come out.  I was just bringing that JIRA up as another
>> > > scenario that entices us to upgrade.
>> > >
>> > > Joe, I'm referring to NiFi, the toolkit, and things non-NiFi that
>> > provide a
>> > > ZK server to which NiFi or the ZK Migration Toolkit are clients.  I'm
>> not
>> > > saying we can't or shouldn't upgrade, but we do need to test to make
>> sure
>> > > that no issues are introduced by NiFi shipping with ZK 3.4.10.  Being
>> > that
>> > > it's a bugfix version change, it's probably fine.
>> > >
>> > > - Jeff
>> > >
>> > > On Tue, May 30, 2017 at 10:46 AM Joe Skora <js...@gmail.com> wrote:
>> > >
>> > > > Jeff,
>> > > >
>> > > > Does that mean NiFi 1.x will be unstable when using embedded
>> ZooKeeper
>> > > > until the ZK version is upgrade?
>> > > >
>> > > > By "components outside of NiFi" do you mean the NiFi toolkit and
>> other
>> > > > parts of the NiFi release?
>> > > >
>> > > > Joe
>> > > >
>> > > > On Tue, May 30, 2017 at 5:42 AM, Jeff <jt...@gmail.com> wrote:
>> > > >
>> > > > > Mark,
>> > > > >
>> > > > > I did report a JIRA [1] for upgrading to 3.5.2 or 3.6.0 (just due
>> to
>> > > > log4j
>> > > > > issues) once it's out and stable, There are issues with the way
>> that
>> > ZK
>> > > > > refers to log4j classes in the code that cause issues for NiFi and
>> > our
>> > > > > Toolkit..  However there has been some back and forth [2] (in
>> 3.4.0,
>> > > > which
>> > > > > doesn't fix the issue, but moves towards fixing it), [3], and [4]
>> on
>> > > the
>> > > > > changes being implemented in versions 3.5.2 and 3.6.0.  Also, it
>> > looks
>> > > > like
>> > > > > ZK 3.6.0 is headed toward using log4j 2 [5].
>> > > > >
>> > > > > There are many components outside of NiFi that are still using ZK
>> > > 3.4.6,
>> > > > so
>> > > > > it may be a while before we can move to 3.4.10. I don't currently
>> > know
>> > > > > anything about the forward compatibility of 3.4.6.  Are there
>> > > > > improvements/fixes in 3.4.10 which you need?
>> > > > >
>> > > > > [1] https://issues.apache.org/jira/browse/NIFI-3067
>> > > > > [2] https://issues.apache.org/jira/browse/ZOOKEEPER-850
>> > > > > [3] https://issues.apache.org/jira/browse/ZOOKEEPER-1371
>> > > > > [4] https://issues.apache.org/jira/browse/ZOOKEEPER-2393
>> > > > > [5] https://issues.apache.org/jira/browse/ZOOKEEPER-2342
>> > > > >
>> > > > > - Jeff
>> > > > >
>> > > > > On Tue, May 30, 2017 at 8:15 AM Mark Bean <ma...@gmail.com>
>> > > wrote:
>> > > > >
>> > > > > > Updated to external ZooKeeper last Friday. Over the weekend,
>> there
>> > > are
>> > > > no
>> > > > > > reports of SUSPENDED or RECONNECTED.
>> > > > > >
>> > > > > > Are there plans to upgrade the embedded ZooKeeper to the latest
>> > > > version,
>> > > > > > 3.4.10?
>> > > > > >
>> > > > > > Thanks,
>> > > > > > Mark
>> > > > > >
>> > > > > > On Thu, May 25, 2017 at 11:56 AM, Joe Witt <jo...@gmail.com>
>> > > wrote:
>> > > > > >
>> > > > > > > looked at a secured cluster and the send times are routinely at
>> > > 100ms
>> > > > > > > similar to yours.  I think what i was flagging as potentially
>> > > > > > > interesting is not interesting at all.
>> > > > > > >
>> > > > > > > On Thu, May 25, 2017 at 11:34 AM, Joe Witt <joe.witt@gmail.com
>> >
>> > > > wrote:
>> > > > > > > > Ok.  Well as a point of comparison i'm looking at heartbeat
>> > logs
>> > > > from
>> > > > > > > > another cluster and the times are consistently 1-3 millis for
>> > the
>> > > > > > > > send.  Yours above show 100+ms typical with one north of
>> 900ms.
>> > > > Not
>> > > > > > > > sure how relevant that is but something i noticed.
>> > > > > > > >
>> > > > > > > > On Thu, May 25, 2017 at 11:29 AM, Mark Bean <
>> > > mark.o.bean@gmail.com
>> > > > >
>> > > > > > > wrote:
>> > > > > > > >> ping shows acceptably fast response time between servers,
>> > > > > > approximately
>> > > > > > > >> 0.100-0.150 ms
>> > > > > > > >>
>> > > > > > > >>
>> > > > > > > >> On Thu, May 25, 2017 at 11:13 AM, Joe Witt <
>> > joe.witt@gmail.com>
>> > > > > > wrote:
>> > > > > > > >>
>> > > > > > > >>> have you evaluated latency across the machines in your
>> > cluster?
>> > > > I
>> > > > > > ask
>> > > > > > > >>> because 122ms is pretty long and 917ms is very long.  Are
>> > these
>> > > > > nodes
>> > > > > > > >>> across a WAN link?
>> > > > > > > >>>
>> > > > > > > >>> On Thu, May 25, 2017 at 11:08 AM, Mark Bean <
>> > > > mark.o.bean@gmail.com
>> > > > > >
>> > > > > > > wrote:
>> > > > > > > >>> > Update: now all 5 nodes, regardless of ZK server, are
>> > > > indicating
>> > > > > > > >>> SUSPENDED
>> > > > > > > >>> > -> RECONNECTED.
>> > > > > > > >>> >
>> > > > > > > >>> > On Thu, May 25, 2017 at 10:23 AM, Mark Bean <
>> > > > > mark.o.bean@gmail.com
>> > > > > > >
>> > > > > > > >>> wrote:
>> > > > > > > >>> >
>> > > > > > > >>> >> I reduced the number of embedded ZooKeeper servers on
>> the
>> > > > 5-Node
>> > > > > > > NiFi
>> > > > > > > >>> >> Cluster from 5 to 3. This has improved the situation. I
>> do
>> > > not
>> > > > > see
>> > > > > > > any
>> > > > > > > >>> of
>> > > > > > > >>> >> the three Nodes which are also ZK servers
>> > > > > > > disconnecting/reconnecting to
>> > > > > > > >>> the
>> > > > > > > >>> >> cluster as before. However, the two Nodes which are not
>> > > > running
>> > > > > ZK
>> > > > > > > >>> continue
>> > > > > > > >>> >> to disconnect and reconnect. The following is taken from
>> > one
>> > > > of
>> > > > > > the
>> > > > > > > >>> non-ZK
>> > > > > > > >>> >> Nodes. It's curious that some messages are issued twice
>> > from
>> > > > the
>> > > > > > > same
>> > > > > > > >>> >> thread, but reference a different object
>> > > > > > > >>> >>
>> > > > > > > >>> >> nifi-app.log
>> > > > > > > >>> >> 2017-05-25 13:40:01,628 INFO [main-EventTrhead]
>> > > o.a.c.f.state.
>> > > > > > > >>> ConnectionStateManager
>> > > > > > > >>> >> State change: SUSPENDED
>> > > > > > > >>> >> 2017-05-25 13:39:45,627 INFO [Clustering Tasks Thread-1]
>> > > > > > o.a.n.c.c.
>> > > > > > > >>> ClusterProtocolHeaertbeater
>> > > > > > > >>> >> Heartbeat create at 2017-05-25 13:39:45,504 and sent to
>> > > > > FQDN:PORT
>> > > > > > at
>> > > > > > > >>> >> 2017-05-25 13:39:45,627; send took 122 millis
>> > > > > > > >>> >> 2017-05-25 13:39:50,862 INFO [Clustering Tasks Thread-1]
>> > > > > > o.a.n.c.c.
>> > > > > > > >>> ClusterProtocolHeaertbeater
>> > > > > > > >>> >> Heartbeat create at 2017-05-25 13:39:50,732 and sent to
>> > > > > FQDN:PORT
>> > > > > > at
>> > > > > > > >>> >> 2017-05-25 13:39:50,862; send took 122 millis
>> > > > > > > >>> >> 2017-05-25 13:39:56,089 INFO [Clustering Tasks Thread-1]
>> > > > > > o.a.n.c.c.
>> > > > > > > >>> ClusterProtocolHeaertbeater
>> > > > > > > >>> >> Heartbeat create at 2017-05-25 13:39:55,966 and sent to
>> > > > > FQDN:PORT
>> > > > > > at
>> > > > > > > >>> >> 2017-05-25 13:39:56,089; send took 129 millis
>> > > > > > > >>> >> 2017-05-25 13:40:01,629 INFO
>> > > > [Curator-ConnectionStateManager-0]
>> > > > > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
>> > > > > > > org.apache.nifi.controller.
>> > > > > > > >>> >> leader.election.CuratorLeaderElectionManager$
>> > > > > > > ElectionListener@68f8b6a2
>> > > > > > > >>> >> Connection State changed to SUSPENDED
>> > > > > > > >>> >> 2017-05-25 13:40:01,629 INFO
>> > > > [Curator-ConnectionStateManager-0]
>> > > > > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
>> > > > > > > org.apache.nifi.controller.
>> > > > > > > >>> >> leader.election.CuratorLeaderElectionManager$
>> > > > > > > ElectionListener@663f55cd
>> > > > > > > >>> >> Connection State changed to SUSPENDED
>> > > > > > > >>> >> 2017-05-25 13:40:02,412 INFO [main-EventThread]
>> > > o.a.c.f.state.
>> > > > > > > >>> ConnectinoStateManager
>> > > > > > > >>> >> State change: RECONNECTED
>> > > > > > > >>> >> 2017-05-25 13:40:02,413 INFO
>> > > > [Curator-ConnectionStateManager-0]
>> > > > > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
>> > > > > > > org.apache.nifi.controller.
>> > > > > > > >>> >> leader.election.CuratorLeaderElectionManager$
>> > > > > > > ElectionListener@68f8b6a2
>> > > > > > > >>> >> Connection State changed to RECONNECTED
>> > > > > > > >>> >> 2017-05-25 13:40:02,413 INFO
>> > > > [Curator-ConnectionStateManager-0]
>> > > > > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
>> > > > > > > org.apache.nifi.controller.
>> > > > > > > >>> >> leader.election.CuratorLeaderElectionManager$
>> > > > > > > ElectionListener@663f55cd
>> > > > > > > >>> >> Connection State changed to RECONNECTED
>> > > > > > > >>> >> 2017-05-25 13:40:02,550 INFO [Clustering Tasks Thread-1]
>> > > > > > o.a.n.c.c.
>> > > > > > > >>> ClusterProtocolHeaertbeater
>> > > > > > > >>> >> Heartbeat create at 2017-05-25 13:40:01,632 and sent to
>> > > > > FQDN:PORT
>> > > > > > at
>> > > > > > > >>> >> 2017-05-25 13:40:02,550; send took 917 millis
>> > > > > > > >>> >> 2017-05-25 13:40:07,787 INFO [Clustering Tasks Thread-1]
>> > > > > > o.a.n.c.c.
>> > > > > > > >>> ClusterProtocolHeaertbeater
>> > > > > > > >>> >> Heartbeat create at 2017-05-25 13:40:07,657 and sent to
>> > > > > FQDN:PORT
>> > > > > > at
>> > > > > > > >>> >> 2017-05-25 13:40:07,787; send took 129 millis
>> > > > > > > >>> >>
>> > > > > > > >>> >> I will work on setting up an external ZK next, but would
>> > > still
>> > > > > > like
>> > > > > > > some
>> > > > > > > >>> >> insight to what is being observed with the embedded ZK.
>> > > > > > > >>> >>
>> > > > > > > >>> >> Thanks,
>> > > > > > > >>> >> Mark
>> > > > > > > >>> >>
>> > > > > > > >>> >>
>> > > > > > > >>> >>
>> > > > > > > >>> >>
>> > > > > > > >>> >> On Wed, May 24, 2017 at 3:57 PM, Mark Bean <
>> > > > > mark.o.bean@gmail.com
>> > > > > > >
>> > > > > > > >>> wrote:
>> > > > > > > >>> >>
>> > > > > > > >>> >>> Yes, we are using the embedded ZK. We will try
>> > > instantiating
>> > > > > and
>> > > > > > > >>> external
>> > > > > > > >>> >>> ZK and see if that resolves the problem.
>> > > > > > > >>> >>>
>> > > > > > > >>> >>> The load on the system is extremely small. Currently
>> (as
>> > > > Nodes
>> > > > > > are
>> > > > > > > >>> >>> disconnecting/reconnecting) all input ports to the flow
>> > are
>> > > > > > turned
>> > > > > > > >>> off. The
>> > > > > > > >>> >>> only data in the flow is from a single GenerateFlow
>> > > > generating
>> > > > > 5B
>> > > > > > > >>> every 30
>> > > > > > > >>> >>> secs.
>> > > > > > > >>> >>>
>> > > > > > > >>> >>> Also, it is a 5-node cluster with embedded ZK on each
>> > node.
>> > > > > > First,
>> > > > > > > I
>> > > > > > > >>> will
>> > > > > > > >>> >>> try reducing ZK to only 3 nodes. Then, I will try a
>> > 3-node
>> > > > > > > external ZK.
>> > > > > > > >>> >>>
>> > > > > > > >>> >>> Thanks,
>> > > > > > > >>> >>> Mark
>> > > > > > > >>> >>>
>> > > > > > > >>> >>> On Wed, May 24, 2017 at 11:49 AM, Joe Witt <
>> > > > joe.witt@gmail.com
>> > > > > >
>> > > > > > > wrote:
>> > > > > > > >>> >>>
>> > > > > > > >>> >>>> Are you using the embedded Zookeeper?  If yes we
>> > recommend
>> > > > > using
>> > > > > > > an
>> > > > > > > >>> >>>> external zookeeper.
>> > > > > > > >>> >>>>
>> > > > > > > >>> >>>> What type of load are the systems under when this
>> occurs
>> > > > (cpu,
>> > > > > > > >>> >>>> network, memory, disk io)? Under high load the default
>> > > > > timeouts
>> > > > > > > for
>> > > > > > > >>> >>>> clustering are too aggressive.  You can relax these
>> for
>> > > > higher
>> > > > > > > load
>> > > > > > > >>> >>>> clusters and should see good behavior.  Even if the
>> > system
>> > > > > > > overall is
>> > > > > > > >>> >>>> not under all that high of load if you're seeing
>> garbage
>> > > > > > > collection
>> > > > > > > >>> >>>> pauses that are lengthy and/or frequent it can cause
>> the
>> > > > same
>> > > > > > high
>> > > > > > > >>> >>>> load effect as far as the JVM is concerned.
>> > > > > > > >>> >>>>
>> > > > > > > >>> >>>> Thanks
>> > > > > > > >>> >>>> Joe
>> > > > > > > >>> >>>>
>> > > > > > > >>> >>>> On Wed, May 24, 2017 at 9:11 AM, Mark Bean <
>> > > > > > mark.o.bean@gmail.com
>> > > > > > > >
>> > > > > > > >>> >>>> wrote:
>> > > > > > > >>> >>>> > We have a cluster which is showing signs of
>> > instability.
>> > > > The
>> > > > > > > Primary
>> > > > > > > >>> >>>> Node
>> > > > > > > >>> >>>> > and Coordinator are reassigned to different nodes
>> > every
>> > > > > > several
>> > > > > > > >>> >>>> minutes. I
>> > > > > > > >>> >>>> > believe this is due to lack of heartbeat or other
>> > > > > > coordination.
>> > > > > > > The
>> > > > > > > >>> >>>> > following error occurs periodically in the
>> > nifi-app.log
>> > > > > > > >>> >>>> >
>> > > > > > > >>> >>>> > ERROR [CommitProcessor:1] o.apache.zookeeper.server.
>> > > > > > > NIOServerCnxn
>> > > > > > > >>> >>>> > Unexpected Exception:
>> > > > > > > >>> >>>> > java.nio.channels.CancelledKeyException: null
>> > > > > > > >>> >>>> >         at sun.nio.ch.SelectionKeyImpl.en
>> > > > > > > >>> >>>> sureValid(SectionKeyImpl.java:73)
>> > > > > > > >>> >>>> >         at sun.nio.ch.SelectionKeyImpl.in
>> > > > > > > >>> >>>> terestOps(SelctionKeyImpl.java:77)
>> > > > > > > >>> >>>> >         at
>> > > > > > > >>> >>>> >
>> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(
>> > > > > NIOServ
>> > > > > > > >>> >>>> erCnxn.java:151)
>> > > > > > > >>> >>>> >         at
>> > > > > > > >>> >>>> >
>> > org.apache.zookeeper.server.NIOServerCnXn.sendResopnse(
>> > > > > NIOSe
>> > > > > > > >>> >>>> rverCnxn.java:1081)
>> > > > > > > >>> >>>> >         at
>> > > > > > > >>> >>>> > org.apache.zookeeper.server.FinalRequestProcessor.
>> > > > > processReq
>> > > > > > > >>> >>>> uest(FinalRequestProcessor.java:404)
>> > > > > > > >>> >>>> >         at
>> > > > > > > >>> >>>> >
>> > org.apache.zookeeper.server.quorum.CommitProcessor.run(
>> > > > > Commi
>> > > > > > > >>> >>>> tProcessor.java:74)
>> > > > > > > >>> >>>> >
>> > > > > > > >>> >>>> > Apache NiFi 1.2.0
>> > > > > > > >>> >>>> >
>> > > > > > > >>> >>>> > Thoughts?
>> > > > > > > >>> >>>>
>> > > > > > > >>> >>>
>> > > > > > > >>> >>>
>> > > > > > > >>> >>
>> > > > > > > >>>
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>

Re: unstable cluster

Posted by Juan Sequeiros <he...@gmail.com>.

Hello all,

I'll like to chime in on this interesting discussion thread.

I'll like to add that my system(s) too have seen unstable ZK interaction
with both embedded and eventually external ZK ( granted external has been
better ) interaction.
We have resolved them with NIFI restarts. And it's to the point that we are
hesitant to roll up to NIFI 1.X mainly because of this ( we have DEV NIFI
1.X )

I also would like to add that we are greatly anticipating ZK release 3.5.X
for its TLS implementation, and as such have not voiced our experience with
NIFI / ZOOKEEPER assuming that once ZOOKEEPER 3.5.X is out of ALPHA that it
would be added in to NIFI NAR framework fairly fast and fix the oddities.

I would say though that we have been hoping for a newer client on NIFI ZK
side since the current one suggests its based off 3.4.6 ZOOKEEPER which was
released on *MAR 2014*.

# jar tc nifi-framework-nar-1.1.1.nar | grep zoo
META-INF/bundled-dependencies/zookeper-3.4.6.jar

And now I wonder how long it would take for NIFI to code release a client
based off 3.5.X once it goes official given hesitation on forward
capability.


On Tue, May 30, 2017 at 2:52 PM Jeff <jt...@gmail.com> wrote:

> Joe,
>
> My own direct and indirect experiences with NiFi 1.x clustering have been
> good for both embedded and external zookeeper but we have certainly seen
> some emails on mailing-list about it. Those have been for high load case
> where the embedded approach would be susceptible to timing issues and
> resolved by using an external system. Mark Bean's report is interesting
> though since it happens under no real load at all.
>
> I suspect ZOOKEEPER-2044 will help that though there are several comments
> [1] (and others on that JIRA) that describe the issue as minor/false
> reporting/cosmetic/an improvement. Updating to ZooKeeper 3.4.10 suggests
> that this rare issue can be resolved in NiFi, but we'll have to do our due
> diligence to make sure that no new issues are raised with the upgrade for
> NiFi or its ability to interface with external systems. We'll have to do
> testing with other dependencies that use ZooKeeper 3.4.6 to ensure that
> forward capability.
>
> [1]
>
> https://issues.apache.org/jira/browse/ZOOKEEPER-2044?focusedCommentId=15024616&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15024616
>
> Thanks,
> Jeff
>
> On Tue, May 30, 2017 at 1:15 PM Joe Skora <js...@gmail.com> wrote:
>
> > Jeff,
> >
> > If I understand the issue correctly, this means NiFi 1.x has always been
> > broken for clustering with an embedded ZooKeeper.  That has never
> > communicated until now, we clearly build for and explain how to use an
> > embedded ZooKeeper in documentation.
> >
> > Any external non-NiFi elements that are considered in design and
> dependency
> > decisions need to be clearly understood by the entire community.  What
> > things non-NiFi are you thinking of that drive ZooKeeper dependencies?
> >
> > Joe
> >
> > On Tue, May 30, 2017 at 9:11 AM, Jeff <jt...@gmail.com> wrote:
> >
> > > Mark, we can certainly take smaller steps rather than waiting for
> > > 3.5.2/3.6.0 to come out.  I was just bringing that JIRA up as another
> > > scenario that entices us to upgrade.
> > >
> > > Joe, I'm referring to NiFi, the toolkit, and things non-NiFi that
> > provide a
> > > ZK server to which NiFi or the ZK Migration Toolkit are clients.  I'm
> not
> > > saying we can't or shouldn't upgrade, but we do need to test to make
> sure
> > > that no issues are introduced by NiFi shipping with ZK 3.4.10.  Being
> > that
> > > it's a bugfix version change, it's probably fine.
> > >
> > > - Jeff
> > >
> > > On Tue, May 30, 2017 at 10:46 AM Joe Skora <js...@gmail.com> wrote:
> > >
> > > > Jeff,
> > > >
> > > > Does that mean NiFi 1.x will be unstable when using embedded
> ZooKeeper
> > > > until the ZK version is upgrade?
> > > >
> > > > By "components outside of NiFi" do you mean the NiFi toolkit and
> other
> > > > parts of the NiFi release?
> > > >
> > > > Joe
> > > >
> > > > On Tue, May 30, 2017 at 5:42 AM, Jeff <jt...@gmail.com> wrote:
> > > >
> > > > > Mark,
> > > > >
> > > > > I did report a JIRA [1] for upgrading to 3.5.2 or 3.6.0 (just due
> to
> > > > log4j
> > > > > issues) once it's out and stable, There are issues with the way
> that
> > ZK
> > > > > refers to log4j classes in the code that cause issues for NiFi and
> > our
> > > > > Toolkit..  However there has been some back and forth [2] (in
> 3.4.0,
> > > > which
> > > > > doesn't fix the issue, but moves towards fixing it), [3], and [4]
> on
> > > the
> > > > > changes being implemented in versions 3.5.2 and 3.6.0.  Also, it
> > looks
> > > > like
> > > > > ZK 3.6.0 is headed toward using log4j 2 [5].
> > > > >
> > > > > There are many components outside of NiFi that are still using ZK
> > > 3.4.6,
> > > > so
> > > > > it may be a while before we can move to 3.4.10. I don't currently
> > know
> > > > > anything about the forward compatibility of 3.4.6.  Are there
> > > > > improvements/fixes in 3.4.10 which you need?
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/NIFI-3067
> > > > > [2] https://issues.apache.org/jira/browse/ZOOKEEPER-850
> > > > > [3] https://issues.apache.org/jira/browse/ZOOKEEPER-1371
> > > > > [4] https://issues.apache.org/jira/browse/ZOOKEEPER-2393
> > > > > [5] https://issues.apache.org/jira/browse/ZOOKEEPER-2342
> > > > >
> > > > > - Jeff
> > > > >
> > > > > On Tue, May 30, 2017 at 8:15 AM Mark Bean <ma...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Updated to external ZooKeeper last Friday. Over the weekend,
> there
> > > are
> > > > no
> > > > > > reports of SUSPENDED or RECONNECTED.
> > > > > >
> > > > > > Are there plans to upgrade the embedded ZooKeeper to the latest
> > > > version,
> > > > > > 3.4.10?
> > > > > >
> > > > > > Thanks,
> > > > > > Mark
> > > > > >
> > > > > > On Thu, May 25, 2017 at 11:56 AM, Joe Witt <jo...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > looked at a secured cluster and the send times are routinely at
> > > 100ms
> > > > > > > similar to yours.  I think what i was flagging as potentially
> > > > > > > interesting is not interesting at all.
> > > > > > >
> > > > > > > On Thu, May 25, 2017 at 11:34 AM, Joe Witt <joe.witt@gmail.com
> >
> > > > wrote:
> > > > > > > > Ok.  Well as a point of comparison i'm looking at heartbeat
> > logs
> > > > from
> > > > > > > > another cluster and the times are consistently 1-3 millis for
> > the
> > > > > > > > send.  Yours above show 100+ms typical with one north of
> 900ms.
> > > > Not
> > > > > > > > sure how relevant that is but something i noticed.
> > > > > > > >
> > > > > > > > On Thu, May 25, 2017 at 11:29 AM, Mark Bean <
> > > mark.o.bean@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > >> ping shows acceptably fast response time between servers,
> > > > > > approximately
> > > > > > > >> 0.100-0.150 ms
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> On Thu, May 25, 2017 at 11:13 AM, Joe Witt <
> > joe.witt@gmail.com>
> > > > > > wrote:
> > > > > > > >>
> > > > > > > >>> have you evaluated latency across the machines in your
> > cluster?
> > > > I
> > > > > > ask
> > > > > > > >>> because 122ms is pretty long and 917ms is very long.  Are
> > these
> > > > > nodes
> > > > > > > >>> across a WAN link?
> > > > > > > >>>
> > > > > > > >>> On Thu, May 25, 2017 at 11:08 AM, Mark Bean <
> > > > mark.o.bean@gmail.com
> > > > > >
> > > > > > > wrote:
> > > > > > > >>> > Update: now all 5 nodes, regardless of ZK server, are
> > > > indicating
> > > > > > > >>> SUSPENDED
> > > > > > > >>> > -> RECONNECTED.
> > > > > > > >>> >
> > > > > > > >>> > On Thu, May 25, 2017 at 10:23 AM, Mark Bean <
> > > > > mark.o.bean@gmail.com
> > > > > > >
> > > > > > > >>> wrote:
> > > > > > > >>> >
> > > > > > > >>> >> I reduced the number of embedded ZooKeeper servers on
> the
> > > > 5-Node
> > > > > > > NiFi
> > > > > > > >>> >> Cluster from 5 to 3. This has improved the situation. I
> do
> > > not
> > > > > see
> > > > > > > any
> > > > > > > >>> of
> > > > > > > >>> >> the three Nodes which are also ZK servers
> > > > > > > disconnecting/reconnecting to
> > > > > > > >>> the
> > > > > > > >>> >> cluster as before. However, the two Nodes which are not
> > > > running
> > > > > ZK
> > > > > > > >>> continue
> > > > > > > >>> >> to disconnect and reconnect. The following is taken from
> > one
> > > > of
> > > > > > the
> > > > > > > >>> non-ZK
> > > > > > > >>> >> Nodes. It's curious that some messages are issued twice
> > from
> > > > the
> > > > > > > same
> > > > > > > >>> >> thread, but reference a different object
> > > > > > > >>> >>
> > > > > > > >>> >> nifi-app.log
> > > > > > > >>> >> 2017-05-25 13:40:01,628 INFO [main-EventTrhead]
> > > o.a.c.f.state.
> > > > > > > >>> ConnectionStateManager
> > > > > > > >>> >> State change: SUSPENDED
> > > > > > > >>> >> 2017-05-25 13:39:45,627 INFO [Clustering Tasks Thread-1]
> > > > > > o.a.n.c.c.
> > > > > > > >>> ClusterProtocolHeaertbeater
> > > > > > > >>> >> Heartbeat create at 2017-05-25 13:39:45,504 and sent to
> > > > > FQDN:PORT
> > > > > > at
> > > > > > > >>> >> 2017-05-25 13:39:45,627; send took 122 millis
> > > > > > > >>> >> 2017-05-25 13:39:50,862 INFO [Clustering Tasks Thread-1]
> > > > > > o.a.n.c.c.
> > > > > > > >>> ClusterProtocolHeaertbeater
> > > > > > > >>> >> Heartbeat create at 2017-05-25 13:39:50,732 and sent to
> > > > > FQDN:PORT
> > > > > > at
> > > > > > > >>> >> 2017-05-25 13:39:50,862; send took 122 millis
> > > > > > > >>> >> 2017-05-25 13:39:56,089 INFO [Clustering Tasks Thread-1]
> > > > > > o.a.n.c.c.
> > > > > > > >>> ClusterProtocolHeaertbeater
> > > > > > > >>> >> Heartbeat create at 2017-05-25 13:39:55,966 and sent to
> > > > > FQDN:PORT
> > > > > > at
> > > > > > > >>> >> 2017-05-25 13:39:56,089; send took 129 millis
> > > > > > > >>> >> 2017-05-25 13:40:01,629 INFO
> > > > [Curator-ConnectionStateManager-0]
> > > > > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> > > > > > > org.apache.nifi.controller.
> > > > > > > >>> >> leader.election.CuratorLeaderElectionManager$
> > > > > > > ElectionListener@68f8b6a2
> > > > > > > >>> >> Connection State changed to SUSPENDED
> > > > > > > >>> >> 2017-05-25 13:40:01,629 INFO
> > > > [Curator-ConnectionStateManager-0]
> > > > > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> > > > > > > org.apache.nifi.controller.
> > > > > > > >>> >> leader.election.CuratorLeaderElectionManager$
> > > > > > > ElectionListener@663f55cd
> > > > > > > >>> >> Connection State changed to SUSPENDED
> > > > > > > >>> >> 2017-05-25 13:40:02,412 INFO [main-EventThread]
> > > o.a.c.f.state.
> > > > > > > >>> ConnectinoStateManager
> > > > > > > >>> >> State change: RECONNECTED
> > > > > > > >>> >> 2017-05-25 13:40:02,413 INFO
> > > > [Curator-ConnectionStateManager-0]
> > > > > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> > > > > > > org.apache.nifi.controller.
> > > > > > > >>> >> leader.election.CuratorLeaderElectionManager$
> > > > > > > ElectionListener@68f8b6a2
> > > > > > > >>> >> Connection State changed to RECONNECTED
> > > > > > > >>> >> 2017-05-25 13:40:02,413 INFO
> > > > [Curator-ConnectionStateManager-0]
> > > > > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> > > > > > > org.apache.nifi.controller.
> > > > > > > >>> >> leader.election.CuratorLeaderElectionManager$
> > > > > > > ElectionListener@663f55cd
> > > > > > > >>> >> Connection State changed to RECONNECTED
> > > > > > > >>> >> 2017-05-25 13:40:02,550 INFO [Clustering Tasks Thread-1]
> > > > > > o.a.n.c.c.
> > > > > > > >>> ClusterProtocolHeaertbeater
> > > > > > > >>> >> Heartbeat create at 2017-05-25 13:40:01,632 and sent to
> > > > > FQDN:PORT
> > > > > > at
> > > > > > > >>> >> 2017-05-25 13:40:02,550; send took 917 millis
> > > > > > > >>> >> 2017-05-25 13:40:07,787 INFO [Clustering Tasks Thread-1]
> > > > > > o.a.n.c.c.
> > > > > > > >>> ClusterProtocolHeaertbeater
> > > > > > > >>> >> Heartbeat create at 2017-05-25 13:40:07,657 and sent to
> > > > > FQDN:PORT
> > > > > > at
> > > > > > > >>> >> 2017-05-25 13:40:07,787; send took 129 millis
> > > > > > > >>> >>
> > > > > > > >>> >> I will work on setting up an external ZK next, but would
> > > still
> > > > > > like
> > > > > > > some
> > > > > > > >>> >> insight to what is being observed with the embedded ZK.
> > > > > > > >>> >>
> > > > > > > >>> >> Thanks,
> > > > > > > >>> >> Mark
> > > > > > > >>> >>
> > > > > > > >>> >>
> > > > > > > >>> >>
> > > > > > > >>> >>
> > > > > > > >>> >> On Wed, May 24, 2017 at 3:57 PM, Mark Bean <
> > > > > mark.o.bean@gmail.com
> > > > > > >
> > > > > > > >>> wrote:
> > > > > > > >>> >>
> > > > > > > >>> >>> Yes, we are using the embedded ZK. We will try
> > > instantiating
> > > > > and
> > > > > > > >>> external
> > > > > > > >>> >>> ZK and see if that resolves the problem.
> > > > > > > >>> >>>
> > > > > > > >>> >>> The load on the system is extremely small. Currently
> (as
> > > > Nodes
> > > > > > are
> > > > > > > >>> >>> disconnecting/reconnecting) all input ports to the flow
> > are
> > > > > > turned
> > > > > > > >>> off. The
> > > > > > > >>> >>> only data in the flow is from a single GenerateFlow
> > > > generating
> > > > > 5B
> > > > > > > >>> every 30
> > > > > > > >>> >>> secs.
> > > > > > > >>> >>>
> > > > > > > >>> >>> Also, it is a 5-node cluster with embedded ZK on each
> > node.
> > > > > > First,
> > > > > > > I
> > > > > > > >>> will
> > > > > > > >>> >>> try reducing ZK to only 3 nodes. Then, I will try a
> > 3-node
> > > > > > > external ZK.
> > > > > > > >>> >>>
> > > > > > > >>> >>> Thanks,
> > > > > > > >>> >>> Mark
> > > > > > > >>> >>>
> > > > > > > >>> >>> On Wed, May 24, 2017 at 11:49 AM, Joe Witt <
> > > > joe.witt@gmail.com
> > > > > >
> > > > > > > wrote:
> > > > > > > >>> >>>
> > > > > > > >>> >>>> Are you using the embedded Zookeeper?  If yes we
> > recommend
> > > > > using
> > > > > > > an
> > > > > > > >>> >>>> external zookeeper.
> > > > > > > >>> >>>>
> > > > > > > >>> >>>> What type of load are the systems under when this
> occurs
> > > > (cpu,
> > > > > > > >>> >>>> network, memory, disk io)? Under high load the default
> > > > > timeouts
> > > > > > > for
> > > > > > > >>> >>>> clustering are too aggressive.  You can relax these
> for
> > > > higher
> > > > > > > load
> > > > > > > >>> >>>> clusters and should see good behavior.  Even if the
> > system
> > > > > > > overall is
> > > > > > > >>> >>>> not under all that high of load if you're seeing
> garbage
> > > > > > > collection
> > > > > > > >>> >>>> pauses that are lengthy and/or frequent it can cause
> the
> > > > same
> > > > > > high
> > > > > > > >>> >>>> load effect as far as the JVM is concerned.
> > > > > > > >>> >>>>
> > > > > > > >>> >>>> Thanks
> > > > > > > >>> >>>> Joe
> > > > > > > >>> >>>>
> > > > > > > >>> >>>> On Wed, May 24, 2017 at 9:11 AM, Mark Bean <
> > > > > > mark.o.bean@gmail.com
> > > > > > > >
> > > > > > > >>> >>>> wrote:
> > > > > > > >>> >>>> > We have a cluster which is showing signs of
> > instability.
> > > > The
> > > > > > > Primary
> > > > > > > >>> >>>> Node
> > > > > > > >>> >>>> > and Coordinator are reassigned to different nodes
> > every
> > > > > > several
> > > > > > > >>> >>>> minutes. I
> > > > > > > >>> >>>> > believe this is due to lack of heartbeat or other
> > > > > > coordination.
> > > > > > > The
> > > > > > > >>> >>>> > following error occurs periodically in the
> > nifi-app.log
> > > > > > > >>> >>>> >
> > > > > > > >>> >>>> > ERROR [CommitProcessor:1] o.apache.zookeeper.server.
> > > > > > > NIOServerCnxn
> > > > > > > >>> >>>> > Unexpected Exception:
> > > > > > > >>> >>>> > java.nio.channels.CancelledKeyException: null
> > > > > > > >>> >>>> >         at sun.nio.ch.SelectionKeyImpl.en
> > > > > > > >>> >>>> sureValid(SectionKeyImpl.java:73)
> > > > > > > >>> >>>> >         at sun.nio.ch.SelectionKeyImpl.in
> > > > > > > >>> >>>> terestOps(SelctionKeyImpl.java:77)
> > > > > > > >>> >>>> >         at
> > > > > > > >>> >>>> >
> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(
> > > > > NIOServ
> > > > > > > >>> >>>> erCnxn.java:151)
> > > > > > > >>> >>>> >         at
> > > > > > > >>> >>>> >
> > org.apache.zookeeper.server.NIOServerCnXn.sendResopnse(
> > > > > NIOSe
> > > > > > > >>> >>>> rverCnxn.java:1081)
> > > > > > > >>> >>>> >         at
> > > > > > > >>> >>>> > org.apache.zookeeper.server.FinalRequestProcessor.
> > > > > processReq
> > > > > > > >>> >>>> uest(FinalRequestProcessor.java:404)
> > > > > > > >>> >>>> >         at
> > > > > > > >>> >>>> >
> > org.apache.zookeeper.server.quorum.CommitProcessor.run(
> > > > > Commi
> > > > > > > >>> >>>> tProcessor.java:74)
> > > > > > > >>> >>>> >
> > > > > > > >>> >>>> > Apache NiFi 1.2.0
> > > > > > > >>> >>>> >
> > > > > > > >>> >>>> > Thoughts?
> > > > > > > >>> >>>>
> > > > > > > >>> >>>
> > > > > > > >>> >>>
> > > > > > > >>> >>
> > > > > > > >>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: unstable cluster

Posted by Jeff <jt...@gmail.com>.

Joe,

My own direct and indirect experiences with NiFi 1.x clustering have been
good for both embedded and external zookeeper but we have certainly seen
some emails on mailing-list about it. Those have been for high load case
where the embedded approach would be susceptible to timing issues and
resolved by using an external system. Mark Bean's report is interesting
though since it happens under no real load at all.

I suspect ZOOKEEPER-2044 will help that though there are several comments
[1] (and others on that JIRA) that describe the issue as minor/false
reporting/cosmetic/an improvement. Updating to ZooKeeper 3.4.10 suggests
that this rare issue can be resolved in NiFi, but we'll have to do our due
diligence to make sure that no new issues are raised with the upgrade for
NiFi or its ability to interface with external systems. We'll have to do
testing with other dependencies that use ZooKeeper 3.4.6 to ensure that
forward capability.

[1]
https://issues.apache.org/jira/browse/ZOOKEEPER-2044?focusedCommentId=15024616&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15024616

Thanks,
Jeff

On Tue, May 30, 2017 at 1:15 PM Joe Skora <js...@gmail.com> wrote:

> Jeff,
>
> If I understand the issue correctly, this means NiFi 1.x has always been
> broken for clustering with an embedded ZooKeeper.  That has never
> communicated until now, we clearly build for and explain how to use an
> embedded ZooKeeper in documentation.
>
> Any external non-NiFi elements that are considered in design and dependency
> decisions need to be clearly understood by the entire community.  What
> things non-NiFi are you thinking of that drive ZooKeeper dependencies?
>
> Joe
>
> On Tue, May 30, 2017 at 9:11 AM, Jeff <jt...@gmail.com> wrote:
>
> > Mark, we can certainly take smaller steps rather than waiting for
> > 3.5.2/3.6.0 to come out.  I was just bringing that JIRA up as another
> > scenario that entices us to upgrade.
> >
> > Joe, I'm referring to NiFi, the toolkit, and things non-NiFi that
> provide a
> > ZK server to which NiFi or the ZK Migration Toolkit are clients.  I'm not
> > saying we can't or shouldn't upgrade, but we do need to test to make sure
> > that no issues are introduced by NiFi shipping with ZK 3.4.10.  Being
> that
> > it's a bugfix version change, it's probably fine.
> >
> > - Jeff
> >
> > On Tue, May 30, 2017 at 10:46 AM Joe Skora <js...@gmail.com> wrote:
> >
> > > Jeff,
> > >
> > > Does that mean NiFi 1.x will be unstable when using embedded ZooKeeper
> > > until the ZK version is upgrade?
> > >
> > > By "components outside of NiFi" do you mean the NiFi toolkit and other
> > > parts of the NiFi release?
> > >
> > > Joe
> > >
> > > On Tue, May 30, 2017 at 5:42 AM, Jeff <jt...@gmail.com> wrote:
> > >
> > > > Mark,
> > > >
> > > > I did report a JIRA [1] for upgrading to 3.5.2 or 3.6.0 (just due to
> > > log4j
> > > > issues) once it's out and stable, There are issues with the way that
> ZK
> > > > refers to log4j classes in the code that cause issues for NiFi and
> our
> > > > Toolkit..  However there has been some back and forth [2] (in 3.4.0,
> > > which
> > > > doesn't fix the issue, but moves towards fixing it), [3], and [4] on
> > the
> > > > changes being implemented in versions 3.5.2 and 3.6.0.  Also, it
> looks
> > > like
> > > > ZK 3.6.0 is headed toward using log4j 2 [5].
> > > >
> > > > There are many components outside of NiFi that are still using ZK
> > 3.4.6,
> > > so
> > > > it may be a while before we can move to 3.4.10. I don't currently
> know
> > > > anything about the forward compatibility of 3.4.6.  Are there
> > > > improvements/fixes in 3.4.10 which you need?
> > > >
> > > > [1] https://issues.apache.org/jira/browse/NIFI-3067
> > > > [2] https://issues.apache.org/jira/browse/ZOOKEEPER-850
> > > > [3] https://issues.apache.org/jira/browse/ZOOKEEPER-1371
> > > > [4] https://issues.apache.org/jira/browse/ZOOKEEPER-2393
> > > > [5] https://issues.apache.org/jira/browse/ZOOKEEPER-2342
> > > >
> > > > - Jeff
> > > >
> > > > On Tue, May 30, 2017 at 8:15 AM Mark Bean <ma...@gmail.com>
> > wrote:
> > > >
> > > > > Updated to external ZooKeeper last Friday. Over the weekend, there
> > are
> > > no
> > > > > reports of SUSPENDED or RECONNECTED.
> > > > >
> > > > > Are there plans to upgrade the embedded ZooKeeper to the latest
> > > version,
> > > > > 3.4.10?
> > > > >
> > > > > Thanks,
> > > > > Mark
> > > > >
> > > > > On Thu, May 25, 2017 at 11:56 AM, Joe Witt <jo...@gmail.com>
> > wrote:
> > > > >
> > > > > > looked at a secured cluster and the send times are routinely at
> > 100ms
> > > > > > similar to yours.  I think what i was flagging as potentially
> > > > > > interesting is not interesting at all.
> > > > > >
> > > > > > On Thu, May 25, 2017 at 11:34 AM, Joe Witt <jo...@gmail.com>
> > > wrote:
> > > > > > > Ok.  Well as a point of comparison i'm looking at heartbeat
> logs
> > > from
> > > > > > > another cluster and the times are consistently 1-3 millis for
> the
> > > > > > > send.  Yours above show 100+ms typical with one north of 900ms.
> > > Not
> > > > > > > sure how relevant that is but something i noticed.
> > > > > > >
> > > > > > > On Thu, May 25, 2017 at 11:29 AM, Mark Bean <
> > mark.o.bean@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >> ping shows acceptably fast response time between servers,
> > > > > approximately
> > > > > > >> 0.100-0.150 ms
> > > > > > >>
> > > > > > >>
> > > > > > >> On Thu, May 25, 2017 at 11:13 AM, Joe Witt <
> joe.witt@gmail.com>
> > > > > wrote:
> > > > > > >>
> > > > > > >>> have you evaluated latency across the machines in your
> cluster?
> > > I
> > > > > ask
> > > > > > >>> because 122ms is pretty long and 917ms is very long.  Are
> these
> > > > nodes
> > > > > > >>> across a WAN link?
> > > > > > >>>
> > > > > > >>> On Thu, May 25, 2017 at 11:08 AM, Mark Bean <
> > > mark.o.bean@gmail.com
> > > > >
> > > > > > wrote:
> > > > > > >>> > Update: now all 5 nodes, regardless of ZK server, are
> > > indicating
> > > > > > >>> SUSPENDED
> > > > > > >>> > -> RECONNECTED.
> > > > > > >>> >
> > > > > > >>> > On Thu, May 25, 2017 at 10:23 AM, Mark Bean <
> > > > mark.o.bean@gmail.com
> > > > > >
> > > > > > >>> wrote:
> > > > > > >>> >
> > > > > > >>> >> I reduced the number of embedded ZooKeeper servers on the
> > > 5-Node
> > > > > > NiFi
> > > > > > >>> >> Cluster from 5 to 3. This has improved the situation. I do
> > not
> > > > see
> > > > > > any
> > > > > > >>> of
> > > > > > >>> >> the three Nodes which are also ZK servers
> > > > > > disconnecting/reconnecting to
> > > > > > >>> the
> > > > > > >>> >> cluster as before. However, the two Nodes which are not
> > > running
> > > > ZK
> > > > > > >>> continue
> > > > > > >>> >> to disconnect and reconnect. The following is taken from
> one
> > > of
> > > > > the
> > > > > > >>> non-ZK
> > > > > > >>> >> Nodes. It's curious that some messages are issued twice
> from
> > > the
> > > > > > same
> > > > > > >>> >> thread, but reference a different object
> > > > > > >>> >>
> > > > > > >>> >> nifi-app.log
> > > > > > >>> >> 2017-05-25 13:40:01,628 INFO [main-EventTrhead]
> > o.a.c.f.state.
> > > > > > >>> ConnectionStateManager
> > > > > > >>> >> State change: SUSPENDED
> > > > > > >>> >> 2017-05-25 13:39:45,627 INFO [Clustering Tasks Thread-1]
> > > > > o.a.n.c.c.
> > > > > > >>> ClusterProtocolHeaertbeater
> > > > > > >>> >> Heartbeat create at 2017-05-25 13:39:45,504 and sent to
> > > > FQDN:PORT
> > > > > at
> > > > > > >>> >> 2017-05-25 13:39:45,627; send took 122 millis
> > > > > > >>> >> 2017-05-25 13:39:50,862 INFO [Clustering Tasks Thread-1]
> > > > > o.a.n.c.c.
> > > > > > >>> ClusterProtocolHeaertbeater
> > > > > > >>> >> Heartbeat create at 2017-05-25 13:39:50,732 and sent to
> > > > FQDN:PORT
> > > > > at
> > > > > > >>> >> 2017-05-25 13:39:50,862; send took 122 millis
> > > > > > >>> >> 2017-05-25 13:39:56,089 INFO [Clustering Tasks Thread-1]
> > > > > o.a.n.c.c.
> > > > > > >>> ClusterProtocolHeaertbeater
> > > > > > >>> >> Heartbeat create at 2017-05-25 13:39:55,966 and sent to
> > > > FQDN:PORT
> > > > > at
> > > > > > >>> >> 2017-05-25 13:39:56,089; send took 129 millis
> > > > > > >>> >> 2017-05-25 13:40:01,629 INFO
> > > [Curator-ConnectionStateManager-0]
> > > > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> > > > > > org.apache.nifi.controller.
> > > > > > >>> >> leader.election.CuratorLeaderElectionManager$
> > > > > > ElectionListener@68f8b6a2
> > > > > > >>> >> Connection State changed to SUSPENDED
> > > > > > >>> >> 2017-05-25 13:40:01,629 INFO
> > > [Curator-ConnectionStateManager-0]
> > > > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> > > > > > org.apache.nifi.controller.
> > > > > > >>> >> leader.election.CuratorLeaderElectionManager$
> > > > > > ElectionListener@663f55cd
> > > > > > >>> >> Connection State changed to SUSPENDED
> > > > > > >>> >> 2017-05-25 13:40:02,412 INFO [main-EventThread]
> > o.a.c.f.state.
> > > > > > >>> ConnectinoStateManager
> > > > > > >>> >> State change: RECONNECTED
> > > > > > >>> >> 2017-05-25 13:40:02,413 INFO
> > > [Curator-ConnectionStateManager-0]
> > > > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> > > > > > org.apache.nifi.controller.
> > > > > > >>> >> leader.election.CuratorLeaderElectionManager$
> > > > > > ElectionListener@68f8b6a2
> > > > > > >>> >> Connection State changed to RECONNECTED
> > > > > > >>> >> 2017-05-25 13:40:02,413 INFO
> > > [Curator-ConnectionStateManager-0]
> > > > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> > > > > > org.apache.nifi.controller.
> > > > > > >>> >> leader.election.CuratorLeaderElectionManager$
> > > > > > ElectionListener@663f55cd
> > > > > > >>> >> Connection State changed to RECONNECTED
> > > > > > >>> >> 2017-05-25 13:40:02,550 INFO [Clustering Tasks Thread-1]
> > > > > o.a.n.c.c.
> > > > > > >>> ClusterProtocolHeaertbeater
> > > > > > >>> >> Heartbeat create at 2017-05-25 13:40:01,632 and sent to
> > > > FQDN:PORT
> > > > > at
> > > > > > >>> >> 2017-05-25 13:40:02,550; send took 917 millis
> > > > > > >>> >> 2017-05-25 13:40:07,787 INFO [Clustering Tasks Thread-1]
> > > > > o.a.n.c.c.
> > > > > > >>> ClusterProtocolHeaertbeater
> > > > > > >>> >> Heartbeat create at 2017-05-25 13:40:07,657 and sent to
> > > > FQDN:PORT
> > > > > at
> > > > > > >>> >> 2017-05-25 13:40:07,787; send took 129 millis
> > > > > > >>> >>
> > > > > > >>> >> I will work on setting up an external ZK next, but would
> > still
> > > > > like
> > > > > > some
> > > > > > >>> >> insight to what is being observed with the embedded ZK.
> > > > > > >>> >>
> > > > > > >>> >> Thanks,
> > > > > > >>> >> Mark
> > > > > > >>> >>
> > > > > > >>> >>
> > > > > > >>> >>
> > > > > > >>> >>
> > > > > > >>> >> On Wed, May 24, 2017 at 3:57 PM, Mark Bean <
> > > > mark.o.bean@gmail.com
> > > > > >
> > > > > > >>> wrote:
> > > > > > >>> >>
> > > > > > >>> >>> Yes, we are using the embedded ZK. We will try
> > instantiating
> > > > and
> > > > > > >>> external
> > > > > > >>> >>> ZK and see if that resolves the problem.
> > > > > > >>> >>>
> > > > > > >>> >>> The load on the system is extremely small. Currently (as
> > > Nodes
> > > > > are
> > > > > > >>> >>> disconnecting/reconnecting) all input ports to the flow
> are
> > > > > turned
> > > > > > >>> off. The
> > > > > > >>> >>> only data in the flow is from a single GenerateFlow
> > > generating
> > > > 5B
> > > > > > >>> every 30
> > > > > > >>> >>> secs.
> > > > > > >>> >>>
> > > > > > >>> >>> Also, it is a 5-node cluster with embedded ZK on each
> node.
> > > > > First,
> > > > > > I
> > > > > > >>> will
> > > > > > >>> >>> try reducing ZK to only 3 nodes. Then, I will try a
> 3-node
> > > > > > external ZK.
> > > > > > >>> >>>
> > > > > > >>> >>> Thanks,
> > > > > > >>> >>> Mark
> > > > > > >>> >>>
> > > > > > >>> >>> On Wed, May 24, 2017 at 11:49 AM, Joe Witt <
> > > joe.witt@gmail.com
> > > > >
> > > > > > wrote:
> > > > > > >>> >>>
> > > > > > >>> >>>> Are you using the embedded Zookeeper?  If yes we
> recommend
> > > > using
> > > > > > an
> > > > > > >>> >>>> external zookeeper.
> > > > > > >>> >>>>
> > > > > > >>> >>>> What type of load are the systems under when this occurs
> > > (cpu,
> > > > > > >>> >>>> network, memory, disk io)? Under high load the default
> > > > timeouts
> > > > > > for
> > > > > > >>> >>>> clustering are too aggressive.  You can relax these for
> > > higher
> > > > > > load
> > > > > > >>> >>>> clusters and should see good behavior.  Even if the
> system
> > > > > > overall is
> > > > > > >>> >>>> not under all that high of load if you're seeing garbage
> > > > > > collection
> > > > > > >>> >>>> pauses that are lengthy and/or frequent it can cause the
> > > same
> > > > > high
> > > > > > >>> >>>> load effect as far as the JVM is concerned.
> > > > > > >>> >>>>
> > > > > > >>> >>>> Thanks
> > > > > > >>> >>>> Joe
> > > > > > >>> >>>>
> > > > > > >>> >>>> On Wed, May 24, 2017 at 9:11 AM, Mark Bean <
> > > > > mark.o.bean@gmail.com
> > > > > > >
> > > > > > >>> >>>> wrote:
> > > > > > >>> >>>> > We have a cluster which is showing signs of
> instability.
> > > The
> > > > > > Primary
> > > > > > >>> >>>> Node
> > > > > > >>> >>>> > and Coordinator are reassigned to different nodes
> every
> > > > > several
> > > > > > >>> >>>> minutes. I
> > > > > > >>> >>>> > believe this is due to lack of heartbeat or other
> > > > > coordination.
> > > > > > The
> > > > > > >>> >>>> > following error occurs periodically in the
> nifi-app.log
> > > > > > >>> >>>> >
> > > > > > >>> >>>> > ERROR [CommitProcessor:1] o.apache.zookeeper.server.
> > > > > > NIOServerCnxn
> > > > > > >>> >>>> > Unexpected Exception:
> > > > > > >>> >>>> > java.nio.channels.CancelledKeyException: null
> > > > > > >>> >>>> >         at sun.nio.ch.SelectionKeyImpl.en
> > > > > > >>> >>>> sureValid(SectionKeyImpl.java:73)
> > > > > > >>> >>>> >         at sun.nio.ch.SelectionKeyImpl.in
> > > > > > >>> >>>> terestOps(SelctionKeyImpl.java:77)
> > > > > > >>> >>>> >         at
> > > > > > >>> >>>> > org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(
> > > > NIOServ
> > > > > > >>> >>>> erCnxn.java:151)
> > > > > > >>> >>>> >         at
> > > > > > >>> >>>> >
> org.apache.zookeeper.server.NIOServerCnXn.sendResopnse(
> > > > NIOSe
> > > > > > >>> >>>> rverCnxn.java:1081)
> > > > > > >>> >>>> >         at
> > > > > > >>> >>>> > org.apache.zookeeper.server.FinalRequestProcessor.
> > > > processReq
> > > > > > >>> >>>> uest(FinalRequestProcessor.java:404)
> > > > > > >>> >>>> >         at
> > > > > > >>> >>>> >
> org.apache.zookeeper.server.quorum.CommitProcessor.run(
> > > > Commi
> > > > > > >>> >>>> tProcessor.java:74)
> > > > > > >>> >>>> >
> > > > > > >>> >>>> > Apache NiFi 1.2.0
> > > > > > >>> >>>> >
> > > > > > >>> >>>> > Thoughts?
> > > > > > >>> >>>>
> > > > > > >>> >>>
> > > > > > >>> >>>
> > > > > > >>> >>
> > > > > > >>>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: unstable cluster

Posted by Joe Skora <js...@gmail.com>.

Jeff,

If I understand the issue correctly, this means NiFi 1.x has always been
broken for clustering with an embedded ZooKeeper.  That has never
communicated until now, we clearly build for and explain how to use an
embedded ZooKeeper in documentation.

Any external non-NiFi elements that are considered in design and dependency
decisions need to be clearly understood by the entire community.  What
things non-NiFi are you thinking of that drive ZooKeeper dependencies?

Joe

On Tue, May 30, 2017 at 9:11 AM, Jeff <jt...@gmail.com> wrote:

> Mark, we can certainly take smaller steps rather than waiting for
> 3.5.2/3.6.0 to come out.  I was just bringing that JIRA up as another
> scenario that entices us to upgrade.
>
> Joe, I'm referring to NiFi, the toolkit, and things non-NiFi that provide a
> ZK server to which NiFi or the ZK Migration Toolkit are clients.  I'm not
> saying we can't or shouldn't upgrade, but we do need to test to make sure
> that no issues are introduced by NiFi shipping with ZK 3.4.10.  Being that
> it's a bugfix version change, it's probably fine.
>
> - Jeff
>
> On Tue, May 30, 2017 at 10:46 AM Joe Skora <js...@gmail.com> wrote:
>
> > Jeff,
> >
> > Does that mean NiFi 1.x will be unstable when using embedded ZooKeeper
> > until the ZK version is upgrade?
> >
> > By "components outside of NiFi" do you mean the NiFi toolkit and other
> > parts of the NiFi release?
> >
> > Joe
> >
> > On Tue, May 30, 2017 at 5:42 AM, Jeff <jt...@gmail.com> wrote:
> >
> > > Mark,
> > >
> > > I did report a JIRA [1] for upgrading to 3.5.2 or 3.6.0 (just due to
> > log4j
> > > issues) once it's out and stable, There are issues with the way that ZK
> > > refers to log4j classes in the code that cause issues for NiFi and our
> > > Toolkit..  However there has been some back and forth [2] (in 3.4.0,
> > which
> > > doesn't fix the issue, but moves towards fixing it), [3], and [4] on
> the
> > > changes being implemented in versions 3.5.2 and 3.6.0.  Also, it looks
> > like
> > > ZK 3.6.0 is headed toward using log4j 2 [5].
> > >
> > > There are many components outside of NiFi that are still using ZK
> 3.4.6,
> > so
> > > it may be a while before we can move to 3.4.10. I don't currently know
> > > anything about the forward compatibility of 3.4.6.  Are there
> > > improvements/fixes in 3.4.10 which you need?
> > >
> > > [1] https://issues.apache.org/jira/browse/NIFI-3067
> > > [2] https://issues.apache.org/jira/browse/ZOOKEEPER-850
> > > [3] https://issues.apache.org/jira/browse/ZOOKEEPER-1371
> > > [4] https://issues.apache.org/jira/browse/ZOOKEEPER-2393
> > > [5] https://issues.apache.org/jira/browse/ZOOKEEPER-2342
> > >
> > > - Jeff
> > >
> > > On Tue, May 30, 2017 at 8:15 AM Mark Bean <ma...@gmail.com>
> wrote:
> > >
> > > > Updated to external ZooKeeper last Friday. Over the weekend, there
> are
> > no
> > > > reports of SUSPENDED or RECONNECTED.
> > > >
> > > > Are there plans to upgrade the embedded ZooKeeper to the latest
> > version,
> > > > 3.4.10?
> > > >
> > > > Thanks,
> > > > Mark
> > > >
> > > > On Thu, May 25, 2017 at 11:56 AM, Joe Witt <jo...@gmail.com>
> wrote:
> > > >
> > > > > looked at a secured cluster and the send times are routinely at
> 100ms
> > > > > similar to yours.  I think what i was flagging as potentially
> > > > > interesting is not interesting at all.
> > > > >
> > > > > On Thu, May 25, 2017 at 11:34 AM, Joe Witt <jo...@gmail.com>
> > wrote:
> > > > > > Ok.  Well as a point of comparison i'm looking at heartbeat logs
> > from
> > > > > > another cluster and the times are consistently 1-3 millis for the
> > > > > > send.  Yours above show 100+ms typical with one north of 900ms.
> > Not
> > > > > > sure how relevant that is but something i noticed.
> > > > > >
> > > > > > On Thu, May 25, 2017 at 11:29 AM, Mark Bean <
> mark.o.bean@gmail.com
> > >
> > > > > wrote:
> > > > > >> ping shows acceptably fast response time between servers,
> > > > approximately
> > > > > >> 0.100-0.150 ms
> > > > > >>
> > > > > >>
> > > > > >> On Thu, May 25, 2017 at 11:13 AM, Joe Witt <jo...@gmail.com>
> > > > wrote:
> > > > > >>
> > > > > >>> have you evaluated latency across the machines in your cluster?
> > I
> > > > ask
> > > > > >>> because 122ms is pretty long and 917ms is very long.  Are these
> > > nodes
> > > > > >>> across a WAN link?
> > > > > >>>
> > > > > >>> On Thu, May 25, 2017 at 11:08 AM, Mark Bean <
> > mark.o.bean@gmail.com
> > > >
> > > > > wrote:
> > > > > >>> > Update: now all 5 nodes, regardless of ZK server, are
> > indicating
> > > > > >>> SUSPENDED
> > > > > >>> > -> RECONNECTED.
> > > > > >>> >
> > > > > >>> > On Thu, May 25, 2017 at 10:23 AM, Mark Bean <
> > > mark.o.bean@gmail.com
> > > > >
> > > > > >>> wrote:
> > > > > >>> >
> > > > > >>> >> I reduced the number of embedded ZooKeeper servers on the
> > 5-Node
> > > > > NiFi
> > > > > >>> >> Cluster from 5 to 3. This has improved the situation. I do
> not
> > > see
> > > > > any
> > > > > >>> of
> > > > > >>> >> the three Nodes which are also ZK servers
> > > > > disconnecting/reconnecting to
> > > > > >>> the
> > > > > >>> >> cluster as before. However, the two Nodes which are not
> > running
> > > ZK
> > > > > >>> continue
> > > > > >>> >> to disconnect and reconnect. The following is taken from one
> > of
> > > > the
> > > > > >>> non-ZK
> > > > > >>> >> Nodes. It's curious that some messages are issued twice from
> > the
> > > > > same
> > > > > >>> >> thread, but reference a different object
> > > > > >>> >>
> > > > > >>> >> nifi-app.log
> > > > > >>> >> 2017-05-25 13:40:01,628 INFO [main-EventTrhead]
> o.a.c.f.state.
> > > > > >>> ConnectionStateManager
> > > > > >>> >> State change: SUSPENDED
> > > > > >>> >> 2017-05-25 13:39:45,627 INFO [Clustering Tasks Thread-1]
> > > > o.a.n.c.c.
> > > > > >>> ClusterProtocolHeaertbeater
> > > > > >>> >> Heartbeat create at 2017-05-25 13:39:45,504 and sent to
> > > FQDN:PORT
> > > > at
> > > > > >>> >> 2017-05-25 13:39:45,627; send took 122 millis
> > > > > >>> >> 2017-05-25 13:39:50,862 INFO [Clustering Tasks Thread-1]
> > > > o.a.n.c.c.
> > > > > >>> ClusterProtocolHeaertbeater
> > > > > >>> >> Heartbeat create at 2017-05-25 13:39:50,732 and sent to
> > > FQDN:PORT
> > > > at
> > > > > >>> >> 2017-05-25 13:39:50,862; send took 122 millis
> > > > > >>> >> 2017-05-25 13:39:56,089 INFO [Clustering Tasks Thread-1]
> > > > o.a.n.c.c.
> > > > > >>> ClusterProtocolHeaertbeater
> > > > > >>> >> Heartbeat create at 2017-05-25 13:39:55,966 and sent to
> > > FQDN:PORT
> > > > at
> > > > > >>> >> 2017-05-25 13:39:56,089; send took 129 millis
> > > > > >>> >> 2017-05-25 13:40:01,629 INFO
> > [Curator-ConnectionStateManager-0]
> > > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> > > > > org.apache.nifi.controller.
> > > > > >>> >> leader.election.CuratorLeaderElectionManager$
> > > > > ElectionListener@68f8b6a2
> > > > > >>> >> Connection State changed to SUSPENDED
> > > > > >>> >> 2017-05-25 13:40:01,629 INFO
> > [Curator-ConnectionStateManager-0]
> > > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> > > > > org.apache.nifi.controller.
> > > > > >>> >> leader.election.CuratorLeaderElectionManager$
> > > > > ElectionListener@663f55cd
> > > > > >>> >> Connection State changed to SUSPENDED
> > > > > >>> >> 2017-05-25 13:40:02,412 INFO [main-EventThread]
> o.a.c.f.state.
> > > > > >>> ConnectinoStateManager
> > > > > >>> >> State change: RECONNECTED
> > > > > >>> >> 2017-05-25 13:40:02,413 INFO
> > [Curator-ConnectionStateManager-0]
> > > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> > > > > org.apache.nifi.controller.
> > > > > >>> >> leader.election.CuratorLeaderElectionManager$
> > > > > ElectionListener@68f8b6a2
> > > > > >>> >> Connection State changed to RECONNECTED
> > > > > >>> >> 2017-05-25 13:40:02,413 INFO
> > [Curator-ConnectionStateManager-0]
> > > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> > > > > org.apache.nifi.controller.
> > > > > >>> >> leader.election.CuratorLeaderElectionManager$
> > > > > ElectionListener@663f55cd
> > > > > >>> >> Connection State changed to RECONNECTED
> > > > > >>> >> 2017-05-25 13:40:02,550 INFO [Clustering Tasks Thread-1]
> > > > o.a.n.c.c.
> > > > > >>> ClusterProtocolHeaertbeater
> > > > > >>> >> Heartbeat create at 2017-05-25 13:40:01,632 and sent to
> > > FQDN:PORT
> > > > at
> > > > > >>> >> 2017-05-25 13:40:02,550; send took 917 millis
> > > > > >>> >> 2017-05-25 13:40:07,787 INFO [Clustering Tasks Thread-1]
> > > > o.a.n.c.c.
> > > > > >>> ClusterProtocolHeaertbeater
> > > > > >>> >> Heartbeat create at 2017-05-25 13:40:07,657 and sent to
> > > FQDN:PORT
> > > > at
> > > > > >>> >> 2017-05-25 13:40:07,787; send took 129 millis
> > > > > >>> >>
> > > > > >>> >> I will work on setting up an external ZK next, but would
> still
> > > > like
> > > > > some
> > > > > >>> >> insight to what is being observed with the embedded ZK.
> > > > > >>> >>
> > > > > >>> >> Thanks,
> > > > > >>> >> Mark
> > > > > >>> >>
> > > > > >>> >>
> > > > > >>> >>
> > > > > >>> >>
> > > > > >>> >> On Wed, May 24, 2017 at 3:57 PM, Mark Bean <
> > > mark.o.bean@gmail.com
> > > > >
> > > > > >>> wrote:
> > > > > >>> >>
> > > > > >>> >>> Yes, we are using the embedded ZK. We will try
> instantiating
> > > and
> > > > > >>> external
> > > > > >>> >>> ZK and see if that resolves the problem.
> > > > > >>> >>>
> > > > > >>> >>> The load on the system is extremely small. Currently (as
> > Nodes
> > > > are
> > > > > >>> >>> disconnecting/reconnecting) all input ports to the flow are
> > > > turned
> > > > > >>> off. The
> > > > > >>> >>> only data in the flow is from a single GenerateFlow
> > generating
> > > 5B
> > > > > >>> every 30
> > > > > >>> >>> secs.
> > > > > >>> >>>
> > > > > >>> >>> Also, it is a 5-node cluster with embedded ZK on each node.
> > > > First,
> > > > > I
> > > > > >>> will
> > > > > >>> >>> try reducing ZK to only 3 nodes. Then, I will try a 3-node
> > > > > external ZK.
> > > > > >>> >>>
> > > > > >>> >>> Thanks,
> > > > > >>> >>> Mark
> > > > > >>> >>>
> > > > > >>> >>> On Wed, May 24, 2017 at 11:49 AM, Joe Witt <
> > joe.witt@gmail.com
> > > >
> > > > > wrote:
> > > > > >>> >>>
> > > > > >>> >>>> Are you using the embedded Zookeeper?  If yes we recommend
> > > using
> > > > > an
> > > > > >>> >>>> external zookeeper.
> > > > > >>> >>>>
> > > > > >>> >>>> What type of load are the systems under when this occurs
> > (cpu,
> > > > > >>> >>>> network, memory, disk io)? Under high load the default
> > > timeouts
> > > > > for
> > > > > >>> >>>> clustering are too aggressive.  You can relax these for
> > higher
> > > > > load
> > > > > >>> >>>> clusters and should see good behavior.  Even if the system
> > > > > overall is
> > > > > >>> >>>> not under all that high of load if you're seeing garbage
> > > > > collection
> > > > > >>> >>>> pauses that are lengthy and/or frequent it can cause the
> > same
> > > > high
> > > > > >>> >>>> load effect as far as the JVM is concerned.
> > > > > >>> >>>>
> > > > > >>> >>>> Thanks
> > > > > >>> >>>> Joe
> > > > > >>> >>>>
> > > > > >>> >>>> On Wed, May 24, 2017 at 9:11 AM, Mark Bean <
> > > > mark.o.bean@gmail.com
> > > > > >
> > > > > >>> >>>> wrote:
> > > > > >>> >>>> > We have a cluster which is showing signs of instability.
> > The
> > > > > Primary
> > > > > >>> >>>> Node
> > > > > >>> >>>> > and Coordinator are reassigned to different nodes every
> > > > several
> > > > > >>> >>>> minutes. I
> > > > > >>> >>>> > believe this is due to lack of heartbeat or other
> > > > coordination.
> > > > > The
> > > > > >>> >>>> > following error occurs periodically in the nifi-app.log
> > > > > >>> >>>> >
> > > > > >>> >>>> > ERROR [CommitProcessor:1] o.apache.zookeeper.server.
> > > > > NIOServerCnxn
> > > > > >>> >>>> > Unexpected Exception:
> > > > > >>> >>>> > java.nio.channels.CancelledKeyException: null
> > > > > >>> >>>> >         at sun.nio.ch.SelectionKeyImpl.en
> > > > > >>> >>>> sureValid(SectionKeyImpl.java:73)
> > > > > >>> >>>> >         at sun.nio.ch.SelectionKeyImpl.in
> > > > > >>> >>>> terestOps(SelctionKeyImpl.java:77)
> > > > > >>> >>>> >         at
> > > > > >>> >>>> > org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(
> > > NIOServ
> > > > > >>> >>>> erCnxn.java:151)
> > > > > >>> >>>> >         at
> > > > > >>> >>>> > org.apache.zookeeper.server.NIOServerCnXn.sendResopnse(
> > > NIOSe
> > > > > >>> >>>> rverCnxn.java:1081)
> > > > > >>> >>>> >         at
> > > > > >>> >>>> > org.apache.zookeeper.server.FinalRequestProcessor.
> > > processReq
> > > > > >>> >>>> uest(FinalRequestProcessor.java:404)
> > > > > >>> >>>> >         at
> > > > > >>> >>>> > org.apache.zookeeper.server.quorum.CommitProcessor.run(
> > > Commi
> > > > > >>> >>>> tProcessor.java:74)
> > > > > >>> >>>> >
> > > > > >>> >>>> > Apache NiFi 1.2.0
> > > > > >>> >>>> >
> > > > > >>> >>>> > Thoughts?
> > > > > >>> >>>>
> > > > > >>> >>>
> > > > > >>> >>>
> > > > > >>> >>
> > > > > >>>
> > > > >
> > > >
> > >
> >
>

Re: unstable cluster

Posted by Jeff <jt...@gmail.com>.

Mark, we can certainly take smaller steps rather than waiting for
3.5.2/3.6.0 to come out.  I was just bringing that JIRA up as another
scenario that entices us to upgrade.

Joe, I'm referring to NiFi, the toolkit, and things non-NiFi that provide a
ZK server to which NiFi or the ZK Migration Toolkit are clients.  I'm not
saying we can't or shouldn't upgrade, but we do need to test to make sure
that no issues are introduced by NiFi shipping with ZK 3.4.10.  Being that
it's a bugfix version change, it's probably fine.

- Jeff

On Tue, May 30, 2017 at 10:46 AM Joe Skora <js...@gmail.com> wrote:

> Jeff,
>
> Does that mean NiFi 1.x will be unstable when using embedded ZooKeeper
> until the ZK version is upgrade?
>
> By "components outside of NiFi" do you mean the NiFi toolkit and other
> parts of the NiFi release?
>
> Joe
>
> On Tue, May 30, 2017 at 5:42 AM, Jeff <jt...@gmail.com> wrote:
>
> > Mark,
> >
> > I did report a JIRA [1] for upgrading to 3.5.2 or 3.6.0 (just due to
> log4j
> > issues) once it's out and stable, There are issues with the way that ZK
> > refers to log4j classes in the code that cause issues for NiFi and our
> > Toolkit..  However there has been some back and forth [2] (in 3.4.0,
> which
> > doesn't fix the issue, but moves towards fixing it), [3], and [4] on the
> > changes being implemented in versions 3.5.2 and 3.6.0.  Also, it looks
> like
> > ZK 3.6.0 is headed toward using log4j 2 [5].
> >
> > There are many components outside of NiFi that are still using ZK 3.4.6,
> so
> > it may be a while before we can move to 3.4.10. I don't currently know
> > anything about the forward compatibility of 3.4.6.  Are there
> > improvements/fixes in 3.4.10 which you need?
> >
> > [1] https://issues.apache.org/jira/browse/NIFI-3067
> > [2] https://issues.apache.org/jira/browse/ZOOKEEPER-850
> > [3] https://issues.apache.org/jira/browse/ZOOKEEPER-1371
> > [4] https://issues.apache.org/jira/browse/ZOOKEEPER-2393
> > [5] https://issues.apache.org/jira/browse/ZOOKEEPER-2342
> >
> > - Jeff
> >
> > On Tue, May 30, 2017 at 8:15 AM Mark Bean <ma...@gmail.com> wrote:
> >
> > > Updated to external ZooKeeper last Friday. Over the weekend, there are
> no
> > > reports of SUSPENDED or RECONNECTED.
> > >
> > > Are there plans to upgrade the embedded ZooKeeper to the latest
> version,
> > > 3.4.10?
> > >
> > > Thanks,
> > > Mark
> > >
> > > On Thu, May 25, 2017 at 11:56 AM, Joe Witt <jo...@gmail.com> wrote:
> > >
> > > > looked at a secured cluster and the send times are routinely at 100ms
> > > > similar to yours.  I think what i was flagging as potentially
> > > > interesting is not interesting at all.
> > > >
> > > > On Thu, May 25, 2017 at 11:34 AM, Joe Witt <jo...@gmail.com>
> wrote:
> > > > > Ok.  Well as a point of comparison i'm looking at heartbeat logs
> from
> > > > > another cluster and the times are consistently 1-3 millis for the
> > > > > send.  Yours above show 100+ms typical with one north of 900ms.
> Not
> > > > > sure how relevant that is but something i noticed.
> > > > >
> > > > > On Thu, May 25, 2017 at 11:29 AM, Mark Bean <mark.o.bean@gmail.com
> >
> > > > wrote:
> > > > >> ping shows acceptably fast response time between servers,
> > > approximately
> > > > >> 0.100-0.150 ms
> > > > >>
> > > > >>
> > > > >> On Thu, May 25, 2017 at 11:13 AM, Joe Witt <jo...@gmail.com>
> > > wrote:
> > > > >>
> > > > >>> have you evaluated latency across the machines in your cluster?
> I
> > > ask
> > > > >>> because 122ms is pretty long and 917ms is very long.  Are these
> > nodes
> > > > >>> across a WAN link?
> > > > >>>
> > > > >>> On Thu, May 25, 2017 at 11:08 AM, Mark Bean <
> mark.o.bean@gmail.com
> > >
> > > > wrote:
> > > > >>> > Update: now all 5 nodes, regardless of ZK server, are
> indicating
> > > > >>> SUSPENDED
> > > > >>> > -> RECONNECTED.
> > > > >>> >
> > > > >>> > On Thu, May 25, 2017 at 10:23 AM, Mark Bean <
> > mark.o.bean@gmail.com
> > > >
> > > > >>> wrote:
> > > > >>> >
> > > > >>> >> I reduced the number of embedded ZooKeeper servers on the
> 5-Node
> > > > NiFi
> > > > >>> >> Cluster from 5 to 3. This has improved the situation. I do not
> > see
> > > > any
> > > > >>> of
> > > > >>> >> the three Nodes which are also ZK servers
> > > > disconnecting/reconnecting to
> > > > >>> the
> > > > >>> >> cluster as before. However, the two Nodes which are not
> running
> > ZK
> > > > >>> continue
> > > > >>> >> to disconnect and reconnect. The following is taken from one
> of
> > > the
> > > > >>> non-ZK
> > > > >>> >> Nodes. It's curious that some messages are issued twice from
> the
> > > > same
> > > > >>> >> thread, but reference a different object
> > > > >>> >>
> > > > >>> >> nifi-app.log
> > > > >>> >> 2017-05-25 13:40:01,628 INFO [main-EventTrhead] o.a.c.f.state.
> > > > >>> ConnectionStateManager
> > > > >>> >> State change: SUSPENDED
> > > > >>> >> 2017-05-25 13:39:45,627 INFO [Clustering Tasks Thread-1]
> > > o.a.n.c.c.
> > > > >>> ClusterProtocolHeaertbeater
> > > > >>> >> Heartbeat create at 2017-05-25 13:39:45,504 and sent to
> > FQDN:PORT
> > > at
> > > > >>> >> 2017-05-25 13:39:45,627; send took 122 millis
> > > > >>> >> 2017-05-25 13:39:50,862 INFO [Clustering Tasks Thread-1]
> > > o.a.n.c.c.
> > > > >>> ClusterProtocolHeaertbeater
> > > > >>> >> Heartbeat create at 2017-05-25 13:39:50,732 and sent to
> > FQDN:PORT
> > > at
> > > > >>> >> 2017-05-25 13:39:50,862; send took 122 millis
> > > > >>> >> 2017-05-25 13:39:56,089 INFO [Clustering Tasks Thread-1]
> > > o.a.n.c.c.
> > > > >>> ClusterProtocolHeaertbeater
> > > > >>> >> Heartbeat create at 2017-05-25 13:39:55,966 and sent to
> > FQDN:PORT
> > > at
> > > > >>> >> 2017-05-25 13:39:56,089; send took 129 millis
> > > > >>> >> 2017-05-25 13:40:01,629 INFO
> [Curator-ConnectionStateManager-0]
> > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> > > > org.apache.nifi.controller.
> > > > >>> >> leader.election.CuratorLeaderElectionManager$
> > > > ElectionListener@68f8b6a2
> > > > >>> >> Connection State changed to SUSPENDED
> > > > >>> >> 2017-05-25 13:40:01,629 INFO
> [Curator-ConnectionStateManager-0]
> > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> > > > org.apache.nifi.controller.
> > > > >>> >> leader.election.CuratorLeaderElectionManager$
> > > > ElectionListener@663f55cd
> > > > >>> >> Connection State changed to SUSPENDED
> > > > >>> >> 2017-05-25 13:40:02,412 INFO [main-EventThread] o.a.c.f.state.
> > > > >>> ConnectinoStateManager
> > > > >>> >> State change: RECONNECTED
> > > > >>> >> 2017-05-25 13:40:02,413 INFO
> [Curator-ConnectionStateManager-0]
> > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> > > > org.apache.nifi.controller.
> > > > >>> >> leader.election.CuratorLeaderElectionManager$
> > > > ElectionListener@68f8b6a2
> > > > >>> >> Connection State changed to RECONNECTED
> > > > >>> >> 2017-05-25 13:40:02,413 INFO
> [Curator-ConnectionStateManager-0]
> > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> > > > org.apache.nifi.controller.
> > > > >>> >> leader.election.CuratorLeaderElectionManager$
> > > > ElectionListener@663f55cd
> > > > >>> >> Connection State changed to RECONNECTED
> > > > >>> >> 2017-05-25 13:40:02,550 INFO [Clustering Tasks Thread-1]
> > > o.a.n.c.c.
> > > > >>> ClusterProtocolHeaertbeater
> > > > >>> >> Heartbeat create at 2017-05-25 13:40:01,632 and sent to
> > FQDN:PORT
> > > at
> > > > >>> >> 2017-05-25 13:40:02,550; send took 917 millis
> > > > >>> >> 2017-05-25 13:40:07,787 INFO [Clustering Tasks Thread-1]
> > > o.a.n.c.c.
> > > > >>> ClusterProtocolHeaertbeater
> > > > >>> >> Heartbeat create at 2017-05-25 13:40:07,657 and sent to
> > FQDN:PORT
> > > at
> > > > >>> >> 2017-05-25 13:40:07,787; send took 129 millis
> > > > >>> >>
> > > > >>> >> I will work on setting up an external ZK next, but would still
> > > like
> > > > some
> > > > >>> >> insight to what is being observed with the embedded ZK.
> > > > >>> >>
> > > > >>> >> Thanks,
> > > > >>> >> Mark
> > > > >>> >>
> > > > >>> >>
> > > > >>> >>
> > > > >>> >>
> > > > >>> >> On Wed, May 24, 2017 at 3:57 PM, Mark Bean <
> > mark.o.bean@gmail.com
> > > >
> > > > >>> wrote:
> > > > >>> >>
> > > > >>> >>> Yes, we are using the embedded ZK. We will try instantiating
> > and
> > > > >>> external
> > > > >>> >>> ZK and see if that resolves the problem.
> > > > >>> >>>
> > > > >>> >>> The load on the system is extremely small. Currently (as
> Nodes
> > > are
> > > > >>> >>> disconnecting/reconnecting) all input ports to the flow are
> > > turned
> > > > >>> off. The
> > > > >>> >>> only data in the flow is from a single GenerateFlow
> generating
> > 5B
> > > > >>> every 30
> > > > >>> >>> secs.
> > > > >>> >>>
> > > > >>> >>> Also, it is a 5-node cluster with embedded ZK on each node.
> > > First,
> > > > I
> > > > >>> will
> > > > >>> >>> try reducing ZK to only 3 nodes. Then, I will try a 3-node
> > > > external ZK.
> > > > >>> >>>
> > > > >>> >>> Thanks,
> > > > >>> >>> Mark
> > > > >>> >>>
> > > > >>> >>> On Wed, May 24, 2017 at 11:49 AM, Joe Witt <
> joe.witt@gmail.com
> > >
> > > > wrote:
> > > > >>> >>>
> > > > >>> >>>> Are you using the embedded Zookeeper?  If yes we recommend
> > using
> > > > an
> > > > >>> >>>> external zookeeper.
> > > > >>> >>>>
> > > > >>> >>>> What type of load are the systems under when this occurs
> (cpu,
> > > > >>> >>>> network, memory, disk io)? Under high load the default
> > timeouts
> > > > for
> > > > >>> >>>> clustering are too aggressive.  You can relax these for
> higher
> > > > load
> > > > >>> >>>> clusters and should see good behavior.  Even if the system
> > > > overall is
> > > > >>> >>>> not under all that high of load if you're seeing garbage
> > > > collection
> > > > >>> >>>> pauses that are lengthy and/or frequent it can cause the
> same
> > > high
> > > > >>> >>>> load effect as far as the JVM is concerned.
> > > > >>> >>>>
> > > > >>> >>>> Thanks
> > > > >>> >>>> Joe
> > > > >>> >>>>
> > > > >>> >>>> On Wed, May 24, 2017 at 9:11 AM, Mark Bean <
> > > mark.o.bean@gmail.com
> > > > >
> > > > >>> >>>> wrote:
> > > > >>> >>>> > We have a cluster which is showing signs of instability.
> The
> > > > Primary
> > > > >>> >>>> Node
> > > > >>> >>>> > and Coordinator are reassigned to different nodes every
> > > several
> > > > >>> >>>> minutes. I
> > > > >>> >>>> > believe this is due to lack of heartbeat or other
> > > coordination.
> > > > The
> > > > >>> >>>> > following error occurs periodically in the nifi-app.log
> > > > >>> >>>> >
> > > > >>> >>>> > ERROR [CommitProcessor:1] o.apache.zookeeper.server.
> > > > NIOServerCnxn
> > > > >>> >>>> > Unexpected Exception:
> > > > >>> >>>> > java.nio.channels.CancelledKeyException: null
> > > > >>> >>>> >         at sun.nio.ch.SelectionKeyImpl.en
> > > > >>> >>>> sureValid(SectionKeyImpl.java:73)
> > > > >>> >>>> >         at sun.nio.ch.SelectionKeyImpl.in
> > > > >>> >>>> terestOps(SelctionKeyImpl.java:77)
> > > > >>> >>>> >         at
> > > > >>> >>>> > org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(
> > NIOServ
> > > > >>> >>>> erCnxn.java:151)
> > > > >>> >>>> >         at
> > > > >>> >>>> > org.apache.zookeeper.server.NIOServerCnXn.sendResopnse(
> > NIOSe
> > > > >>> >>>> rverCnxn.java:1081)
> > > > >>> >>>> >         at
> > > > >>> >>>> > org.apache.zookeeper.server.FinalRequestProcessor.
> > processReq
> > > > >>> >>>> uest(FinalRequestProcessor.java:404)
> > > > >>> >>>> >         at
> > > > >>> >>>> > org.apache.zookeeper.server.quorum.CommitProcessor.run(
> > Commi
> > > > >>> >>>> tProcessor.java:74)
> > > > >>> >>>> >
> > > > >>> >>>> > Apache NiFi 1.2.0
> > > > >>> >>>> >
> > > > >>> >>>> > Thoughts?
> > > > >>> >>>>
> > > > >>> >>>
> > > > >>> >>>
> > > > >>> >>
> > > > >>>
> > > >
> > >
> >
>

Re: unstable cluster

Posted by Joe Skora <js...@gmail.com>.

Jeff,

Does that mean NiFi 1.x will be unstable when using embedded ZooKeeper
until the ZK version is upgrade?

By "components outside of NiFi" do you mean the NiFi toolkit and other
parts of the NiFi release?

Joe

On Tue, May 30, 2017 at 5:42 AM, Jeff <jt...@gmail.com> wrote:

> Mark,
>
> I did report a JIRA [1] for upgrading to 3.5.2 or 3.6.0 (just due to log4j
> issues) once it's out and stable, There are issues with the way that ZK
> refers to log4j classes in the code that cause issues for NiFi and our
> Toolkit..  However there has been some back and forth [2] (in 3.4.0, which
> doesn't fix the issue, but moves towards fixing it), [3], and [4] on the
> changes being implemented in versions 3.5.2 and 3.6.0.  Also, it looks like
> ZK 3.6.0 is headed toward using log4j 2 [5].
>
> There are many components outside of NiFi that are still using ZK 3.4.6, so
> it may be a while before we can move to 3.4.10. I don't currently know
> anything about the forward compatibility of 3.4.6.  Are there
> improvements/fixes in 3.4.10 which you need?
>
> [1] https://issues.apache.org/jira/browse/NIFI-3067
> [2] https://issues.apache.org/jira/browse/ZOOKEEPER-850
> [3] https://issues.apache.org/jira/browse/ZOOKEEPER-1371
> [4] https://issues.apache.org/jira/browse/ZOOKEEPER-2393
> [5] https://issues.apache.org/jira/browse/ZOOKEEPER-2342
>
> - Jeff
>
> On Tue, May 30, 2017 at 8:15 AM Mark Bean <ma...@gmail.com> wrote:
>
> > Updated to external ZooKeeper last Friday. Over the weekend, there are no
> > reports of SUSPENDED or RECONNECTED.
> >
> > Are there plans to upgrade the embedded ZooKeeper to the latest version,
> > 3.4.10?
> >
> > Thanks,
> > Mark
> >
> > On Thu, May 25, 2017 at 11:56 AM, Joe Witt <jo...@gmail.com> wrote:
> >
> > > looked at a secured cluster and the send times are routinely at 100ms
> > > similar to yours.  I think what i was flagging as potentially
> > > interesting is not interesting at all.
> > >
> > > On Thu, May 25, 2017 at 11:34 AM, Joe Witt <jo...@gmail.com> wrote:
> > > > Ok.  Well as a point of comparison i'm looking at heartbeat logs from
> > > > another cluster and the times are consistently 1-3 millis for the
> > > > send.  Yours above show 100+ms typical with one north of 900ms.  Not
> > > > sure how relevant that is but something i noticed.
> > > >
> > > > On Thu, May 25, 2017 at 11:29 AM, Mark Bean <ma...@gmail.com>
> > > wrote:
> > > >> ping shows acceptably fast response time between servers,
> > approximately
> > > >> 0.100-0.150 ms
> > > >>
> > > >>
> > > >> On Thu, May 25, 2017 at 11:13 AM, Joe Witt <jo...@gmail.com>
> > wrote:
> > > >>
> > > >>> have you evaluated latency across the machines in your cluster?  I
> > ask
> > > >>> because 122ms is pretty long and 917ms is very long.  Are these
> nodes
> > > >>> across a WAN link?
> > > >>>
> > > >>> On Thu, May 25, 2017 at 11:08 AM, Mark Bean <mark.o.bean@gmail.com
> >
> > > wrote:
> > > >>> > Update: now all 5 nodes, regardless of ZK server, are indicating
> > > >>> SUSPENDED
> > > >>> > -> RECONNECTED.
> > > >>> >
> > > >>> > On Thu, May 25, 2017 at 10:23 AM, Mark Bean <
> mark.o.bean@gmail.com
> > >
> > > >>> wrote:
> > > >>> >
> > > >>> >> I reduced the number of embedded ZooKeeper servers on the 5-Node
> > > NiFi
> > > >>> >> Cluster from 5 to 3. This has improved the situation. I do not
> see
> > > any
> > > >>> of
> > > >>> >> the three Nodes which are also ZK servers
> > > disconnecting/reconnecting to
> > > >>> the
> > > >>> >> cluster as before. However, the two Nodes which are not running
> ZK
> > > >>> continue
> > > >>> >> to disconnect and reconnect. The following is taken from one of
> > the
> > > >>> non-ZK
> > > >>> >> Nodes. It's curious that some messages are issued twice from the
> > > same
> > > >>> >> thread, but reference a different object
> > > >>> >>
> > > >>> >> nifi-app.log
> > > >>> >> 2017-05-25 13:40:01,628 INFO [main-EventTrhead] o.a.c.f.state.
> > > >>> ConnectionStateManager
> > > >>> >> State change: SUSPENDED
> > > >>> >> 2017-05-25 13:39:45,627 INFO [Clustering Tasks Thread-1]
> > o.a.n.c.c.
> > > >>> ClusterProtocolHeaertbeater
> > > >>> >> Heartbeat create at 2017-05-25 13:39:45,504 and sent to
> FQDN:PORT
> > at
> > > >>> >> 2017-05-25 13:39:45,627; send took 122 millis
> > > >>> >> 2017-05-25 13:39:50,862 INFO [Clustering Tasks Thread-1]
> > o.a.n.c.c.
> > > >>> ClusterProtocolHeaertbeater
> > > >>> >> Heartbeat create at 2017-05-25 13:39:50,732 and sent to
> FQDN:PORT
> > at
> > > >>> >> 2017-05-25 13:39:50,862; send took 122 millis
> > > >>> >> 2017-05-25 13:39:56,089 INFO [Clustering Tasks Thread-1]
> > o.a.n.c.c.
> > > >>> ClusterProtocolHeaertbeater
> > > >>> >> Heartbeat create at 2017-05-25 13:39:55,966 and sent to
> FQDN:PORT
> > at
> > > >>> >> 2017-05-25 13:39:56,089; send took 129 millis
> > > >>> >> 2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0]
> > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> > > org.apache.nifi.controller.
> > > >>> >> leader.election.CuratorLeaderElectionManager$
> > > ElectionListener@68f8b6a2
> > > >>> >> Connection State changed to SUSPENDED
> > > >>> >> 2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0]
> > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> > > org.apache.nifi.controller.
> > > >>> >> leader.election.CuratorLeaderElectionManager$
> > > ElectionListener@663f55cd
> > > >>> >> Connection State changed to SUSPENDED
> > > >>> >> 2017-05-25 13:40:02,412 INFO [main-EventThread] o.a.c.f.state.
> > > >>> ConnectinoStateManager
> > > >>> >> State change: RECONNECTED
> > > >>> >> 2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0]
> > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> > > org.apache.nifi.controller.
> > > >>> >> leader.election.CuratorLeaderElectionManager$
> > > ElectionListener@68f8b6a2
> > > >>> >> Connection State changed to RECONNECTED
> > > >>> >> 2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0]
> > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> > > org.apache.nifi.controller.
> > > >>> >> leader.election.CuratorLeaderElectionManager$
> > > ElectionListener@663f55cd
> > > >>> >> Connection State changed to RECONNECTED
> > > >>> >> 2017-05-25 13:40:02,550 INFO [Clustering Tasks Thread-1]
> > o.a.n.c.c.
> > > >>> ClusterProtocolHeaertbeater
> > > >>> >> Heartbeat create at 2017-05-25 13:40:01,632 and sent to
> FQDN:PORT
> > at
> > > >>> >> 2017-05-25 13:40:02,550; send took 917 millis
> > > >>> >> 2017-05-25 13:40:07,787 INFO [Clustering Tasks Thread-1]
> > o.a.n.c.c.
> > > >>> ClusterProtocolHeaertbeater
> > > >>> >> Heartbeat create at 2017-05-25 13:40:07,657 and sent to
> FQDN:PORT
> > at
> > > >>> >> 2017-05-25 13:40:07,787; send took 129 millis
> > > >>> >>
> > > >>> >> I will work on setting up an external ZK next, but would still
> > like
> > > some
> > > >>> >> insight to what is being observed with the embedded ZK.
> > > >>> >>
> > > >>> >> Thanks,
> > > >>> >> Mark
> > > >>> >>
> > > >>> >>
> > > >>> >>
> > > >>> >>
> > > >>> >> On Wed, May 24, 2017 at 3:57 PM, Mark Bean <
> mark.o.bean@gmail.com
> > >
> > > >>> wrote:
> > > >>> >>
> > > >>> >>> Yes, we are using the embedded ZK. We will try instantiating
> and
> > > >>> external
> > > >>> >>> ZK and see if that resolves the problem.
> > > >>> >>>
> > > >>> >>> The load on the system is extremely small. Currently (as Nodes
> > are
> > > >>> >>> disconnecting/reconnecting) all input ports to the flow are
> > turned
> > > >>> off. The
> > > >>> >>> only data in the flow is from a single GenerateFlow generating
> 5B
> > > >>> every 30
> > > >>> >>> secs.
> > > >>> >>>
> > > >>> >>> Also, it is a 5-node cluster with embedded ZK on each node.
> > First,
> > > I
> > > >>> will
> > > >>> >>> try reducing ZK to only 3 nodes. Then, I will try a 3-node
> > > external ZK.
> > > >>> >>>
> > > >>> >>> Thanks,
> > > >>> >>> Mark
> > > >>> >>>
> > > >>> >>> On Wed, May 24, 2017 at 11:49 AM, Joe Witt <joe.witt@gmail.com
> >
> > > wrote:
> > > >>> >>>
> > > >>> >>>> Are you using the embedded Zookeeper?  If yes we recommend
> using
> > > an
> > > >>> >>>> external zookeeper.
> > > >>> >>>>
> > > >>> >>>> What type of load are the systems under when this occurs (cpu,
> > > >>> >>>> network, memory, disk io)? Under high load the default
> timeouts
> > > for
> > > >>> >>>> clustering are too aggressive.  You can relax these for higher
> > > load
> > > >>> >>>> clusters and should see good behavior.  Even if the system
> > > overall is
> > > >>> >>>> not under all that high of load if you're seeing garbage
> > > collection
> > > >>> >>>> pauses that are lengthy and/or frequent it can cause the same
> > high
> > > >>> >>>> load effect as far as the JVM is concerned.
> > > >>> >>>>
> > > >>> >>>> Thanks
> > > >>> >>>> Joe
> > > >>> >>>>
> > > >>> >>>> On Wed, May 24, 2017 at 9:11 AM, Mark Bean <
> > mark.o.bean@gmail.com
> > > >
> > > >>> >>>> wrote:
> > > >>> >>>> > We have a cluster which is showing signs of instability. The
> > > Primary
> > > >>> >>>> Node
> > > >>> >>>> > and Coordinator are reassigned to different nodes every
> > several
> > > >>> >>>> minutes. I
> > > >>> >>>> > believe this is due to lack of heartbeat or other
> > coordination.
> > > The
> > > >>> >>>> > following error occurs periodically in the nifi-app.log
> > > >>> >>>> >
> > > >>> >>>> > ERROR [CommitProcessor:1] o.apache.zookeeper.server.
> > > NIOServerCnxn
> > > >>> >>>> > Unexpected Exception:
> > > >>> >>>> > java.nio.channels.CancelledKeyException: null
> > > >>> >>>> >         at sun.nio.ch.SelectionKeyImpl.en
> > > >>> >>>> sureValid(SectionKeyImpl.java:73)
> > > >>> >>>> >         at sun.nio.ch.SelectionKeyImpl.in
> > > >>> >>>> terestOps(SelctionKeyImpl.java:77)
> > > >>> >>>> >         at
> > > >>> >>>> > org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(
> NIOServ
> > > >>> >>>> erCnxn.java:151)
> > > >>> >>>> >         at
> > > >>> >>>> > org.apache.zookeeper.server.NIOServerCnXn.sendResopnse(
> NIOSe
> > > >>> >>>> rverCnxn.java:1081)
> > > >>> >>>> >         at
> > > >>> >>>> > org.apache.zookeeper.server.FinalRequestProcessor.
> processReq
> > > >>> >>>> uest(FinalRequestProcessor.java:404)
> > > >>> >>>> >         at
> > > >>> >>>> > org.apache.zookeeper.server.quorum.CommitProcessor.run(
> Commi
> > > >>> >>>> tProcessor.java:74)
> > > >>> >>>> >
> > > >>> >>>> > Apache NiFi 1.2.0
> > > >>> >>>> >
> > > >>> >>>> > Thoughts?
> > > >>> >>>>
> > > >>> >>>
> > > >>> >>>
> > > >>> >>
> > > >>>
> > >
> >
>

Re: unstable cluster

Posted by Mark Bean <ma...@gmail.com>.

Jeff,

The Nodes are disconnecting from the Cluster due to the problem reported in
[1]. ZK fixed this in 3.4.10. This was the reason for inquiring about
upgrading the embedded ZK to 3.4.10. While I understand there are
additional reasons (log4j) to wait for a later ZK release so they can be
included as well. But, can we take two smaller steps (especially since ZK
3.5.2 or 3.6.0 is a somewhat unknown timeframe) rather than one big step?

[1] https://issues.apache.org/jira/browse/ZOOKEEPER-2044

On Tue, May 30, 2017 at 8:42 AM, Jeff <jt...@gmail.com> wrote:

> Mark,
>
> I did report a JIRA [1] for upgrading to 3.5.2 or 3.6.0 (just due to log4j
> issues) once it's out and stable, There are issues with the way that ZK
> refers to log4j classes in the code that cause issues for NiFi and our
> Toolkit..  However there has been some back and forth [2] (in 3.4.0, which
> doesn't fix the issue, but moves towards fixing it), [3], and [4] on the
> changes being implemented in versions 3.5.2 and 3.6.0.  Also, it looks like
> ZK 3.6.0 is headed toward using log4j 2 [5].
>
> There are many components outside of NiFi that are still using ZK 3.4.6, so
> it may be a while before we can move to 3.4.10. I don't currently know
> anything about the forward compatibility of 3.4.6.  Are there
> improvements/fixes in 3.4.10 which you need?
>
> [1] https://issues.apache.org/jira/browse/NIFI-3067
> [2] https://issues.apache.org/jira/browse/ZOOKEEPER-850
> [3] https://issues.apache.org/jira/browse/ZOOKEEPER-1371
> [4] https://issues.apache.org/jira/browse/ZOOKEEPER-2393
> [5] https://issues.apache.org/jira/browse/ZOOKEEPER-2342
>
> - Jeff
>
> On Tue, May 30, 2017 at 8:15 AM Mark Bean <ma...@gmail.com> wrote:
>
> > Updated to external ZooKeeper last Friday. Over the weekend, there are no
> > reports of SUSPENDED or RECONNECTED.
> >
> > Are there plans to upgrade the embedded ZooKeeper to the latest version,
> > 3.4.10?
> >
> > Thanks,
> > Mark
> >
> > On Thu, May 25, 2017 at 11:56 AM, Joe Witt <jo...@gmail.com> wrote:
> >
> > > looked at a secured cluster and the send times are routinely at 100ms
> > > similar to yours.  I think what i was flagging as potentially
> > > interesting is not interesting at all.
> > >
> > > On Thu, May 25, 2017 at 11:34 AM, Joe Witt <jo...@gmail.com> wrote:
> > > > Ok.  Well as a point of comparison i'm looking at heartbeat logs from
> > > > another cluster and the times are consistently 1-3 millis for the
> > > > send.  Yours above show 100+ms typical with one north of 900ms.  Not
> > > > sure how relevant that is but something i noticed.
> > > >
> > > > On Thu, May 25, 2017 at 11:29 AM, Mark Bean <ma...@gmail.com>
> > > wrote:
> > > >> ping shows acceptably fast response time between servers,
> > approximately
> > > >> 0.100-0.150 ms
> > > >>
> > > >>
> > > >> On Thu, May 25, 2017 at 11:13 AM, Joe Witt <jo...@gmail.com>
> > wrote:
> > > >>
> > > >>> have you evaluated latency across the machines in your cluster?  I
> > ask
> > > >>> because 122ms is pretty long and 917ms is very long.  Are these
> nodes
> > > >>> across a WAN link?
> > > >>>
> > > >>> On Thu, May 25, 2017 at 11:08 AM, Mark Bean <mark.o.bean@gmail.com
> >
> > > wrote:
> > > >>> > Update: now all 5 nodes, regardless of ZK server, are indicating
> > > >>> SUSPENDED
> > > >>> > -> RECONNECTED.
> > > >>> >
> > > >>> > On Thu, May 25, 2017 at 10:23 AM, Mark Bean <
> mark.o.bean@gmail.com
> > >
> > > >>> wrote:
> > > >>> >
> > > >>> >> I reduced the number of embedded ZooKeeper servers on the 5-Node
> > > NiFi
> > > >>> >> Cluster from 5 to 3. This has improved the situation. I do not
> see
> > > any
> > > >>> of
> > > >>> >> the three Nodes which are also ZK servers
> > > disconnecting/reconnecting to
> > > >>> the
> > > >>> >> cluster as before. However, the two Nodes which are not running
> ZK
> > > >>> continue
> > > >>> >> to disconnect and reconnect. The following is taken from one of
> > the
> > > >>> non-ZK
> > > >>> >> Nodes. It's curious that some messages are issued twice from the
> > > same
> > > >>> >> thread, but reference a different object
> > > >>> >>
> > > >>> >> nifi-app.log
> > > >>> >> 2017-05-25 13:40:01,628 INFO [main-EventTrhead] o.a.c.f.state.
> > > >>> ConnectionStateManager
> > > >>> >> State change: SUSPENDED
> > > >>> >> 2017-05-25 13:39:45,627 INFO [Clustering Tasks Thread-1]
> > o.a.n.c.c.
> > > >>> ClusterProtocolHeaertbeater
> > > >>> >> Heartbeat create at 2017-05-25 13:39:45,504 and sent to
> FQDN:PORT
> > at
> > > >>> >> 2017-05-25 13:39:45,627; send took 122 millis
> > > >>> >> 2017-05-25 13:39:50,862 INFO [Clustering Tasks Thread-1]
> > o.a.n.c.c.
> > > >>> ClusterProtocolHeaertbeater
> > > >>> >> Heartbeat create at 2017-05-25 13:39:50,732 and sent to
> FQDN:PORT
> > at
> > > >>> >> 2017-05-25 13:39:50,862; send took 122 millis
> > > >>> >> 2017-05-25 13:39:56,089 INFO [Clustering Tasks Thread-1]
> > o.a.n.c.c.
> > > >>> ClusterProtocolHeaertbeater
> > > >>> >> Heartbeat create at 2017-05-25 13:39:55,966 and sent to
> FQDN:PORT
> > at
> > > >>> >> 2017-05-25 13:39:56,089; send took 129 millis
> > > >>> >> 2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0]
> > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> > > org.apache.nifi.controller.
> > > >>> >> leader.election.CuratorLeaderElectionManager$
> > > ElectionListener@68f8b6a2
> > > >>> >> Connection State changed to SUSPENDED
> > > >>> >> 2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0]
> > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> > > org.apache.nifi.controller.
> > > >>> >> leader.election.CuratorLeaderElectionManager$
> > > ElectionListener@663f55cd
> > > >>> >> Connection State changed to SUSPENDED
> > > >>> >> 2017-05-25 13:40:02,412 INFO [main-EventThread] o.a.c.f.state.
> > > >>> ConnectinoStateManager
> > > >>> >> State change: RECONNECTED
> > > >>> >> 2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0]
> > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> > > org.apache.nifi.controller.
> > > >>> >> leader.election.CuratorLeaderElectionManager$
> > > ElectionListener@68f8b6a2
> > > >>> >> Connection State changed to RECONNECTED
> > > >>> >> 2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0]
> > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> > > org.apache.nifi.controller.
> > > >>> >> leader.election.CuratorLeaderElectionManager$
> > > ElectionListener@663f55cd
> > > >>> >> Connection State changed to RECONNECTED
> > > >>> >> 2017-05-25 13:40:02,550 INFO [Clustering Tasks Thread-1]
> > o.a.n.c.c.
> > > >>> ClusterProtocolHeaertbeater
> > > >>> >> Heartbeat create at 2017-05-25 13:40:01,632 and sent to
> FQDN:PORT
> > at
> > > >>> >> 2017-05-25 13:40:02,550; send took 917 millis
> > > >>> >> 2017-05-25 13:40:07,787 INFO [Clustering Tasks Thread-1]
> > o.a.n.c.c.
> > > >>> ClusterProtocolHeaertbeater
> > > >>> >> Heartbeat create at 2017-05-25 13:40:07,657 and sent to
> FQDN:PORT
> > at
> > > >>> >> 2017-05-25 13:40:07,787; send took 129 millis
> > > >>> >>
> > > >>> >> I will work on setting up an external ZK next, but would still
> > like
> > > some
> > > >>> >> insight to what is being observed with the embedded ZK.
> > > >>> >>
> > > >>> >> Thanks,
> > > >>> >> Mark
> > > >>> >>
> > > >>> >>
> > > >>> >>
> > > >>> >>
> > > >>> >> On Wed, May 24, 2017 at 3:57 PM, Mark Bean <
> mark.o.bean@gmail.com
> > >
> > > >>> wrote:
> > > >>> >>
> > > >>> >>> Yes, we are using the embedded ZK. We will try instantiating
> and
> > > >>> external
> > > >>> >>> ZK and see if that resolves the problem.
> > > >>> >>>
> > > >>> >>> The load on the system is extremely small. Currently (as Nodes
> > are
> > > >>> >>> disconnecting/reconnecting) all input ports to the flow are
> > turned
> > > >>> off. The
> > > >>> >>> only data in the flow is from a single GenerateFlow generating
> 5B
> > > >>> every 30
> > > >>> >>> secs.
> > > >>> >>>
> > > >>> >>> Also, it is a 5-node cluster with embedded ZK on each node.
> > First,
> > > I
> > > >>> will
> > > >>> >>> try reducing ZK to only 3 nodes. Then, I will try a 3-node
> > > external ZK.
> > > >>> >>>
> > > >>> >>> Thanks,
> > > >>> >>> Mark
> > > >>> >>>
> > > >>> >>> On Wed, May 24, 2017 at 11:49 AM, Joe Witt <joe.witt@gmail.com
> >
> > > wrote:
> > > >>> >>>
> > > >>> >>>> Are you using the embedded Zookeeper?  If yes we recommend
> using
> > > an
> > > >>> >>>> external zookeeper.
> > > >>> >>>>
> > > >>> >>>> What type of load are the systems under when this occurs (cpu,
> > > >>> >>>> network, memory, disk io)? Under high load the default
> timeouts
> > > for
> > > >>> >>>> clustering are too aggressive.  You can relax these for higher
> > > load
> > > >>> >>>> clusters and should see good behavior.  Even if the system
> > > overall is
> > > >>> >>>> not under all that high of load if you're seeing garbage
> > > collection
> > > >>> >>>> pauses that are lengthy and/or frequent it can cause the same
> > high
> > > >>> >>>> load effect as far as the JVM is concerned.
> > > >>> >>>>
> > > >>> >>>> Thanks
> > > >>> >>>> Joe
> > > >>> >>>>
> > > >>> >>>> On Wed, May 24, 2017 at 9:11 AM, Mark Bean <
> > mark.o.bean@gmail.com
> > > >
> > > >>> >>>> wrote:
> > > >>> >>>> > We have a cluster which is showing signs of instability. The
> > > Primary
> > > >>> >>>> Node
> > > >>> >>>> > and Coordinator are reassigned to different nodes every
> > several
> > > >>> >>>> minutes. I
> > > >>> >>>> > believe this is due to lack of heartbeat or other
> > coordination.
> > > The
> > > >>> >>>> > following error occurs periodically in the nifi-app.log
> > > >>> >>>> >
> > > >>> >>>> > ERROR [CommitProcessor:1] o.apache.zookeeper.server.
> > > NIOServerCnxn
> > > >>> >>>> > Unexpected Exception:
> > > >>> >>>> > java.nio.channels.CancelledKeyException: null
> > > >>> >>>> >         at sun.nio.ch.SelectionKeyImpl.en
> > > >>> >>>> sureValid(SectionKeyImpl.java:73)
> > > >>> >>>> >         at sun.nio.ch.SelectionKeyImpl.in
> > > >>> >>>> terestOps(SelctionKeyImpl.java:77)
> > > >>> >>>> >         at
> > > >>> >>>> > org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(
> NIOServ
> > > >>> >>>> erCnxn.java:151)
> > > >>> >>>> >         at
> > > >>> >>>> > org.apache.zookeeper.server.NIOServerCnXn.sendResopnse(
> NIOSe
> > > >>> >>>> rverCnxn.java:1081)
> > > >>> >>>> >         at
> > > >>> >>>> > org.apache.zookeeper.server.FinalRequestProcessor.
> processReq
> > > >>> >>>> uest(FinalRequestProcessor.java:404)
> > > >>> >>>> >         at
> > > >>> >>>> > org.apache.zookeeper.server.quorum.CommitProcessor.run(
> Commi
> > > >>> >>>> tProcessor.java:74)
> > > >>> >>>> >
> > > >>> >>>> > Apache NiFi 1.2.0
> > > >>> >>>> >
> > > >>> >>>> > Thoughts?
> > > >>> >>>>
> > > >>> >>>
> > > >>> >>>
> > > >>> >>
> > > >>>
> > >
> >
>

Re: unstable cluster

Posted by Jeff <jt...@gmail.com>.

Mark,

I did report a JIRA [1] for upgrading to 3.5.2 or 3.6.0 (just due to log4j
issues) once it's out and stable, There are issues with the way that ZK
refers to log4j classes in the code that cause issues for NiFi and our
Toolkit..  However there has been some back and forth [2] (in 3.4.0, which
doesn't fix the issue, but moves towards fixing it), [3], and [4] on the
changes being implemented in versions 3.5.2 and 3.6.0.  Also, it looks like
ZK 3.6.0 is headed toward using log4j 2 [5].

There are many components outside of NiFi that are still using ZK 3.4.6, so
it may be a while before we can move to 3.4.10. I don't currently know
anything about the forward compatibility of 3.4.6.  Are there
improvements/fixes in 3.4.10 which you need?

[1] https://issues.apache.org/jira/browse/NIFI-3067
[2] https://issues.apache.org/jira/browse/ZOOKEEPER-850
[3] https://issues.apache.org/jira/browse/ZOOKEEPER-1371
[4] https://issues.apache.org/jira/browse/ZOOKEEPER-2393
[5] https://issues.apache.org/jira/browse/ZOOKEEPER-2342

- Jeff

On Tue, May 30, 2017 at 8:15 AM Mark Bean <ma...@gmail.com> wrote:

> Updated to external ZooKeeper last Friday. Over the weekend, there are no
> reports of SUSPENDED or RECONNECTED.
>
> Are there plans to upgrade the embedded ZooKeeper to the latest version,
> 3.4.10?
>
> Thanks,
> Mark
>
> On Thu, May 25, 2017 at 11:56 AM, Joe Witt <jo...@gmail.com> wrote:
>
> > looked at a secured cluster and the send times are routinely at 100ms
> > similar to yours.  I think what i was flagging as potentially
> > interesting is not interesting at all.
> >
> > On Thu, May 25, 2017 at 11:34 AM, Joe Witt <jo...@gmail.com> wrote:
> > > Ok.  Well as a point of comparison i'm looking at heartbeat logs from
> > > another cluster and the times are consistently 1-3 millis for the
> > > send.  Yours above show 100+ms typical with one north of 900ms.  Not
> > > sure how relevant that is but something i noticed.
> > >
> > > On Thu, May 25, 2017 at 11:29 AM, Mark Bean <ma...@gmail.com>
> > wrote:
> > >> ping shows acceptably fast response time between servers,
> approximately
> > >> 0.100-0.150 ms
> > >>
> > >>
> > >> On Thu, May 25, 2017 at 11:13 AM, Joe Witt <jo...@gmail.com>
> wrote:
> > >>
> > >>> have you evaluated latency across the machines in your cluster?  I
> ask
> > >>> because 122ms is pretty long and 917ms is very long.  Are these nodes
> > >>> across a WAN link?
> > >>>
> > >>> On Thu, May 25, 2017 at 11:08 AM, Mark Bean <ma...@gmail.com>
> > wrote:
> > >>> > Update: now all 5 nodes, regardless of ZK server, are indicating
> > >>> SUSPENDED
> > >>> > -> RECONNECTED.
> > >>> >
> > >>> > On Thu, May 25, 2017 at 10:23 AM, Mark Bean <mark.o.bean@gmail.com
> >
> > >>> wrote:
> > >>> >
> > >>> >> I reduced the number of embedded ZooKeeper servers on the 5-Node
> > NiFi
> > >>> >> Cluster from 5 to 3. This has improved the situation. I do not see
> > any
> > >>> of
> > >>> >> the three Nodes which are also ZK servers
> > disconnecting/reconnecting to
> > >>> the
> > >>> >> cluster as before. However, the two Nodes which are not running ZK
> > >>> continue
> > >>> >> to disconnect and reconnect. The following is taken from one of
> the
> > >>> non-ZK
> > >>> >> Nodes. It's curious that some messages are issued twice from the
> > same
> > >>> >> thread, but reference a different object
> > >>> >>
> > >>> >> nifi-app.log
> > >>> >> 2017-05-25 13:40:01,628 INFO [main-EventTrhead] o.a.c.f.state.
> > >>> ConnectionStateManager
> > >>> >> State change: SUSPENDED
> > >>> >> 2017-05-25 13:39:45,627 INFO [Clustering Tasks Thread-1]
> o.a.n.c.c.
> > >>> ClusterProtocolHeaertbeater
> > >>> >> Heartbeat create at 2017-05-25 13:39:45,504 and sent to FQDN:PORT
> at
> > >>> >> 2017-05-25 13:39:45,627; send took 122 millis
> > >>> >> 2017-05-25 13:39:50,862 INFO [Clustering Tasks Thread-1]
> o.a.n.c.c.
> > >>> ClusterProtocolHeaertbeater
> > >>> >> Heartbeat create at 2017-05-25 13:39:50,732 and sent to FQDN:PORT
> at
> > >>> >> 2017-05-25 13:39:50,862; send took 122 millis
> > >>> >> 2017-05-25 13:39:56,089 INFO [Clustering Tasks Thread-1]
> o.a.n.c.c.
> > >>> ClusterProtocolHeaertbeater
> > >>> >> Heartbeat create at 2017-05-25 13:39:55,966 and sent to FQDN:PORT
> at
> > >>> >> 2017-05-25 13:39:56,089; send took 129 millis
> > >>> >> 2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0]
> > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> > org.apache.nifi.controller.
> > >>> >> leader.election.CuratorLeaderElectionManager$
> > ElectionListener@68f8b6a2
> > >>> >> Connection State changed to SUSPENDED
> > >>> >> 2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0]
> > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> > org.apache.nifi.controller.
> > >>> >> leader.election.CuratorLeaderElectionManager$
> > ElectionListener@663f55cd
> > >>> >> Connection State changed to SUSPENDED
> > >>> >> 2017-05-25 13:40:02,412 INFO [main-EventThread] o.a.c.f.state.
> > >>> ConnectinoStateManager
> > >>> >> State change: RECONNECTED
> > >>> >> 2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0]
> > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> > org.apache.nifi.controller.
> > >>> >> leader.election.CuratorLeaderElectionManager$
> > ElectionListener@68f8b6a2
> > >>> >> Connection State changed to RECONNECTED
> > >>> >> 2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0]
> > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> > org.apache.nifi.controller.
> > >>> >> leader.election.CuratorLeaderElectionManager$
> > ElectionListener@663f55cd
> > >>> >> Connection State changed to RECONNECTED
> > >>> >> 2017-05-25 13:40:02,550 INFO [Clustering Tasks Thread-1]
> o.a.n.c.c.
> > >>> ClusterProtocolHeaertbeater
> > >>> >> Heartbeat create at 2017-05-25 13:40:01,632 and sent to FQDN:PORT
> at
> > >>> >> 2017-05-25 13:40:02,550; send took 917 millis
> > >>> >> 2017-05-25 13:40:07,787 INFO [Clustering Tasks Thread-1]
> o.a.n.c.c.
> > >>> ClusterProtocolHeaertbeater
> > >>> >> Heartbeat create at 2017-05-25 13:40:07,657 and sent to FQDN:PORT
> at
> > >>> >> 2017-05-25 13:40:07,787; send took 129 millis
> > >>> >>
> > >>> >> I will work on setting up an external ZK next, but would still
> like
> > some
> > >>> >> insight to what is being observed with the embedded ZK.
> > >>> >>
> > >>> >> Thanks,
> > >>> >> Mark
> > >>> >>
> > >>> >>
> > >>> >>
> > >>> >>
> > >>> >> On Wed, May 24, 2017 at 3:57 PM, Mark Bean <mark.o.bean@gmail.com
> >
> > >>> wrote:
> > >>> >>
> > >>> >>> Yes, we are using the embedded ZK. We will try instantiating and
> > >>> external
> > >>> >>> ZK and see if that resolves the problem.
> > >>> >>>
> > >>> >>> The load on the system is extremely small. Currently (as Nodes
> are
> > >>> >>> disconnecting/reconnecting) all input ports to the flow are
> turned
> > >>> off. The
> > >>> >>> only data in the flow is from a single GenerateFlow generating 5B
> > >>> every 30
> > >>> >>> secs.
> > >>> >>>
> > >>> >>> Also, it is a 5-node cluster with embedded ZK on each node.
> First,
> > I
> > >>> will
> > >>> >>> try reducing ZK to only 3 nodes. Then, I will try a 3-node
> > external ZK.
> > >>> >>>
> > >>> >>> Thanks,
> > >>> >>> Mark
> > >>> >>>
> > >>> >>> On Wed, May 24, 2017 at 11:49 AM, Joe Witt <jo...@gmail.com>
> > wrote:
> > >>> >>>
> > >>> >>>> Are you using the embedded Zookeeper?  If yes we recommend using
> > an
> > >>> >>>> external zookeeper.
> > >>> >>>>
> > >>> >>>> What type of load are the systems under when this occurs (cpu,
> > >>> >>>> network, memory, disk io)? Under high load the default timeouts
> > for
> > >>> >>>> clustering are too aggressive.  You can relax these for higher
> > load
> > >>> >>>> clusters and should see good behavior.  Even if the system
> > overall is
> > >>> >>>> not under all that high of load if you're seeing garbage
> > collection
> > >>> >>>> pauses that are lengthy and/or frequent it can cause the same
> high
> > >>> >>>> load effect as far as the JVM is concerned.
> > >>> >>>>
> > >>> >>>> Thanks
> > >>> >>>> Joe
> > >>> >>>>
> > >>> >>>> On Wed, May 24, 2017 at 9:11 AM, Mark Bean <
> mark.o.bean@gmail.com
> > >
> > >>> >>>> wrote:
> > >>> >>>> > We have a cluster which is showing signs of instability. The
> > Primary
> > >>> >>>> Node
> > >>> >>>> > and Coordinator are reassigned to different nodes every
> several
> > >>> >>>> minutes. I
> > >>> >>>> > believe this is due to lack of heartbeat or other
> coordination.
> > The
> > >>> >>>> > following error occurs periodically in the nifi-app.log
> > >>> >>>> >
> > >>> >>>> > ERROR [CommitProcessor:1] o.apache.zookeeper.server.
> > NIOServerCnxn
> > >>> >>>> > Unexpected Exception:
> > >>> >>>> > java.nio.channels.CancelledKeyException: null
> > >>> >>>> >         at sun.nio.ch.SelectionKeyImpl.en
> > >>> >>>> sureValid(SectionKeyImpl.java:73)
> > >>> >>>> >         at sun.nio.ch.SelectionKeyImpl.in
> > >>> >>>> terestOps(SelctionKeyImpl.java:77)
> > >>> >>>> >         at
> > >>> >>>> > org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServ
> > >>> >>>> erCnxn.java:151)
> > >>> >>>> >         at
> > >>> >>>> > org.apache.zookeeper.server.NIOServerCnXn.sendResopnse(NIOSe
> > >>> >>>> rverCnxn.java:1081)
> > >>> >>>> >         at
> > >>> >>>> > org.apache.zookeeper.server.FinalRequestProcessor.processReq
> > >>> >>>> uest(FinalRequestProcessor.java:404)
> > >>> >>>> >         at
> > >>> >>>> > org.apache.zookeeper.server.quorum.CommitProcessor.run(Commi
> > >>> >>>> tProcessor.java:74)
> > >>> >>>> >
> > >>> >>>> > Apache NiFi 1.2.0
> > >>> >>>> >
> > >>> >>>> > Thoughts?
> > >>> >>>>
> > >>> >>>
> > >>> >>>
> > >>> >>
> > >>>
> >
>

Re: unstable cluster

Posted by Mark Bean <ma...@gmail.com>.

Updated to external ZooKeeper last Friday. Over the weekend, there are no
reports of SUSPENDED or RECONNECTED.

Are there plans to upgrade the embedded ZooKeeper to the latest version,
3.4.10?

Thanks,
Mark

On Thu, May 25, 2017 at 11:56 AM, Joe Witt <jo...@gmail.com> wrote:

> looked at a secured cluster and the send times are routinely at 100ms
> similar to yours.  I think what i was flagging as potentially
> interesting is not interesting at all.
>
> On Thu, May 25, 2017 at 11:34 AM, Joe Witt <jo...@gmail.com> wrote:
> > Ok.  Well as a point of comparison i'm looking at heartbeat logs from
> > another cluster and the times are consistently 1-3 millis for the
> > send.  Yours above show 100+ms typical with one north of 900ms.  Not
> > sure how relevant that is but something i noticed.
> >
> > On Thu, May 25, 2017 at 11:29 AM, Mark Bean <ma...@gmail.com>
> wrote:
> >> ping shows acceptably fast response time between servers, approximately
> >> 0.100-0.150 ms
> >>
> >>
> >> On Thu, May 25, 2017 at 11:13 AM, Joe Witt <jo...@gmail.com> wrote:
> >>
> >>> have you evaluated latency across the machines in your cluster?  I ask
> >>> because 122ms is pretty long and 917ms is very long.  Are these nodes
> >>> across a WAN link?
> >>>
> >>> On Thu, May 25, 2017 at 11:08 AM, Mark Bean <ma...@gmail.com>
> wrote:
> >>> > Update: now all 5 nodes, regardless of ZK server, are indicating
> >>> SUSPENDED
> >>> > -> RECONNECTED.
> >>> >
> >>> > On Thu, May 25, 2017 at 10:23 AM, Mark Bean <ma...@gmail.com>
> >>> wrote:
> >>> >
> >>> >> I reduced the number of embedded ZooKeeper servers on the 5-Node
> NiFi
> >>> >> Cluster from 5 to 3. This has improved the situation. I do not see
> any
> >>> of
> >>> >> the three Nodes which are also ZK servers
> disconnecting/reconnecting to
> >>> the
> >>> >> cluster as before. However, the two Nodes which are not running ZK
> >>> continue
> >>> >> to disconnect and reconnect. The following is taken from one of the
> >>> non-ZK
> >>> >> Nodes. It's curious that some messages are issued twice from the
> same
> >>> >> thread, but reference a different object
> >>> >>
> >>> >> nifi-app.log
> >>> >> 2017-05-25 13:40:01,628 INFO [main-EventTrhead] o.a.c.f.state.
> >>> ConnectionStateManager
> >>> >> State change: SUSPENDED
> >>> >> 2017-05-25 13:39:45,627 INFO [Clustering Tasks Thread-1] o.a.n.c.c.
> >>> ClusterProtocolHeaertbeater
> >>> >> Heartbeat create at 2017-05-25 13:39:45,504 and sent to FQDN:PORT at
> >>> >> 2017-05-25 13:39:45,627; send took 122 millis
> >>> >> 2017-05-25 13:39:50,862 INFO [Clustering Tasks Thread-1] o.a.n.c.c.
> >>> ClusterProtocolHeaertbeater
> >>> >> Heartbeat create at 2017-05-25 13:39:50,732 and sent to FQDN:PORT at
> >>> >> 2017-05-25 13:39:50,862; send took 122 millis
> >>> >> 2017-05-25 13:39:56,089 INFO [Clustering Tasks Thread-1] o.a.n.c.c.
> >>> ClusterProtocolHeaertbeater
> >>> >> Heartbeat create at 2017-05-25 13:39:55,966 and sent to FQDN:PORT at
> >>> >> 2017-05-25 13:39:56,089; send took 129 millis
> >>> >> 2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0]
> >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> org.apache.nifi.controller.
> >>> >> leader.election.CuratorLeaderElectionManager$
> ElectionListener@68f8b6a2
> >>> >> Connection State changed to SUSPENDED
> >>> >> 2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0]
> >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> org.apache.nifi.controller.
> >>> >> leader.election.CuratorLeaderElectionManager$
> ElectionListener@663f55cd
> >>> >> Connection State changed to SUSPENDED
> >>> >> 2017-05-25 13:40:02,412 INFO [main-EventThread] o.a.c.f.state.
> >>> ConnectinoStateManager
> >>> >> State change: RECONNECTED
> >>> >> 2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0]
> >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> org.apache.nifi.controller.
> >>> >> leader.election.CuratorLeaderElectionManager$
> ElectionListener@68f8b6a2
> >>> >> Connection State changed to RECONNECTED
> >>> >> 2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0]
> >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> org.apache.nifi.controller.
> >>> >> leader.election.CuratorLeaderElectionManager$
> ElectionListener@663f55cd
> >>> >> Connection State changed to RECONNECTED
> >>> >> 2017-05-25 13:40:02,550 INFO [Clustering Tasks Thread-1] o.a.n.c.c.
> >>> ClusterProtocolHeaertbeater
> >>> >> Heartbeat create at 2017-05-25 13:40:01,632 and sent to FQDN:PORT at
> >>> >> 2017-05-25 13:40:02,550; send took 917 millis
> >>> >> 2017-05-25 13:40:07,787 INFO [Clustering Tasks Thread-1] o.a.n.c.c.
> >>> ClusterProtocolHeaertbeater
> >>> >> Heartbeat create at 2017-05-25 13:40:07,657 and sent to FQDN:PORT at
> >>> >> 2017-05-25 13:40:07,787; send took 129 millis
> >>> >>
> >>> >> I will work on setting up an external ZK next, but would still like
> some
> >>> >> insight to what is being observed with the embedded ZK.
> >>> >>
> >>> >> Thanks,
> >>> >> Mark
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >> On Wed, May 24, 2017 at 3:57 PM, Mark Bean <ma...@gmail.com>
> >>> wrote:
> >>> >>
> >>> >>> Yes, we are using the embedded ZK. We will try instantiating and
> >>> external
> >>> >>> ZK and see if that resolves the problem.
> >>> >>>
> >>> >>> The load on the system is extremely small. Currently (as Nodes are
> >>> >>> disconnecting/reconnecting) all input ports to the flow are turned
> >>> off. The
> >>> >>> only data in the flow is from a single GenerateFlow generating 5B
> >>> every 30
> >>> >>> secs.
> >>> >>>
> >>> >>> Also, it is a 5-node cluster with embedded ZK on each node. First,
> I
> >>> will
> >>> >>> try reducing ZK to only 3 nodes. Then, I will try a 3-node
> external ZK.
> >>> >>>
> >>> >>> Thanks,
> >>> >>> Mark
> >>> >>>
> >>> >>> On Wed, May 24, 2017 at 11:49 AM, Joe Witt <jo...@gmail.com>
> wrote:
> >>> >>>
> >>> >>>> Are you using the embedded Zookeeper?  If yes we recommend using
> an
> >>> >>>> external zookeeper.
> >>> >>>>
> >>> >>>> What type of load are the systems under when this occurs (cpu,
> >>> >>>> network, memory, disk io)? Under high load the default timeouts
> for
> >>> >>>> clustering are too aggressive.  You can relax these for higher
> load
> >>> >>>> clusters and should see good behavior.  Even if the system
> overall is
> >>> >>>> not under all that high of load if you're seeing garbage
> collection
> >>> >>>> pauses that are lengthy and/or frequent it can cause the same high
> >>> >>>> load effect as far as the JVM is concerned.
> >>> >>>>
> >>> >>>> Thanks
> >>> >>>> Joe
> >>> >>>>
> >>> >>>> On Wed, May 24, 2017 at 9:11 AM, Mark Bean <mark.o.bean@gmail.com
> >
> >>> >>>> wrote:
> >>> >>>> > We have a cluster which is showing signs of instability. The
> Primary
> >>> >>>> Node
> >>> >>>> > and Coordinator are reassigned to different nodes every several
> >>> >>>> minutes. I
> >>> >>>> > believe this is due to lack of heartbeat or other coordination.
> The
> >>> >>>> > following error occurs periodically in the nifi-app.log
> >>> >>>> >
> >>> >>>> > ERROR [CommitProcessor:1] o.apache.zookeeper.server.
> NIOServerCnxn
> >>> >>>> > Unexpected Exception:
> >>> >>>> > java.nio.channels.CancelledKeyException: null
> >>> >>>> >         at sun.nio.ch.SelectionKeyImpl.en
> >>> >>>> sureValid(SectionKeyImpl.java:73)
> >>> >>>> >         at sun.nio.ch.SelectionKeyImpl.in
> >>> >>>> terestOps(SelctionKeyImpl.java:77)
> >>> >>>> >         at
> >>> >>>> > org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServ
> >>> >>>> erCnxn.java:151)
> >>> >>>> >         at
> >>> >>>> > org.apache.zookeeper.server.NIOServerCnXn.sendResopnse(NIOSe
> >>> >>>> rverCnxn.java:1081)
> >>> >>>> >         at
> >>> >>>> > org.apache.zookeeper.server.FinalRequestProcessor.processReq
> >>> >>>> uest(FinalRequestProcessor.java:404)
> >>> >>>> >         at
> >>> >>>> > org.apache.zookeeper.server.quorum.CommitProcessor.run(Commi
> >>> >>>> tProcessor.java:74)
> >>> >>>> >
> >>> >>>> > Apache NiFi 1.2.0
> >>> >>>> >
> >>> >>>> > Thoughts?
> >>> >>>>
> >>> >>>
> >>> >>>
> >>> >>
> >>>
>

Re: unstable cluster

Posted by Joe Witt <jo...@gmail.com>.

looked at a secured cluster and the send times are routinely at 100ms
similar to yours.  I think what i was flagging as potentially
interesting is not interesting at all.

On Thu, May 25, 2017 at 11:34 AM, Joe Witt <jo...@gmail.com> wrote:
> Ok.  Well as a point of comparison i'm looking at heartbeat logs from
> another cluster and the times are consistently 1-3 millis for the
> send.  Yours above show 100+ms typical with one north of 900ms.  Not
> sure how relevant that is but something i noticed.
>
> On Thu, May 25, 2017 at 11:29 AM, Mark Bean <ma...@gmail.com> wrote:
>> ping shows acceptably fast response time between servers, approximately
>> 0.100-0.150 ms
>>
>>
>> On Thu, May 25, 2017 at 11:13 AM, Joe Witt <jo...@gmail.com> wrote:
>>
>>> have you evaluated latency across the machines in your cluster?  I ask
>>> because 122ms is pretty long and 917ms is very long.  Are these nodes
>>> across a WAN link?
>>>
>>> On Thu, May 25, 2017 at 11:08 AM, Mark Bean <ma...@gmail.com> wrote:
>>> > Update: now all 5 nodes, regardless of ZK server, are indicating
>>> SUSPENDED
>>> > -> RECONNECTED.
>>> >
>>> > On Thu, May 25, 2017 at 10:23 AM, Mark Bean <ma...@gmail.com>
>>> wrote:
>>> >
>>> >> I reduced the number of embedded ZooKeeper servers on the 5-Node NiFi
>>> >> Cluster from 5 to 3. This has improved the situation. I do not see any
>>> of
>>> >> the three Nodes which are also ZK servers disconnecting/reconnecting to
>>> the
>>> >> cluster as before. However, the two Nodes which are not running ZK
>>> continue
>>> >> to disconnect and reconnect. The following is taken from one of the
>>> non-ZK
>>> >> Nodes. It's curious that some messages are issued twice from the same
>>> >> thread, but reference a different object
>>> >>
>>> >> nifi-app.log
>>> >> 2017-05-25 13:40:01,628 INFO [main-EventTrhead] o.a.c.f.state.
>>> ConnectionStateManager
>>> >> State change: SUSPENDED
>>> >> 2017-05-25 13:39:45,627 INFO [Clustering Tasks Thread-1] o.a.n.c.c.
>>> ClusterProtocolHeaertbeater
>>> >> Heartbeat create at 2017-05-25 13:39:45,504 and sent to FQDN:PORT at
>>> >> 2017-05-25 13:39:45,627; send took 122 millis
>>> >> 2017-05-25 13:39:50,862 INFO [Clustering Tasks Thread-1] o.a.n.c.c.
>>> ClusterProtocolHeaertbeater
>>> >> Heartbeat create at 2017-05-25 13:39:50,732 and sent to FQDN:PORT at
>>> >> 2017-05-25 13:39:50,862; send took 122 millis
>>> >> 2017-05-25 13:39:56,089 INFO [Clustering Tasks Thread-1] o.a.n.c.c.
>>> ClusterProtocolHeaertbeater
>>> >> Heartbeat create at 2017-05-25 13:39:55,966 and sent to FQDN:PORT at
>>> >> 2017-05-25 13:39:56,089; send took 129 millis
>>> >> 2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0]
>>> >> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.
>>> >> leader.election.CuratorLeaderElectionManager$ElectionListener@68f8b6a2
>>> >> Connection State changed to SUSPENDED
>>> >> 2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0]
>>> >> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.
>>> >> leader.election.CuratorLeaderElectionManager$ElectionListener@663f55cd
>>> >> Connection State changed to SUSPENDED
>>> >> 2017-05-25 13:40:02,412 INFO [main-EventThread] o.a.c.f.state.
>>> ConnectinoStateManager
>>> >> State change: RECONNECTED
>>> >> 2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0]
>>> >> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.
>>> >> leader.election.CuratorLeaderElectionManager$ElectionListener@68f8b6a2
>>> >> Connection State changed to RECONNECTED
>>> >> 2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0]
>>> >> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.
>>> >> leader.election.CuratorLeaderElectionManager$ElectionListener@663f55cd
>>> >> Connection State changed to RECONNECTED
>>> >> 2017-05-25 13:40:02,550 INFO [Clustering Tasks Thread-1] o.a.n.c.c.
>>> ClusterProtocolHeaertbeater
>>> >> Heartbeat create at 2017-05-25 13:40:01,632 and sent to FQDN:PORT at
>>> >> 2017-05-25 13:40:02,550; send took 917 millis
>>> >> 2017-05-25 13:40:07,787 INFO [Clustering Tasks Thread-1] o.a.n.c.c.
>>> ClusterProtocolHeaertbeater
>>> >> Heartbeat create at 2017-05-25 13:40:07,657 and sent to FQDN:PORT at
>>> >> 2017-05-25 13:40:07,787; send took 129 millis
>>> >>
>>> >> I will work on setting up an external ZK next, but would still like some
>>> >> insight to what is being observed with the embedded ZK.
>>> >>
>>> >> Thanks,
>>> >> Mark
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> On Wed, May 24, 2017 at 3:57 PM, Mark Bean <ma...@gmail.com>
>>> wrote:
>>> >>
>>> >>> Yes, we are using the embedded ZK. We will try instantiating and
>>> external
>>> >>> ZK and see if that resolves the problem.
>>> >>>
>>> >>> The load on the system is extremely small. Currently (as Nodes are
>>> >>> disconnecting/reconnecting) all input ports to the flow are turned
>>> off. The
>>> >>> only data in the flow is from a single GenerateFlow generating 5B
>>> every 30
>>> >>> secs.
>>> >>>
>>> >>> Also, it is a 5-node cluster with embedded ZK on each node. First, I
>>> will
>>> >>> try reducing ZK to only 3 nodes. Then, I will try a 3-node external ZK.
>>> >>>
>>> >>> Thanks,
>>> >>> Mark
>>> >>>
>>> >>> On Wed, May 24, 2017 at 11:49 AM, Joe Witt <jo...@gmail.com> wrote:
>>> >>>
>>> >>>> Are you using the embedded Zookeeper?  If yes we recommend using an
>>> >>>> external zookeeper.
>>> >>>>
>>> >>>> What type of load are the systems under when this occurs (cpu,
>>> >>>> network, memory, disk io)? Under high load the default timeouts for
>>> >>>> clustering are too aggressive.  You can relax these for higher load
>>> >>>> clusters and should see good behavior.  Even if the system overall is
>>> >>>> not under all that high of load if you're seeing garbage collection
>>> >>>> pauses that are lengthy and/or frequent it can cause the same high
>>> >>>> load effect as far as the JVM is concerned.
>>> >>>>
>>> >>>> Thanks
>>> >>>> Joe
>>> >>>>
>>> >>>> On Wed, May 24, 2017 at 9:11 AM, Mark Bean <ma...@gmail.com>
>>> >>>> wrote:
>>> >>>> > We have a cluster which is showing signs of instability. The Primary
>>> >>>> Node
>>> >>>> > and Coordinator are reassigned to different nodes every several
>>> >>>> minutes. I
>>> >>>> > believe this is due to lack of heartbeat or other coordination. The
>>> >>>> > following error occurs periodically in the nifi-app.log
>>> >>>> >
>>> >>>> > ERROR [CommitProcessor:1] o.apache.zookeeper.server.NIOServerCnxn
>>> >>>> > Unexpected Exception:
>>> >>>> > java.nio.channels.CancelledKeyException: null
>>> >>>> >         at sun.nio.ch.SelectionKeyImpl.en
>>> >>>> sureValid(SectionKeyImpl.java:73)
>>> >>>> >         at sun.nio.ch.SelectionKeyImpl.in
>>> >>>> terestOps(SelctionKeyImpl.java:77)
>>> >>>> >         at
>>> >>>> > org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServ
>>> >>>> erCnxn.java:151)
>>> >>>> >         at
>>> >>>> > org.apache.zookeeper.server.NIOServerCnXn.sendResopnse(NIOSe
>>> >>>> rverCnxn.java:1081)
>>> >>>> >         at
>>> >>>> > org.apache.zookeeper.server.FinalRequestProcessor.processReq
>>> >>>> uest(FinalRequestProcessor.java:404)
>>> >>>> >         at
>>> >>>> > org.apache.zookeeper.server.quorum.CommitProcessor.run(Commi
>>> >>>> tProcessor.java:74)
>>> >>>> >
>>> >>>> > Apache NiFi 1.2.0
>>> >>>> >
>>> >>>> > Thoughts?
>>> >>>>
>>> >>>
>>> >>>
>>> >>
>>>

Re: unstable cluster

Posted by Joe Witt <jo...@gmail.com>.

Ok.  Well as a point of comparison i'm looking at heartbeat logs from
another cluster and the times are consistently 1-3 millis for the
send.  Yours above show 100+ms typical with one north of 900ms.  Not
sure how relevant that is but something i noticed.

On Thu, May 25, 2017 at 11:29 AM, Mark Bean <ma...@gmail.com> wrote:
> ping shows acceptably fast response time between servers, approximately
> 0.100-0.150 ms
>
>
> On Thu, May 25, 2017 at 11:13 AM, Joe Witt <jo...@gmail.com> wrote:
>
>> have you evaluated latency across the machines in your cluster?  I ask
>> because 122ms is pretty long and 917ms is very long.  Are these nodes
>> across a WAN link?
>>
>> On Thu, May 25, 2017 at 11:08 AM, Mark Bean <ma...@gmail.com> wrote:
>> > Update: now all 5 nodes, regardless of ZK server, are indicating
>> SUSPENDED
>> > -> RECONNECTED.
>> >
>> > On Thu, May 25, 2017 at 10:23 AM, Mark Bean <ma...@gmail.com>
>> wrote:
>> >
>> >> I reduced the number of embedded ZooKeeper servers on the 5-Node NiFi
>> >> Cluster from 5 to 3. This has improved the situation. I do not see any
>> of
>> >> the three Nodes which are also ZK servers disconnecting/reconnecting to
>> the
>> >> cluster as before. However, the two Nodes which are not running ZK
>> continue
>> >> to disconnect and reconnect. The following is taken from one of the
>> non-ZK
>> >> Nodes. It's curious that some messages are issued twice from the same
>> >> thread, but reference a different object
>> >>
>> >> nifi-app.log
>> >> 2017-05-25 13:40:01,628 INFO [main-EventTrhead] o.a.c.f.state.
>> ConnectionStateManager
>> >> State change: SUSPENDED
>> >> 2017-05-25 13:39:45,627 INFO [Clustering Tasks Thread-1] o.a.n.c.c.
>> ClusterProtocolHeaertbeater
>> >> Heartbeat create at 2017-05-25 13:39:45,504 and sent to FQDN:PORT at
>> >> 2017-05-25 13:39:45,627; send took 122 millis
>> >> 2017-05-25 13:39:50,862 INFO [Clustering Tasks Thread-1] o.a.n.c.c.
>> ClusterProtocolHeaertbeater
>> >> Heartbeat create at 2017-05-25 13:39:50,732 and sent to FQDN:PORT at
>> >> 2017-05-25 13:39:50,862; send took 122 millis
>> >> 2017-05-25 13:39:56,089 INFO [Clustering Tasks Thread-1] o.a.n.c.c.
>> ClusterProtocolHeaertbeater
>> >> Heartbeat create at 2017-05-25 13:39:55,966 and sent to FQDN:PORT at
>> >> 2017-05-25 13:39:56,089; send took 129 millis
>> >> 2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0]
>> >> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.
>> >> leader.election.CuratorLeaderElectionManager$ElectionListener@68f8b6a2
>> >> Connection State changed to SUSPENDED
>> >> 2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0]
>> >> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.
>> >> leader.election.CuratorLeaderElectionManager$ElectionListener@663f55cd
>> >> Connection State changed to SUSPENDED
>> >> 2017-05-25 13:40:02,412 INFO [main-EventThread] o.a.c.f.state.
>> ConnectinoStateManager
>> >> State change: RECONNECTED
>> >> 2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0]
>> >> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.
>> >> leader.election.CuratorLeaderElectionManager$ElectionListener@68f8b6a2
>> >> Connection State changed to RECONNECTED
>> >> 2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0]
>> >> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.
>> >> leader.election.CuratorLeaderElectionManager$ElectionListener@663f55cd
>> >> Connection State changed to RECONNECTED
>> >> 2017-05-25 13:40:02,550 INFO [Clustering Tasks Thread-1] o.a.n.c.c.
>> ClusterProtocolHeaertbeater
>> >> Heartbeat create at 2017-05-25 13:40:01,632 and sent to FQDN:PORT at
>> >> 2017-05-25 13:40:02,550; send took 917 millis
>> >> 2017-05-25 13:40:07,787 INFO [Clustering Tasks Thread-1] o.a.n.c.c.
>> ClusterProtocolHeaertbeater
>> >> Heartbeat create at 2017-05-25 13:40:07,657 and sent to FQDN:PORT at
>> >> 2017-05-25 13:40:07,787; send took 129 millis
>> >>
>> >> I will work on setting up an external ZK next, but would still like some
>> >> insight to what is being observed with the embedded ZK.
>> >>
>> >> Thanks,
>> >> Mark
>> >>
>> >>
>> >>
>> >>
>> >> On Wed, May 24, 2017 at 3:57 PM, Mark Bean <ma...@gmail.com>
>> wrote:
>> >>
>> >>> Yes, we are using the embedded ZK. We will try instantiating and
>> external
>> >>> ZK and see if that resolves the problem.
>> >>>
>> >>> The load on the system is extremely small. Currently (as Nodes are
>> >>> disconnecting/reconnecting) all input ports to the flow are turned
>> off. The
>> >>> only data in the flow is from a single GenerateFlow generating 5B
>> every 30
>> >>> secs.
>> >>>
>> >>> Also, it is a 5-node cluster with embedded ZK on each node. First, I
>> will
>> >>> try reducing ZK to only 3 nodes. Then, I will try a 3-node external ZK.
>> >>>
>> >>> Thanks,
>> >>> Mark
>> >>>
>> >>> On Wed, May 24, 2017 at 11:49 AM, Joe Witt <jo...@gmail.com> wrote:
>> >>>
>> >>>> Are you using the embedded Zookeeper?  If yes we recommend using an
>> >>>> external zookeeper.
>> >>>>
>> >>>> What type of load are the systems under when this occurs (cpu,
>> >>>> network, memory, disk io)? Under high load the default timeouts for
>> >>>> clustering are too aggressive.  You can relax these for higher load
>> >>>> clusters and should see good behavior.  Even if the system overall is
>> >>>> not under all that high of load if you're seeing garbage collection
>> >>>> pauses that are lengthy and/or frequent it can cause the same high
>> >>>> load effect as far as the JVM is concerned.
>> >>>>
>> >>>> Thanks
>> >>>> Joe
>> >>>>
>> >>>> On Wed, May 24, 2017 at 9:11 AM, Mark Bean <ma...@gmail.com>
>> >>>> wrote:
>> >>>> > We have a cluster which is showing signs of instability. The Primary
>> >>>> Node
>> >>>> > and Coordinator are reassigned to different nodes every several
>> >>>> minutes. I
>> >>>> > believe this is due to lack of heartbeat or other coordination. The
>> >>>> > following error occurs periodically in the nifi-app.log
>> >>>> >
>> >>>> > ERROR [CommitProcessor:1] o.apache.zookeeper.server.NIOServerCnxn
>> >>>> > Unexpected Exception:
>> >>>> > java.nio.channels.CancelledKeyException: null
>> >>>> >         at sun.nio.ch.SelectionKeyImpl.en
>> >>>> sureValid(SectionKeyImpl.java:73)
>> >>>> >         at sun.nio.ch.SelectionKeyImpl.in
>> >>>> terestOps(SelctionKeyImpl.java:77)
>> >>>> >         at
>> >>>> > org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServ
>> >>>> erCnxn.java:151)
>> >>>> >         at
>> >>>> > org.apache.zookeeper.server.NIOServerCnXn.sendResopnse(NIOSe
>> >>>> rverCnxn.java:1081)
>> >>>> >         at
>> >>>> > org.apache.zookeeper.server.FinalRequestProcessor.processReq
>> >>>> uest(FinalRequestProcessor.java:404)
>> >>>> >         at
>> >>>> > org.apache.zookeeper.server.quorum.CommitProcessor.run(Commi
>> >>>> tProcessor.java:74)
>> >>>> >
>> >>>> > Apache NiFi 1.2.0
>> >>>> >
>> >>>> > Thoughts?
>> >>>>
>> >>>
>> >>>
>> >>
>>

Re: unstable cluster

Posted by Mark Bean <ma...@gmail.com>.

ping shows acceptably fast response time between servers, approximately
0.100-0.150 ms


On Thu, May 25, 2017 at 11:13 AM, Joe Witt <jo...@gmail.com> wrote:

> have you evaluated latency across the machines in your cluster?  I ask
> because 122ms is pretty long and 917ms is very long.  Are these nodes
> across a WAN link?
>
> On Thu, May 25, 2017 at 11:08 AM, Mark Bean <ma...@gmail.com> wrote:
> > Update: now all 5 nodes, regardless of ZK server, are indicating
> SUSPENDED
> > -> RECONNECTED.
> >
> > On Thu, May 25, 2017 at 10:23 AM, Mark Bean <ma...@gmail.com>
> wrote:
> >
> >> I reduced the number of embedded ZooKeeper servers on the 5-Node NiFi
> >> Cluster from 5 to 3. This has improved the situation. I do not see any
> of
> >> the three Nodes which are also ZK servers disconnecting/reconnecting to
> the
> >> cluster as before. However, the two Nodes which are not running ZK
> continue
> >> to disconnect and reconnect. The following is taken from one of the
> non-ZK
> >> Nodes. It's curious that some messages are issued twice from the same
> >> thread, but reference a different object
> >>
> >> nifi-app.log
> >> 2017-05-25 13:40:01,628 INFO [main-EventTrhead] o.a.c.f.state.
> ConnectionStateManager
> >> State change: SUSPENDED
> >> 2017-05-25 13:39:45,627 INFO [Clustering Tasks Thread-1] o.a.n.c.c.
> ClusterProtocolHeaertbeater
> >> Heartbeat create at 2017-05-25 13:39:45,504 and sent to FQDN:PORT at
> >> 2017-05-25 13:39:45,627; send took 122 millis
> >> 2017-05-25 13:39:50,862 INFO [Clustering Tasks Thread-1] o.a.n.c.c.
> ClusterProtocolHeaertbeater
> >> Heartbeat create at 2017-05-25 13:39:50,732 and sent to FQDN:PORT at
> >> 2017-05-25 13:39:50,862; send took 122 millis
> >> 2017-05-25 13:39:56,089 INFO [Clustering Tasks Thread-1] o.a.n.c.c.
> ClusterProtocolHeaertbeater
> >> Heartbeat create at 2017-05-25 13:39:55,966 and sent to FQDN:PORT at
> >> 2017-05-25 13:39:56,089; send took 129 millis
> >> 2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0]
> >> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.
> >> leader.election.CuratorLeaderElectionManager$ElectionListener@68f8b6a2
> >> Connection State changed to SUSPENDED
> >> 2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0]
> >> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.
> >> leader.election.CuratorLeaderElectionManager$ElectionListener@663f55cd
> >> Connection State changed to SUSPENDED
> >> 2017-05-25 13:40:02,412 INFO [main-EventThread] o.a.c.f.state.
> ConnectinoStateManager
> >> State change: RECONNECTED
> >> 2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0]
> >> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.
> >> leader.election.CuratorLeaderElectionManager$ElectionListener@68f8b6a2
> >> Connection State changed to RECONNECTED
> >> 2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0]
> >> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.
> >> leader.election.CuratorLeaderElectionManager$ElectionListener@663f55cd
> >> Connection State changed to RECONNECTED
> >> 2017-05-25 13:40:02,550 INFO [Clustering Tasks Thread-1] o.a.n.c.c.
> ClusterProtocolHeaertbeater
> >> Heartbeat create at 2017-05-25 13:40:01,632 and sent to FQDN:PORT at
> >> 2017-05-25 13:40:02,550; send took 917 millis
> >> 2017-05-25 13:40:07,787 INFO [Clustering Tasks Thread-1] o.a.n.c.c.
> ClusterProtocolHeaertbeater
> >> Heartbeat create at 2017-05-25 13:40:07,657 and sent to FQDN:PORT at
> >> 2017-05-25 13:40:07,787; send took 129 millis
> >>
> >> I will work on setting up an external ZK next, but would still like some
> >> insight to what is being observed with the embedded ZK.
> >>
> >> Thanks,
> >> Mark
> >>
> >>
> >>
> >>
> >> On Wed, May 24, 2017 at 3:57 PM, Mark Bean <ma...@gmail.com>
> wrote:
> >>
> >>> Yes, we are using the embedded ZK. We will try instantiating and
> external
> >>> ZK and see if that resolves the problem.
> >>>
> >>> The load on the system is extremely small. Currently (as Nodes are
> >>> disconnecting/reconnecting) all input ports to the flow are turned
> off. The
> >>> only data in the flow is from a single GenerateFlow generating 5B
> every 30
> >>> secs.
> >>>
> >>> Also, it is a 5-node cluster with embedded ZK on each node. First, I
> will
> >>> try reducing ZK to only 3 nodes. Then, I will try a 3-node external ZK.
> >>>
> >>> Thanks,
> >>> Mark
> >>>
> >>> On Wed, May 24, 2017 at 11:49 AM, Joe Witt <jo...@gmail.com> wrote:
> >>>
> >>>> Are you using the embedded Zookeeper?  If yes we recommend using an
> >>>> external zookeeper.
> >>>>
> >>>> What type of load are the systems under when this occurs (cpu,
> >>>> network, memory, disk io)? Under high load the default timeouts for
> >>>> clustering are too aggressive.  You can relax these for higher load
> >>>> clusters and should see good behavior.  Even if the system overall is
> >>>> not under all that high of load if you're seeing garbage collection
> >>>> pauses that are lengthy and/or frequent it can cause the same high
> >>>> load effect as far as the JVM is concerned.
> >>>>
> >>>> Thanks
> >>>> Joe
> >>>>
> >>>> On Wed, May 24, 2017 at 9:11 AM, Mark Bean <ma...@gmail.com>
> >>>> wrote:
> >>>> > We have a cluster which is showing signs of instability. The Primary
> >>>> Node
> >>>> > and Coordinator are reassigned to different nodes every several
> >>>> minutes. I
> >>>> > believe this is due to lack of heartbeat or other coordination. The
> >>>> > following error occurs periodically in the nifi-app.log
> >>>> >
> >>>> > ERROR [CommitProcessor:1] o.apache.zookeeper.server.NIOServerCnxn
> >>>> > Unexpected Exception:
> >>>> > java.nio.channels.CancelledKeyException: null
> >>>> >         at sun.nio.ch.SelectionKeyImpl.en
> >>>> sureValid(SectionKeyImpl.java:73)
> >>>> >         at sun.nio.ch.SelectionKeyImpl.in
> >>>> terestOps(SelctionKeyImpl.java:77)
> >>>> >         at
> >>>> > org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServ
> >>>> erCnxn.java:151)
> >>>> >         at
> >>>> > org.apache.zookeeper.server.NIOServerCnXn.sendResopnse(NIOSe
> >>>> rverCnxn.java:1081)
> >>>> >         at
> >>>> > org.apache.zookeeper.server.FinalRequestProcessor.processReq
> >>>> uest(FinalRequestProcessor.java:404)
> >>>> >         at
> >>>> > org.apache.zookeeper.server.quorum.CommitProcessor.run(Commi
> >>>> tProcessor.java:74)
> >>>> >
> >>>> > Apache NiFi 1.2.0
> >>>> >
> >>>> > Thoughts?
> >>>>
> >>>
> >>>
> >>
>

Re: unstable cluster

Posted by Joe Witt <jo...@gmail.com>.

have you evaluated latency across the machines in your cluster?  I ask
because 122ms is pretty long and 917ms is very long.  Are these nodes
across a WAN link?

On Thu, May 25, 2017 at 11:08 AM, Mark Bean <ma...@gmail.com> wrote:
> Update: now all 5 nodes, regardless of ZK server, are indicating SUSPENDED
> -> RECONNECTED.
>
> On Thu, May 25, 2017 at 10:23 AM, Mark Bean <ma...@gmail.com> wrote:
>
>> I reduced the number of embedded ZooKeeper servers on the 5-Node NiFi
>> Cluster from 5 to 3. This has improved the situation. I do not see any of
>> the three Nodes which are also ZK servers disconnecting/reconnecting to the
>> cluster as before. However, the two Nodes which are not running ZK continue
>> to disconnect and reconnect. The following is taken from one of the non-ZK
>> Nodes. It's curious that some messages are issued twice from the same
>> thread, but reference a different object
>>
>> nifi-app.log
>> 2017-05-25 13:40:01,628 INFO [main-EventTrhead] o.a.c.f.state.ConnectionStateManager
>> State change: SUSPENDED
>> 2017-05-25 13:39:45,627 INFO [Clustering Tasks Thread-1] o.a.n.c.c.ClusterProtocolHeaertbeater
>> Heartbeat create at 2017-05-25 13:39:45,504 and sent to FQDN:PORT at
>> 2017-05-25 13:39:45,627; send took 122 millis
>> 2017-05-25 13:39:50,862 INFO [Clustering Tasks Thread-1] o.a.n.c.c.ClusterProtocolHeaertbeater
>> Heartbeat create at 2017-05-25 13:39:50,732 and sent to FQDN:PORT at
>> 2017-05-25 13:39:50,862; send took 122 millis
>> 2017-05-25 13:39:56,089 INFO [Clustering Tasks Thread-1] o.a.n.c.c.ClusterProtocolHeaertbeater
>> Heartbeat create at 2017-05-25 13:39:55,966 and sent to FQDN:PORT at
>> 2017-05-25 13:39:56,089; send took 129 millis
>> 2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0]
>> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.
>> leader.election.CuratorLeaderElectionManager$ElectionListener@68f8b6a2
>> Connection State changed to SUSPENDED
>> 2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0]
>> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.
>> leader.election.CuratorLeaderElectionManager$ElectionListener@663f55cd
>> Connection State changed to SUSPENDED
>> 2017-05-25 13:40:02,412 INFO [main-EventThread] o.a.c.f.state.ConnectinoStateManager
>> State change: RECONNECTED
>> 2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0]
>> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.
>> leader.election.CuratorLeaderElectionManager$ElectionListener@68f8b6a2
>> Connection State changed to RECONNECTED
>> 2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0]
>> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.
>> leader.election.CuratorLeaderElectionManager$ElectionListener@663f55cd
>> Connection State changed to RECONNECTED
>> 2017-05-25 13:40:02,550 INFO [Clustering Tasks Thread-1] o.a.n.c.c.ClusterProtocolHeaertbeater
>> Heartbeat create at 2017-05-25 13:40:01,632 and sent to FQDN:PORT at
>> 2017-05-25 13:40:02,550; send took 917 millis
>> 2017-05-25 13:40:07,787 INFO [Clustering Tasks Thread-1] o.a.n.c.c.ClusterProtocolHeaertbeater
>> Heartbeat create at 2017-05-25 13:40:07,657 and sent to FQDN:PORT at
>> 2017-05-25 13:40:07,787; send took 129 millis
>>
>> I will work on setting up an external ZK next, but would still like some
>> insight to what is being observed with the embedded ZK.
>>
>> Thanks,
>> Mark
>>
>>
>>
>>
>> On Wed, May 24, 2017 at 3:57 PM, Mark Bean <ma...@gmail.com> wrote:
>>
>>> Yes, we are using the embedded ZK. We will try instantiating and external
>>> ZK and see if that resolves the problem.
>>>
>>> The load on the system is extremely small. Currently (as Nodes are
>>> disconnecting/reconnecting) all input ports to the flow are turned off. The
>>> only data in the flow is from a single GenerateFlow generating 5B every 30
>>> secs.
>>>
>>> Also, it is a 5-node cluster with embedded ZK on each node. First, I will
>>> try reducing ZK to only 3 nodes. Then, I will try a 3-node external ZK.
>>>
>>> Thanks,
>>> Mark
>>>
>>> On Wed, May 24, 2017 at 11:49 AM, Joe Witt <jo...@gmail.com> wrote:
>>>
>>>> Are you using the embedded Zookeeper?  If yes we recommend using an
>>>> external zookeeper.
>>>>
>>>> What type of load are the systems under when this occurs (cpu,
>>>> network, memory, disk io)? Under high load the default timeouts for
>>>> clustering are too aggressive.  You can relax these for higher load
>>>> clusters and should see good behavior.  Even if the system overall is
>>>> not under all that high of load if you're seeing garbage collection
>>>> pauses that are lengthy and/or frequent it can cause the same high
>>>> load effect as far as the JVM is concerned.
>>>>
>>>> Thanks
>>>> Joe
>>>>
>>>> On Wed, May 24, 2017 at 9:11 AM, Mark Bean <ma...@gmail.com>
>>>> wrote:
>>>> > We have a cluster which is showing signs of instability. The Primary
>>>> Node
>>>> > and Coordinator are reassigned to different nodes every several
>>>> minutes. I
>>>> > believe this is due to lack of heartbeat or other coordination. The
>>>> > following error occurs periodically in the nifi-app.log
>>>> >
>>>> > ERROR [CommitProcessor:1] o.apache.zookeeper.server.NIOServerCnxn
>>>> > Unexpected Exception:
>>>> > java.nio.channels.CancelledKeyException: null
>>>> >         at sun.nio.ch.SelectionKeyImpl.en
>>>> sureValid(SectionKeyImpl.java:73)
>>>> >         at sun.nio.ch.SelectionKeyImpl.in
>>>> terestOps(SelctionKeyImpl.java:77)
>>>> >         at
>>>> > org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServ
>>>> erCnxn.java:151)
>>>> >         at
>>>> > org.apache.zookeeper.server.NIOServerCnXn.sendResopnse(NIOSe
>>>> rverCnxn.java:1081)
>>>> >         at
>>>> > org.apache.zookeeper.server.FinalRequestProcessor.processReq
>>>> uest(FinalRequestProcessor.java:404)
>>>> >         at
>>>> > org.apache.zookeeper.server.quorum.CommitProcessor.run(Commi
>>>> tProcessor.java:74)
>>>> >
>>>> > Apache NiFi 1.2.0
>>>> >
>>>> > Thoughts?
>>>>
>>>
>>>
>>

Re: unstable cluster

Posted by Mark Bean <ma...@gmail.com>.

Update: now all 5 nodes, regardless of ZK server, are indicating SUSPENDED
-> RECONNECTED.

On Thu, May 25, 2017 at 10:23 AM, Mark Bean <ma...@gmail.com> wrote:

> I reduced the number of embedded ZooKeeper servers on the 5-Node NiFi
> Cluster from 5 to 3. This has improved the situation. I do not see any of
> the three Nodes which are also ZK servers disconnecting/reconnecting to the
> cluster as before. However, the two Nodes which are not running ZK continue
> to disconnect and reconnect. The following is taken from one of the non-ZK
> Nodes. It's curious that some messages are issued twice from the same
> thread, but reference a different object
>
> nifi-app.log
> 2017-05-25 13:40:01,628 INFO [main-EventTrhead] o.a.c.f.state.ConnectionStateManager
> State change: SUSPENDED
> 2017-05-25 13:39:45,627 INFO [Clustering Tasks Thread-1] o.a.n.c.c.ClusterProtocolHeaertbeater
> Heartbeat create at 2017-05-25 13:39:45,504 and sent to FQDN:PORT at
> 2017-05-25 13:39:45,627; send took 122 millis
> 2017-05-25 13:39:50,862 INFO [Clustering Tasks Thread-1] o.a.n.c.c.ClusterProtocolHeaertbeater
> Heartbeat create at 2017-05-25 13:39:50,732 and sent to FQDN:PORT at
> 2017-05-25 13:39:50,862; send took 122 millis
> 2017-05-25 13:39:56,089 INFO [Clustering Tasks Thread-1] o.a.n.c.c.ClusterProtocolHeaertbeater
> Heartbeat create at 2017-05-25 13:39:55,966 and sent to FQDN:PORT at
> 2017-05-25 13:39:56,089; send took 129 millis
> 2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0]
> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.
> leader.election.CuratorLeaderElectionManager$ElectionListener@68f8b6a2
> Connection State changed to SUSPENDED
> 2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0]
> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.
> leader.election.CuratorLeaderElectionManager$ElectionListener@663f55cd
> Connection State changed to SUSPENDED
> 2017-05-25 13:40:02,412 INFO [main-EventThread] o.a.c.f.state.ConnectinoStateManager
> State change: RECONNECTED
> 2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0]
> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.
> leader.election.CuratorLeaderElectionManager$ElectionListener@68f8b6a2
> Connection State changed to RECONNECTED
> 2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0]
> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.
> leader.election.CuratorLeaderElectionManager$ElectionListener@663f55cd
> Connection State changed to RECONNECTED
> 2017-05-25 13:40:02,550 INFO [Clustering Tasks Thread-1] o.a.n.c.c.ClusterProtocolHeaertbeater
> Heartbeat create at 2017-05-25 13:40:01,632 and sent to FQDN:PORT at
> 2017-05-25 13:40:02,550; send took 917 millis
> 2017-05-25 13:40:07,787 INFO [Clustering Tasks Thread-1] o.a.n.c.c.ClusterProtocolHeaertbeater
> Heartbeat create at 2017-05-25 13:40:07,657 and sent to FQDN:PORT at
> 2017-05-25 13:40:07,787; send took 129 millis
>
> I will work on setting up an external ZK next, but would still like some
> insight to what is being observed with the embedded ZK.
>
> Thanks,
> Mark
>
>
>
>
> On Wed, May 24, 2017 at 3:57 PM, Mark Bean <ma...@gmail.com> wrote:
>
>> Yes, we are using the embedded ZK. We will try instantiating and external
>> ZK and see if that resolves the problem.
>>
>> The load on the system is extremely small. Currently (as Nodes are
>> disconnecting/reconnecting) all input ports to the flow are turned off. The
>> only data in the flow is from a single GenerateFlow generating 5B every 30
>> secs.
>>
>> Also, it is a 5-node cluster with embedded ZK on each node. First, I will
>> try reducing ZK to only 3 nodes. Then, I will try a 3-node external ZK.
>>
>> Thanks,
>> Mark
>>
>> On Wed, May 24, 2017 at 11:49 AM, Joe Witt <jo...@gmail.com> wrote:
>>
>>> Are you using the embedded Zookeeper?  If yes we recommend using an
>>> external zookeeper.
>>>
>>> What type of load are the systems under when this occurs (cpu,
>>> network, memory, disk io)? Under high load the default timeouts for
>>> clustering are too aggressive.  You can relax these for higher load
>>> clusters and should see good behavior.  Even if the system overall is
>>> not under all that high of load if you're seeing garbage collection
>>> pauses that are lengthy and/or frequent it can cause the same high
>>> load effect as far as the JVM is concerned.
>>>
>>> Thanks
>>> Joe
>>>
>>> On Wed, May 24, 2017 at 9:11 AM, Mark Bean <ma...@gmail.com>
>>> wrote:
>>> > We have a cluster which is showing signs of instability. The Primary
>>> Node
>>> > and Coordinator are reassigned to different nodes every several
>>> minutes. I
>>> > believe this is due to lack of heartbeat or other coordination. The
>>> > following error occurs periodically in the nifi-app.log
>>> >
>>> > ERROR [CommitProcessor:1] o.apache.zookeeper.server.NIOServerCnxn
>>> > Unexpected Exception:
>>> > java.nio.channels.CancelledKeyException: null
>>> >         at sun.nio.ch.SelectionKeyImpl.en
>>> sureValid(SectionKeyImpl.java:73)
>>> >         at sun.nio.ch.SelectionKeyImpl.in
>>> terestOps(SelctionKeyImpl.java:77)
>>> >         at
>>> > org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServ
>>> erCnxn.java:151)
>>> >         at
>>> > org.apache.zookeeper.server.NIOServerCnXn.sendResopnse(NIOSe
>>> rverCnxn.java:1081)
>>> >         at
>>> > org.apache.zookeeper.server.FinalRequestProcessor.processReq
>>> uest(FinalRequestProcessor.java:404)
>>> >         at
>>> > org.apache.zookeeper.server.quorum.CommitProcessor.run(Commi
>>> tProcessor.java:74)
>>> >
>>> > Apache NiFi 1.2.0
>>> >
>>> > Thoughts?
>>>
>>
>>
>

Re: unstable cluster

Posted by Mark Bean <ma...@gmail.com>.

I reduced the number of embedded ZooKeeper servers on the 5-Node NiFi
Cluster from 5 to 3. This has improved the situation. I do not see any of
the three Nodes which are also ZK servers disconnecting/reconnecting to the
cluster as before. However, the two Nodes which are not running ZK continue
to disconnect and reconnect. The following is taken from one of the non-ZK
Nodes. It's curious that some messages are issued twice from the same
thread, but reference a different object

nifi-app.log
2017-05-25 13:40:01,628 INFO [main-EventTrhead]
o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
2017-05-25 13:39:45,627 INFO [Clustering Tasks Thread-1]
o.a.n.c.c.ClusterProtocolHeaertbeater Heartbeat create at 2017-05-25
13:39:45,504 and sent to FQDN:PORT at 2017-05-25 13:39:45,627; send took
122 millis
2017-05-25 13:39:50,862 INFO [Clustering Tasks Thread-1]
o.a.n.c.c.ClusterProtocolHeaertbeater Heartbeat create at 2017-05-25
13:39:50,732 and sent to FQDN:PORT at 2017-05-25 13:39:50,862; send took
122 millis
2017-05-25 13:39:56,089 INFO [Clustering Tasks Thread-1]
o.a.n.c.c.ClusterProtocolHeaertbeater Heartbeat create at 2017-05-25
13:39:55,966 and sent to FQDN:PORT at 2017-05-25 13:39:56,089; send took
129 millis
2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0]
o.a.n.c.l.e.CuratorLeaderElectionManager
org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@68f8b6a2
Connection State changed to SUSPENDED
2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0]
o.a.n.c.l.e.CuratorLeaderElectionManager
org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@663f55cd
Connection State changed to SUSPENDED
2017-05-25 13:40:02,412 INFO [main-EventThread]
o.a.c.f.state.ConnectinoStateManager State change: RECONNECTED
2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0]
o.a.n.c.l.e.CuratorLeaderElectionManager
org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@68f8b6a2
Connection State changed to RECONNECTED
2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0]
o.a.n.c.l.e.CuratorLeaderElectionManager
org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@663f55cd
Connection State changed to RECONNECTED
2017-05-25 13:40:02,550 INFO [Clustering Tasks Thread-1]
o.a.n.c.c.ClusterProtocolHeaertbeater Heartbeat create at 2017-05-25
13:40:01,632 and sent to FQDN:PORT at 2017-05-25 13:40:02,550; send took
917 millis
2017-05-25 13:40:07,787 INFO [Clustering Tasks Thread-1]
o.a.n.c.c.ClusterProtocolHeaertbeater Heartbeat create at 2017-05-25
13:40:07,657 and sent to FQDN:PORT at 2017-05-25 13:40:07,787; send took
129 millis

I will work on setting up an external ZK next, but would still like some
insight to what is being observed with the embedded ZK.

Thanks,
Mark




On Wed, May 24, 2017 at 3:57 PM, Mark Bean <ma...@gmail.com> wrote:

> Yes, we are using the embedded ZK. We will try instantiating and external
> ZK and see if that resolves the problem.
>
> The load on the system is extremely small. Currently (as Nodes are
> disconnecting/reconnecting) all input ports to the flow are turned off. The
> only data in the flow is from a single GenerateFlow generating 5B every 30
> secs.
>
> Also, it is a 5-node cluster with embedded ZK on each node. First, I will
> try reducing ZK to only 3 nodes. Then, I will try a 3-node external ZK.
>
> Thanks,
> Mark
>
> On Wed, May 24, 2017 at 11:49 AM, Joe Witt <jo...@gmail.com> wrote:
>
>> Are you using the embedded Zookeeper?  If yes we recommend using an
>> external zookeeper.
>>
>> What type of load are the systems under when this occurs (cpu,
>> network, memory, disk io)? Under high load the default timeouts for
>> clustering are too aggressive.  You can relax these for higher load
>> clusters and should see good behavior.  Even if the system overall is
>> not under all that high of load if you're seeing garbage collection
>> pauses that are lengthy and/or frequent it can cause the same high
>> load effect as far as the JVM is concerned.
>>
>> Thanks
>> Joe
>>
>> On Wed, May 24, 2017 at 9:11 AM, Mark Bean <ma...@gmail.com> wrote:
>> > We have a cluster which is showing signs of instability. The Primary
>> Node
>> > and Coordinator are reassigned to different nodes every several
>> minutes. I
>> > believe this is due to lack of heartbeat or other coordination. The
>> > following error occurs periodically in the nifi-app.log
>> >
>> > ERROR [CommitProcessor:1] o.apache.zookeeper.server.NIOServerCnxn
>> > Unexpected Exception:
>> > java.nio.channels.CancelledKeyException: null
>> >         at sun.nio.ch.SelectionKeyImpl.ensureValid(SectionKeyImpl.java:
>> 73)
>> >         at sun.nio.ch.SelectionKeyImpl.interestOps(SelctionKeyImpl.java
>> :77)
>> >         at
>> > org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServ
>> erCnxn.java:151)
>> >         at
>> > org.apache.zookeeper.server.NIOServerCnXn.sendResopnse(NIOSe
>> rverCnxn.java:1081)
>> >         at
>> > org.apache.zookeeper.server.FinalRequestProcessor.processReq
>> uest(FinalRequestProcessor.java:404)
>> >         at
>> > org.apache.zookeeper.server.quorum.CommitProcessor.run(Commi
>> tProcessor.java:74)
>> >
>> > Apache NiFi 1.2.0
>> >
>> > Thoughts?
>>
>
>

Re: unstable cluster

Posted by Mark Bean <ma...@gmail.com>.

Yes, we are using the embedded ZK. We will try instantiating and external
ZK and see if that resolves the problem.

The load on the system is extremely small. Currently (as Nodes are
disconnecting/reconnecting) all input ports to the flow are turned off. The
only data in the flow is from a single GenerateFlow generating 5B every 30
secs.

Also, it is a 5-node cluster with embedded ZK on each node. First, I will
try reducing ZK to only 3 nodes. Then, I will try a 3-node external ZK.

Thanks,
Mark

On Wed, May 24, 2017 at 11:49 AM, Joe Witt <jo...@gmail.com> wrote:

> Are you using the embedded Zookeeper?  If yes we recommend using an
> external zookeeper.
>
> What type of load are the systems under when this occurs (cpu,
> network, memory, disk io)? Under high load the default timeouts for
> clustering are too aggressive.  You can relax these for higher load
> clusters and should see good behavior.  Even if the system overall is
> not under all that high of load if you're seeing garbage collection
> pauses that are lengthy and/or frequent it can cause the same high
> load effect as far as the JVM is concerned.
>
> Thanks
> Joe
>
> On Wed, May 24, 2017 at 9:11 AM, Mark Bean <ma...@gmail.com> wrote:
> > We have a cluster which is showing signs of instability. The Primary Node
> > and Coordinator are reassigned to different nodes every several minutes.
> I
> > believe this is due to lack of heartbeat or other coordination. The
> > following error occurs periodically in the nifi-app.log
> >
> > ERROR [CommitProcessor:1] o.apache.zookeeper.server.NIOServerCnxn
> > Unexpected Exception:
> > java.nio.channels.CancelledKeyException: null
> >         at sun.nio.ch.SelectionKeyImpl.ensureValid(SectionKeyImpl.
> java:73)
> >         at sun.nio.ch.SelectionKeyImpl.interestOps(SelctionKeyImpl.
> java:77)
> >         at
> > org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(
> NIOServerCnxn.java:151)
> >         at
> > org.apache.zookeeper.server.NIOServerCnXn.sendResopnse(
> NIOServerCnxn.java:1081)
> >         at
> > org.apache.zookeeper.server.FinalRequestProcessor.processRequest(
> FinalRequestProcessor.java:404)
> >         at
> > org.apache.zookeeper.server.quorum.CommitProcessor.run(
> CommitProcessor.java:74)
> >
> > Apache NiFi 1.2.0
> >
> > Thoughts?
>

Re: unstable cluster

Posted by Joe Witt <jo...@gmail.com>.

Are you using the embedded Zookeeper?  If yes we recommend using an
external zookeeper.

What type of load are the systems under when this occurs (cpu,
network, memory, disk io)? Under high load the default timeouts for
clustering are too aggressive.  You can relax these for higher load
clusters and should see good behavior.  Even if the system overall is
not under all that high of load if you're seeing garbage collection
pauses that are lengthy and/or frequent it can cause the same high
load effect as far as the JVM is concerned.

Thanks
Joe

On Wed, May 24, 2017 at 9:11 AM, Mark Bean <ma...@gmail.com> wrote:
> We have a cluster which is showing signs of instability. The Primary Node
> and Coordinator are reassigned to different nodes every several minutes. I
> believe this is due to lack of heartbeat or other coordination. The
> following error occurs periodically in the nifi-app.log
>
> ERROR [CommitProcessor:1] o.apache.zookeeper.server.NIOServerCnxn
> Unexpected Exception:
> java.nio.channels.CancelledKeyException: null
>         at sun.nio.ch.SelectionKeyImpl.ensureValid(SectionKeyImpl.java:73)
>         at sun.nio.ch.SelectionKeyImpl.interestOps(SelctionKeyImpl.java:77)
>         at
> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:151)
>         at
> org.apache.zookeeper.server.NIOServerCnXn.sendResopnse(NIOServerCnxn.java:1081)
>         at
> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:404)
>         at
> org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:74)
>
> Apache NiFi 1.2.0
>
> Thoughts?