You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Aaron Longfield <al...@gmail.com> on 2016/07/14 15:14:13 UTC

Nifi cluster nodes regularly stop processing any flowfiles

Hi,

I'm having an issue with a small (two node) NiFi cluster where the nodes
will stop processing any queued flowfiles.  I haven't seen any error
messages logged related to it, and when attempting to restart the service,
NiFi doesn't respond and the script forcibly kills it.  This causes
multiple flowfile version to hang around, and generally makes me feel like
it might be causing data loss.

I'm running the web UI on a different box, and when things stop working, it
stops showing changes to counts in any queues, and the thread count never
changes.  It still thinks the nodes are connecting and responding, though.

My environment is two 8 cpu systems w/ 60GB memory with 48GB given to the
NiFi JVM in bootstrap.conf.  I have timer threads limited to 12, and event
threads to 4.  Install is on the current Amazon Linux AMI and using OpenJDK
1.8.0.91 x64.

Any idea, other debug steps, or changes that I can try?  I'm running 0.7.0,
having upgraded from 0.6.1, but this has been occurring with both
versions.  The higher the flowfile volume I push through, the faster this
happens.

Thanks for any help there is to give!

-Aaron Longfield

Re: Nifi cluster nodes regularly stop processing any flowfiles

Posted by Mark Payne <ma...@hotmail.com>.
Aaron,

Excellent! Glad that you're seeing better results. Sorry about that. Let us know if you run into any other strangeness!

Thanks
-Mark

> On Aug 3, 2016, at 6:18 PM, Aaron Longfield <al...@gmail.com> wrote:
> 
> I backported the patch from the master branch and it applies without changing much at all.  Workflow processing works fine by my eye, but I do see quite a few provenance warnings logged.  I haven't tried out to see how that repository is working yet, but I just pushed a few million flowfiles through my flows, and output probably 25GB to a remote processor without anything falling over!
> 
> -Aaron
> 
> On Mon, Aug 1, 2016 at 4:08 PM, Joe Witt <joe.witt@gmail.com <ma...@gmail.com>> wrote:
> Aaron,
> 
> Ok so from a production point of view I'd recommend a small patched
> version of the 0.7 release you were working with.  It might be the
> case that grafting the master line patch for that JIRA into an 0.x
> patch is pretty straight forward.  You could take a look at that as a
> short term option.  We probably should start 0.7.1 and 1.0-M1 type
> release motions soon anyway so this could be a helpful catalyst.
> 
> Thanks
> Joe
> 
> On Mon, Aug 1, 2016 at 4:03 PM, Aaron Longfield <alongfield@gmail.com <ma...@gmail.com>> wrote:
> > Joe,
> >
> > Sure, I can give that a go.  Any serious bugs that I might run across with
> > that branch that should make me worried about running it on a production
> > flow?
> >
> > -Aaron
> >
> > On Mon, Aug 1, 2016 at 4:01 PM, Joe Witt <joe.witt@gmail.com <ma...@gmail.com>> wrote:
> >>
> >> Aaron,
> >>
> >> It doesn't look like the 0.x version of that patch has been created
> >> yet.  Any chance you could build master (slated for upcoming 1.x
> >> release) and try that?
> >>
> >> Thanks
> >> Joe
> >>
> >> On Mon, Aug 1, 2016 at 3:30 PM, Aaron Longfield <alongfield@gmail.com <ma...@gmail.com>>
> >> wrote:
> >> > Great, glad there's already a fixed bug for it!  Is there anything I try
> >> > to
> >> > work around it for now, or at least just get longer processing times
> >> > between
> >> > restarts?
> >> >
> >> > -Aaron
> >> >
> >> > On Mon, Aug 1, 2016 at 11:54 AM, Mark Payne <markap14@hotmail.com <ma...@hotmail.com>>
> >> > wrote:
> >> >>
> >> >> Aaron,
> >> >>
> >> >> Thanks for getting that to us quickly! It is extremely useful.
> >> >>
> >> >> Joe,
> >> >>
> >> >> I do indeed believe this is the same thing. I was in the middle of
> >> >> typing
> >> >> a response, but you beat me to it!
> >> >>
> >> >> Thanks
> >> >> -Mark
> >> >>
> >> >>
> >> >> > On Aug 1, 2016, at 11:49 AM, Joe Witt <joe.witt@gmail.com <ma...@gmail.com>> wrote:
> >> >> >
> >> >> > Aaron, Mark,
> >> >> >
> >> >> > In looking at the thread-dump provided it looks to me like this is
> >> >> > the
> >> >> > same as what was reported and addressed in
> >> >> > https://issues.apache.org/jira/browse/NIFI-2395 <https://issues.apache.org/jira/browse/NIFI-2395>
> >> >> >
> >> >> > The fix for this has not yet been released but it slated to end up on
> >> >> > an 0.x and 1.0 release line.
> >> >> >
> >> >> > Mark do you agree it is the same thing by looking at the logs?
> >> >> >
> >> >> > Thanks
> >> >> > Joe
> >> >> >
> >> >> > On Mon, Aug 1, 2016 at 11:39 AM, Aaron Longfield
> >> >> > <alongfield@gmail.com <ma...@gmail.com>>
> >> >> > wrote:
> >> >> >> Alright, here you go for one of the nodes!
> >> >> >>
> >> >> >> On Mon, Aug 1, 2016 at 10:33 AM, Mark Payne <markap14@hotmail.com <ma...@hotmail.com>>
> >> >> >> wrote:
> >> >> >>>
> >> >> >>> Aaron,
> >> >> >>>
> >> >> >>> Any time that you find NiFi stop performing its work, the best
> >> >> >>> thing
> >> >> >>> to do
> >> >> >>> is to perform a thread-dump to and
> >> >> >>> to the mailing list. This allows us to determine what exactly is
> >> >> >>> happening, so we know what action is being
> >> >> >>> performed that prevents any other progress.
> >> >> >>>
> >> >> >>> To do this, you can go to the NiFi node that is not performing and
> >> >> >>> run
> >> >> >>> the
> >> >> >>> command:
> >> >> >>>
> >> >> >>> bin/nifi.sh dump thread-dump.txt
> >> >> >>>
> >> >> >>> This will generate a file named thread-dump.txt that you can send
> >> >> >>> to
> >> >> >>> us.
> >> >> >>>
> >> >> >>> Thanks!
> >> >> >>> -Mark
> >> >> >>>
> >> >> >>>
> >> >> >>> On Aug 1, 2016, at 10:19 AM, Aaron Longfield <alongfield@gmail.com <ma...@gmail.com>>
> >> >> >>> wrote:
> >> >> >>>
> >> >> >>> I've been trying different things to try to fix my NiFi freeze
> >> >> >>> problems,
> >> >> >>> and it seems the most frequent reason that my cluster gets stuck
> >> >> >>> and
> >> >> >>> stops
> >> >> >>> processing has to do with network related processors.  My data
> >> >> >>> enters
> >> >> >>> the
> >> >> >>> environment from Kafka and leaves via a site-to-site output port.
> >> >> >>> After
> >> >> >>> some time processing (sometimes a few minutes, sometimes a few
> >> >> >>> hours)
> >> >> >>> one of
> >> >> >>> those will start logging connection errors, and then that node will
> >> >> >>> stop
> >> >> >>> processing any flowfiles across all processors.
> >> >> >>>
> >> >> >>> So far, this followed me from 0.6.1 to 0.7.0, and on Amazon Linux
> >> >> >>> to
> >> >> >>> RHEL7
> >> >> >>> (although RHEL seems to be happier).  I've tried restricting
> >> >> >>> threads
> >> >> >>> to less
> >> >> >>> than the number of available cores on each node, different heap
> >> >> >>> sizes,
> >> >> >>> and
> >> >> >>> different garbage collectors.  So far none of that has preventing
> >> >> >>> the
> >> >> >>> problem, unfortunately.
> >> >> >>>
> >> >> >>> I'm not quite ready to build all custom processors for my flow
> >> >> >>> logic...
> >> >> >>> most of it is straightforward attribute routing, text replacement,
> >> >> >>> and
> >> >> >>> flowfile merging.
> >> >> >>>
> >> >> >>> What are other things that I could try, or just be doing wrong that
> >> >> >>> could
> >> >> >>> lead to this?  I'm happy to keep trying suggestions and changes; I
> >> >> >>> really
> >> >> >>> want this to work!
> >> >> >>>
> >> >> >>> Thanks,
> >> >> >>> -Aaron
> >> >> >>>
> >> >> >>> On Fri, Jul 15, 2016 at 12:07 PM, Lee Laim <lee.laim@gmail.com <ma...@gmail.com>>
> >> >> >>> wrote:
> >> >> >>>>
> >> >> >>>> Aaron,
> >> >> >>>>
> >> >> >>>> I ran into an issue where the Execute Stream Command (ESC)
> >> >> >>>> processor
> >> >> >>>> with
> >> >> >>>> many threads would run a legacy script that would hang if the
> >> >> >>>> incoming file
> >> >> >>>> was 'inconsistent'.  It appeared that ESC slowly collected stuck
> >> >> >>>> threads as
> >> >> >>>> malformed data randomly streamed through it. Eventually I ran out
> >> >> >>>> of
> >> >> >>>> threads
> >> >> >>>> as the system was just waiting for a thread to become available.
> >> >> >>>>
> >> >> >>>> It was apparent in the processor statistics where the
> >> >> >>>> flowfiles-out
> >> >> >>>> statistic would eventually step down to zero as threads became
> >> >> >>>> stuck.
> >> >> >>>>
> >> >> >>>> It might be worth trying InvokeScriptedProcessor or building
> >> >> >>>> custom
> >> >> >>>> processors as they provide a means to handle these inconsistencies
> >> >> >>>> more
> >> >> >>>> gracefully.
> >> >> >>>>
> >> >> >>>>
> >> >> >>>>
> >> >> >>>> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.script.InvokeScriptedProcessor/index.html <https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.script.InvokeScriptedProcessor/index.html>
> >> >> >>>>
> >> >> >>>> Thanks,
> >> >> >>>> Lee
> >> >> >>>>
> >> >> >>>>
> >> >> >>>>
> >> >> >>>>
> >> >> >>>>
> >> >> >>>> On Fri, Jul 15, 2016 at 6:50 AM, Aaron Longfield
> >> >> >>>> <alongfield@gmail.com <ma...@gmail.com>>
> >> >> >>>> wrote:
> >> >> >>>>>
> >> >> >>>>> Hi Mark,
> >> >> >>>>>
> >> >> >>>>> I've been using the G1 garbage collector.  I brought the nodes
> >> >> >>>>> down
> >> >> >>>>> to
> >> >> >>>>> 8GB heap and let it run overnight, but processing still got stuck
> >> >> >>>>> and
> >> >> >>>>> requiring NiFi to be restarted on all nodes.  It took longer to
> >> >> >>>>> happen, but
> >> >> >>>>> they went down after a few hours.  Are there any other things I
> >> >> >>>>> can
> >> >> >>>>> look
> >> >> >>>>> into?
> >> >> >>>>>
> >> >> >>>>> Thanks!
> >> >> >>>>>
> >> >> >>>>> -Aaron
> >> >> >>>>>
> >> >> >>>>> On Thu, Jul 14, 2016 at 2:33 PM, Mark Payne
> >> >> >>>>> <markap14@hotmail.com <ma...@hotmail.com>>
> >> >> >>>>> wrote:
> >> >> >>>>>>
> >> >> >>>>>> Aaron,
> >> >> >>>>>>
> >> >> >>>>>> My guess would be that you are hitting a Full Garbage
> >> >> >>>>>> Collection.
> >> >> >>>>>> With
> >> >> >>>>>> such a huge Java heap, that will cause a "stop the world" pause
> >> >> >>>>>> for
> >> >> >>>>>> quite a
> >> >> >>>>>> long time.
> >> >> >>>>>> Which garbage collector are you using? Have you tried reducing
> >> >> >>>>>> the
> >> >> >>>>>> heap
> >> >> >>>>>> from 48 GB to say 4 or 8 GB?
> >> >> >>>>>>
> >> >> >>>>>> Thanks
> >> >> >>>>>> -Mark
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>> On Jul 14, 2016, at 11:14 AM, Aaron Longfield
> >> >> >>>>>>> <alongfield@gmail.com <ma...@gmail.com>>
> >> >> >>>>>>> wrote:
> >> >> >>>>>>>
> >> >> >>>>>>> Hi,
> >> >> >>>>>>>
> >> >> >>>>>>> I'm having an issue with a small (two node) NiFi cluster where
> >> >> >>>>>>> the
> >> >> >>>>>>> nodes will stop processing any queued flowfiles.  I haven't
> >> >> >>>>>>> seen
> >> >> >>>>>>> any error
> >> >> >>>>>>> messages logged related to it, and when attempting to restart
> >> >> >>>>>>> the
> >> >> >>>>>>> service,
> >> >> >>>>>>> NiFi doesn't respond and the script forcibly kills it.  This
> >> >> >>>>>>> causes multiple
> >> >> >>>>>>> flowfile version to hang around, and generally makes me feel
> >> >> >>>>>>> like
> >> >> >>>>>>> it might
> >> >> >>>>>>> be causing data loss.
> >> >> >>>>>>>
> >> >> >>>>>>> I'm running the web UI on a different box, and when things stop
> >> >> >>>>>>> working, it stops showing changes to counts in any queues, and
> >> >> >>>>>>> the
> >> >> >>>>>>> thread
> >> >> >>>>>>> count never changes.  It still thinks the nodes are connecting
> >> >> >>>>>>> and
> >> >> >>>>>>> responding, though.
> >> >> >>>>>>>
> >> >> >>>>>>> My environment is two 8 cpu systems w/ 60GB memory with 48GB
> >> >> >>>>>>> given
> >> >> >>>>>>> to
> >> >> >>>>>>> the NiFi JVM in bootstrap.conf.  I have timer threads limited
> >> >> >>>>>>> to
> >> >> >>>>>>> 12, and
> >> >> >>>>>>> event threads to 4.  Install is on the current Amazon Linux AMI
> >> >> >>>>>>> and using
> >> >> >>>>>>> OpenJDK 1.8.0.91 x64.
> >> >> >>>>>>>
> >> >> >>>>>>> Any idea, other debug steps, or changes that I can try?  I'm
> >> >> >>>>>>> running
> >> >> >>>>>>> 0.7.0, having upgraded from 0.6.1, but this has been occurring
> >> >> >>>>>>> with both
> >> >> >>>>>>> versions.  The higher the flowfile volume I push through, the
> >> >> >>>>>>> faster this
> >> >> >>>>>>> happens.
> >> >> >>>>>>>
> >> >> >>>>>>> Thanks for any help there is to give!
> >> >> >>>>>>>
> >> >> >>>>>>> -Aaron Longfield
> >> >> >>>>>>
> >> >> >>>>>
> >> >> >>>>
> >> >> >>>
> >> >> >>>
> >> >> >>
> >> >>
> >> >
> >
> >
> 


Re: Nifi cluster nodes regularly stop processing any flowfiles

Posted by Aaron Longfield <al...@gmail.com>.
I backported the patch from the master branch and it applies without
changing much at all.  Workflow processing works fine by my eye, but I do
see quite a few provenance warnings logged.  I haven't tried out to see how
that repository is working yet, but I just pushed a few million flowfiles
through my flows, and output probably 25GB to a remote processor without
anything falling over!

-Aaron

On Mon, Aug 1, 2016 at 4:08 PM, Joe Witt <jo...@gmail.com> wrote:

> Aaron,
>
> Ok so from a production point of view I'd recommend a small patched
> version of the 0.7 release you were working with.  It might be the
> case that grafting the master line patch for that JIRA into an 0.x
> patch is pretty straight forward.  You could take a look at that as a
> short term option.  We probably should start 0.7.1 and 1.0-M1 type
> release motions soon anyway so this could be a helpful catalyst.
>
> Thanks
> Joe
>
> On Mon, Aug 1, 2016 at 4:03 PM, Aaron Longfield <al...@gmail.com>
> wrote:
> > Joe,
> >
> > Sure, I can give that a go.  Any serious bugs that I might run across
> with
> > that branch that should make me worried about running it on a production
> > flow?
> >
> > -Aaron
> >
> > On Mon, Aug 1, 2016 at 4:01 PM, Joe Witt <jo...@gmail.com> wrote:
> >>
> >> Aaron,
> >>
> >> It doesn't look like the 0.x version of that patch has been created
> >> yet.  Any chance you could build master (slated for upcoming 1.x
> >> release) and try that?
> >>
> >> Thanks
> >> Joe
> >>
> >> On Mon, Aug 1, 2016 at 3:30 PM, Aaron Longfield <al...@gmail.com>
> >> wrote:
> >> > Great, glad there's already a fixed bug for it!  Is there anything I
> try
> >> > to
> >> > work around it for now, or at least just get longer processing times
> >> > between
> >> > restarts?
> >> >
> >> > -Aaron
> >> >
> >> > On Mon, Aug 1, 2016 at 11:54 AM, Mark Payne <ma...@hotmail.com>
> >> > wrote:
> >> >>
> >> >> Aaron,
> >> >>
> >> >> Thanks for getting that to us quickly! It is extremely useful.
> >> >>
> >> >> Joe,
> >> >>
> >> >> I do indeed believe this is the same thing. I was in the middle of
> >> >> typing
> >> >> a response, but you beat me to it!
> >> >>
> >> >> Thanks
> >> >> -Mark
> >> >>
> >> >>
> >> >> > On Aug 1, 2016, at 11:49 AM, Joe Witt <jo...@gmail.com> wrote:
> >> >> >
> >> >> > Aaron, Mark,
> >> >> >
> >> >> > In looking at the thread-dump provided it looks to me like this is
> >> >> > the
> >> >> > same as what was reported and addressed in
> >> >> > https://issues.apache.org/jira/browse/NIFI-2395
> >> >> >
> >> >> > The fix for this has not yet been released but it slated to end up
> on
> >> >> > an 0.x and 1.0 release line.
> >> >> >
> >> >> > Mark do you agree it is the same thing by looking at the logs?
> >> >> >
> >> >> > Thanks
> >> >> > Joe
> >> >> >
> >> >> > On Mon, Aug 1, 2016 at 11:39 AM, Aaron Longfield
> >> >> > <al...@gmail.com>
> >> >> > wrote:
> >> >> >> Alright, here you go for one of the nodes!
> >> >> >>
> >> >> >> On Mon, Aug 1, 2016 at 10:33 AM, Mark Payne <markap14@hotmail.com
> >
> >> >> >> wrote:
> >> >> >>>
> >> >> >>> Aaron,
> >> >> >>>
> >> >> >>> Any time that you find NiFi stop performing its work, the best
> >> >> >>> thing
> >> >> >>> to do
> >> >> >>> is to perform a thread-dump to and
> >> >> >>> to the mailing list. This allows us to determine what exactly is
> >> >> >>> happening, so we know what action is being
> >> >> >>> performed that prevents any other progress.
> >> >> >>>
> >> >> >>> To do this, you can go to the NiFi node that is not performing
> and
> >> >> >>> run
> >> >> >>> the
> >> >> >>> command:
> >> >> >>>
> >> >> >>> bin/nifi.sh dump thread-dump.txt
> >> >> >>>
> >> >> >>> This will generate a file named thread-dump.txt that you can send
> >> >> >>> to
> >> >> >>> us.
> >> >> >>>
> >> >> >>> Thanks!
> >> >> >>> -Mark
> >> >> >>>
> >> >> >>>
> >> >> >>> On Aug 1, 2016, at 10:19 AM, Aaron Longfield <
> alongfield@gmail.com>
> >> >> >>> wrote:
> >> >> >>>
> >> >> >>> I've been trying different things to try to fix my NiFi freeze
> >> >> >>> problems,
> >> >> >>> and it seems the most frequent reason that my cluster gets stuck
> >> >> >>> and
> >> >> >>> stops
> >> >> >>> processing has to do with network related processors.  My data
> >> >> >>> enters
> >> >> >>> the
> >> >> >>> environment from Kafka and leaves via a site-to-site output port.
> >> >> >>> After
> >> >> >>> some time processing (sometimes a few minutes, sometimes a few
> >> >> >>> hours)
> >> >> >>> one of
> >> >> >>> those will start logging connection errors, and then that node
> will
> >> >> >>> stop
> >> >> >>> processing any flowfiles across all processors.
> >> >> >>>
> >> >> >>> So far, this followed me from 0.6.1 to 0.7.0, and on Amazon Linux
> >> >> >>> to
> >> >> >>> RHEL7
> >> >> >>> (although RHEL seems to be happier).  I've tried restricting
> >> >> >>> threads
> >> >> >>> to less
> >> >> >>> than the number of available cores on each node, different heap
> >> >> >>> sizes,
> >> >> >>> and
> >> >> >>> different garbage collectors.  So far none of that has preventing
> >> >> >>> the
> >> >> >>> problem, unfortunately.
> >> >> >>>
> >> >> >>> I'm not quite ready to build all custom processors for my flow
> >> >> >>> logic...
> >> >> >>> most of it is straightforward attribute routing, text
> replacement,
> >> >> >>> and
> >> >> >>> flowfile merging.
> >> >> >>>
> >> >> >>> What are other things that I could try, or just be doing wrong
> that
> >> >> >>> could
> >> >> >>> lead to this?  I'm happy to keep trying suggestions and changes;
> I
> >> >> >>> really
> >> >> >>> want this to work!
> >> >> >>>
> >> >> >>> Thanks,
> >> >> >>> -Aaron
> >> >> >>>
> >> >> >>> On Fri, Jul 15, 2016 at 12:07 PM, Lee Laim <le...@gmail.com>
> >> >> >>> wrote:
> >> >> >>>>
> >> >> >>>> Aaron,
> >> >> >>>>
> >> >> >>>> I ran into an issue where the Execute Stream Command (ESC)
> >> >> >>>> processor
> >> >> >>>> with
> >> >> >>>> many threads would run a legacy script that would hang if the
> >> >> >>>> incoming file
> >> >> >>>> was 'inconsistent'.  It appeared that ESC slowly collected stuck
> >> >> >>>> threads as
> >> >> >>>> malformed data randomly streamed through it. Eventually I ran
> out
> >> >> >>>> of
> >> >> >>>> threads
> >> >> >>>> as the system was just waiting for a thread to become available.
> >> >> >>>>
> >> >> >>>> It was apparent in the processor statistics where the
> >> >> >>>> flowfiles-out
> >> >> >>>> statistic would eventually step down to zero as threads became
> >> >> >>>> stuck.
> >> >> >>>>
> >> >> >>>> It might be worth trying InvokeScriptedProcessor or building
> >> >> >>>> custom
> >> >> >>>> processors as they provide a means to handle these
> inconsistencies
> >> >> >>>> more
> >> >> >>>> gracefully.
> >> >> >>>>
> >> >> >>>>
> >> >> >>>>
> >> >> >>>>
> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.script.InvokeScriptedProcessor/index.html
> >> >> >>>>
> >> >> >>>> Thanks,
> >> >> >>>> Lee
> >> >> >>>>
> >> >> >>>>
> >> >> >>>>
> >> >> >>>>
> >> >> >>>>
> >> >> >>>> On Fri, Jul 15, 2016 at 6:50 AM, Aaron Longfield
> >> >> >>>> <al...@gmail.com>
> >> >> >>>> wrote:
> >> >> >>>>>
> >> >> >>>>> Hi Mark,
> >> >> >>>>>
> >> >> >>>>> I've been using the G1 garbage collector.  I brought the nodes
> >> >> >>>>> down
> >> >> >>>>> to
> >> >> >>>>> 8GB heap and let it run overnight, but processing still got
> stuck
> >> >> >>>>> and
> >> >> >>>>> requiring NiFi to be restarted on all nodes.  It took longer to
> >> >> >>>>> happen, but
> >> >> >>>>> they went down after a few hours.  Are there any other things I
> >> >> >>>>> can
> >> >> >>>>> look
> >> >> >>>>> into?
> >> >> >>>>>
> >> >> >>>>> Thanks!
> >> >> >>>>>
> >> >> >>>>> -Aaron
> >> >> >>>>>
> >> >> >>>>> On Thu, Jul 14, 2016 at 2:33 PM, Mark Payne
> >> >> >>>>> <ma...@hotmail.com>
> >> >> >>>>> wrote:
> >> >> >>>>>>
> >> >> >>>>>> Aaron,
> >> >> >>>>>>
> >> >> >>>>>> My guess would be that you are hitting a Full Garbage
> >> >> >>>>>> Collection.
> >> >> >>>>>> With
> >> >> >>>>>> such a huge Java heap, that will cause a "stop the world"
> pause
> >> >> >>>>>> for
> >> >> >>>>>> quite a
> >> >> >>>>>> long time.
> >> >> >>>>>> Which garbage collector are you using? Have you tried reducing
> >> >> >>>>>> the
> >> >> >>>>>> heap
> >> >> >>>>>> from 48 GB to say 4 or 8 GB?
> >> >> >>>>>>
> >> >> >>>>>> Thanks
> >> >> >>>>>> -Mark
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>> On Jul 14, 2016, at 11:14 AM, Aaron Longfield
> >> >> >>>>>>> <al...@gmail.com>
> >> >> >>>>>>> wrote:
> >> >> >>>>>>>
> >> >> >>>>>>> Hi,
> >> >> >>>>>>>
> >> >> >>>>>>> I'm having an issue with a small (two node) NiFi cluster
> where
> >> >> >>>>>>> the
> >> >> >>>>>>> nodes will stop processing any queued flowfiles.  I haven't
> >> >> >>>>>>> seen
> >> >> >>>>>>> any error
> >> >> >>>>>>> messages logged related to it, and when attempting to restart
> >> >> >>>>>>> the
> >> >> >>>>>>> service,
> >> >> >>>>>>> NiFi doesn't respond and the script forcibly kills it.  This
> >> >> >>>>>>> causes multiple
> >> >> >>>>>>> flowfile version to hang around, and generally makes me feel
> >> >> >>>>>>> like
> >> >> >>>>>>> it might
> >> >> >>>>>>> be causing data loss.
> >> >> >>>>>>>
> >> >> >>>>>>> I'm running the web UI on a different box, and when things
> stop
> >> >> >>>>>>> working, it stops showing changes to counts in any queues,
> and
> >> >> >>>>>>> the
> >> >> >>>>>>> thread
> >> >> >>>>>>> count never changes.  It still thinks the nodes are
> connecting
> >> >> >>>>>>> and
> >> >> >>>>>>> responding, though.
> >> >> >>>>>>>
> >> >> >>>>>>> My environment is two 8 cpu systems w/ 60GB memory with 48GB
> >> >> >>>>>>> given
> >> >> >>>>>>> to
> >> >> >>>>>>> the NiFi JVM in bootstrap.conf.  I have timer threads limited
> >> >> >>>>>>> to
> >> >> >>>>>>> 12, and
> >> >> >>>>>>> event threads to 4.  Install is on the current Amazon Linux
> AMI
> >> >> >>>>>>> and using
> >> >> >>>>>>> OpenJDK 1.8.0.91 x64.
> >> >> >>>>>>>
> >> >> >>>>>>> Any idea, other debug steps, or changes that I can try?  I'm
> >> >> >>>>>>> running
> >> >> >>>>>>> 0.7.0, having upgraded from 0.6.1, but this has been
> occurring
> >> >> >>>>>>> with both
> >> >> >>>>>>> versions.  The higher the flowfile volume I push through, the
> >> >> >>>>>>> faster this
> >> >> >>>>>>> happens.
> >> >> >>>>>>>
> >> >> >>>>>>> Thanks for any help there is to give!
> >> >> >>>>>>>
> >> >> >>>>>>> -Aaron Longfield
> >> >> >>>>>>
> >> >> >>>>>
> >> >> >>>>
> >> >> >>>
> >> >> >>>
> >> >> >>
> >> >>
> >> >
> >
> >
>

Re: Nifi cluster nodes regularly stop processing any flowfiles

Posted by Joe Witt <jo...@gmail.com>.
Aaron,

Ok so from a production point of view I'd recommend a small patched
version of the 0.7 release you were working with.  It might be the
case that grafting the master line patch for that JIRA into an 0.x
patch is pretty straight forward.  You could take a look at that as a
short term option.  We probably should start 0.7.1 and 1.0-M1 type
release motions soon anyway so this could be a helpful catalyst.

Thanks
Joe

On Mon, Aug 1, 2016 at 4:03 PM, Aaron Longfield <al...@gmail.com> wrote:
> Joe,
>
> Sure, I can give that a go.  Any serious bugs that I might run across with
> that branch that should make me worried about running it on a production
> flow?
>
> -Aaron
>
> On Mon, Aug 1, 2016 at 4:01 PM, Joe Witt <jo...@gmail.com> wrote:
>>
>> Aaron,
>>
>> It doesn't look like the 0.x version of that patch has been created
>> yet.  Any chance you could build master (slated for upcoming 1.x
>> release) and try that?
>>
>> Thanks
>> Joe
>>
>> On Mon, Aug 1, 2016 at 3:30 PM, Aaron Longfield <al...@gmail.com>
>> wrote:
>> > Great, glad there's already a fixed bug for it!  Is there anything I try
>> > to
>> > work around it for now, or at least just get longer processing times
>> > between
>> > restarts?
>> >
>> > -Aaron
>> >
>> > On Mon, Aug 1, 2016 at 11:54 AM, Mark Payne <ma...@hotmail.com>
>> > wrote:
>> >>
>> >> Aaron,
>> >>
>> >> Thanks for getting that to us quickly! It is extremely useful.
>> >>
>> >> Joe,
>> >>
>> >> I do indeed believe this is the same thing. I was in the middle of
>> >> typing
>> >> a response, but you beat me to it!
>> >>
>> >> Thanks
>> >> -Mark
>> >>
>> >>
>> >> > On Aug 1, 2016, at 11:49 AM, Joe Witt <jo...@gmail.com> wrote:
>> >> >
>> >> > Aaron, Mark,
>> >> >
>> >> > In looking at the thread-dump provided it looks to me like this is
>> >> > the
>> >> > same as what was reported and addressed in
>> >> > https://issues.apache.org/jira/browse/NIFI-2395
>> >> >
>> >> > The fix for this has not yet been released but it slated to end up on
>> >> > an 0.x and 1.0 release line.
>> >> >
>> >> > Mark do you agree it is the same thing by looking at the logs?
>> >> >
>> >> > Thanks
>> >> > Joe
>> >> >
>> >> > On Mon, Aug 1, 2016 at 11:39 AM, Aaron Longfield
>> >> > <al...@gmail.com>
>> >> > wrote:
>> >> >> Alright, here you go for one of the nodes!
>> >> >>
>> >> >> On Mon, Aug 1, 2016 at 10:33 AM, Mark Payne <ma...@hotmail.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> Aaron,
>> >> >>>
>> >> >>> Any time that you find NiFi stop performing its work, the best
>> >> >>> thing
>> >> >>> to do
>> >> >>> is to perform a thread-dump to and
>> >> >>> to the mailing list. This allows us to determine what exactly is
>> >> >>> happening, so we know what action is being
>> >> >>> performed that prevents any other progress.
>> >> >>>
>> >> >>> To do this, you can go to the NiFi node that is not performing and
>> >> >>> run
>> >> >>> the
>> >> >>> command:
>> >> >>>
>> >> >>> bin/nifi.sh dump thread-dump.txt
>> >> >>>
>> >> >>> This will generate a file named thread-dump.txt that you can send
>> >> >>> to
>> >> >>> us.
>> >> >>>
>> >> >>> Thanks!
>> >> >>> -Mark
>> >> >>>
>> >> >>>
>> >> >>> On Aug 1, 2016, at 10:19 AM, Aaron Longfield <al...@gmail.com>
>> >> >>> wrote:
>> >> >>>
>> >> >>> I've been trying different things to try to fix my NiFi freeze
>> >> >>> problems,
>> >> >>> and it seems the most frequent reason that my cluster gets stuck
>> >> >>> and
>> >> >>> stops
>> >> >>> processing has to do with network related processors.  My data
>> >> >>> enters
>> >> >>> the
>> >> >>> environment from Kafka and leaves via a site-to-site output port.
>> >> >>> After
>> >> >>> some time processing (sometimes a few minutes, sometimes a few
>> >> >>> hours)
>> >> >>> one of
>> >> >>> those will start logging connection errors, and then that node will
>> >> >>> stop
>> >> >>> processing any flowfiles across all processors.
>> >> >>>
>> >> >>> So far, this followed me from 0.6.1 to 0.7.0, and on Amazon Linux
>> >> >>> to
>> >> >>> RHEL7
>> >> >>> (although RHEL seems to be happier).  I've tried restricting
>> >> >>> threads
>> >> >>> to less
>> >> >>> than the number of available cores on each node, different heap
>> >> >>> sizes,
>> >> >>> and
>> >> >>> different garbage collectors.  So far none of that has preventing
>> >> >>> the
>> >> >>> problem, unfortunately.
>> >> >>>
>> >> >>> I'm not quite ready to build all custom processors for my flow
>> >> >>> logic...
>> >> >>> most of it is straightforward attribute routing, text replacement,
>> >> >>> and
>> >> >>> flowfile merging.
>> >> >>>
>> >> >>> What are other things that I could try, or just be doing wrong that
>> >> >>> could
>> >> >>> lead to this?  I'm happy to keep trying suggestions and changes; I
>> >> >>> really
>> >> >>> want this to work!
>> >> >>>
>> >> >>> Thanks,
>> >> >>> -Aaron
>> >> >>>
>> >> >>> On Fri, Jul 15, 2016 at 12:07 PM, Lee Laim <le...@gmail.com>
>> >> >>> wrote:
>> >> >>>>
>> >> >>>> Aaron,
>> >> >>>>
>> >> >>>> I ran into an issue where the Execute Stream Command (ESC)
>> >> >>>> processor
>> >> >>>> with
>> >> >>>> many threads would run a legacy script that would hang if the
>> >> >>>> incoming file
>> >> >>>> was 'inconsistent'.  It appeared that ESC slowly collected stuck
>> >> >>>> threads as
>> >> >>>> malformed data randomly streamed through it. Eventually I ran out
>> >> >>>> of
>> >> >>>> threads
>> >> >>>> as the system was just waiting for a thread to become available.
>> >> >>>>
>> >> >>>> It was apparent in the processor statistics where the
>> >> >>>> flowfiles-out
>> >> >>>> statistic would eventually step down to zero as threads became
>> >> >>>> stuck.
>> >> >>>>
>> >> >>>> It might be worth trying InvokeScriptedProcessor or building
>> >> >>>> custom
>> >> >>>> processors as they provide a means to handle these inconsistencies
>> >> >>>> more
>> >> >>>> gracefully.
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.script.InvokeScriptedProcessor/index.html
>> >> >>>>
>> >> >>>> Thanks,
>> >> >>>> Lee
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> On Fri, Jul 15, 2016 at 6:50 AM, Aaron Longfield
>> >> >>>> <al...@gmail.com>
>> >> >>>> wrote:
>> >> >>>>>
>> >> >>>>> Hi Mark,
>> >> >>>>>
>> >> >>>>> I've been using the G1 garbage collector.  I brought the nodes
>> >> >>>>> down
>> >> >>>>> to
>> >> >>>>> 8GB heap and let it run overnight, but processing still got stuck
>> >> >>>>> and
>> >> >>>>> requiring NiFi to be restarted on all nodes.  It took longer to
>> >> >>>>> happen, but
>> >> >>>>> they went down after a few hours.  Are there any other things I
>> >> >>>>> can
>> >> >>>>> look
>> >> >>>>> into?
>> >> >>>>>
>> >> >>>>> Thanks!
>> >> >>>>>
>> >> >>>>> -Aaron
>> >> >>>>>
>> >> >>>>> On Thu, Jul 14, 2016 at 2:33 PM, Mark Payne
>> >> >>>>> <ma...@hotmail.com>
>> >> >>>>> wrote:
>> >> >>>>>>
>> >> >>>>>> Aaron,
>> >> >>>>>>
>> >> >>>>>> My guess would be that you are hitting a Full Garbage
>> >> >>>>>> Collection.
>> >> >>>>>> With
>> >> >>>>>> such a huge Java heap, that will cause a "stop the world" pause
>> >> >>>>>> for
>> >> >>>>>> quite a
>> >> >>>>>> long time.
>> >> >>>>>> Which garbage collector are you using? Have you tried reducing
>> >> >>>>>> the
>> >> >>>>>> heap
>> >> >>>>>> from 48 GB to say 4 or 8 GB?
>> >> >>>>>>
>> >> >>>>>> Thanks
>> >> >>>>>> -Mark
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>> On Jul 14, 2016, at 11:14 AM, Aaron Longfield
>> >> >>>>>>> <al...@gmail.com>
>> >> >>>>>>> wrote:
>> >> >>>>>>>
>> >> >>>>>>> Hi,
>> >> >>>>>>>
>> >> >>>>>>> I'm having an issue with a small (two node) NiFi cluster where
>> >> >>>>>>> the
>> >> >>>>>>> nodes will stop processing any queued flowfiles.  I haven't
>> >> >>>>>>> seen
>> >> >>>>>>> any error
>> >> >>>>>>> messages logged related to it, and when attempting to restart
>> >> >>>>>>> the
>> >> >>>>>>> service,
>> >> >>>>>>> NiFi doesn't respond and the script forcibly kills it.  This
>> >> >>>>>>> causes multiple
>> >> >>>>>>> flowfile version to hang around, and generally makes me feel
>> >> >>>>>>> like
>> >> >>>>>>> it might
>> >> >>>>>>> be causing data loss.
>> >> >>>>>>>
>> >> >>>>>>> I'm running the web UI on a different box, and when things stop
>> >> >>>>>>> working, it stops showing changes to counts in any queues, and
>> >> >>>>>>> the
>> >> >>>>>>> thread
>> >> >>>>>>> count never changes.  It still thinks the nodes are connecting
>> >> >>>>>>> and
>> >> >>>>>>> responding, though.
>> >> >>>>>>>
>> >> >>>>>>> My environment is two 8 cpu systems w/ 60GB memory with 48GB
>> >> >>>>>>> given
>> >> >>>>>>> to
>> >> >>>>>>> the NiFi JVM in bootstrap.conf.  I have timer threads limited
>> >> >>>>>>> to
>> >> >>>>>>> 12, and
>> >> >>>>>>> event threads to 4.  Install is on the current Amazon Linux AMI
>> >> >>>>>>> and using
>> >> >>>>>>> OpenJDK 1.8.0.91 x64.
>> >> >>>>>>>
>> >> >>>>>>> Any idea, other debug steps, or changes that I can try?  I'm
>> >> >>>>>>> running
>> >> >>>>>>> 0.7.0, having upgraded from 0.6.1, but this has been occurring
>> >> >>>>>>> with both
>> >> >>>>>>> versions.  The higher the flowfile volume I push through, the
>> >> >>>>>>> faster this
>> >> >>>>>>> happens.
>> >> >>>>>>>
>> >> >>>>>>> Thanks for any help there is to give!
>> >> >>>>>>>
>> >> >>>>>>> -Aaron Longfield
>> >> >>>>>>
>> >> >>>>>
>> >> >>>>
>> >> >>>
>> >> >>>
>> >> >>
>> >>
>> >
>
>

Re: Nifi cluster nodes regularly stop processing any flowfiles

Posted by Aaron Longfield <al...@gmail.com>.
Joe,

Sure, I can give that a go.  Any serious bugs that I might run across with
that branch that should make me worried about running it on a production
flow?

-Aaron

On Mon, Aug 1, 2016 at 4:01 PM, Joe Witt <jo...@gmail.com> wrote:

> Aaron,
>
> It doesn't look like the 0.x version of that patch has been created
> yet.  Any chance you could build master (slated for upcoming 1.x
> release) and try that?
>
> Thanks
> Joe
>
> On Mon, Aug 1, 2016 at 3:30 PM, Aaron Longfield <al...@gmail.com>
> wrote:
> > Great, glad there's already a fixed bug for it!  Is there anything I try
> to
> > work around it for now, or at least just get longer processing times
> between
> > restarts?
> >
> > -Aaron
> >
> > On Mon, Aug 1, 2016 at 11:54 AM, Mark Payne <ma...@hotmail.com>
> wrote:
> >>
> >> Aaron,
> >>
> >> Thanks for getting that to us quickly! It is extremely useful.
> >>
> >> Joe,
> >>
> >> I do indeed believe this is the same thing. I was in the middle of
> typing
> >> a response, but you beat me to it!
> >>
> >> Thanks
> >> -Mark
> >>
> >>
> >> > On Aug 1, 2016, at 11:49 AM, Joe Witt <jo...@gmail.com> wrote:
> >> >
> >> > Aaron, Mark,
> >> >
> >> > In looking at the thread-dump provided it looks to me like this is the
> >> > same as what was reported and addressed in
> >> > https://issues.apache.org/jira/browse/NIFI-2395
> >> >
> >> > The fix for this has not yet been released but it slated to end up on
> >> > an 0.x and 1.0 release line.
> >> >
> >> > Mark do you agree it is the same thing by looking at the logs?
> >> >
> >> > Thanks
> >> > Joe
> >> >
> >> > On Mon, Aug 1, 2016 at 11:39 AM, Aaron Longfield <
> alongfield@gmail.com>
> >> > wrote:
> >> >> Alright, here you go for one of the nodes!
> >> >>
> >> >> On Mon, Aug 1, 2016 at 10:33 AM, Mark Payne <ma...@hotmail.com>
> >> >> wrote:
> >> >>>
> >> >>> Aaron,
> >> >>>
> >> >>> Any time that you find NiFi stop performing its work, the best thing
> >> >>> to do
> >> >>> is to perform a thread-dump to and
> >> >>> to the mailing list. This allows us to determine what exactly is
> >> >>> happening, so we know what action is being
> >> >>> performed that prevents any other progress.
> >> >>>
> >> >>> To do this, you can go to the NiFi node that is not performing and
> run
> >> >>> the
> >> >>> command:
> >> >>>
> >> >>> bin/nifi.sh dump thread-dump.txt
> >> >>>
> >> >>> This will generate a file named thread-dump.txt that you can send to
> >> >>> us.
> >> >>>
> >> >>> Thanks!
> >> >>> -Mark
> >> >>>
> >> >>>
> >> >>> On Aug 1, 2016, at 10:19 AM, Aaron Longfield <al...@gmail.com>
> >> >>> wrote:
> >> >>>
> >> >>> I've been trying different things to try to fix my NiFi freeze
> >> >>> problems,
> >> >>> and it seems the most frequent reason that my cluster gets stuck and
> >> >>> stops
> >> >>> processing has to do with network related processors.  My data
> enters
> >> >>> the
> >> >>> environment from Kafka and leaves via a site-to-site output port.
> >> >>> After
> >> >>> some time processing (sometimes a few minutes, sometimes a few
> hours)
> >> >>> one of
> >> >>> those will start logging connection errors, and then that node will
> >> >>> stop
> >> >>> processing any flowfiles across all processors.
> >> >>>
> >> >>> So far, this followed me from 0.6.1 to 0.7.0, and on Amazon Linux to
> >> >>> RHEL7
> >> >>> (although RHEL seems to be happier).  I've tried restricting threads
> >> >>> to less
> >> >>> than the number of available cores on each node, different heap
> sizes,
> >> >>> and
> >> >>> different garbage collectors.  So far none of that has preventing
> the
> >> >>> problem, unfortunately.
> >> >>>
> >> >>> I'm not quite ready to build all custom processors for my flow
> >> >>> logic...
> >> >>> most of it is straightforward attribute routing, text replacement,
> and
> >> >>> flowfile merging.
> >> >>>
> >> >>> What are other things that I could try, or just be doing wrong that
> >> >>> could
> >> >>> lead to this?  I'm happy to keep trying suggestions and changes; I
> >> >>> really
> >> >>> want this to work!
> >> >>>
> >> >>> Thanks,
> >> >>> -Aaron
> >> >>>
> >> >>> On Fri, Jul 15, 2016 at 12:07 PM, Lee Laim <le...@gmail.com>
> wrote:
> >> >>>>
> >> >>>> Aaron,
> >> >>>>
> >> >>>> I ran into an issue where the Execute Stream Command (ESC)
> processor
> >> >>>> with
> >> >>>> many threads would run a legacy script that would hang if the
> >> >>>> incoming file
> >> >>>> was 'inconsistent'.  It appeared that ESC slowly collected stuck
> >> >>>> threads as
> >> >>>> malformed data randomly streamed through it. Eventually I ran out
> of
> >> >>>> threads
> >> >>>> as the system was just waiting for a thread to become available.
> >> >>>>
> >> >>>> It was apparent in the processor statistics where the flowfiles-out
> >> >>>> statistic would eventually step down to zero as threads became
> stuck.
> >> >>>>
> >> >>>> It might be worth trying InvokeScriptedProcessor or building custom
> >> >>>> processors as they provide a means to handle these inconsistencies
> >> >>>> more
> >> >>>> gracefully.
> >> >>>>
> >> >>>>
> >> >>>>
> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.script.InvokeScriptedProcessor/index.html
> >> >>>>
> >> >>>> Thanks,
> >> >>>> Lee
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> On Fri, Jul 15, 2016 at 6:50 AM, Aaron Longfield
> >> >>>> <al...@gmail.com>
> >> >>>> wrote:
> >> >>>>>
> >> >>>>> Hi Mark,
> >> >>>>>
> >> >>>>> I've been using the G1 garbage collector.  I brought the nodes
> down
> >> >>>>> to
> >> >>>>> 8GB heap and let it run overnight, but processing still got stuck
> >> >>>>> and
> >> >>>>> requiring NiFi to be restarted on all nodes.  It took longer to
> >> >>>>> happen, but
> >> >>>>> they went down after a few hours.  Are there any other things I
> can
> >> >>>>> look
> >> >>>>> into?
> >> >>>>>
> >> >>>>> Thanks!
> >> >>>>>
> >> >>>>> -Aaron
> >> >>>>>
> >> >>>>> On Thu, Jul 14, 2016 at 2:33 PM, Mark Payne <markap14@hotmail.com
> >
> >> >>>>> wrote:
> >> >>>>>>
> >> >>>>>> Aaron,
> >> >>>>>>
> >> >>>>>> My guess would be that you are hitting a Full Garbage Collection.
> >> >>>>>> With
> >> >>>>>> such a huge Java heap, that will cause a "stop the world" pause
> for
> >> >>>>>> quite a
> >> >>>>>> long time.
> >> >>>>>> Which garbage collector are you using? Have you tried reducing
> the
> >> >>>>>> heap
> >> >>>>>> from 48 GB to say 4 or 8 GB?
> >> >>>>>>
> >> >>>>>> Thanks
> >> >>>>>> -Mark
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>> On Jul 14, 2016, at 11:14 AM, Aaron Longfield
> >> >>>>>>> <al...@gmail.com>
> >> >>>>>>> wrote:
> >> >>>>>>>
> >> >>>>>>> Hi,
> >> >>>>>>>
> >> >>>>>>> I'm having an issue with a small (two node) NiFi cluster where
> the
> >> >>>>>>> nodes will stop processing any queued flowfiles.  I haven't seen
> >> >>>>>>> any error
> >> >>>>>>> messages logged related to it, and when attempting to restart
> the
> >> >>>>>>> service,
> >> >>>>>>> NiFi doesn't respond and the script forcibly kills it.  This
> >> >>>>>>> causes multiple
> >> >>>>>>> flowfile version to hang around, and generally makes me feel
> like
> >> >>>>>>> it might
> >> >>>>>>> be causing data loss.
> >> >>>>>>>
> >> >>>>>>> I'm running the web UI on a different box, and when things stop
> >> >>>>>>> working, it stops showing changes to counts in any queues, and
> the
> >> >>>>>>> thread
> >> >>>>>>> count never changes.  It still thinks the nodes are connecting
> and
> >> >>>>>>> responding, though.
> >> >>>>>>>
> >> >>>>>>> My environment is two 8 cpu systems w/ 60GB memory with 48GB
> given
> >> >>>>>>> to
> >> >>>>>>> the NiFi JVM in bootstrap.conf.  I have timer threads limited to
> >> >>>>>>> 12, and
> >> >>>>>>> event threads to 4.  Install is on the current Amazon Linux AMI
> >> >>>>>>> and using
> >> >>>>>>> OpenJDK 1.8.0.91 x64.
> >> >>>>>>>
> >> >>>>>>> Any idea, other debug steps, or changes that I can try?  I'm
> >> >>>>>>> running
> >> >>>>>>> 0.7.0, having upgraded from 0.6.1, but this has been occurring
> >> >>>>>>> with both
> >> >>>>>>> versions.  The higher the flowfile volume I push through, the
> >> >>>>>>> faster this
> >> >>>>>>> happens.
> >> >>>>>>>
> >> >>>>>>> Thanks for any help there is to give!
> >> >>>>>>>
> >> >>>>>>> -Aaron Longfield
> >> >>>>>>
> >> >>>>>
> >> >>>>
> >> >>>
> >> >>>
> >> >>
> >>
> >
>

Re: Nifi cluster nodes regularly stop processing any flowfiles

Posted by Joe Witt <jo...@gmail.com>.
Aaron,

It doesn't look like the 0.x version of that patch has been created
yet.  Any chance you could build master (slated for upcoming 1.x
release) and try that?

Thanks
Joe

On Mon, Aug 1, 2016 at 3:30 PM, Aaron Longfield <al...@gmail.com> wrote:
> Great, glad there's already a fixed bug for it!  Is there anything I try to
> work around it for now, or at least just get longer processing times between
> restarts?
>
> -Aaron
>
> On Mon, Aug 1, 2016 at 11:54 AM, Mark Payne <ma...@hotmail.com> wrote:
>>
>> Aaron,
>>
>> Thanks for getting that to us quickly! It is extremely useful.
>>
>> Joe,
>>
>> I do indeed believe this is the same thing. I was in the middle of typing
>> a response, but you beat me to it!
>>
>> Thanks
>> -Mark
>>
>>
>> > On Aug 1, 2016, at 11:49 AM, Joe Witt <jo...@gmail.com> wrote:
>> >
>> > Aaron, Mark,
>> >
>> > In looking at the thread-dump provided it looks to me like this is the
>> > same as what was reported and addressed in
>> > https://issues.apache.org/jira/browse/NIFI-2395
>> >
>> > The fix for this has not yet been released but it slated to end up on
>> > an 0.x and 1.0 release line.
>> >
>> > Mark do you agree it is the same thing by looking at the logs?
>> >
>> > Thanks
>> > Joe
>> >
>> > On Mon, Aug 1, 2016 at 11:39 AM, Aaron Longfield <al...@gmail.com>
>> > wrote:
>> >> Alright, here you go for one of the nodes!
>> >>
>> >> On Mon, Aug 1, 2016 at 10:33 AM, Mark Payne <ma...@hotmail.com>
>> >> wrote:
>> >>>
>> >>> Aaron,
>> >>>
>> >>> Any time that you find NiFi stop performing its work, the best thing
>> >>> to do
>> >>> is to perform a thread-dump to and
>> >>> to the mailing list. This allows us to determine what exactly is
>> >>> happening, so we know what action is being
>> >>> performed that prevents any other progress.
>> >>>
>> >>> To do this, you can go to the NiFi node that is not performing and run
>> >>> the
>> >>> command:
>> >>>
>> >>> bin/nifi.sh dump thread-dump.txt
>> >>>
>> >>> This will generate a file named thread-dump.txt that you can send to
>> >>> us.
>> >>>
>> >>> Thanks!
>> >>> -Mark
>> >>>
>> >>>
>> >>> On Aug 1, 2016, at 10:19 AM, Aaron Longfield <al...@gmail.com>
>> >>> wrote:
>> >>>
>> >>> I've been trying different things to try to fix my NiFi freeze
>> >>> problems,
>> >>> and it seems the most frequent reason that my cluster gets stuck and
>> >>> stops
>> >>> processing has to do with network related processors.  My data enters
>> >>> the
>> >>> environment from Kafka and leaves via a site-to-site output port.
>> >>> After
>> >>> some time processing (sometimes a few minutes, sometimes a few hours)
>> >>> one of
>> >>> those will start logging connection errors, and then that node will
>> >>> stop
>> >>> processing any flowfiles across all processors.
>> >>>
>> >>> So far, this followed me from 0.6.1 to 0.7.0, and on Amazon Linux to
>> >>> RHEL7
>> >>> (although RHEL seems to be happier).  I've tried restricting threads
>> >>> to less
>> >>> than the number of available cores on each node, different heap sizes,
>> >>> and
>> >>> different garbage collectors.  So far none of that has preventing the
>> >>> problem, unfortunately.
>> >>>
>> >>> I'm not quite ready to build all custom processors for my flow
>> >>> logic...
>> >>> most of it is straightforward attribute routing, text replacement, and
>> >>> flowfile merging.
>> >>>
>> >>> What are other things that I could try, or just be doing wrong that
>> >>> could
>> >>> lead to this?  I'm happy to keep trying suggestions and changes; I
>> >>> really
>> >>> want this to work!
>> >>>
>> >>> Thanks,
>> >>> -Aaron
>> >>>
>> >>> On Fri, Jul 15, 2016 at 12:07 PM, Lee Laim <le...@gmail.com> wrote:
>> >>>>
>> >>>> Aaron,
>> >>>>
>> >>>> I ran into an issue where the Execute Stream Command (ESC) processor
>> >>>> with
>> >>>> many threads would run a legacy script that would hang if the
>> >>>> incoming file
>> >>>> was 'inconsistent'.  It appeared that ESC slowly collected stuck
>> >>>> threads as
>> >>>> malformed data randomly streamed through it. Eventually I ran out of
>> >>>> threads
>> >>>> as the system was just waiting for a thread to become available.
>> >>>>
>> >>>> It was apparent in the processor statistics where the flowfiles-out
>> >>>> statistic would eventually step down to zero as threads became stuck.
>> >>>>
>> >>>> It might be worth trying InvokeScriptedProcessor or building custom
>> >>>> processors as they provide a means to handle these inconsistencies
>> >>>> more
>> >>>> gracefully.
>> >>>>
>> >>>>
>> >>>> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.script.InvokeScriptedProcessor/index.html
>> >>>>
>> >>>> Thanks,
>> >>>> Lee
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Fri, Jul 15, 2016 at 6:50 AM, Aaron Longfield
>> >>>> <al...@gmail.com>
>> >>>> wrote:
>> >>>>>
>> >>>>> Hi Mark,
>> >>>>>
>> >>>>> I've been using the G1 garbage collector.  I brought the nodes down
>> >>>>> to
>> >>>>> 8GB heap and let it run overnight, but processing still got stuck
>> >>>>> and
>> >>>>> requiring NiFi to be restarted on all nodes.  It took longer to
>> >>>>> happen, but
>> >>>>> they went down after a few hours.  Are there any other things I can
>> >>>>> look
>> >>>>> into?
>> >>>>>
>> >>>>> Thanks!
>> >>>>>
>> >>>>> -Aaron
>> >>>>>
>> >>>>> On Thu, Jul 14, 2016 at 2:33 PM, Mark Payne <ma...@hotmail.com>
>> >>>>> wrote:
>> >>>>>>
>> >>>>>> Aaron,
>> >>>>>>
>> >>>>>> My guess would be that you are hitting a Full Garbage Collection.
>> >>>>>> With
>> >>>>>> such a huge Java heap, that will cause a "stop the world" pause for
>> >>>>>> quite a
>> >>>>>> long time.
>> >>>>>> Which garbage collector are you using? Have you tried reducing the
>> >>>>>> heap
>> >>>>>> from 48 GB to say 4 or 8 GB?
>> >>>>>>
>> >>>>>> Thanks
>> >>>>>> -Mark
>> >>>>>>
>> >>>>>>
>> >>>>>>> On Jul 14, 2016, at 11:14 AM, Aaron Longfield
>> >>>>>>> <al...@gmail.com>
>> >>>>>>> wrote:
>> >>>>>>>
>> >>>>>>> Hi,
>> >>>>>>>
>> >>>>>>> I'm having an issue with a small (two node) NiFi cluster where the
>> >>>>>>> nodes will stop processing any queued flowfiles.  I haven't seen
>> >>>>>>> any error
>> >>>>>>> messages logged related to it, and when attempting to restart the
>> >>>>>>> service,
>> >>>>>>> NiFi doesn't respond and the script forcibly kills it.  This
>> >>>>>>> causes multiple
>> >>>>>>> flowfile version to hang around, and generally makes me feel like
>> >>>>>>> it might
>> >>>>>>> be causing data loss.
>> >>>>>>>
>> >>>>>>> I'm running the web UI on a different box, and when things stop
>> >>>>>>> working, it stops showing changes to counts in any queues, and the
>> >>>>>>> thread
>> >>>>>>> count never changes.  It still thinks the nodes are connecting and
>> >>>>>>> responding, though.
>> >>>>>>>
>> >>>>>>> My environment is two 8 cpu systems w/ 60GB memory with 48GB given
>> >>>>>>> to
>> >>>>>>> the NiFi JVM in bootstrap.conf.  I have timer threads limited to
>> >>>>>>> 12, and
>> >>>>>>> event threads to 4.  Install is on the current Amazon Linux AMI
>> >>>>>>> and using
>> >>>>>>> OpenJDK 1.8.0.91 x64.
>> >>>>>>>
>> >>>>>>> Any idea, other debug steps, or changes that I can try?  I'm
>> >>>>>>> running
>> >>>>>>> 0.7.0, having upgraded from 0.6.1, but this has been occurring
>> >>>>>>> with both
>> >>>>>>> versions.  The higher the flowfile volume I push through, the
>> >>>>>>> faster this
>> >>>>>>> happens.
>> >>>>>>>
>> >>>>>>> Thanks for any help there is to give!
>> >>>>>>>
>> >>>>>>> -Aaron Longfield
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>>
>> >>
>>
>

Re: Nifi cluster nodes regularly stop processing any flowfiles

Posted by Aaron Longfield <al...@gmail.com>.
Great, glad there's already a fixed bug for it!  Is there anything I try to
work around it for now, or at least just get longer processing times
between restarts?

-Aaron

On Mon, Aug 1, 2016 at 11:54 AM, Mark Payne <ma...@hotmail.com> wrote:

> Aaron,
>
> Thanks for getting that to us quickly! It is extremely useful.
>
> Joe,
>
> I do indeed believe this is the same thing. I was in the middle of typing
> a response, but you beat me to it!
>
> Thanks
> -Mark
>
>
> > On Aug 1, 2016, at 11:49 AM, Joe Witt <jo...@gmail.com> wrote:
> >
> > Aaron, Mark,
> >
> > In looking at the thread-dump provided it looks to me like this is the
> > same as what was reported and addressed in
> > https://issues.apache.org/jira/browse/NIFI-2395
> >
> > The fix for this has not yet been released but it slated to end up on
> > an 0.x and 1.0 release line.
> >
> > Mark do you agree it is the same thing by looking at the logs?
> >
> > Thanks
> > Joe
> >
> > On Mon, Aug 1, 2016 at 11:39 AM, Aaron Longfield <al...@gmail.com>
> wrote:
> >> Alright, here you go for one of the nodes!
> >>
> >> On Mon, Aug 1, 2016 at 10:33 AM, Mark Payne <ma...@hotmail.com>
> wrote:
> >>>
> >>> Aaron,
> >>>
> >>> Any time that you find NiFi stop performing its work, the best thing
> to do
> >>> is to perform a thread-dump to and
> >>> to the mailing list. This allows us to determine what exactly is
> >>> happening, so we know what action is being
> >>> performed that prevents any other progress.
> >>>
> >>> To do this, you can go to the NiFi node that is not performing and run
> the
> >>> command:
> >>>
> >>> bin/nifi.sh dump thread-dump.txt
> >>>
> >>> This will generate a file named thread-dump.txt that you can send to
> us.
> >>>
> >>> Thanks!
> >>> -Mark
> >>>
> >>>
> >>> On Aug 1, 2016, at 10:19 AM, Aaron Longfield <al...@gmail.com>
> wrote:
> >>>
> >>> I've been trying different things to try to fix my NiFi freeze
> problems,
> >>> and it seems the most frequent reason that my cluster gets stuck and
> stops
> >>> processing has to do with network related processors.  My data enters
> the
> >>> environment from Kafka and leaves via a site-to-site output port.
> After
> >>> some time processing (sometimes a few minutes, sometimes a few hours)
> one of
> >>> those will start logging connection errors, and then that node will
> stop
> >>> processing any flowfiles across all processors.
> >>>
> >>> So far, this followed me from 0.6.1 to 0.7.0, and on Amazon Linux to
> RHEL7
> >>> (although RHEL seems to be happier).  I've tried restricting threads
> to less
> >>> than the number of available cores on each node, different heap sizes,
> and
> >>> different garbage collectors.  So far none of that has preventing the
> >>> problem, unfortunately.
> >>>
> >>> I'm not quite ready to build all custom processors for my flow logic...
> >>> most of it is straightforward attribute routing, text replacement, and
> >>> flowfile merging.
> >>>
> >>> What are other things that I could try, or just be doing wrong that
> could
> >>> lead to this?  I'm happy to keep trying suggestions and changes; I
> really
> >>> want this to work!
> >>>
> >>> Thanks,
> >>> -Aaron
> >>>
> >>> On Fri, Jul 15, 2016 at 12:07 PM, Lee Laim <le...@gmail.com> wrote:
> >>>>
> >>>> Aaron,
> >>>>
> >>>> I ran into an issue where the Execute Stream Command (ESC) processor
> with
> >>>> many threads would run a legacy script that would hang if the
> incoming file
> >>>> was 'inconsistent'.  It appeared that ESC slowly collected stuck
> threads as
> >>>> malformed data randomly streamed through it. Eventually I ran out of
> threads
> >>>> as the system was just waiting for a thread to become available.
> >>>>
> >>>> It was apparent in the processor statistics where the flowfiles-out
> >>>> statistic would eventually step down to zero as threads became stuck.
> >>>>
> >>>> It might be worth trying InvokeScriptedProcessor or building custom
> >>>> processors as they provide a means to handle these inconsistencies
> more
> >>>> gracefully.
> >>>>
> >>>>
> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.script.InvokeScriptedProcessor/index.html
> >>>>
> >>>> Thanks,
> >>>> Lee
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Fri, Jul 15, 2016 at 6:50 AM, Aaron Longfield <
> alongfield@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> Hi Mark,
> >>>>>
> >>>>> I've been using the G1 garbage collector.  I brought the nodes down
> to
> >>>>> 8GB heap and let it run overnight, but processing still got stuck and
> >>>>> requiring NiFi to be restarted on all nodes.  It took longer to
> happen, but
> >>>>> they went down after a few hours.  Are there any other things I can
> look
> >>>>> into?
> >>>>>
> >>>>> Thanks!
> >>>>>
> >>>>> -Aaron
> >>>>>
> >>>>> On Thu, Jul 14, 2016 at 2:33 PM, Mark Payne <ma...@hotmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> Aaron,
> >>>>>>
> >>>>>> My guess would be that you are hitting a Full Garbage Collection.
> With
> >>>>>> such a huge Java heap, that will cause a "stop the world" pause for
> quite a
> >>>>>> long time.
> >>>>>> Which garbage collector are you using? Have you tried reducing the
> heap
> >>>>>> from 48 GB to say 4 or 8 GB?
> >>>>>>
> >>>>>> Thanks
> >>>>>> -Mark
> >>>>>>
> >>>>>>
> >>>>>>> On Jul 14, 2016, at 11:14 AM, Aaron Longfield <
> alongfield@gmail.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> I'm having an issue with a small (two node) NiFi cluster where the
> >>>>>>> nodes will stop processing any queued flowfiles.  I haven't seen
> any error
> >>>>>>> messages logged related to it, and when attempting to restart the
> service,
> >>>>>>> NiFi doesn't respond and the script forcibly kills it.  This
> causes multiple
> >>>>>>> flowfile version to hang around, and generally makes me feel like
> it might
> >>>>>>> be causing data loss.
> >>>>>>>
> >>>>>>> I'm running the web UI on a different box, and when things stop
> >>>>>>> working, it stops showing changes to counts in any queues, and the
> thread
> >>>>>>> count never changes.  It still thinks the nodes are connecting and
> >>>>>>> responding, though.
> >>>>>>>
> >>>>>>> My environment is two 8 cpu systems w/ 60GB memory with 48GB given
> to
> >>>>>>> the NiFi JVM in bootstrap.conf.  I have timer threads limited to
> 12, and
> >>>>>>> event threads to 4.  Install is on the current Amazon Linux AMI
> and using
> >>>>>>> OpenJDK 1.8.0.91 x64.
> >>>>>>>
> >>>>>>> Any idea, other debug steps, or changes that I can try?  I'm
> running
> >>>>>>> 0.7.0, having upgraded from 0.6.1, but this has been occurring
> with both
> >>>>>>> versions.  The higher the flowfile volume I push through, the
> faster this
> >>>>>>> happens.
> >>>>>>>
> >>>>>>> Thanks for any help there is to give!
> >>>>>>>
> >>>>>>> -Aaron Longfield
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>
>
>

Re: Nifi cluster nodes regularly stop processing any flowfiles

Posted by Mark Payne <ma...@hotmail.com>.
Aaron,

Thanks for getting that to us quickly! It is extremely useful.

Joe,

I do indeed believe this is the same thing. I was in the middle of typing a response, but you beat me to it!

Thanks
-Mark


> On Aug 1, 2016, at 11:49 AM, Joe Witt <jo...@gmail.com> wrote:
> 
> Aaron, Mark,
> 
> In looking at the thread-dump provided it looks to me like this is the
> same as what was reported and addressed in
> https://issues.apache.org/jira/browse/NIFI-2395
> 
> The fix for this has not yet been released but it slated to end up on
> an 0.x and 1.0 release line.
> 
> Mark do you agree it is the same thing by looking at the logs?
> 
> Thanks
> Joe
> 
> On Mon, Aug 1, 2016 at 11:39 AM, Aaron Longfield <al...@gmail.com> wrote:
>> Alright, here you go for one of the nodes!
>> 
>> On Mon, Aug 1, 2016 at 10:33 AM, Mark Payne <ma...@hotmail.com> wrote:
>>> 
>>> Aaron,
>>> 
>>> Any time that you find NiFi stop performing its work, the best thing to do
>>> is to perform a thread-dump to and
>>> to the mailing list. This allows us to determine what exactly is
>>> happening, so we know what action is being
>>> performed that prevents any other progress.
>>> 
>>> To do this, you can go to the NiFi node that is not performing and run the
>>> command:
>>> 
>>> bin/nifi.sh dump thread-dump.txt
>>> 
>>> This will generate a file named thread-dump.txt that you can send to us.
>>> 
>>> Thanks!
>>> -Mark
>>> 
>>> 
>>> On Aug 1, 2016, at 10:19 AM, Aaron Longfield <al...@gmail.com> wrote:
>>> 
>>> I've been trying different things to try to fix my NiFi freeze problems,
>>> and it seems the most frequent reason that my cluster gets stuck and stops
>>> processing has to do with network related processors.  My data enters the
>>> environment from Kafka and leaves via a site-to-site output port.  After
>>> some time processing (sometimes a few minutes, sometimes a few hours) one of
>>> those will start logging connection errors, and then that node will stop
>>> processing any flowfiles across all processors.
>>> 
>>> So far, this followed me from 0.6.1 to 0.7.0, and on Amazon Linux to RHEL7
>>> (although RHEL seems to be happier).  I've tried restricting threads to less
>>> than the number of available cores on each node, different heap sizes, and
>>> different garbage collectors.  So far none of that has preventing the
>>> problem, unfortunately.
>>> 
>>> I'm not quite ready to build all custom processors for my flow logic...
>>> most of it is straightforward attribute routing, text replacement, and
>>> flowfile merging.
>>> 
>>> What are other things that I could try, or just be doing wrong that could
>>> lead to this?  I'm happy to keep trying suggestions and changes; I really
>>> want this to work!
>>> 
>>> Thanks,
>>> -Aaron
>>> 
>>> On Fri, Jul 15, 2016 at 12:07 PM, Lee Laim <le...@gmail.com> wrote:
>>>> 
>>>> Aaron,
>>>> 
>>>> I ran into an issue where the Execute Stream Command (ESC) processor with
>>>> many threads would run a legacy script that would hang if the incoming file
>>>> was 'inconsistent'.  It appeared that ESC slowly collected stuck threads as
>>>> malformed data randomly streamed through it. Eventually I ran out of threads
>>>> as the system was just waiting for a thread to become available.
>>>> 
>>>> It was apparent in the processor statistics where the flowfiles-out
>>>> statistic would eventually step down to zero as threads became stuck.
>>>> 
>>>> It might be worth trying InvokeScriptedProcessor or building custom
>>>> processors as they provide a means to handle these inconsistencies more
>>>> gracefully.
>>>> 
>>>> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.script.InvokeScriptedProcessor/index.html
>>>> 
>>>> Thanks,
>>>> Lee
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Fri, Jul 15, 2016 at 6:50 AM, Aaron Longfield <al...@gmail.com>
>>>> wrote:
>>>>> 
>>>>> Hi Mark,
>>>>> 
>>>>> I've been using the G1 garbage collector.  I brought the nodes down to
>>>>> 8GB heap and let it run overnight, but processing still got stuck and
>>>>> requiring NiFi to be restarted on all nodes.  It took longer to happen, but
>>>>> they went down after a few hours.  Are there any other things I can look
>>>>> into?
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> -Aaron
>>>>> 
>>>>> On Thu, Jul 14, 2016 at 2:33 PM, Mark Payne <ma...@hotmail.com>
>>>>> wrote:
>>>>>> 
>>>>>> Aaron,
>>>>>> 
>>>>>> My guess would be that you are hitting a Full Garbage Collection. With
>>>>>> such a huge Java heap, that will cause a "stop the world" pause for quite a
>>>>>> long time.
>>>>>> Which garbage collector are you using? Have you tried reducing the heap
>>>>>> from 48 GB to say 4 or 8 GB?
>>>>>> 
>>>>>> Thanks
>>>>>> -Mark
>>>>>> 
>>>>>> 
>>>>>>> On Jul 14, 2016, at 11:14 AM, Aaron Longfield <al...@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I'm having an issue with a small (two node) NiFi cluster where the
>>>>>>> nodes will stop processing any queued flowfiles.  I haven't seen any error
>>>>>>> messages logged related to it, and when attempting to restart the service,
>>>>>>> NiFi doesn't respond and the script forcibly kills it.  This causes multiple
>>>>>>> flowfile version to hang around, and generally makes me feel like it might
>>>>>>> be causing data loss.
>>>>>>> 
>>>>>>> I'm running the web UI on a different box, and when things stop
>>>>>>> working, it stops showing changes to counts in any queues, and the thread
>>>>>>> count never changes.  It still thinks the nodes are connecting and
>>>>>>> responding, though.
>>>>>>> 
>>>>>>> My environment is two 8 cpu systems w/ 60GB memory with 48GB given to
>>>>>>> the NiFi JVM in bootstrap.conf.  I have timer threads limited to 12, and
>>>>>>> event threads to 4.  Install is on the current Amazon Linux AMI and using
>>>>>>> OpenJDK 1.8.0.91 x64.
>>>>>>> 
>>>>>>> Any idea, other debug steps, or changes that I can try?  I'm running
>>>>>>> 0.7.0, having upgraded from 0.6.1, but this has been occurring with both
>>>>>>> versions.  The higher the flowfile volume I push through, the faster this
>>>>>>> happens.
>>>>>>> 
>>>>>>> Thanks for any help there is to give!
>>>>>>> 
>>>>>>> -Aaron Longfield
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>> 


Re: Nifi cluster nodes regularly stop processing any flowfiles

Posted by Joe Witt <jo...@gmail.com>.
Aaron, Mark,

In looking at the thread-dump provided it looks to me like this is the
same as what was reported and addressed in
https://issues.apache.org/jira/browse/NIFI-2395

The fix for this has not yet been released but it slated to end up on
an 0.x and 1.0 release line.

Mark do you agree it is the same thing by looking at the logs?

Thanks
Joe

On Mon, Aug 1, 2016 at 11:39 AM, Aaron Longfield <al...@gmail.com> wrote:
> Alright, here you go for one of the nodes!
>
> On Mon, Aug 1, 2016 at 10:33 AM, Mark Payne <ma...@hotmail.com> wrote:
>>
>> Aaron,
>>
>> Any time that you find NiFi stop performing its work, the best thing to do
>> is to perform a thread-dump to and
>> to the mailing list. This allows us to determine what exactly is
>> happening, so we know what action is being
>> performed that prevents any other progress.
>>
>> To do this, you can go to the NiFi node that is not performing and run the
>> command:
>>
>> bin/nifi.sh dump thread-dump.txt
>>
>> This will generate a file named thread-dump.txt that you can send to us.
>>
>> Thanks!
>> -Mark
>>
>>
>> On Aug 1, 2016, at 10:19 AM, Aaron Longfield <al...@gmail.com> wrote:
>>
>> I've been trying different things to try to fix my NiFi freeze problems,
>> and it seems the most frequent reason that my cluster gets stuck and stops
>> processing has to do with network related processors.  My data enters the
>> environment from Kafka and leaves via a site-to-site output port.  After
>> some time processing (sometimes a few minutes, sometimes a few hours) one of
>> those will start logging connection errors, and then that node will stop
>> processing any flowfiles across all processors.
>>
>> So far, this followed me from 0.6.1 to 0.7.0, and on Amazon Linux to RHEL7
>> (although RHEL seems to be happier).  I've tried restricting threads to less
>> than the number of available cores on each node, different heap sizes, and
>> different garbage collectors.  So far none of that has preventing the
>> problem, unfortunately.
>>
>> I'm not quite ready to build all custom processors for my flow logic...
>> most of it is straightforward attribute routing, text replacement, and
>> flowfile merging.
>>
>> What are other things that I could try, or just be doing wrong that could
>> lead to this?  I'm happy to keep trying suggestions and changes; I really
>> want this to work!
>>
>> Thanks,
>> -Aaron
>>
>> On Fri, Jul 15, 2016 at 12:07 PM, Lee Laim <le...@gmail.com> wrote:
>>>
>>> Aaron,
>>>
>>> I ran into an issue where the Execute Stream Command (ESC) processor with
>>> many threads would run a legacy script that would hang if the incoming file
>>> was 'inconsistent'.  It appeared that ESC slowly collected stuck threads as
>>> malformed data randomly streamed through it. Eventually I ran out of threads
>>> as the system was just waiting for a thread to become available.
>>>
>>> It was apparent in the processor statistics where the flowfiles-out
>>> statistic would eventually step down to zero as threads became stuck.
>>>
>>> It might be worth trying InvokeScriptedProcessor or building custom
>>> processors as they provide a means to handle these inconsistencies more
>>> gracefully.
>>>
>>> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.script.InvokeScriptedProcessor/index.html
>>>
>>> Thanks,
>>> Lee
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Jul 15, 2016 at 6:50 AM, Aaron Longfield <al...@gmail.com>
>>> wrote:
>>>>
>>>> Hi Mark,
>>>>
>>>> I've been using the G1 garbage collector.  I brought the nodes down to
>>>> 8GB heap and let it run overnight, but processing still got stuck and
>>>> requiring NiFi to be restarted on all nodes.  It took longer to happen, but
>>>> they went down after a few hours.  Are there any other things I can look
>>>> into?
>>>>
>>>> Thanks!
>>>>
>>>> -Aaron
>>>>
>>>> On Thu, Jul 14, 2016 at 2:33 PM, Mark Payne <ma...@hotmail.com>
>>>> wrote:
>>>>>
>>>>> Aaron,
>>>>>
>>>>> My guess would be that you are hitting a Full Garbage Collection. With
>>>>> such a huge Java heap, that will cause a "stop the world" pause for quite a
>>>>> long time.
>>>>> Which garbage collector are you using? Have you tried reducing the heap
>>>>> from 48 GB to say 4 or 8 GB?
>>>>>
>>>>> Thanks
>>>>> -Mark
>>>>>
>>>>>
>>>>> > On Jul 14, 2016, at 11:14 AM, Aaron Longfield <al...@gmail.com>
>>>>> > wrote:
>>>>> >
>>>>> > Hi,
>>>>> >
>>>>> > I'm having an issue with a small (two node) NiFi cluster where the
>>>>> > nodes will stop processing any queued flowfiles.  I haven't seen any error
>>>>> > messages logged related to it, and when attempting to restart the service,
>>>>> > NiFi doesn't respond and the script forcibly kills it.  This causes multiple
>>>>> > flowfile version to hang around, and generally makes me feel like it might
>>>>> > be causing data loss.
>>>>> >
>>>>> > I'm running the web UI on a different box, and when things stop
>>>>> > working, it stops showing changes to counts in any queues, and the thread
>>>>> > count never changes.  It still thinks the nodes are connecting and
>>>>> > responding, though.
>>>>> >
>>>>> > My environment is two 8 cpu systems w/ 60GB memory with 48GB given to
>>>>> > the NiFi JVM in bootstrap.conf.  I have timer threads limited to 12, and
>>>>> > event threads to 4.  Install is on the current Amazon Linux AMI and using
>>>>> > OpenJDK 1.8.0.91 x64.
>>>>> >
>>>>> > Any idea, other debug steps, or changes that I can try?  I'm running
>>>>> > 0.7.0, having upgraded from 0.6.1, but this has been occurring with both
>>>>> > versions.  The higher the flowfile volume I push through, the faster this
>>>>> > happens.
>>>>> >
>>>>> > Thanks for any help there is to give!
>>>>> >
>>>>> > -Aaron Longfield
>>>>>
>>>>
>>>
>>
>>
>

Re: Nifi cluster nodes regularly stop processing any flowfiles

Posted by Aaron Longfield <al...@gmail.com>.
Alright, here you go for one of the nodes!

On Mon, Aug 1, 2016 at 10:33 AM, Mark Payne <ma...@hotmail.com> wrote:

> Aaron,
>
> Any time that you find NiFi stop performing its work, the best thing to do
> is to perform a thread-dump to and
> to the mailing list. This allows us to determine what exactly is
> happening, so we know what action is being
> performed that prevents any other progress.
>
> To do this, you can go to the NiFi node that is not performing and run the
> command:
>
> bin/nifi.sh dump thread-dump.txt
>
> This will generate a file named thread-dump.txt that you can send to us.
>
> Thanks!
> -Mark
>
>
> On Aug 1, 2016, at 10:19 AM, Aaron Longfield <al...@gmail.com> wrote:
>
> I've been trying different things to try to fix my NiFi freeze problems,
> and it seems the most frequent reason that my cluster gets stuck and stops
> processing has to do with network related processors.  My data enters the
> environment from Kafka and leaves via a site-to-site output port.  After
> some time processing (sometimes a few minutes, sometimes a few hours) one
> of those will start logging connection errors, and then that node will stop
> processing any flowfiles across all processors.
>
> So far, this followed me from 0.6.1 to 0.7.0, and on Amazon Linux to RHEL7
> (although RHEL seems to be happier).  I've tried restricting threads to
> less than the number of available cores on each node, different heap sizes,
> and different garbage collectors.  So far none of that has preventing the
> problem, unfortunately.
>
> I'm not quite ready to build all custom processors for my flow logic...
> most of it is straightforward attribute routing, text replacement, and
> flowfile merging.
>
> What are other things that I could try, or just be doing wrong that could
> lead to this?  I'm happy to keep trying suggestions and changes; I really
> want this to work!
>
> Thanks,
> -Aaron
>
> On Fri, Jul 15, 2016 at 12:07 PM, Lee Laim <le...@gmail.com> wrote:
>
>> Aaron,
>>
>> I ran into an issue where the Execute Stream Command (ESC) processor with
>> many threads would run a legacy script that would hang if the incoming file
>> was 'inconsistent'.  It appeared that ESC slowly collected stuck threads as
>> malformed data randomly streamed through it. Eventually I ran out of
>> threads as the system was just waiting for a thread to become available.
>>
>> It was apparent in the processor statistics where the flowfiles-out
>> statistic would eventually step down to zero as threads became stuck.
>>
>> It might be worth trying InvokeScriptedProcessor or building custom
>> processors as they provide a means to handle these inconsistencies more
>> gracefully.
>>
>> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.script.InvokeScriptedProcessor/index.html
>>
>> Thanks,
>> Lee
>>
>>
>>
>>
>>
>> On Fri, Jul 15, 2016 at 6:50 AM, Aaron Longfield <al...@gmail.com>
>> wrote:
>>
>>> Hi Mark,
>>>
>>> I've been using the G1 garbage collector.  I brought the nodes down to
>>> 8GB heap and let it run overnight, but processing still got stuck and
>>> requiring NiFi to be restarted on all nodes.  It took longer to happen, but
>>> they went down after a few hours.  Are there any other things I can look
>>> into?
>>>
>>> Thanks!
>>>
>>> -Aaron
>>>
>>> On Thu, Jul 14, 2016 at 2:33 PM, Mark Payne <ma...@hotmail.com>
>>> wrote:
>>>
>>>> Aaron,
>>>>
>>>> My guess would be that you are hitting a Full Garbage Collection. With
>>>> such a huge Java heap, that will cause a "stop the world" pause for quite a
>>>> long time.
>>>> Which garbage collector are you using? Have you tried reducing the heap
>>>> from 48 GB to say 4 or 8 GB?
>>>>
>>>> Thanks
>>>> -Mark
>>>>
>>>>
>>>> > On Jul 14, 2016, at 11:14 AM, Aaron Longfield <al...@gmail.com>
>>>> wrote:
>>>> >
>>>> > Hi,
>>>> >
>>>> > I'm having an issue with a small (two node) NiFi cluster where the
>>>> nodes will stop processing any queued flowfiles.  I haven't seen any error
>>>> messages logged related to it, and when attempting to restart the service,
>>>> NiFi doesn't respond and the script forcibly kills it.  This causes
>>>> multiple flowfile version to hang around, and generally makes me feel like
>>>> it might be causing data loss.
>>>> >
>>>> > I'm running the web UI on a different box, and when things stop
>>>> working, it stops showing changes to counts in any queues, and the thread
>>>> count never changes.  It still thinks the nodes are connecting and
>>>> responding, though.
>>>> >
>>>> > My environment is two 8 cpu systems w/ 60GB memory with 48GB given to
>>>> the NiFi JVM in bootstrap.conf.  I have timer threads limited to 12, and
>>>> event threads to 4.  Install is on the current Amazon Linux AMI and using
>>>> OpenJDK 1.8.0.91 x64.
>>>> >
>>>> > Any idea, other debug steps, or changes that I can try?  I'm running
>>>> 0.7.0, having upgraded from 0.6.1, but this has been occurring with both
>>>> versions.  The higher the flowfile volume I push through, the faster this
>>>> happens.
>>>> >
>>>> > Thanks for any help there is to give!
>>>> >
>>>> > -Aaron Longfield
>>>>
>>>>
>>>
>>
>
>

Re: Nifi cluster nodes regularly stop processing any flowfiles

Posted by Mark Payne <ma...@hotmail.com>.
Aaron,

Any time that you find NiFi stop performing its work, the best thing to do is to perform a thread-dump to and
to the mailing list. This allows us to determine what exactly is happening, so we know what action is being
performed that prevents any other progress.

To do this, you can go to the NiFi node that is not performing and run the command:

bin/nifi.sh dump thread-dump.txt

This will generate a file named thread-dump.txt that you can send to us.

Thanks!
-Mark


> On Aug 1, 2016, at 10:19 AM, Aaron Longfield <al...@gmail.com> wrote:
> 
> I've been trying different things to try to fix my NiFi freeze problems, and it seems the most frequent reason that my cluster gets stuck and stops processing has to do with network related processors.  My data enters the environment from Kafka and leaves via a site-to-site output port.  After some time processing (sometimes a few minutes, sometimes a few hours) one of those will start logging connection errors, and then that node will stop processing any flowfiles across all processors.
> 
> So far, this followed me from 0.6.1 to 0.7.0, and on Amazon Linux to RHEL7 (although RHEL seems to be happier).  I've tried restricting threads to less than the number of available cores on each node, different heap sizes, and different garbage collectors.  So far none of that has preventing the problem, unfortunately.
> 
> I'm not quite ready to build all custom processors for my flow logic... most of it is straightforward attribute routing, text replacement, and flowfile merging.
> 
> What are other things that I could try, or just be doing wrong that could lead to this?  I'm happy to keep trying suggestions and changes; I really want this to work!
> 
> Thanks,
> -Aaron
> 
> On Fri, Jul 15, 2016 at 12:07 PM, Lee Laim <lee.laim@gmail.com <ma...@gmail.com>> wrote:
> Aaron, 
> 
> I ran into an issue where the Execute Stream Command (ESC) processor with many threads would run a legacy script that would hang if the incoming file was 'inconsistent'.  It appeared that ESC slowly collected stuck threads as malformed data randomly streamed through it. Eventually I ran out of threads as the system was just waiting for a thread to become available.  
> 
> It was apparent in the processor statistics where the flowfiles-out statistic would eventually step down to zero as threads became stuck.  
> 
> It might be worth trying InvokeScriptedProcessor or building custom processors as they provide a means to handle these inconsistencies more gracefully.
> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.script.InvokeScriptedProcessor/index.html <https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.script.InvokeScriptedProcessor/index.html>
> 
> Thanks,
> Lee 
> 
> 
> 
> 
> 
> On Fri, Jul 15, 2016 at 6:50 AM, Aaron Longfield <alongfield@gmail.com <ma...@gmail.com>> wrote:
> Hi Mark,
> 
> I've been using the G1 garbage collector.  I brought the nodes down to 8GB heap and let it run overnight, but processing still got stuck and requiring NiFi to be restarted on all nodes.  It took longer to happen, but they went down after a few hours.  Are there any other things I can look into?
> 
> Thanks!
> 
> -Aaron
> 
> On Thu, Jul 14, 2016 at 2:33 PM, Mark Payne <markap14@hotmail.com <ma...@hotmail.com>> wrote:
> Aaron,
> 
> My guess would be that you are hitting a Full Garbage Collection. With such a huge Java heap, that will cause a "stop the world" pause for quite a long time.
> Which garbage collector are you using? Have you tried reducing the heap from 48 GB to say 4 or 8 GB?
> 
> Thanks
> -Mark
> 
> 
> > On Jul 14, 2016, at 11:14 AM, Aaron Longfield <alongfield@gmail.com <ma...@gmail.com>> wrote:
> >
> > Hi,
> >
> > I'm having an issue with a small (two node) NiFi cluster where the nodes will stop processing any queued flowfiles.  I haven't seen any error messages logged related to it, and when attempting to restart the service, NiFi doesn't respond and the script forcibly kills it.  This causes multiple flowfile version to hang around, and generally makes me feel like it might be causing data loss.
> >
> > I'm running the web UI on a different box, and when things stop working, it stops showing changes to counts in any queues, and the thread count never changes.  It still thinks the nodes are connecting and responding, though.
> >
> > My environment is two 8 cpu systems w/ 60GB memory with 48GB given to the NiFi JVM in bootstrap.conf.  I have timer threads limited to 12, and event threads to 4.  Install is on the current Amazon Linux AMI and using OpenJDK 1.8.0.91 x64.
> >
> > Any idea, other debug steps, or changes that I can try?  I'm running 0.7.0, having upgraded from 0.6.1, but this has been occurring with both versions.  The higher the flowfile volume I push through, the faster this happens.
> >
> > Thanks for any help there is to give!
> >
> > -Aaron Longfield
> 
> 
> 
> 


Re: Nifi cluster nodes regularly stop processing any flowfiles

Posted by Aaron Longfield <al...@gmail.com>.
I've been trying different things to try to fix my NiFi freeze problems,
and it seems the most frequent reason that my cluster gets stuck and stops
processing has to do with network related processors.  My data enters the
environment from Kafka and leaves via a site-to-site output port.  After
some time processing (sometimes a few minutes, sometimes a few hours) one
of those will start logging connection errors, and then that node will stop
processing any flowfiles across all processors.

So far, this followed me from 0.6.1 to 0.7.0, and on Amazon Linux to RHEL7
(although RHEL seems to be happier).  I've tried restricting threads to
less than the number of available cores on each node, different heap sizes,
and different garbage collectors.  So far none of that has preventing the
problem, unfortunately.

I'm not quite ready to build all custom processors for my flow logic...
most of it is straightforward attribute routing, text replacement, and
flowfile merging.

What are other things that I could try, or just be doing wrong that could
lead to this?  I'm happy to keep trying suggestions and changes; I really
want this to work!

Thanks,
-Aaron

On Fri, Jul 15, 2016 at 12:07 PM, Lee Laim <le...@gmail.com> wrote:

> Aaron,
>
> I ran into an issue where the Execute Stream Command (ESC) processor with
> many threads would run a legacy script that would hang if the incoming file
> was 'inconsistent'.  It appeared that ESC slowly collected stuck threads as
> malformed data randomly streamed through it. Eventually I ran out of
> threads as the system was just waiting for a thread to become available.
>
> It was apparent in the processor statistics where the flowfiles-out
> statistic would eventually step down to zero as threads became stuck.
>
> It might be worth trying InvokeScriptedProcessor or building custom
> processors as they provide a means to handle these inconsistencies more
> gracefully.
>
> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.script.InvokeScriptedProcessor/index.html
>
> Thanks,
> Lee
>
>
>
>
>
> On Fri, Jul 15, 2016 at 6:50 AM, Aaron Longfield <al...@gmail.com>
> wrote:
>
>> Hi Mark,
>>
>> I've been using the G1 garbage collector.  I brought the nodes down to
>> 8GB heap and let it run overnight, but processing still got stuck and
>> requiring NiFi to be restarted on all nodes.  It took longer to happen, but
>> they went down after a few hours.  Are there any other things I can look
>> into?
>>
>> Thanks!
>>
>> -Aaron
>>
>> On Thu, Jul 14, 2016 at 2:33 PM, Mark Payne <ma...@hotmail.com> wrote:
>>
>>> Aaron,
>>>
>>> My guess would be that you are hitting a Full Garbage Collection. With
>>> such a huge Java heap, that will cause a "stop the world" pause for quite a
>>> long time.
>>> Which garbage collector are you using? Have you tried reducing the heap
>>> from 48 GB to say 4 or 8 GB?
>>>
>>> Thanks
>>> -Mark
>>>
>>>
>>> > On Jul 14, 2016, at 11:14 AM, Aaron Longfield <al...@gmail.com>
>>> wrote:
>>> >
>>> > Hi,
>>> >
>>> > I'm having an issue with a small (two node) NiFi cluster where the
>>> nodes will stop processing any queued flowfiles.  I haven't seen any error
>>> messages logged related to it, and when attempting to restart the service,
>>> NiFi doesn't respond and the script forcibly kills it.  This causes
>>> multiple flowfile version to hang around, and generally makes me feel like
>>> it might be causing data loss.
>>> >
>>> > I'm running the web UI on a different box, and when things stop
>>> working, it stops showing changes to counts in any queues, and the thread
>>> count never changes.  It still thinks the nodes are connecting and
>>> responding, though.
>>> >
>>> > My environment is two 8 cpu systems w/ 60GB memory with 48GB given to
>>> the NiFi JVM in bootstrap.conf.  I have timer threads limited to 12, and
>>> event threads to 4.  Install is on the current Amazon Linux AMI and using
>>> OpenJDK 1.8.0.91 x64.
>>> >
>>> > Any idea, other debug steps, or changes that I can try?  I'm running
>>> 0.7.0, having upgraded from 0.6.1, but this has been occurring with both
>>> versions.  The higher the flowfile volume I push through, the faster this
>>> happens.
>>> >
>>> > Thanks for any help there is to give!
>>> >
>>> > -Aaron Longfield
>>>
>>>
>>
>

Re: Nifi cluster nodes regularly stop processing any flowfiles

Posted by Lee Laim <le...@gmail.com>.
Aaron,

I ran into an issue where the Execute Stream Command (ESC) processor with
many threads would run a legacy script that would hang if the incoming file
was 'inconsistent'.  It appeared that ESC slowly collected stuck threads as
malformed data randomly streamed through it. Eventually I ran out of
threads as the system was just waiting for a thread to become available.

It was apparent in the processor statistics where the flowfiles-out
statistic would eventually step down to zero as threads became stuck.

It might be worth trying InvokeScriptedProcessor or building custom
processors as they provide a means to handle these inconsistencies more
gracefully.
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.script.InvokeScriptedProcessor/index.html

Thanks,
Lee





On Fri, Jul 15, 2016 at 6:50 AM, Aaron Longfield <al...@gmail.com>
wrote:

> Hi Mark,
>
> I've been using the G1 garbage collector.  I brought the nodes down to 8GB
> heap and let it run overnight, but processing still got stuck and requiring
> NiFi to be restarted on all nodes.  It took longer to happen, but they went
> down after a few hours.  Are there any other things I can look into?
>
> Thanks!
>
> -Aaron
>
> On Thu, Jul 14, 2016 at 2:33 PM, Mark Payne <ma...@hotmail.com> wrote:
>
>> Aaron,
>>
>> My guess would be that you are hitting a Full Garbage Collection. With
>> such a huge Java heap, that will cause a "stop the world" pause for quite a
>> long time.
>> Which garbage collector are you using? Have you tried reducing the heap
>> from 48 GB to say 4 or 8 GB?
>>
>> Thanks
>> -Mark
>>
>>
>> > On Jul 14, 2016, at 11:14 AM, Aaron Longfield <al...@gmail.com>
>> wrote:
>> >
>> > Hi,
>> >
>> > I'm having an issue with a small (two node) NiFi cluster where the
>> nodes will stop processing any queued flowfiles.  I haven't seen any error
>> messages logged related to it, and when attempting to restart the service,
>> NiFi doesn't respond and the script forcibly kills it.  This causes
>> multiple flowfile version to hang around, and generally makes me feel like
>> it might be causing data loss.
>> >
>> > I'm running the web UI on a different box, and when things stop
>> working, it stops showing changes to counts in any queues, and the thread
>> count never changes.  It still thinks the nodes are connecting and
>> responding, though.
>> >
>> > My environment is two 8 cpu systems w/ 60GB memory with 48GB given to
>> the NiFi JVM in bootstrap.conf.  I have timer threads limited to 12, and
>> event threads to 4.  Install is on the current Amazon Linux AMI and using
>> OpenJDK 1.8.0.91 x64.
>> >
>> > Any idea, other debug steps, or changes that I can try?  I'm running
>> 0.7.0, having upgraded from 0.6.1, but this has been occurring with both
>> versions.  The higher the flowfile volume I push through, the faster this
>> happens.
>> >
>> > Thanks for any help there is to give!
>> >
>> > -Aaron Longfield
>>
>>
>

Re: Nifi cluster nodes regularly stop processing any flowfiles

Posted by Aaron Longfield <al...@gmail.com>.
Hi Mark,

I've been using the G1 garbage collector.  I brought the nodes down to 8GB
heap and let it run overnight, but processing still got stuck and requiring
NiFi to be restarted on all nodes.  It took longer to happen, but they went
down after a few hours.  Are there any other things I can look into?

Thanks!

-Aaron

On Thu, Jul 14, 2016 at 2:33 PM, Mark Payne <ma...@hotmail.com> wrote:

> Aaron,
>
> My guess would be that you are hitting a Full Garbage Collection. With
> such a huge Java heap, that will cause a "stop the world" pause for quite a
> long time.
> Which garbage collector are you using? Have you tried reducing the heap
> from 48 GB to say 4 or 8 GB?
>
> Thanks
> -Mark
>
>
> > On Jul 14, 2016, at 11:14 AM, Aaron Longfield <al...@gmail.com>
> wrote:
> >
> > Hi,
> >
> > I'm having an issue with a small (two node) NiFi cluster where the nodes
> will stop processing any queued flowfiles.  I haven't seen any error
> messages logged related to it, and when attempting to restart the service,
> NiFi doesn't respond and the script forcibly kills it.  This causes
> multiple flowfile version to hang around, and generally makes me feel like
> it might be causing data loss.
> >
> > I'm running the web UI on a different box, and when things stop working,
> it stops showing changes to counts in any queues, and the thread count
> never changes.  It still thinks the nodes are connecting and responding,
> though.
> >
> > My environment is two 8 cpu systems w/ 60GB memory with 48GB given to
> the NiFi JVM in bootstrap.conf.  I have timer threads limited to 12, and
> event threads to 4.  Install is on the current Amazon Linux AMI and using
> OpenJDK 1.8.0.91 x64.
> >
> > Any idea, other debug steps, or changes that I can try?  I'm running
> 0.7.0, having upgraded from 0.6.1, but this has been occurring with both
> versions.  The higher the flowfile volume I push through, the faster this
> happens.
> >
> > Thanks for any help there is to give!
> >
> > -Aaron Longfield
>
>

Re: Nifi cluster nodes regularly stop processing any flowfiles

Posted by Mark Payne <ma...@hotmail.com>.
Aaron,

My guess would be that you are hitting a Full Garbage Collection. With such a huge Java heap, that will cause a "stop the world" pause for quite a long time.
Which garbage collector are you using? Have you tried reducing the heap from 48 GB to say 4 or 8 GB?

Thanks
-Mark


> On Jul 14, 2016, at 11:14 AM, Aaron Longfield <al...@gmail.com> wrote:
> 
> Hi,
> 
> I'm having an issue with a small (two node) NiFi cluster where the nodes will stop processing any queued flowfiles.  I haven't seen any error messages logged related to it, and when attempting to restart the service, NiFi doesn't respond and the script forcibly kills it.  This causes multiple flowfile version to hang around, and generally makes me feel like it might be causing data loss.
> 
> I'm running the web UI on a different box, and when things stop working, it stops showing changes to counts in any queues, and the thread count never changes.  It still thinks the nodes are connecting and responding, though.
> 
> My environment is two 8 cpu systems w/ 60GB memory with 48GB given to the NiFi JVM in bootstrap.conf.  I have timer threads limited to 12, and event threads to 4.  Install is on the current Amazon Linux AMI and using OpenJDK 1.8.0.91 x64.
> 
> Any idea, other debug steps, or changes that I can try?  I'm running 0.7.0, having upgraded from 0.6.1, but this has been occurring with both versions.  The higher the flowfile volume I push through, the faster this happens.
> 
> Thanks for any help there is to give!
> 
> -Aaron Longfield