You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@nifi.apache.org by Alessio Palma <al...@buongiorno.com> on 2016/10/28 09:55:39 UTC

How I put the cluster down.

Hello all,
yesterday, for a mistake, basically I executed " ls -R / " using the
ListHDFS processor and the whole cluster gone down ( not just a node ).

Something like this also happened when I was playing with some DO WHILE
/ WHILE DO patterns. I have only the nifi logs and they show the
heartbeat has been lost. About the CPU LOAD, NETWORK TRAFFIC I have no
info. Any pointers about where do I have look for the problem's root ?

Today I'm trying to repeat the problems I got with DO/WHILE, nothing bad
is happening although CPU LOAD is enough high and NETWORK  TRAFFIC
increased up to 282 Kb/sec.

Of course I can redo the "ls -R /" on production, however I like to
avoid it since there are already some ingestion flows running.

AP

Re: How I put the cluster down.

Posted by Andrew Grande <ap...@gmail.com>.

Hi,

I'd suggest couple things. Have you configured backpressure controls on
connections? NiFi 1.0.0 adds 10000evt/1GB by default IIRC. This can help
avoid overwhelming components in a flow.

Next, the 2 core CPU is really inadequate for high throughput system, see
if you can get something better. It seems there's a lot going on in your
cluster. A full NiFi node with many flows does a lot of housekeeping in the
background, needs some power.

Andrew

On Fri, Oct 28, 2016, 8:36 AM Alessio Palma <al...@buongiorno.com>
wrote:

> Hello Witt,
> before anything else thanks for your help.
> Fortunatly I  put down only the NIFI cluster, otherwise I was already in
> vacation :)
>
> After I posted this problem I kept to torture staging NIFI and
> discovered that when CPU LOAD gets very high, nodes loose connection and
> anything starts going in the bad directory. Also the WEB GUI becomes not
> responsive, you have no option to stop workflows.
>
> You can reproduce this issue starting some workflows composed by
> 1) GenerateFlowFile ( 1 Kb size, Timer driven, 0 sec run schedule )
> 2) ReplaceText ( just to force the use of regexp )
> 3) HashContent, ( auto terminate both relationships )
>
> Currently my staging cluster is composed by 2 virtual host configured as:
> 2 Core cpu ( Intel(R) Xeon(R) CPU E7- 2870  @ 2.40GHz )
> 2 GB RAM
> 18 GB HD
>
> The problem raised when the CPU load goes over 8, this basically means
> when you start 8 of the above WF.
>
> I noticed NIFI attempts to reduce the load but this does not works too
> much and does not avoid the general failure.
>
> Here you can see the errors which started to show under stress:
>
> https://drive.google.com/drive/folders/0B7NTMIqrCjESN0JURnRtZWp5Tms?usp=sharing
>
>
> The 1st question is: is here a way to keep the load under some critical
> values? Is there some "how to" which helps me to configure NIFI ?
> Currently it is using the factory settings and no customization has been
> performed but LDAP login.
>
> AP
>
>
>
> On 28/10/2016 13:24, Joe Witt wrote:
> > Alessio
> >
> > You have two clusters here potentially.  The NiFi cluster and the
> > Hadoop cluster.  Which one went down?
> >
> > If NiFi went down I'd suspect memory exhaustion issues because other
> > resource exhaustion issues like full file system, exhausted file
> > handles, pegged CPU, etc.. tend not to cause it to restart.  If memory
> > related you'll probably see something in the nifi-app.log.  Try going
> > with a larger heap as can be controlled in conf/bootstrap.conf.
> >
> > Thanks
> > Joe
> >
> > On Fri, Oct 28, 2016 at 5:55 AM, Alessio Palma
> > <al...@buongiorno.com> wrote:
> >> Hello all,
> >> yesterday, for a mistake, basically I executed " ls -R / " using the
> >> ListHDFS processor and the whole cluster gone down ( not just a node ).
> >>
> >> Something like this also happened when I was playing with some DO WHILE
> >> / WHILE DO patterns. I have only the nifi logs and they show the
> >> heartbeat has been lost. About the CPU LOAD, NETWORK TRAFFIC I have no
> >> info. Any pointers about where do I have look for the problem's root ?
> >>
> >> Today I'm trying to repeat the problems I got with DO/WHILE, nothing bad
> >> is happening although CPU LOAD is enough high and NETWORK  TRAFFIC
> >> increased up to 282 Kb/sec.
> >>
> >> Of course I can redo the "ls -R /" on production, however I like to
> >> avoid it since there are already some ingestion flows running.
> >>
> >> AP
> > .
> >
>

Re: How I put the cluster down.

Posted by Alessio Palma <al...@buongiorno.com>.

Hello Witt,
before anything else thanks for your help.
Fortunatly I  put down only the NIFI cluster, otherwise I was already in
vacation :)

After I posted this problem I kept to torture staging NIFI and
discovered that when CPU LOAD gets very high, nodes loose connection and
anything starts going in the bad directory. Also the WEB GUI becomes not
responsive, you have no option to stop workflows.

You can reproduce this issue starting some workflows composed by
1) GenerateFlowFile ( 1 Kb size, Timer driven, 0 sec run schedule )
2) ReplaceText ( just to force the use of regexp )
3) HashContent, ( auto terminate both relationships )

Currently my staging cluster is composed by 2 virtual host configured as:
2 Core cpu ( Intel(R) Xeon(R) CPU E7- 2870  @ 2.40GHz )
2 GB RAM
18 GB HD

The problem raised when the CPU load goes over 8, this basically means
when you start 8 of the above WF.

I noticed NIFI attempts to reduce the load but this does not works too
much and does not avoid the general failure.

Here you can see the errors which started to show under stress:
https://drive.google.com/drive/folders/0B7NTMIqrCjESN0JURnRtZWp5Tms?usp=sharing

The 1st question is: is here a way to keep the load under some critical
values? Is there some "how to" which helps me to configure NIFI ?
Currently it is using the factory settings and no customization has been
performed but LDAP login.

AP

On 28/10/2016 13:24, Joe Witt wrote:
> Alessio
> 
> You have two clusters here potentially.  The NiFi cluster and the
> Hadoop cluster.  Which one went down?
> 
> If NiFi went down I'd suspect memory exhaustion issues because other
> resource exhaustion issues like full file system, exhausted file
> handles, pegged CPU, etc.. tend not to cause it to restart.  If memory
> related you'll probably see something in the nifi-app.log.  Try going
> with a larger heap as can be controlled in conf/bootstrap.conf.
> 
> Thanks
> Joe
> 
> On Fri, Oct 28, 2016 at 5:55 AM, Alessio Palma
> <al...@buongiorno.com> wrote:
>> Hello all,
>> yesterday, for a mistake, basically I executed " ls -R / " using the
>> ListHDFS processor and the whole cluster gone down ( not just a node ).
>>
>> Something like this also happened when I was playing with some DO WHILE
>> / WHILE DO patterns. I have only the nifi logs and they show the
>> heartbeat has been lost. About the CPU LOAD, NETWORK TRAFFIC I have no
>> info. Any pointers about where do I have look for the problem's root ?
>>
>> Today I'm trying to repeat the problems I got with DO/WHILE, nothing bad
>> is happening although CPU LOAD is enough high and NETWORK  TRAFFIC
>> increased up to 282 Kb/sec.
>>
>> Of course I can redo the "ls -R /" on production, however I like to
>> avoid it since there are already some ingestion flows running.
>>
>> AP
> .
>

Re: How I put the cluster down.

Posted by Joe Witt <jo...@gmail.com>.

Alessio

You have two clusters here potentially.  The NiFi cluster and the
Hadoop cluster.  Which one went down?

If NiFi went down I'd suspect memory exhaustion issues because other
resource exhaustion issues like full file system, exhausted file
handles, pegged CPU, etc.. tend not to cause it to restart.  If memory
related you'll probably see something in the nifi-app.log.  Try going
with a larger heap as can be controlled in conf/bootstrap.conf.

Thanks
Joe

On Fri, Oct 28, 2016 at 5:55 AM, Alessio Palma
<al...@buongiorno.com> wrote:
> Hello all,
> yesterday, for a mistake, basically I executed " ls -R / " using the
> ListHDFS processor and the whole cluster gone down ( not just a node ).
>
> Something like this also happened when I was playing with some DO WHILE
> / WHILE DO patterns. I have only the nifi logs and they show the
> heartbeat has been lost. About the CPU LOAD, NETWORK TRAFFIC I have no
> info. Any pointers about where do I have look for the problem's root ?
>
> Today I'm trying to repeat the problems I got with DO/WHILE, nothing bad
> is happening although CPU LOAD is enough high and NETWORK  TRAFFIC
> increased up to 282 Kb/sec.
>
> Of course I can redo the "ls -R /" on production, however I like to
> avoid it since there are already some ingestion flows running.
>
> AP