You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@nifi.apache.org by Mike Thomsen <mi...@gmail.com> on 2020/09/11 16:51:11 UTC

Best way to tune NiFi for huge amounts of small flowfiles

What are the general recommended practices around tuning NiFi to
safely handle flows that may drop in several million very small
flowfiles (2k-10kb each) onto a single node? It's possible that some
of the data dumps we're processing (and we can't control their size)
will drop about 3.5-5M flowfiles the moment we expand them in the
flow.

(Let me emphasize again, it was not our idea to dump the data this way)

Any pointers would be appreciated.

Thanks,

Mike

Re: Best way to tune NiFi for huge amounts of small flowfiles

Posted by Ryan Hendrickson <ry...@gmail.com>.

We keep our queue limit at 20,000 to keep data from swapping between
ArrayLists and Prioritized Queues.  See bug:
https://issues.apache.org/jira/browse/NIFI-7583

You can also adjust that limit up in the nifi.properties.

On Sat, Sep 12, 2020 at 1:15 AM Chris Sampson <ch...@naimuri.com>
wrote:

> One thing we've not done yet but I think might help is to stripe disks for
> each repo too, i.e. multiple disks for content, etc., which will help
> spread the disk I/O.
>
>
> Cheers,
>
> Chris Sampson
>
> On Fri, 11 Sep 2020, 22:46 Mike Thomsen, <mi...@gmail.com> wrote:
>
>> Craig and Jeremy,
>>
>> Thanks. The point about using different disks for different
>> repositories is definitely something to add to the list.
>>
>> On Fri, Sep 11, 2020 at 3:11 PM Jeremy Dyer <jd...@gmail.com> wrote:
>> >
>> > Hey Mike,
>> >
>> > When you say "flows that may drop in several million ... flowfiles" I
>> read that as a single node that might be inundated with tons of source data
>> (local files, ftp, kafka messages, etc). Just my 2 cents but if you don't
>> have strict SLAs (and this kind of sounds like a 1 time thing) I wouldn't
>> even worry about it and just let the system back pressure and process in
>> time as designed. That process will be "safe" although maybe not fast. If
>> you need speed throw lots of NVMe mounts at it. We process well into the
>> tens (sometimes hundreds) of millions of flowfiles a day on a 5 node
>> cluster with no issues. However our hardware is quite over the top.
>> >
>> > Thanks,
>> > Jeremy Dyer
>> >
>> > On Fri, Sep 11, 2020 at 12:51 PM Mike Thomsen <mi...@gmail.com>
>> wrote:
>> >>
>> >> What are the general recommended practices around tuning NiFi to
>> >> safely handle flows that may drop in several million very small
>> >> flowfiles (2k-10kb each) onto a single node? It's possible that some
>> >> of the data dumps we're processing (and we can't control their size)
>> >> will drop about 3.5-5M flowfiles the moment we expand them in the
>> >> flow.
>> >>
>> >> (Let me emphasize again, it was not our idea to dump the data this way)
>> >>
>> >> Any pointers would be appreciated.
>> >>
>> >> Thanks,
>> >>
>> >> Mike
>>
>

Re: Best way to tune NiFi for huge amounts of small flowfiles

Posted by Chris Sampson <ch...@naimuri.com>.

One thing we've not done yet but I think might help is to stripe disks for
each repo too, i.e. multiple disks for content, etc., which will help
spread the disk I/O.


Cheers,

Chris Sampson

On Fri, 11 Sep 2020, 22:46 Mike Thomsen, <mi...@gmail.com> wrote:

> Craig and Jeremy,
>
> Thanks. The point about using different disks for different
> repositories is definitely something to add to the list.
>
> On Fri, Sep 11, 2020 at 3:11 PM Jeremy Dyer <jd...@gmail.com> wrote:
> >
> > Hey Mike,
> >
> > When you say "flows that may drop in several million ... flowfiles" I
> read that as a single node that might be inundated with tons of source data
> (local files, ftp, kafka messages, etc). Just my 2 cents but if you don't
> have strict SLAs (and this kind of sounds like a 1 time thing) I wouldn't
> even worry about it and just let the system back pressure and process in
> time as designed. That process will be "safe" although maybe not fast. If
> you need speed throw lots of NVMe mounts at it. We process well into the
> tens (sometimes hundreds) of millions of flowfiles a day on a 5 node
> cluster with no issues. However our hardware is quite over the top.
> >
> > Thanks,
> > Jeremy Dyer
> >
> > On Fri, Sep 11, 2020 at 12:51 PM Mike Thomsen <mi...@gmail.com>
> wrote:
> >>
> >> What are the general recommended practices around tuning NiFi to
> >> safely handle flows that may drop in several million very small
> >> flowfiles (2k-10kb each) onto a single node? It's possible that some
> >> of the data dumps we're processing (and we can't control their size)
> >> will drop about 3.5-5M flowfiles the moment we expand them in the
> >> flow.
> >>
> >> (Let me emphasize again, it was not our idea to dump the data this way)
> >>
> >> Any pointers would be appreciated.
> >>
> >> Thanks,
> >>
> >> Mike
>

Re: Best way to tune NiFi for huge amounts of small flowfiles

Posted by Mike Thomsen <mi...@gmail.com>.

Craig and Jeremy,

Thanks. The point about using different disks for different
repositories is definitely something to add to the list.

On Fri, Sep 11, 2020 at 3:11 PM Jeremy Dyer <jd...@gmail.com> wrote:
>
> Hey Mike,
>
> When you say "flows that may drop in several million ... flowfiles" I read that as a single node that might be inundated with tons of source data (local files, ftp, kafka messages, etc). Just my 2 cents but if you don't have strict SLAs (and this kind of sounds like a 1 time thing) I wouldn't even worry about it and just let the system back pressure and process in time as designed. That process will be "safe" although maybe not fast. If you need speed throw lots of NVMe mounts at it. We process well into the tens (sometimes hundreds) of millions of flowfiles a day on a 5 node cluster with no issues. However our hardware is quite over the top.
>
> Thanks,
> Jeremy Dyer
>
> On Fri, Sep 11, 2020 at 12:51 PM Mike Thomsen <mi...@gmail.com> wrote:
>>
>> What are the general recommended practices around tuning NiFi to
>> safely handle flows that may drop in several million very small
>> flowfiles (2k-10kb each) onto a single node? It's possible that some
>> of the data dumps we're processing (and we can't control their size)
>> will drop about 3.5-5M flowfiles the moment we expand them in the
>> flow.
>>
>> (Let me emphasize again, it was not our idea to dump the data this way)
>>
>> Any pointers would be appreciated.
>>
>> Thanks,
>>
>> Mike

Re: Best way to tune NiFi for huge amounts of small flowfiles

Posted by Jeremy Dyer <jd...@gmail.com>.

Hey Mike,

When you say "flows that may drop in several million ... flowfiles" I read
that as a single node that might be inundated with tons of source data
(local files, ftp, kafka messages, etc). Just my 2 cents but if you don't
have strict SLAs (and this kind of sounds like a 1 time thing) I wouldn't
even worry about it and just let the system back pressure and process in
time as designed. That process will be "safe" although maybe not fast. If
you need speed throw lots of NVMe mounts at it. We process well into the
tens (sometimes hundreds) of millions of flowfiles a day on a 5 node
cluster with no issues. However our hardware is quite over the top.

Thanks,
Jeremy Dyer

On Fri, Sep 11, 2020 at 12:51 PM Mike Thomsen <mi...@gmail.com>
wrote:

> What are the general recommended practices around tuning NiFi to
> safely handle flows that may drop in several million very small
> flowfiles (2k-10kb each) onto a single node? It's possible that some
> of the data dumps we're processing (and we can't control their size)
> will drop about 3.5-5M flowfiles the moment we expand them in the
> flow.
>
> (Let me emphasize again, it was not our idea to dump the data this way)
>
> Any pointers would be appreciated.
>
> Thanks,
>
> Mike
>

Re: Best way to tune NiFi for huge amounts of small flowfiles

Posted by Craig Connell <cs...@staq.com>.

Hi Mike,

I might have a few more pointers to offer when I can get unburied from some
other work ... but the couple things that jump to mind are the following:


   - I think for that many flowfiles, you will want to make sure you have
   separate disks set up for data provenance.  We have several different types
   of flowfile profiles.  For the ones where we didn't have too many
   flowfiles, we didn't do much to change some of the default settings, and we
   actually (again recommendation and better judgement) had everything hitting
   the same set of disks.  When we had another more real time processing
   profile more akin to the volume that you are talking about, we began to run
   into issues related to the ability of provenance to keep up.  We created
   three separate disks, and changed the accompanying config and that helped a
   great deal.  You'd need to make some changes around threading for that to.
   You can find some info on that here:
   https://community.cloudera.com/t5/Community-Articles/HDF-CFM-NIFI-Best-practices-for-setting-up-a-high/ta-p/244999


   - I don't know what you've done with regard to the Maximum Timer Drive
   Thread Count, but the default is quite low (depending on the size of your
   machine).  If I'm not mistaken (there is a best practices doc out there),
   you can set this to 2-4 times the number of cores that you have.  We have
   been fairly aggressive and set it to 4.  Once we did that, we had some of
   the processors run multiple threads - but you have to be careful you don't
   have one set of processors eating all of your available cycles.

One of the sizing docs we used was this one:
https://community.cloudera.com/t5/Community-Articles/NiFi-Sizing-Guide-Deployment-Best-Practices/ta-p/246781
so that we could use that to give some thought to our server size and the
throughput we wanted.

In all, we found that there were some best practices, but it required some
tuning and observation.

I hope that helps.

Craig

Craig S. Connell
CTO & Senior VP of Engineering
CSConnell@staq.com
443-789-4842




On Fri, Sep 11, 2020 at 12:51 PM Mike Thomsen <mi...@gmail.com>
wrote:

> What are the general recommended practices around tuning NiFi to
> safely handle flows that may drop in several million very small
> flowfiles (2k-10kb each) onto a single node? It's possible that some
> of the data dumps we're processing (and we can't control their size)
> will drop about 3.5-5M flowfiles the moment we expand them in the
> flow.
>
> (Let me emphasize again, it was not our idea to dump the data this way)
>
> Any pointers would be appreciated.
>
> Thanks,
>
> Mike
>