You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nifi.apache.org by Martin Eden <ma...@gmail.com> on 2017/06/02 19:32:27 UTC

Load Balancing in NiFi

Hi everyone,

Simple flow in NiFi 1.2.0:
ListHDFS -> FetchHDFS -> PutHDFS

Just moving files from one HDFS folder to another for evaluation purposes,
to see if NiFi can be used for this sort of ETL.

To benchmark I am doing is on a 50 x 1 GB input files dataset.

I am testing out with varying cluster sizes: 1, 2, 3 nodes and am expecting
to see linear scalability.

I am following the advice in this article
<https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html>
and
the docs
<https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#clustering>
by
executing the ListHDFS on the primary and sending the file names to a
Remote Process Group which I am expecting to do the load balancing to the
FetchHDFS -> PutHDFS executed on each of the nodes. However 90% of the
files get sent to only one node even on a 3 node cluster. It seems the load
balancing is not Round Robin and the Remote Process Group does not allow
you to set one either. Later I found this post
<https://community.hortonworks.com/questions/53153/load-balancing-while-the-fetching-of-file-from-a-s.html>
which
says "Nodes with higher load will get fewer FlowFiles. The load balancing
is done in batches for efficiency, so under light load you may not see an
exact balanced delivery, but under higher FlowFile volumes you will see a
balanced delivery over the 5 minutes delivery statistics."

Questions:
- What are the ways to get more control and explicitly enforce a load
balancing policy like Round Robin across nodes?
I found a *DistributeLoad* processor which I haven't tried because based on
it's docs it seems to be load balancing in between multiple outbound
processors (which obviously will be on the same node).
- NiFi seems to be sensitive to skews in input file sizes because it treats
files as one (does not partition them) which means that larger files get
processed by one node and will effectively be processed much slower. What
are the recommended ways to mitigate this?

Thanks,
M

Re: Load Balancing in NiFi

Posted by Martin Eden <ma...@gmail.com>.

Kevin, Koji,

Your prompt replies are much appreciated.

So Koji's solution is for solving the data skew problem in cases where we
want to bundle up smaller files to achieve a certain size. This would
however not break up a very large file in smaller more manageable chunks
which is what Kevin's proposition would solve. Hope I understood correctly.

Kevin's solution also nicely ties into the other thread I have going on in
the mailing list where I have a workflow where a RouteOnContent processor
was the bottleneck and there too the SplitText processor was suggested as a
better solution.

Thanks again for following up with me. A helpful community is a big plus
for us. Keep up the good work!

I will be writing posting back further discoveries as we do more NiFi work.

Cheers,
Martin

On Sat, Jun 3, 2017 at 5:50 AM, Koji Kawamura <ij...@gmail.com>
wrote:

> Hi Martin,
>
> As Kevin mentioned, the recommended way of distribute files depends on
> your use-case.
>
> So, please take this as one possible example to address that challenge.
> While I was playing with the new RecordReader functionality added
> since 1.2.0, I came up with an idea to distribute filename list by
> file size so that each node can process roughly even amount of data
> volume without skew, when those are distributed with S2S (RPG).
>
> I experiment the idea locally, and put a NiFi template file and wrote
> some explanation on Gist.
> https://gist.github.com/ijokarumawak/27f646fdbfa29e71b8abf556aa50fe39
>
> I hope it will be helpful for your use-case, too.
>
> Thanks,
> Koji
>
> On Sat, Jun 3, 2017 at 8:58 AM, Kevin Doran <kd...@gmail.com>
> wrote:
> > Hi Martin,
> >
> > Cool that you are making progress and getting expected results!
> Regarding your second question, I can offer a bit of advice, and others
> might chime in as well.
> >
> > First, w.r.t. this observation:
> >
> >> NiFi seems to be sensitive to skews in input file sizes because it
> treats files as one...
> >
> > I would point out that this is more a characteristic of the test flow
> you've setup and not NiFi in general. In your case, you've setup your data
> flow so that the primary node is doing ListHDFS and sending the name of
> each file to the cluster for processing, where the node that receives the
> file name processes it entirely. If a big file gets to one node and the
> processing is expensive, it's true it could take longer for that entire
> file to process while another node crunches through several smaller files.
> In some applications, that might be fine. In others, it would be
> undesirable. The main disadvantage I could think of would be the case there
> is only one large file to process and you give it to one node that has to
> deal with it while others sit idle – that doesn't seem optimal :-). On top
> of the variances in files sizes, consider the complexity that nodes in the
> cluster might not be uniformly sized and we can see that processing could
> easily become unbalanced.
> >
> > Which brings me to your question:
> >
> >> ... What are the recommended ways to mitigate this?
> >
> > It would depend on the characteristics of the data you are dealing with
> and the desired data flow you want to achieve. In general, you would want
> to apply some data partitioning strategy early in the data flow that is
> specific to the data you are dealing with. NiFi comes with a ton of tools
> built in to control and tune the flow of data, including dividing (and
> joining) flow files along the way. For example, say your input files are
> CSV (or similar) and the expensive part of your data flow is incurred when
> processing a row. In that case, early in your data flow you could use a
> processor such as SplitText [1] to divide your data into smaller flow files
> before sending them down stream (e.g., to a RPG cluster) for the expensive
> processing, so that you take full advantage of the cluster at all times.
> >
> > The recommended strategy really depends on what your goals are and what
> your data looks like. Once data is loaded into NiFi, you can do a lot with
> it, and even before the data gets to NiFi, there are usually some
> strategies you can apply to achieve roughly uniform-sized inputs if that is
> important to your architecture. For example, if you are using NiFi to
> process application log files, you could set the logger in your application
> to roll over based on file size.
> >
> > I hope this helps. If you get further along and have questions regarding
> specific types of data and processing, be sure to post a question back here
> and someone will be able to assist you.
> >
> > Cheers,
> > Kevin
> >
> > [1] https://nifi.apache.org/docs/nifi-docs/components/org.
> apache.nifi/nifi-standard-nar/1.2.0/org.apache.nifi.
> processors.standard.SplitText
> >
> > On 6/2/17, 18:50, "Martin Eden" <ma...@gmail.com> wrote:
> >
> >     Hi Joe,
> >
> >     I changed the batch size on the input port of the Remote Process
> Group to 1
> >     and got the results I was looking for: ~1/3 of the time for a 3 node
> >     cluster compared to 1 node. So big thanks!
> >
> >     Any takes on my second question though?
> >     - NiFi seems to be sensitive to skews in input file sizes because it
> treats
> >     files as one (does not partition them) which means that larger files
> get
> >     processed by one node and will effectively be processed much slower.
> What
> >     are the recommended ways to mitigate this?
> >
> >     Thanks again,
> >     Martin
> >
> >
> >     On Fri, Jun 2, 2017 at 8:37 PM, Joe Witt <jo...@gmail.com> wrote:
> >
> >     > Martin,
> >     >
> >     > The problem you're hitting is that site-to-site doesn't by default
> do
> >     > file by file load balancing.  It sends a set of files to one node,
> >     > then a set to another, and so on.  This was tuned for constant high
> >     > rate/volume transmission so a test like this will have funny
> results.
> >     > Did you tune the batch settings in site-to-site which become
> available
> >     > due to https://issues.apache.org/jira/browse/NIFI-1202
> >     >
> >     > You can set it to batch sizes of one I'd assume (i've never done
> this)
> >     > and that should then behave the way you're looking for.
> >     >
> >     > Thanks
> >     >
> >     > On Fri, Jun 2, 2017 at 3:32 PM, Martin Eden <
> martineden131@gmail.com>
> >     > wrote:
> >     > > Hi everyone,
> >     > >
> >     > > Simple flow in NiFi 1.2.0:
> >     > > ListHDFS -> FetchHDFS -> PutHDFS
> >     > >
> >     > > Just moving files from one HDFS folder to another for evaluation
> >     > purposes,
> >     > > to see if NiFi can be used for this sort of ETL.
> >     > >
> >     > > To benchmark I am doing is on a 50 x 1 GB input files dataset.
> >     > >
> >     > > I am testing out with varying cluster sizes: 1, 2, 3 nodes and am
> >     > expecting
> >     > > to see linear scalability.
> >     > >
> >     > > I am following the advice in this article
> >     > > <https://community.hortonworks.com/articles/
> 16120/how-do-i-distribute-
> >     > data-across-a-nifi-cluster.html>
> >     > > and
> >     > > the docs
> >     > > <https://nifi.apache.org/docs/nifi-docs/html/administration-
> >     > guide.html#clustering>
> >     > > by
> >     > > executing the ListHDFS on the primary and sending the file names
> to a
> >     > > Remote Process Group which I am expecting to do the load
> balancing to the
> >     > > FetchHDFS -> PutHDFS executed on each of the nodes. However 90%
> of the
> >     > > files get sent to only one node even on a 3 node cluster. It
> seems the
> >     > load
> >     > > balancing is not Round Robin and the Remote Process Group does
> not allow
> >     > > you to set one either. Later I found this post
> >     > > <https://community.hortonworks.com/questions/
> 53153/load-balancing-while-
> >     > the-fetching-of-file-from-a-s.html>
> >     > > which
> >     > > says "Nodes with higher load will get fewer FlowFiles. The load
> balancing
> >     > > is done in batches for efficiency, so under light load you may
> not see an
> >     > > exact balanced delivery, but under higher FlowFile volumes you
> will see a
> >     > > balanced delivery over the 5 minutes delivery statistics."
> >     > >
> >     > > Questions:
> >     > > - What are the ways to get more control and explicitly enforce a
> load
> >     > > balancing policy like Round Robin across nodes?
> >     > > I found a *DistributeLoad* processor which I haven't tried
> because based
> >     > on
> >     > > it's docs it seems to be load balancing in between multiple
> outbound
> >     > > processors (which obviously will be on the same node).
> >     > > - NiFi seems to be sensitive to skews in input file sizes
> because it
> >     > treats
> >     > > files as one (does not partition them) which means that larger
> files get
> >     > > processed by one node and will effectively be processed much
> slower. What
> >     > > are the recommended ways to mitigate this?
> >     > >
> >     > > Thanks,
> >     > > M
> >     >
> >
> >
> >
>

Re: Load Balancing in NiFi

Posted by Koji Kawamura <ij...@gmail.com>.

Hi Martin,

As Kevin mentioned, the recommended way of distribute files depends on
your use-case.

So, please take this as one possible example to address that challenge.
While I was playing with the new RecordReader functionality added
since 1.2.0, I came up with an idea to distribute filename list by
file size so that each node can process roughly even amount of data
volume without skew, when those are distributed with S2S (RPG).

I experiment the idea locally, and put a NiFi template file and wrote
some explanation on Gist.
https://gist.github.com/ijokarumawak/27f646fdbfa29e71b8abf556aa50fe39

I hope it will be helpful for your use-case, too.

Thanks,
Koji

On Sat, Jun 3, 2017 at 8:58 AM, Kevin Doran <kd...@gmail.com> wrote:
> Hi Martin,
>
> Cool that you are making progress and getting expected results! Regarding your second question, I can offer a bit of advice, and others might chime in as well.
>
> First, w.r.t. this observation:
>
>> NiFi seems to be sensitive to skews in input file sizes because it treats files as one...
>
> I would point out that this is more a characteristic of the test flow you've setup and not NiFi in general. In your case, you've setup your data flow so that the primary node is doing ListHDFS and sending the name of each file to the cluster for processing, where the node that receives the file name processes it entirely. If a big file gets to one node and the processing is expensive, it's true it could take longer for that entire file to process while another node crunches through several smaller files. In some applications, that might be fine. In others, it would be undesirable. The main disadvantage I could think of would be the case there is only one large file to process and you give it to one node that has to deal with it while others sit idle – that doesn't seem optimal :-). On top of the variances in files sizes, consider the complexity that nodes in the cluster might not be uniformly sized and we can see that processing could easily become unbalanced.
>
> Which brings me to your question:
>
>> ... What are the recommended ways to mitigate this?
>
> It would depend on the characteristics of the data you are dealing with and the desired data flow you want to achieve. In general, you would want to apply some data partitioning strategy early in the data flow that is specific to the data you are dealing with. NiFi comes with a ton of tools built in to control and tune the flow of data, including dividing (and joining) flow files along the way. For example, say your input files are CSV (or similar) and the expensive part of your data flow is incurred when processing a row. In that case, early in your data flow you could use a processor such as SplitText [1] to divide your data into smaller flow files before sending them down stream (e.g., to a RPG cluster) for the expensive processing, so that you take full advantage of the cluster at all times.
>
> The recommended strategy really depends on what your goals are and what your data looks like. Once data is loaded into NiFi, you can do a lot with it, and even before the data gets to NiFi, there are usually some strategies you can apply to achieve roughly uniform-sized inputs if that is important to your architecture. For example, if you are using NiFi to process application log files, you could set the logger in your application to roll over based on file size.
>
> I hope this helps. If you get further along and have questions regarding specific types of data and processing, be sure to post a question back here and someone will be able to assist you.
>
> Cheers,
> Kevin
>
> [1] https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.2.0/org.apache.nifi.processors.standard.SplitText
>
> On 6/2/17, 18:50, "Martin Eden" <ma...@gmail.com> wrote:
>
>     Hi Joe,
>
>     I changed the batch size on the input port of the Remote Process Group to 1
>     and got the results I was looking for: ~1/3 of the time for a 3 node
>     cluster compared to 1 node. So big thanks!
>
>     Any takes on my second question though?
>     - NiFi seems to be sensitive to skews in input file sizes because it treats
>     files as one (does not partition them) which means that larger files get
>     processed by one node and will effectively be processed much slower. What
>     are the recommended ways to mitigate this?
>
>     Thanks again,
>     Martin
>
>
>     On Fri, Jun 2, 2017 at 8:37 PM, Joe Witt <jo...@gmail.com> wrote:
>
>     > Martin,
>     >
>     > The problem you're hitting is that site-to-site doesn't by default do
>     > file by file load balancing.  It sends a set of files to one node,
>     > then a set to another, and so on.  This was tuned for constant high
>     > rate/volume transmission so a test like this will have funny results.
>     > Did you tune the batch settings in site-to-site which become available
>     > due to https://issues.apache.org/jira/browse/NIFI-1202
>     >
>     > You can set it to batch sizes of one I'd assume (i've never done this)
>     > and that should then behave the way you're looking for.
>     >
>     > Thanks
>     >
>     > On Fri, Jun 2, 2017 at 3:32 PM, Martin Eden <ma...@gmail.com>
>     > wrote:
>     > > Hi everyone,
>     > >
>     > > Simple flow in NiFi 1.2.0:
>     > > ListHDFS -> FetchHDFS -> PutHDFS
>     > >
>     > > Just moving files from one HDFS folder to another for evaluation
>     > purposes,
>     > > to see if NiFi can be used for this sort of ETL.
>     > >
>     > > To benchmark I am doing is on a 50 x 1 GB input files dataset.
>     > >
>     > > I am testing out with varying cluster sizes: 1, 2, 3 nodes and am
>     > expecting
>     > > to see linear scalability.
>     > >
>     > > I am following the advice in this article
>     > > <https://community.hortonworks.com/articles/16120/how-do-i-distribute-
>     > data-across-a-nifi-cluster.html>
>     > > and
>     > > the docs
>     > > <https://nifi.apache.org/docs/nifi-docs/html/administration-
>     > guide.html#clustering>
>     > > by
>     > > executing the ListHDFS on the primary and sending the file names to a
>     > > Remote Process Group which I am expecting to do the load balancing to the
>     > > FetchHDFS -> PutHDFS executed on each of the nodes. However 90% of the
>     > > files get sent to only one node even on a 3 node cluster. It seems the
>     > load
>     > > balancing is not Round Robin and the Remote Process Group does not allow
>     > > you to set one either. Later I found this post
>     > > <https://community.hortonworks.com/questions/53153/load-balancing-while-
>     > the-fetching-of-file-from-a-s.html>
>     > > which
>     > > says "Nodes with higher load will get fewer FlowFiles. The load balancing
>     > > is done in batches for efficiency, so under light load you may not see an
>     > > exact balanced delivery, but under higher FlowFile volumes you will see a
>     > > balanced delivery over the 5 minutes delivery statistics."
>     > >
>     > > Questions:
>     > > - What are the ways to get more control and explicitly enforce a load
>     > > balancing policy like Round Robin across nodes?
>     > > I found a *DistributeLoad* processor which I haven't tried because based
>     > on
>     > > it's docs it seems to be load balancing in between multiple outbound
>     > > processors (which obviously will be on the same node).
>     > > - NiFi seems to be sensitive to skews in input file sizes because it
>     > treats
>     > > files as one (does not partition them) which means that larger files get
>     > > processed by one node and will effectively be processed much slower. What
>     > > are the recommended ways to mitigate this?
>     > >
>     > > Thanks,
>     > > M
>     >
>
>
>

Re: Load Balancing in NiFi

Posted by Kevin Doran <kd...@gmail.com>.

Hi Martin,

Cool that you are making progress and getting expected results! Regarding your second question, I can offer a bit of advice, and others might chime in as well.

First, w.r.t. this observation:

> NiFi seems to be sensitive to skews in input file sizes because it treats files as one... 

I would point out that this is more a characteristic of the test flow you've setup and not NiFi in general. In your case, you've setup your data flow so that the primary node is doing ListHDFS and sending the name of each file to the cluster for processing, where the node that receives the file name processes it entirely. If a big file gets to one node and the processing is expensive, it's true it could take longer for that entire file to process while another node crunches through several smaller files. In some applications, that might be fine. In others, it would be undesirable. The main disadvantage I could think of would be the case there is only one large file to process and you give it to one node that has to deal with it while others sit idle – that doesn't seem optimal :-). On top of the variances in files sizes, consider the complexity that nodes in the cluster might not be uniformly sized and we can see that processing could easily become unbalanced.

Which brings me to your question:

> ... What are the recommended ways to mitigate this?

It would depend on the characteristics of the data you are dealing with and the desired data flow you want to achieve. In general, you would want to apply some data partitioning strategy early in the data flow that is specific to the data you are dealing with. NiFi comes with a ton of tools built in to control and tune the flow of data, including dividing (and joining) flow files along the way. For example, say your input files are CSV (or similar) and the expensive part of your data flow is incurred when processing a row. In that case, early in your data flow you could use a processor such as SplitText [1] to divide your data into smaller flow files before sending them down stream (e.g., to a RPG cluster) for the expensive processing, so that you take full advantage of the cluster at all times.

The recommended strategy really depends on what your goals are and what your data looks like. Once data is loaded into NiFi, you can do a lot with it, and even before the data gets to NiFi, there are usually some strategies you can apply to achieve roughly uniform-sized inputs if that is important to your architecture. For example, if you are using NiFi to process application log files, you could set the logger in your application to roll over based on file size.

I hope this helps. If you get further along and have questions regarding specific types of data and processing, be sure to post a question back here and someone will be able to assist you.

Cheers,
Kevin

[1] https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.2.0/org.apache.nifi.processors.standard.SplitText

On 6/2/17, 18:50, "Martin Eden" <ma...@gmail.com> wrote:

    Hi Joe,

    I changed the batch size on the input port of the Remote Process Group to 1
    and got the results I was looking for: ~1/3 of the time for a 3 node
    cluster compared to 1 node. So big thanks!

    Any takes on my second question though?
    - NiFi seems to be sensitive to skews in input file sizes because it treats
    files as one (does not partition them) which means that larger files get
    processed by one node and will effectively be processed much slower. What
    are the recommended ways to mitigate this?

    Thanks again,
    Martin

    On Fri, Jun 2, 2017 at 8:37 PM, Joe Witt <jo...@gmail.com> wrote:

    > Martin,
    >
    > The problem you're hitting is that site-to-site doesn't by default do
    > file by file load balancing.  It sends a set of files to one node,
    > then a set to another, and so on.  This was tuned for constant high
    > rate/volume transmission so a test like this will have funny results.
    > Did you tune the batch settings in site-to-site which become available
    > due to https://issues.apache.org/jira/browse/NIFI-1202
    >
    > You can set it to batch sizes of one I'd assume (i've never done this)
    > and that should then behave the way you're looking for.
    >
    > Thanks
    >
    > On Fri, Jun 2, 2017 at 3:32 PM, Martin Eden <ma...@gmail.com>
    > wrote:
    > > Hi everyone,
    > >
    > > Simple flow in NiFi 1.2.0:
    > > ListHDFS -> FetchHDFS -> PutHDFS
    > >
    > > Just moving files from one HDFS folder to another for evaluation
    > purposes,
    > > to see if NiFi can be used for this sort of ETL.
    > >
    > > To benchmark I am doing is on a 50 x 1 GB input files dataset.
    > >
    > > I am testing out with varying cluster sizes: 1, 2, 3 nodes and am
    > expecting
    > > to see linear scalability.
    > >
    > > I am following the advice in this article
    > > <https://community.hortonworks.com/articles/16120/how-do-i-distribute-
    > data-across-a-nifi-cluster.html>
    > > and
    > > the docs
    > > <https://nifi.apache.org/docs/nifi-docs/html/administration-
    > guide.html#clustering>
    > > by
    > > executing the ListHDFS on the primary and sending the file names to a
    > > Remote Process Group which I am expecting to do the load balancing to the
    > > FetchHDFS -> PutHDFS executed on each of the nodes. However 90% of the
    > > files get sent to only one node even on a 3 node cluster. It seems the
    > load
    > > balancing is not Round Robin and the Remote Process Group does not allow
    > > you to set one either. Later I found this post
    > > <https://community.hortonworks.com/questions/53153/load-balancing-while-
    > the-fetching-of-file-from-a-s.html>
    > > which
    > > says "Nodes with higher load will get fewer FlowFiles. The load balancing
    > > is done in batches for efficiency, so under light load you may not see an
    > > exact balanced delivery, but under higher FlowFile volumes you will see a
    > > balanced delivery over the 5 minutes delivery statistics."
    > >
    > > Questions:
    > > - What are the ways to get more control and explicitly enforce a load
    > > balancing policy like Round Robin across nodes?
    > > I found a *DistributeLoad* processor which I haven't tried because based
    > on
    > > it's docs it seems to be load balancing in between multiple outbound
    > > processors (which obviously will be on the same node).
    > > - NiFi seems to be sensitive to skews in input file sizes because it
    > treats
    > > files as one (does not partition them) which means that larger files get
    > > processed by one node and will effectively be processed much slower. What
    > > are the recommended ways to mitigate this?
    > >
    > > Thanks,
    > > M
    >

Re: Load Balancing in NiFi

Posted by Martin Eden <ma...@gmail.com>.

Hi Joe,

I changed the batch size on the input port of the Remote Process Group to 1
and got the results I was looking for: ~1/3 of the time for a 3 node
cluster compared to 1 node. So big thanks!

Any takes on my second question though?
- NiFi seems to be sensitive to skews in input file sizes because it treats
files as one (does not partition them) which means that larger files get
processed by one node and will effectively be processed much slower. What
are the recommended ways to mitigate this?

Thanks again,
Martin


On Fri, Jun 2, 2017 at 8:37 PM, Joe Witt <jo...@gmail.com> wrote:

> Martin,
>
> The problem you're hitting is that site-to-site doesn't by default do
> file by file load balancing.  It sends a set of files to one node,
> then a set to another, and so on.  This was tuned for constant high
> rate/volume transmission so a test like this will have funny results.
> Did you tune the batch settings in site-to-site which become available
> due to https://issues.apache.org/jira/browse/NIFI-1202
>
> You can set it to batch sizes of one I'd assume (i've never done this)
> and that should then behave the way you're looking for.
>
> Thanks
>
> On Fri, Jun 2, 2017 at 3:32 PM, Martin Eden <ma...@gmail.com>
> wrote:
> > Hi everyone,
> >
> > Simple flow in NiFi 1.2.0:
> > ListHDFS -> FetchHDFS -> PutHDFS
> >
> > Just moving files from one HDFS folder to another for evaluation
> purposes,
> > to see if NiFi can be used for this sort of ETL.
> >
> > To benchmark I am doing is on a 50 x 1 GB input files dataset.
> >
> > I am testing out with varying cluster sizes: 1, 2, 3 nodes and am
> expecting
> > to see linear scalability.
> >
> > I am following the advice in this article
> > <https://community.hortonworks.com/articles/16120/how-do-i-distribute-
> data-across-a-nifi-cluster.html>
> > and
> > the docs
> > <https://nifi.apache.org/docs/nifi-docs/html/administration-
> guide.html#clustering>
> > by
> > executing the ListHDFS on the primary and sending the file names to a
> > Remote Process Group which I am expecting to do the load balancing to the
> > FetchHDFS -> PutHDFS executed on each of the nodes. However 90% of the
> > files get sent to only one node even on a 3 node cluster. It seems the
> load
> > balancing is not Round Robin and the Remote Process Group does not allow
> > you to set one either. Later I found this post
> > <https://community.hortonworks.com/questions/53153/load-balancing-while-
> the-fetching-of-file-from-a-s.html>
> > which
> > says "Nodes with higher load will get fewer FlowFiles. The load balancing
> > is done in batches for efficiency, so under light load you may not see an
> > exact balanced delivery, but under higher FlowFile volumes you will see a
> > balanced delivery over the 5 minutes delivery statistics."
> >
> > Questions:
> > - What are the ways to get more control and explicitly enforce a load
> > balancing policy like Round Robin across nodes?
> > I found a *DistributeLoad* processor which I haven't tried because based
> on
> > it's docs it seems to be load balancing in between multiple outbound
> > processors (which obviously will be on the same node).
> > - NiFi seems to be sensitive to skews in input file sizes because it
> treats
> > files as one (does not partition them) which means that larger files get
> > processed by one node and will effectively be processed much slower. What
> > are the recommended ways to mitigate this?
> >
> > Thanks,
> > M
>

Re: Load Balancing in NiFi

Posted by Joe Witt <jo...@gmail.com>.

Martin,

The problem you're hitting is that site-to-site doesn't by default do
file by file load balancing.  It sends a set of files to one node,
then a set to another, and so on.  This was tuned for constant high
rate/volume transmission so a test like this will have funny results.
Did you tune the batch settings in site-to-site which become available
due to https://issues.apache.org/jira/browse/NIFI-1202

You can set it to batch sizes of one I'd assume (i've never done this)
and that should then behave the way you're looking for.

Thanks

On Fri, Jun 2, 2017 at 3:32 PM, Martin Eden <ma...@gmail.com> wrote:
> Hi everyone,
>
> Simple flow in NiFi 1.2.0:
> ListHDFS -> FetchHDFS -> PutHDFS
>
> Just moving files from one HDFS folder to another for evaluation purposes,
> to see if NiFi can be used for this sort of ETL.
>
> To benchmark I am doing is on a 50 x 1 GB input files dataset.
>
> I am testing out with varying cluster sizes: 1, 2, 3 nodes and am expecting
> to see linear scalability.
>
> I am following the advice in this article
> <https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html>
> and
> the docs
> <https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#clustering>
> by
> executing the ListHDFS on the primary and sending the file names to a
> Remote Process Group which I am expecting to do the load balancing to the
> FetchHDFS -> PutHDFS executed on each of the nodes. However 90% of the
> files get sent to only one node even on a 3 node cluster. It seems the load
> balancing is not Round Robin and the Remote Process Group does not allow
> you to set one either. Later I found this post
> <https://community.hortonworks.com/questions/53153/load-balancing-while-the-fetching-of-file-from-a-s.html>
> which
> says "Nodes with higher load will get fewer FlowFiles. The load balancing
> is done in batches for efficiency, so under light load you may not see an
> exact balanced delivery, but under higher FlowFile volumes you will see a
> balanced delivery over the 5 minutes delivery statistics."
>
> Questions:
> - What are the ways to get more control and explicitly enforce a load
> balancing policy like Round Robin across nodes?
> I found a *DistributeLoad* processor which I haven't tried because based on
> it's docs it seems to be load balancing in between multiple outbound
> processors (which obviously will be on the same node).
> - NiFi seems to be sensitive to skews in input file sizes because it treats
> files as one (does not partition them) which means that larger files get
> processed by one node and will effectively be processed much slower. What
> are the recommended ways to mitigate this?
>
> Thanks,
> M