You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Aurélien DEHAY <au...@zorel.org> on 2015/10/08 12:52:46 UTC

Nifi & Spark receiver performance configuration

Hello.



I’m doing some experimentations on Apache Nifi to see where we can use it.



One idea is to use nifi to feed a spark cluster. So I’m doing some simple test (GenerateFlowFile => spark output port and a simple word count on spark side.



I was pretty unhappy with the performance out of the box, so I looked on the net and found almost nothing.



So I looked at nifi.properties, and found that some of the following properties have a huge impact on how many messages / second were processed to Spark :



nifi.queue.swap.threshold=20000

nifi.swap.in.period=1 sec

nifi.swap.in.threads=1

nifi.swap.out.period=1 sec

nifi.swap.out.threads=4


The documentation seems unclear on this point for output ports, is anyone have a pointer for me ?

Thanks.

Aurélien.

Re: Nifi & Spark receiver performance configuration

Posted by Mark Payne <ma...@hotmail.com>.
Aurelien,

The way that swapping works in NiFi is that when the number of FlowFiles in a particular queue builds up
past a certain point, NiFi will write those files to disk and drop them from the Java heap in order to avoid
running out of heap space. Then, when the number of FlowFiles in the queue drops below a certain number,
those swap files are swapped back in so that the queue can start serving them up again.

Unfortunately, the current implementation certainly needs some work. Right now, this happens in a background
thread, so when you see only 20,000 FlowFiles being sent, that's because 20,000 FlowFiles are in memory,
and then it is waiting for the background thread to swap those back in.

This is certainly something that we want to address, so that rather than having the background thread doing this,
the queue itself will be responsible for pulling those back in, and that should alleviate this issue. We've just not
yet gotten to the point of being able to reimplement this yet.

Thanks
-Mark


> On Oct 8, 2015, at 11:22 AM, Aurélien DEHAY <au...@zorel.org> wrote:
> 
> Hello.
> 
> I'm testing on a VM 8vCPU (E5606 2.13Ghz) / 16Go.
> 
> I just have a GenerateFLowFIle which send data to an output port for Spark. Here, the performance is very good, I can generate a huge number of flow files.
> 
> My spark job is configured as local[4], and use 3 receivers. It just doing a simple word count on the stream, with a streaming context at 2 seconds.
> 
> My test is simple, I generate about 200k messages (about 200MB) from generateFLowFile, not starting the output port, in order  to queue  the data. Then I stop the processor, and start the output port and my spark job (see attached file) with:
>  bin/spark-shell --master local[4] --packages "org.apache.nifi:nifi-spark-receiver:0.3.0" -i nifi.scala
> 
> 
> with:
> nifi.queue.swap.threshold=20000
> nifi.swap.in.period=10 sec
> nifi.swap.in.threads=1
> nifi.swap.out.period=1 sec
> nifi.swap.out.threads=4
> 
> I get the result in 2015-10-08 17_07_04 screenshot file.
> 
> 
> 
> With
> nifi.queue.swap.threshold=20000
> nifi.swap.in.period=1 sec
> nifi.swap.in.threads=1
> nifi.swap.out.period=1 sec
> nifi.swap.out.threads=4
> 
> results in 2015-10-08 17_13_28 screenshot.
> 
> 
> With
> nifi.queue.swap.threshold=200000
> nifi.swap.in.period=1 sec
> nifi.swap.in.threads=1
> nifi.swap.out.period=1 sec
> nifi.swap.out.threads=4
> 
> I see result in 2015-10-08 17_21_09 screenshot.
> 
> Many questions:
> - Why nifi limits the number of flow file sent to at most the swap threshold?
> - Why nifi waits swap.in.period to send batch of flowfile?
> 
> It's not I'm not happy with nifi perf and/or the spark receiver, but the configuration or doc should be more clear on tuning.
> 
> Regards.
> De : Bryan Bende <bb...@gmail.com>
> 
> Envoyé : jeudi 8 octobre 2015 16:52
> À : users@nifi.apache.org
> Objet : Re: Nifi & Spark receiver performance configuration
>  
> Hello,
> 
> When you say you were unhappy with the performance, can you give some more information about what was not performing well?
> 
> Was the NiFi Spark Receiver not pulling messages in fast enough and they were queuing up in NiFi?
> Was NiFi not producing messages as fast as you expected?
> What kind of environment were you running this? All on a local machine for testing?
> 
> -Bryan
> 
> On Thu, Oct 8, 2015 at 6:52 AM, Aurélien DEHAY <aurelien.dehay@zorel.org <ma...@zorel.org>> wrote:
> Hello.
>  
> I’m doing some experimentations on Apache Nifi to see where we can use it. 
>  
> One idea is to use nifi to feed a spark cluster. So I’m doing some simple test (GenerateFlowFile => spark output port and a simple word count on spark side.
>  
> I was pretty unhappy with the performance out of the box, so I looked on the net and found almost nothing.
>  
> So I looked at nifi.properties, and found that some of the following properties have a huge impact on how many messages / second were processed to Spark :
>  
> nifi.queue.swap.threshold=20000
> nifi.swap.in.period=1 sec
> nifi.swap.in.threads=1
> nifi.swap.out.period=1 sec
> nifi.swap.out.threads=4
>  
> The documentation seems unclear on this point for output ports, is anyone have a pointer for me ?
>  
> Thanks.
>  
> Aurélien.
> 
> <nifi.scala><2015-10-08 17_07_04-Spark shell - Streaming Statistics.png><2015-10-08 17_13_28-Spark shell - Streaming Statistics.png><2015-10-08 17_21_09-Spark shell - Streaming Statistics.png>


RE: Nifi & Spark receiver performance configuration

Posted by Aurélien DEHAY <au...@zorel.org>.
Hello.


I'm testing on a VM 8vCPU (E5606 2.13Ghz) / 16Go.


I just have a GenerateFLowFIle which send data to an output port for Spark. Here, the performance is very good, I can generate a huge number of flow files.

My spark job is configured as local[4], and use 3 receivers. It just doing a simple word count on the stream, with a streaming context at 2 seconds.

My test is simple, I generate about 200k messages (about 200MB) from generateFLowFile, not starting the output port, in order  to queue  the data. Then I stop the processor, and start the output port and my spark job (see attached file) with:
 bin/spark-shell --master local[4] --packages "org.apache.nifi:nifi-spark-receiver:0.3.0" -i nifi.scala


with:
nifi.queue.swap.threshold=20000
nifi.swap.in.period=10 sec
nifi.swap.in.threads=1
nifi.swap.out.period=1 sec
nifi.swap.out.threads=4

I get the result in 2015-10-08 17_07_04 screenshot file.



With
nifi.queue.swap.threshold=20000
nifi.swap.in.period=1 sec
nifi.swap.in.threads=1
nifi.swap.out.period=1 sec
nifi.swap.out.threads=4

results in 2015-10-08 17_13_28 screenshot.


With
nifi.queue.swap.threshold=200000
nifi.swap.in.period=1 sec
nifi.swap.in.threads=1
nifi.swap.out.period=1 sec
nifi.swap.out.threads=4

I see result in 2015-10-08 17_21_09 screenshot.

Many questions:
- Why nifi limits the number of flow file sent to at most the swap threshold?
- Why nifi waits swap.in.period to send batch of flowfile?

It's not I'm not happy with nifi perf and/or the spark receiver, but the configuration or doc should be more clear on tuning.

Regards.
________________________________
De : Bryan Bende <bb...@gmail.com>

Envoyé : jeudi 8 octobre 2015 16:52
À : users@nifi.apache.org
Objet : Re: Nifi & Spark receiver performance configuration

Hello,

When you say you were unhappy with the performance, can you give some more information about what was not performing well?

Was the NiFi Spark Receiver not pulling messages in fast enough and they were queuing up in NiFi?
Was NiFi not producing messages as fast as you expected?
What kind of environment were you running this? All on a local machine for testing?

-Bryan

On Thu, Oct 8, 2015 at 6:52 AM, Aurélien DEHAY <au...@zorel.org>> wrote:

Hello.



I'm doing some experimentations on Apache Nifi to see where we can use it.



One idea is to use nifi to feed a spark cluster. So I'm doing some simple test (GenerateFlowFile => spark output port and a simple word count on spark side.



I was pretty unhappy with the performance out of the box, so I looked on the net and found almost nothing.



So I looked at nifi.properties, and found that some of the following properties have a huge impact on how many messages / second were processed to Spark :



nifi.queue.swap.threshold=20000

nifi.swap.in.period=1 sec

nifi.swap.in.threads=1

nifi.swap.out.period=1 sec

nifi.swap.out.threads=4


The documentation seems unclear on this point for output ports, is anyone have a pointer for me ?

Thanks.

Aurélien.


Re: Nifi & Spark receiver performance configuration

Posted by Bryan Bende <bb...@gmail.com>.
Hello,

When you say you were unhappy with the performance, can you give some more
information about what was not performing well?

Was the NiFi Spark Receiver not pulling messages in fast enough and they
were queuing up in NiFi?
Was NiFi not producing messages as fast as you expected?
What kind of environment were you running this? All on a local machine for
testing?

-Bryan

On Thu, Oct 8, 2015 at 6:52 AM, Aurélien DEHAY <au...@zorel.org>
wrote:

> Hello.
>
>
>
> I’m doing some experimentations on Apache Nifi to see where we can use it.
>
>
>
> One idea is to use nifi to feed a spark cluster. So I’m doing some simple
> test (GenerateFlowFile => spark output port and a simple word count on
> spark side.
>
>
>
> I was pretty unhappy with the performance out of the box, so I looked on
> the net and found almost nothing.
>
>
>
> So I looked at nifi.properties, and found that some of the following
> properties have a huge impact on how many messages / second were processed
> to Spark :
>
>
>
> nifi.queue.swap.threshold=20000
>
> nifi.swap.in.period=1 sec
>
> nifi.swap.in.threads=1
>
> nifi.swap.out.period=1 sec
>
> nifi.swap.out.threads=4
>
>
>
> The documentation seems unclear on this point for output ports, is anyone
> have a pointer for me ?
>
>
>
> Thanks.
>
>
>
> Aurélien.
>