You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nifi.apache.org by ravi mannan <ra...@yahoo.com.INVALID> on 2015/09/06 04:16:31 UTC

nifi for large files

Hello,
I am new to apache nifi (and java, somewhat), so forgive my ignorance.  I want to use nifi to process large files. My university has some research work that could be written as nifi processors.  I know "large" can be a relative term, but for the resources I have, these files are large. About 700mb-1gb.  I *had* processors that took these large files and split them into smaller files, and then performed our algorithms (which I ported over to java), and then extracted some data and finally did almost an "if-else" branching. Seems great for nifi, right? Unfortunately I am seeing problems where it seems that nifi is i/o bound. I think this is because the walog and provenance log and all the transactions that are recorded there after each processor. I then essentially combined my processes into a large one, which I know goes against the grain of nifi.
I wonder if I should use some of these:
SideEffectFree
SupportsBatching

and maybe I should've used SupportsBatching instead of throwing my code into one big processor?
Even if the above does help with the i/o problems, I still like having a large nifi processor that does a lot of work. One benefit to using one big processor is that I can use the Callback in session.read to load my large files into a java.nio.ByteBuffer. A ByteBuffer is useful in my case because I can have my data off heap and run my algorithms a chunk (or ".slice()") at a time.  Hopefully you are familiar with this class. If I had many small processors (with the grain), I would have to constantly use session.read and InputStreamCallback and read the InputStream which is not as efficient as using a ByteBuffer.

If I'm reading things correctly, FlowFileController.getContent returns an InputStream, so that's not bad. My concern is that I will have many processors (with many threads) reading InputStreams and then having many objects waiting to be garbage collected. 


So as you can imagine I am going off into tangents, I was wondering if you have ideas to reduce the i/o I'm seeing and what you think of my use of ByteBuffers. I wonder if I could pass the ByteBuffer around in a FlowFile?
Thanks,
Ravi

RE: nifi for large files

Posted by Mark Payne <ma...@hotmail.com>.
Ravi,

Not to overload you with responses here, but I did want to mention that there is another
very important aspect to keep in mind, as you're contemplating the idea of loading your
data in off-heap ByteBuffers. Generally, the Operating System provides a *very* good disk
cache that will typically provide great results. 

Of course, the results depend on a lot of different things, but in many of the flows that 
we have, when we look at the disk operations performed, we will see that we have 
many write operations and almost no read operations, even though we read the 
content far more often than we write. This is because as we write the content, the 
operating system will cache the data on write, so that when we do read, it's all 
being read from RAM anyway.

That being said, if the issue you are running into really is related to the Write-Ahead Log
and the Provenance repo, then the SupportsBatching would certainly help. But the
changes that were added to 0.3.0 should also help. I would recommend that you look into
that before refactoring the processors into a single monolithic processor.

Thanks
-Mark



----------------------------------------
> From: rbraddy@softnas.com
> To: dev@nifi.apache.org; ravimannan2002@yahoo.com
> Subject: Re: nifi for large files
> Date: Sun, 6 Sep 2015 18:33:45 +0000
>
> Ravi,
>
> We process large numbers of files and noticed the I/O overhead as well. I put the content, FlowFiles and provenance repositories on SSD and was then able to achieve high enough throughput with our workloads to consistently saturate a 1Gb pipe with lots of parallel processing, which was fast enough for our purposes.
>
> I'd rather have the provenance history available for support and troubleshooting so it's a trade off that makes sense in our case, as SSD is cheap enough now.
>
> Rick
>
>
>
>> On Sep 5, 2015, at 9:57 PM, ravi mannan <ra...@yahoo.com.INVALID> wrote:
>>
>>
>> Hello,
>> I am new to apache nifi (and java, somewhat), so forgive my ignorance. I want to use nifi to process large files. My university has some research work that could be written as nifi processors. I know "large" can be a relative term, but for the resources I have, these files are large. About 700mb-1gb. I *had* processors that took these large files and split them into smaller files, and then performed our algorithms (which I ported over to java), and then extracted some data and finally did almost an "if-else" branching. Seems great for nifi, right? Unfortunately I am seeing problems where it seems that nifi is i/o bound. I think this is because the walog and provenance log and all the transactions that are recorded there after each processor. I then essentially combined my processes into a large one, which I know goes against the grain of nifi.
>> I wonder if I should use some of these:
>> SideEffectFree
>> SupportsBatching
>>
>> and maybe I should've used SupportsBatching instead of throwing my code into one big processor?
>> Even if the above does help with the i/o problems, I still like having a large nifi processor that does a lot of work. One benefit to using one big processor is that I can use the Callback in session.read to load my large files into a java.nio.ByteBuffer. A ByteBuffer is useful in my case because I can have my data off heap and run my algorithms a chunk (or ".slice()") at a time. Hopefully you are familiar with this class. If I had many small processors (with the grain), I would have to constantly use session.read and InputStreamCallback and read the InputStream which is not as efficient as using a ByteBuffer.
>>
>> If I'm reading things correctly, FlowFileController.getContent returns an InputStream, so that's not bad. My concern is that I will have many processors (with many threads) reading InputStreams and then having many objects waiting to be garbage collected.
>>
>>
>> So as you can imagine I am going off into tangents, I was wondering if you have ideas to reduce the i/o I'm seeing and what you think of my use of ByteBuffers. I wonder if I could pass the ByteBuffer around in a FlowFile?
>> Thanks,
>> Ravi
 		 	   		  

Re: nifi for large files

Posted by Rick Braddy <rb...@softnas.com>.
Ravi,

We process large numbers of files and noticed the I/O overhead as well.  I put the content, FlowFiles and provenance repositories on SSD and was then able to achieve high enough throughput with our workloads to consistently saturate a 1Gb pipe with lots of parallel processing, which was fast enough for our purposes.

I'd rather have the provenance history available for support and troubleshooting so it's a trade off that makes sense in our case, as SSD is cheap enough now.

Rick



> On Sep 5, 2015, at 9:57 PM, ravi mannan <ra...@yahoo.com.INVALID> wrote:
> 
> 
> Hello,
> I am new to apache nifi (and java, somewhat), so forgive my ignorance.  I want to use nifi to process large files. My university has some research work that could be written as nifi processors.  I know "large" can be a relative term, but for the resources I have, these files are large. About 700mb-1gb.  I *had* processors that took these large files and split them into smaller files, and then performed our algorithms (which I ported over to java), and then extracted some data and finally did almost an "if-else" branching. Seems great for nifi, right? Unfortunately I am seeing problems where it seems that nifi is i/o bound. I think this is because the walog and provenance log and all the transactions that are recorded there after each processor. I then essentially combined my processes into a large one, which I know goes against the grain of nifi.
> I wonder if I should use some of these:
> SideEffectFree
> SupportsBatching
> 
> and maybe I should've used SupportsBatching instead of throwing my code into one big processor?
> Even if the above does help with the i/o problems, I still like having a large nifi processor that does a lot of work. One benefit to using one big processor is that I can use the Callback in session.read to load my large files into a java.nio.ByteBuffer. A ByteBuffer is useful in my case because I can have my data off heap and run my algorithms a chunk (or ".slice()") at a time.  Hopefully you are familiar with this class. If I had many small processors (with the grain), I would have to constantly use session.read and InputStreamCallback and read the InputStream which is not as efficient as using a ByteBuffer.
> 
> If I'm reading things correctly, FlowFileController.getContent returns an InputStream, so that's not bad. My concern is that I will have many processors (with many threads) reading InputStreams and then having many objects waiting to be garbage collected. 
> 
> 
> So as you can imagine I am going off into tangents, I was wondering if you have ideas to reduce the i/o I'm seeing and what you think of my use of ByteBuffers. I wonder if I could pass the ByteBuffer around in a FlowFile?
> Thanks,
> Ravi

Re: nifi for large files

Posted by Joe Witt <jo...@gmail.com>.
Ravi,

Tangents are fine.  We're here to help and there are a ton of ways to
tackle these sorts of things.

So here are some general thoughts that help with processing data at high rate:
1) Consider how you have NiFi laid out on the physical system
specifically with regard to disk usage.  You can, for example, have
the flowfile repo, content repo, and prov repo on their own physical
disks.  You can even have multiple content and prov repos to spread
the load even further as needed.  So truly you should expect hundreds
of MB per second of read/write rate.  On even a slow, single disk, all
repos on the same disk config you should see 50+ MB/s.

2) Consider the effect on the garbage collector.  Loading lots of
really large objects on the heap creates pressure and can lead to more
frequent and costly full garbage collections - of course this isn't
always true as its always about the use case.  I'd avoid being too
eager to go with direct allocation on ByteBuffers unless it is for
long-lived reference type data.  If it is for just processing a large
object which you are already able to break into chunks the
Input/Output stream concept NiFi exposes is ideal for this.  In such a
case you shouldn't even have to split up the large objects because the
amount you'll ever have in memory should be no larger than the largest
object contained within that set.  In short, there are often very
efficient ways to process data.

3) The WAL for FlowFiles and prov repo should not become an issue
unless you have completely saturated the disks they run on or unless
the transaction rate is so high they simply cannot keep up.

4) Whether you do one monolith processor or whether you break them up
does certainly impact performance but mostly it is about reuse and
ease of troubleshooting.  This concept (how you compose processors) is
something that will evolve over time as you learn more and find
valuable reuse and so on.  You should not worry too much about it yet.
The most effective abstractions often revealed - not invented.

Troubleshooting:
- How are you measuring the IO?  Are you doing this in Linux using
something like 'iostat'?
- Are you sure the issue you're seeing is truly IO and not garbage
collection?  Watching GC behavior with jstat can be extremely helpful.

Which version of NiFI are you on?  In the latest snapshot/codebase
which is NIFI 0.3.0-snapshot a massive performance change has occurred
which could potentially be a factor for you.  It wasn't actually an IO
loading issue but rather just an implementation problem which caused
nifi to artificially slow like 'stop the world pauses' in the JVM.  It
has been remedied and I've personally seen 4x or more improvements for
certain use cases.

So let's iterate through this for a while and see if we can't get you
where you want to be performance wise.  Let's start with understanding
how you're observing high IO and what the GC behavior is.

Thanks
Joe

On Sat, Sep 5, 2015 at 10:16 PM, ravi mannan
<ra...@yahoo.com.invalid> wrote:
>
> Hello,
> I am new to apache nifi (and java, somewhat), so forgive my ignorance.  I want to use nifi to process large files. My university has some research work that could be written as nifi processors.  I know "large" can be a relative term, but for the resources I have, these files are large. About 700mb-1gb.  I *had* processors that took these large files and split them into smaller files, and then performed our algorithms (which I ported over to java), and then extracted some data and finally did almost an "if-else" branching. Seems great for nifi, right? Unfortunately I am seeing problems where it seems that nifi is i/o bound. I think this is because the walog and provenance log and all the transactions that are recorded there after each processor. I then essentially combined my processes into a large one, which I know goes against the grain of nifi.
> I wonder if I should use some of these:
> SideEffectFree
> SupportsBatching
>
> and maybe I should've used SupportsBatching instead of throwing my code into one big processor?
> Even if the above does help with the i/o problems, I still like having a large nifi processor that does a lot of work. One benefit to using one big processor is that I can use the Callback in session.read to load my large files into a java.nio.ByteBuffer. A ByteBuffer is useful in my case because I can have my data off heap and run my algorithms a chunk (or ".slice()") at a time.  Hopefully you are familiar with this class. If I had many small processors (with the grain), I would have to constantly use session.read and InputStreamCallback and read the InputStream which is not as efficient as using a ByteBuffer.
>
> If I'm reading things correctly, FlowFileController.getContent returns an InputStream, so that's not bad. My concern is that I will have many processors (with many threads) reading InputStreams and then having many objects waiting to be garbage collected.
>
>
> So as you can imagine I am going off into tangents, I was wondering if you have ideas to reduce the i/o I'm seeing and what you think of my use of ByteBuffers. I wonder if I could pass the ByteBuffer around in a FlowFile?
> Thanks,
> Ravi