You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Chris Lim <ch...@gmail.com> on 2015/12/04 03:48:46 UTC

Re: Multiple Content Repositories

Thanks Joe.

Following through the inquiries on multiple content repositories. I still
have a few more questions. :)

1. Is it correct to say that the use case for having multiple content
repositories is to take advantage of parallel disk writes assuming that the
system have multiple bare metal disk drives mounted? Are there any other
use cases for doing multiple content repositories?

2. On an enterprise environment wherein NiFi writes to a SAN (Storage Area
Network) does it make sense to have logical mounted volumes for the
multiple content repositories? Or are we better off having just one content
repository. Of course the assumption here is that we are dealing with
multiple files with 10 to 50 gigabytes in sizes.

3. Will NiFi have disk contention issues in a scenario wherein we have 5 or
more independent flows on a single NiFI instance and all the flows are
involved in ETL?

Regards,
Chris



On Fri, Nov 27, 2015 at 3:56 AM, Joe Witt <jo...@gmail.com> wrote:

> Chris,
>
> It is something which occurs automatically and behind the scenes.
> Under normal circumstances there will be many FlowFiles written to the
> same content claim they'll just each have different offsets.  It is
> more aligned with how disks work in terms of efficiently writing data,
> efficiently reading data, and efficiently deleting the entire claim
> (which is a file on disk).  Rather than a delete per flowfile we
> delete once there are no more references to the entire claim.  Much
> faster.  And all of that is totally abstracted away from the
> perspective of someone writing extensions.  This bit, combined with
> the copy on write and pass by reference logic the content repository
> provides is a key part of what makes nifi efficient.
>
> Thanks
> Joe
>
> On Thu, Nov 26, 2015 at 1:40 AM, Chris Lim <ch...@gmail.com>
> wrote:
> > Thanks Mark.
> >
> > The answer on the content repository round-robin is perfect. :)
> >
> > It got me curious when you mentioned that one or more FlowFiles can be
> > written to the same Resource Claim. Is there a specific scenario wherein
> > this can occur? Under normal circumstances there is only one FlowFile
> > written to a Resource Claim?
> >
> > --
> > Chris
> >
> >
> > On Wed, Nov 25, 2015 at 9:39 PM, Mark Payne <ma...@hotmail.com>
> wrote:
> >>
> >> Chris,
> >>
> >> In terms of round robin-ing between the repositories, yes, it follows a
> >> simple round-robin approach.
> >> In terms of sections within those containers, the answer is more of a
> >> "sort-of." Each FlowFile has what
> >> we refer to as a Resource Claim, which points to a location in the
> content
> >> repository. In the case of the
> >> FileSystemRepository (which is the default and almost all that's ever
> used
> >> right now), the Resource Claim
> >> maps to a file on disk. In order to be very efficient, we may write many
> >> FlowFiles to the same Resource Claim.
> >>
> >> Once we finish writing to a particular Resource Claim, we close the
> >> resources and create a new one for the next
> >> FlowFile. When we create these Resource Claims, we do so in a
> round-robin
> >> fashion across the different Sections
> >> of the content repository.
> >>
> >> Sorry, this is a fairly long-winded answer to such a seemingly simple
> >> question :) but I wasn't sure how much detail you were
> >> looking for. If anything is not clear, let us know.
> >>
> >> Thanks
> >> -Mark
> >>
> >>
> >> On Nov 25, 2015, at 5:12 AM, Chris Lim <ch...@gmail.com>
> >> wrote:
> >>
> >> Hi Guys,
> >>
> >> I am configuring our NiFi instance to have multiple content repositories
> >> specifically with the "nifi.content.repository.directory." property
> setting
> >> as mentioned in the Administrator's guide. Am I correct that flow file
> >> contents are written to the repository using a round-robin algorithm?
> Also,
> >> does the sections within a specific content repository follow the same
> >> round-robin algorithm?
> >>
> >> Thanks,
> >> Chris
> >>
> >>
> >
>

Re: Multiple Content Repositories

Posted by Mark Payne <ma...@hotmail.com>.
Chris,

Not a problem, we're happy to answer questions :)

Re #1: There are two benefits to having multiple content repositories. The first, as you mentioned, is parallel reads and writes, which can be a tremendous performance improvement. The other benefit is simply that it provides you with more storage, in general. By default, NiFi "archives" the content when it's done with it instead of immediately deleting it. This allows you to go into your Provenance data and actually View/Download the data exactly as it was at that point in the
flow. So this is extremely powerful because Provenance shows you the lineage (How did it get to this point?), the attributes (The context used to get to this point), and the data itself. Having all 3 of these pieces of information dramatically improves your ability to debug and understand what's happening - and gives you the ability to replay individual pieces of data from anywhere in the flow if it wasn't done right. But, as you can imagine, storing all of this information can take a lot of - well, storage. So having multiple disks to store that on can be very helpful.

Re #2: I don't know that i've used any SAN to back my repositories other than the EBS provided by Amazon EC2. In that environment, I found that having one or having multiple repos was essentially equivalent.

Re #3: Whether or not NiFi has disk contention is really dependent on the data rate. NiFi is pretty smart about how it handles file I/O so that it is able to write multiple FlowFiles to the same underlying file on disk and by default FlowFiles are sorted/prioritized in a queue such that they are the most efficient to read. That being said, if you're reading/writing hundreds of MB/sec then you're probably going to have some disk contention :) The number of flows you have running, though, does not really play a factor, though - one flow processing 100 MB/sec will result in approximately the same contention as 10 flows each processing 10 MB/sec.

Also of note, you can assign multiple partitions to the Provenance Repository as well. If you are processing tons of very small FlowFiles, you may actually be better off using multiple partitions for the Provenance Repository than using multiple partitions for the content repository - or if you have the partitions free, use multiple for both.

Does this clear things up? Hopefully it doesn't murky the water more, at least! :)

Thanks
-Mark


> On Dec 3, 2015, at 9:48 PM, Chris Lim <ch...@gmail.com> wrote:
> 
> Thanks Joe.
> 
> Following through the inquiries on multiple content repositories. I still have a few more questions. :)
> 
> 1. Is it correct to say that the use case for having multiple content repositories is to take advantage of parallel disk writes assuming that the system have multiple bare metal disk drives mounted? Are there any other use cases for doing multiple content repositories?
> 
> 2. On an enterprise environment wherein NiFi writes to a SAN (Storage Area Network) does it make sense to have logical mounted volumes for the multiple content repositories? Or are we better off having just one content repository. Of course the assumption here is that we are dealing with multiple files with 10 to 50 gigabytes in sizes.
> 
> 3. Will NiFi have disk contention issues in a scenario wherein we have 5 or more independent flows on a single NiFI instance and all the flows are involved in ETL?
> 
> Regards,
> Chris
> 
> 
> 
> On Fri, Nov 27, 2015 at 3:56 AM, Joe Witt <joe.witt@gmail.com <ma...@gmail.com>> wrote:
> Chris,
> 
> It is something which occurs automatically and behind the scenes.
> Under normal circumstances there will be many FlowFiles written to the
> same content claim they'll just each have different offsets.  It is
> more aligned with how disks work in terms of efficiently writing data,
> efficiently reading data, and efficiently deleting the entire claim
> (which is a file on disk).  Rather than a delete per flowfile we
> delete once there are no more references to the entire claim.  Much
> faster.  And all of that is totally abstracted away from the
> perspective of someone writing extensions.  This bit, combined with
> the copy on write and pass by reference logic the content repository
> provides is a key part of what makes nifi efficient.
> 
> Thanks
> Joe
> 
> On Thu, Nov 26, 2015 at 1:40 AM, Chris Lim <christopher.a.lim@gmail.com <ma...@gmail.com>> wrote:
> > Thanks Mark.
> >
> > The answer on the content repository round-robin is perfect. :)
> >
> > It got me curious when you mentioned that one or more FlowFiles can be
> > written to the same Resource Claim. Is there a specific scenario wherein
> > this can occur? Under normal circumstances there is only one FlowFile
> > written to a Resource Claim?
> >
> > --
> > Chris
> >
> >
> > On Wed, Nov 25, 2015 at 9:39 PM, Mark Payne <markap14@hotmail.com <ma...@hotmail.com>> wrote:
> >>
> >> Chris,
> >>
> >> In terms of round robin-ing between the repositories, yes, it follows a
> >> simple round-robin approach.
> >> In terms of sections within those containers, the answer is more of a
> >> "sort-of." Each FlowFile has what
> >> we refer to as a Resource Claim, which points to a location in the content
> >> repository. In the case of the
> >> FileSystemRepository (which is the default and almost all that's ever used
> >> right now), the Resource Claim
> >> maps to a file on disk. In order to be very efficient, we may write many
> >> FlowFiles to the same Resource Claim.
> >>
> >> Once we finish writing to a particular Resource Claim, we close the
> >> resources and create a new one for the next
> >> FlowFile. When we create these Resource Claims, we do so in a round-robin
> >> fashion across the different Sections
> >> of the content repository.
> >>
> >> Sorry, this is a fairly long-winded answer to such a seemingly simple
> >> question :) but I wasn't sure how much detail you were
> >> looking for. If anything is not clear, let us know.
> >>
> >> Thanks
> >> -Mark
> >>
> >>
> >> On Nov 25, 2015, at 5:12 AM, Chris Lim <christopher.a.lim@gmail.com <ma...@gmail.com>>
> >> wrote:
> >>
> >> Hi Guys,
> >>
> >> I am configuring our NiFi instance to have multiple content repositories
> >> specifically with the "nifi.content.repository.directory." property setting
> >> as mentioned in the Administrator's guide. Am I correct that flow file
> >> contents are written to the repository using a round-robin algorithm? Also,
> >> does the sections within a specific content repository follow the same
> >> round-robin algorithm?
> >>
> >> Thanks,
> >> Chris
> >>
> >>
> >
>