You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@nifi.apache.org by Brett Tiplitz <br...@systolic-inc.com> on 2016/10/05 15:38:19 UTC

Re: provenance

James -

I believe the complication for me is both the number of objects as well as
the number of processors the data goes through.  I talked with a few people
and it sounds like NIFI writes each event out disk and then executes a
commit, which really does have a major impact on the performance.  I don't
have the liberty of resolving the disk performance, though I think I will
try moving the journals directory to /dev/shm.  I know on reboot I'll loose
data, but that is just like 1-2 times a year, so I think that loss is
acceptable.  Also, I'm not specifying anything on what data get's indexed
so it's what ever the default is.

If I'm producing about 6000 (just a guess, though I think it's pretty
large) events per second, it would be nice if there was an option not to
perform a commit on every one of the 6000 items.  In reality, I would say a
commit should never occur more than once a second and that is likely way
too often.

Last, is there a way to measure the actual provenance events going through
as I'm guessing on what it's actually doing here.

brett

On Fri, Sep 30, 2016 at 2:16 PM, James Wing <jv...@gmail.com> wrote:

> Brett,
>
> The default provenance store, PersistentProvenanceRepository, does
> require I/O in proportion to flowfile events.  Flowfiles with many
> attributes, especially large attributes, are a frequent contributor to
> provenance overload because attribute state is tracked in provenance
> events.  But this is different from flowfile content reads and writes,
> which use the separate content repository.  You might consider moving the
> provenance repository to a separate disk for additional I/O capacity.
>
> Does this sound relevant?  Can you share some details of your flow volumes
> and attribute sizes?
>
> nifi.provenance.repository.buffer.size is only used by the
> VolatileProvenanceRepository implementation, an in-memory provenance
> store.  The property defines the size of the in-memory store.  The volatile
> store can avoid disk I/O issues, but at the expense of reduced provenance
> functionality.
>
> Thanks,
>
> James
>
> On Thu, Sep 29, 2016 at 1:37 PM, Brett Tiplitz <
> brett.m.tiplitz@systolic-inc.com> wrote:
>
>> I'm having a throughput problem when processing data with Provenance
>> recording enabled.  I've pretty much disabled it, so I believe that is the
>> source of my issue.  On occasion, I get a message saying the flow is
>> slowing due to provenance recording.  I was running the out of the box
>> configuration for provenance.
>>
>> I believe the issue might be related to commit writes, though it's just a
>> theory.  There is a variable nifi.provenance.repository.buffer.size,
>> though I don't see anything about what that does.
>>
>> Any suggestions ?
>>
>> thanks,
>>
>> brett
>>
>> --
>> Brett Tiplitz
>> Systolic, Inc
>>
>
>

-- 
Brett Tiplitz
Systolic, Inc

Re: provenance

Posted by Brett Tiplitz <br...@systolic-inc.com>.

Joe -

On the data side, the data get's written as it comes in, 2 more times on
processing, and then 1 more time with a mergeContent.  So I/O on the writes
is 4 times but only 3 really hit the commit to the file system.

The provenance on the other hand is another side of the coin.  Not much
get's written, but just fact-of records.  I looked at my worst case and I
think it's 13 for a flow uuid.  It's not my worst path, but it's got the
most data going through it.

The disk partitioning likely helps with QOS to ensure that flow and
provenance events are not held up by content writes, but if only I had QOS
available to me.  On the version, it says I'm running 0.6.1.  I just
switched from the old release that was years old, but I think the key to it
working before was that I had QOS on the disk i/o which I've now lost.

Also as a note, I had moved the provenance to volatile and the problem
disappeared.  Though that was not the desired outcome, so I just re-enabled
it again today.

brett

On Wed, Oct 5, 2016 at 12:42 PM, Joe Witt <jo...@gmail.com> wrote:

> NiFi only writes data to disk when it is actually changing the data.
> It is very uncommon to have a 10 processor flow where all or even most
> are actually touching the content.  You can take a look at the live
> status history data to see exactly how much content is being read from
> disk and written to disk precisely.  This makes it very easy to find
> where the heavy users of the underlying content repository - and disk
> - are.
>
> Even in the case of reads you should generally benefit from pretty
> excellent disk/OS caching.  Also, even if you're flow forks data and
> send it down multiple paths it is not actually creating copies.  Just
> creating new references.  NiFi will also automatically combine writes
> of events to the same file on disk within a short span of time and
> space.  This too helps with efficiency of disk utilization.  Key point
> there is efficiency of the content repository is pretty strong at this
> stage.  If you're using a version of NiFi that is years old then these
> things may not be true.
>
> Now, the run duration suggestion is about efficiency of the flow file
> repository which is the bookkeeping of the flowfiles (not the
> content).  We want you to be able to reduce how often we commit the
> session so run duration lets you choose your tolerance for delay while
> we automatically batch together sessions.
>
> So, key is to keep in mind that there are a few repositories and
> things (depending on your configuration) that will use disk:
> 1) Content repository (the bytes of the things you're reading/writing)
> 2) FlowFile repository (information about the flow files and their
> attributes - no content)
> 3) Provenance Repository
> 4) Logs
>
> All of these can be on different partitions and all can be across
> partitions and such.
>
> To really help with this particular case I think we'll need you to
> list out the processors involved (generically if necessary) and how
> much they read/write over a five minute period in steady state.  If
> there is really a chain of 10 processors and most are actually reading
> and writing content we can talk about additional strategies such as
> alternative composition of processors that will be more efficient.
>
> Thanks
> jOe
>
>
> On Wed, Oct 5, 2016 at 11:21 AM, Brett Tiplitz
> <br...@systolic-inc.com> wrote:
> > I was always trying to understand the run duration.  I'm good on the
> > latency, so if it processes a bunch of events at once and my overall
> > throughput is the same, then it's ok.  I increased it to 100ms.  But I
> > looked at the bulk of my flow and this feature was only on 1 of the > 10
> > processors data goes through.
> >
> > I realize that slowing the rate of commits seems bad, but even the big
> guys
> > limit commits
> >
> >
> > On Wed, Oct 5, 2016 at 12:05 PM, Bryan Bende <bb...@gmail.com> wrote:
> >>
> >> Brett,
> >>
> >> One thing that could possibly improve the performance here, although
> hard
> >> to say how much, is the concept of "Run Duration" on the processor
> >> scheduling tab. This is only available on processors marked with the
> >> @SupportsBatching annotation, so it depends what processors you are
> using.
> >>
> >> By increasing the run duration it lets the framework batch together all
> of
> >> the framework operations during that time period. The default setting
> is 0
> >> which means no batching by default, giving you the lowest latency per
> flow
> >> file, but users can choose to sacrifice some latency for higher
> throughput.
> >>
> >> I don't know enough about how provenance events are specifically
> >> committed, but I believe they would be tied to the session commits so
> that
> >> if a rollback occurred there wouldn't be unwanted events written.
> >>
> >> -Bryan
> >>
> >>
> >> On Wed, Oct 5, 2016 at 11:38 AM, Brett Tiplitz
> >> <br...@systolic-inc.com> wrote:
> >>>
> >>> James -
> >>>
> >>> I believe the complication for me is both the number of objects as well
> >>> as the number of processors the data goes through.  I talked with a few
> >>> people and it sounds like NIFI writes each event out disk and then
> executes
> >>> a commit, which really does have a major impact on the performance.  I
> don't
> >>> have the liberty of resolving the disk performance, though I think I
> will
> >>> try moving the journals directory to /dev/shm.  I know on reboot I'll
> loose
> >>> data, but that is just like 1-2 times a year, so I think that loss is
> >>> acceptable.  Also, I'm not specifying anything on what data get's
> indexed so
> >>> it's what ever the default is.
> >>>
> >>> If I'm producing about 6000 (just a guess, though I think it's pretty
> >>> large) events per second, it would be nice if there was an option not
> to
> >>> perform a commit on every one of the 6000 items.  In reality, I would
> say a
> >>> commit should never occur more than once a second and that is likely
> way too
> >>> often.
> >>>
> >>> Last, is there a way to measure the actual provenance events going
> >>> through as I'm guessing on what it's actually doing here.
> >>>
> >>> brett
> >>>
> >>> On Fri, Sep 30, 2016 at 2:16 PM, James Wing <jv...@gmail.com> wrote:
> >>>>
> >>>> Brett,
> >>>>
> >>>> The default provenance store, PersistentProvenanceRepository, does
> >>>> require I/O in proportion to flowfile events.  Flowfiles with many
> >>>> attributes, especially large attributes, are a frequent contributor to
> >>>> provenance overload because attribute state is tracked in provenance
> events.
> >>>> But this is different from flowfile content reads and writes, which
> use the
> >>>> separate content repository.  You might consider moving the provenance
> >>>> repository to a separate disk for additional I/O capacity.
> >>>>
> >>>> Does this sound relevant?  Can you share some details of your flow
> >>>> volumes and attribute sizes?
> >>>>
> >>>> nifi.provenance.repository.buffer.size is only used by the
> >>>> VolatileProvenanceRepository implementation, an in-memory provenance
> store.
> >>>> The property defines the size of the in-memory store.  The volatile
> store
> >>>> can avoid disk I/O issues, but at the expense of reduced provenance
> >>>> functionality.
> >>>>
> >>>> Thanks,
> >>>>
> >>>> James
> >>>>
> >>>> On Thu, Sep 29, 2016 at 1:37 PM, Brett Tiplitz
> >>>> <br...@systolic-inc.com> wrote:
> >>>>>
> >>>>> I'm having a throughput problem when processing data with Provenance
> >>>>> recording enabled.  I've pretty much disabled it, so I believe that
> is the
> >>>>> source of my issue.  On occasion, I get a message saying the flow is
> slowing
> >>>>> due to provenance recording.  I was running the out of the box
> configuration
> >>>>> for provenance.
> >>>>>
> >>>>> I believe the issue might be related to commit writes, though it's
> just
> >>>>> a theory.  There is a variable nifi.provenance.repository.
> buffer.size,
> >>>>> though I don't see anything about what that does.
> >>>>>
> >>>>> Any suggestions ?
> >>>>>
> >>>>> thanks,
> >>>>>
> >>>>> brett
> >>>>>
> >>>>> --
> >>>>> Brett Tiplitz
> >>>>> Systolic, Inc
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Brett Tiplitz
> >>> Systolic, Inc
> >>
> >>
> >
> >
> >
> > --
> > Brett Tiplitz
> > Systolic, Inc
>



-- 
Brett Tiplitz
Systolic, Inc

Re: provenance

Posted by Joe Witt <jo...@gmail.com>.

NiFi only writes data to disk when it is actually changing the data.
It is very uncommon to have a 10 processor flow where all or even most
are actually touching the content.  You can take a look at the live
status history data to see exactly how much content is being read from
disk and written to disk precisely.  This makes it very easy to find
where the heavy users of the underlying content repository - and disk
- are.

Even in the case of reads you should generally benefit from pretty
excellent disk/OS caching.  Also, even if you're flow forks data and
send it down multiple paths it is not actually creating copies.  Just
creating new references.  NiFi will also automatically combine writes
of events to the same file on disk within a short span of time and
space.  This too helps with efficiency of disk utilization.  Key point
there is efficiency of the content repository is pretty strong at this
stage.  If you're using a version of NiFi that is years old then these
things may not be true.

Now, the run duration suggestion is about efficiency of the flow file
repository which is the bookkeeping of the flowfiles (not the
content).  We want you to be able to reduce how often we commit the
session so run duration lets you choose your tolerance for delay while
we automatically batch together sessions.

So, key is to keep in mind that there are a few repositories and
things (depending on your configuration) that will use disk:
1) Content repository (the bytes of the things you're reading/writing)
2) FlowFile repository (information about the flow files and their
attributes - no content)
3) Provenance Repository
4) Logs

All of these can be on different partitions and all can be across
partitions and such.

To really help with this particular case I think we'll need you to
list out the processors involved (generically if necessary) and how
much they read/write over a five minute period in steady state.  If
there is really a chain of 10 processors and most are actually reading
and writing content we can talk about additional strategies such as
alternative composition of processors that will be more efficient.

Thanks
jOe


On Wed, Oct 5, 2016 at 11:21 AM, Brett Tiplitz
<br...@systolic-inc.com> wrote:
> I was always trying to understand the run duration.  I'm good on the
> latency, so if it processes a bunch of events at once and my overall
> throughput is the same, then it's ok.  I increased it to 100ms.  But I
> looked at the bulk of my flow and this feature was only on 1 of the > 10
> processors data goes through.
>
> I realize that slowing the rate of commits seems bad, but even the big guys
> limit commits
>
>
> On Wed, Oct 5, 2016 at 12:05 PM, Bryan Bende <bb...@gmail.com> wrote:
>>
>> Brett,
>>
>> One thing that could possibly improve the performance here, although hard
>> to say how much, is the concept of "Run Duration" on the processor
>> scheduling tab. This is only available on processors marked with the
>> @SupportsBatching annotation, so it depends what processors you are using.
>>
>> By increasing the run duration it lets the framework batch together all of
>> the framework operations during that time period. The default setting is 0
>> which means no batching by default, giving you the lowest latency per flow
>> file, but users can choose to sacrifice some latency for higher throughput.
>>
>> I don't know enough about how provenance events are specifically
>> committed, but I believe they would be tied to the session commits so that
>> if a rollback occurred there wouldn't be unwanted events written.
>>
>> -Bryan
>>
>>
>> On Wed, Oct 5, 2016 at 11:38 AM, Brett Tiplitz
>> <br...@systolic-inc.com> wrote:
>>>
>>> James -
>>>
>>> I believe the complication for me is both the number of objects as well
>>> as the number of processors the data goes through.  I talked with a few
>>> people and it sounds like NIFI writes each event out disk and then executes
>>> a commit, which really does have a major impact on the performance.  I don't
>>> have the liberty of resolving the disk performance, though I think I will
>>> try moving the journals directory to /dev/shm.  I know on reboot I'll loose
>>> data, but that is just like 1-2 times a year, so I think that loss is
>>> acceptable.  Also, I'm not specifying anything on what data get's indexed so
>>> it's what ever the default is.
>>>
>>> If I'm producing about 6000 (just a guess, though I think it's pretty
>>> large) events per second, it would be nice if there was an option not to
>>> perform a commit on every one of the 6000 items.  In reality, I would say a
>>> commit should never occur more than once a second and that is likely way too
>>> often.
>>>
>>> Last, is there a way to measure the actual provenance events going
>>> through as I'm guessing on what it's actually doing here.
>>>
>>> brett
>>>
>>> On Fri, Sep 30, 2016 at 2:16 PM, James Wing <jv...@gmail.com> wrote:
>>>>
>>>> Brett,
>>>>
>>>> The default provenance store, PersistentProvenanceRepository, does
>>>> require I/O in proportion to flowfile events.  Flowfiles with many
>>>> attributes, especially large attributes, are a frequent contributor to
>>>> provenance overload because attribute state is tracked in provenance events.
>>>> But this is different from flowfile content reads and writes, which use the
>>>> separate content repository.  You might consider moving the provenance
>>>> repository to a separate disk for additional I/O capacity.
>>>>
>>>> Does this sound relevant?  Can you share some details of your flow
>>>> volumes and attribute sizes?
>>>>
>>>> nifi.provenance.repository.buffer.size is only used by the
>>>> VolatileProvenanceRepository implementation, an in-memory provenance store.
>>>> The property defines the size of the in-memory store.  The volatile store
>>>> can avoid disk I/O issues, but at the expense of reduced provenance
>>>> functionality.
>>>>
>>>> Thanks,
>>>>
>>>> James
>>>>
>>>> On Thu, Sep 29, 2016 at 1:37 PM, Brett Tiplitz
>>>> <br...@systolic-inc.com> wrote:
>>>>>
>>>>> I'm having a throughput problem when processing data with Provenance
>>>>> recording enabled.  I've pretty much disabled it, so I believe that is the
>>>>> source of my issue.  On occasion, I get a message saying the flow is slowing
>>>>> due to provenance recording.  I was running the out of the box configuration
>>>>> for provenance.
>>>>>
>>>>> I believe the issue might be related to commit writes, though it's just
>>>>> a theory.  There is a variable nifi.provenance.repository.buffer.size,
>>>>> though I don't see anything about what that does.
>>>>>
>>>>> Any suggestions ?
>>>>>
>>>>> thanks,
>>>>>
>>>>> brett
>>>>>
>>>>> --
>>>>> Brett Tiplitz
>>>>> Systolic, Inc
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Brett Tiplitz
>>> Systolic, Inc
>>
>>
>
>
>
> --
> Brett Tiplitz
> Systolic, Inc

Re: provenance

Posted by Brett Tiplitz <br...@systolic-inc.com>.

I was always trying to understand the run duration.  I'm good on the
latency, so if it processes a bunch of events at once and my overall
throughput is the same, then it's ok.  I increased it to 100ms.  But I
looked at the bulk of my flow and this feature was only on 1 of the > 10
processors data goes through.

I realize that slowing the rate of commits seems bad, but even the big guys
limit commits


On Wed, Oct 5, 2016 at 12:05 PM, Bryan Bende <bb...@gmail.com> wrote:

> Brett,
>
> One thing that could possibly improve the performance here, although hard
> to say how much, is the concept of "Run Duration" on the processor
> scheduling tab. This is only available on processors marked with the
> @SupportsBatching annotation, so it depends what processors you are using.
>
> By increasing the run duration it lets the framework batch together all of
> the framework operations during that time period. The default setting is 0
> which means no batching by default, giving you the lowest latency per flow
> file, but users can choose to sacrifice some latency for higher throughput.
>
> I don't know enough about how provenance events are specifically
> committed, but I believe they would be tied to the session commits so that
> if a rollback occurred there wouldn't be unwanted events written.
>
> -Bryan
>
>
> On Wed, Oct 5, 2016 at 11:38 AM, Brett Tiplitz <
> brett.m.tiplitz@systolic-inc.com> wrote:
>
>> James -
>>
>> I believe the complication for me is both the number of objects as well
>> as the number of processors the data goes through.  I talked with a few
>> people and it sounds like NIFI writes each event out disk and then executes
>> a commit, which really does have a major impact on the performance.  I
>> don't have the liberty of resolving the disk performance, though I think I
>> will try moving the journals directory to /dev/shm.  I know on reboot I'll
>> loose data, but that is just like 1-2 times a year, so I think that loss is
>> acceptable.  Also, I'm not specifying anything on what data get's indexed
>> so it's what ever the default is.
>>
>> If I'm producing about 6000 (just a guess, though I think it's pretty
>> large) events per second, it would be nice if there was an option not to
>> perform a commit on every one of the 6000 items.  In reality, I would say a
>> commit should never occur more than once a second and that is likely way
>> too often.
>>
>> Last, is there a way to measure the actual provenance events going
>> through as I'm guessing on what it's actually doing here.
>>
>> brett
>>
>> On Fri, Sep 30, 2016 at 2:16 PM, James Wing <jv...@gmail.com> wrote:
>>
>>> Brett,
>>>
>>> The default provenance store, PersistentProvenanceRepository, does
>>> require I/O in proportion to flowfile events.  Flowfiles with many
>>> attributes, especially large attributes, are a frequent contributor to
>>> provenance overload because attribute state is tracked in provenance
>>> events.  But this is different from flowfile content reads and writes,
>>> which use the separate content repository.  You might consider moving the
>>> provenance repository to a separate disk for additional I/O capacity.
>>>
>>> Does this sound relevant?  Can you share some details of your flow
>>> volumes and attribute sizes?
>>>
>>> nifi.provenance.repository.buffer.size is only used by the
>>> VolatileProvenanceRepository implementation, an in-memory provenance
>>> store.  The property defines the size of the in-memory store.  The volatile
>>> store can avoid disk I/O issues, but at the expense of reduced provenance
>>> functionality.
>>>
>>> Thanks,
>>>
>>> James
>>>
>>> On Thu, Sep 29, 2016 at 1:37 PM, Brett Tiplitz <
>>> brett.m.tiplitz@systolic-inc.com> wrote:
>>>
>>>> I'm having a throughput problem when processing data with Provenance
>>>> recording enabled.  I've pretty much disabled it, so I believe that is the
>>>> source of my issue.  On occasion, I get a message saying the flow is
>>>> slowing due to provenance recording.  I was running the out of the box
>>>> configuration for provenance.
>>>>
>>>> I believe the issue might be related to commit writes, though it's just
>>>> a theory.  There is a variable nifi.provenance.repository.buffer.size,
>>>> though I don't see anything about what that does.
>>>>
>>>> Any suggestions ?
>>>>
>>>> thanks,
>>>>
>>>> brett
>>>>
>>>> --
>>>> Brett Tiplitz
>>>> Systolic, Inc
>>>>
>>>
>>>
>>
>>
>> --
>> Brett Tiplitz
>> Systolic, Inc
>>
>
>


-- 
Brett Tiplitz
Systolic, Inc

Re: provenance

Posted by Bryan Bende <bb...@gmail.com>.

Brett,

One thing that could possibly improve the performance here, although hard
to say how much, is the concept of "Run Duration" on the processor
scheduling tab. This is only available on processors marked with the
@SupportsBatching annotation, so it depends what processors you are using.

By increasing the run duration it lets the framework batch together all of
the framework operations during that time period. The default setting is 0
which means no batching by default, giving you the lowest latency per flow
file, but users can choose to sacrifice some latency for higher throughput.

I don't know enough about how provenance events are specifically committed,
but I believe they would be tied to the session commits so that if a
rollback occurred there wouldn't be unwanted events written.

-Bryan


On Wed, Oct 5, 2016 at 11:38 AM, Brett Tiplitz <
brett.m.tiplitz@systolic-inc.com> wrote:

> James -
>
> I believe the complication for me is both the number of objects as well as
> the number of processors the data goes through.  I talked with a few people
> and it sounds like NIFI writes each event out disk and then executes a
> commit, which really does have a major impact on the performance.  I don't
> have the liberty of resolving the disk performance, though I think I will
> try moving the journals directory to /dev/shm.  I know on reboot I'll loose
> data, but that is just like 1-2 times a year, so I think that loss is
> acceptable.  Also, I'm not specifying anything on what data get's indexed
> so it's what ever the default is.
>
> If I'm producing about 6000 (just a guess, though I think it's pretty
> large) events per second, it would be nice if there was an option not to
> perform a commit on every one of the 6000 items.  In reality, I would say a
> commit should never occur more than once a second and that is likely way
> too often.
>
> Last, is there a way to measure the actual provenance events going through
> as I'm guessing on what it's actually doing here.
>
> brett
>
> On Fri, Sep 30, 2016 at 2:16 PM, James Wing <jv...@gmail.com> wrote:
>
>> Brett,
>>
>> The default provenance store, PersistentProvenanceRepository, does
>> require I/O in proportion to flowfile events.  Flowfiles with many
>> attributes, especially large attributes, are a frequent contributor to
>> provenance overload because attribute state is tracked in provenance
>> events.  But this is different from flowfile content reads and writes,
>> which use the separate content repository.  You might consider moving the
>> provenance repository to a separate disk for additional I/O capacity.
>>
>> Does this sound relevant?  Can you share some details of your flow
>> volumes and attribute sizes?
>>
>> nifi.provenance.repository.buffer.size is only used by the
>> VolatileProvenanceRepository implementation, an in-memory provenance
>> store.  The property defines the size of the in-memory store.  The volatile
>> store can avoid disk I/O issues, but at the expense of reduced provenance
>> functionality.
>>
>> Thanks,
>>
>> James
>>
>> On Thu, Sep 29, 2016 at 1:37 PM, Brett Tiplitz <
>> brett.m.tiplitz@systolic-inc.com> wrote:
>>
>>> I'm having a throughput problem when processing data with Provenance
>>> recording enabled.  I've pretty much disabled it, so I believe that is the
>>> source of my issue.  On occasion, I get a message saying the flow is
>>> slowing due to provenance recording.  I was running the out of the box
>>> configuration for provenance.
>>>
>>> I believe the issue might be related to commit writes, though it's just
>>> a theory.  There is a variable nifi.provenance.repository.buffer.size,
>>> though I don't see anything about what that does.
>>>
>>> Any suggestions ?
>>>
>>> thanks,
>>>
>>> brett
>>>
>>> --
>>> Brett Tiplitz
>>> Systolic, Inc
>>>
>>
>>
>
>
> --
> Brett Tiplitz
> Systolic, Inc
>