You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@nifi.apache.org by si...@vonos.net on 2017/04/19 10:50:15 UTC

Provenance Event performance

Hi All,

In some parts of the NiFi documentation, it is stated that a provenance 
event is emitted for each flowfile for each processor. However elsewhere 
it is stated that no provenance-event is generated for a flowfile sent 
to the \u201csuccess\u201d output of a processor - which is true?

And are there mechanisms for reducing the number of provenance events 
generated by a NiFi flow? When a dataflow is processing large numbers of 
events, it would seem to me that the generation of provenance events 
will be the limiting factor for performance. When processing 1 million 
records per day, generating 1 million provenance events (or worse) is 
not helpful..

Thanks in advance,

Simon

Re: Provenance Event performance

Posted by si...@vonos.net.

Thanks Joe, Juan,

Perhaps it would be useful to be able to generate provenance events for 
a _sample_ of flowfiles? eg every Nth flowfile created by a "data 
ingress" (GET* or LISTEN*) processor gets tracked? Or maybe better: 
every flowfile gets tracked with a probability of N, to ensure that 
specific input patterns (eg every (N+1)th message is unusual) don't go 
unaudited..

I have seen users reporting problems on this email list where the 
provenance repository becomes full and everything stops. That is clearly 
not desirable, but neither is simply discarding the oldest provenance 
records in the repository; some flows are presumably more important than 
others. In particular, a single large-volume flow should presumably not 
cause provenance for other flows to be flushed. The admin-guide page you 
referenced below apparently does not allow provenance storage to be 
configured per-flow. Maybe the ability to configure "sampling" might 
help?

I'm developing a data import process right now for a customer; some 
datasources will be reasonably low-volume while others will be very 
high-volume. Sampling for high-volume flows might be useful, but 
tracking each one is simply not practical. In addition, some datasources 
hold very confidential data; it doesn't seem desirable to record this at 
all - although AFAIK avoiding retaining this data in the NiFi content 
repository for unknown periods of time is unavoidable..

Thanks once again for your feedback!

Regards,
Simon

On 2017-04-19 23:36, Joe Witt wrote:
> You're right that the generation and indexing of provenance data
> creates overhead.  We've put considerable effort in minimizing that
> overhead to a point where you should not have to think about it and
> still get all the powerful user experience/auditing gains it provides.
> However, when you're talking about 100s of thousands of events per
> second it can simply be too much overhead to give up.  I dont know if
> we have a JIRA for it yet but it makes a lot of sense to allow
> properly authorized folks to shut off generation of provenance events
> at certain points of a flow.
> 
> On Wed, Apr 19, 2017 at 5:34 PM, Juan Sequeiros <he...@gmail.com> 
> wrote:
>> Simon,
>> 
>> I feel that " provenance event is emitted for each flowfile for each
>> processor." is accurate understanding "each processor" means the 
>> unique
>> processors the flowFile goes through.
>> 
>> The provenance database is a lucene database and 1 million provenance 
>> events
>> is not unreasonable.
>> It would have to do with how you configure your NIFI and a best 
>> practice is
>> to store your provenance on its own disk.
>> 
>> Many tweak able settings for provenance are on nifi.properties [1]
>> 
>> [1] 
>> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html
>> 
>> 
>> On Wed, Apr 19, 2017 at 6:50 AM <si...@vonos.net> wrote:
>>> 
>>> Hi All,
>>> 
>>> In some parts of the NiFi documentation, it is stated that a 
>>> provenance
>>> event is emitted for each flowfile for each processor. However 
>>> elsewhere
>>> it is stated that no provenance-event is generated for a flowfile 
>>> sent
>>> to the \u201csuccess\u201d output of a processor - which is true?
>>> 
>>> And are there mechanisms for reducing the number of provenance events
>>> generated by a NiFi flow? When a dataflow is processing large numbers 
>>> of
>>> events, it would seem to me that the generation of provenance events
>>> will be the limiting factor for performance. When processing 1 
>>> million
>>> records per day, generating 1 million provenance events (or worse) is
>>> not helpful..
>>> 
>>> Thanks in advance,
>>> 
>>> Simon

Re: Provenance Event performance

Posted by Joe Witt <jo...@gmail.com>.

You're right that the generation and indexing of provenance data
creates overhead.  We've put considerable effort in minimizing that
overhead to a point where you should not have to think about it and
still get all the powerful user experience/auditing gains it provides.
However, when you're talking about 100s of thousands of events per
second it can simply be too much overhead to give up.  I dont know if
we have a JIRA for it yet but it makes a lot of sense to allow
properly authorized folks to shut off generation of provenance events
at certain points of a flow.

On Wed, Apr 19, 2017 at 5:34 PM, Juan Sequeiros <he...@gmail.com> wrote:
> Simon,
>
> I feel that " provenance event is emitted for each flowfile for each
> processor." is accurate understanding "each processor" means the unique
> processors the flowFile goes through.
>
> The provenance database is a lucene database and 1 million provenance events
> is not unreasonable.
> It would have to do with how you configure your NIFI and a best practice is
> to store your provenance on its own disk.
>
> Many tweak able settings for provenance are on nifi.properties [1]
>
> [1] https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html
>
>
> On Wed, Apr 19, 2017 at 6:50 AM <si...@vonos.net> wrote:
>>
>> Hi All,
>>
>> In some parts of the NiFi documentation, it is stated that a provenance
>> event is emitted for each flowfile for each processor. However elsewhere
>> it is stated that no provenance-event is generated for a flowfile sent
>> to the “success” output of a processor - which is true?
>>
>> And are there mechanisms for reducing the number of provenance events
>> generated by a NiFi flow? When a dataflow is processing large numbers of
>> events, it would seem to me that the generation of provenance events
>> will be the limiting factor for performance. When processing 1 million
>> records per day, generating 1 million provenance events (or worse) is
>> not helpful..
>>
>> Thanks in advance,
>>
>> Simon

Re: Provenance Event performance

Posted by Juan Sequeiros <he...@gmail.com>.

Simon,

I feel that " provenance event is emitted for each flowfile for each
processor." is accurate understanding "each processor" means the unique
processors the flowFile goes through.

The provenance database is a lucene database and 1 million provenance
events is not unreasonable.
It would have to do with how you configure your NIFI and a best practice is
to store your provenance on its own disk.

Many tweak able settings for provenance are on nifi.properties [1]

[1] https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html

On Wed, Apr 19, 2017 at 6:50 AM <si...@vonos.net> wrote:

> Hi All,
>
> In some parts of the NiFi documentation, it is stated that a provenance
> event is emitted for each flowfile for each processor. However elsewhere
> it is stated that no provenance-event is generated for a flowfile sent
> to the “success” output of a processor - which is true?
>
> And are there mechanisms for reducing the number of provenance events
> generated by a NiFi flow? When a dataflow is processing large numbers of
> events, it would seem to me that the generation of provenance events
> will be the limiting factor for performance. When processing 1 million
> records per day, generating 1 million provenance events (or worse) is
> not helpful..
>
> Thanks in advance,
>
> Simon
>