You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nifi.apache.org by Joe Skora <js...@gmail.com> on 2017/01/27 14:42:44 UTC

Re: Thoughts on NIFI-1847 Improve Provenance Space Utilization

I'm bumping this hoping for some feedback before I dive back into the
ticket.

Lacking any response for 30 days, I figure this either got overlooked due
to year-end or no one has an opinion to add to the discussion (which seems
unlikely).  ;-)



On Tue, Dec 27, 2016 at 2:50 PM, Joe Skora <js...@gmail.com> wrote:

> All,
>
> Before the change to the schema based repositories committed, I was doing
> some testing for NIFI-1847 Improve Provenance Space Utilization
> <https://issues.apache.org/jira/browse/NIFI-1847> based on these
> assumptions.
>
>    - A partition {{nifi.provenance.repository.directory.XYZ}} entry would
>    only be individually tracked if there was a corresponding {{
>    nifi.provenance.repository.directorySize.XYZ
>    <http://nifi.provenance.repository.directorySize.XYZ>}} entry,
>    otherwise it will only be considered against the aggregate totals.
>    - The original {{nifi.provenance.repository.max.storage.size}}
>    property would represent an aggregate across all partitions, whether
>    specifically tracked or not.
>    - Tracked partitions will be evaluated first and their sizes
>    accumulated to avoid double work.
>
>
> My testing showed improved use of space by partition, but also showed two
> problems.
>
>    - Calling the OS for the size of every journal, partition, and index
>    file is expensive so I'm looking at going to the OS every Nth pass and
>    tracking delta writes in between.
>    - Writers are chosen based on round robin which is far from optimal
>    when the size and available space varies by partition.  I some thoughts but
>    haven't put anything in code yet.
>
>
> Considering that provenance recording seems to be a bottleneck on some
> flows, this needs to be as fast as possible but while staying 100%
> reliable.  So, any thoughts on these issues or wisdom relating to
> repositories and provenance is appreciated.
>
> Thanks,
> Joe
>

Re: Thoughts on NIFI-1847 Improve Provenance Space Utilization

Posted by Mark Payne <ma...@hotmail.com>.
Joe,

Yes - certainly I believe that the existing implementation needs to be improved, as it can
result in quit e a bit of thrashing and overhead, as you've laid out. I should point out that
a "complete rewrite" would indeed be concerning. I'm not at all purposing that we rewrite
the existing implementation. Rather, I'm putting together an alternate implementation, that
users will be able to use by changing a property in nifi.properties.

I simply brought that up because I think that eventually users will move away from the
PersistentProvenanceRepository all together in favor in the new one. So I didn't want you to
spend a lot of effort improving the utilization of the old one if it isn't used much longer. I did not,
however, consider the 0.x baseline - the new repository won't be backported to that line. If this
is something that you need for the 0.x baseline then by all means go for it :)

Thanks
-Mark

> On Feb 1, 2017, at 10:24 AM, Joe Skora <js...@gmail.com> wrote:
> 
> Mark,
> 
> The gist of the original ticket NIFI-1847 Improve Provenance Space
> Utilization <https://issues.apache.org/jira/browse/NIFI-1847> was about
> efficient use of the configured repository space and support for multiple
> asymmetric storage locations.  My email was seeking input on that,
> especially in consideration of property changes necessary to describe
> discrete storage locations of different sizes.
> 
> I think any improvement to the repository performance will be welcomed by a
> lot of folks, but I'm a little concerned about a complete rewrite.  Do you
> plan to port the new repository back to 0.x?  Without porting it back,
> users of 0.x will still have problems.
> 
> On a heavy provenance test flow I observed storage thrashing and overrun on
> 0.x and 1.x, but it seemed to be caused by the cleanup logic not the
> underlying repository implementation.  With minor changes to cleanup
> thresholds it ran better and without overrunning storage.  The PRs
> submitted in November on NIFI-3039
> <https://issues.apache.org/jira/browse/NIFI-3039>[2] (PR#1240
> <https://github.com/apache/nifi/pull/1240>[3] and PR#1241
> <https://github.com/apache/nifi/pull/1241>[4]) implemented those cleanup
> threshold adjustments.  I know you commented on NIFI-3039, but did you try
> the changes before starting a complete rewrite?
> 
> The first problem was that the current repository starts cleanup at >90%
> used but stops once it reaches <100% used, so it tends to fluctuate close
> to capacity increasing cleanup cycles.  Similarly, rollover limits itself
> to 110% of configured space, implying an intentional overrun.  The changes
> on the PRs resulted in intermittent instead of constant cleanup, so
> provenance ran smoother and more reliably even with the current repository
> implementation.
> 
> [1] https://issues.apache.org/jira/browse/NIFI-1847
> [2] https://issues.apache.org/jira/browse/NIFI-3039
> [3] https://github.com/apache/nifi/pull/1240
> [4] https://github.com/apache/nifi/pull/1241
> 
> Regards,
> Joe
> 
> On Fri, Jan 27, 2017 at 9:58 AM, Mark Payne <ma...@hotmail.com> wrote:
> 
>> Hey Joe,
>> 
>> Sorry - I don't think I saw this. I have actually been working on
>> NIFI-3356 [1] for which
>> I hope to have a PR up in the next few days. I've been doing some
>> long-running tests,
>> and I did find an issue yesterday so I've redeployed to some nodes to let
>> it run over the
>> weekend. If all looks good I can perhaps have a PR in on Monday.
>> 
>> The Persistent Provenance Repository is quite old. At the time that it was
>> written, the requirements
>> were simply to store data in a sequential fashion and make it available
>> for a Reporting Task to iterate
>> over the events sequentially. There was no compression, and there was no
>> indexing/searching. The
>> requirements clearly have changed over the years :) So I started working
>> on a totally new implementation
>> and my testing shows that it is 2-3 times faster than the Persistent
>> Provenance Repository while at the
>> same time providing faster query capabilities and immediate access to
>> events (as opposed to after a 30-
>> second rollover period).
>> 
>> When I get a chance to get it posted, it would be great if you want to put
>> it through the ringer as well.
>> I say all of this, because if you are interested, it may be worth holding
>> off a few days and looking into
>> implementing something similar to the new repo instead of focusing on the
>> PersistentProvenanceRepository
>> (or updating both).
>> 
>> Thanks
>> -Mark
>> 
>> 
>> [1] https://issues.apache.org/jira/browse/NIFI-3356
>> 
>> 
>> 
>> On Jan 27, 2017, at 9:42 AM, Joe Skora <jskora@gmail.com<mailto:jskor
>> a@gmail.com>> wrote:
>> 
>> I'm bumping this hoping for some feedback before I dive back into the
>> ticket.
>> 
>> Lacking any response for 30 days, I figure this either got overlooked due
>> to year-end or no one has an opinion to add to the discussion (which seems
>> unlikely).  ;-)
>> 
>> 
>> 
>> On Tue, Dec 27, 2016 at 2:50 PM, Joe Skora <jskora@gmail.com<mailto:jskor
>> a@gmail.com>> wrote:
>> 
>> All,
>> 
>> Before the change to the schema based repositories committed, I was doing
>> some testing for NIFI-1847 Improve Provenance Space Utilization
>> <https://issues.apache.org/jira/browse/NIFI-1847> based on these
>> assumptions.
>> 
>>  - A partition {{nifi.provenance.repository.directory.XYZ}} entry would
>>  only be individually tracked if there was a corresponding {{
>>  nifi.provenance.repository.directorySize.XYZ
>>  <http://nifi.provenance.repository.directorySize.XYZ>}} entry,
>>  otherwise it will only be considered against the aggregate totals.
>>  - The original {{nifi.provenance.repository.max.storage.size}}
>>  property would represent an aggregate across all partitions, whether
>>  specifically tracked or not.
>>  - Tracked partitions will be evaluated first and their sizes
>>  accumulated to avoid double work.
>> 
>> 
>> My testing showed improved use of space by partition, but also showed two
>> problems.
>> 
>>  - Calling the OS for the size of every journal, partition, and index
>>  file is expensive so I'm looking at going to the OS every Nth pass and
>>  tracking delta writes in between.
>>  - Writers are chosen based on round robin which is far from optimal
>>  when the size and available space varies by partition.  I some thoughts
>> but
>>  haven't put anything in code yet.
>> 
>> 
>> Considering that provenance recording seems to be a bottleneck on some
>> flows, this needs to be as fast as possible but while staying 100%
>> reliable.  So, any thoughts on these issues or wisdom relating to
>> repositories and provenance is appreciated.
>> 
>> Thanks,
>> Joe
>> 
>> 
>> 


Re: Thoughts on NIFI-1847 Improve Provenance Space Utilization

Posted by Joe Skora <js...@gmail.com>.
Mark,

The gist of the original ticket NIFI-1847 Improve Provenance Space
Utilization <https://issues.apache.org/jira/browse/NIFI-1847> was about
efficient use of the configured repository space and support for multiple
asymmetric storage locations.  My email was seeking input on that,
especially in consideration of property changes necessary to describe
discrete storage locations of different sizes.

I think any improvement to the repository performance will be welcomed by a
lot of folks, but I'm a little concerned about a complete rewrite.  Do you
plan to port the new repository back to 0.x?  Without porting it back,
users of 0.x will still have problems.

On a heavy provenance test flow I observed storage thrashing and overrun on
0.x and 1.x, but it seemed to be caused by the cleanup logic not the
underlying repository implementation.  With minor changes to cleanup
thresholds it ran better and without overrunning storage.  The PRs
submitted in November on NIFI-3039
<https://issues.apache.org/jira/browse/NIFI-3039>[2] (PR#1240
<https://github.com/apache/nifi/pull/1240>[3] and PR#1241
<https://github.com/apache/nifi/pull/1241>[4]) implemented those cleanup
threshold adjustments.  I know you commented on NIFI-3039, but did you try
the changes before starting a complete rewrite?

The first problem was that the current repository starts cleanup at >90%
used but stops once it reaches <100% used, so it tends to fluctuate close
to capacity increasing cleanup cycles.  Similarly, rollover limits itself
to 110% of configured space, implying an intentional overrun.  The changes
on the PRs resulted in intermittent instead of constant cleanup, so
provenance ran smoother and more reliably even with the current repository
implementation.

[1] https://issues.apache.org/jira/browse/NIFI-1847
[2] https://issues.apache.org/jira/browse/NIFI-3039
[3] https://github.com/apache/nifi/pull/1240
[4] https://github.com/apache/nifi/pull/1241

Regards,
Joe

On Fri, Jan 27, 2017 at 9:58 AM, Mark Payne <ma...@hotmail.com> wrote:

> Hey Joe,
>
> Sorry - I don't think I saw this. I have actually been working on
> NIFI-3356 [1] for which
> I hope to have a PR up in the next few days. I've been doing some
> long-running tests,
> and I did find an issue yesterday so I've redeployed to some nodes to let
> it run over the
> weekend. If all looks good I can perhaps have a PR in on Monday.
>
> The Persistent Provenance Repository is quite old. At the time that it was
> written, the requirements
> were simply to store data in a sequential fashion and make it available
> for a Reporting Task to iterate
> over the events sequentially. There was no compression, and there was no
> indexing/searching. The
> requirements clearly have changed over the years :) So I started working
> on a totally new implementation
> and my testing shows that it is 2-3 times faster than the Persistent
> Provenance Repository while at the
> same time providing faster query capabilities and immediate access to
> events (as opposed to after a 30-
> second rollover period).
>
> When I get a chance to get it posted, it would be great if you want to put
> it through the ringer as well.
> I say all of this, because if you are interested, it may be worth holding
> off a few days and looking into
> implementing something similar to the new repo instead of focusing on the
> PersistentProvenanceRepository
> (or updating both).
>
> Thanks
> -Mark
>
>
> [1] https://issues.apache.org/jira/browse/NIFI-3356
>
>
>
> On Jan 27, 2017, at 9:42 AM, Joe Skora <jskora@gmail.com<mailto:jskor
> a@gmail.com>> wrote:
>
> I'm bumping this hoping for some feedback before I dive back into the
> ticket.
>
> Lacking any response for 30 days, I figure this either got overlooked due
> to year-end or no one has an opinion to add to the discussion (which seems
> unlikely).  ;-)
>
>
>
> On Tue, Dec 27, 2016 at 2:50 PM, Joe Skora <jskora@gmail.com<mailto:jskor
> a@gmail.com>> wrote:
>
> All,
>
> Before the change to the schema based repositories committed, I was doing
> some testing for NIFI-1847 Improve Provenance Space Utilization
> <https://issues.apache.org/jira/browse/NIFI-1847> based on these
> assumptions.
>
>   - A partition {{nifi.provenance.repository.directory.XYZ}} entry would
>   only be individually tracked if there was a corresponding {{
>   nifi.provenance.repository.directorySize.XYZ
>   <http://nifi.provenance.repository.directorySize.XYZ>}} entry,
>   otherwise it will only be considered against the aggregate totals.
>   - The original {{nifi.provenance.repository.max.storage.size}}
>   property would represent an aggregate across all partitions, whether
>   specifically tracked or not.
>   - Tracked partitions will be evaluated first and their sizes
>   accumulated to avoid double work.
>
>
> My testing showed improved use of space by partition, but also showed two
> problems.
>
>   - Calling the OS for the size of every journal, partition, and index
>   file is expensive so I'm looking at going to the OS every Nth pass and
>   tracking delta writes in between.
>   - Writers are chosen based on round robin which is far from optimal
>   when the size and available space varies by partition.  I some thoughts
> but
>   haven't put anything in code yet.
>
>
> Considering that provenance recording seems to be a bottleneck on some
> flows, this needs to be as fast as possible but while staying 100%
> reliable.  So, any thoughts on these issues or wisdom relating to
> repositories and provenance is appreciated.
>
> Thanks,
> Joe
>
>
>

Re: Thoughts on NIFI-1847 Improve Provenance Space Utilization

Posted by Mark Payne <ma...@hotmail.com>.
Hey Joe,

Sorry - I don't think I saw this. I have actually been working on NIFI-3356 [1] for which
I hope to have a PR up in the next few days. I've been doing some long-running tests,
and I did find an issue yesterday so I've redeployed to some nodes to let it run over the
weekend. If all looks good I can perhaps have a PR in on Monday.

The Persistent Provenance Repository is quite old. At the time that it was written, the requirements
were simply to store data in a sequential fashion and make it available for a Reporting Task to iterate
over the events sequentially. There was no compression, and there was no indexing/searching. The
requirements clearly have changed over the years :) So I started working on a totally new implementation
and my testing shows that it is 2-3 times faster than the Persistent Provenance Repository while at the
same time providing faster query capabilities and immediate access to events (as opposed to after a 30-
second rollover period).

When I get a chance to get it posted, it would be great if you want to put it through the ringer as well.
I say all of this, because if you are interested, it may be worth holding off a few days and looking into
implementing something similar to the new repo instead of focusing on the PersistentProvenanceRepository
(or updating both).

Thanks
-Mark


[1] https://issues.apache.org/jira/browse/NIFI-3356



On Jan 27, 2017, at 9:42 AM, Joe Skora <js...@gmail.com>> wrote:

I'm bumping this hoping for some feedback before I dive back into the
ticket.

Lacking any response for 30 days, I figure this either got overlooked due
to year-end or no one has an opinion to add to the discussion (which seems
unlikely).  ;-)



On Tue, Dec 27, 2016 at 2:50 PM, Joe Skora <js...@gmail.com>> wrote:

All,

Before the change to the schema based repositories committed, I was doing
some testing for NIFI-1847 Improve Provenance Space Utilization
<https://issues.apache.org/jira/browse/NIFI-1847> based on these
assumptions.

  - A partition {{nifi.provenance.repository.directory.XYZ}} entry would
  only be individually tracked if there was a corresponding {{
  nifi.provenance.repository.directorySize.XYZ
  <http://nifi.provenance.repository.directorySize.XYZ>}} entry,
  otherwise it will only be considered against the aggregate totals.
  - The original {{nifi.provenance.repository.max.storage.size}}
  property would represent an aggregate across all partitions, whether
  specifically tracked or not.
  - Tracked partitions will be evaluated first and their sizes
  accumulated to avoid double work.


My testing showed improved use of space by partition, but also showed two
problems.

  - Calling the OS for the size of every journal, partition, and index
  file is expensive so I'm looking at going to the OS every Nth pass and
  tracking delta writes in between.
  - Writers are chosen based on round robin which is far from optimal
  when the size and available space varies by partition.  I some thoughts but
  haven't put anything in code yet.


Considering that provenance recording seems to be a bottleneck on some
flows, this needs to be as fast as possible but while staying 100%
reliable.  So, any thoughts on these issues or wisdom relating to
repositories and provenance is appreciated.

Thanks,
Joe