You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nifi.apache.org by Daniel Cave <dc...@ssglimited.com> on 2016/11/28 14:25:50 UTC

MiNiFi C++ Data Provenance and Related Issues

This is a break off from the discussion on the MiNiFi C++ 0.1.0 Release
thread. I assume a hub and spoke NiFi/MiNiFi C++ architecture.

As discussed on that thread, I am concerned about the existing choice for
data provenance tracking and the implications it leads to as well as the
current data provenance requirements for MiNiFi C++. MiNiFi C++ must be
highly efficient and carry a minimal footprint in order to be able to
function at background and embedded levels. As such, performance and space
are priorities as are the ability to communicate to the NiFi hub the needed
information (i.e. there isn't space for a large unindexed data provenance
archive locally nor the processing ability to handle it).

The data provenance registry must be: 1) Fault tolerant, 2) able to be
easily purged, 3) fast to write, 4) easily accessed in session, 5) easily
accessed post session. The current choice (LevelDB) meets #3, but not the
other 4 requirements. LevelDB is prone to corruption in cases of
application failure during a write (fails #1). LevelDB has no indexing, and
if keys are by UUID then there is no way to efficiently sort by date or by
parent/child (fails #2, #4, #5). The choice for a provenance store should
answer as many of these as possible. For permanent stores, the choices
would be super lightweight databases or something fault resistent like LMDB.
I don't have any preference, just that it functionally addresses as many
criteria as possible and absolutely satisfies #1.

A solution to #4 and #5 could be that the entire provenance tree inside
MiNiFi C++ rides with the flowfile and transfers to NiFi (including through
descendants). This I see as something of a requirement as well, as it is
the only efficient way to provide cradle to grave provenance through the
entire MiNiFi/NiFi system without the need for heavy post processing to
reconstruct the tree. While this adds slightly to the package being sent
between MiNiFi and NiFi, it's negligible compared to post query this
especially where MiNiFi is embedded or on an IoT device.

Any thoughts?

--
View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/MiNiFi-C-Data-Provenance-and-Related-Issues-tp14024.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.

Re: MiNiFi C++ Data Provenance and Related Issues

Posted by Daniel Cave <dc...@ssglimited.com>.

I will not be continuing this discussion.  I will leave it to others to pick
it up if they feel it's needed.



--
View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/MiNiFi-C-Data-Provenance-and-Related-Issues-tp14024p14058.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.

Re: MiNiFi C++ Data Provenance and Related Issues

Posted by Joe Witt <jo...@gmail.com>.

Regarding the scenario I am highlighting to show the problem of
in-band or in-line provenance exfil what I was pointing out is not:

MiNiFi -> SystemA -> SystemB -> ... -> NiFi

but rather it is

MiNiFi -> SystemA
MiNiFI -> SystemB

Where the data being sent to A and B is happening in parallel (not
series) and is actually the same piece of data for instance.  This
would look like a "fan out" graph.

The current model that we've followed supports generation and
transmission of the provenance graph regardless of the nature of the
graph of how data flows within the system.  The current approach we
have for exfil of events is to leverage reporting tasks and this too
has worked well.  We can filter events in such tasks, we can manage
bandwidth used, etc.. Can we rebase the discussion the problems we're
trying to solve?  That will help us better discuss solutions to those
problems.  If I look at the original thread I see "#4 and #5" being
used to articulate what I think became the s2s alteration proposal.
But I don't quite follow what #4 or #5 mean so can we restate/rephrase
the core problem.

Regarding ETL patterns and fundamental disagreement: It wasn't clear
to me what part of the discussion that was referring to and I'm not
familiar with the public papers you've released.  Would be happy to
read through to better understand your perspective. Can you share the
links here?

Regarding contributions and branching: I don't believe anyone has
pushed back on your idea to provide an alternative implementation of
the repositories.  Please do feel free to contribute your alternative
implementation.  It would be great to be able to have both available
and run side by side.  This sort of pluggability also promotes good
interface design to the repositories so it will be healthy regardless
of what the outcome is.

Regarding issues getting contributions into NiFi: Is there a specific
engagement you've found has been left hanging?  I see a couple of
JIRAs and contribs you were involved in that culminated in merged
commits and one that appears to have hit some snags and has not
progressed.  Is that the one you're talking about or are there other
challenges?  Let's take these cases and work through them.

Thanks
Joe


On Tue, Nov 29, 2016 at 10:35 AM, Daniel Cave <dc...@ssglimited.com> wrote:
> "Yes but there can be other hubs too and in parallel."
> [Daniel]For MiNiFi C++ -> SystemA -> SystemB -> ... -> NiFi, if you dont
> want provenance to travel then I don't see it as an issue since the outgoing
> message would be identical to what you have now.  If you feel it's going to
> be extremely confusing then I could make it a new clone of the S2S MiNiFi
> C++ processor, but I don't see a point to just hide a toggle.  On the NiFi
> side for this case you would use the normal S2S intake methods you use now.
> No change.  Also, if you're going from MiNiFi C++ -> SystemA there is no
> change.
> For MiNiFi C++ -> MiNiFi C++ ->....-> NiFi, if you want provenance travel
> then yes you are locked into using n*(MiNiFi C++) -> NiFi with the
> provenance toggled on and using the new S2S receiving processors in MiNiFi
> C++/NiFi (it has to be a new one to avoid backwards compatibility issues)
> that can handle provenance.  Again, I don't see this as an issue either
> since you are clearly wanting this functionality if you're doing this.
> Am I missing something in my logic flow that you are seeing that I need to
> account for?
>
> "You've mentioned this a couple times now. "
> [Daniel] Agreed and this is how this discussion is meant to be taken.
>
> "I'm not quite sure I understand so please elaborate if my
> comments don't apply."
> [Daniel]It has to do with when and how it's consumed.  On current path Atlas
> won't answer the issues, but as you said there are others and I have my own
> in progress as well.  I fundamentally disagree with the current
> sink-retrieve-sink ETL paradigm (as you've seen from my public papers, there
> are others not public yet as well) as it is a complete waste of time and
> resources at this point.  In all my work, data is handled as available (near
> real-time) rather than waiting for some ETL processes to run at some
> arbitrary point in the future.  By doing this you avoid unnecessary traffic,
> storage, processing, maintenance, and design all while improving data
> availability.  More specifically to this discussion, the issue comes down to
> access from the point of origin.  In an embedded or background instance of
> MiNiFi C++, bidirectional followup calls for provenance only are not always
> going to be available.  Additionally, where they are available they are not
> going to be current and hence are fairly useless for security applications.
> Think of trying this on your laptop, IoT devices, or on financial
> transactions.  If I find out 12-36hrs later when you reconnect or I can send
> someone to the field to retrieve it or the ETL processes run that there was
> an issue, it doesn't do me any good.  As Randy mentioned, you can recombine
> all this later, however it is a very resource consuming process.  There is
> no reason not to have it available when the data is available since it's
> just a matter of allowing for its transfer in line with the data.  NiFi is
> not assuming responsibility for anything it doesn't already, this just
> extends it's reach to the full NiFi/MiNiFi instance so there should not be
> an ownership concern.  This requires an extremely minor update in NiFi, but
> is for a fundamental need in MiNiFi C++.
>
> "Ok so I think what you're saying is"
> [Daniel] Right, and since you can just disable it if you don't need it there
> is no performance or bandwidth hit unless you enable it.
>
> "It is really important to propose and advocate"
> [Daniel] I don't see this as a model change, as per my previous questions
> MiNiFi C++ seems to not yet have a solid model as the time and effort is
> being mainly being put into MiNiFi Java.  Since I have very specific ideas
> around MiNiFi C++ (and have discussed them with you last year and others at
> HW when MiNiFi was only going to be in C) I have not seen this as a radical
> departure but an elaboration on what we had already discussed.  If you or
> the community wants to go a different path, I have no issue branching and
> going a separate way with these and the LevelDB changes rather than
> introducing these changes into the current path.  Being OpenSource there is
> no right answer, so I'm certainly open to any suggestions, but I think
> you'll find what I'm proposing here is going to be important when you get to
> actual implementations of it and it's easier to change now than when you're
> locked in later, especially given my issues getting our contributions into
> NiFi.  As stated above, I don't see how this affects any other
> implementations or use cases of MiNiFi C++/NiFi as proposed.
>
>
>
>
> --
> View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/MiNiFi-C-Data-Provenance-and-Related-Issues-tp14024p14048.html
> Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.

Re: MiNiFi C++ Data Provenance and Related Issues

Posted by Daniel Cave <dc...@ssglimited.com>.

"Yes but there can be other hubs too and in parallel."
[Daniel]For MiNiFi C++ -> SystemA -> SystemB -> ... -> NiFi, if you dont
want provenance to travel then I don't see it as an issue since the outgoing
message would be identical to what you have now. If you feel it's going to
be extremely confusing then I could make it a new clone of the S2S MiNiFi
C++ processor, but I don't see a point to just hide a toggle. On the NiFi
side for this case you would use the normal S2S intake methods you use now.
No change. Also, if you're going from MiNiFi C++ -> SystemA there is no
change.
For MiNiFi C++ -> MiNiFi C++ ->....-> NiFi, if you want provenance travel
then yes you are locked into using n*(MiNiFi C++) -> NiFi with the
provenance toggled on and using the new S2S receiving processors in MiNiFi
C++/NiFi (it has to be a new one to avoid backwards compatibility issues)
that can handle provenance. Again, I don't see this as an issue either
since you are clearly wanting this functionality if you're doing this.
Am I missing something in my logic flow that you are seeing that I need to
account for?

"You've mentioned this a couple times now. "
[Daniel] Agreed and this is how this discussion is meant to be taken.

"I'm not quite sure I understand so please elaborate if my
comments don't apply."
[Daniel]It has to do with when and how it's consumed. On current path Atlas
won't answer the issues, but as you said there are others and I have my own
in progress as well. I fundamentally disagree with the current
sink-retrieve-sink ETL paradigm (as you've seen from my public papers, there
are others not public yet as well) as it is a complete waste of time and
resources at this point. In all my work, data is handled as available (near
real-time) rather than waiting for some ETL processes to run at some
arbitrary point in the future. By doing this you avoid unnecessary traffic,
storage, processing, maintenance, and design all while improving data
availability. More specifically to this discussion, the issue comes down to
access from the point of origin. In an embedded or background instance of
MiNiFi C++, bidirectional followup calls for provenance only are not always
going to be available. Additionally, where they are available they are not
going to be current and hence are fairly useless for security applications.
Think of trying this on your laptop, IoT devices, or on financial
transactions. If I find out 12-36hrs later when you reconnect or I can send
someone to the field to retrieve it or the ETL processes run that there was
an issue, it doesn't do me any good. As Randy mentioned, you can recombine
all this later, however it is a very resource consuming process. There is
no reason not to have it available when the data is available since it's
just a matter of allowing for its transfer in line with the data. NiFi is
not assuming responsibility for anything it doesn't already, this just
extends it's reach to the full NiFi/MiNiFi instance so there should not be
an ownership concern. This requires an extremely minor update in NiFi, but
is for a fundamental need in MiNiFi C++.

"Ok so I think what you're saying is"
[Daniel] Right, and since you can just disable it if you don't need it there
is no performance or bandwidth hit unless you enable it.

"It is really important to propose and advocate"
[Daniel] I don't see this as a model change, as per my previous questions
MiNiFi C++ seems to not yet have a solid model as the time and effort is
being mainly being put into MiNiFi Java. Since I have very specific ideas
around MiNiFi C++ (and have discussed them with you last year and others at
HW when MiNiFi was only going to be in C) I have not seen this as a radical
departure but an elaboration on what we had already discussed. If you or
the community wants to go a different path, I have no issue branching and
going a separate way with these and the LevelDB changes rather than
introducing these changes into the current path. Being OpenSource there is
no right answer, so I'm certainly open to any suggestions, but I think
you'll find what I'm proposing here is going to be important when you get to
actual implementations of it and it's easier to change now than when you're
locked in later, especially given my issues getting our contributions into
NiFi. As stated above, I don't see how this affects any other
implementations or use cases of MiNiFi C++/NiFi as proposed.

--
View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/MiNiFi-C-Data-Provenance-and-Related-Issues-tp14024p14048.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.

Re: MiNiFi C++ Data Provenance and Related Issues

Posted by Joe Witt <jo...@gmail.com>.

"I look at MiNiFi C++ as a direct spoke of a NiFi hub and as such it
really can be treated as one
"NiFi" instance."

[joe]Yes but there can be other hubs too and in parallel.  For
example, it is quite common for an edge collection location to write
events to a local message bus for local usage while at the same time
send the feed to a central NiFi instance.  We should avoid introducing
a single exfil point limitation especially when the primary reason
would be to simplify the concept of provenance.  The whole point of
provenance is to capture and embrace what really happens in end to end
flows.

"Additionally, since MiNiFi C++ is a complete rewrite, as has been
previously discussed, making requirement variations from NiFi or
MiNiFi Java is acceptable, in my opinion."

[joe]You've mentioned this a couple times now.  I don't think anyone
is making a case here on the basis that we don't want to change it
because we want to avoid requirement variations.  The discussion is
purely on merit of the ideas.  We should always be open to requirement
changes.

"As such, there is no value in having separate provenance for MiNiFi
C++ and NiFi since it is one cradle to
grave path (that happens to use both)."

[joe]I'm not quite sure I understand so please elaborate if my
comments don't apply.  There is no such thing as 'separate provenance'
really.  The bottom line is that capturing facts about what happens to
a piece of data at a point in its lifecycle happens all over the
end-to-end chain.  These things ultimately when wired together
conceptually form a representation of the graph of how data flowed.
Ultimately a single instance of MiNiFi only knows about the events
that happened on its watch.  Same is true for an instance of NiFi.  In
the end, there are various places where provenance gets generated and
then you get to the scenario of "how do i see the end to end chain".
This requires something even beyond any NiFi itself.  Apache Atlas
(incubating) might be an answer but there may be others.  This is why
we have facilities like reporting tasks to send provenance events to
some place.  This is often just sending to HDFS so all events are in
one place for retention and analysis.  The concept of provenance is
bigger than NiFi or MiNiFi to be clear.  And, at this point we do not
have any plans or designs for having a NiFi cluster take ownership of
other systems provenance events (even if those other systems are NiFi
or MiNiFi agents).  We can certainly act as a relay point for such
information but to index them and properly represent them in the
context of who owns them is another matter.  Frankly, if you get into
the deep weeds of provenance you can get into some fun discussions
about data identity.  When I am systemX and have object Y and send it
to sytemZ did I send object X or did I sent some object X2? If you
think I sent X then what happens to X if it is altered on the other
system?  We can't now both be talking about X but talking about
different versions.  Etc..

"I personally don't see this as an attribute as currently represented
in the flowfiles since that would not be an efficient structure to
handle or maintain through MiNiFi C++ pathing.  This requires the
provenance tree related to that flowfile to be sent (which should be
small-ish in a MiNiFi C++ instance). My design for it was that it
would be a separate data point on the flowfile package using a simple,
extremely lightweight, and easy to manipulate structure.  Truthfully,
it doesn't even have to be resident all through the MiNiFi C++ flow if
a viable repo replaces LevelDB and my preference is to add it in at
the S2S processor.  The important thing is that it can be sent with
the flowfile through S2S and then added to the main NiFi provenance
repo so as to provide a continuous chain.  This would be easy to
toggle through a single checkbox added to a MiNiFi C++ S2S variant so
that if you choose not to integrate as provenance isn't important to
you, you could."

[joe]Ok so I think what you're saying is that you'd have a sort of
hybrid out of band model where it is brought in-band during site to
site transfers. I see how that helps and that is certainly fine as a
transport.  I'm not sure how expensive it would be to collect the
provenance trail during transfer but of course the provenance
repository for MiNiFi could be optimized for that.  Also, we still
have to consider that MiNiFi isn't limited to just being tethered to a
single NiFi instance so we'd need to be clear that there could be
additional provenance we're not getting via this path and if it came
in via other paths we'd have to have a way to resolve this.

"Since in this model, MiNiFi C++ plus provenance only integrates with NiFi
hubs, there is no reason to concern with outside compatibility for this
specific S2S processor mechanism."

[joe]It is really important to propose and advocate a model for
provenance that honors the existing plan and model for MiNiFi and
NiFi.  Or, if we should discuss altering that model we should do that
on a separate thread and we should also have good reasons to limit it
from what is planned today and ideally for more reasons that just
making provenance more clear.  It was definitely built with the
understanding of edge use cases requiring more than a single exfil
path.

Thanks
Joe

On Tue, Nov 29, 2016 at 8:02 AM, Daniel Cave <dc...@ssglimited.com> wrote:
> As to Joe and Aldrin's concerns, I feel a bit more detail of what I had in
> mind might clear up some of the concerns and vagaries (all valid) that you
> mentioned.
>
> As Aldrin mentioned, to me provenance is not about metadata needed for
> routing.  I don't doubt there are use cases for that, as Randy mentioned,
> however it was not the concern I had in mind that I am looking to address
> with this discussion.  If the community wants to add more functionality from
> a metadata also, we can certainly add that.
>
> As for Joe's examples and concerns for in-band, I look at MiNiFi C++ as a
> direct spoke of a NiFi hub and as such it really can be treated as one
> "NiFi" instance.  Additionally, since MiNiFi C++ is a complete rewrite, as
> has been previously discussed, making requirement variations from NiFi or
> MiNiFi Java is acceptable, in my opinion.  As such, there is no value in
> having separate provenance for MiNiFi C++ and NiFi since it is one cradle to
> grave path (that happens to use both).  As for bandwidth concerns, this is
> actually exactly one of the issues that concerns me as later calling to the
> MiNiFi C++ enabled device merely to sort and retrieve provenance (which
> would be a heavy operation as currently constructed) is not realistic.  One
> of the biggest selling points of NiFi is its full data provenance ability,
> and my goal is merely to extend it through the full "flow".  I personally
> don't see this as an attribute as currently represented in the flowfiles
> since that would not be an efficient structure to handle or maintain through
> MiNiFi C++ pathing.  This requires the provenance tree related to that
> flowfile to be sent (which should be small-ish in a MiNiFi C++ instance).
> My design for it was that it would be a separate data point on the flowfile
> package using a simple, extremely lightweight, and easy to manipulate
> structure.  Truthfully, it doesn't even have to be resident all through the
> MiNiFi C++ flow if a viable repo replaces LevelDB and my preference is to
> add it in at the S2S processor.  The important thing is that it can be sent
> with the flowfile through S2S and then added to the main NiFi provenance
> repo so as to provide a continuous chain.  This would be easy to toggle
> through a single checkbox added to a MiNiFi C++ S2S variant so that if you
> choose not to integrate as provenance isn't important to you, you could.
> Since in this model, MiNiFi C++ plus provenance only integrates with NiFi
> hubs, there is no reason to concern with outside compatibility for this
> specific S2S processor mechanism.
>
> I see the ability to allow for "in-band" communication at the S2S-S2S point
> as a requirement for some use cases.
>
>
>
> --
> View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/MiNiFi-C-Data-Provenance-and-Related-Issues-tp14024p14045.html
> Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.

Re: MiNiFi C++ Data Provenance and Related Issues

Posted by Daniel Cave <dc...@ssglimited.com>.

As to Joe and Aldrin's concerns, I feel a bit more detail of what I had in
mind might clear up some of the concerns and vagaries (all valid) that you
mentioned.

As Aldrin mentioned, to me provenance is not about metadata needed for
routing. I don't doubt there are use cases for that, as Randy mentioned,
however it was not the concern I had in mind that I am looking to address
with this discussion. If the community wants to add more functionality from
a metadata also, we can certainly add that.

As for Joe's examples and concerns for in-band, I look at MiNiFi C++ as a
direct spoke of a NiFi hub and as such it really can be treated as one
"NiFi" instance. Additionally, since MiNiFi C++ is a complete rewrite, as
has been previously discussed, making requirement variations from NiFi or
MiNiFi Java is acceptable, in my opinion. As such, there is no value in
having separate provenance for MiNiFi C++ and NiFi since it is one cradle to
grave path (that happens to use both). As for bandwidth concerns, this is
actually exactly one of the issues that concerns me as later calling to the
MiNiFi C++ enabled device merely to sort and retrieve provenance (which
would be a heavy operation as currently constructed) is not realistic. One
of the biggest selling points of NiFi is its full data provenance ability,
and my goal is merely to extend it through the full "flow". I personally
don't see this as an attribute as currently represented in the flowfiles
since that would not be an efficient structure to handle or maintain through
MiNiFi C++ pathing. This requires the provenance tree related to that
flowfile to be sent (which should be small-ish in a MiNiFi C++ instance).
My design for it was that it would be a separate data point on the flowfile
package using a simple, extremely lightweight, and easy to manipulate
structure. Truthfully, it doesn't even have to be resident all through the
MiNiFi C++ flow if a viable repo replaces LevelDB and my preference is to
add it in at the S2S processor. The important thing is that it can be sent
with the flowfile through S2S and then added to the main NiFi provenance
repo so as to provide a continuous chain. This would be easy to toggle
through a single checkbox added to a MiNiFi C++ S2S variant so that if you
choose not to integrate as provenance isn't important to you, you could.
Since in this model, MiNiFi C++ plus provenance only integrates with NiFi
hubs, there is no reason to concern with outside compatibility for this
specific S2S processor mechanism.

I see the ability to allow for "in-band" communication at the S2S-S2S point
as a requirement for some use cases.

--
View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/MiNiFi-C-Data-Provenance-and-Related-Issues-tp14024p14045.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.

Re: MiNiFi C++ Data Provenance and Related Issues

Posted by Joe Witt <jo...@gmail.com>.

Might make sense to split these discussions out.  Regarding provenance...

Data provenance is about tracking the origin and attribution of data
and the model that we've got allows that to occur despite that the
fact that we're often handling directed graphs of flows involving
numerous systems.

Transport models:
- "In-band" embedded with the data the provenance is about.

This is the original model we considered several years ago and is how
things are commonly done in other systems with some aspects of
provenance.  The problem with this approach is how do you resolve the
provenance chain when you deliver from MiNIFi-A to System1 and System2
in paralllel?  This is a simple case.  But what does that provenance
chain look like for the object sent to System1 and what does it look
like for the System2 provenance?  Of course one object won't know
about the other object.  So what does that provenance tell us?

- "Out of band" A separate feed of event data which is 'about the
data' but is not the data itself.  This is how NiFi works today.

Now, as Randy points out there are cases where having the provenance
data in-band would help with routing cases.  I'd make the case that
this is not about data provenance as we're generally talking about it.
That is just contextual metadata and is why both MiNiFi and NiFi
support and advocate the flowfile construct which has metadata and
content - just like HTTP does.  If you need information about where
data came from then it should be embedded in the flowfile metadata.
If there are common details that are valuable and can best be relayed
by the last component that touched an object let's discuss that.

-- Now, we could arguably support both models where by we allow you to
optionally send in-band provenance but we must make it clear that the
provenance chain of an in-band message only reflects the linear chain
of provenance that is known by that object and does NOT reflect the
full graph.  However, i'd also have significant concerns about how to
efficiently store this data.  FlowFile attributes are today held in
memory.  So, alternatively we could make a new FlowFile construct for
this such that the flowfile chain (which could be quite large) is in
some form of non-memory-loaded content.  But this would also be a
pretty huge change.

It isn't clear to me that introducing an in-band model is a good path for us.

Thanks
Joe

On Tue, Nov 29, 2016 at 7:38 AM, Aldrin Piri <al...@gmail.com> wrote:
> Hey folks,
>
> Good commentary and I would encourage you to create associated tickets
> where applicable such that we can track such ideas and their efforts from a
> community project level.
>
> Concerning building, Randy, if you could provide more details on your OS X
> build problems, this would be greatly helpful.  I know a number of
> contributors have OS X machines and seem to have reasonable success so any
> details on your environment would be helpful in trying to track down the
> problem.  Certainly understand the concerns over wanting things to work on
> a wide variety of systems as stock.  This was voiced in part by
> https://issues.apache.org/jira/browse/MINIFI-118.  We certainly have
> options here depending on what the target environment will support, such as
> more static linking which may be acceptable for larger systems running more
> enterprise level OSes.
>
> LMDB certainly seems like it could be an interesting candidate doing some
> initial glances over it and its licensing (OpenLDAP Public License) seems
> like a variant of a 3-Clause BSD, so it should be okay to utilize from an
> ALv2 concern.  Definitely worth pursuing, and as mentioned in the prior
> thread, there are no hard and fast commitments to a particular technology
> but rather, especially in its early stages, to establish the interfaces and
> framework and provide a working implementation such that there is a place
> to start.
>
> Concerning the idea of integrating provenance with FlowFiles, I can
> certainly see the value in bundling it with the FlowFiles from the
> standpoint of minimizing footprint and resource utilization on
> device/source.  One important item to also be mindful of that has come up
> with a number of folks looking to tackle management of dataflow is also
> that of limited communications and/or prohibitive cost when looking at
> large deployments of such agents.  A separate provenance repository allows
> the sending of provenance events out of band when convenient or explicitly
> requested/needed.  In another aspect on that idea, including provenance in
> each FlowFile could exhaust disk more quickly in the event that a means of
> transmission is not available.  In this case, the discrete storage
> mechanisms could allow the purging and removal of provenance without the
> cost of losing data that might otherwise be able to continue being
> buffered.  That's not to say this use case is any more valid or important,
> but another point of consideration in the design choices made for
> data/provenance storage and transmission.
>
> I think the key item of import for the effort is that there are many and
> widely varying use cases and situations for how this particular
> implementation needs to be built, deployed, and utilized but makes for some
> interesting discussions and design processes that should make for a
> rewarding challenge.
>
> Thanks for the input!
>
> On Tue, Nov 29, 2016 at 4:56 AM, Daniel Cave <dc...@ssglimited.com> wrote:
>
>> Since MiNiFi C++ requires completely new code (unlike the Java version), I
>> don't see any reason we cant deviate where it makes requirement sense.  If
>> we move the provenance onto the flowfile, then your build issues and my
>> stability issues can be simplified because the local provenance repo
>> becomes
>> log only and where the local repo could be handled by a standard logging
>> mechanism instead.  As you stated, installing additional open source
>> libraries in production environments is a near non-starter.
>>
>> If no one disagrees with the approach or really desperately wants to take
>> it
>> on, I'm ok with taking the action item to start working on a good transport
>> structure and looking at making the changes needed for it to work through
>> S2S. This also requires making changes to NiFi to allow for the provenance
>> to be added to the main NiFi repo; this is something I was planning on
>> doing
>> anyway as part of a new enterprise dp/dg engine based on NiFi I'm working
>> on.
>>
>> We need someone to test a reliable replacement for LevelDB (be that LMDB,
>> which I believe comes standard in RHEL distributions, or whatever) and
>> integrate it or convert the local repo to log only.  I'll get to it
>> eventually after I make the other changes if no one else does.
>>
>>
>>
>> --
>> View this message in context: http://apache-nifi-developer-
>> list.39713.n7.nabble.com/MiNiFi-C-Data-Provenance-and-
>> Related-Issues-tp14024p14040.html
>> Sent from the Apache NiFi Developer List mailing list archive at
>> Nabble.com.
>>

Re: MiNiFi C++ Data Provenance and Related Issues

Posted by Aldrin Piri <al...@gmail.com>.

Hey folks,

Good commentary and I would encourage you to create associated tickets
where applicable such that we can track such ideas and their efforts from a
community project level.

Concerning building, Randy, if you could provide more details on your OS X
build problems, this would be greatly helpful.  I know a number of
contributors have OS X machines and seem to have reasonable success so any
details on your environment would be helpful in trying to track down the
problem.  Certainly understand the concerns over wanting things to work on
a wide variety of systems as stock.  This was voiced in part by
https://issues.apache.org/jira/browse/MINIFI-118.  We certainly have
options here depending on what the target environment will support, such as
more static linking which may be acceptable for larger systems running more
enterprise level OSes.

LMDB certainly seems like it could be an interesting candidate doing some
initial glances over it and its licensing (OpenLDAP Public License) seems
like a variant of a 3-Clause BSD, so it should be okay to utilize from an
ALv2 concern.  Definitely worth pursuing, and as mentioned in the prior
thread, there are no hard and fast commitments to a particular technology
but rather, especially in its early stages, to establish the interfaces and
framework and provide a working implementation such that there is a place
to start.

Concerning the idea of integrating provenance with FlowFiles, I can
certainly see the value in bundling it with the FlowFiles from the
standpoint of minimizing footprint and resource utilization on
device/source.  One important item to also be mindful of that has come up
with a number of folks looking to tackle management of dataflow is also
that of limited communications and/or prohibitive cost when looking at
large deployments of such agents.  A separate provenance repository allows
the sending of provenance events out of band when convenient or explicitly
requested/needed.  In another aspect on that idea, including provenance in
each FlowFile could exhaust disk more quickly in the event that a means of
transmission is not available.  In this case, the discrete storage
mechanisms could allow the purging and removal of provenance without the
cost of losing data that might otherwise be able to continue being
buffered.  That's not to say this use case is any more valid or important,
but another point of consideration in the design choices made for
data/provenance storage and transmission.

I think the key item of import for the effort is that there are many and
widely varying use cases and situations for how this particular
implementation needs to be built, deployed, and utilized but makes for some
interesting discussions and design processes that should make for a
rewarding challenge.

Thanks for the input!

On Tue, Nov 29, 2016 at 4:56 AM, Daniel Cave <dc...@ssglimited.com> wrote:

> Since MiNiFi C++ requires completely new code (unlike the Java version), I
> don't see any reason we cant deviate where it makes requirement sense.  If
> we move the provenance onto the flowfile, then your build issues and my
> stability issues can be simplified because the local provenance repo
> becomes
> log only and where the local repo could be handled by a standard logging
> mechanism instead.  As you stated, installing additional open source
> libraries in production environments is a near non-starter.
>
> If no one disagrees with the approach or really desperately wants to take
> it
> on, I'm ok with taking the action item to start working on a good transport
> structure and looking at making the changes needed for it to work through
> S2S. This also requires making changes to NiFi to allow for the provenance
> to be added to the main NiFi repo; this is something I was planning on
> doing
> anyway as part of a new enterprise dp/dg engine based on NiFi I'm working
> on.
>
> We need someone to test a reliable replacement for LevelDB (be that LMDB,
> which I believe comes standard in RHEL distributions, or whatever) and
> integrate it or convert the local repo to log only.  I'll get to it
> eventually after I make the other changes if no one else does.
>
>
>
> --
> View this message in context: http://apache-nifi-developer-
> list.39713.n7.nabble.com/MiNiFi-C-Data-Provenance-and-
> Related-Issues-tp14024p14040.html
> Sent from the Apache NiFi Developer List mailing list archive at
> Nabble.com.
>

Re: MiNiFi C++ Data Provenance and Related Issues

Posted by Daniel Cave <dc...@ssglimited.com>.

Since MiNiFi C++ requires completely new code (unlike the Java version), I
don't see any reason we cant deviate where it makes requirement sense.  If
we move the provenance onto the flowfile, then your build issues and my
stability issues can be simplified because the local provenance repo becomes
log only and where the local repo could be handled by a standard logging
mechanism instead.  As you stated, installing additional open source
libraries in production environments is a near non-starter. 

If no one disagrees with the approach or really desperately wants to take it
on, I'm ok with taking the action item to start working on a good transport
structure and looking at making the changes needed for it to work through
S2S. This also requires making changes to NiFi to allow for the provenance
to be added to the main NiFi repo; this is something I was planning on doing
anyway as part of a new enterprise dp/dg engine based on NiFi I'm working
on.

We need someone to test a reliable replacement for LevelDB (be that LMDB,
which I believe comes standard in RHEL distributions, or whatever) and
integrate it or convert the local repo to log only.  I'll get to it
eventually after I make the other changes if no one else does.



--
View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/MiNiFi-C-Data-Provenance-and-Related-Issues-tp14024p14040.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.