You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nifi.apache.org by Andre <an...@fucs.org> on 2017/02/21 23:10:53 UTC

[DISCUSS] Scale-out/Object Storage - taming the diversity of processors

dev,

I was having a chat with Pierre around PR#379 and we thought it would be
worth sharing this with the wider group:


I recently noticed that we merged a number of PRs and merges around
scale-out/cloud based object store into the master.

Would it make sense to start considering adopting a pattern where
Put/Get/ListHDFS are used in tandem with implementations of the
hadoop.filesystem interfaces instead of creating new processors, except
where a particular deficiency/incompatibility in the hadoop.filesystem
implementation exists?

Candidates for removal / non merge would be:

- Alluxio (PR#379)
- WASB (PR#626)
 - Azure* (PR#399)
- *GCP (recently merged as PR#1482)
- *S3 (although this has been in code so it would have to be deprecated)

The pattern would be pretty much the same as the one documented and
successfully deployed here:

https://community.hortonworks.com/articles/71916/connecting-
to-azure-data-lake-from-a-nifi-dataflow.html

Which means that in the case of Alluxio, one would use the properties
documented here:

https://www.alluxio.com/docs/community/1.3/en/Running-
Hadoop-MapReduce-on-Alluxio.html

While with Google Cloud Storage we would use the properties documented here:

https://cloud.google.com/hadoop/google-cloud-storage-connector

I noticed that specific processors could have the ability to handle
particular properties to a filesystem, however I would like to believe the
same issue would plague hadoop users, and therefore is reasonable to
believe the Hadoop compatible implementations would have ways of exposing
those properties as well?

In the case the properties are exposed, we perhaps simply adjust the *HDFS
processors to use dynamic properties to pass those to the underlying
module, therefore providing a way to explore particular settings of an
underlying storage platforms.

Any opinion would be welcome

PS-sent it again with proper subject label

Re: [DISCUSS] Scale-out/Object Storage - taming the diversity of processors

Posted by James Wing <jv...@gmail.com>.

Andre,

I definitely believe it is worth documenting this capability as a path to
storage providers that have no NiFi processors.  But I'm not sold on
dropping the processors we have now.  In addition to the great points made
by Andrew and Matt:

** Usability* - Specific storage processors provide an intuitive path for
the user thinking "I want to write this to S3".  Being specific to a
storage provider allows the processors to mirror provider terminology and
features.  It would be a difficult challenge to smoothly signal users that
they should use PutHDFS, but torture the configuration files until it
writes to S3.

** Advertising* - Having a broad array of storage processors gives NiFi
tangible, linkable, and Googleable answers to what NiFi supports.

** Positioning* - Stripping out other stores in favor of an
HDFS-library-first design would position NiFi closer to Hadoop/HDFS and
less of an independent mediator.  If only in some small way.

I also believe that the NiFi Registry initiative should help address the
processor explosion.


Thanks,

James

On Tue, Feb 21, 2017 at 3:45 PM, Matt Burgess <ma...@apache.org> wrote:

> I agree with Andrew in the operations sense, and would like to add
> that the user experience around dynamic properties (and even
> "conditional" properties that are not dynamic but can be exposed when
> other properties are "Applied") can be less-than-ideal and IMHO should
> be used sparingly. Full disclosure: My latest processor uses
> "conditional" properties at the moment, choosing them over dynamic
> properties in the hopes that the user experience is better, but
> without in-place updates (possibly implemented under [1]) and/or the
> UI making it obvious that dynamic properties are supported (under
> [2]), I'm not sure which is better (or if I should create different
> processors for my case as well).
>
> Under the hood, if it makes sense to group these processors and
> abstract away common code, then I'm all for it.  Especially if we can
> use something like the nifi-hadoop-libraries-nar as an ancestor NAR to
> provide a common set of libraries to all the Hadoop-Compatible File
> System (HCFS) implementations.  However I fear based on versions of
> the specific HCFS implementations, they may also need different
> versions of the HFS client dependencies, in which case we'd be looking
> for the Extension Registry and some smart classloading to alleviate
> those pain points without ballooning the NiFi footprint.
>
> Regards,
> Matt
>
> [1] https://issues.apache.org/jira/browse/NIFI-1121
> [2] https://issues.apache.org/jira/browse/NIFI-2629
>
>
> On Tue, Feb 21, 2017 at 6:21 PM, Andrew Grande <ap...@gmail.com> wrote:
> > Andre,
> >
> > I came across multiple NiFi use cases where going through the HDFS layer
> > and the fs plugin may not be possible. I.e. when no HDFS layer present at
> > all, so no NN to connect to.
> >
> > Another important aspect is operations. Current PutHDFS model with
> > additional jar location, well, it kinda works, but I very much dislike
> it.
> > Too many possibilities for a human error in addition to deployment pain,
> > especially in a cluster.
> >
> > Finally, native object storage processors have features which may not
> even
> > apply to the HDFS layer. E.g. the Azure storage has Table storage, etc.
> >
> > I agree consolidating various efforts is worthwhile, but only within a
> > context of a specific storage solution. Not 'unifying' them into a single
> > layer.
> >
> > Andrew
> >
> > On Tue, Feb 21, 2017, 6:10 PM Andre <an...@fucs.org> wrote:
> >
> >> dev,
> >>
> >> I was having a chat with Pierre around PR#379 and we thought it would be
> >> worth sharing this with the wider group:
> >>
> >>
> >> I recently noticed that we merged a number of PRs and merges around
> >> scale-out/cloud based object store into the master.
> >>
> >> Would it make sense to start considering adopting a pattern where
> >> Put/Get/ListHDFS are used in tandem with implementations of the
> >> hadoop.filesystem interfaces instead of creating new processors, except
> >> where a particular deficiency/incompatibility in the hadoop.filesystem
> >> implementation exists?
> >>
> >> Candidates for removal / non merge would be:
> >>
> >> - Alluxio (PR#379)
> >> - WASB (PR#626)
> >>  - Azure* (PR#399)
> >> - *GCP (recently merged as PR#1482)
> >> - *S3 (although this has been in code so it would have to be deprecated)
> >>
> >> The pattern would be pretty much the same as the one documented and
> >> successfully deployed here:
> >>
> >> https://community.hortonworks.com/articles/71916/connecting-
> >> to-azure-data-lake-from-a-nifi-dataflow.html
> >>
> >> Which means that in the case of Alluxio, one would use the properties
> >> documented here:
> >>
> >> https://www.alluxio.com/docs/community/1.3/en/Running-
> >> Hadoop-MapReduce-on-Alluxio.html
> >>
> >> While with Google Cloud Storage we would use the properties documented
> >> here:
> >>
> >> https://cloud.google.com/hadoop/google-cloud-storage-connector
> >>
> >> I noticed that specific processors could have the ability to handle
> >> particular properties to a filesystem, however I would like to believe
> the
> >> same issue would plague hadoop users, and therefore is reasonable to
> >> believe the Hadoop compatible implementations would have ways of
> exposing
> >> those properties as well?
> >>
> >> In the case the properties are exposed, we perhaps simply adjust the
> *HDFS
> >> processors to use dynamic properties to pass those to the underlying
> >> module, therefore providing a way to explore particular settings of an
> >> underlying storage platforms.
> >>
> >> Any opinion would be welcome
> >>
> >> PS-sent it again with proper subject label
> >>
>

Re: [DISCUSS] Scale-out/Object Storage - taming the diversity of processors

Posted by Matt Burgess <ma...@apache.org>.

I agree with Andrew in the operations sense, and would like to add
that the user experience around dynamic properties (and even
"conditional" properties that are not dynamic but can be exposed when
other properties are "Applied") can be less-than-ideal and IMHO should
be used sparingly. Full disclosure: My latest processor uses
"conditional" properties at the moment, choosing them over dynamic
properties in the hopes that the user experience is better, but
without in-place updates (possibly implemented under [1]) and/or the
UI making it obvious that dynamic properties are supported (under
[2]), I'm not sure which is better (or if I should create different
processors for my case as well).

Under the hood, if it makes sense to group these processors and
abstract away common code, then I'm all for it.  Especially if we can
use something like the nifi-hadoop-libraries-nar as an ancestor NAR to
provide a common set of libraries to all the Hadoop-Compatible File
System (HCFS) implementations.  However I fear based on versions of
the specific HCFS implementations, they may also need different
versions of the HFS client dependencies, in which case we'd be looking
for the Extension Registry and some smart classloading to alleviate
those pain points without ballooning the NiFi footprint.

Regards,
Matt

[1] https://issues.apache.org/jira/browse/NIFI-1121
[2] https://issues.apache.org/jira/browse/NIFI-2629


On Tue, Feb 21, 2017 at 6:21 PM, Andrew Grande <ap...@gmail.com> wrote:
> Andre,
>
> I came across multiple NiFi use cases where going through the HDFS layer
> and the fs plugin may not be possible. I.e. when no HDFS layer present at
> all, so no NN to connect to.
>
> Another important aspect is operations. Current PutHDFS model with
> additional jar location, well, it kinda works, but I very much dislike it.
> Too many possibilities for a human error in addition to deployment pain,
> especially in a cluster.
>
> Finally, native object storage processors have features which may not even
> apply to the HDFS layer. E.g. the Azure storage has Table storage, etc.
>
> I agree consolidating various efforts is worthwhile, but only within a
> context of a specific storage solution. Not 'unifying' them into a single
> layer.
>
> Andrew
>
> On Tue, Feb 21, 2017, 6:10 PM Andre <an...@fucs.org> wrote:
>
>> dev,
>>
>> I was having a chat with Pierre around PR#379 and we thought it would be
>> worth sharing this with the wider group:
>>
>>
>> I recently noticed that we merged a number of PRs and merges around
>> scale-out/cloud based object store into the master.
>>
>> Would it make sense to start considering adopting a pattern where
>> Put/Get/ListHDFS are used in tandem with implementations of the
>> hadoop.filesystem interfaces instead of creating new processors, except
>> where a particular deficiency/incompatibility in the hadoop.filesystem
>> implementation exists?
>>
>> Candidates for removal / non merge would be:
>>
>> - Alluxio (PR#379)
>> - WASB (PR#626)
>>  - Azure* (PR#399)
>> - *GCP (recently merged as PR#1482)
>> - *S3 (although this has been in code so it would have to be deprecated)
>>
>> The pattern would be pretty much the same as the one documented and
>> successfully deployed here:
>>
>> https://community.hortonworks.com/articles/71916/connecting-
>> to-azure-data-lake-from-a-nifi-dataflow.html
>>
>> Which means that in the case of Alluxio, one would use the properties
>> documented here:
>>
>> https://www.alluxio.com/docs/community/1.3/en/Running-
>> Hadoop-MapReduce-on-Alluxio.html
>>
>> While with Google Cloud Storage we would use the properties documented
>> here:
>>
>> https://cloud.google.com/hadoop/google-cloud-storage-connector
>>
>> I noticed that specific processors could have the ability to handle
>> particular properties to a filesystem, however I would like to believe the
>> same issue would plague hadoop users, and therefore is reasonable to
>> believe the Hadoop compatible implementations would have ways of exposing
>> those properties as well?
>>
>> In the case the properties are exposed, we perhaps simply adjust the *HDFS
>> processors to use dynamic properties to pass those to the underlying
>> module, therefore providing a way to explore particular settings of an
>> underlying storage platforms.
>>
>> Any opinion would be welcome
>>
>> PS-sent it again with proper subject label
>>

Re: [DISCUSS] Scale-out/Object Storage - taming the diversity of processors

Posted by Andre <an...@fucs.org>.

Joe,

Thanks for the comments!

Slightly deviating from code consistency discussion but I think you raised
some important points.

I may on the glass half empty side, but being a user who witnessed many
time how modularity played on other communities i am not particular excited
about the prospect.

I agree with Olegz comments about the benefits of the registry to code that
due to licensing issues would not be able to merged into NiFi, I am excited
about faster builds and selective packaging but past IMNHSO things get ugly
very quickly.

To be 100% transparent it is probably because I personally benefited from
having my code reviewed and subjected to the always helpful input of people
like yourself, Aldrin, BBende, JPercivall, Matt B, Oleg,  Pierre,  and the
rest of the wider community. I think of how ListenSMTP evolved in two days
more than I would be able to improve it in a lifetime, all thanks to a very
strong core of contributors around this community.

Perhaps it is because as a volunteer I have had to manage some sites build
around WordPress and noticed that quite frequently a WordPress site is a
see of plugin modules that as time progresses start getting outdated,
turning the long term maintenance of the platform an absolute pain. And to
be honest it has a healthy ecosystem around it: free modules, commercial
modules, independent registries, product reviews and support forums.

At this stage I would like to point out: Support forums - we need to start
thinking about how we plan to manage third party plugins as shared mailing
lists are woefully inadequate to do so. Even Elastic seems to have ended up
using both its support forums and GH's issue pages to provide support to
users.

In any case, back to the original issue:

I think I will second Adam's comments around not assuming the registry will
help around the original issue highlighted: Consistency, Code repetition
and long term support.

Cheers

On Thu, Feb 23, 2017 at 6:25 AM, Joe Witt <jo...@gmail.com> wrote:

> more good points...
>
> We will need to think through/document the benefits that this
> extension bazar would provide and explain the risks it also brings to
> users.  We should document for developers best practices in building
> new components or extending existing ones and we should allow users to
> socialize their findings for processors.  If someone sees a processor
> with 'one star' from a few users versus '4.5' from a thousand they
> would get a different level of confidence.  We should in general think
> about how to 'mature' components in these registries and such.
>
> That said, I think some of the problems we're talking about we'd be
> very honored and fortunate to be solving.  Right now we're making it
> hard to build and develop releases and slowing our agility as a
> community.
>
> Good problems to have though!
>
> joe
>
>

Re: [DISCUSS] Scale-out/Object Storage - taming the diversity of processors

Posted by Joe Witt <jo...@gmail.com>.

more good points...

We will need to think through/document the benefits that this
extension bazar would provide and explain the risks it also brings to
users.  We should document for developers best practices in building
new components or extending existing ones and we should allow users to
socialize their findings for processors.  If someone sees a processor
with 'one star' from a few users versus '4.5' from a thousand they
would get a different level of confidence.  We should in general think
about how to 'mature' components in these registries and such.

That said, I think some of the problems we're talking about we'd be
very honored and fortunate to be solving.  Right now we're making it
hard to build and develop releases and slowing our agility as a
community.

Good problems to have though!

joe

On Wed, Feb 22, 2017 at 2:20 PM, Adam Lamar <ad...@gmail.com> wrote:
> To be clear, I really love the idea of an extension registry and have at
> least one custom processor that would be a great fit for some of the
> reasons listed by Oleg, and its really cool thinking that user data can
> drive NiFi improvements. We're on the same page there.
>
> Let's go back to one of Andre's points as I understand it: Many of the
> processors that do similar things are different in lots of minor but
> unnecessary ways, and this hurts the user experience.
>
> With an extension registry, NiFi users would potentially have access to
> even more processors, but these processors don't undergo the same code
> review as they would being introduced into the mainline tree today, and if
> someone writes ListX or ListY, there seems to be little incentive for them
> to match existing processor behavior, because ListX and ListY exist
> independently from the core processors.
>
> I've already shown my hand, but I'm interested to hear what others think.
> Is this lack of consistency a problem, and if so, how does NiFi mitigate
> the potential issues?
>
> Adam
>
>
> On Wed, Feb 22, 2017 at 11:22 AM, Oleg Zhurakousky <
> ozhurakousky@hortonworks.com> wrote:
>
>> Just wanted to add one more point which IMHO just as important. . .
>> Certain “artifacts” (i.e., NARs that depends on libraries which are not
>> ASF friendly) may not fit the ASF licensing requirements of genuine Apache
>> NiFi distribution, yet add a great value for greater community of NiFi
>> users, so having them NOT being part of official NiFi distribution is a
>> value in itself.
>>
>> Cheers
>> Oleg
>>
>> > On Feb 22, 2017, at 12:52 PM, Oleg Zhurakousky <
>> ozhurakousky@hortonworks.com> wrote:
>> >
>> > Adam
>> >
>> > I 100% agree with your comment on "official/sanctioned”. As an external
>> artifact registry such as BinTray for example or GitHub, one can not
>> control what is there, rather how to get it. The final decision is left to
>> the end user.
>> > Artifacts could be rated and/or Apache NiFi (and/or commercial
>> distributions of NiFi) can “endorse” and/or “un-endorse” certain artifacts
>> and IMHO that is perfectly fine. On top of that a future distribution of
>> NiFi can have configuration to account for the “endorsed/supported”
>> artifacts, yet it should not stop one from downloading and trying something
>> new.
>> >
>> > Cheers
>> > Oleg
>> >
>> >> On Feb 22, 2017, at 12:43 PM, Adam Lamar <ad...@gmail.com> wrote:
>> >>
>> >> Hey all,
>> >>
>> >> I can understand Andre's perspective - when I was building the ListS3
>> >> processor, I mostly just copied the bits that made sense from ListHDFS
>> and
>> >> ListFile. That worked, but its a poor way to ensure consistency across
>> >> List* processors.
>> >>
>> >> As a once-in-a-while contributor, I love the idea that community
>> >> contributions are respected and we're not dropping them, because they
>> solve
>> >> real needs right now, and it isn't clear another approach would be
>> better.
>> >>
>> >> And I disagree slightly with the notion that an artifact registry will
>> >> solve the problem - I think it could make it worse, at least from a
>> >> consistency point of view. Taming _is_ important, which is one reason
>> >> registry communities have official/sanctioned modules. Quality and
>> >> interoperability can vary vastly.
>> >>
>> >> By convention, it seems like NiFi already has a handful of
>> well-understood
>> >> patterns - List, Fetch, Get, Put, etc all mean something specific in
>> >> processor terms. Is there a reason not to formalize those patterns in
>> the
>> >> code as well? That would help with processor consistency, and if done
>> >> right, it may even be easier to write new processors, fix bugs, etc.
>> >>
>> >> For example, ListS3 initially shipped with some bad session commit()
>> >> behavior, which was obvious once identified, but a generalized
>> >> AbstractListProcessor (higher level that the one that already exists)
>> could
>> >> make it easier to avoid this class of bug.
>> >>
>> >> Admittedly this could be a lot of work.
>> >>
>> >> Cheers,
>> >> Adam
>> >>
>> >>
>> >>
>> >> On Wed, Feb 22, 2017 at 8:38 AM, Oleg Zhurakousky <
>> >> ozhurakousky@hortonworks.com> wrote:
>> >>
>> >>> I’ll second Pierre
>> >>>
>> >>> Yes with the current deployment model the amount of processors and the
>> >>> size of NiFi distribution is a concern simply because it’s growing with
>> >>> each release. But it should not be the driver to start jamming more
>> >>> functionality into existing processors which on the surface may look
>> like
>> >>> related (even if they are).
>> >>> Basically a processor should never be complex with regard to it being
>> >>> understood by the end user who is non-technical, so “specialization” is
>> >>> always takes precedence here since it limits “configuration” and thus
>> >>> making such processor simpler. It also helps with maintenance and
>> >>> management of such processor by the developer. Also, having multiple
>> >>> related processors will promote healthy competition where my MyputHDFS
>> may
>> >>> for certain cases be better/faster then YourPutHDFS and why not have
>> both?
>> >>>
>> >>> The “artifact registry” (flow, extension, template etc) is the only
>> answer
>> >>> here since it will remove the “proliferation” and the need for “taming”
>> >>> anything from the picture. With “artifact registry” one or one million
>> >>> processors, the NiFi size/state will always remain constant and small.
>> >>>
>> >>> Cheers
>> >>> Oleg
>> >>>> On Feb 22, 2017, at 6:05 AM, Pierre Villard <
>> pierre.villard.fr@gmail.com>
>> >>> wrote:
>> >>>>
>> >>>> Hey guys,
>> >>>>
>> >>>> Thanks for the thread Andre.
>> >>>>
>> >>>> +1 to James' answer.
>> >>>>
>> >>>> I understand the interest that would provide a single processor to
>> >>> connect
>> >>>> to all the back ends... and we could document/improve the PutHDFS to
>> ease
>> >>>> such use but I really don't think that it will benefit the user
>> >>> experience.
>> >>>> That may be interesting in some cases for some users but I don't think
>> >>> that
>> >>>> would be a majority.
>> >>>>
>> >>>> I believe NiFi is great for one reason: you have a lot of specialized
>> >>>> processors that are really easy to use and efficient for what they've
>> >>> been
>> >>>> designed for.
>> >>>>
>> >>>> Let's ask ourselves the question the other way: with the NiFi
>> registry on
>> >>>> its way, what is the problem having multiple processors for each back
>> >>> end?
>> >>>> I don't really see the issue here. OK we have a lot of processors
>> (but I
>> >>>> believe this is a good point for NiFi, for user experience, for
>> >>>> advertising, etc. - maybe we should improve the processor listing
>> though,
>> >>>> but again, this will be part of the NiFi Registry work), it generates
>> a
>> >>>> heavy NiFi binary (but that will be solved with the registry), but
>> that's
>> >>>> all, no?
>> >>>>
>> >>>> Also agree on the positioning aspect: IMO NiFi should not be highly
>> tied
>> >>> to
>> >>>> the Hadoop ecosystem. There is a lot of users using NiFi with
>> absolutely
>> >>> no
>> >>>> relation to Hadoop. Not sure that would send the good "signal".
>> >>>>
>> >>>> Pierre
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> 2017-02-22 6:50 GMT+01:00 Andre <an...@fucs.org>:
>> >>>>
>> >>>>> Andrew,
>> >>>>>
>> >>>>>
>> >>>>> On Wed, Feb 22, 2017 at 11:21 AM, Andrew Grande <ap...@gmail.com>
>> >>>>> wrote:
>> >>>>>
>> >>>>>> I am observing one assumption in this thread. For some reason we are
>> >>>>>> implying all these will be hadoop compatible file systems. They
>> don't
>> >>>>>> always have an HDFS plugin, nor should they as a mandatory
>> requirement.
>> >>>>>>
>> >>>>>
>> >>>>> You are partially correct.
>> >>>>>
>> >>>>> There is a direct assumption in the availability of a HCFS (thanks
>> >>> Matt!)
>> >>>>> implementation.
>> >>>>>
>> >>>>> This is the case with:
>> >>>>>
>> >>>>> * Windows Azure Blob Storage
>> >>>>> * Google Cloud Storage Connector
>> >>>>> * MapR FileSystem (currently done via NAR recompilation / mvn
>> profile)
>> >>>>> * Alluxio
>> >>>>> * Isilon (via HDFS)
>> >>>>> * others
>> >>>>>
>> >>>>> But I would't say this will apply to every other use storage system
>> and
>> >>> in
>> >>>>> certain cases may not even be necessary (e.g. Isilon scale-out
>> storage
>> >>> may
>> >>>>> be reached using its native HDFS compatible interfaces).
>> >>>>>
>> >>>>>
>> >>>>> Untie completely from the Hadoop nar. This allows for effective
>> minifi
>> >>>>>> interaction without the weight of hadoop libs for example. Massive
>> size
>> >>>>>> savings where it matters.
>> >>>>>>
>> >>>>>>
>> >>>>> Are you suggesting a use case were MiNiFi agents interact directly
>> with
>> >>>>> cloud storage, without relying on NiFi hubs to do that?
>> >>>>>
>> >>>>>
>> >>>>>> For the deployment, it's easy enough for an admin to either rely on
>> a
>> >>>>>> standard tar or rpm if the NAR modules are already available in the
>> >>>>> distro
>> >>>>>> (well, I won't talk registry till it arrives). Mounting a common
>> >>>>> directory
>> >>>>>> on every node or distributing additional jars everywhere, plus
>> configs,
>> >>>>> and
>> >>>>>> then keeping it consistent across is something which can be avoided
>> by
>> >>>>>> simpler packaging.
>> >>>>>>
>> >>>>>
>> >>>>> As long the NAR or RPM supports your use-case, which is not the case
>> of
>> >>>>> people running NiFi with MapR-FS for example. For those, a
>> >>> recompilation is
>> >>>>> required anyway. A flexible processor may remove the need to
>> recompile
>> >>> (I
>> >>>>> am currently playing with the classpath implication to MapR users).
>> >>>>>
>> >>>>> Cheers
>> >>>>>
>> >>>
>> >>>
>> >
>>
>>

Re: [DISCUSS] Scale-out/Object Storage - taming the diversity of processors

Posted by Adam Lamar <ad...@gmail.com>.

To be clear, I really love the idea of an extension registry and have at
least one custom processor that would be a great fit for some of the
reasons listed by Oleg, and its really cool thinking that user data can
drive NiFi improvements. We're on the same page there.

Let's go back to one of Andre's points as I understand it: Many of the
processors that do similar things are different in lots of minor but
unnecessary ways, and this hurts the user experience.

With an extension registry, NiFi users would potentially have access to
even more processors, but these processors don't undergo the same code
review as they would being introduced into the mainline tree today, and if
someone writes ListX or ListY, there seems to be little incentive for them
to match existing processor behavior, because ListX and ListY exist
independently from the core processors.

I've already shown my hand, but I'm interested to hear what others think.
Is this lack of consistency a problem, and if so, how does NiFi mitigate
the potential issues?

Adam


On Wed, Feb 22, 2017 at 11:22 AM, Oleg Zhurakousky <
ozhurakousky@hortonworks.com> wrote:

> Just wanted to add one more point which IMHO just as important. . .
> Certain “artifacts” (i.e., NARs that depends on libraries which are not
> ASF friendly) may not fit the ASF licensing requirements of genuine Apache
> NiFi distribution, yet add a great value for greater community of NiFi
> users, so having them NOT being part of official NiFi distribution is a
> value in itself.
>
> Cheers
> Oleg
>
> > On Feb 22, 2017, at 12:52 PM, Oleg Zhurakousky <
> ozhurakousky@hortonworks.com> wrote:
> >
> > Adam
> >
> > I 100% agree with your comment on "official/sanctioned”. As an external
> artifact registry such as BinTray for example or GitHub, one can not
> control what is there, rather how to get it. The final decision is left to
> the end user.
> > Artifacts could be rated and/or Apache NiFi (and/or commercial
> distributions of NiFi) can “endorse” and/or “un-endorse” certain artifacts
> and IMHO that is perfectly fine. On top of that a future distribution of
> NiFi can have configuration to account for the “endorsed/supported”
> artifacts, yet it should not stop one from downloading and trying something
> new.
> >
> > Cheers
> > Oleg
> >
> >> On Feb 22, 2017, at 12:43 PM, Adam Lamar <ad...@gmail.com> wrote:
> >>
> >> Hey all,
> >>
> >> I can understand Andre's perspective - when I was building the ListS3
> >> processor, I mostly just copied the bits that made sense from ListHDFS
> and
> >> ListFile. That worked, but its a poor way to ensure consistency across
> >> List* processors.
> >>
> >> As a once-in-a-while contributor, I love the idea that community
> >> contributions are respected and we're not dropping them, because they
> solve
> >> real needs right now, and it isn't clear another approach would be
> better.
> >>
> >> And I disagree slightly with the notion that an artifact registry will
> >> solve the problem - I think it could make it worse, at least from a
> >> consistency point of view. Taming _is_ important, which is one reason
> >> registry communities have official/sanctioned modules. Quality and
> >> interoperability can vary vastly.
> >>
> >> By convention, it seems like NiFi already has a handful of
> well-understood
> >> patterns - List, Fetch, Get, Put, etc all mean something specific in
> >> processor terms. Is there a reason not to formalize those patterns in
> the
> >> code as well? That would help with processor consistency, and if done
> >> right, it may even be easier to write new processors, fix bugs, etc.
> >>
> >> For example, ListS3 initially shipped with some bad session commit()
> >> behavior, which was obvious once identified, but a generalized
> >> AbstractListProcessor (higher level that the one that already exists)
> could
> >> make it easier to avoid this class of bug.
> >>
> >> Admittedly this could be a lot of work.
> >>
> >> Cheers,
> >> Adam
> >>
> >>
> >>
> >> On Wed, Feb 22, 2017 at 8:38 AM, Oleg Zhurakousky <
> >> ozhurakousky@hortonworks.com> wrote:
> >>
> >>> I’ll second Pierre
> >>>
> >>> Yes with the current deployment model the amount of processors and the
> >>> size of NiFi distribution is a concern simply because it’s growing with
> >>> each release. But it should not be the driver to start jamming more
> >>> functionality into existing processors which on the surface may look
> like
> >>> related (even if they are).
> >>> Basically a processor should never be complex with regard to it being
> >>> understood by the end user who is non-technical, so “specialization” is
> >>> always takes precedence here since it limits “configuration” and thus
> >>> making such processor simpler. It also helps with maintenance and
> >>> management of such processor by the developer. Also, having multiple
> >>> related processors will promote healthy competition where my MyputHDFS
> may
> >>> for certain cases be better/faster then YourPutHDFS and why not have
> both?
> >>>
> >>> The “artifact registry” (flow, extension, template etc) is the only
> answer
> >>> here since it will remove the “proliferation” and the need for “taming”
> >>> anything from the picture. With “artifact registry” one or one million
> >>> processors, the NiFi size/state will always remain constant and small.
> >>>
> >>> Cheers
> >>> Oleg
> >>>> On Feb 22, 2017, at 6:05 AM, Pierre Villard <
> pierre.villard.fr@gmail.com>
> >>> wrote:
> >>>>
> >>>> Hey guys,
> >>>>
> >>>> Thanks for the thread Andre.
> >>>>
> >>>> +1 to James' answer.
> >>>>
> >>>> I understand the interest that would provide a single processor to
> >>> connect
> >>>> to all the back ends... and we could document/improve the PutHDFS to
> ease
> >>>> such use but I really don't think that it will benefit the user
> >>> experience.
> >>>> That may be interesting in some cases for some users but I don't think
> >>> that
> >>>> would be a majority.
> >>>>
> >>>> I believe NiFi is great for one reason: you have a lot of specialized
> >>>> processors that are really easy to use and efficient for what they've
> >>> been
> >>>> designed for.
> >>>>
> >>>> Let's ask ourselves the question the other way: with the NiFi
> registry on
> >>>> its way, what is the problem having multiple processors for each back
> >>> end?
> >>>> I don't really see the issue here. OK we have a lot of processors
> (but I
> >>>> believe this is a good point for NiFi, for user experience, for
> >>>> advertising, etc. - maybe we should improve the processor listing
> though,
> >>>> but again, this will be part of the NiFi Registry work), it generates
> a
> >>>> heavy NiFi binary (but that will be solved with the registry), but
> that's
> >>>> all, no?
> >>>>
> >>>> Also agree on the positioning aspect: IMO NiFi should not be highly
> tied
> >>> to
> >>>> the Hadoop ecosystem. There is a lot of users using NiFi with
> absolutely
> >>> no
> >>>> relation to Hadoop. Not sure that would send the good "signal".
> >>>>
> >>>> Pierre
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> 2017-02-22 6:50 GMT+01:00 Andre <an...@fucs.org>:
> >>>>
> >>>>> Andrew,
> >>>>>
> >>>>>
> >>>>> On Wed, Feb 22, 2017 at 11:21 AM, Andrew Grande <ap...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> I am observing one assumption in this thread. For some reason we are
> >>>>>> implying all these will be hadoop compatible file systems. They
> don't
> >>>>>> always have an HDFS plugin, nor should they as a mandatory
> requirement.
> >>>>>>
> >>>>>
> >>>>> You are partially correct.
> >>>>>
> >>>>> There is a direct assumption in the availability of a HCFS (thanks
> >>> Matt!)
> >>>>> implementation.
> >>>>>
> >>>>> This is the case with:
> >>>>>
> >>>>> * Windows Azure Blob Storage
> >>>>> * Google Cloud Storage Connector
> >>>>> * MapR FileSystem (currently done via NAR recompilation / mvn
> profile)
> >>>>> * Alluxio
> >>>>> * Isilon (via HDFS)
> >>>>> * others
> >>>>>
> >>>>> But I would't say this will apply to every other use storage system
> and
> >>> in
> >>>>> certain cases may not even be necessary (e.g. Isilon scale-out
> storage
> >>> may
> >>>>> be reached using its native HDFS compatible interfaces).
> >>>>>
> >>>>>
> >>>>> Untie completely from the Hadoop nar. This allows for effective
> minifi
> >>>>>> interaction without the weight of hadoop libs for example. Massive
> size
> >>>>>> savings where it matters.
> >>>>>>
> >>>>>>
> >>>>> Are you suggesting a use case were MiNiFi agents interact directly
> with
> >>>>> cloud storage, without relying on NiFi hubs to do that?
> >>>>>
> >>>>>
> >>>>>> For the deployment, it's easy enough for an admin to either rely on
> a
> >>>>>> standard tar or rpm if the NAR modules are already available in the
> >>>>> distro
> >>>>>> (well, I won't talk registry till it arrives). Mounting a common
> >>>>> directory
> >>>>>> on every node or distributing additional jars everywhere, plus
> configs,
> >>>>> and
> >>>>>> then keeping it consistent across is something which can be avoided
> by
> >>>>>> simpler packaging.
> >>>>>>
> >>>>>
> >>>>> As long the NAR or RPM supports your use-case, which is not the case
> of
> >>>>> people running NiFi with MapR-FS for example. For those, a
> >>> recompilation is
> >>>>> required anyway. A flexible processor may remove the need to
> recompile
> >>> (I
> >>>>> am currently playing with the classpath implication to MapR users).
> >>>>>
> >>>>> Cheers
> >>>>>
> >>>
> >>>
> >
>
>

Re: [DISCUSS] Scale-out/Object Storage - taming the diversity of processors

Posted by Oleg Zhurakousky <oz...@hortonworks.com>.

Just wanted to add one more point which IMHO just as important. . .
Certain “artifacts” (i.e., NARs that depends on libraries which are not ASF friendly) may not fit the ASF licensing requirements of genuine Apache NiFi distribution, yet add a great value for greater community of NiFi users, so having them NOT being part of official NiFi distribution is a value in itself.

Cheers
Oleg

> On Feb 22, 2017, at 12:52 PM, Oleg Zhurakousky <oz...@hortonworks.com> wrote:
> 
> Adam
> 
> I 100% agree with your comment on "official/sanctioned”. As an external artifact registry such as BinTray for example or GitHub, one can not control what is there, rather how to get it. The final decision is left to the end user.
> Artifacts could be rated and/or Apache NiFi (and/or commercial distributions of NiFi) can “endorse” and/or “un-endorse” certain artifacts and IMHO that is perfectly fine. On top of that a future distribution of NiFi can have configuration to account for the “endorsed/supported” artifacts, yet it should not stop one from downloading and trying something new.
> 
> Cheers
> Oleg
> 
>> On Feb 22, 2017, at 12:43 PM, Adam Lamar <ad...@gmail.com> wrote:
>> 
>> Hey all,
>> 
>> I can understand Andre's perspective - when I was building the ListS3
>> processor, I mostly just copied the bits that made sense from ListHDFS and
>> ListFile. That worked, but its a poor way to ensure consistency across
>> List* processors.
>> 
>> As a once-in-a-while contributor, I love the idea that community
>> contributions are respected and we're not dropping them, because they solve
>> real needs right now, and it isn't clear another approach would be better.
>> 
>> And I disagree slightly with the notion that an artifact registry will
>> solve the problem - I think it could make it worse, at least from a
>> consistency point of view. Taming _is_ important, which is one reason
>> registry communities have official/sanctioned modules. Quality and
>> interoperability can vary vastly.
>> 
>> By convention, it seems like NiFi already has a handful of well-understood
>> patterns - List, Fetch, Get, Put, etc all mean something specific in
>> processor terms. Is there a reason not to formalize those patterns in the
>> code as well? That would help with processor consistency, and if done
>> right, it may even be easier to write new processors, fix bugs, etc.
>> 
>> For example, ListS3 initially shipped with some bad session commit()
>> behavior, which was obvious once identified, but a generalized
>> AbstractListProcessor (higher level that the one that already exists) could
>> make it easier to avoid this class of bug.
>> 
>> Admittedly this could be a lot of work.
>> 
>> Cheers,
>> Adam
>> 
>> 
>> 
>> On Wed, Feb 22, 2017 at 8:38 AM, Oleg Zhurakousky <
>> ozhurakousky@hortonworks.com> wrote:
>> 
>>> I’ll second Pierre
>>> 
>>> Yes with the current deployment model the amount of processors and the
>>> size of NiFi distribution is a concern simply because it’s growing with
>>> each release. But it should not be the driver to start jamming more
>>> functionality into existing processors which on the surface may look like
>>> related (even if they are).
>>> Basically a processor should never be complex with regard to it being
>>> understood by the end user who is non-technical, so “specialization” is
>>> always takes precedence here since it limits “configuration” and thus
>>> making such processor simpler. It also helps with maintenance and
>>> management of such processor by the developer. Also, having multiple
>>> related processors will promote healthy competition where my MyputHDFS may
>>> for certain cases be better/faster then YourPutHDFS and why not have both?
>>> 
>>> The “artifact registry” (flow, extension, template etc) is the only answer
>>> here since it will remove the “proliferation” and the need for “taming”
>>> anything from the picture. With “artifact registry” one or one million
>>> processors, the NiFi size/state will always remain constant and small.
>>> 
>>> Cheers
>>> Oleg
>>>> On Feb 22, 2017, at 6:05 AM, Pierre Villard <pi...@gmail.com>
>>> wrote:
>>>> 
>>>> Hey guys,
>>>> 
>>>> Thanks for the thread Andre.
>>>> 
>>>> +1 to James' answer.
>>>> 
>>>> I understand the interest that would provide a single processor to
>>> connect
>>>> to all the back ends... and we could document/improve the PutHDFS to ease
>>>> such use but I really don't think that it will benefit the user
>>> experience.
>>>> That may be interesting in some cases for some users but I don't think
>>> that
>>>> would be a majority.
>>>> 
>>>> I believe NiFi is great for one reason: you have a lot of specialized
>>>> processors that are really easy to use and efficient for what they've
>>> been
>>>> designed for.
>>>> 
>>>> Let's ask ourselves the question the other way: with the NiFi registry on
>>>> its way, what is the problem having multiple processors for each back
>>> end?
>>>> I don't really see the issue here. OK we have a lot of processors (but I
>>>> believe this is a good point for NiFi, for user experience, for
>>>> advertising, etc. - maybe we should improve the processor listing though,
>>>> but again, this will be part of the NiFi Registry work), it generates a
>>>> heavy NiFi binary (but that will be solved with the registry), but that's
>>>> all, no?
>>>> 
>>>> Also agree on the positioning aspect: IMO NiFi should not be highly tied
>>> to
>>>> the Hadoop ecosystem. There is a lot of users using NiFi with absolutely
>>> no
>>>> relation to Hadoop. Not sure that would send the good "signal".
>>>> 
>>>> Pierre
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 2017-02-22 6:50 GMT+01:00 Andre <an...@fucs.org>:
>>>> 
>>>>> Andrew,
>>>>> 
>>>>> 
>>>>> On Wed, Feb 22, 2017 at 11:21 AM, Andrew Grande <ap...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> I am observing one assumption in this thread. For some reason we are
>>>>>> implying all these will be hadoop compatible file systems. They don't
>>>>>> always have an HDFS plugin, nor should they as a mandatory requirement.
>>>>>> 
>>>>> 
>>>>> You are partially correct.
>>>>> 
>>>>> There is a direct assumption in the availability of a HCFS (thanks
>>> Matt!)
>>>>> implementation.
>>>>> 
>>>>> This is the case with:
>>>>> 
>>>>> * Windows Azure Blob Storage
>>>>> * Google Cloud Storage Connector
>>>>> * MapR FileSystem (currently done via NAR recompilation / mvn profile)
>>>>> * Alluxio
>>>>> * Isilon (via HDFS)
>>>>> * others
>>>>> 
>>>>> But I would't say this will apply to every other use storage system and
>>> in
>>>>> certain cases may not even be necessary (e.g. Isilon scale-out storage
>>> may
>>>>> be reached using its native HDFS compatible interfaces).
>>>>> 
>>>>> 
>>>>> Untie completely from the Hadoop nar. This allows for effective minifi
>>>>>> interaction without the weight of hadoop libs for example. Massive size
>>>>>> savings where it matters.
>>>>>> 
>>>>>> 
>>>>> Are you suggesting a use case were MiNiFi agents interact directly with
>>>>> cloud storage, without relying on NiFi hubs to do that?
>>>>> 
>>>>> 
>>>>>> For the deployment, it's easy enough for an admin to either rely on a
>>>>>> standard tar or rpm if the NAR modules are already available in the
>>>>> distro
>>>>>> (well, I won't talk registry till it arrives). Mounting a common
>>>>> directory
>>>>>> on every node or distributing additional jars everywhere, plus configs,
>>>>> and
>>>>>> then keeping it consistent across is something which can be avoided by
>>>>>> simpler packaging.
>>>>>> 
>>>>> 
>>>>> As long the NAR or RPM supports your use-case, which is not the case of
>>>>> people running NiFi with MapR-FS for example. For those, a
>>> recompilation is
>>>>> required anyway. A flexible processor may remove the need to recompile
>>> (I
>>>>> am currently playing with the classpath implication to MapR users).
>>>>> 
>>>>> Cheers
>>>>> 
>>> 
>>> 
>

Re: [DISCUSS] Scale-out/Object Storage - taming the diversity of processors

Posted by Oleg Zhurakousky <oz...@hortonworks.com>.

Adam

I 100% agree with your comment on "official/sanctioned”. As an external artifact registry such as BinTray for example or GitHub, one can not control what is there, rather how to get it. The final decision is left to the end user.
Artifacts could be rated and/or Apache NiFi (and/or commercial distributions of NiFi) can “endorse” and/or “un-endorse” certain artifacts and IMHO that is perfectly fine. On top of that a future distribution of NiFi can have configuration to account for the “endorsed/supported” artifacts, yet it should not stop one from downloading and trying something new.

Cheers
Oleg

> On Feb 22, 2017, at 12:43 PM, Adam Lamar <ad...@gmail.com> wrote:
> 
> Hey all,
> 
> I can understand Andre's perspective - when I was building the ListS3
> processor, I mostly just copied the bits that made sense from ListHDFS and
> ListFile. That worked, but its a poor way to ensure consistency across
> List* processors.
> 
> As a once-in-a-while contributor, I love the idea that community
> contributions are respected and we're not dropping them, because they solve
> real needs right now, and it isn't clear another approach would be better.
> 
> And I disagree slightly with the notion that an artifact registry will
> solve the problem - I think it could make it worse, at least from a
> consistency point of view. Taming _is_ important, which is one reason
> registry communities have official/sanctioned modules. Quality and
> interoperability can vary vastly.
> 
> By convention, it seems like NiFi already has a handful of well-understood
> patterns - List, Fetch, Get, Put, etc all mean something specific in
> processor terms. Is there a reason not to formalize those patterns in the
> code as well? That would help with processor consistency, and if done
> right, it may even be easier to write new processors, fix bugs, etc.
> 
> For example, ListS3 initially shipped with some bad session commit()
> behavior, which was obvious once identified, but a generalized
> AbstractListProcessor (higher level that the one that already exists) could
> make it easier to avoid this class of bug.
> 
> Admittedly this could be a lot of work.
> 
> Cheers,
> Adam
> 
> 
> 
> On Wed, Feb 22, 2017 at 8:38 AM, Oleg Zhurakousky <
> ozhurakousky@hortonworks.com> wrote:
> 
>> I’ll second Pierre
>> 
>> Yes with the current deployment model the amount of processors and the
>> size of NiFi distribution is a concern simply because it’s growing with
>> each release. But it should not be the driver to start jamming more
>> functionality into existing processors which on the surface may look like
>> related (even if they are).
>> Basically a processor should never be complex with regard to it being
>> understood by the end user who is non-technical, so “specialization” is
>> always takes precedence here since it limits “configuration” and thus
>> making such processor simpler. It also helps with maintenance and
>> management of such processor by the developer. Also, having multiple
>> related processors will promote healthy competition where my MyputHDFS may
>> for certain cases be better/faster then YourPutHDFS and why not have both?
>> 
>> The “artifact registry” (flow, extension, template etc) is the only answer
>> here since it will remove the “proliferation” and the need for “taming”
>> anything from the picture. With “artifact registry” one or one million
>> processors, the NiFi size/state will always remain constant and small.
>> 
>> Cheers
>> Oleg
>>> On Feb 22, 2017, at 6:05 AM, Pierre Villard <pi...@gmail.com>
>> wrote:
>>> 
>>> Hey guys,
>>> 
>>> Thanks for the thread Andre.
>>> 
>>> +1 to James' answer.
>>> 
>>> I understand the interest that would provide a single processor to
>> connect
>>> to all the back ends... and we could document/improve the PutHDFS to ease
>>> such use but I really don't think that it will benefit the user
>> experience.
>>> That may be interesting in some cases for some users but I don't think
>> that
>>> would be a majority.
>>> 
>>> I believe NiFi is great for one reason: you have a lot of specialized
>>> processors that are really easy to use and efficient for what they've
>> been
>>> designed for.
>>> 
>>> Let's ask ourselves the question the other way: with the NiFi registry on
>>> its way, what is the problem having multiple processors for each back
>> end?
>>> I don't really see the issue here. OK we have a lot of processors (but I
>>> believe this is a good point for NiFi, for user experience, for
>>> advertising, etc. - maybe we should improve the processor listing though,
>>> but again, this will be part of the NiFi Registry work), it generates a
>>> heavy NiFi binary (but that will be solved with the registry), but that's
>>> all, no?
>>> 
>>> Also agree on the positioning aspect: IMO NiFi should not be highly tied
>> to
>>> the Hadoop ecosystem. There is a lot of users using NiFi with absolutely
>> no
>>> relation to Hadoop. Not sure that would send the good "signal".
>>> 
>>> Pierre
>>> 
>>> 
>>> 
>>> 
>>> 2017-02-22 6:50 GMT+01:00 Andre <an...@fucs.org>:
>>> 
>>>> Andrew,
>>>> 
>>>> 
>>>> On Wed, Feb 22, 2017 at 11:21 AM, Andrew Grande <ap...@gmail.com>
>>>> wrote:
>>>> 
>>>>> I am observing one assumption in this thread. For some reason we are
>>>>> implying all these will be hadoop compatible file systems. They don't
>>>>> always have an HDFS plugin, nor should they as a mandatory requirement.
>>>>> 
>>>> 
>>>> You are partially correct.
>>>> 
>>>> There is a direct assumption in the availability of a HCFS (thanks
>> Matt!)
>>>> implementation.
>>>> 
>>>> This is the case with:
>>>> 
>>>> * Windows Azure Blob Storage
>>>> * Google Cloud Storage Connector
>>>> * MapR FileSystem (currently done via NAR recompilation / mvn profile)
>>>> * Alluxio
>>>> * Isilon (via HDFS)
>>>> * others
>>>> 
>>>> But I would't say this will apply to every other use storage system and
>> in
>>>> certain cases may not even be necessary (e.g. Isilon scale-out storage
>> may
>>>> be reached using its native HDFS compatible interfaces).
>>>> 
>>>> 
>>>> Untie completely from the Hadoop nar. This allows for effective minifi
>>>>> interaction without the weight of hadoop libs for example. Massive size
>>>>> savings where it matters.
>>>>> 
>>>>> 
>>>> Are you suggesting a use case were MiNiFi agents interact directly with
>>>> cloud storage, without relying on NiFi hubs to do that?
>>>> 
>>>> 
>>>>> For the deployment, it's easy enough for an admin to either rely on a
>>>>> standard tar or rpm if the NAR modules are already available in the
>>>> distro
>>>>> (well, I won't talk registry till it arrives). Mounting a common
>>>> directory
>>>>> on every node or distributing additional jars everywhere, plus configs,
>>>> and
>>>>> then keeping it consistent across is something which can be avoided by
>>>>> simpler packaging.
>>>>> 
>>>> 
>>>> As long the NAR or RPM supports your use-case, which is not the case of
>>>> people running NiFi with MapR-FS for example. For those, a
>> recompilation is
>>>> required anyway. A flexible processor may remove the need to recompile
>> (I
>>>> am currently playing with the classpath implication to MapR users).
>>>> 
>>>> Cheers
>>>> 
>> 
>>

Re: [DISCUSS] Scale-out/Object Storage - taming the diversity of processors

Posted by Joe Witt <jo...@gmail.com>.

Adam,

Some great points there.  I think what would be good here to keep in
mind is 'who' will tame these things.

For various patterns that are chosen and abstractions found and code written:
  - The developers do the taming.

For the extension registry and which processors become popular or
become unused and phase out:
 - The users/flow managers do the taming.

It is certainly the case we need to think through a robust plan which
allows both developers and users to provide the feedback and energy
necessary.  To date, we've not allowed the users to have much direct
influence here and we really don't have a strong sense of which
components are most commonly used.  One of the things I am most
excited by with the extension registry and related efforts is that it
will help us make more data driven decisions about where to focus our
energies.

Thanks
Joe

On Wed, Feb 22, 2017 at 12:43 PM, Adam Lamar <ad...@gmail.com> wrote:
> Hey all,
>
> I can understand Andre's perspective - when I was building the ListS3
> processor, I mostly just copied the bits that made sense from ListHDFS and
> ListFile. That worked, but its a poor way to ensure consistency across
> List* processors.
>
> As a once-in-a-while contributor, I love the idea that community
> contributions are respected and we're not dropping them, because they solve
> real needs right now, and it isn't clear another approach would be better.
>
> And I disagree slightly with the notion that an artifact registry will
> solve the problem - I think it could make it worse, at least from a
> consistency point of view. Taming _is_ important, which is one reason
> registry communities have official/sanctioned modules. Quality and
> interoperability can vary vastly.
>
> By convention, it seems like NiFi already has a handful of well-understood
> patterns - List, Fetch, Get, Put, etc all mean something specific in
> processor terms. Is there a reason not to formalize those patterns in the
> code as well? That would help with processor consistency, and if done
> right, it may even be easier to write new processors, fix bugs, etc.
>
> For example, ListS3 initially shipped with some bad session commit()
> behavior, which was obvious once identified, but a generalized
> AbstractListProcessor (higher level that the one that already exists) could
> make it easier to avoid this class of bug.
>
> Admittedly this could be a lot of work.
>
> Cheers,
> Adam
>
>
>
> On Wed, Feb 22, 2017 at 8:38 AM, Oleg Zhurakousky <
> ozhurakousky@hortonworks.com> wrote:
>
>> I’ll second Pierre
>>
>> Yes with the current deployment model the amount of processors and the
>> size of NiFi distribution is a concern simply because it’s growing with
>> each release. But it should not be the driver to start jamming more
>> functionality into existing processors which on the surface may look like
>> related (even if they are).
>> Basically a processor should never be complex with regard to it being
>> understood by the end user who is non-technical, so “specialization” is
>> always takes precedence here since it limits “configuration” and thus
>> making such processor simpler. It also helps with maintenance and
>> management of such processor by the developer. Also, having multiple
>> related processors will promote healthy competition where my MyputHDFS may
>> for certain cases be better/faster then YourPutHDFS and why not have both?
>>
>> The “artifact registry” (flow, extension, template etc) is the only answer
>> here since it will remove the “proliferation” and the need for “taming”
>> anything from the picture. With “artifact registry” one or one million
>> processors, the NiFi size/state will always remain constant and small.
>>
>> Cheers
>> Oleg
>> > On Feb 22, 2017, at 6:05 AM, Pierre Villard <pi...@gmail.com>
>> wrote:
>> >
>> > Hey guys,
>> >
>> > Thanks for the thread Andre.
>> >
>> > +1 to James' answer.
>> >
>> > I understand the interest that would provide a single processor to
>> connect
>> > to all the back ends... and we could document/improve the PutHDFS to ease
>> > such use but I really don't think that it will benefit the user
>> experience.
>> > That may be interesting in some cases for some users but I don't think
>> that
>> > would be a majority.
>> >
>> > I believe NiFi is great for one reason: you have a lot of specialized
>> > processors that are really easy to use and efficient for what they've
>> been
>> > designed for.
>> >
>> > Let's ask ourselves the question the other way: with the NiFi registry on
>> > its way, what is the problem having multiple processors for each back
>> end?
>> > I don't really see the issue here. OK we have a lot of processors (but I
>> > believe this is a good point for NiFi, for user experience, for
>> > advertising, etc. - maybe we should improve the processor listing though,
>> > but again, this will be part of the NiFi Registry work), it generates a
>> > heavy NiFi binary (but that will be solved with the registry), but that's
>> > all, no?
>> >
>> > Also agree on the positioning aspect: IMO NiFi should not be highly tied
>> to
>> > the Hadoop ecosystem. There is a lot of users using NiFi with absolutely
>> no
>> > relation to Hadoop. Not sure that would send the good "signal".
>> >
>> > Pierre
>> >
>> >
>> >
>> >
>> > 2017-02-22 6:50 GMT+01:00 Andre <an...@fucs.org>:
>> >
>> >> Andrew,
>> >>
>> >>
>> >> On Wed, Feb 22, 2017 at 11:21 AM, Andrew Grande <ap...@gmail.com>
>> >> wrote:
>> >>
>> >>> I am observing one assumption in this thread. For some reason we are
>> >>> implying all these will be hadoop compatible file systems. They don't
>> >>> always have an HDFS plugin, nor should they as a mandatory requirement.
>> >>>
>> >>
>> >> You are partially correct.
>> >>
>> >> There is a direct assumption in the availability of a HCFS (thanks
>> Matt!)
>> >> implementation.
>> >>
>> >> This is the case with:
>> >>
>> >> * Windows Azure Blob Storage
>> >> * Google Cloud Storage Connector
>> >> * MapR FileSystem (currently done via NAR recompilation / mvn profile)
>> >> * Alluxio
>> >> * Isilon (via HDFS)
>> >> * others
>> >>
>> >> But I would't say this will apply to every other use storage system and
>> in
>> >> certain cases may not even be necessary (e.g. Isilon scale-out storage
>> may
>> >> be reached using its native HDFS compatible interfaces).
>> >>
>> >>
>> >> Untie completely from the Hadoop nar. This allows for effective minifi
>> >>> interaction without the weight of hadoop libs for example. Massive size
>> >>> savings where it matters.
>> >>>
>> >>>
>> >> Are you suggesting a use case were MiNiFi agents interact directly with
>> >> cloud storage, without relying on NiFi hubs to do that?
>> >>
>> >>
>> >>> For the deployment, it's easy enough for an admin to either rely on a
>> >>> standard tar or rpm if the NAR modules are already available in the
>> >> distro
>> >>> (well, I won't talk registry till it arrives). Mounting a common
>> >> directory
>> >>> on every node or distributing additional jars everywhere, plus configs,
>> >> and
>> >>> then keeping it consistent across is something which can be avoided by
>> >>> simpler packaging.
>> >>>
>> >>
>> >> As long the NAR or RPM supports your use-case, which is not the case of
>> >> people running NiFi with MapR-FS for example. For those, a
>> recompilation is
>> >> required anyway. A flexible processor may remove the need to recompile
>> (I
>> >> am currently playing with the classpath implication to MapR users).
>> >>
>> >> Cheers
>> >>
>>
>>

Re: [DISCUSS] Scale-out/Object Storage - taming the diversity of processors

Posted by Andre <an...@fucs.org>.

Adam,

On 23 Feb 2017 4:43 AM, "Adam Lamar" <ad...@gmail.com> wrote:

Hey all,

I can understand Andre's perspective - when I was building the ListS3
processor, I mostly just copied the bits that made sense from ListHDFS and
ListFile. That worked, but its a poor way to ensure consistency across
List* processors.


Been there,  done that and continute to do that :-)

This is particularly tricky however,  because those processors drift apart
once the first iteration is made. Bugs get fixed in one processor but not
on the other.

As a once-in-a-while contributor, I love the idea that community
contributions are respected and we're not dropping them, because they solve
real needs right now, and it isn't clear another approach would be better.


I feel your pain and good news is that removing them is a breaking change
anyhow...

 plus we all love the *S3 processors.  :-)

And I disagree slightly with the notion that an artifact registry will
solve the problem - I think it could make it worse, at least from a
consistency point of view.


100% agreed.

Taming _is_ important, which is one reason
registry communities have official/sanctioned modules. Quality and
interoperability can vary vastly.


100% agreed. Just look at maven, jcenter, PyPI...


I suspect you will agree with idea that the user would think twice about
using a 3rd party processor due to the fear of it obsoleted by later
version upgrades?



By convention, it seems like NiFi already has a handful of well-understood
patterns - List, Fetch, Get, Put, etc all mean something specific in
processor terms. Is there a reason not to formalize those patterns in the
code as well? That would help with processor consistency, and if done
right, it may even be easier to write new processors, fix bugs, etc.


100% agreed. My suggestion was HCFS but our dislike for this approach
should not preclude us from achieving the final goal:

Currently consistency isn't easily maintained, it would be great if it did.

Thank lots for your comments, truly appreciated.

Re: [DISCUSS] Scale-out/Object Storage - taming the diversity of processors

Posted by Adam Lamar <ad...@gmail.com>.

Hey all,

I can understand Andre's perspective - when I was building the ListS3
processor, I mostly just copied the bits that made sense from ListHDFS and
ListFile. That worked, but its a poor way to ensure consistency across
List* processors.

As a once-in-a-while contributor, I love the idea that community
contributions are respected and we're not dropping them, because they solve
real needs right now, and it isn't clear another approach would be better.

And I disagree slightly with the notion that an artifact registry will
solve the problem - I think it could make it worse, at least from a
consistency point of view. Taming _is_ important, which is one reason
registry communities have official/sanctioned modules. Quality and
interoperability can vary vastly.

By convention, it seems like NiFi already has a handful of well-understood
patterns - List, Fetch, Get, Put, etc all mean something specific in
processor terms. Is there a reason not to formalize those patterns in the
code as well? That would help with processor consistency, and if done
right, it may even be easier to write new processors, fix bugs, etc.

For example, ListS3 initially shipped with some bad session commit()
behavior, which was obvious once identified, but a generalized
AbstractListProcessor (higher level that the one that already exists) could
make it easier to avoid this class of bug.

Admittedly this could be a lot of work.

Cheers,
Adam



On Wed, Feb 22, 2017 at 8:38 AM, Oleg Zhurakousky <
ozhurakousky@hortonworks.com> wrote:

> I’ll second Pierre
>
> Yes with the current deployment model the amount of processors and the
> size of NiFi distribution is a concern simply because it’s growing with
> each release. But it should not be the driver to start jamming more
> functionality into existing processors which on the surface may look like
> related (even if they are).
> Basically a processor should never be complex with regard to it being
> understood by the end user who is non-technical, so “specialization” is
> always takes precedence here since it limits “configuration” and thus
> making such processor simpler. It also helps with maintenance and
> management of such processor by the developer. Also, having multiple
> related processors will promote healthy competition where my MyputHDFS may
> for certain cases be better/faster then YourPutHDFS and why not have both?
>
> The “artifact registry” (flow, extension, template etc) is the only answer
> here since it will remove the “proliferation” and the need for “taming”
> anything from the picture. With “artifact registry” one or one million
> processors, the NiFi size/state will always remain constant and small.
>
> Cheers
> Oleg
> > On Feb 22, 2017, at 6:05 AM, Pierre Villard <pi...@gmail.com>
> wrote:
> >
> > Hey guys,
> >
> > Thanks for the thread Andre.
> >
> > +1 to James' answer.
> >
> > I understand the interest that would provide a single processor to
> connect
> > to all the back ends... and we could document/improve the PutHDFS to ease
> > such use but I really don't think that it will benefit the user
> experience.
> > That may be interesting in some cases for some users but I don't think
> that
> > would be a majority.
> >
> > I believe NiFi is great for one reason: you have a lot of specialized
> > processors that are really easy to use and efficient for what they've
> been
> > designed for.
> >
> > Let's ask ourselves the question the other way: with the NiFi registry on
> > its way, what is the problem having multiple processors for each back
> end?
> > I don't really see the issue here. OK we have a lot of processors (but I
> > believe this is a good point for NiFi, for user experience, for
> > advertising, etc. - maybe we should improve the processor listing though,
> > but again, this will be part of the NiFi Registry work), it generates a
> > heavy NiFi binary (but that will be solved with the registry), but that's
> > all, no?
> >
> > Also agree on the positioning aspect: IMO NiFi should not be highly tied
> to
> > the Hadoop ecosystem. There is a lot of users using NiFi with absolutely
> no
> > relation to Hadoop. Not sure that would send the good "signal".
> >
> > Pierre
> >
> >
> >
> >
> > 2017-02-22 6:50 GMT+01:00 Andre <an...@fucs.org>:
> >
> >> Andrew,
> >>
> >>
> >> On Wed, Feb 22, 2017 at 11:21 AM, Andrew Grande <ap...@gmail.com>
> >> wrote:
> >>
> >>> I am observing one assumption in this thread. For some reason we are
> >>> implying all these will be hadoop compatible file systems. They don't
> >>> always have an HDFS plugin, nor should they as a mandatory requirement.
> >>>
> >>
> >> You are partially correct.
> >>
> >> There is a direct assumption in the availability of a HCFS (thanks
> Matt!)
> >> implementation.
> >>
> >> This is the case with:
> >>
> >> * Windows Azure Blob Storage
> >> * Google Cloud Storage Connector
> >> * MapR FileSystem (currently done via NAR recompilation / mvn profile)
> >> * Alluxio
> >> * Isilon (via HDFS)
> >> * others
> >>
> >> But I would't say this will apply to every other use storage system and
> in
> >> certain cases may not even be necessary (e.g. Isilon scale-out storage
> may
> >> be reached using its native HDFS compatible interfaces).
> >>
> >>
> >> Untie completely from the Hadoop nar. This allows for effective minifi
> >>> interaction without the weight of hadoop libs for example. Massive size
> >>> savings where it matters.
> >>>
> >>>
> >> Are you suggesting a use case were MiNiFi agents interact directly with
> >> cloud storage, without relying on NiFi hubs to do that?
> >>
> >>
> >>> For the deployment, it's easy enough for an admin to either rely on a
> >>> standard tar or rpm if the NAR modules are already available in the
> >> distro
> >>> (well, I won't talk registry till it arrives). Mounting a common
> >> directory
> >>> on every node or distributing additional jars everywhere, plus configs,
> >> and
> >>> then keeping it consistent across is something which can be avoided by
> >>> simpler packaging.
> >>>
> >>
> >> As long the NAR or RPM supports your use-case, which is not the case of
> >> people running NiFi with MapR-FS for example. For those, a
> recompilation is
> >> required anyway. A flexible processor may remove the need to recompile
> (I
> >> am currently playing with the classpath implication to MapR users).
> >>
> >> Cheers
> >>
>
>

Re: [DISCUSS] Scale-out/Object Storage - taming the diversity of processors

Posted by Oleg Zhurakousky <oz...@hortonworks.com>.

I’ll second Pierre

Yes with the current deployment model the amount of processors and the size of NiFi distribution is a concern simply because it’s growing with each release. But it should not be the driver to start jamming more functionality into existing processors which on the surface may look like related (even if they are).
Basically a processor should never be complex with regard to it being understood by the end user who is non-technical, so “specialization” is always takes precedence here since it limits “configuration” and thus making such processor simpler. It also helps with maintenance and management of such processor by the developer. Also, having multiple related processors will promote healthy competition where my MyputHDFS may for certain cases be better/faster then YourPutHDFS and why not have both?

The “artifact registry” (flow, extension, template etc) is the only answer here since it will remove the “proliferation” and the need for “taming” anything from the picture. With “artifact registry” one or one million processors, the NiFi size/state will always remain constant and small.

Cheers
Oleg
> On Feb 22, 2017, at 6:05 AM, Pierre Villard <pi...@gmail.com> wrote:
> 
> Hey guys,
> 
> Thanks for the thread Andre.
> 
> +1 to James' answer.
> 
> I understand the interest that would provide a single processor to connect
> to all the back ends... and we could document/improve the PutHDFS to ease
> such use but I really don't think that it will benefit the user experience.
> That may be interesting in some cases for some users but I don't think that
> would be a majority.
> 
> I believe NiFi is great for one reason: you have a lot of specialized
> processors that are really easy to use and efficient for what they've been
> designed for.
> 
> Let's ask ourselves the question the other way: with the NiFi registry on
> its way, what is the problem having multiple processors for each back end?
> I don't really see the issue here. OK we have a lot of processors (but I
> believe this is a good point for NiFi, for user experience, for
> advertising, etc. - maybe we should improve the processor listing though,
> but again, this will be part of the NiFi Registry work), it generates a
> heavy NiFi binary (but that will be solved with the registry), but that's
> all, no?
> 
> Also agree on the positioning aspect: IMO NiFi should not be highly tied to
> the Hadoop ecosystem. There is a lot of users using NiFi with absolutely no
> relation to Hadoop. Not sure that would send the good "signal".
> 
> Pierre
> 
> 
> 
> 
> 2017-02-22 6:50 GMT+01:00 Andre <an...@fucs.org>:
> 
>> Andrew,
>> 
>> 
>> On Wed, Feb 22, 2017 at 11:21 AM, Andrew Grande <ap...@gmail.com>
>> wrote:
>> 
>>> I am observing one assumption in this thread. For some reason we are
>>> implying all these will be hadoop compatible file systems. They don't
>>> always have an HDFS plugin, nor should they as a mandatory requirement.
>>> 
>> 
>> You are partially correct.
>> 
>> There is a direct assumption in the availability of a HCFS (thanks Matt!)
>> implementation.
>> 
>> This is the case with:
>> 
>> * Windows Azure Blob Storage
>> * Google Cloud Storage Connector
>> * MapR FileSystem (currently done via NAR recompilation / mvn profile)
>> * Alluxio
>> * Isilon (via HDFS)
>> * others
>> 
>> But I would't say this will apply to every other use storage system and in
>> certain cases may not even be necessary (e.g. Isilon scale-out storage may
>> be reached using its native HDFS compatible interfaces).
>> 
>> 
>> Untie completely from the Hadoop nar. This allows for effective minifi
>>> interaction without the weight of hadoop libs for example. Massive size
>>> savings where it matters.
>>> 
>>> 
>> Are you suggesting a use case were MiNiFi agents interact directly with
>> cloud storage, without relying on NiFi hubs to do that?
>> 
>> 
>>> For the deployment, it's easy enough for an admin to either rely on a
>>> standard tar or rpm if the NAR modules are already available in the
>> distro
>>> (well, I won't talk registry till it arrives). Mounting a common
>> directory
>>> on every node or distributing additional jars everywhere, plus configs,
>> and
>>> then keeping it consistent across is something which can be avoided by
>>> simpler packaging.
>>> 
>> 
>> As long the NAR or RPM supports your use-case, which is not the case of
>> people running NiFi with MapR-FS for example. For those, a recompilation is
>> required anyway. A flexible processor may remove the need to recompile (I
>> am currently playing with the classpath implication to MapR users).
>> 
>> Cheers
>>

Re: [DISCUSS] Scale-out/Object Storage - taming the diversity of processors

Posted by Bryan Bende <bb...@gmail.com>.

I tend to agree with a lot of the points made by James and Pierre...

Given that the end user of NiFi is not always a developer, it seems
more user-friendly to have the specific processors and not have users
trying to come up with the right set of JARs and the right
configuration properties (although many power users can do this).

Since the processors we are talking about already exist, and many came
from great community contributions, I don't think we should get rid of
any of them.  If there are inconsistencies that can be improved, such
as some processors using EL and others not, then we should definitely
make those improvements.





On Wed, Feb 22, 2017 at 8:42 AM, Andre <an...@fucs.org> wrote:
> Pierre,
>
>
>> I believe NiFi is great for one reason: you have a lot of specialized
>> processors that are really easy to use and efficient for what they've been
>> designed for.
>>
>
>> Let's ask ourselves the question the other way: with the NiFi registry on
>> its way, what is the problem having multiple processors for each back end?
>> I don't really see the issue here. OK we have a lot of processors (but I
>> believe this is a good point for NiFi, for user experience, for
>> advertising, etc. - maybe we should improve the processor listing though,
>> but again, this will be part of the NiFi Registry work), it generates a
>> heavy NiFi binary (but that will be solved with the registry), but that's
>> all, no?
>>
>
> The natural trade-off being fragmentation, code support and consistency?
>
> Simple example?
>
> ListS3 = Uses InputRequirement(Requirement.INPUT_FORBIDDEN)
> ListGCSBucket = INPUT_FORBIDDEN seems to be absent, however, expression
> language is disabled on most properties, suggesting design did not intend
> to have input. Simple bug (NIFI-3514), simple fix (PR#1526).
>
> Yes, no doubts, ListS3 presents S3's properties in clear fashion. Certainly
> ListGCSBucket represents  GCS metadata as attributes in a more specific way
> and this is handy, but that wouldn't be an unmanageable challenge.
>
> This is not an isolated issue, there are plenty of examples, some as simple
> as naming...  After all, one could be ultra pedantic for a second and note
> the ListGCSBucket does not follow the same convention as ListS3(*).
>
>
> Therefore, while the the examples above are overly trivial, they still
> serve as a clear reminder of a very WET vs DRY dilemma. I strongly believe
> we should strive to stay in DRY land.
>
>
> Note however, that I am 100% OK with the idea that using HCFS may be overly
> complex and possibly undesirable;
>
> Nonetheless I think we should at least consider Matt's suggestion of using
> some refactoring magic, or anything that can help us achieving programatic
> ways of promoting consistency across the common features of those
> processors (with the registry or not).
>
>
>
> I will take the community guidance on this.
>
> Cheers
>
> Andre
>
>
> (*) The closer conventional name would probably be ListGCS as no other
> ListProcessor seems to define the unit of collection, (i.e. it is ListSFTP
> not ListSFTPFolder).  I have not raised a JIRA ticket but I suggest the
> name to be changed for better user experience.

Re: [DISCUSS] Scale-out/Object Storage - taming the diversity of processors

Posted by Andre <an...@fucs.org>.

Pierre,


> I believe NiFi is great for one reason: you have a lot of specialized
> processors that are really easy to use and efficient for what they've been
> designed for.
>

> Let's ask ourselves the question the other way: with the NiFi registry on
> its way, what is the problem having multiple processors for each back end?
> I don't really see the issue here. OK we have a lot of processors (but I
> believe this is a good point for NiFi, for user experience, for
> advertising, etc. - maybe we should improve the processor listing though,
> but again, this will be part of the NiFi Registry work), it generates a
> heavy NiFi binary (but that will be solved with the registry), but that's
> all, no?
>

The natural trade-off being fragmentation, code support and consistency?

Simple example?

ListS3 = Uses InputRequirement(Requirement.INPUT_FORBIDDEN)
ListGCSBucket = INPUT_FORBIDDEN seems to be absent, however, expression
language is disabled on most properties, suggesting design did not intend
to have input. Simple bug (NIFI-3514), simple fix (PR#1526).

Yes, no doubts, ListS3 presents S3's properties in clear fashion. Certainly
ListGCSBucket represents  GCS metadata as attributes in a more specific way
and this is handy, but that wouldn't be an unmanageable challenge.

This is not an isolated issue, there are plenty of examples, some as simple
as naming...  After all, one could be ultra pedantic for a second and note
the ListGCSBucket does not follow the same convention as ListS3(*).


Therefore, while the the examples above are overly trivial, they still
serve as a clear reminder of a very WET vs DRY dilemma. I strongly believe
we should strive to stay in DRY land.


Note however, that I am 100% OK with the idea that using HCFS may be overly
complex and possibly undesirable;

Nonetheless I think we should at least consider Matt's suggestion of using
some refactoring magic, or anything that can help us achieving programatic
ways of promoting consistency across the common features of those
processors (with the registry or not).



I will take the community guidance on this.

Cheers

Andre


(*) The closer conventional name would probably be ListGCS as no other
ListProcessor seems to define the unit of collection, (i.e. it is ListSFTP
not ListSFTPFolder).  I have not raised a JIRA ticket but I suggest the
name to be changed for better user experience.

Re: [DISCUSS] Scale-out/Object Storage - taming the diversity of processors

Posted by Pierre Villard <pi...@gmail.com>.

Hey guys,

Thanks for the thread Andre.

+1 to James' answer.

I understand the interest that would provide a single processor to connect
to all the back ends... and we could document/improve the PutHDFS to ease
such use but I really don't think that it will benefit the user experience.
That may be interesting in some cases for some users but I don't think that
would be a majority.

I believe NiFi is great for one reason: you have a lot of specialized
processors that are really easy to use and efficient for what they've been
designed for.

Let's ask ourselves the question the other way: with the NiFi registry on
its way, what is the problem having multiple processors for each back end?
I don't really see the issue here. OK we have a lot of processors (but I
believe this is a good point for NiFi, for user experience, for
advertising, etc. - maybe we should improve the processor listing though,
but again, this will be part of the NiFi Registry work), it generates a
heavy NiFi binary (but that will be solved with the registry), but that's
all, no?

Also agree on the positioning aspect: IMO NiFi should not be highly tied to
the Hadoop ecosystem. There is a lot of users using NiFi with absolutely no
relation to Hadoop. Not sure that would send the good "signal".

Pierre

2017-02-22 6:50 GMT+01:00 Andre <an...@fucs.org>:

> Andrew,
>
>
> On Wed, Feb 22, 2017 at 11:21 AM, Andrew Grande <ap...@gmail.com>
> wrote:
>
> > I am observing one assumption in this thread. For some reason we are
> > implying all these will be hadoop compatible file systems. They don't
> > always have an HDFS plugin, nor should they as a mandatory requirement.
> >
>
> You are partially correct.
>
> There is a direct assumption in the availability of a HCFS (thanks Matt!)
> implementation.
>
> This is the case with:
>
> * Windows Azure Blob Storage
> * Google Cloud Storage Connector
> * MapR FileSystem (currently done via NAR recompilation / mvn profile)
> * Alluxio
> * Isilon (via HDFS)
> * others
>
> But I would't say this will apply to every other use storage system and in
> certain cases may not even be necessary (e.g. Isilon scale-out storage may
> be reached using its native HDFS compatible interfaces).
>
>
> Untie completely from the Hadoop nar. This allows for effective minifi
> > interaction without the weight of hadoop libs for example. Massive size
> > savings where it matters.
> >
> >
> Are you suggesting a use case were MiNiFi agents interact directly with
> cloud storage, without relying on NiFi hubs to do that?
>
>
> > For the deployment, it's easy enough for an admin to either rely on a
> > standard tar or rpm if the NAR modules are already available in the
> distro
> > (well, I won't talk registry till it arrives). Mounting a common
> directory
> > on every node or distributing additional jars everywhere, plus configs,
> and
> > then keeping it consistent across is something which can be avoided by
> > simpler packaging.
> >
>
> As long the NAR or RPM supports your use-case, which is not the case of
> people running NiFi with MapR-FS for example. For those, a recompilation is
> required anyway. A flexible processor may remove the need to recompile (I
> am currently playing with the classpath implication to MapR users).
>
> Cheers
>

Re: [DISCUSS] Scale-out/Object Storage - taming the diversity of processors

Posted by Andre <an...@fucs.org>.

Andrew,

On Wed, Feb 22, 2017 at 11:21 AM, Andrew Grande <ap...@gmail.com> wrote:

> I am observing one assumption in this thread. For some reason we are
> implying all these will be hadoop compatible file systems. They don't
> always have an HDFS plugin, nor should they as a mandatory requirement.
>

You are partially correct.

There is a direct assumption in the availability of a HCFS (thanks Matt!)
implementation.

This is the case with:

* Windows Azure Blob Storage
* Google Cloud Storage Connector
* MapR FileSystem (currently done via NAR recompilation / mvn profile)
* Alluxio
* Isilon (via HDFS)
* others

But I would't say this will apply to every other use storage system and in
certain cases may not even be necessary (e.g. Isilon scale-out storage may
be reached using its native HDFS compatible interfaces).

Untie completely from the Hadoop nar. This allows for effective minifi
> interaction without the weight of hadoop libs for example. Massive size
> savings where it matters.
>
>
Are you suggesting a use case were MiNiFi agents interact directly with
cloud storage, without relying on NiFi hubs to do that?

> For the deployment, it's easy enough for an admin to either rely on a
> standard tar or rpm if the NAR modules are already available in the distro
> (well, I won't talk registry till it arrives). Mounting a common directory
> on every node or distributing additional jars everywhere, plus configs, and
> then keeping it consistent across is something which can be avoided by
> simpler packaging.
>

As long the NAR or RPM supports your use-case, which is not the case of
people running NiFi with MapR-FS for example. For those, a recompilation is
required anyway. A flexible processor may remove the need to recompile (I
am currently playing with the classpath implication to MapR users).

Cheers

Re: [DISCUSS] Scale-out/Object Storage - taming the diversity of processors

Posted by Andrew Grande <ap...@gmail.com>.

I am observing one assumption in this thread. For some reason we are
implying all these will be hadoop compatible file systems. They don't
always have an HDFS plugin, nor should they as a mandatory requirement.
Untie completely from the Hadoop nar. This allows for effective minifi
interaction without the weight of hadoop libs for example. Massive size
savings where it matters.

For the deployment, it's easy enough for an admin to either rely on a
standard tar or rpm if the NAR modules are already available in the distro
(well, I won't talk registry till it arrives). Mounting a common directory
on every node or distributing additional jars everywhere, plus configs, and
then keeping it consistent across is something which can be avoided by
simpler packaging.

Andrew

On Tue, Feb 21, 2017, 6:47 PM Andre <an...@fucs.org> wrote:

> Andrew,
>
> Thank you for contributing.
>
> On 22 Feb 2017 10:21 AM, "Andrew Grande" <ap...@gmail.com> wrote:
>
> Andre,
>
> I came across multiple NiFi use cases where going through the HDFS layer
> and the fs plugin may not be possible. I.e. when no HDFS layer present at
> all, so no NN to connect to.
>
>
> Not sure I understand what you mean.
>
>
> Another important aspect is operations. Current PutHDFS model with
> additional jar location, well, it kinda works, but I very much dislike it.
> Too many possibilities for a human error in addition to deployment pain,
> especially in a cluster.
>
>
> Fair enough. Would you mind expanding a bit on what sort of  challenges
> currently apply in terms of cluster deployment?
>
>
> Finally, native object storage processors have features which may not even
> apply to the HDFS layer. E.g. the Azure storage has Table storage, etc.
>
>
> This is a very valid point but I am sure exceptions (in this case a NoSQL
> DB operating under the umbrella term of "storage").
>
> I perhaps should have made it more explicit but the requirements are:
>
> - existence of a hadoop compatible interface
> - ability to handle files
>
> Again, thank you for the input, truly appreciated.
>
> Andre
>
> I agree consolidating various efforts is worthwhile, but only within a
> context of a specific storage solution. Not 'unifying' them into a single
> layer.
>
> Andrew
>
> On Tue, Feb 21, 2017, 6:10 PM Andre <an...@fucs.org> wrote:
>
> > dev,
> >
> > I was having a chat with Pierre around PR#379 and we thought it would be
> > worth sharing this with the wider group:
> >
> >
> > I recently noticed that we merged a number of PRs and merges around
> > scale-out/cloud based object store into the master.
> >
> > Would it make sense to start considering adopting a pattern where
> > Put/Get/ListHDFS are used in tandem with implementations of the
> > hadoop.filesystem interfaces instead of creating new processors, except
> > where a particular deficiency/incompatibility in the hadoop.filesystem
> > implementation exists?
> >
> > Candidates for removal / non merge would be:
> >
> > - Alluxio (PR#379)
> > - WASB (PR#626)
> >  - Azure* (PR#399)
> > - *GCP (recently merged as PR#1482)
> > - *S3 (although this has been in code so it would have to be deprecated)
> >
> > The pattern would be pretty much the same as the one documented and
> > successfully deployed here:
> >
> > https://community.hortonworks.com/articles/71916/connecting-
> > to-azure-data-lake-from-a-nifi-dataflow.html
> >
> > Which means that in the case of Alluxio, one would use the properties
> > documented here:
> >
> > https://www.alluxio.com/docs/community/1.3/en/Running-
> > Hadoop-MapReduce-on-Alluxio.html
> >
> > While with Google Cloud Storage we would use the properties documented
> > here:
> >
> > https://cloud.google.com/hadoop/google-cloud-storage-connector
> >
> > I noticed that specific processors could have the ability to handle
> > particular properties to a filesystem, however I would like to believe
> the
> > same issue would plague hadoop users, and therefore is reasonable to
> > believe the Hadoop compatible implementations would have ways of exposing
> > those properties as well?
> >
> > In the case the properties are exposed, we perhaps simply adjust the
> *HDFS
> > processors to use dynamic properties to pass those to the underlying
> > module, therefore providing a way to explore particular settings of an
> > underlying storage platforms.
> >
> > Any opinion would be welcome
> >
> > PS-sent it again with proper subject label
> >
>

Re: [DISCUSS] Scale-out/Object Storage - taming the diversity of processors

Posted by Andre <an...@fucs.org>.

Andrew,

Thank you for contributing.

On 22 Feb 2017 10:21 AM, "Andrew Grande" <ap...@gmail.com> wrote:

Andre,

I came across multiple NiFi use cases where going through the HDFS layer
and the fs plugin may not be possible. I.e. when no HDFS layer present at
all, so no NN to connect to.

Not sure I understand what you mean.

Another important aspect is operations. Current PutHDFS model with
additional jar location, well, it kinda works, but I very much dislike it.
Too many possibilities for a human error in addition to deployment pain,
especially in a cluster.

Fair enough. Would you mind expanding a bit on what sort of  challenges
currently apply in terms of cluster deployment?

Finally, native object storage processors have features which may not even
apply to the HDFS layer. E.g. the Azure storage has Table storage, etc.

This is a very valid point but I am sure exceptions (in this case a NoSQL
DB operating under the umbrella term of "storage").

I perhaps should have made it more explicit but the requirements are:

- existence of a hadoop compatible interface
- ability to handle files

Again, thank you for the input, truly appreciated.

Andre

I agree consolidating various efforts is worthwhile, but only within a
context of a specific storage solution. Not 'unifying' them into a single
layer.

Andrew

On Tue, Feb 21, 2017, 6:10 PM Andre <an...@fucs.org> wrote:

> dev,
>
> I was having a chat with Pierre around PR#379 and we thought it would be
> worth sharing this with the wider group:
>
>
> I recently noticed that we merged a number of PRs and merges around
> scale-out/cloud based object store into the master.
>
> Would it make sense to start considering adopting a pattern where
> Put/Get/ListHDFS are used in tandem with implementations of the
> hadoop.filesystem interfaces instead of creating new processors, except
> where a particular deficiency/incompatibility in the hadoop.filesystem
> implementation exists?
>
> Candidates for removal / non merge would be:
>
> - Alluxio (PR#379)
> - WASB (PR#626)
>  - Azure* (PR#399)
> - *GCP (recently merged as PR#1482)
> - *S3 (although this has been in code so it would have to be deprecated)
>
> The pattern would be pretty much the same as the one documented and
> successfully deployed here:
>
> https://community.hortonworks.com/articles/71916/connecting-
> to-azure-data-lake-from-a-nifi-dataflow.html
>
> Which means that in the case of Alluxio, one would use the properties
> documented here:
>
> https://www.alluxio.com/docs/community/1.3/en/Running-
> Hadoop-MapReduce-on-Alluxio.html
>
> While with Google Cloud Storage we would use the properties documented
> here:
>
> https://cloud.google.com/hadoop/google-cloud-storage-connector
>
> I noticed that specific processors could have the ability to handle
> particular properties to a filesystem, however I would like to believe the
> same issue would plague hadoop users, and therefore is reasonable to
> believe the Hadoop compatible implementations would have ways of exposing
> those properties as well?
>
> In the case the properties are exposed, we perhaps simply adjust the *HDFS
> processors to use dynamic properties to pass those to the underlying
> module, therefore providing a way to explore particular settings of an
> underlying storage platforms.
>
> Any opinion would be welcome
>
> PS-sent it again with proper subject label
>

Re: [DISCUSS] Scale-out/Object Storage - taming the diversity of processors

Posted by Andrew Grande <ap...@gmail.com>.

Andre,

I came across multiple NiFi use cases where going through the HDFS layer
and the fs plugin may not be possible. I.e. when no HDFS layer present at
all, so no NN to connect to.

Another important aspect is operations. Current PutHDFS model with
additional jar location, well, it kinda works, but I very much dislike it.
Too many possibilities for a human error in addition to deployment pain,
especially in a cluster.

Finally, native object storage processors have features which may not even
apply to the HDFS layer. E.g. the Azure storage has Table storage, etc.

I agree consolidating various efforts is worthwhile, but only within a
context of a specific storage solution. Not 'unifying' them into a single
layer.

Andrew

On Tue, Feb 21, 2017, 6:10 PM Andre <an...@fucs.org> wrote:

> dev,
>
> I was having a chat with Pierre around PR#379 and we thought it would be
> worth sharing this with the wider group:
>
>
> I recently noticed that we merged a number of PRs and merges around
> scale-out/cloud based object store into the master.
>
> Would it make sense to start considering adopting a pattern where
> Put/Get/ListHDFS are used in tandem with implementations of the
> hadoop.filesystem interfaces instead of creating new processors, except
> where a particular deficiency/incompatibility in the hadoop.filesystem
> implementation exists?
>
> Candidates for removal / non merge would be:
>
> - Alluxio (PR#379)
> - WASB (PR#626)
>  - Azure* (PR#399)
> - *GCP (recently merged as PR#1482)
> - *S3 (although this has been in code so it would have to be deprecated)
>
> The pattern would be pretty much the same as the one documented and
> successfully deployed here:
>
> https://community.hortonworks.com/articles/71916/connecting-
> to-azure-data-lake-from-a-nifi-dataflow.html
>
> Which means that in the case of Alluxio, one would use the properties
> documented here:
>
> https://www.alluxio.com/docs/community/1.3/en/Running-
> Hadoop-MapReduce-on-Alluxio.html
>
> While with Google Cloud Storage we would use the properties documented
> here:
>
> https://cloud.google.com/hadoop/google-cloud-storage-connector
>
> I noticed that specific processors could have the ability to handle
> particular properties to a filesystem, however I would like to believe the
> same issue would plague hadoop users, and therefore is reasonable to
> believe the Hadoop compatible implementations would have ways of exposing
> those properties as well?
>
> In the case the properties are exposed, we perhaps simply adjust the *HDFS
> processors to use dynamic properties to pass those to the underlying
> module, therefore providing a way to explore particular settings of an
> underlying storage platforms.
>
> Any opinion would be welcome
>
> PS-sent it again with proper subject label
>